Maggy Distributed Training with PyTorch and DeepSpeed ZeRO example
MNIST training and DeepSpeed ZeRO
Maggy enables you to train with Microsoft’s DeepSpeed ZeRO optimizer. Since DeepSpeed does not follow the common PyTorch programming model, Maggy is unable to provide full distribution transparency to the user. This means that if you want to use DeepSpeed for your training, you will have to make small changes to your code. In this notebook, we will show you what exactly you have to change in order to make DeepSpeed run with Maggy.
from hops import hdfs
import torch
import torch.nn.functional as F
Starting Spark application
ID | YARN Application ID | Kind | State | Spark UI | Driver log |
---|---|---|---|---|---|
189 | application_1617699042861_0016 | pyspark | idle | Link | Link |
SparkSession available as 'spark'.
Define the model
First off, we have to define our model. Since DeepSpeed’s ZeRO is meant to reduce the memory consumption of our model, we will use an unreasonably large CNN for this example.
class CNN(torch.nn.Module):
def __init__(self):
super().__init__()
self.l1 = torch.nn.Conv2d(1,1000,3)
self.l2 = torch.nn.Conv2d(1000,3000,5)
self.l3 = torch.nn.Conv2d(3000,3000,5)
self.l4 = torch.nn.Linear(3000*18*18,10)
def forward(self, x):
x = F.relu(self.l1(x))
x = F.relu(self.l2(x))
x = F.relu(self.l3(x))
x = F.softmax(self.l4(x.flatten(start_dim=1)), dim=0)
return x
Adapting the training function
There are a few minor changes that have to be done in order to train with DeepSpeed:
- There is no need for an optimizer anymore. You can configure your optimizer later in the DeepSpeed config.
- DeepSpeed’s ZeRO requires you to use FP16 training. Therefore, convert your data to half precision!
- The backward call is not executed on the loss, but on the model (model.backward(loss)
instead of loss.backward()
).
- The step call is not executed on the optimizer, but also on the model (model.step()
instead of optimizer.step()
).
- As we have no optimizer anymore, there is also no need to call optimizer.zero_grad()
.
You do not have to worry about the implementation of these calls, Maggy configures your model at runtime to act as a DeepSpeed engine.
def train_fn(module, hparams, train_set, test_set):
import time
import torch
from maggy.core.patching import MaggyPetastormDataLoader
model = module(**hparams)
batch_size = 4
lr_base = 0.1 * batch_size/256
# Parameters as in https://arxiv.org/pdf/1706.02677.pdf
loss_criterion = torch.nn.CrossEntropyLoss()
train_loader = MaggyPetastormDataLoader(train_set, batch_size=batch_size)
model.train()
for idx, data in enumerate(train_loader):
img, label = data["image"].half(), data["label"].half()
prediction = model(img)
loss = loss_criterion(prediction, label.long())
model.backward(loss)
m1 = torch.cuda.max_memory_allocated(0)
model.step()
m2 = torch.cuda.max_memory_allocated(0)
print("Optimizer pre: {}MB\n Optimizer post: {}MB".format(m1//1e6,m2//1e6))
print(f"Finished batch {idx}")
return float(1)
train_ds = hdfs.project_path() + "/DataSets/MNIST/PetastormMNIST/train_set"
test_ds = hdfs.project_path() + "/DataSets/MNIST/PetastormMNIST/test_set"
print(hdfs.exists(train_ds), hdfs.exists(test_ds))
True True
Configuring DeepSpeed
In order to use DeepSpeed’s ZeRO, the deepspeed
backend has to be chosen. This backend also requires its own config. You can read a full specification of the possible settings here.
from maggy import experiment
from maggy.experiment_config import TorchDistributedConfig
ds_config = {"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"optimizer": {"type": "Adam", "params": {"lr": 0.1}},
"fp16": {"enabled": True},
"zero_optimization": {"stage": 2},
}
config = TorchDistributedConfig(name='DS_ZeRO', module=CNN, train_set=train_ds, test_set=test_ds, backend="deepspeed", deepspeed_config=ds_config)
Starting the training
You can now launch training with DS ZeRO. Note that the overhead of DeepSpeed is considerably larger than PyTorch’s build in sharding, albeit also more efficient for a larger number of GPUs. DS will also jit compile components on the first run. If you want to compare memory efficiency with the default training, you can rewrite this notebook to work with standard PyTorch training.
result = experiment.lagom(train_fn, config)
HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=1.0, style=ProgressStyle(descripti…
0: Awaiting worker reservations.
1: Awaiting worker reservations.
1: All executors registered: True
1: Reservations complete, configuring PyTorch.
1: Torch config is {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '48985', 'WORLD_SIZE': '2', 'RANK': '1', 'LOCAL_RANK': '0', 'NCCL_BLOCKING_WAIT': '1', 'NCCL_DEBUG': 'INFO'}
0: All executors registered: True
0: Reservations complete, configuring PyTorch.
0: Torch config is {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '48985', 'WORLD_SIZE': '2', 'RANK': '0', 'LOCAL_RANK': '0', 'NCCL_BLOCKING_WAIT': '1', 'NCCL_DEBUG': 'INFO'}
0: Starting distributed training.
1: Starting distributed training.
0: Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
1: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions as PyTorch extensions root...
1: Creating extension directory /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions/utils...
1: Emitting ninja build file /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions/utils/build.ninja...
1: Building extension module utils...
1: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
0: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions as PyTorch extensions root...
0: Creating extension directory /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions/utils...
0: Emitting ninja build file /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions/utils/build.ninja...
0: Building extension module utils...
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
0: Loading extension module utils...
0: Time to load utils op: 14.465313196182251 seconds
0: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions as PyTorch extensions root...
0: No modifications detected for re-loaded extension module utils, skipping build step...
0: Loading extension module utils...
0: Time to load utils op: 0.0007085800170898438 seconds
0: Petastorm dataset detected in folder hdfs://rpc.namenode.service.consul:8020/Projects/PyTorch_spark_minimal//DataSets/MNIST/PetastormMNIST/train_set
1: Loading extension module utils...
1: Time to load utils op: 15.031829833984375 seconds
1: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions as PyTorch extensions root...
1: No modifications detected for re-loaded extension module utils, skipping build step...
1: Loading extension module utils...
1: Time to load utils op: 0.0007691383361816406 seconds
1: Petastorm dataset detected in folder hdfs://rpc.namenode.service.consul:8020/Projects/PyTorch_spark_minimal//DataSets/MNIST/PetastormMNIST/train_set
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 0
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 0
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 1
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 1
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 2
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 2
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 3
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 3
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 4
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 4
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 5
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 5
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 6
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 6
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 7
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 7
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 8
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 8
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 9
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 9
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 10
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 10
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 11
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 11
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 12
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 12
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 13
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 13
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 14
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 14
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 15
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 15
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 16
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 16
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 17
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 17
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 18
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 18
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 19
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 19
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 20
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 20
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 21
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 21
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 22
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 22
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 23
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 23
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 24
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 24
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 25
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 25
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 26
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 26
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 27
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 27
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 28
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 28
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 29
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 29
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 30
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 30
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 31
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 31
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 32
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 32
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 33
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 33
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 34
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 34
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 35
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 35
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 36
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 36
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 37
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 37
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 38
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 38
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 39
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 39
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 40
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 40
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 41
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 41
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 42
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 42
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 43
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 43
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 44
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 44
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 45
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 45
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 46
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 46
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 47
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 47
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 48
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 48
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 49
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 49
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 50
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 50
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 51
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 51
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 52
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 52
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 53
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 53
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 54
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 54
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 55
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 55
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 56
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 56
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 57
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 57
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 58
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 58
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 59
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 59
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 60
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 60
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 61
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 61
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 62
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 62
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 63
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 63
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 64
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 64
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 65
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 65
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 66
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 66
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 67
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 67
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 68
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 68
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 69
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 69
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 70
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 70
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 71
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 71
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 72
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 72
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 73
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 73
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 74
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 74
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 75
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 75
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 76
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 76
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 77
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 77
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 78
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 78
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 79
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 79
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 80
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 80
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 81
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 81
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 82
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 82
0: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
0: Finished batch 83
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 83
1: Optimizer pre: 4336.0MB
Optimizer post: 4336.0MB
1: Finished batch 84
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-6-1ca51e069556> in <module>
----> 1 get_ipython().run_cell_magic('spark', '', 'result = experiment.lagom(train_fn, config)\n')
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2397 with self.builtin_trap:
2398 args = (magic_arg_s, cell)
-> 2399 result = fn(*args, **kwargs)
2400 return result
2401
<decorator-gen-116> in spark(self, *args, **kwargs)
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
105 def wrapped(self, *args, **kwargs):
106 try:
--> 107 out = f(self, *args, **kwargs)
108 except Exception as err:
109 if conf.all_errors_are_fatal():
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
72 def wrapped(self, *args, **kwargs):
73 try:
---> 74 out = f(self, *args, **kwargs)
75 except exceptions_to_handle as err:
76 if conf.all_errors_are_fatal():
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/kernels/kernelmagics.py in spark(self, line, cell, local_ns)
263 coerce = get_coerce_value(args.coerce)
264
--> 265 self.execute_spark(cell, args.output, args.samplemethod, args.maxrows, args.samplefraction, None, coerce)
266 else:
267 return
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/magics/sparkmagicsbase.py in execute_spark(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce)
132 if DEBUG:
133 self.ipython_display.writeln("Started heartbeating...")
--> 134 self.execute_final(cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce)
135 except:
136 raise
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/magics/sparkmagicsbase.py in execute_final(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce)
102
103 def execute_final(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce):
--> 104 (success, out, mimetype) = self.spark_controller.run_command(Command(cell), session_name)
105 if not success:
106 if conf.shutdown_session_on_spark_statement_errors():
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/sparkcontroller.py in run_command(self, command, client_name)
38 def run_command(self, command, client_name=None):
39 session_to_use = self.get_session_by_name_or_default(client_name)
---> 40 return command.execute(session_to_use)
41
42 def run_sqlquery(self, sqlquery, client_name=None):
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
42 response = session.http_client.post_statement(session.id, data)
43 statement_id = response[u'id']
---> 44 output = self._get_statement_output(session, statement_id)
45 except Exception as e:
46 self._spark_events.emit_statement_execution_end_event(session.guid, session.kind, session.id,
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in _get_statement_output(self, session, statement_id)
74 if status not in FINAL_STATEMENT_STATUS:
75 progress.value = statement.get('progress', 0.0)
---> 76 session.sleep(retries)
77 retries += 1
78 else:
/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/livysession.py in sleep(self, retries)
288
289 def sleep(self, retries):
--> 290 sleep(self._policy.seconds_to_sleep(retries))
291
292 # This function will refresh the status and get the logs in a single call.
KeyboardInterrupt: