Maggy Distributed Training with PyTorch and DeepSpeed ZeRO example

MNIST training and DeepSpeed ZeRO

Maggy enables you to train with Microsoft’s DeepSpeed ZeRO optimizer. Since DeepSpeed does not follow the common PyTorch programming model, Maggy is unable to provide full distribution transparency to the user. This means that if you want to use DeepSpeed for your training, you will have to make small changes to your code. In this notebook, we will show you what exactly you have to change in order to make DeepSpeed run with Maggy.

from hops import hdfs
import torch
import torch.nn.functional as F
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver log
189application_1617699042861_0016pysparkidleLinkLink
SparkSession available as 'spark'.

Define the model

First off, we have to define our model. Since DeepSpeed’s ZeRO is meant to reduce the memory consumption of our model, we will use an unreasonably large CNN for this example.

class CNN(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Conv2d(1,1000,3)
        self.l2 = torch.nn.Conv2d(1000,3000,5)
        self.l3 = torch.nn.Conv2d(3000,3000,5)
        self.l4 = torch.nn.Linear(3000*18*18,10)
        
    def forward(self, x):
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = F.relu(self.l3(x))
        x = F.softmax(self.l4(x.flatten(start_dim=1)), dim=0)
        return x

Adapting the training function

There are a few minor changes that have to be done in order to train with DeepSpeed: - There is no need for an optimizer anymore. You can configure your optimizer later in the DeepSpeed config. - DeepSpeed’s ZeRO requires you to use FP16 training. Therefore, convert your data to half precision! - The backward call is not executed on the loss, but on the model (model.backward(loss) instead of loss.backward()). - The step call is not executed on the optimizer, but also on the model (model.step() instead of optimizer.step()). - As we have no optimizer anymore, there is also no need to call optimizer.zero_grad(). You do not have to worry about the implementation of these calls, Maggy configures your model at runtime to act as a DeepSpeed engine.

def train_fn(module, hparams, train_set, test_set):
    
    import time
    import torch
        
    from maggy.core.patching import MaggyPetastormDataLoader
    
    model = module(**hparams)
    
    batch_size = 4
    lr_base = 0.1 * batch_size/256
    
    # Parameters as in https://arxiv.org/pdf/1706.02677.pdf
    loss_criterion = torch.nn.CrossEntropyLoss()

    train_loader = MaggyPetastormDataLoader(train_set, batch_size=batch_size)
                            
    model.train()
    for idx, data in enumerate(train_loader):
        img, label = data["image"].half(), data["label"].half()
        prediction = model(img)
        loss = loss_criterion(prediction, label.long())
        model.backward(loss)

        m1 = torch.cuda.max_memory_allocated(0)
        model.step()
        m2 = torch.cuda.max_memory_allocated(0)
        print("Optimizer pre: {}MB\n Optimizer post: {}MB".format(m1//1e6,m2//1e6))
        print(f"Finished batch {idx}")
    return float(1)
train_ds = hdfs.project_path() + "/DataSets/MNIST/PetastormMNIST/train_set"
test_ds = hdfs.project_path() + "/DataSets/MNIST/PetastormMNIST/test_set"
print(hdfs.exists(train_ds), hdfs.exists(test_ds))
True True

Configuring DeepSpeed

In order to use DeepSpeed’s ZeRO, the deepspeed backend has to be chosen. This backend also requires its own config. You can read a full specification of the possible settings here.

from maggy import experiment
from maggy.experiment_config import TorchDistributedConfig

ds_config = {"train_micro_batch_size_per_gpu": 1,
 "gradient_accumulation_steps": 1,
 "optimizer": {"type": "Adam", "params": {"lr": 0.1}},
 "fp16": {"enabled": True},
 "zero_optimization": {"stage": 2},
}

config = TorchDistributedConfig(name='DS_ZeRO', module=CNN, train_set=train_ds, test_set=test_ds, backend="deepspeed", deepspeed_config=ds_config)

Starting the training

You can now launch training with DS ZeRO. Note that the overhead of DeepSpeed is considerably larger than PyTorch’s build in sharding, albeit also more efficient for a larger number of GPUs. DS will also jit compile components on the first run. If you want to compare memory efficiency with the default training, you can rewrite this notebook to work with standard PyTorch training.

result = experiment.lagom(train_fn, config)
HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=1.0, style=ProgressStyle(descripti…


0: Awaiting worker reservations.
1: Awaiting worker reservations.
1: All executors registered: True
1: Reservations complete, configuring PyTorch.
1: Torch config is {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '48985', 'WORLD_SIZE': '2', 'RANK': '1', 'LOCAL_RANK': '0', 'NCCL_BLOCKING_WAIT': '1', 'NCCL_DEBUG': 'INFO'}
0: All executors registered: True
0: Reservations complete, configuring PyTorch.
0: Torch config is {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '48985', 'WORLD_SIZE': '2', 'RANK': '0', 'LOCAL_RANK': '0', 'NCCL_BLOCKING_WAIT': '1', 'NCCL_DEBUG': 'INFO'}
0: Starting distributed training.
1: Starting distributed training.
0: Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
1: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions as PyTorch extensions root...
1: Creating extension directory /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions/utils...
1: Emitting ninja build file /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions/utils/build.ninja...
1: Building extension module utils...
1: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
0: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions as PyTorch extensions root...
0: Creating extension directory /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions/utils...
0: Emitting ninja build file /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions/utils/build.ninja...
0: Building extension module utils...
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
0: Loading extension module utils...
0: Time to load utils op: 14.465313196182251 seconds
0: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000002/.cache/torch_extensions as PyTorch extensions root...
0: No modifications detected for re-loaded extension module utils, skipping build step...
0: Loading extension module utils...
0: Time to load utils op: 0.0007085800170898438 seconds
0: Petastorm dataset detected in folder hdfs://rpc.namenode.service.consul:8020/Projects/PyTorch_spark_minimal//DataSets/MNIST/PetastormMNIST/train_set
1: Loading extension module utils...
1: Time to load utils op: 15.031829833984375 seconds
1: Using /srv/hops/hopsdata/tmp/nm-local-dir/usercache/PyTorch_spark_minimal__realamac/appcache/application_1617699042861_0016/container_e78_1617699042861_0016_01_000004/.cache/torch_extensions as PyTorch extensions root...
1: No modifications detected for re-loaded extension module utils, skipping build step...
1: Loading extension module utils...
1: Time to load utils op: 0.0007691383361816406 seconds
1: Petastorm dataset detected in folder hdfs://rpc.namenode.service.consul:8020/Projects/PyTorch_spark_minimal//DataSets/MNIST/PetastormMNIST/train_set
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 0
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 0
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 1
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 1
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 2
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 2
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 3
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 3
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 4
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 4
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 5
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 5
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 6
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 6
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 7
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 7
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 8
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 8
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 9
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 9
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 10
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 10
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 11
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 11
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 12
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 12
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 13
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 13
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 14
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 14
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 15
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 15
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 16
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 16
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 17
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 17
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 18
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 18
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 19
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 19
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 20
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 20
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 21
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 21
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 22
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 22
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 23
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 23
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 24
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 24
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 25
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 25
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 26
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 26
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 27
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 27
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 28
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 28
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 29
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 29
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 30
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 30
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 31
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 31
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 32
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 32
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 33
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 33
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 34
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 34
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 35
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 35
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 36
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 36
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 37
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 37
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 38
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 38
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 39
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 39
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 40
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 40
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 41
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 41
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 42
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 42
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 43
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 43
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 44
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 44
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 45
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 45
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 46
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 46
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 47
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 47
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 48
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 48
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 49
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 49
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 50
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 50
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 51
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 51
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 52
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 52
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 53
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 53
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 54
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 54
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 55
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 55
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 56
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 56
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 57
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 57
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 58
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 58
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 59
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 59
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 60
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 60
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 61
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 61
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 62
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 62
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 63
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 63
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 64
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 64
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 65
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 65
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 66
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 66
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 67
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 67
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 68
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 68
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 69
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 69
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 70
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 70
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 71
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 71
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 72
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 72
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 73
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 73
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 74
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 74
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 75
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 75
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 76
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 76
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 77
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 77
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 78
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 78
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 79
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 79
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 80
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 80
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 81
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 81
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 82
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 82
0: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
0: Finished batch 83
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 83
1: Optimizer pre: 4336.0MB
 Optimizer post: 4336.0MB
1: Finished batch 84




---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-6-1ca51e069556> in <module>
----> 1 get_ipython().run_cell_magic('spark', '', 'result = experiment.lagom(train_fn, config)\n')


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2397             with self.builtin_trap:
   2398                 args = (magic_arg_s, cell)
-> 2399                 result = fn(*args, **kwargs)
   2400             return result
   2401 


<decorator-gen-116> in spark(self, *args, **kwargs)


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
    105     def wrapped(self, *args, **kwargs):
    106         try:
--> 107             out = f(self, *args, **kwargs)
    108         except Exception as err:
    109             if conf.all_errors_are_fatal():


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
     72     def wrapped(self, *args, **kwargs):
     73         try:
---> 74             out = f(self, *args, **kwargs)
     75         except exceptions_to_handle as err:
     76             if conf.all_errors_are_fatal():


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/kernels/kernelmagics.py in spark(self, line, cell, local_ns)
    263             coerce = get_coerce_value(args.coerce)
    264 
--> 265             self.execute_spark(cell, args.output, args.samplemethod, args.maxrows, args.samplefraction, None, coerce)
    266         else:
    267             return


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/magics/sparkmagicsbase.py in execute_spark(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce)
    132                 if DEBUG:
    133                     self.ipython_display.writeln("Started heartbeating...")
--> 134                 self.execute_final(cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce)
    135             except:
    136                 raise


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/magics/sparkmagicsbase.py in execute_final(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce)
    102 
    103     def execute_final(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce):
--> 104         (success, out, mimetype) = self.spark_controller.run_command(Command(cell), session_name)
    105         if not success:
    106             if conf.shutdown_session_on_spark_statement_errors():


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/sparkcontroller.py in run_command(self, command, client_name)
     38     def run_command(self, command, client_name=None):
     39         session_to_use = self.get_session_by_name_or_default(client_name)
---> 40         return command.execute(session_to_use)
     41 
     42     def run_sqlquery(self, sqlquery, client_name=None):


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     42             response = session.http_client.post_statement(session.id, data)
     43             statement_id = response[u'id']
---> 44             output = self._get_statement_output(session, statement_id)
     45         except Exception as e:
     46             self._spark_events.emit_statement_execution_end_event(session.guid, session.kind, session.id,


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in _get_statement_output(self, session, statement_id)
     74             if status not in FINAL_STATEMENT_STATUS:
     75                 progress.value = statement.get('progress', 0.0)
---> 76                 session.sleep(retries)
     77                 retries += 1
     78             else:


/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/sparkmagic/livyclientlib/livysession.py in sleep(self, retries)
    288 
    289     def sleep(self, retries):
--> 290         sleep(self._policy.seconds_to_sleep(retries))
    291 
    292     # This function will refresh the status and get the logs in a single call.


KeyboardInterrupt: