We start lesson 11 with a brief look at a smart and simple initialization technique called Layer-wise Sequential Unit Variance (LSUV). We implement it from scratch, and then use the methods introduced in the previous lesson to investigate the impact of this technique on our model training. It looks pretty good!

Then we look at one of the jewels of fastai: the Data Block API. We already saw how to use this API in part 1 of the course; but now we learn how to create it from scratch, and in the process we also will learn a lot about how to better use it and customize it. We’ll look closely at each step:

  • Get files: we’ll learn how os.scandir provides a highly optimized way to access the filesystem, and os.walk provides a powerful recursive tree walking abstraction on top of that
  • Transformations: we create a simple but powerful list and function composition to transform data on-the-fly
  • Split and label: we create flexible functions for each
  • DataBunch: we’ll see that DataBunch is a very simple container for our DataLoaders

Next up, we build a new StatefulOptimizer class, and show that nearly all optimizers used in modern deep learning training are just special cases of this one class. We use it to add weight decay, momentum, Adam, and LAMB optimizers, and take a look a detailed look at how momentum changes training.

Finally, we look at data augmentation, and benchmark various data augmentation techniques. We develop a new GPU-based data augmentation approach which we find speeds things up quite dramatically, and allows us to then add more sophisticated warp-based transformations.

Layer-wise Sequential Unit Variance: a smart and simple init

Paper: All You Need is a Good Init

Notebook: 07a_lsuv

Trick: in deep learning, Modules are like a tree, so recursion for finding modules is needed. To concatenate the list of modules in the recursion, we can use sum(list, []), beginning with an empty list.

def find_modules(m, cond):
    if cond(m): return [m]
    return sum([find_modules(o,cond) for o in m.children()], [])

mods = find_modules(learn.model, lambda o: isinstance(o,ConvLayer))

The code for LSUV

def append_stat(hook, mod, inp, outp):
    d = outp.data
    hook.mean,hook.std = d.mean().item(),d.std().item()

mdl = learn.model.cuda()

with Hooks(mods, append_stat) as hooks:
    for hook in hooks: print(hook.mean,hook.std)

def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

    return h.mean,h.std

for m in mods: print(lsuv_module(m, xb))

%time run.fit(2, learn)

While the mean is not near zero: keep subtracting the bias with it.

While the std is not 1, keep dividing the weight by it.

This is the fastai way of initialization, just a loop, no math!

Note: LSUV is run once at the beginning before training. If you have a small batch size, run it for 5 batches and take the mean.

Data Block API

MNIST is a toy problem and it is too easy for serious research. CIFAR-10’s problem is that it’s 32x32, and it has very different characteristics from large images. Once the images are smaller than 96x96, things are quite different. Yet ImageNet is too large to experiment on.

So Jeremy created new datasets: Imagenette and Imagewoof. Imagenette (with French pronunciation) is designed to be easier. It has 10 very different classes. Imagewoof has only dog breeds.

Note: a big part of making deep learning useful in any domain is to make small workable datasets.

The Data Block API enables us to load huge datasets bit by bit because we can’t fit the whole dataset in our RAM.

Trick: add a custom function to any standard library. Just take the class and define a new method (this is the advantage of using a dynamic language)

import PIL,os,mimetypes
Path.ls = lambda x: list(x.iterdir())

Joke: If a person says they are a deep learning practioner they must know tenches. Tench is the 1st class in ImageNet.

When we load an image it’s size is (160, 239, 3) (h, w, ch). The pixels are uint8.

Notice it’s not square, and the images in this dataset have different sizes. We need to resize them in order to put them in the same batch.

Tip: Python’s os.scandir(path) is a super fast way to check the directory for content, it’s written in C. os.walk(path) is similar but it’s able to recursively walk the directory. It is faster than glob and is lower-level. glob has more functionality and should be using scandir under the hood.

For 13394 files, it took ~70ms, extremely fast. The original ImageNet is 100x bigger, it will take just a few seconds.

Note: Jeremy spent a lot of time of these notebooks in part II, they are his research journal. Much of the code is already in fastai v1.

Prepare for modeling

What we need to do:

  • Get files
  • Split validation set
    • random%, folder name, csv, …
  • Label:
    • folder name, file name/re, csv, …
  • Transform per image (optional)
  • Transform to tensor
  • DataLoader
  • Transform per batch (optional)
  • DataBunch
  • Add test set (optional)
def compose(x, funcs, *args, order_key='_order', **kwargs):
    key = lambda o: getattr(o, order_key, 0)
    for f in sorted(listify(funcs), key=key): x = f(x, **kwargs)
    return x

class ItemList(ListContainer):
    def __init__(self, items, path='.', tfms=None):
        self.path,self.tfms = Path(path),tfms

    def __repr__(self): return f'{super().__repr__()}\nPath: {self.path}'

    def new(self, items, cls=None):
        if cls is None: cls=self.__class__
        """cls , or self.__class__, becomes the constructor"""
        return cls(items, self.path, tfms=self.tfms)

    def  get(self, i): return i
    def _get(self, i): return compose(self.get(i), self.tfms)

    def __getitem__(self, idx):
        res = super().__getitem__(idx)
        if isinstance(res,list): return [self._get(o) for o in res]
        return self._get(res)

class ImageList(ItemList):
    def from_files(cls, path, extensions=None, recurse=True, include=None, **kwargs):
        if extensions is None: extensions = image_extensions
        return cls(get_files(path, extensions, recurse=recurse, include=include), path, **kwargs)

    def get(self, fn): return PIL.Image.open(fn)

Tip: compose is a very useful concept in functional programming, it has a list of functions, calls the current function, get the result, and plug the result into the next function.


We need some kind of Processor to do things on the training set and be able to do the same to the validation/test sets later, things such as

  • processing texts to tokenize and numericalize them
  • filling in missing values with median computed from training set for tabular data, storing the statistics in the Processor
  • converting label strings to numbers in a consistent and reproducible way

In the image classification case, we create a list of possible labels in the training set, and then convert our labels to numbers based on this vocab.

Note: when a trained model is no better than random, the most common issue is that the validation set and the training set have different processing/mapping. Validation set should use the same processing with the same vocab and statistics as the training set!

Tip: whatever framework you use, have something like the fastai Labeler and Processor classes to help you remember the right things to do.

Question: How to make the model handle unseen categories at inference time?

Answer: Great question especially for tasks with unlimited classes. In that case, find some rare classes in your data and label them as “other”, train with some examples in the “other” category. That way, the model should be able to hanlde unseen categories better.

Note: learn Python @classmethod, what they are, when and why to use them. Refer to the post here.

Then we resize and turn the images into tensors.

tfms = [make_rgb, ResizeFixed(128), to_byte_tensor, to_float_tensor]

il = ImageList.from_files(path, tfms=tfms)
sd = SplitData.split_by_func(il, splitter)
ll = label_by_func(sd, parent_labeler, proc_y=CategoryProcessor())


Data Bunch

class DataBunch():
    def __init__(self, train_dl, valid_dl, c_in=None, c_out=None):
        self.train_dl,self.valid_dl,self.c_in,self.c_out = \

    def train_ds(self): return self.train_dl.dataset

    def valid_ds(self): return self.valid_dl.dataset

def databunchify(sd, bs, c_in=None, c_out=None, **kwargs):
    dls = get_dls(sd.train, sd.valid, bs, **kwargs)
    return DataBunch(*dls, c_in=c_in, c_out=c_out)

SplitData.to_databunch = databunchify

Summarize all the steps from the start
path = datasets.untar_data(datasets.URLs.IMAGENETTE_160)
tfms = [
    make_rgb, ResizeFixed(128), to_byte_tensor, to_float_tensor

il = ImageList.from_files(path, tfms=tfms)
sd = SplitData.split_by_func(
    il, partial(grandparent_splitter, valid_name='val'))
ll = label_by_func(sd, parent_labeler, proc_y=CategoryProcessor())
data = ll.to_databunch(bs, c_in=3, c_out=10, num_workers=4)


In CNN, 3x3 kernels are the best bang for your buck. Papers:

cbfs = [partial(AvgStatsCallback,accuracy),

def normalize_chan(x, mean, std):
    return (x-mean[...,None,None]) / std[...,None,None]

_m = tensor([0.47, 0.48, 0.45])
_s = tensor([0.29, 0.28, 0.30])
norm_imagenette = partial(normalize_chan, mean=_m.cuda(), std=_s.cuda())

cbfs.append(partial(BatchTransformXCallback, norm_imagenette))
nfs = [64,64,128,256]

def prev_pow_2(x): return 2**math.floor(math.log2(x))

def get_cnn_layers(data, nfs, layer, **kwargs):
    def f(ni, nf, stride=2): return layer(ni, nf, 3, stride=stride, **kwargs)
    l1 = data.c_in
    l2 = prev_pow_2(l1*3*3)
    layers =  [f(l1  , l2  , stride=1),
               f(l2  , l2*2, stride=2),
               f(l2*2, l2*4, stride=2)]
    nfs = [l2*4] + nfs
    layers += [f(nfs[i], nfs[i+1]) for i in range(len(nfs)-1)]
    layers += [nn.AdaptiveAvgPool2d(1), Lambda(flatten),
               nn.Linear(nfs[-1], data.c_out)]
    return layers

def get_cnn_model(data, nfs, layer, **kwargs):
    return nn.Sequential(*get_cnn_layers(data, nfs, layer, **kwargs))

def get_learn_run(nfs, data, lr, layer, cbs=None, opt_func=None, **kwargs):
    model = get_cnn_model(data, nfs, layer, **kwargs)
    return get_runner(model, data, lr=lr, cbs=cbs, opt_func=opt_func)

sched = combine_scheds([0.3,0.7], cos_1cycle_anneal(0.1,0.3,0.05))
learn,run = get_learn_run(nfs, data, 0.2, conv_layer, cbs=cbfs+[
    partial(ParamScheduler, 'lr', sched)

Create model_summary()

def model_summary(run, learn, data, find_all=False):
    xb,yb = get_batch(data.valid_dl, run)
    device = next(learn.model.parameters()).device#Model may not be on the GPU yet
    xb,yb = xb.to(device),yb.to(device)
    mods = find_modules(learn.model, is_lin_layer) if find_all else learn.model.children()
    f = lambda hook,mod,inp,out: print(f"{mod}\n{out.shape}\n")
    with Hooks(mods, f) as hooks: learn.model(xb)

Training the model,

%time run.fit(5, learn)
train: [1.7975745138242594, tensor(0.3771, device='cuda:0')]
valid: [1.950084228515625, tensor(0.3640, device='cuda:0')]
train: [1.331341733558244, tensor(0.5549, device='cuda:0')]
valid: [1.182614013671875, tensor(0.6160, device='cuda:0')]
train: [1.0004353405653792, tensor(0.6729, device='cuda:0')]
valid: [0.9452028198242187, tensor(0.6740, device='cuda:0')]
train: [0.744675257750698, tensor(0.7583, device='cuda:0')]
valid: [0.8292762451171874, tensor(0.7360, device='cuda:0')]
train: [0.5341721137253761, tensor(0.8359, device='cuda:0')]
valid: [0.798895751953125, tensor(0.7360, device='cuda:0')]
CPU times: user 25.6 s, sys: 10.7 s, total: 36.4 s
Wall time: 1min 7s

This is the most basic CNN, and its performance is not bad!


Notebook: 09_optimizers

Jeremy: we don’t need to re-implement the optimizer every time a new one comes out. There is only ONE generic optimizer, and we can change it to get every optimizer.

In pytorch, the optimizer is just a dictionary.

class Optimizer():
    def __init__(self, params, steppers, **defaults):
        """params is a list of lists of parameter tensors"""
        # might be a generator
        self.param_groups = list(params)
        # ensure params is a list of lists
        if not isinstance(self.param_groups[0], list):
            self.param_groups = [self.param_groups]
        One dict of hyperparameters for each parameter group.
        This line below copies the defaults dict rather than
        pointing to the same reference.
        Hyperparameters include lr, eps, etc.
        self.hypers = [{**defaults} for p in self.param_groups]
        self.steppers = listify(steppers)

    def grad_params(self):
        return [
            (p,hyper) for pg,hyper in zip(
            for p in pg if p.grad is not None

    def zero_grad(self):
        for p,hyper in self.grad_params():

    def step(self):
        for p,hyper in self.grad_params():
            compose(p, self.steppers, **hyper)

def sgd_step(p, lr, **kwargs):
    p.data.add_(-lr, p.grad.data)
    return p

opt_func = partial(Optimizer, steppers=[sgd_step])

We can also add weight decay. We can do it in one of two ways:

  • Add L2 regularization to the loss, or
  • Add weight decay to the gradient weight.grad += wd * weight. For a vanilla SGD the update is
weight = weight - lr*(weight.grad + wd*weight)

These two ways are only equivalent for vanilla SGD. For RMSprop and Adam, the second way is better. It is mentioned in the paper DECOUPLED WEIGHT DECAY REGULARIZATION. fastai made the second way the default.

We can implement stepper for weight decay.

def weight_decay(p, lr, wd, **kwargs):
    p.data.mul_(1 - lr*wd)
    return p

weight_decay._defaults = dict(wd=0.)

# Or
def l2_reg(p, lr, wd, **kwargs):
    # Tip: Pytorch add_ can take two parameters a, b,
    # it does a mult of a and b first and then add
    p.grad.data.add_(wd, p.data)
    return p

l2_reg._defaults = dict(wd=0.)

Tip: Pytorch add_ can take two parameters a, b, it does a mult of a and b first and then add the result to the tensor.

Add momentum

Momentum needs the previous state of all parameters to work. We store it in a dict state.

class StatefulOptimizer(Optimizer):
    def __init__(self, params, steppers, stats=None, **defaults):
        self.stats = listify(stats)
        maybe_update(self.stats, defaults, get_defaults)
        super().__init__(params, steppers, **defaults)
        self.state = {}

    def step(self):
        for p,hyper in self.grad_params():
            if p not in self.state:
                # Create a state for p and call all the
                # statistics to initialize it.
                self.state[p] = {}
                    lambda o: o.init_state(p)
            state = self.state[p]
            for stat in self.stats:
                state = stat.update(p, state, **hyper)
            compose(p, self.steppers, **state, **hyper)
            self.state[p] = state

class AverageGrad(Stat):
    _defaults = dict(mom=0.9)

    def init_state(self, p):
        return {'grad_avg': torch.zeros_like(p.grad.data)}
    def update(self, p, state, mom, **kwargs):
        return state

def momentum_step(p, lr, grad_avg, **kwargs):
    p.data.add_(-lr, grad_avg)
    return p

sgd_mom_opt = partial(
    StatefulOptimizer, steppers=[momentum_step,weight_decay],
    stats=AverageGrad(), wd=0.01

learn,run = get_learn_run(
    nfs, data, 0.3, conv_layer, cbs=cbfs, opt_func=sgd_mom_opt

!!NOTE!! Batch norm mults makes L2 regularization not work the way we expected!

Everybody has been doing it wrong! This paper L2 regularization vs batch and weight normalization pointed it out. But this paper also isn’t completely correct. L2 regularization DOES DO SOMETHING. There are some more recent papers that tried to explain what L2 regularization does along with batch norm, but the current state is that no one understands it completely.

Jeremy: The theory people who did this kind of research don’t know how to train models. The practioners forget about theories. If you can combine the two, you can find interesting results!

Momentum is also interesting. We use Exponentially Weighted Moving Average (ewma) or lerp in pytorch.

Next, the notebook describes the implementation of Adam and LAMB (paper) with the generic optimizer structure. Refer to the notebook for more details.

Jeremy implemented the LAMB approach a week before the paper came out. The paper is highly recommended.

Refactor: remove Runner, just have Learner

Notebook: 09b_learner


Jeremy realized the Runner class just stores 3 things and it’s much better to just put it in Learner. This makes the code much easier to use.

Add a progress bar

Notebook: 09c_add_progress_bar


The components used are

from fastprogress import master_bar, progress_bar

And we create the callback

class AvgStatsCallback(Callback):
    def __init__(self, metrics):
        self.train_stats,self.valid_stats = \

    def begin_fit(self):
        met_names = ['loss'] + [
            m.__name__ for m in self.train_stats.metrics]
        names = ['epoch'] + [f'train_{n}' for n in met_names] + \
            [f'valid_{n}' for n in met_names] + ['time']

    def begin_epoch(self):
        self.start_time = time.time()

    def after_loss(self):
        stats = self.train_stats if self.in_train else \
        with torch.no_grad(): stats.accumulate(self.run)

    def after_epoch(self):
        stats = [str(self.epoch)]
        for o in [self.train_stats, self.valid_stats]:
            stats += [f'{v:.6f}' for v in o.avg_stats]
        stats += [format_time(time.time() - self.start_time)]

Then we add the progress bars… with a Callback of course! master_bar handles the count over the epochs while its child progress_bar is looping over all the batches. We just create one at the beginning or each epoch/validation phase, and update it at the end of each batch. By changing the logger of the Learner to the write function of the master bar, everything is automatically written there.

Note: this requires fastprogress v0.1.21 or later.

# export
class ProgressCallback(Callback):
    def begin_fit(self):
        self.mbar = master_bar(range(self.epochs))
        self.run.logger = partial(self.mbar.write, table=True)

    def after_fit(self): self.mbar.on_iter_end()
    def after_batch(self): self.pb.update(self.iter)
    def begin_epoch   (self): self.set_pb()
    def begin_validate(self): self.set_pb()

    def set_pb(self):
        self.pb = progress_bar(self.dl, parent=self.mbar)

Then use it like this

cbfs = [partial(AvgStatsCallback,accuracy),
        partial(BatchTransformXCallback, norm_imagenette)]

learn = get_learner(nfs, data, 0.4, conv_layer, cb_funcs=cbfs)

Data augmentation

Notebook: 10_augmentation

To further improve our Imagenette model, we need data augmentation.

The key takeaway is that there is no “best transform” for data augmentation. Try things out and take a close look at the results.

Some transform examples:

img.resize((128,128), resample=PIL.Image.ANTIALIAS)

img.resize((128,128), resample=PIL.Image.BILINEAR)

img.resize((128,128), resample=PIL.Image.NEAREST)

img.resize((256,256), resample=PIL.Image.BICUBIC)\
   .resize((128,128), resample=PIL.Image.NEAREST)

Tip: doing transforms on bytes (i.e. uint8) is much faster than on floats. Converting bytes to floats is much slower than something as complex as a warp!

Some useful transforms that are particularly useful:

  • zooming in. It works great for image, for text and audio.
  • perspective warping for image. It needs to solve a system of linear equations. Pytorch has this solver!

One comment from Jeremy is that for zooming in, if the object of interest - the tench - is cropped out by the zoom-in effect, it’s OK! It’s still the ImageNet winning strategy. Ultimately it creates noisy labels where some labels are just wrong. All the research showed that noisy labels are okay - because the model learns to link other things in that image with the label.

For music data augmentation, for example, you can do pitch shifting, volume changes, cutting, etc. Ultimately it depends on the domain and what you need.

In the next lesson, we will introduce MixUp, a data augmentation technique that dramatically improves results no matter the domain, and can be run on GPU. It will make some of the techniques here irrelevant.

Papers to read