Fastai Data Block API

Essentially steps: 1. Define the source of your inputs(X values)

ImageItemList.from_folder(path) 

## This step generate the `ItemList` class. 

## ItemBase: The ItemBase class defines what an “item” in your inputs or your targets looks like. 

## ItemList: An ItemList defines a collections of `items` (e.g., ItemBase objects) including how they are individually fetched and displayed.

2. Define how you want to split your inputs into training and validation datasets using one of the built-in mechanisms for doing so.

ImageItemList.from_folder(path)
             .split_by_folder()

## This step generate the `ItemLists` class. 

## ItemLists: A collection of ItemList instances for your inputs or targets. the `split` function above will return a separate ItemList instance for both your training and validation sets in an `ItemLists` object.

3. Define the source of your targets (that is your y values) and combine them with the inputs of your training and validation datasets in the form of fastai LabelList objects. LabelList subclasses the PyTorch Dataset class.

ImageItemList.from_folder(path)
             .split_by_folder()
             .label_from_folder()

## This step generate the `LabelLists` class

## LabelList: A LabelList is a PyTorch Dataset that combines your input and target ItemList classes (an inputs ItemList + a targets ItemList = a LabelList). 

## LabelLists: A collection of LabelList instances you get as a result of your `labeling` function. Again, a LabelList` is a PyTorch Dataset and essentially defines the things, your inputs and optionally targets, fed into the forward function of your model.

## Pre-Processing: This is also where any PreProcessor classes you’ve passed into your ItemList class run. These classes define things you want done to your data once before they are turned into PyTorch Datasets/DataLoaders. Examples include things like tokenizing and numericalizing text, filling in missing values in tabular, etc…. You can define a default `PreProcessor` or collection of PreProcessors you want ran by overloading the _processor class variable in your custom ItemList.

4. Add a test dataset (optional).

data = (ImageItemList.from_folder(path) 
                     .split_by_folder()
                     .label_from_folder()
                     .add_test_folder())

## If you add a test set, like we do above, the same pre-processing applied to your validation set will be applied to your test.

5. Add transforms to your LabelList objects (optional). Here you can apply data augmentation to either, or both, your inputs and targets.

data = (ImageItemList.from_folder(path) 
                     .split_by_folder()
                     .label_from_folder()
                     .add_test_folder()
                     .transform(tfms, size=64))

## Transforms define data augmentation you want done to either, or both, of your inputs and targets datasets.

6. Build PyTorch DataLoaders from the Datasets defined above and package them up into a fastai DataBunch.

data = (ImageItemList.from_folder(path) 
                     .split_by_folder()
                     .label_from_folder()
                     .add_test_folder()
                     .transform(tfms, size=64)
                     .databunch())

## The step generate the `DataBunch` class

## A DataBunch is a collection of PyTorch DataLoaders returned when you call the databunch function. It also defines how they are created from your training, validation, and optionally test LabelList instances.

Once this is done, you’ll have everything you need to train, validate, and test any PyTorch nn.Module using the fastai library. You’ll also have everything you need to later do inference on future data.

Example

class ImageTuple(ItemBase):
    """
    we need to create a custom type of items since we feed the model tuples of images.
    """
    def __init__(self, img1, img2):
        """
         The basis is to code the data attribute that is what will be 
         given to the model. Note that we still keep track of the 
         initial object (usuall in an obj attribute) to be able to show 
         nice representations later on. 
        """
        self.img1 = img1
        self.img2 = img2
        self.obj = (img1,img2)
        self.data = [-1+2*img1.data,-1+2*img2.data]
    
    def apply_tfms(self, tfms, **kwargs):
        """
        Then we want to apply data augmentation to our tuple of images. 
        That's done by writing an apply_tfms method as we saw before. 
        Here we pass that call to the two underlying images then update the data.
        """
        self.img1 = self.img1.apply_tfms(tfms, **kwargs)
        self.img2 = self.img2.apply_tfms(tfms, **kwargs)
        self.data = [-1+2*self.img1.data,-1+2*self.img2.data]
        return self   
    
    def to_one(self):
        """
        We define a last method to stack the two images next to each other, which we 
        will use later for a customized show_batch / show_results behavior.
        """
        return Image(0.5+torch.cat(self.data,2)/2)

class TargetTupleList(ItemList):
    def reconstruct(self, t:Tensor): 
        if len(t.size()) == 0: return t
        return ImageTuple(Image(t[0]/2+0.5),Image(t[1]/2+0.5))

_bunch contains the name of the class that will be used to create a DataBunch
_processor contains a class (or a list of classes) of PreProcessor that will then be used as the default to create processor for this ItemList
_label_cls contains the class that will be used to create the labels by default

class ImageTupleList(ImageList):
    _label_cls=TargetTupleList
    def __init__(self, items, itemsB=None, **kwargs):
        super().__init__(items, **kwargs)
        self.itemsB = itemsB
        self.copy_new.append('itemsB')
    
    def get(self, i):
        img1 = super().get(i)
        fn = self.itemsB[random.randint(0, len(self.itemsB)-1)]
        return ImageTuple(img1, open_image(fn))
    
    def reconstruct(self, t:Tensor): 
        return ImageTuple(Image(t[0]/2+0.5),Image(t[1]/2+0.5))
    
    @classmethod
    def from_folders(cls, path, folderA, folderB, **kwargs):
        itemsB = ImageList.from_folder(path/folderB).items
        res = super().from_folder(path/folderA, itemsB=itemsB, **kwargs)
        res.path = path
        return res

    def show_xys(self, xs, ys, figsize:Tuple[int,int]=(12,6), **kwargs):
        "Show the `xs` and `ys` on a figure of `figsize`. `kwargs` are passed to the show method."
        rows = int(math.sqrt(len(xs)))
        fig, axs = plt.subplots(rows,rows,figsize=figsize)
        for i, ax in enumerate(axs.flatten() if rows > 1 else [axs]):
            xs[i].to_one().show(ax=ax, **kwargs)
        plt.tight_layout()

    
    def show_xyzs(self, xs, ys, zs, figsize:Tuple[int,int]=None, **kwargs):
        """Show `xs` (inputs), `ys` (targets) and `zs` (predictions) on a figure of `figsize`.
        `kwargs` are passed to the show method."""
        figsize = ifnone(figsize, (12,3*len(xs)))
        fig,axs = plt.subplots(len(xs), 2, figsize=figsize)
        fig.suptitle('Ground truth / Predictions', weight='bold', size=14)
        for i,(x,z) in enumerate(zip(xs,zs)):
            x.to_one().show(ax=axs[i,0], **kwargs)
            z.to_one().show(ax=axs[i,1], **kwargs)

Reference

https://blog.usejournal.com/finding-data-block-nirvana-a-journey-through-the-fastai-data-block-api-c38210537fe4
https://docs.fast.ai/tutorial.itemlist.html