# Parallelization with TFDS

In this week’s exercise, we’ll go back to the classic cats versus dogs example, but instead of just naively loading the data to train a model, you will be parallelizing various stages of the Extract, Transform and Load processes. In particular, you will be performing following tasks:

1. Parallelize the extraction of the stored TFRecords of the cats_vs_dogs dataset by using the interleave operation.
2. Parallelize the transformation during the preprocessing of the raw dataset by using the map operation.
3. Cache the processed dataset in memory by using the cache operation for faster retrieval.
4. Parallelize the loading of the cached dataset during the training cycle by using the prefetch operation.

## Naive Approach

Just for comparison, let’s start by using the naive approach to Extract, Transform, and Load the data to train the model defined above. By naive approach we mean that we won’t apply any of the new concepts of parallelization that we learned about in this module.

2.0.1


The next step will be to train the model using the following code:

Since we want to focus on the parallelization techniques, we won’t go through the training process here, as this can take some time.

# Parallelize Various Stages of the ETL Processes

The following exercises are about parallelizing various stages of Extract, Transform and Load processes. In particular, you will be tasked with performing following tasks:

1. Parallelize the extraction of the stored TFRecords of the cats_vs_dogs dataset by using the interleave operation.
2. Parallelize the transformation during the preprocessing of the raw dataset by using the map operation.
3. Cache the processed dataset in memory by using the cache operation for faster retrieval.
4. Parallelize the loading of the cached dataset during the training cycle by using the prefetch operation.

We start by creating a dataset of strings corresponding to the file_pattern of the TFRecords of the cats_vs_dogs dataset.

Let’s recall that the TFRecord format is a simple format for storing a sequence of binary records. This is very useful because by serializing the data and storing it in a set of files (100-200MB each) that can each be read linearly greatly increases the efficiency when reading the data.

Since we will use it later, we should also recall that a tf.Example message (or protobuf) is a flexible message type that represents a {"string": tf.train.Feature} mapping.

## Parallelize Extraction

In the cell below you will use the interleave operation with certain arguments to parallelize the extraction of the stored TFRecords of the cats_vs_dogs dataset.

Recall that tf.data.experimental.AUTOTUNE will delegate the decision about what level of parallelism to use to the tf.data runtime.

## Parse and Decode

At this point the train_dataset contains serialized tf.train.Example messages. When iterated over, it returns these as scalar string tensors. The sample output for one record is given below:

In order to be able to use these tensors to train our model, we must first parse them and decode them. We can parse and decode these string tensors by using a function. In the cell below you will create a read_tfrecord function that will read the serialized tf.train.Example messages and decode them. The function will also normalize and resize the images after they have been decoded.

In order to parse the tf.train.Example messages we need to create a feature_description dictionary. We need the feature_description dictionary because TFDS uses graph-execution and therefore, needs this description to build their shape and type signature. The basic structure of the feature_description dictionary looks like this:

The number of features in your feature_description dictionary will vary depending on your dataset. In our particular case, the features are 'image' and 'label' and can be seen in the sample output of the string tensor above. Therefore, our feature_description dictionary will look like this:

where we have given the default values of "" and -1 to the 'image' and 'label' respectively.

The next step will be to parse the serialized tf.train.Example message using the feature_description dictionary given above. This can be done with the following code:

Finally, we can decode the image by using:

Use the code given above to complete the exercise below.

## Parallelize Transformation

You can now apply the read_tfrecord function to each item in the train_dataset by using the map method. You can parallelize the transformation of the train_dataset by using the map method with the num_parallel_calls set to the number of CPU cores.

8


`