- This is the assignment of coursera course Medical Diagnosis from deeplearning.ai
Chest X-Ray Medical Diagnosis with Deep Learning
Welcome to the first assignment of course 1
In this assignment! You will explore medical image diagnosis by building a state-of-the-art chest X-ray classifier using Keras.
The assignment will walk through some of the steps of building and evaluating this deep learning classifier model. In particular, you will:
- Pre-process and prepare a real-world X-ray dataset
- Use transfer learning to retrain a DenseNet model for X-ray image classification
- Learn a technique to handle class imbalance
- Measure diagnostic performance by computing the AUC (Area Under the Curve) for the ROC (Receiver Operating Characteristic) curve
- Visualize model activity using GradCAMs
In completing this assignment you will learn about the following topics:
- Data preparation
- Visualizing data
- Preventing data leakage
- Model Development
- Addressing class imbalance
- Leveraging pre-trained models using transfer learning
- Evaluation
- AUC and ROC curves
Outline
Use these links to jump to specific sections of this assignment!
- 1. Import Packages and Function
- 2. Load the Datasets
- 3. Model Development
- 4. Training [optional]
- 5. Prediction and Evaluation
1. Import Packages and Functions¶
We’ll make use of the following packages:
numpy
andpandas
is what we’ll use to manipulate our datamatplotlib.pyplot
andseaborn
will be used to produce plots for visualizationutil
will provide the locally defined utility functions that have been provided for this assignment
We will also use several modules from the keras
framework for building deep learning models.
Run the next cell to import all the necessary packages.
1 | import numpy as np |
2 Load the Datasets
For this assignment, we will be using the ChestX-ray8 dataset which contains 108,948 frontal-view X-ray images of 32,717 unique patients.
- Each image in the data set contains multiple text-mined labels identifying 14 different pathological conditions.
- These in turn can be used by physicians to diagnose 8 different diseases.
- We will use this data to develop a single model that will provide binary classification predictions for each of the 14 labeled pathologies.
- In other words it will predict ‘positive’ or ‘negative’ for each of the pathologies.
You can download the entire dataset for free here.
- We have provided a ~1000 image subset of the images for you.
- These can be accessed in the folder path stored in the
IMAGE_DIR
variable.
The dataset includes a CSV file that provides the labels for each X-ray.
To make your job a bit easier, we have processed the labels for our small sample and generated three new files to get you started. These three files are:
nih/train-small.csv
: 875 images from our dataset to be used for training.nih/valid-small.csv
: 109 images from our dataset to be used for validation.nih/test.csv
: 420 images from our dataset to be used for testing.
This dataset has been annotated by consensus among four different radiologists for 5 of our 14 pathologies:
Consolidation
Edema
Effusion
Cardiomegaly
Atelectasis
Sidebar on meaning of ‘class’
It is worth noting that the word ‘class’ is used in multiple ways is these discussions.
- We sometimes refer to each of the 14 pathological conditions that are labeled in our dataset as a class.
- But for each of those pathologies we are attempting to predict whether a certain condition is present (i.e. positive result) or absent (i.e. negative result).
- These two possible labels of ‘positive’ or ‘negative’ (or the numerical equivalent of 1 or 0) are also typically referred to as classes.
- Moreover, we also use the term in reference to software code ‘classes’ such as
ImageDataGenerator
.
As long as you are aware of all this though, it should not cause you any confusion as the term ‘class’ is usually clear from the context in which it is used.
Read in the data
Let’s open these files using the pandas library
1 | train_df = pd.read_csv("nih/train-small.csv") |
Image | Atelectasis | Cardiomegaly | Consolidation | Edema | Effusion | Emphysema | Fibrosis | Hernia | Infiltration | Mass | Nodule | PatientId | Pleural_Thickening | Pneumonia | Pneumothorax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00027079_001.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27079 | 1 | 0 | 0 |
1 | 00004477_001.png | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4477 | 0 | 0 | 0 |
2 | 00018530_002.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18530 | 0 | 0 | 0 |
3 | 00026928_001.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 26928 | 0 | 0 | 0 |
4 | 00016687_000.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 16687 | 0 | 0 | 0 |
1 | train_df.shape |
(875, 16)
1 | labels = ['Cardiomegaly', |
2.1 Preventing Data Leakage
It is worth noting that our dataset contains multiple images for each patient. This could be the case, for example, when a patient has taken multiple X-ray images at different times during their hospital visits. In our data splitting, we have ensured that the split is done on the patient level so that there is no data “leakage” between the train, validation, and test datasets.
Exercise 1 - Checking Data Leakage
In the cell below, write a function to check whether there is leakage between two datasets. We’ll use this to make sure there are no patients in the test set that are also present in either the train or validation sets.
- Make use of python's set.intersection() function.
- In order to match the automatic grader's expectations, please start the line of code with
df1_patients_unique...[continue your code here]
1 | # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
1 | # test |
test case 1
df1
patient_id
0 0
1 1
2 2
df2
patient_id
0 2
1 3
2 4
leakage output: True
-------------------------------------
test case 2
df1:
patient_id
0 0
1 1
2 2
df2:
patient_id
0 3
1 4
2 5
leakage output: False
Expected output
1 | test case 1 |
Run the next cell to check if there are patients in both train and test or in both valid and test.
1 | print("leakage between train and test: {}".format(check_for_leakage(train_df, test_df, 'PatientId'))) |
leakage between train and test: False
leakage between valid and test: False
If we get False
for both, then we’re ready to start preparing the datasets for training. Remember to always check for data leakage!
2.2 Preparing Images
With our dataset splits ready, we can now proceed with setting up our model to consume them.
- For this we will use the off-the-shelf ImageDataGenerator class from the Keras framework, which allows us to build a “generator” for images specified in a dataframe.
- This class also provides support for basic data augmentation such as random horizontal flipping of images.
- We also use the generator to transform the values in each batch so that their mean is $0$ and their standard deviation is 1.
- This will facilitate model training by standardizing the input distribution.
- The generator also converts our single channel X-ray images (gray-scale) to a three-channel format by repeating the values in the image across all channels.
- We will want this because the pre-trained model that we’ll use requires three-channel inputs.
Since it is mainly a matter of reading and understanding Keras documentation, we have implemented the generator for you. There are a few things to note:
- We normalize the mean and standard deviation of the data
- We shuffle the input after each epoch.
- We set the image size to be 320px by 320px
1 | def get_train_generator(df, image_dir, x_col, y_cols, shuffle=True, batch_size=8, seed=1, target_w = 320, target_h = 320): |
Build a separate generator for valid and test sets
Now we need to build a new generator for validation and testing data.
Why can’t we use the same generator as for the training data?
Look back at the generator we wrote for the training data.
- It normalizes each image per batch, meaning that it uses batch statistics.
- We should not do this with the test and validation data, since in a real life scenario we don’t process incoming images a batch at a time (we process one image at a time).
- Knowing the average per batch of test data would effectively give our model an advantage.
- The model should not have any information about the test data.
What we need to do is normalize incoming test data using the statistics computed from the training set.
- We implement this in the function below.
- There is one technical note. Ideally, we would want to compute our sample mean and standard deviation using the entire training set.
- However, since this is extremely large, that would be very time consuming.
- In the interest of time, we’ll take a random sample of the dataset and calcualte the sample mean and sample standard deviation.
1 | def get_test_and_valid_generator(valid_df, test_df, train_df, image_dir, x_col, y_cols, sample_size=100, batch_size=8, seed=1, target_w = 320, target_h = 320): |
With our generator function ready, let’s make one generator for our training data and one each of our test and validation datasets.
1 | IMAGE_DIR = "nih/images-small/" |
getting train generator...
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/image/dataframe_iterator.py:273: UserWarning: Found 866 invalid image filename(s) in x_col="Image". These filename(s) will be ignored.
.format(n_invalid, x_col)
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/image/dataframe_iterator.py:273: UserWarning: Found 866 invalid image filename(s) in x_col="Image". These filename(s) will be ignored.
.format(n_invalid, x_col)
Found 9 validated image filenames.
getting train and valid generators...
Found 9 validated image filenames.
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/image/dataframe_iterator.py:273: UserWarning: Found 108 invalid image filename(s) in x_col="Image". These filename(s) will be ignored.
.format(n_invalid, x_col)
Found 1 validated image filenames.
Found 420 validated image filenames.
Let’s peek into what the generator gives our model during training and validation. We can do this by calling the __get_item__(index)
function:
1 | x, y = train_generator.__getitem__(0) |
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
3 Model Development
Now we’ll move on to model training and development. We have a few practical challenges to deal with before actually training a neural network, though. The first is class imbalance.
3.1 Addressing Class Imbalance
One of the challenges with working with medical diagnostic datasets is the large class imbalance present in such datasets. Let’s plot the frequency of each of the labels in our dataset:
1 | plt.xticks(rotation=90) |
We can see from this plot that the prevalance of positive cases varies significantly across the different pathologies. (These trends mirror the ones in the full dataset as well.)
- The
Hernia
pathology has the greatest imbalance with the proportion of positive training cases being about 0.2%. - But even the
Infiltration
pathology, which has the least amount of imbalance, has only 17.5% of the training cases labelled positive.
Ideally, we would train our model using an evenly balanced dataset so that the positive and negative training cases would contribute equally to the loss.
If we use a normal cross-entropy loss function with a highly unbalanced dataset, as we are seeing here, then the algorithm will be incentivized to prioritize the majority class (i.e negative in our case), since it contributes more to the loss.
Impact of class imbalance on loss function
Let’s take a closer look at this. Assume we would have used a normal cross-entropy loss for each pathology. We recall that the cross-entropy loss contribution from the $i^{th}$ training data case is:
where $x_i$ and $y_i$ are the input features and the label, and $f(x_i)$ is the output of the model, i.e. the probability that it is positive.
Note that for any training case, either $y_i=0$ or else $(1-y_i)=0$, so only one of these terms contributes to the loss (the other term is multiplied by zero, and becomes zero).
We can rewrite the overall average cross-entropy loss over the entire training set $\mathcal{D}$ of size $N$ as follows:
Using this formulation, we can see that if there is a large imbalance with very few positive training cases, for example, then the loss will be dominated by the negative class. Summing the contribution over all the training cases for each class (i.e. pathological condition), we see that the contribution of each class (i.e. positive or negative) is:
Exercise 2 - Computing Class Frequencies
Complete the function below to calculate these frequences for each label in our dataset.
- Use numpy.sum(a, axis=), and choose the axis (0 or 1)
1 | # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
1 | # Test |
labels:
[[1 0 0]
[0 1 1]
[1 0 1]
[1 1 1]
[1 0 1]]
pos freqs: [0.8 0.4 0.8]
neg freqs: [0.2 0.6 0.2]
Expected output
1 | labels: |
Now we’ll compute frequencies for our training data.
1 | freq_pos, freq_neg = compute_class_freqs(train_generator.labels) |
array([0. , 0.11111111, 0.22222222, 0. , 0.22222222,
0.11111111, 0. , 0.11111111, 0. , 0. ,
0. , 0. , 0. , 0. ])
Let’s visualize these two contribution ratios next to each other for each of the pathologies:
1 | data = pd.DataFrame({"Class": labels, "Label": "Positive", "Value": freq_pos}) |
As we see in the above plot, the contributions of positive cases is significantly lower than that of the negative ones. However, we want the contributions to be equal. One way of doing this is by multiplying each example from each class by a class-specific weight factor, $w_{pos}$ and $w_{neg}$, so that the overall contribution of each class is the same.
To have this, we want
which we can do simply by taking
This way, we will be balancing the contribution of positive and negative labels.
1 | pos_weights = freq_neg |
Let’s verify this by graphing the two contributions next to each other again:
1 | data = pd.DataFrame({"Class": labels, "Label": "Positive", "Value": pos_contribution}) |
As the above figure shows, by applying these weightings the positive and negative labels within each class would have the same aggregate contribution to the loss function. Now let’s implement such a loss function.
After computing the weights, our final weighted loss for each training case will be
Exercise 3 - Weighted Loss
Fill out the weighted_loss
function below to return a loss function that calculates the weighted loss for each batch. Recall that for the multi-class loss, we add up the average loss for each individual class. Note that we also want to add a small value, $\epsilon$, to the predicted values before taking their logs. This is simply to avoid a numerical error that would otherwise occur if the predicted value happens to be zero.
Note
Please use Keras functions to calculate the mean and the log.
1 | # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
Now let’s test our function with some simple cases.
1 | # Test |
Test example:
y_true:
[[1. 1. 1.]
[1. 1. 0.]
[0. 1. 0.]
[1. 0. 1.]]
w_p:
[0.25 0.25 0.5 ]
w_n:
[0.75 0.75 0.5 ]
y_pred_1:
[[0.7 0.7 0.7]
[0.7 0.7 0.7]
[0.7 0.7 0.7]
[0.7 0.7 0.7]]
y_pred_2:
[[0.3 0.3 0.3]
[0.3 0.3 0.3]
[0.3 0.3 0.3]
[0.3 0.3 0.3]]
If we weighted them correctly, we expect the two losses to be the same.
L(y_pred_1)= -0.4956, L(y_pred_2)= -0.4956
Difference is L1 - L2 = 0.0000
Additional check
If you implemented the function correctly, then if the epsilon for the get_weighted_loss
is set to 1
, the weighted losses will be as follows:1
L(y_pred_1)= -0.4956, L(y_pred_2)= -0.4956
If you are missing something in your implementation, you will see a different set of losses for L1 and L2 (even though L1 and L2 will be the same).
3.3 DenseNet121
Next, we will use a pre-trained DenseNet121 model which we can load directly from Keras and then add two layers on top of it:
- A
GlobalAveragePooling2D
layer to get the average of the last convolution layers from DenseNet121. - A
Dense
layer withsigmoid
activation to get the prediction logits for each of our classes.
We can set our custom loss function for the model by specifying the loss
parameter in the compile()
function.
1 | # create the base pre-trained model |
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4070: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4074: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.
4 Training [optional]
With our model ready for training, we will use the model.fit()
function in Keras to train our model.
- We are training on a small subset of the dataset (~1%).
- So what we care about at this point is to make sure that the loss on the training set is decreasing.
Since training can take a considerable time, for pedagogical purposes we have chosen not to train the model here but rather to load a set of pre-trained weights in the next section. However, you can use the code shown below to practice training the model locally on your machine or in Colab.
NOTE: Do not run the code below on the Coursera platform as it will exceed the platform’s memory limitations.
Python Code for training the model:
1 | history = model.fit_generator(train_generator, |
4.1 Training on the Larger Dataset
Given that the original dataset is 40GB+ in size and the training process on the full dataset takes a few hours, we have trained the model on a GPU-equipped machine for you and provided the weights file from our model (with a batch size of 32 instead) to be used for the rest of this assignment.
The model architecture for our pre-trained model is exactly the same, but we used a few useful Keras “callbacks” for this training. Do spend time to read about these callbacks at your leisure as they will be very useful for managing long-running training sessions:
- You can use
ModelCheckpoint
callback to monitor your model’sval_loss
metric and keep a snapshot of your model at the point. - You can use the
TensorBoard
to use the Tensorflow Tensorboard utility to monitor your runs in real-time. - You can use the
ReduceLROnPlateau
to slowly decay the learning rate for your model as it stops getting better on a metric such asval_loss
to fine-tune the model in the final steps of training. - You can use the
EarlyStopping
callback to stop the training job when your model stops getting better in it’s validation loss. You can set apatience
value which is the number of epochs the model does not improve after which the training is terminated. This callback can also conveniently restore the weights for the best metric at the end of training to your model.
You can read about these callbacks and other useful Keras callbacks here.
Let’s load our pre-trained weights into the model now:
1 | model.load_weights("./nih/pretrained_model.h5") |
5 Prediction and Evaluation
Now that we have a model, let’s evaluate it using our test set. We can conveniently use the predict_generator
function to generate the predictions for the images in our test set.
Note: The following cell can take about 4 minutes to run.
1 | predicted_vals = model.predict_generator(test_generator, steps = len(test_generator)) |
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
5.1 ROC Curve and AUROC
We’ll cover topic of model evaluation in much more detail in later weeks, but for now we’ll walk through computing a metric called the AUC (Area Under the Curve) from the ROC (Receiver Operating Characteristic) curve. This is also referred to as the AUROC value, but you will see all three terms in reference to the technique, and often used almost interchangeably.
For now, what you need to know in order to interpret the plot is that a curve that is more to the left and the top has more “area” under it, and indicates that the model is performing better.
We will use the util.get_roc_curve()
function which has been provided for you in util.py
. Look through this function and note the use of the sklearn
library functions to generate the ROC curves and AUROC values for our model.
1 | auc_rocs = util.get_roc_curve(labels, predicted_vals, test_generator) |
You can compare the performance to the AUCs reported in the original ChexNeXt paper in the table below:
For reference, here’s the AUC figure from the ChexNeXt paper which includes AUC values for their model as well as radiologists on this dataset:
This method does take advantage of a few other tricks such as self-training and ensembling as well, which can give a significant boost to the performance.
For details about the best performing methods and their performance on this dataset, we encourage you to read the following papers:
5.2 Visualizing Learning with GradCAM
One of the challenges of using deep learning in medicine is that the complex architecture used for neural networks makes them much harder to interpret compared to traditional machine learning models (e.g. linear models).
One of the most common approaches aimed at increasing the interpretability of models for computer vision tasks is to use Class Activation Maps (CAM).
- Class activation maps are useful for understanding where the model is “looking” when classifying an image.
In this section we will use a GradCAM’s technique to produce a heatmap highlighting the important regions in the image for predicting the pathological condition.
- This is done by extracting the gradients of each predicted class, flowing into our model’s final convolutional layer. Look at the
util.compute_gradcam
which has been provided for you inutil.py
to see how this is done with the Keras framework.
It is worth mentioning that GradCAM does not provide a full explanation of the reasoning for each classification probability.
- However, it is still a useful tool for “debugging” our model and augmenting our prediction so that an expert could validate that a prediction is indeed due to the model focusing on the right regions of the image.
First we will load the small training set and setup to look at the 4 classes with the highest performing AUC measures.
1 | df = pd.read_csv("nih/train-small.csv") |
Now let’s look at a few specific images.
1 | util.compute_gradcam(model, '00008270_015.png', IMAGE_DIR, df, labels, labels_to_show) |
Loading original image
Generating gradcam for class Cardiomegaly
Generating gradcam for class Mass
Generating gradcam for class Pneumothorax
Generating gradcam for class Edema
1 | util.compute_gradcam(model, '00011355_002.png', IMAGE_DIR, df, labels, labels_to_show) |
Loading original image
Generating gradcam for class Cardiomegaly
Generating gradcam for class Mass
Generating gradcam for class Pneumothorax
Generating gradcam for class Edema
1 | util.compute_gradcam(model, '00029855_001.png', IMAGE_DIR, df, labels, labels_to_show) |
Loading original image
Generating gradcam for class Cardiomegaly
Generating gradcam for class Mass
Generating gradcam for class Pneumothorax
Generating gradcam for class Edema
1 | util.compute_gradcam(model, '00005410_000.png', IMAGE_DIR, df, labels, labels_to_show) |
Loading original image
Generating gradcam for class Cardiomegaly
Generating gradcam for class Mass
Generating gradcam for class Pneumothorax
Generating gradcam for class Edema
Congratulations, you’ve completed the first assignment of course one! You’ve learned how to preprocess data, check for data leakage, train a pre-trained model, and evaluate using the AUC. Great work!