Risk Models Using Tree-based Models
Welcome to the second assignment of Course 2!
Outline
- 1. Import Packages
- 2. Load the Dataset
- 3. Explore the Dataset
- 4. Dealing with Missing Data
- 5. Decision Trees
- 6. Random Forests
- 7. Imputation
- 8. Error Analysis
- 9. Imputation Approaches
- 10. Comparison
- 11. Explanations: SHAP
In this assignment, you’ll gain experience with tree based models by predicting the 10-year risk of death of individuals from the NHANES I epidemiology dataset (for a detailed description of this dataset you can check the CDC Website). This is a challenging task and a great test bed for the machine learning methods we learned this week.
As you go through the assignment, you’ll learn about:
- Dealing with Missing Data
- Complete Case Analysis.
- Imputation
- Decision Trees
- Evaluation.
- Regularization.
- Random Forests
- Hyperparameter Tuning.
1. Import Packages
We’ll first import all the common packages that we need for this assignment.
shap
is a library that explains predictions made by machine learning models.sklearn
is one of the most popular machine learning libraries.itertools
allows us to conveniently manipulate iterable objects such as lists.pydotplus
is used together withIPython.display.Image
to visualize graph structures such as decision trees.numpy
is a fundamental package for scientific computing in Python.pandas
is what we’ll use to manipulate our data.seaborn
is a plotting library which has some convenient functions for visualizing missing data.matplotlib
is a plotting library.
1 | import shap |
2. Load the Dataset
Run the next cell to load in the NHANES I epidemiology dataset. This dataset contains various features of hospital patients as well as their outcomes, i.e. whether or not they died within 10 years.
1 | X_dev, X_test, y_dev, y_test = load_data(10) |
The dataset has been split into a development set (or dev set), which we will use to develop our risk models, and a test set, which we will use to test our models.
We further split the dev set into a training and validation set, respectively to train and tune our models, using a 75/25 split (note that we set a random state to make this split repeatable).
1 | X_train, X_val, y_train, y_val = train_test_split(X_dev, y_dev, test_size=0.25, random_state=10) |
3. Explore the Dataset
The first step is to familiarize yourself with the data. Run the next cell to get the size of your training set and look at a small sample.
1 | print("X_train shape: {}".format(X_train.shape)) |
X_train shape: (5147, 18)
Age | Diastolic BP | Poverty index | Race | Red blood cells | Sedimentation rate | Serum Albumin | Serum Cholesterol | Serum Iron | Serum Magnesium | Serum Protein | Sex | Systolic BP | TIBC | TS | White blood cells | BMI | Pulse pressure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1599 | 43.0 | 84.0 | 637.0 | 1.0 | 49.3 | 10.0 | 5.0 | 253.0 | 134.0 | 1.59 | 7.7 | 1.0 | NaN | 490.0 | 27.3 | 9.1 | 25.803007 | 34.0 |
2794 | 72.0 | 96.0 | 154.0 | 2.0 | 43.4 | 23.0 | 4.3 | 265.0 | 106.0 | 1.66 | 6.8 | 2.0 | 208.0 | 301.0 | 35.2 | 6.0 | 33.394319 | 112.0 |
1182 | 54.0 | 78.0 | 205.0 | 1.0 | 43.8 | 12.0 | 4.2 | 206.0 | 180.0 | 1.67 | 6.6 | 2.0 | NaN | 363.0 | 49.6 | 5.9 | 20.278410 | 34.0 |
6915 | 59.0 | 90.0 | 417.0 | 1.0 | 43.4 | 9.0 | 4.5 | 327.0 | 114.0 | 1.65 | 7.6 | 2.0 | NaN | 347.0 | 32.9 | 6.1 | 32.917744 | 78.0 |
500 | 34.0 | 80.0 | 385.0 | 1.0 | 77.7 | 9.0 | 4.1 | 197.0 | 64.0 | 1.74 | 7.3 | 2.0 | NaN | 376.0 | 17.0 | 8.2 | 30.743489 | 30.0 |
Our targets y
will be whether or not the target died within 10 years. Run the next cell to see the target data series.
1 | y_train.head(20) |
1599 False
2794 True
1182 False
6915 False
500 False
1188 True
9739 False
3266 False
6681 False
8822 False
5856 True
3415 False
9366 False
7975 False
1397 False
6809 False
9461 False
9374 False
1170 True
158 False
Name: time, dtype: bool
Use the next cell to examine individual cases and familiarize yourself with the features.
1 | i = 10 |
Age 67.000000
Diastolic BP 94.000000
Poverty index 114.000000
Race 1.000000
Red blood cells 43.800000
Sedimentation rate 12.000000
Serum Albumin 3.700000
Serum Cholesterol 178.000000
Serum Iron 73.000000
Serum Magnesium 1.850000
Serum Protein 7.000000
Sex 1.000000
Systolic BP 140.000000
TIBC 311.000000
TS 23.500000
White blood cells 4.300000
BMI 17.481227
Pulse pressure 46.000000
Name: 5856, dtype: float64
Died within 10 years? True
4. Dealing with Missing Data
Looking at our data in X_train
, we see that some of the data is missing: some values in the output of the previous cell are marked as NaN
(“not a number”).
Missing data is a common occurrence in data analysis, that can be due to a variety of reasons, such as measuring instrument malfunction, respondents not willing or not able to supply information, and errors in the data collection process.
Let’s examine the missing data pattern. seaborn
is an alternative to matplotlib
that has some convenient plotting functions for data analysis. We can use its heatmap
function to easily visualize the missing data pattern.
Run the cell below to plot the missing data:
1 | sns.heatmap(X_train.isnull(), cbar=False) |
For each feature, represented as a column, values that are present are shown in black, and missing values are set in a light color.
From this plot, we can see that many values are missing for systolic blood pressure (Systolic BP
).
Exercise 1
In the cell below, write a function to compute the fraction of cases with missing data. This will help us decide how we handle this missing data in the future.
- The
pandas.DataFrame.isnull()
method is helpful in this case. - Use the
pandas.DataFrame.any()
method and set theaxis
parameter. - Divide the total number of rows with missing data by the total number of rows. Remember that in Python,
True
values are equal to 1.
1 | # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
Test your function by running the cell below.
1 | df_test = pd.DataFrame({'a':[None, 1, 1, None], 'b':[1, None, 0, 1]}) |
Example dataframe:
a b
0 NaN 1.0
1 1.0 NaN
2 1.0 0.0
3 NaN 1.0
Computed fraction missing: 0.75, expected: 0.75
Fraction of rows missing from X_train: 0.699
Fraction of rows missing from X_val: 0.704
Fraction of rows missing from X_test: 0.000
We see that our train and validation sets have missing values, but luckily our test set has complete cases.
As a first pass, we will begin with a complete case analysis, dropping all of the rows with any missing data. Run the following cell to drop these rows from our train and validation sets.
1 | X_train_dropped = X_train.dropna(axis='rows') |
5. Decision Trees
Having just learned about decision trees, you choose to use a decision tree classifier. Use scikit-learn to build a decision tree for the hospital dataset using the train set.
1 | dt = DecisionTreeClassifier(max_depth=None, random_state=10) |
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=10, splitter='best')
Next we will evaluate our model. We’ll use C-Index for evaluation.
Remember from lesson 4 of week 1 that the C-Index evaluates the ability of a model to differentiate between different classes, by quantifying how often, when considering all pairs of patients (A, B), the model says that patient A has a higher risk score than patient B when, in the observed data, patient A actually died and patient B actually lived. In our case, our model is a binary classifier, where each risk score is either 1 (the model predicts that the patient will die) or 0 (the patient will live).
More formally, defining _permissible pairs_ of patients as pairs where the outcomes are different, _concordant pairs_ as permissible pairs where the patient that died had a higher risk score (i.e. our model predicted 1 for the patient that died and 0 for the one that lived), and _ties_ as permissible pairs where the risk scores were equal (i.e. our model predicted 1 for both patients or 0 for both patients), the C-Index is equal to:
Run the next cell to compute the C-Index on the train and validation set (we’ve given you an implementation this time).
1 | y_train_preds = dt.predict_proba(X_train_dropped)[:, 1] |
Train C-Index: 1.0
Val C-Index: 0.5629321808510638
Unfortunately your tree seems to be overfitting: it fits the training data so closely that it doesn’t generalize well to other samples such as those from the validation set.
The training C-index comes out to 1.0 because, when initializing
DecisionTreeClasifier
, we have leftmax_depth
andmin_samples_split
unspecified. The resulting decision tree will therefore keep splitting as far as it can, which pretty much guarantees a pure fit to the training data.
To handle this, you can change some of the hyperparameters of our tree.
Exercise 2
Try and find a set of hyperparameters that improves the generalization to the validation set and recompute the C-index. If you do it right, you should get C-index above 0.6 for the validation set.
You can refer to the documentation for the sklearn DecisionTreeClassifier.
- Try limiting the depth of the tree (
'max_depth'
).
1 | # Experiment with different hyperparameters for the DecisionTreeClassifier |
Run the next cell to fit and evaluate the regularized tree.
1 | # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
Train C-Index: 0.688738755448391
Val C-Index (expected > 0.6): 0.6302692819148936
If you used a low max_depth
you can print the entire tree. This allows for easy interpretability. Run the next cell to print the tree splits.
1 | dot_data = StringIO() |
Overfitting, underfitting, and the bias-variance tradeoff
If you tested several values of
max_depth
, you may have seen that a value of3
gives training and validation C-Indices of about0.689
and0.630
, and that amax_depth
of2
gives better agreement with values of about0.653
and0.607
. In the latter case, we have further reduced overfitting, at the cost of a minor loss in predictive performance.Contrast this with a
max_depth
value of1
, which results in C-Indices of about0.597
for the training set and0.598
for the validation set: we have eliminated overfitting but with a much stronger degradation of predictive performance.Lower predictive performance on the training and validation sets is indicative of the model _underfitting_ the data: it neither learns enough from the training data nor is able to generalize to unseen data (the validation data in our case).
Finding a model that minimizes and acceptably balances underfitting and overfitting (e.g. selecting the model with a
max_depth
of2
over the other values) is a common problem in machine learning that is known as the _bias-variance tradeoff_.
6. Random Forests
No matter how you choose hyperparameters, a single decision tree is prone to overfitting. To solve this problem, you can try random forests, which combine predictions from many different trees to create a robust classifier.
As before, we will use scikit-learn to build a random forest for the data. We will use the default hyperparameters.
1 | rf = RandomForestClassifier(n_estimators=100, random_state=10) |
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=10, verbose=0,
warm_start=False)
Now compute and report the C-Index for the random forest on the training and validation set.
1 | y_train_rf_preds = rf.predict_proba(X_train_dropped)[:, 1] |
Train C-Index: 1.0
Val C-Index: 0.6660488696808511
Training a random forest with the default hyperparameters results in a model that has better predictive performance than individual decision trees as in the previous section, but this model is overfitting.
We therefore need to tune (or optimize) the hyperparameters, to find a model that both has good predictive performance and minimizes overfitting.
The hyperparameters we choose to adjust will be:
n_estimators
: the number of trees used in the forest.max_depth
: the maximum depth of each tree.min_samples_leaf
: the minimum number (ifint
) or proportion (iffloat
) of samples in a leaf.
The approach we implement to tune the hyperparameters is known as a grid search:
We define a set of possible values for each of the target hyperparameters.
A model is trained and evaluated for every possible combination of hyperparameters.
The best performing set of hyperparameters is returned.
The cell below implements a hyperparameter grid search, using the C-Index to evaluate each tested model.
1 | def holdout_grid_search(clf, X_train_hp, y_train_hp, X_val_hp, y_val_hp, hyperparams, fixed_hyperparams={}): |
Exercise 3
In the cell below, define the values you want to run the hyperparameter grid search on, and run the cell to find the best-performing set of hyperparameters.
Your objective is to get a C-Index above 0.6
on both the train and validation set.
- n_estimators: try values greater than 100
- max_depth: try values in the range 1 to 100
- min_samples_leaf: try float values below .5 and/or int values greater than 2
1 | def random_forest_grid_search(X_train_dropped, y_train_dropped, X_val_dropped, y_val_dropped): |
1 | best_rf, best_hyperparams = random_forest_grid_search(X_train_dropped, y_train_dropped, X_val_dropped, y_val_dropped) |
[1/27] {'n_estimators': 50, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.6724567819148937
[2/27] {'n_estimators': 50, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.6722240691489362
[3/27] {'n_estimators': 50, 'max_depth': 3, 'min_samples_leaf': 3}
Val C-Index: 0.6725066489361702
[4/27] {'n_estimators': 50, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.6637965425531915
[5/27] {'n_estimators': 50, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.6655585106382979
[6/27] {'n_estimators': 50, 'max_depth': 5, 'min_samples_leaf': 3}
Val C-Index: 0.659624335106383
[7/27] {'n_estimators': 50, 'max_depth': 10, 'min_samples_leaf': 1}
Val C-Index: 0.6611535904255319
[8/27] {'n_estimators': 50, 'max_depth': 10, 'min_samples_leaf': 2}
Val C-Index: 0.6647273936170213
[9/27] {'n_estimators': 50, 'max_depth': 10, 'min_samples_leaf': 3}
Val C-Index: 0.6605884308510638
[10/27] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.6811502659574468
[11/27] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.6815159574468085
[12/27] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 3}
Val C-Index: 0.6809175531914894
[13/27] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.6765458776595744
[14/27] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.6750831117021276
[15/27] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 3}
Val C-Index: 0.6745844414893617
[16/27] {'n_estimators': 200, 'max_depth': 10, 'min_samples_leaf': 1}
Val C-Index: 0.668467420212766
[17/27] {'n_estimators': 200, 'max_depth': 10, 'min_samples_leaf': 2}
Val C-Index: 0.6737699468085107
[18/27] {'n_estimators': 200, 'max_depth': 10, 'min_samples_leaf': 3}
Val C-Index: 0.667436835106383
[19/27] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.6811502659574468
[20/27] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.6815159574468085
[21/27] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 3}
Val C-Index: 0.6809175531914894
[22/27] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.6765458776595744
[23/27] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.6750831117021276
[24/27] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 3}
Val C-Index: 0.6745844414893617
[25/27] {'n_estimators': 200, 'max_depth': 10, 'min_samples_leaf': 1}
Val C-Index: 0.668467420212766
[26/27] {'n_estimators': 200, 'max_depth': 10, 'min_samples_leaf': 2}
Val C-Index: 0.6737699468085107
[27/27] {'n_estimators': 200, 'max_depth': 10, 'min_samples_leaf': 3}
Val C-Index: 0.667436835106383
Best hyperparameters:
{'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 2, 'random_state': 10}
Train C-Index: 0.7798145228600575
Val C-Index: 0.6815159574468085
Finally, evaluate the model on the test set. This is a crucial step, as trying out many combinations of hyperparameters and evaluating them on the validation set could result in a model that ends up overfitting the validation set. We therefore need to check if the model performs well on unseen data, which is the role of the test set, which we have held out until now.
1 | # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
Test C-Index: 0.7013860174676174
Your C-Index on the test set should be greater than 0.6
.
7. Imputation
You’ve now built and optimized a random forest model on our data. However, there was still a drop in test C-Index. This might be because you threw away more than half of the data of our data because of missing values for systolic blood pressure. Instead, we can try filling in, or imputing, these values.
First, let’s explore to see if our data is missing at random or not. Let’s plot histograms of the dropped rows against each of the covariates (aside from systolic blood pressure) to see if there is a trend. Compare these to the histograms of the feature in the entire dataset. Try to see if one of the covariates has a signficantly different distribution in the two subsets.
1 | dropped_rows = X_train[X_train.isnull().any(axis=1)] |
Most of the covariates are distributed similarly whether or not we have discarded rows with missing data. In other words missingness of the data is independent of these covariates.
If this had been true across all covariates, then the data would have been said to be missing completely at random (MCAR).
But when considering the age covariate, we see that much more data tends to be missing for patients over 65. The reason could be that blood pressure was measured less frequently for old people to avoid placing additional burden on them.
As missingness is related to one or more covariates, the missing data is said to be missing at random (MAR).
Based on the information we have, there is however no reason to believe that the _values_ of the missing data — or specifically the values of the missing systolic blood pressures — are related to the age of the patients.
If this was the case, then this data would be said to be missing not at random (MNAR).
8. Error Analysis
Exercise 4
Using the information from the plots above, try to find a subgroup of the test data on which the model performs poorly. You should be able to easily find a subgroup of at least 250 cases on which the model has a C-Index of less than 0.69.
- Define a mask using a feature and a threshold, e.g. patients with a BMI below 20:
mask = X_test['BMI'] < 20
. - Try to find a subgroup for which the model had little data.
1 | # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
Test Your Work
1 | performance, subgroup_size = bad_subset(best_rf, X_test, y_test) |
Subgroup size should greater than 250, performance should be less than 0.69
Subgroup size: 320, C-Index: 0.670638197475522
Expected Output
Note, your actual output will vary depending on the hyper-parameters that you chose and the mask that you chose.
- Make sure that the c-index is less than 0.69
1
Subgroup size: 586, C-Index: 0.6275
Bonus:
- See if you can get a c-index as low as 0.53
1
Subgroup size: 251, C-Index: 0.5331
9. Imputation Approaches
Seeing that our data is not missing completely at random, we can handle the missing values by replacing them with substituted values based on the other values that we have. This is known as imputation.
The first imputation strategy that we will use is mean substitution: we will replace the missing values for each feature with the mean of the available values. In the next cell, use the SimpleImputer
from sklearn
to use mean imputation for the missing values.
1 | # Impute values using the mean |
Exercise 5
Now perform a hyperparameter grid search to find the best-performing random forest model, and report results on the test set.
Define the parameter ranges for the hyperparameter search in the next cell, and run the cell.
Target performance
Make your test c-index at least 0.74 or higher
- n_estimators: try values greater than 100
- max_depth: try values in the range 1 to 100
- min_samples_leaf: try float values below .5 and/or int values greater than 2
1 | # Define ranges for the random forest hyperparameter search |
1 | # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
[1/8] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.7395345453913784
[2/8] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.7397907669057344
[3/8] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.7491450235484942
[4/8] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.7481646505507676
[5/8] {'n_estimators': 500, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.740106701061148
[6/8] {'n_estimators': 500, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.7403585798379725
[7/8] {'n_estimators': 500, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.7496932941618408
[8/8] {'n_estimators': 500, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.7494088448535303
Performance for best hyperparameters:
- Train C-Index: 0.8137
- Val C-Index: 0.7497
- Test C-Index: 0.7819
Expected output
Note, your actual c-index values will vary depending on the hyper-parameters that you choose.
- Try to get a good Test c-index, similar these numbers below:
1 | Performance for best hyperparameters: |
Next, we will apply another imputation strategy, known as multivariate feature imputation, using scikit-learn’s IterativeImputer
class (see the documentation).
With this strategy, for each feature that is missing values, a regression model is trained to predict observed values based on all of the other features, and the missing values are inferred using this model.
As a single iteration across all features may not be enough to impute all missing values, several iterations may be performed, hence the name of the class IterativeImputer
.
In the next cell, use IterativeImputer
to perform multivariate feature imputation.
Note that the first time the cell is run,
imputer.fit(X_train)
may fail with the messageLinAlgError: SVD did not converge
: simply re-run the cell.
1 | # Impute using regression on other covariates |
Exercise 6
Perform a hyperparameter grid search to find the best-performing random forest model, and report results on the test set. Define the parameter ranges for the hyperparameter search in the next cell, and run the cell.
Target performance
Try to get a text c-index of at least 0.74 or higher.
- n_estimators: try values greater than 100
- max_depth: try values in the range 1 to 100
- min_samples_leaf: try float values below .5 and/or int values greater than 2
1 | # Define ranges for the random forest hyperparameter search |
1 | # UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) |
[1/18] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.7354751714838482
[2/18] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.7357596207921587
[3/18] {'n_estimators': 200, 'max_depth': 3, 'min_samples_leaf': 3}
Val C-Index: 0.7356792801478268
[4/18] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.745511237919047
[5/18] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.7446752609442414
[6/18] {'n_estimators': 200, 'max_depth': 5, 'min_samples_leaf': 3}
Val C-Index: 0.7453787844243376
[7/18] {'n_estimators': 200, 'max_depth': 7, 'min_samples_leaf': 1}
Val C-Index: 0.7486206379915707
[8/18] {'n_estimators': 200, 'max_depth': 7, 'min_samples_leaf': 2}
Val C-Index: 0.7506139545185099
[9/18] {'n_estimators': 200, 'max_depth': 7, 'min_samples_leaf': 3}
Val C-Index: 0.749875689138162
[10/18] {'n_estimators': 500, 'max_depth': 3, 'min_samples_leaf': 1}
Val C-Index: 0.73679102095588
[11/18] {'n_estimators': 500, 'max_depth': 3, 'min_samples_leaf': 2}
Val C-Index: 0.737142782695928
[12/18] {'n_estimators': 500, 'max_depth': 3, 'min_samples_leaf': 3}
Val C-Index: 0.737118897639505
[13/18] {'n_estimators': 500, 'max_depth': 5, 'min_samples_leaf': 1}
Val C-Index: 0.7460540801104792
[14/18] {'n_estimators': 500, 'max_depth': 5, 'min_samples_leaf': 2}
Val C-Index: 0.7457783162772317
[15/18] {'n_estimators': 500, 'max_depth': 5, 'min_samples_leaf': 3}
Val C-Index: 0.746544809451534
[16/18] {'n_estimators': 500, 'max_depth': 7, 'min_samples_leaf': 1}
Val C-Index: 0.7486162952540393
[17/18] {'n_estimators': 500, 'max_depth': 7, 'min_samples_leaf': 2}
Val C-Index: 0.7508050349698939
[18/18] {'n_estimators': 500, 'max_depth': 7, 'min_samples_leaf': 3}
Val C-Index: 0.7501840235028955
Performance for best hyperparameters:
- Train C-Index: 0.8774
- Val C-Index: 0.7508
- Test C-Index: 0.7834
Expected Output
Note, your actual output will vary depending on the hyper-parameters that you chose and the mask that you chose.1
2
3
4Performance for best hyperparameters:
- Train C-Index: 0.8131
- Val C-Index: 0.7454
- Test C-Index: 0.7797
10. Comparison
For good measure, retest on the subgroup from before to see if your new models do better.
1 | performance, subgroup_size = bad_subset(best_rf, X_test, y_test) |
C-Index (no imputation): 0.670638197475522
C-Index (mean imputation): 0.6847548267862058
C-Index (multivariate feature imputation): 0.6926978884039164
We should see that avoiding complete case analysis (i.e. analysis only on observations for which there is no missing data) allows our model to generalize a bit better. Remember to examine your missing cases to judge whether they are missing at random or not!
11. Explanations: SHAP
Using a random forest has improved results, but we’ve lost some of the natural interpretability of trees. In this section we’ll try to explain the predictions using slightly more sophisticated techniques.
You choose to apply SHAP (SHapley Additive exPlanations) , a cutting edge method that explains predictions made by black-box machine learning models (i.e. models which are too complex to be understandable by humans as is).
Given a prediction made by a machine learning model, SHAP values explain the prediction by quantifying the additive importance of each feature to the prediction. SHAP values have their roots in cooperative game theory, where Shapley values are used to quantify the contribution of each player to the game.
Although it is computationally expensive to compute SHAP values for general black-box models, in the case of trees and forests there exists a fast polynomial-time algorithm. For more details, see the TreeShap paper.
We’ll use the shap library to do this for our random forest model. Run the next cell to output the most at risk individuals in the test set according to our model.
1 | X_test_risk = X_test.copy(deep=True) |
Age | Diastolic BP | Poverty index | Race | Red blood cells | Sedimentation rate | Serum Albumin | Serum Cholesterol | Serum Iron | Serum Magnesium | Serum Protein | Sex | Systolic BP | TIBC | TS | White blood cells | BMI | Pulse pressure | risk | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5493 | 67.0 | 80.0 | 30.0 | 1.0 | 77.7 | 59.0 | 3.4 | 231.0 | 36.0 | 1.40 | 6.3 | 1.0 | 170.0 | 202.0 | 17.8 | 8.4 | 17.029470 | 90.0 | 0.689064 |
6337 | 69.0 | 80.0 | 233.0 | 1.0 | 77.7 | 48.0 | 4.2 | 159.0 | 87.0 | 1.81 | 6.9 | 1.0 | 146.0 | 291.0 | 29.9 | 15.2 | 17.931276 | 66.0 | 0.639300 |
2044 | 74.0 | 80.0 | 83.0 | 1.0 | 47.6 | 19.0 | 4.2 | 205.0 | 72.0 | 1.71 | 6.9 | 1.0 | 180.0 | 310.0 | 23.2 | 10.8 | 20.900101 | 100.0 | 0.582532 |
1017 | 65.0 | 98.0 | 16.0 | 1.0 | 49.4 | 30.0 | 3.4 | 124.0 | 129.0 | 1.59 | 7.7 | 1.0 | 184.0 | 293.0 | 44.0 | 5.9 | 30.858853 | 86.0 | 0.573548 |
6609 | 72.0 | 90.0 | 75.0 | 1.0 | 29.3 | 59.0 | 3.9 | 216.0 | 64.0 | 1.63 | 7.4 | 2.0 | 182.0 | 322.0 | 19.9 | 9.3 | 22.281793 | 92.0 | 0.565259 |
We can use SHAP values to try and understand the model output on specific individuals using force plots. Run the cell below to see a force plot on the riskiest individual.
1 | explainer = shap.TreeExplainer(rf_imputed) |
How to read this chart:
- The red sections on the left are features which push the model towards the final prediction in the positive direction (i.e. a higher Age increases the predicted risk).
- The blue sections on the right are features that push the model towards the final prediction in the negative direction (if an increase in a feature leads to a lower risk, it will be shown in blue).
- Note that the exact output of your chart will differ depending on the hyper-parameters that you choose for your model.
We can also use SHAP values to understand the model output in aggregate. Run the next cell to initialize the SHAP values (this may take a few minutes).
1 | shap_values = shap.TreeExplainer(rf_imputed).shap_values(X_test)[1] |
Run the next cell to see a summary plot of the SHAP values for each feature on each of the test examples. The colors indicate the value of the feature.
1 | shap.summary_plot(shap_values, X_test) |
Clearly we see that being a woman (sex = 2.0
, as opposed to men for which sex = 1.0
) has a negative SHAP value, meaning that it reduces the risk of dying within 10 years. High age and high systolic blood pressure have positive SHAP values, and are therefore related to increased mortality.
You can see how features interact using dependence plots. These plot the SHAP value for a given feature for each data point, and color the points in using the value for another feature. This lets us begin to explain the variation in SHAP value for a single value of the main feature.
Run the next cell to see the interaction between Age and Sex.
1 | shap.dependence_plot('Age', shap_values, X_test, interaction_index='Sex') |
We see that while Age > 50 is generally bad (positive SHAP value), being a woman generally reduces the impact of age. This makes sense since we know that women generally live longer than men.
Let’s now look at poverty index and age.
1 | shap.dependence_plot('Poverty index', shap_values, X_test, interaction_index='Age') |
We see that the impact of poverty index drops off quickly, and for higher income individuals age begins to explain much of variation in the impact of poverty index.
Try some other pairs and see what other interesting relationships you can find!
Congratulations!
You have completed the second assignment in Course 2. Along the way you’ve learned to fit decision trees, random forests, and deal with missing data. Now you’re ready to move on to week 3!