Exploratory data analysis

What and Why?

Better understand the data
Build an intuition about the data
Generate hypothesizes
Find insights

Building intuition about the data

Get domain knowledge – It helps to deeper understand the problem
Check if the data is intuitive – And agrees with domain knowledge
Understand how the data was generated

As it is crucial to set up a proper validation

Explore individual features
Explore pairs and groups
Clean features up
Check for leaks!

Exploring anonymized data

Two things to do with anonymized features: 1. Try to decode the features - Guess the true meaning of the feature 2. Guess the feature types - Each type needs its own preprocessing

Visualization

EDA is an art And visualizations are our art tools !

Tools for individual features exploration

Histograms:
1
plt.hist(x)
Plot (index versus value):
1
plt.plot(x, '.')
Statistics:
1
2
3
df.describe()
x.mean()
x.var()
Other tools:
1
2
x.value_counts()
x.isnull()

Explore feature relations

Pairs − Scatter plot, scatter matrix − Corrplot

plt.scatter(x1, x2)
pd.scatter_matrix(df)
df.corr()
plt.matshow()

Groups − Corrplot + clustering − Plot (index vs feature statistics)

1 2	df.mean().plot(style=’.’) df.mean().sort_values().plot(style=’.’)

Examples

Dataset cleaning

Constant features
1
train.nunique(axis=1) == 1

Duplicated features

traintest.T.drop_duplicates()

for f in categorical_feats: 
    traintest[f] = raintest[f].factorize()
traintest.T.drop_duplicates()

Duplicated rows

Check if same rows have same label
Find duplicated rows, understand why they are duplicated

Check if dataset is shuffled

Validation

Validation helps us evaluate a quality of the model
Validation helps us select the model which will perform best on the unseen data
Underfitting refers to not capturing enough patterns in the data
Generally, overfitting refers to
- capturing noize
- capturing patterns which do not generalize to test data
In competitions, overfitting refers to
- low model’s quality on test data, which was unexpected due to validation scores
There are three main validation strategies:
1. Holdout > sklearn.model_selection.ShuffleSplit
- Split train data into two parts: partA and partB.
- Fit the model on partA, predict for partB.
- Use predictions for partB for estimating model quality. Find such hyper-parameters, that quality on partB is maximized.
1. KFold > sklearn.model_selection.Kfold
- Split train data into K folds.
- Iterate though each fold: retrain the model on all folds except current fold, predict for the current fold.
- Use the predictions to calculate quality on each fold. Find such hyper-parameters, that quality on each fold is maximized. You can also estimate mean and variance of the loss. This is very helpful in order to understand significance of improvement.
1. LOO > sklearn.model_selection.LeaveOneOut
- Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
- In the end you will get LOO predictions for every sample in the trainset and can calculate loss.
- Notice, that these are validation schemes are supposed to be used to estimate quality of the model. When you found the right hyper-parameters and want to get test predictions don't forget to retrain your model using all training data.
1. Stratification preserve the same target distribution over different folds

Data split

In most cases data is split by:

Rownumber
Time
Id

Logic of feature generation depends on the data splitting strategy Set up your validation to mimic the train/test split of the competition
Set up your validation to mimic the train/test split of the competition

Validation problems

If we have big dispersion of scores on validation stage, we should do extensive validation
- Average scores from different KFold splits
- Tune model on one split, evaluate score on the other
If submission’s score do not match local validation score, we should
- Check if we have too little data in public LB
- Check if we overfitted
- Check if we chose correct splitting strategy
- Check if train/test have different distibutions
Expect LB shuffle because of
- Randomness – Little amount of data – Different public/private distributions

Data leakage

Split should be done on time.
- In real life we don’t have information from future
- In competitions first thing to look: train/public/private split, is it on time?
Even when split by time, features may contain information about future.
- User history in CTR tasks
- Weather