Exploratory data analysis
What and Why?
- Better understand the data
- Build an intuition about the data
- Generate hypothesizes
- Find insights
Building intuition about the data
- Get domain knowledge – It helps to deeper understand the problem
- Check if the data is intuitive – And agrees with domain knowledge
- Understand how the data was generated
As it is crucial to set up a proper validation
- Explore individual features
Explore pairs and groups
Clean features up
- Check for leaks!
Exploring anonymized data
Two things to do with anonymized features: 1. Try to decode the features - Guess the true meaning of the feature 2. Guess the feature types - Each type needs its own preprocessing
Visualization
EDA is an art And visualizations are our art tools !
Tools for individual features exploration
- Histograms:
1
plt.hist(x)
- Plot (index versus value):
1
plt.plot(x, '.')
- Statistics:
1
2
3df.describe()
x.mean()
x.var() - Other tools:
1
2x.value_counts()
x.isnull()
Explore feature relations
- Pairs − Scatter plot, scatter matrix − Corrplot
1
2
3
4plt.scatter(x1, x2)
pd.scatter_matrix(df)
df.corr()
plt.matshow() - Groups − Corrplot + clustering − Plot (index vs feature statistics)
1
2df.mean().plot(style=’.’)
df.mean().sort_values().plot(style=’.’)
Examples
Dataset cleaning
- Constant features
1
train.nunique(axis=1) == 1
- Duplicated features
1
2
3
4
5traintest.T.drop_duplicates()
for f in categorical_feats:
traintest[f] = raintest[f].factorize()
traintest.T.drop_duplicates() - Duplicated rows
- Check if same rows have same label
- Find duplicated rows, understand why they are duplicated
- Check if dataset is shuffled
Validation
- Validation helps us evaluate a quality of the model
- Validation helps us select the model which will perform best on the unseen data
- Underfitting refers to not capturing enough patterns in the data
- Generally, overfitting refers to
- capturing noize
- capturing patterns which do not generalize to test data
- In competitions, overfitting refers to
- low model’s quality on test data, which was unexpected due to validation scores
- There are three main validation strategies:
- Holdout > sklearn.model_selection.ShuffleSplit
- Split train data into two parts: partA and partB.
- Fit the model on partA, predict for partB.
- Use predictions for partB for estimating model quality. Find such hyper-parameters, that quality on partB is maximized.
- KFold > sklearn.model_selection.Kfold
- Split train data into K folds.
- Iterate though each fold: retrain the model on all folds except current fold, predict for the current fold.
- Use the predictions to calculate quality on each fold. Find such hyper-parameters, that quality on each fold is maximized. You can also estimate mean and variance of the loss. This is very helpful in order to understand significance of improvement.
- LOO > sklearn.model_selection.LeaveOneOut
- Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
- In the end you will get LOO predictions for every sample in the trainset and can calculate loss.
- Notice, that these are validation schemes are supposed to be used to estimate quality of the model. When you found the right hyper-parameters and want to get test predictions don't forget to retrain your model using all training data.
- Stratification preserve the same target distribution over different folds
Data split
- In most cases data is split by:
- Rownumber
- Time
- Id
- Logic of feature generation depends on the data splitting strategy Set up your validation to mimic the train/test split of the competition
- Set up your validation to mimic the train/test split of the competition
Validation problems
- If we have big dispersion of scores on validation stage, we should do extensive validation
- Average scores from different KFold splits
- Tune model on one split, evaluate score on the other
- If submission’s score do not match local validation score, we should
- Check if we have too little data in public LB
- Check if we overfitted
- Check if we chose correct splitting strategy
- Check if train/test have different distibutions
- Expect LB shuffle because of
- Randomness – Little amount of data – Different public/private distributions
Data leakage
- Split should be done on time.
- In real life we don’t have information from future
- In competitions first thing to look: train/public/private split, is it on time?
- Even when split by time, features may contain information about future.
- User history in CTR tasks
- Weather