- Better understand the data
- Build an intuition about the data
- Generate hypothesizes
- Find insights
- Get domain knowledge
– It helps to deeper understand the problem
- Check if the data is intuitive
– And agrees with domain knowledge
- Understand how the data was generated
- Explore individual features
- Explore pairs and groups
- Check for leaks!
Two things to do with anonymized features:
- Try to decode the features
- Guess the true meaning of the feature
- Guess the feature types
- Each type needs its own preprocessing
EDA is an art And visualizations are our art tools !
Plot (index versus value):
− Scatter plot, scatter matrix
− Corrplot + clustering
− Plot (index vs feature statistics)
train.nunique(axis=1) == 1
for f in categorical_feats:
traintest[f] = raintest[f].factorize()
- Check if same rows have same label
- Find duplicated rows, understand why they are duplicated
- Check if dataset is shuffled
- Validation helps us evaluate a quality of the model
- Validation helps us select the model which will perform best on the unseen data
- Underfitting refers to not capturing enough patterns in the data
- Generally, overfitting refers to
- capturing noize
- capturing patterns which do not generalize to test data
- In competitions, overfitting refers to
- low model’s quality on test data, which was unexpected due to validation scores
- There are three main validation strategies:
- Split train data into two parts: partA and partB.
- Fit the model on partA, predict for partB.
- Use predictions for partB for estimating model quality. Find such hyper-parameters, that quality on partB is maximized.
- Split train data into K folds.
- Iterate though each fold: retrain the model on all folds except current fold, predict for the current fold.
- Use the predictions to calculate quality on each fold. Find such hyper-parameters, that quality on each fold is maximized. You can also estimate mean and variance of the loss. This is very helpful in order to understand significance of improvement.
- Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
- In the end you will get LOO predictions for every sample in the trainset and can calculate loss.
- Notice, that these are validation schemes are supposed to be used to estimate quality of the model. When you found the right hyper-parameters and want to get test predictions don’t forget to retrain your model using all training data.
- Stratification preserve the same target distribution over different folds
- In most cases data is split by:
- Logic of feature generation depends on the data splitting strategy
Set up your validation to mimic the train/test split of the competition
- Set up your validation to mimic the train/test split of the competition
- If we have big dispersion of scores on validation stage, we should do extensive validation
- Average scores from different KFold splits
- Tune model on one split, evaluate score on the other
- If submission’s score do not match local validation score,
- Check if we have too little data in public LB
- Check if we overfitted
- Check if we chose correct splitting strategy
- Check if train/test have different distibutions
- Expect LB shuffle because of
– Little amount of data
– Different public/private distributions
- Split should be done on time.
- In real life we don’t have information from future
- In competitions first thing to look: train/public/private
split, is it on time?
- Even when split by time, features may contain information about future.
- User history in CTR tasks