Practical Guide for Kaggle Competition

Practical Guide for Competition

Define your goals.

What you can get out of your participation?

  1. To learn more about an interesting problem
  2. To get acquainted with new software tools
  3. To hunt for a medal

Working with ideas

  1. Organize ideas in some structure
  2. Select the most important and promising ideas
  3. Try to understand the reasons why something does/doesn’t work

Initial pipeline

  1. Get familiar with problem domain
  2. Start with simple (or even primitive) solution
  3. Debug full pipeline
    − From reading data to writing submission file
  4. “From simple to complex”
    − I prefer to start with Random Forest rather than Gradient Boosted Decision Trees

data loading

  1. Do basic preprocessing and convert csv/txt files into hdf5/npy for much faster loading
  2. Do not forget that by default data is stored in 64-bit arrays, most of the times you can safely downcast it to 32-bits
  3. Large datasets can be processed in chunks

Performance evaluation

  1. Extensive validation is not always needed
  2. Start with fastest models - such as LightGBM

Everything is a hyperparameter

Sort all parameters by these principles:

  1. Importance
  2. Feasibility
  3. Understanding

Note: changing one parameter can affect the whole pipeline

tricks

  1. Fast and dirty always better
    • Don’t pay too much attention to code quality
    • Keep things simple: save only important things
    • If you feel uncomfortable with given computational resources
    • rent a larger server
  2. Use good variable names
    • If your code is hard to read — you definitely will have
      problems soon or later
  3. Keep your research reproducible
    • Fix random seed
      − Write down exactly how any features were generated
      − Use Version Control Systems (VCS, for example, git)
  4. Reuse code
    − Especially important to use same code for train and test stages
  5. Read papers
    • For example, how to optimize AUC
  6. Read forums and examine kernels first
  7. Code organization
    • keeping it clean
    • macros
    • test/val

Pipeline detail

Procedure days
Understand the problem 1 ~ 2 days
Exploratory data analysis 1 ~ 2 days
Define cv strategy 1 day
Feature Engineering until last 3 ~ 4 days
Modeling Until last 3 ~ 4 days
Ensembling last 3 ~ 4 days

Understand broadly the problem

  1. type of problem
  2. How big is the dataset
  3. What is the metric
  4. Previous code revelant
  5. Hardware needed (cpu, gpu ….)
  6. Software needed (TF, Sklearn, xgboost, lightgBM)

EDA

see the blog Exploratory data analysis

Define cv strategy

  1. This setp is critical
  2. Is time is important? Time-based validation
  3. Different entities than the train. StratifiedKFold Validation
  4. Is it completely random? Random validation
  5. Combination of all the above
  6. Use the leader board to test

Feature Engineering

Modeling

Ensembling

Donate article here