0%

Advanced features

Advanced Features

Statistics and distance based features

Groupby feature

1
2
gb = df.groupby(['user_id','page_id'], as_index = False).agg({'ad_price': {'max_price': np.max, 'min_price': np.min}})
gb.columns = ['user_id','page_id','min_price','max_price']

  • How many pages user visited
  • Standard deviation of prices
  • Most visited page
  • Many, many more

Neighbors

  • Explicit group is not needed
  • More flexible
  • Much harder to implement

such as:

  • Number of houses in 500m, 1000m,..
  • Average price per square meter in 500m, 1000m,..
  • Number of schools/supermarkets/parking lots in 500m, 1000m,..
  • Distance to closest subway station

KNN features as example: - Mean encode all the variables - For every point, find 2000 nearest neighbors using Bray-Curtis metric \[\frac{\sum|\mu_i - \upsilon_i|}{|\mu_i + \upsilon_i|}\] - Calculate various features from those 2000 neighbors - Mean target of nearest 5,10,15,500, 2000 neighbors - Mean distance to 10 closest neighbors - Mean distance to 10 closest neighbors with target 1 - Mean distance to 10 closest neighbors with target 0

Matrix Factorizations for Feature Extraction

  • Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
  • It can be applied for transforming categorical features into real-valued
  • Many of tricks trick suitable for linear models can be useful for MF

  • Can be apply only for some columns
  • Can provide additional diversity − Good for ensembles
  • It is a lossy transformation. Its’ efficiency depends on: − Particulartask − Numberoflatentfactors(Usually 5-100)
  • Several MF methods you can find in sklearn
  • SVD and PCA − Standart tools for Matrix Factorization
  • TruncatedSVD − Works with sparse matrices
  • Non-negative Matrix Factorization (NMF) − Ensures that all latent factors are non-negative − Good for counts-like data

Feature interactions

  • We have a lot of possible interactions − N*N for N features
    • Even more if use several types in interactions
  • Need to reduce its’ number
    • Dimensionality reduction
    • Feature selection
  • Interactions' order
    • We looked at 2nd and higher order interactions.
    • It is hard to do generation and selection automatically.
    • Manual building of high-order interactions is some kind of art.

Frequent operations for feature interaction

  • Multiplication
  • Sum
  • Diff
  • Division

Example of interaction generation pipeline

Extract features from DT

get the index of the leaf that each sample is predicted as. it a method to get the high order features

1
tree.apply()

tSNE

  • Result heavily depends on hyperparameters (perplexity)
  • Good practice is to use several projections with different perplexities (5-100)
  • Due to stochastic nature, tSNE provides different projections even for the same data − Train and test should be projected together
  • tSNE runs for a long time with a big number of features − it is common to do dimensionality reduction before projection.