Advanced Features
Statistics and distance based features
Groupby feature
1 | gb = df.groupby(['user_id','page_id'], as_index = False).agg({'ad_price': {'max_price': np.max, 'min_price': np.min}}) |
- How many pages user visited
- Standard deviation of prices
- Most visited page
- Many, many more
Neighbors
- Explicit group is not needed
- More flexible
- Much harder to implement
such as:
- Number of houses in 500m, 1000m,..
- Average price per square meter in 500m, 1000m,..
- Number of schools/supermarkets/parking lots in 500m, 1000m,..
- Distance to closest subway station
KNN features as example: - Mean encode all the variables - For every point, find 2000 nearest neighbors using Bray-Curtis metric \[\frac{\sum|\mu_i - \upsilon_i|}{|\mu_i + \upsilon_i|}\] - Calculate various features from those 2000 neighbors - Mean target of nearest 5,10,15,500, 2000 neighbors - Mean distance to 10 closest neighbors - Mean distance to 10 closest neighbors with target 1 - Mean distance to 10 closest neighbors with target 0
Matrix Factorizations for Feature Extraction
- Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
- It can be applied for transforming categorical features into real-valued
- Many of tricks trick suitable for linear models can be useful for MF
- Can be apply only for some columns
- Can provide additional diversity − Good for ensembles
- It is a lossy transformation. Its’ efficiency depends on: − Particulartask − Numberoflatentfactors(Usually 5-100)
- Several MF methods you can find in sklearn
- SVD and PCA − Standart tools for Matrix Factorization
- TruncatedSVD − Works with sparse matrices
- Non-negative Matrix Factorization (NMF) − Ensures that all latent factors are non-negative − Good for counts-like data
Feature interactions
- We have a lot of possible interactions − N*N for N features
- Even more if use several types in interactions
- Need to reduce its’ number
- Dimensionality reduction
- Feature selection
- Interactions' order
- We looked at 2nd and higher order interactions.
- It is hard to do generation and selection automatically.
- Manual building of high-order interactions is some kind of art.
Frequent operations for feature interaction
- Multiplication
- Sum
- Diff
- Division
Example of interaction generation pipeline
Extract features from DT
get the index of the leaf that each sample is predicted as. it a method to get the high order features
1 | tree.apply() |
tSNE
- Result heavily depends on hyperparameters (perplexity)
- Good practice is to use several projections with different perplexities (5-100)
- Due to stochastic nature, tSNE provides different projections even for the same data − Train and test should be projected together
- tSNE runs for a long time with a big number of features − it is common to do dimensionality reduction before projection.