- Tree-based models doesn’t depend on scaling and Rank
b. Non-tree-based models hugely depend on scaling and Rank
- To [0,1]
x = (x-min()) / (x.max() - x.min())
2. To mean=0,std=1
x = (x - x.mean()) / x.std()
from scipy.stats import rankdata
You can apply log transformation through your data, or there’s another possibility. You can extract a square root of the data. Both these transformations can be useful because they drive too big values closer to the features’ average value. Along with this, the values near zero are becoming a bit more distinguishable. Despite the simplicity, one of these transformations can improve your neural network’s results significantly.
UPPERBOUND,LOWERBOUND = np.percentile(x,[1,99])
Sometimes, we can engineer these features using prior knowledge and logic. Sometimes we have to dig into the data, create and check hypothesis, and use this derived knowledge and our intuition to derive new features.
- prior knowledge
It is useful to know that adding, multiplications, divisions, and other features interactions can be of help not only for linear models. For example, although gradient within decision tree is a very powerful model, it still experiences difficulties with approximation of multiplications and divisions. And adding size features explicitly can lead to a more robust model with less amount of trees.
This feature can help the model utilize the differences in people’s perception of these prices. Also, we can find similar patterns in tasks which require distinguishing between a human and a robot.
- EDA(Exploratory data analysis)
- Label and Frequency encodings are often used for tree-
- One-hot encoding is often used for non-tree-based models
- Interactions of categorical features can help linear models
Values in ordinal features are sorted in some meaningful order
- Ticket class: 1,2,3
- Driver’s license: A, B, C, D
- Education: kindergarden, school, undergraduate, bachelor, master, doctoral
Label encoding maps categories to numbers
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
Frequency encoding maps categories to their frequencies
encoding = titanic.groupby(‘Embarked’).size()
encoding = encoding/len(titanic)
titanic["enc"] = titanic.Embarked.map(encoding)
Day number in week, month, season, year second, minute, hour.
- Time since
- Row-independent moment
For example: since 00:00:00 UTC, 1 January 1970;
- Row-dependent important moment
Number of days left until next holidays / time passed after last holiday.
- Difference between dates
- datetime_feature_1 - datetime_feature_2
- Interesting places from train/test data or additional data
Generally, you can calculate distances to important points on the map. You can extract interesting points on the map from your trained test data.
- Centers of clusters
- Aggregated statistics
use coordinates is to calculate aggregated statistics for objects surrounding area.
- If you train decision trees from them, you can add slightly rotated coordinates is new features. And this will help a model make more precise selections on the map.
- -999, -1, etc
- mean, median
- Reconstruct value
- we should be very careful with replacing missing values before our feature generation
- It can be beneficial to change the missing values or categories which present in the test data but do not present in the train data. The intention for doing so appeals to the fact that the model which didn’t have that category in the train data will eventually treat it randomly.
- Lowercase, stemming, lemmarization, stopwords
Lowercase: Very, very -> very,very
lemmarization: I had a car -> I have a car
- Lowercase, stemming, lemmarization, stopwords
Bag of words
# Design the Vocabulary
# Create Document Vectors
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
- Huge vectors
Ngrams can help to use local context
from sklearn.feature_extraction.text import CountVectorizer
# Ngram_range, analyzer
TF-IDF can be of use as postprocessing
如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。TFIDF实际上是：TF * IDF，TF词频(Term Frequency)，IDF反文档频率(Inverse Document Frequency)。TF表示词条在文档d中出现的频率（另一说：TF词频(Term Frequency)指的是某一个给定的词语在该文件中出现的次数）。IDF的主要思想是：如果包含词条t的文档越少，也就是n越小，IDF越大（见后续公式），则说明词条t具有很好的类别区分能力。
## Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
## Inverse Document Frequency
idf = np.log(x.shape / (x > 0).sum(0)) x = x * idf
from sklearn.feature_extraction.text import TfidfVectorizer
- Relatively small vectors
- Pretrained models
- Features can be extracted from different layers
- Careful choosing of pretrained network can help
- Finetuning allows to refine pretrained models
- Data augmentation can improve the model