# Overview

1. Why there are so many metrics?
– Different metrics for different problems
2. Why should we care about metric in competitions?
– It is how the competitors are ranked!

# Regression

Why target mean value minimizes MSE error and why target median minimizes MAE.

Suppose we have a dataset

Basically, we are given pairs: features $x_i$ and corresponding target value $y_i \in \mathbb{R}$.

We will denote vector of targets as $y \in \mathbb{R}^N$, such that $y_i$ is target for object $x_i$. Similarly, $\hat y \in \mathbb{R}$ denotes predictions for the objects: $\hat y_i$ for object $x_i$.

## first-order derivative and second-order derivative

• 若在（a,b)内f’’(x)>0,则f(x)在[a,b]上的图形是凹的；
• 若在（a,b)内f’‘(x)<0,则f(x)在[a,b]上的图形是凸的。

• 当一阶导数等于0，而二阶导数大于0时，为极小值点；
• 当一阶导数等于0，而二阶导数小于0时，为极大值点；
• 当一阶导数和二阶导数都等于0时，为驻点。

## MSE

Now, the question is: if predictions for all the objects were the same and equal to $\alpha$: $\hat y_i = \alpha$, what value of $\alpha$ would minimize MSE error?

The function $f(\alpha)$, that we want to minimize is smooth with respect to $\alpha$. A required condition for $\alpha^*$ to be a local optima is

Let’s find the points, that satisfy the condition:

And finally:

Since second derivative $\frac{d^2 f}{d \alpha^2}$ is positive at point $\alpha^*$, then what we found is local minima.

So, that is how it is possible to find, that optial constant for MSE metric is target mean value.

## MAE

Similarly to the way we found optimal constant for MSE loss, we can find it for MAE.

Recall that $\frac{\partial |x|}{dx} = sign(x)$, where $sign$ stands for signum function . Thus

So we need to find such $\alpha^*$ that

Note that $g(\alpha^*)$ is piecewise-constant non-decreasing function. $g(\alpha^*)=-1$ for all calues of $\alpha$ less then mimimum $y_i$ and $g(\alpha^*)=1$ for $\alpha > \max_i y_i$. The function “jumps” by $\frac{2}{N}$ at every point $y_i$. Here is an example, how this function looks like for $y = [-0.5, 0, 1, 3, 3.4]$:

Basically there are $N$ jumps of the same size, starting from $-1$ and ending at $1$. It is clear, that you need to do about $\frac{N}{2}$ jumps to hit zero. And that happens exactly at median value of the target vector $g(median(y))=0$. We should be careful and separate two cases: when there are even number of points and odd, but the intuition remains the same.

# Classification

## Accuracy

Best constant: predict the most frequent class.

## Logarithmic loss

1. Binary:
2. Multiclass:
• Logloss strongly penalizes completely wrong answers
• Best constant: set $\alpha_{i}$ to frequency of $i-th$ class.

## Area under ROC curve

• TP: true positives
• FP: false positives
• Best constant: All constants give same score
• Random predictions lead to AUC = 0.5

## Kappa

### Cohen’s Kappa motivation

• $p_e$: what accuracy would be on average, if we randomly permute our predictions

### Weighted Kappa

dataset:

• 10 cats
• 90 dogs
• tigers
1. Error weight matrix W
pred/true cat dog tiger
cat 0 1 10
dog 1 0 10
tiger 1 1 0

you can define this by youself

1. Confision matrix C
pred/true cat dog tiger
cat 4 2 3
dog 2 88 5
tiger 4 10 12
1. weighted error

2. weighted Kappa

3. Quadratic and Linear Weighted Kappa
if the target is orderd label, the weighted martix can simply get by follows:

# General approaches for metrics optimization

• Target metric is what we want to optimize
• Optimization loss is what model optimizes

The approaches can be broadly divided into several categories, depending on the metric we need to optimize. Some metrics can be optimized directly.

Approaches in general:

1. Just run the right model(given the metric we need to optimize)
• MSE, Logloss
2. Preprocess train and optimize another metric
• MSPE, MAPE, RMSLE, …
3. Optimize another metric,postprocess predictions
• Accuracy, Kappa
4. Write a custom loss function
• Any, if you can
5. Optimize another metric,use early stopping

## Regression metrics optimization

1. MSE and MAE

just find the right model

2. MSPE and MAPE
• Use weights for samples (sample_weights)
• And use MSE (MAE)
• Not every library accepts sample weights
• XGBoost,LightGBMaccept
• Easy to implement if not supported
• Resample the train set
• df.sample(weights=sample_weights)
• And use any model that optimizes MSE (MAE)
3. (R)MSLE
• Transform target for the train set:
• Fit a model with MSE loss:
• Transform predictions back:
4. AUC
ID LABEL TARGET
A 0 0.1
B 0 0.4
C 1 0.35
D 1 0.8

• Pointwise loss
• Logloss
• Optimize MSE
• Find right thresholds
− Better: optimize thresholds

## Probability Calibration

• logistic regression，在拟合参数的时候采用的是“最大似然法”来直接优化log-loss,因此，logistic function本身返回的就是经过校验的probability。
• Guassian_NaiveBayes，其应用有个前提假设：所有的特征向量是相互独立的。而在实际的工作中，特征向量集难免有冗余，彼此相关，因此利用Guassian_NaiveBayes拟合模型时，往往会over-confidence，所得probability多倾向于0或1。
• RandomForest，与Guassian_NaiveBayes正好相反，由于其分类要旨是取所有分类器的平均，或采用服从多数的策略，因此，RandomForest往往会under-confidence，所得probability多在(0，1)之间。
• SupportVector，由于受到hard margin的影响，其预测probability多集中在(0，1)之间，与RandomForest相似，为under-confidence的情况。

1. non-parameter isotonic regression：isotonic calibration is preferable for non-sigmoid calibration curves and in situations where large amounts of data are available for calibration.
• Just fit Isotonic Regression to your predictions(like in stacking)
2. Platt’s scaling（sigmoid function）: sigmoid calibration is preferable in cases where the calibration curve is sigmoid and where there is limited calibration data.
• JustfitLogisticRegressiontoyourpredictions(like in stacking)
3. Stacking
− Just fit XGBoost or neural net to your predictions

1. 将dataset分为train和test（可用cross_validation.train_test_split）。
2. 用test去拟合校验概率模型；
3. 用train去拟合机器学习模型；
4. 将校验概率模型应用于已经拟合好的机器学习模型上。对机器学习模型的prediction结果进行调整。