Evaluation Metrics

Classification Metrics

Classification is about predicting class labels given input data. Take binary classification for example, there’re 2 possible classes, usually positive and negative, so we can naively use accuracy to measure the performance. $$Accuracy = {n\_correct \over n\_total}$$

dog_cat In this competition, you’ll distinguish dogs from cats

However, in reality, the classes are not always equal. The quantities of classes may be biased, and the cost of making wrong prediction may be different (e.g. in medical diagnosis),we need more terminology to evaluate the performance.

cat_dog

Confusion Matrix

Predicted as Positive Predicted as Negative
Labeled as Positive True Positive (TP) False Negative (FN)
Labeled as Negative False Positive (FP) True Negative (TN)

Then the overall accuracy could be defined as $$Accuracy = {TP + TN \over TP + FN + FP + TN}$$ And as we got the confusion matrix, we can introduce more specifications $$TPR = {TP \over TP + FN}$$ $$FPR = {FP \over FP + TN}$$ $$TNR = {TN \over FP + TN}$$ TPR: True Positive Rate FPR: False Positive Rate TNR: True Negative Rate

Log_Loss

Log-loss, or logarithmic loss, gets into the finer details of a classifier. If the raw output of the classifier is a numeric probability instead of a class label of 0 or 1, then log-loss can be used. $$log\_ loss = -{1 \over N} {\sum_{i=1}^N y_i \log p_i + (1 - y_i) \log (1 - p_i)}$$ \(p_i\) is the probability that the ith data point belongs to class 1 , \(y_i\) is the true label and is either 0 or 1 It’s also called cross entropy loss

AUC

Area under the Curve(Receiver Operating Characteristic, ROC)

The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives

The AUC summarize the ROC curve, so that it can be compared easily.

roc

Ranking Metrics

ranking Ranking is actually related to binary classification. The ranker predicts the relevance of the disired information and the required one.

The search engine acts as a ranker. When the user types in a query, the search engine returns a ranked list of web pages that it considers to be relevant to the query. Conceptually, one can think of the task of ranking as first a binary classification of “rel‐ evant to the query” versus “irrelevant to the query,” followed by ordering the results so that the most relevant items appear at the top of the list.

Precision and recall are actually two metrics. But they are often used together.

precision-recall

$$Precision = {TP \over TP + FP}$$ $$Recall = {TP \over TP + FN}$$ $$F1 = {2 * precision * recall \over precision + recall}$$ However, in this case, precision and recall treat all retrieved items equally.

So to measure effectiveness of web search engine algorithms or related applications, we often use DCG, and DCG originates from an earlier, more primitive, measure called Cumulative Gain.

CG cumulative gain

It is the sum of the graded relevance values of all results in a search result list. $$CG_p = {\sum_{i=1}^p rel_i}$$ where \(rel_i\) is the graded relevance of the result at position \(i\)

DCG discounted cumulative gain

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result $$DCG_p = {\sum_{i=1}^p {2^{rel_i} - 1 \over \log_2 (i + 1) }}$$

NDCG normalized discounted cumulative gain

This is done by sorting all relevant documents in the corpus by their relative relevance, producing the maximum possible DCG through position ,also called Ideal DCG (IDCG) through that position. $$nDCG_p = {DCG_p \over IDCG_p}$$ where: $$IDCG_p = {\sum_{i=1}^{|REL|} {2^{rel_i} - 1 \over \log_2 (i + 1) }}$$ and |REL| represents the list of relevant documents (ordered by their relevance) in the corpus up to position p

Regression Metrics

A great example is stock prediction where we try to predict the price of a stock on future days given past price history and other information about the company and the market, and we can treat it as a regression task.

RMSE root mean square deviation

$$RMSE = {\sqrt{\sum_i (y_i - \hat y_i)^2 \over n}}$$ \(y_i\) denotes the true score for the \(i\)th data point,and \(\hat y_i\) denotes the predicted value. \(n\) is the number of data points

MAPE mean absolute percentage error

deals with large outliers robustly for it look at the median absolute percentage $$MAPE = {median{(|(y_i - \hat y_i) / y_i|)}}$$ Inspired by this measurement, we can easily define how correctness we want our model to perform to change the absolute persentage.

Chuanrong Li

Read more posts by this author.