2020年9月20日 星期日

Pytorch損失函數的簡介

 翻譯自:

https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7


What are loss functions?

什麼是損失函數?


Training the neural network is similar to how humans learn. We give data to the model, it predicts something and we tell it whether the prediction is correct or not. The model then corrects its mistakes. The model does this repeatedly until it reaches a certain level of accuracy, decided by us. Telling the model that the prediction was wrong is crucial for it to learn well. This where the loss function comes in. It tells the model how far off its estimation was from the actual value. While communicating with a human is easier, to tell so to a machine we need a medium(pun intended). This communication needs a how and a what. The How is the programming language Python and the What is the Mathematics. In this post, I’ll go through some Hows, Whats and the intuition behind them.


訓練神經網路類似於交會人類如何去學習, 我們將資料餵給模型, 它預測出一些東西, 並且我們告知預測結果是正確的或是不正確, 模型將會更正它的錯誤, 模型一直持續更正直到它達到了一定水準的正確性, 而正確性是由我們決定的, 告訴模型預測是錯誤的是學習更好的關鍵, 這就是損失函數出現的原因, 它告訴模型實際值與預測值差多少, 這個過程與人類溝通是容易的, 與機器溝通需要一個媒介(雙關語), 這個溝通需要如何(how)及什麼(what), 如何去做是透過Python以及做什麼是透過數學, 在這篇中, 我將介紹一些(如何)hows及什麼(what)背後的直覺


For a quick recap of how neural networks train, have a look at this amazing post. My post is meant for people who are familiar with Deep Learning. For nitty-gritty details refer Pytorch Docs.


快速回顧一下神經網路是如何訓練的, 請看這篇驚人的文章, 我的文章是對深度學習熟悉的人有意義的, 有關具體細節, 請參考Pytorch Docs


Mean Absolute Error

平均絕對誤差

torch.nn.L1Loss


Measures the mean absolute error.

測量平均絕對誤差


Image for post


where x is the actual value and y is the predicted value.

其中x是實際值, y是預測值


What does it mean?

這是什麼意思?


It measures the numerical distance between the estimated and actual value. It is the simplest form of error metric. The absolute value of the error is taken because if we don’t then negatives will cancel out the positives. This isn’t useful to us, rather it makes it more unreliable.


它測量了估計值跟實際值之間的數字距離, 這是錯誤測量最簡單的方式, 對於誤差的絕對值是需要的, 因為如果我們不這樣做, 將會導致負數減去正數, 這樣對我們是沒用的, 反而讓它不太可信


The lower the value of MAE, better is the model. We can not expect its value to be zero, because it might not be practically useful. This leads to wastage of resources. For example, if our model’s loss is within 5% then it is alright in practice, and making it more precise may not really be useful.


MAE的值越小, 模型表現越好, 我們不能期待MAE的值趨近於0, 因為它不是非常有用的, 這導致運算資源的浪費, 舉例來說, 如果我們的模型損失值在5%以內, 然後它已經可以在實際中使用, 使得損失值越小並不會對實際有更好的效果


When to use it?

什麼時候使用?


+ Regression problems

+ Simplistic model

+ As neural networks are usually used for complex problems, this function is rarely used.


+ 回歸問題

+ 簡化模型

+ 由於神經網路通常用於處理複雜的問題, MAE很少被用到


Mean Square Error Loss

均方誤差損失

torch.nn.MSELoss


It measures the mean squared error (squared L2 norm).

它測量了均方誤差(L2 norm)


Image for post


where x is the actual value and y is the predicted value.

其中x是實際值, y是預測值


What does it mean?

這是什麼意思?


The squaring of the difference of prediction and actual value means that we’re amplifying large losses. If the classifier is off by 200, the error is 40000 and if the classifier is off by 0.1, the error is 0.01. This penalizes the model when it makes large mistakes and incentivizes small errors.


預測值與實際值之差再開平方意味著我們在放大較大的損失, 

如果分類器是減去200, 這個開平方後的誤差為40000, 並且分類器減去0.1, 這個開平方後的誤差為0.01, 當模型犯較大的錯誤而且激勵了小錯誤,  這樣罰了這個模型


When to use it?

何時去使用它?


+ Regression problems.

+ The numerical value features are not large.

+ Problem is not very high dimensional.


+ 回歸問題

+ 數值特徵不大

+ 問題不是在高維度


Smooth L1 Loss

胡伯損失

torch.nn.SmoothL1Loss


Also known as Huber loss, it is given by —

當我們知道胡伯損失是由底下公式 —


Image for post


What does it mean?

這是什麼意思?


It uses a squared term if the absolute error falls below 1 and an absolute term otherwise. It is less sensitive to outliers than the mean square error loss and in some cases prevents exploding gradients. In mean square error loss, we square the difference which results in a number which is much larger than the original number. These high values result in exploding gradients. This is avoided here as for numbers greater than 1, the numbers are not squared.


如果|x-y| < 1則使用平方項, 否則使用絕對值項, 它對於離群值不敏感並且在一些狀況中可以防止梯度爆炸, 在MSE Loss中, 我們平方了誤差, 得出的差值遠大於遠始值, 這些高值導致了梯度爆炸, 此處避免這個狀況, 當對於大於1的數字, 數字不平方


When to use it?

何時去使用?


+ Regression.

+ When the features have large values.

+ Well suited for most problems.


+ 回歸

+ 當特徵值有較大值時

+ 非常適合大多數問題


Negative Log-Likelihood Loss

負對數似然損失

torch.nn.NLLLoss

The negative log-likelihood loss:


Image for post


What does it mean?


這是什麼意思?

It maximizes the overall probability of the data. It penalizes the model when it predicts the correct class with smaller probabilities and incentivizes when the prediction is made with higher probability. The logrithm does the penalizing part here. Smaller the probabilities, higher will be its logrithm. The negative sign is used here because the probabilities lie in the range [0, 1] and the logrithms of values in this range is negative. So it makes the loss value to be positive.


它最大化了資料的整體機率, 當它以較小的機率預測了正確的類別會進行懲罰, 當以較高的機率預測了正確的類別會進行激勵, 對數在這裡起到了懲罰的作用, 這裡的-()是因為機率落在0~1之間, 取對數之後的值是負值, 所以-()始得損失值變正值


When to use it?

何時去使用它?


+ Classification.

+ Smaller quicker training.

+ Simple tasks.


+ 分類

+ 更小更快的訓練

+ 簡單的工作


Cross-Entropy Loss

交叉熵

torch.nn.CrossEntropyLoss


Measures the cross-entropy between the predicted and the actual value.

評量了預測值與正確值之間的交叉熵


Image for post


where x is the probability of true label and y is the probability of predicted label.

其中x是正確標籤的機率, y是預測標籤的機率


What does it mean?

這是什麼意思?


Cross-entropy as a loss function is used to learn the probability distribution of the data. While other loss functions like squared loss penalize wrong predictions, cross entropy gives a greater penalty when incorrect predictions are predicted with high confidence. What differentiates it with negative log loss is that cross entropy also penalizes wrong but confident predictions and correct but less confident predictions, while negative log loss does not penalize according to the confidence of predictions.


交叉熵是作為損失函數用來學習資料的機率分布, 雖然其他損失函數(如平方損失)懲罰錯誤預測, 當以高信心度預測錯誤的預測時, 交叉熵會給出一個更大的懲罰, 交叉熵與負對數損失的區別在於交叉熵還會懲罰錯誤但自信的預測以及正確但不那麼可靠的預測,而負對數損失不會根據預測的可信度進行懲罰


When to use it?

何時去使用?


+ Classification tasks

+ For making confident model i.e. model will not only predict accurately, but it will also do so with higher probability.

+ For higher precision/recall values.


+ 分類任務

+ 為了建立可信的模型, 模型不僅正確地預測, 並且在更高的機率去預測

+ 獲得更高的準確率/recall值


Kullback-Leibler divergence

相對熵

torch.nn.KLDivLoss


KL divergence gives a measure of how two probability distributions are different from each other.

相對熵給出兩個機率分布如何去評量兩者是如何不同的


Image for post


where x is the probability of true label and y is the probability of predicted label.

其中x是正確標籤的機率, y是預測標籤的機率


What does it mean?

這是什麼意思?


It is quite similar to cross entropy loss. The distinction is the difference between predicted and actual probability. This adds data about information loss in the model training. The farther away the predicted probability distribution is from the true probability distribution, greater is the loss. It does not penalize the model based on the confidence of prediction, as in cross entropy loss, but how different is the prediction from ground truth. It usually outperforms mean square error, especially when data is not normally distributed. The reason why cross entropy is more widely used is that it can be broken down as a function of cross entropy. Minimizing the cross-entropy is the same as minimizing KL divergence.


這與交叉熵非常類似, 區別在於預測機率與實際機率之間的差異, 這添加有關模型訓練中中資料損失的資料, 預測概率分佈離真實概率分佈越遠,損失越大。它不會像交叉熵損失那樣基於預測的置信度對模型進行懲罰,但是預測與標準答案(ground truth)情況有何不同。它通常優於均方誤差,尤其是在數據不是常態分佈的情況下, 交叉熵得到更廣泛使用的原因是可以將其分解為交叉熵的函數, 最小化交叉熵與最小化相對熵相同。


KL = — xlog(y/x) = xlog(x) — xlog(y) = Entropy — Cross-entropy


When to use it?

何時去使用?


+ Classification

+ Same can be achieved with cross entropy with lesser computation, so avoid it.


+ 分類問題

+ 交叉熵可以通過較少的計算來實現,因此請避免使用KL divergence


Margin Ranking Loss

torch.nn.MarginRankingLoss


It measures the loss given inputs x1, x2, and a label tensor y with values (1 or -1). If y == 1 then it assumed the first input should be ranked higher than the second input, and vice-versa for y == -1.


它評量了兩個輸入的損失值, y只輸出1或-1, 如果y等於1假定第一個輸入排名比第二個輸入還高, 當y等於-1, 反之


Image for post


What does it mean?

這是什麼意思?


The prediction y of the classifier is based on the ranking of the inputs x1 and x2. Assuming margin to have the default value of 0, if y and (x1-x2) are of the same sign, then the loss will be zero. This means that x1/x2 was ranked higher(for y=1/-1), as expected by the data. If y and (x1-x2) are of the opposite sign, then the loss will be the non-zero value given by y * (x1-x2). This means that either x2 was ranked higher when x1 should have been ranked higher or vice versa. Although its usage in Pytorch in unclear as much open source implementations and examples are not available as compared to other loss functions.


分類器的預測是基於兩個輸入的排名, 假設餘量的默認值為0,如果y和(x1-x2)具有相同的符號,則損失將為零, 這意味著x1 / x2的排名更高(對於y = 1 / -1),正如數據所預期的那樣。如果y和(x1-x2)具有相反的符號,則損耗將是y *(x1-x2)給出的非零值, 這意味著,如果x1應該排名更高,則x2排名更高;反之亦然。儘管在Pytorch中使用它的方式尚不清楚,但與其他損失函數相比,沒有太多的開源實現和示例可用


When to use it?

何時去使用?


+ GANs.

+ Ranking tasks.


+ GANs

+ 排名任務


Hinge Embedding Loss

torch.nn.HingeEmbeddingLoss


Measures the loss given an input tensor x and a labels tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.


在輸入張量x和包含值(1或-1)的標籤張量y的情況下測量損耗。它用於測量兩個輸入是否相似。


Image for post


What does it mean?

這是什麼意思?


The prediction y of the classifier is based on the value of the input x. Assuming margin to have the default value of 1, if y=-1, then the loss will be maximum of 0 and (1 — x). If x > 0 loss will be x itself (higher value), if 0<x<1 loss will be 1 — x (smaller value) and if x < 0 loss will be 0 (minimum value). For y =1, the loss is as high as the value of x.


分類器的預測y基於輸入x的值。假設裕度的默認值為1,如果y = -1,則損失將最大為0和(1 — x)。如果x> 0損耗將是x本身(較高的值),如果0 <x <1損耗將是1 — x(較小的值),並且如果x <0損耗將是0(最小值)。對於y = 1,損耗高達x的值。

/*網上找到的結論, 看這個比較清楚

它输入x和y(1或者-1),margin默认为1。

  1. 当y=-1的时候,loss=max(0,1-x),如果x>1(margin),则loss=0;如果x<1,loss=1-x

  2. 当y=1,loss=x*/

When to use it?

何時去使用?


+ Learning nonlinear embeddings

+ Semi-supervised learning

+ Where similarity or dissimilar of two inputs is to be measured.


+ 學習非線性embeddings

+ 半監督學習

+ 要測量兩個輸入的相似性與不相似性


Cosine Embedding Loss

 

torch.nn.CosineEmbeddingLoss

It measures the loss given inputs x1, x2, and a label tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.


它評量了兩個輸入的損失, y只輸出1或-1, 它用來評量兩個輸入相似或不相似


Image for post


What does it mean?

這是什麼意思?


The prediction y of the classifier is based on the cosine distance of the inputs x1 and x2. Cosine distance refers to the angle between two points. It can be easily found out by using dot products as:


分類器的預測y基於輸入x1和x2的餘弦(cosine)距離。餘弦距離是指兩點之間的角度。通過使用點點積(dot)可以很容易地找到它:


Image for post


As cosine lies between - 1 and + 1, loss values are smaller. This aids in computation. Assuming margin to have the default value of 0, if y =1, the loss is (1 - cos(x1, x2)). For y=-1, then the loss will be maximum of 0 and cos(x1, x2). If cos(x1, x2) > 0 loss will be cos(x1, x2) itself (higher value), and if cos(x1, x2) < 0 loss will be 0 (minimum value).


由於餘弦介於-1和+ 1之間,損耗值較小。這有助於計算。假設餘量具有默認值0,如果y = 1,則損失為(1-cos(x1,x2))。對於y = -1,則損耗最大為0,cos(x1,x2)。如果cos(x1,x2)> 0,則損耗將是cos(x1,x2)本身(較高的值),如果cos(x1,x2)<0,則損耗將為0(最小值)。


When to use it?

何時去使用?


+ Learning nonlinear embeddings

+ Semi-supervised learning

+ Where similarity or dissimilar of two inputs is to be measured.


+ 學習非線性embeddings

+ 半監督式學習

+ 要評量兩個輸入的相似性與不相似性


Kaggle競賽-Titanic (前20%排名)

不想成為調參俠, 一直調參的話, 只能用autogluon訓練模型 實作: https://github.com/jackliaoall/deep-learning_exercises/tree/master/Titanic 步驟如下: (1)先安裝autogluon, 不會安裝...