2021年5月15日 星期六

Kaggle競賽-Titanic (前20%排名)

不想成為調參俠, 一直調參的話, 只能用autogluon訓練模型

實作: https://github.com/jackliaoall/deep-learning_exercises/tree/master/Titanic


步驟如下:

(1)先安裝autogluon, 不會安裝autogluon的彭友, 可以先參照上一篇文章"Google Colaboratory安裝和導入Autogluon"


(2)下載Titanic 資料集至Colab



(3)訓練及將預測的結果存成submission.csv



(4)將結果上傳到Kaggle, 名次為6362名, 為前20%排名, 不調參數的訓練算是不錯的成績




2021年5月12日 星期三

Google Colaboratory安裝和導入Autogluon

參考: https://qiita.com/daikikatsuragawa/items/319e09b5e1472ba4c4bb


在Colab中使用autogluon, 必須先安裝相關套件, 若少裝一些套件, 會出現錯誤訊息, 如下圖所示


正確步驟如下:
(1)用pip安裝相關套件
!pip install --upgrade pip
!pip install --upgrade setuptools
!pip install --upgrade "mxnet<2.0.0"
!pip install --pre autogluon


(2)點擊RESTART RUNTIME, 重開Colab環境




(3)載入相關Library沒錯誤訊息, 即安裝成功
import numpy as np
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor







2020年11月5日 星期四

損失函數和優化算法 - 揭開神秘面紗

 翻譯自:

https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c

 

The choice of Optimisation Algorithms and Loss Functions for a deep learning model can play a big role in producing optimum and faster results. Before we begin, let us see how different components of a deep learning model affect its result through the simple example of a Perceptron.

對於深度學習的模型來說, 選擇最佳化算法與損失函數會產生最佳且更快的結果中扮演一個大角色, 在開始之前, 讓我們通過一個簡單的Perceptron範例了解深度學習模型的不同組成部分如何影響其結果。


Perceptron

感知器


If you are not familiar with the term perceptron, it refers to a particular supervised learning model, outlined by Rosenblatt in 1957. The architecture and behavior of a perceptron is very similar to biological neurons, and is often considered as the most basic form of neural network. Other kinds of neural networks were developed after the perceptron, and their diversity and applications continue to grow. It is easier to explain the constitutes of a neural network using the example of a single layer perceptron.

如果你不熟悉這個詞"感知器", 它由1957年的Rosenblatt所概述的一種特殊監督學習模型, 感知器架構與行為是非常類似於神經元, 而已經常被視為是神經網路最基本的單元, 其它形式的神經網路是在感知器後面發展的, 並且它們的多樣性和應用不斷增長, 使用單一個感知器是易於解釋神經網路的組成。

A single layer perceptron works as a linear binary classifier. Consider a feature vector [x1, x2, x3] that is used to predict the probability (p) of occurrence of a certain event.

單一層的感知器可當作一個線性二元分類器, 考慮一個用來預測某一事件發生機率(p)的特徵向量[x1, x2, x3]


Weighing factors: Each input in the feature vector is assigned its own relative weight (w), which decides the impact that the particular input needs in the summation function. In relatively easier terms, some inputs are made more important than others by giving them more weight so that they have a greater effect in the summation function (y). A bias (wo) is also added to the summation.

權重因子: 每個特徵向量的輸入會配給它的相關權重(w), 該權重決定該特殊輸入在求和函數的重要性, 用相對容易的術語, 一些輸入比起其他輸入更重要, 這些輸入在求和函數的過程中會給予更大的權重, 偏移量(bias)是用來加入到求和結果的。

 

Activation function: The result of the summation function, that is the weighted sum, is transformed to a desired output by employing a non linear function (fNL), also known as activation function. Since the desired output is probability of an event in this case, a sigmoid function can be used to restrict the results (y) between 0 and 1.

激活函數: 通過使用非線性函數(fNL(也稱為激活函數)將求和函數的結果(即加權和)轉換為所需的輸出, 由於在這種情況下期望的輸出是事件的概率,因此可以使用Sigmoid函數將結果(y)限制在01之間。

 



Sigmoid Function

Other commonly used activation functions are Rectified Linear Unit (ReLU), Tan Hyperbolic (tanh) and Identity function.

其他常用的激活函數是整流線性單位(ReLU),Tan雙曲線(tanh)和Identity函數。

 

Error and Loss Function: In most learning networks, error is calculated as the difference between the actual output and the predicted output.

誤差和損失函數: 在大多數學習網絡中,誤差計算為實際輸出與預測輸出之間的差。

 

The function that is used to compute this error is known as Loss Function J(.). Different loss functions will give different errors for the same prediction, and thus have a considerable effect on the performance of the model. One of the most widely used loss function is mean square error, which calculates the square of difference between actual value and predicted value. Different loss functions are used to deal with different type of tasks, i.e. regression and classification.

用於計算此誤差的函數稱為損失函數J(.),對於相同的預測,不同的損失函數會產生不同的誤差,因此會對模型的性能產生重大影響,均方誤差是使用最廣泛的損失函數之一,它計算實際值和預測值之間的差的平方,不同的損失函數用於處理不同類型的任務,即回歸和分類。


Back Propogation and Optimisation Function: Error J(w) is a function of internal parameters of model i.e weights and bias. For accurate predictions, one needs to minimize the calculated error. In a neural network, this is done using back propagation. The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in such a way that the error is minimized. The weights are modified using a function called Optimization Function.

反向傳播和優化功能: 誤差函數Jw)是模型內部參數的函數,即權重和偏差,為了進行準確的預測,需要將計算出的誤差最小化,在神經網絡中,這是使用反向傳播完成的,當前錯誤通常會向後傳播到上一層,在該層中,它用於修改權重和偏差,以使錯誤最小化,使用稱為優化功能的函數修改權重。

 



Optimisation functions usually calculate the gradient i.e. the partial derivative of loss function with respect to weights, and the weights are modified in the opposite direction of the calculated gradient. This cycle is repeated until we reach the minima of loss function.

優化函數通常計算梯度,即損失函數相對於權重的偏導數,並且權重在計算出的梯度的相反方向上進行修改。重複此循環,直到我們達到損失函數的最小值。


Thus, the components of a neural network model i.e the activation function, loss function and optimization algorithm play a very important role in efficiently and effectively training a Model and produce accurate results. Different tasks require a different set of such functions to give the most optimum results.

因此,神經網絡模型的組成部分,即激活函數,損失函數和優化算法,在有效且有效地訓練模型並產生準確結果方面起著非常重要的作用。不同的任務需要一組不同的此類功能才能獲得最佳效果。


Loss Functions:

損失函數:


Thus, loss functions are helpful to train a neural network. Given an input and a target, they calculate the loss, i.e difference between output and target variable. Loss functions fall under four major category:

因此,損失函數有助於訓練神經網絡。給定輸入和目標,他們計算損失,即輸出和目標變量之間的差。損失函數分為四大類:

 

Regressive loss functions:

回歸損失函數:


They are used in case of regressive problems, that is when the target variable is continuous. Most widely used regressive loss function is Mean Square Error. Other loss functions are:
1. Absolute error — measures the mean absolute value of the element-wise difference between input;
2. Smooth Absolute Error — a smooth version of Abs Criterion.

它們用於回歸問題的情況,即目標變量是連續的。最廣泛使用的回歸損失函數是均方誤差。其他損失函數包括:

1.絕對誤差測量輸入之間的逐元素差異的平均絕對值;

2.平滑絕對錯誤絕對標準的平滑版本。

 

Classification loss functions:

分類損失函數:


The output variable in classification problem is usually a probability value f(x), called the score for the input x. Generally, the magnitude of the score represents the confidence of our prediction. The target variable y, is a binary variable, 1 for true and -1 for false.
On an example (x,y), the margin is defined as yf(x). The margin is a measure of how correct we are. Most classification losses mainly aim to maximize the margin. Some classification algorithms are:
1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier

分類問題中的輸出變量通常是概率值fx),稱為輸入x的得分。通常,分數的大小代表我們預測的可信度。目標變量y是一個二進制變量,1表示true-1表示false

在示例(xy)上,邊距定義為yfx)。margin是衡量我們正確程度的標準。大多數分類損失值主要是為了使margin最大化。一些分類算法是:

1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier

 

Embedding loss functions:

嵌入損失函數:
It deals with problems where we have to measure whether two inputs are similar or dissimilar. Some examples are:
1. L1 Hinge Error- Calculates the L1 distance between two inputs.
2. Cosine Error- Cosine distance between two inputs.

它處理的問題是,我們必須測量兩個輸入是否相似或不相似。一些例子是:

1. L1 Hinge Error- 計算兩個輸入之間的L1距離。
2. Cosine Error-
兩個輸入之間的餘弦距離。


Visualising Loss Functions:

可視化損失函數:


We performed the task to reconstruct an image using a type of neural network called Autoencoders. Different results were obtained for the same task by using different Loss Functions, while everything else in the neural network architecture remained constant. Thus, the difference in result represents the properties of the different loss functions employed. A very simple data set, MNIST data set was used for this purpose. Three loss functions were used to reconstruct images.

      Absolute Loss Function

      Mean Square Loss Funtion

      Smooth Absolute Loss Function.

我們執行了使用一種稱為自動編碼器的神經網絡來重建圖像的任務。通過使用不同的損失函數,對於同一任務獲得了不同的結果,而神經網絡體系結構中的所有其他內容保持不變。因此,結果的差異代表所採用的不同損耗函數的性質。為此,使用了一個非常簡單的數據集MNIST數據集。使用三個損失函數來重建圖像。

      Absolute Loss Function

      Mean Square Loss Funtion

      Smooth Absolute Loss Function.

While the Absolute error just calculated the mean absolute value between of the pixel-wise difference, Mean Square error uses mean squared error. Thus it was more sensitive to outliers and pushed pixel value towards 1 (in our case, white as can be seen in image after first epoch itself).

絕對誤差只是計算像素差異之間的平均絕對值,而均方誤差則使用均方誤差。因此,它對異常值更加敏感,並將像素值推向1(在我們的情況下,在第一個曆元本身之後的圖像中可以看到白色)。

 

Smooth L1 error can be thought of as a smooth version of the Absolute error. It uses a squared term if the squared element-wise error falls below 1 and L1 distance otherwise. It is less sensitive to outliers than the Mean Squared Error and in some cases prevents exploding gradients.

平滑L1錯誤可以看作是絕對錯誤的平滑版本。如果按平方的元素誤差落在1以下,則使用平方項;否則,使用L1距離。它對離群值的敏感度不如均方誤差,並且在某些情況下可以防止爆炸梯度。

 

Optimisation Algorithms

優化算法


Optimisation Algoritms are used to update weights and biases i.e. the internal parameters of a model to reduce the error. They can be divided into two categories:

優化算法用於更新權重和偏差,即模型的內部參數以減少誤差。它們可以分為兩類:

 

Constant Learning Rate Algorithms:

固定學習率算法:


Most widely used Optimisation Algorithm, the Stochastic Gradient Descent falls under this category.

最廣泛使用的優化算法,隨機梯度下降法屬於這一類。

 


Here η is called as learning rate which is a hyperparameter that has to be tuned. Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence i.e will result in small baby steps towards finding optimal parameter values which minimize loss and finding that valley which directly affects the overall training time which gets too large. While a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.

這裡的η稱為學習率,它是必須調整的超參數。選擇合適的學習速度可能很困難。學習速度太小會導致痛苦的緩慢收斂,即將導致嬰兒般朝著尋找最佳參數值的小步伐邁進,從而使損失最小化並為了找到谷底直接影響總體訓練時間變得太長。學習率太大會阻礙收斂,並導致損失函數在最小值附近波動甚至發散。

 

A similar hyperparameter is momentum, which determines the velocity with which learning rate has to be increased as we approach the minima.

一個類似的超參數是動量, 它決定了當我們接近最小值時必須提高學習率的速度。

 

Adaptive Learning Algorithms:

自適應學習算法:


The challenge of using gradient descent is that their hyper parameters have to be defined in advance and they depend heavily on the type of model and problem. Another problem is that the same learning rate is applied to all parameter updates. If we have sparse data, we may want to update the parameters in different extent instead.

使用梯度下降的挑戰在於必須預先定義它們的超參數,並且它們在很大程度上取決於模型的類型和問題。另一個問題是將相同的學習率應用於所有參數更新。如果我們有稀疏的數據,則可能需要在不同程度上更新參數。

 

Adaptive gradient descent algorithms such as Adagrad, Adadelta, RMSprop, Adam, provide an alternative to classical SGD. They have per-paramter learning rate methods, which provide heuristic approach without requiring expensive work in tuning hyperparameters for the learning rate schedule manually.

自適應梯度下降算法(例如AdagradAdadeltaRMSpropAdam)提供了經典SGD的替代方法。他們具有按參數學習速率的方法,該方法提供了啟發式方法,而無需為手動調整學習速率時間表的超參數而進行昂貴的工作。

 

Working with Optimisation Functions:

使用優化功能:


We used three first order optimisation functions and studied their effect.

      Stochastic Gradient Decent

      Adagrad

      Adam

We used three first order optimisation functions and studied their effect.

      Stochastic Gradient Decent

      Adagrad

      Adam

我們使用了三個一階優化函數並研究了它們的效果。

      Stochastic Gradient Decent

      Adagrad

      Adam

 

Gradient Descent calcultes gradient for the whole dataset and updates values in direction opposite to the gradients until we find a local minima. Stochastic Gradient Descent performs a parameter update for each training example unlike normal Gradient Descent which performs only one update. Thus it is much faster. Gradient Decent algorithms can further be improved by tuning important parametes like momentum, learning rate etc.

梯度下降計算整個數據集的梯度,並在與梯度相反的方向上更新值,直到找到局部最小值為止。隨機梯度下降針對每個訓練示例執行參數更新,這與常規梯度下降僅執行一個更新不同。因此它要快得多。梯度下降算法可以通過調整重要參數(例如動量,學習率等)來進一步改進。

 

Adagrad is more preferrable for a sparse data set as it makes big updates for infrequent parameters and small updates for frequent parameters. It uses a different learning Rate for every parameter θ at a time step based on the past gradients which were computed for that parameter. Thus we do not need to manually tune the learning rate.

Adagrad對於稀疏數據集是更合適的,因為它對不頻繁的參數進行大更新,而對頻繁的參數進行小更新,它基於為該參數計算的過去梯度,針對每個參數θ在某個時間步使用不同的學習率。因此,我們不需要手動調整學習率。


Adam stands for Adaptive Moment Estimation. It also calculates different learning rate. Adam works well in practice, is faster, and outperforms other techniques.

Adam 代表自適應矩估計,它還可以計算出不同的學習率,Adam在實踐中表現良好,速度更快,並且勝過其他技術。

 

Stochastic Gradient Decent was much faster than the other algorithms but the results produced were far from optimum. Both, Adagrad and Adam produced better results that SGD, but they were computationally extensive. Adam was slightly faster than Adagrad. Thus, while using a particular optimization function, one has to make a trade off between more computation power and more optimum results.

隨機梯度算法比其他算法快得多,但產生的結果遠非最佳。 AdagradAdam都比SGD產生了更好的結果,但是它們在計算上很廣泛。AdamAdagrad快一點。因此,在使用一種特定的優化功能時,必須在更多的計算能力和更多的最優結果之間做出權衡。

 

Torch:

We worked with Torch7 to complete this project, which is a Lua based predecessor of PyTorch.

我們使用了Torch7完成了這個項目,它是基於LuaPyTorch的前身。

The github repo of the complete project and codes is- https://github.com/dsgiitr/Visualizing-Loss-Functions 

2020年9月20日 星期日

Pytorch損失函數的簡介

 翻譯自:

https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7


What are loss functions?

什麼是損失函數?


Training the neural network is similar to how humans learn. We give data to the model, it predicts something and we tell it whether the prediction is correct or not. The model then corrects its mistakes. The model does this repeatedly until it reaches a certain level of accuracy, decided by us. Telling the model that the prediction was wrong is crucial for it to learn well. This where the loss function comes in. It tells the model how far off its estimation was from the actual value. While communicating with a human is easier, to tell so to a machine we need a medium(pun intended). This communication needs a how and a what. The How is the programming language Python and the What is the Mathematics. In this post, I’ll go through some Hows, Whats and the intuition behind them.


訓練神經網路類似於交會人類如何去學習, 我們將資料餵給模型, 它預測出一些東西, 並且我們告知預測結果是正確的或是不正確, 模型將會更正它的錯誤, 模型一直持續更正直到它達到了一定水準的正確性, 而正確性是由我們決定的, 告訴模型預測是錯誤的是學習更好的關鍵, 這就是損失函數出現的原因, 它告訴模型實際值與預測值差多少, 這個過程與人類溝通是容易的, 與機器溝通需要一個媒介(雙關語), 這個溝通需要如何(how)及什麼(what), 如何去做是透過Python以及做什麼是透過數學, 在這篇中, 我將介紹一些(如何)hows及什麼(what)背後的直覺


For a quick recap of how neural networks train, have a look at this amazing post. My post is meant for people who are familiar with Deep Learning. For nitty-gritty details refer Pytorch Docs.


快速回顧一下神經網路是如何訓練的, 請看這篇驚人的文章, 我的文章是對深度學習熟悉的人有意義的, 有關具體細節, 請參考Pytorch Docs


Mean Absolute Error

平均絕對誤差

torch.nn.L1Loss


Measures the mean absolute error.

測量平均絕對誤差


Image for post


where x is the actual value and y is the predicted value.

其中x是實際值, y是預測值


What does it mean?

這是什麼意思?


It measures the numerical distance between the estimated and actual value. It is the simplest form of error metric. The absolute value of the error is taken because if we don’t then negatives will cancel out the positives. This isn’t useful to us, rather it makes it more unreliable.


它測量了估計值跟實際值之間的數字距離, 這是錯誤測量最簡單的方式, 對於誤差的絕對值是需要的, 因為如果我們不這樣做, 將會導致負數減去正數, 這樣對我們是沒用的, 反而讓它不太可信


The lower the value of MAE, better is the model. We can not expect its value to be zero, because it might not be practically useful. This leads to wastage of resources. For example, if our model’s loss is within 5% then it is alright in practice, and making it more precise may not really be useful.


MAE的值越小, 模型表現越好, 我們不能期待MAE的值趨近於0, 因為它不是非常有用的, 這導致運算資源的浪費, 舉例來說, 如果我們的模型損失值在5%以內, 然後它已經可以在實際中使用, 使得損失值越小並不會對實際有更好的效果


When to use it?

什麼時候使用?


+ Regression problems

+ Simplistic model

+ As neural networks are usually used for complex problems, this function is rarely used.


+ 回歸問題

+ 簡化模型

+ 由於神經網路通常用於處理複雜的問題, MAE很少被用到


Mean Square Error Loss

均方誤差損失

torch.nn.MSELoss


It measures the mean squared error (squared L2 norm).

它測量了均方誤差(L2 norm)


Image for post


where x is the actual value and y is the predicted value.

其中x是實際值, y是預測值


What does it mean?

這是什麼意思?


The squaring of the difference of prediction and actual value means that we’re amplifying large losses. If the classifier is off by 200, the error is 40000 and if the classifier is off by 0.1, the error is 0.01. This penalizes the model when it makes large mistakes and incentivizes small errors.


預測值與實際值之差再開平方意味著我們在放大較大的損失, 

如果分類器是減去200, 這個開平方後的誤差為40000, 並且分類器減去0.1, 這個開平方後的誤差為0.01, 當模型犯較大的錯誤而且激勵了小錯誤,  這樣罰了這個模型


When to use it?

何時去使用它?


+ Regression problems.

+ The numerical value features are not large.

+ Problem is not very high dimensional.


+ 回歸問題

+ 數值特徵不大

+ 問題不是在高維度


Smooth L1 Loss

胡伯損失

torch.nn.SmoothL1Loss


Also known as Huber loss, it is given by —

當我們知道胡伯損失是由底下公式 —


Image for post


What does it mean?

這是什麼意思?


It uses a squared term if the absolute error falls below 1 and an absolute term otherwise. It is less sensitive to outliers than the mean square error loss and in some cases prevents exploding gradients. In mean square error loss, we square the difference which results in a number which is much larger than the original number. These high values result in exploding gradients. This is avoided here as for numbers greater than 1, the numbers are not squared.


如果|x-y| < 1則使用平方項, 否則使用絕對值項, 它對於離群值不敏感並且在一些狀況中可以防止梯度爆炸, 在MSE Loss中, 我們平方了誤差, 得出的差值遠大於遠始值, 這些高值導致了梯度爆炸, 此處避免這個狀況, 當對於大於1的數字, 數字不平方


When to use it?

何時去使用?


+ Regression.

+ When the features have large values.

+ Well suited for most problems.


+ 回歸

+ 當特徵值有較大值時

+ 非常適合大多數問題


Negative Log-Likelihood Loss

負對數似然損失

torch.nn.NLLLoss

The negative log-likelihood loss:


Image for post


What does it mean?


這是什麼意思?

It maximizes the overall probability of the data. It penalizes the model when it predicts the correct class with smaller probabilities and incentivizes when the prediction is made with higher probability. The logrithm does the penalizing part here. Smaller the probabilities, higher will be its logrithm. The negative sign is used here because the probabilities lie in the range [0, 1] and the logrithms of values in this range is negative. So it makes the loss value to be positive.


它最大化了資料的整體機率, 當它以較小的機率預測了正確的類別會進行懲罰, 當以較高的機率預測了正確的類別會進行激勵, 對數在這裡起到了懲罰的作用, 這裡的-()是因為機率落在0~1之間, 取對數之後的值是負值, 所以-()始得損失值變正值


When to use it?

何時去使用它?


+ Classification.

+ Smaller quicker training.

+ Simple tasks.


+ 分類

+ 更小更快的訓練

+ 簡單的工作


Cross-Entropy Loss

交叉熵

torch.nn.CrossEntropyLoss


Measures the cross-entropy between the predicted and the actual value.

評量了預測值與正確值之間的交叉熵


Image for post


where x is the probability of true label and y is the probability of predicted label.

其中x是正確標籤的機率, y是預測標籤的機率


What does it mean?

這是什麼意思?


Cross-entropy as a loss function is used to learn the probability distribution of the data. While other loss functions like squared loss penalize wrong predictions, cross entropy gives a greater penalty when incorrect predictions are predicted with high confidence. What differentiates it with negative log loss is that cross entropy also penalizes wrong but confident predictions and correct but less confident predictions, while negative log loss does not penalize according to the confidence of predictions.


交叉熵是作為損失函數用來學習資料的機率分布, 雖然其他損失函數(如平方損失)懲罰錯誤預測, 當以高信心度預測錯誤的預測時, 交叉熵會給出一個更大的懲罰, 交叉熵與負對數損失的區別在於交叉熵還會懲罰錯誤但自信的預測以及正確但不那麼可靠的預測,而負對數損失不會根據預測的可信度進行懲罰


When to use it?

何時去使用?


+ Classification tasks

+ For making confident model i.e. model will not only predict accurately, but it will also do so with higher probability.

+ For higher precision/recall values.


+ 分類任務

+ 為了建立可信的模型, 模型不僅正確地預測, 並且在更高的機率去預測

+ 獲得更高的準確率/recall值


Kullback-Leibler divergence

相對熵

torch.nn.KLDivLoss


KL divergence gives a measure of how two probability distributions are different from each other.

相對熵給出兩個機率分布如何去評量兩者是如何不同的


Image for post


where x is the probability of true label and y is the probability of predicted label.

其中x是正確標籤的機率, y是預測標籤的機率


What does it mean?

這是什麼意思?


It is quite similar to cross entropy loss. The distinction is the difference between predicted and actual probability. This adds data about information loss in the model training. The farther away the predicted probability distribution is from the true probability distribution, greater is the loss. It does not penalize the model based on the confidence of prediction, as in cross entropy loss, but how different is the prediction from ground truth. It usually outperforms mean square error, especially when data is not normally distributed. The reason why cross entropy is more widely used is that it can be broken down as a function of cross entropy. Minimizing the cross-entropy is the same as minimizing KL divergence.


這與交叉熵非常類似, 區別在於預測機率與實際機率之間的差異, 這添加有關模型訓練中中資料損失的資料, 預測概率分佈離真實概率分佈越遠,損失越大。它不會像交叉熵損失那樣基於預測的置信度對模型進行懲罰,但是預測與標準答案(ground truth)情況有何不同。它通常優於均方誤差,尤其是在數據不是常態分佈的情況下, 交叉熵得到更廣泛使用的原因是可以將其分解為交叉熵的函數, 最小化交叉熵與最小化相對熵相同。


KL = — xlog(y/x) = xlog(x) — xlog(y) = Entropy — Cross-entropy


When to use it?

何時去使用?


+ Classification

+ Same can be achieved with cross entropy with lesser computation, so avoid it.


+ 分類問題

+ 交叉熵可以通過較少的計算來實現,因此請避免使用KL divergence


Margin Ranking Loss

torch.nn.MarginRankingLoss


It measures the loss given inputs x1, x2, and a label tensor y with values (1 or -1). If y == 1 then it assumed the first input should be ranked higher than the second input, and vice-versa for y == -1.


它評量了兩個輸入的損失值, y只輸出1或-1, 如果y等於1假定第一個輸入排名比第二個輸入還高, 當y等於-1, 反之


Image for post


What does it mean?

這是什麼意思?


The prediction y of the classifier is based on the ranking of the inputs x1 and x2. Assuming margin to have the default value of 0, if y and (x1-x2) are of the same sign, then the loss will be zero. This means that x1/x2 was ranked higher(for y=1/-1), as expected by the data. If y and (x1-x2) are of the opposite sign, then the loss will be the non-zero value given by y * (x1-x2). This means that either x2 was ranked higher when x1 should have been ranked higher or vice versa. Although its usage in Pytorch in unclear as much open source implementations and examples are not available as compared to other loss functions.


分類器的預測是基於兩個輸入的排名, 假設餘量的默認值為0,如果y和(x1-x2)具有相同的符號,則損失將為零, 這意味著x1 / x2的排名更高(對於y = 1 / -1),正如數據所預期的那樣。如果y和(x1-x2)具有相反的符號,則損耗將是y *(x1-x2)給出的非零值, 這意味著,如果x1應該排名更高,則x2排名更高;反之亦然。儘管在Pytorch中使用它的方式尚不清楚,但與其他損失函數相比,沒有太多的開源實現和示例可用


When to use it?

何時去使用?


+ GANs.

+ Ranking tasks.


+ GANs

+ 排名任務


Hinge Embedding Loss

torch.nn.HingeEmbeddingLoss


Measures the loss given an input tensor x and a labels tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.


在輸入張量x和包含值(1或-1)的標籤張量y的情況下測量損耗。它用於測量兩個輸入是否相似。


Image for post


What does it mean?

這是什麼意思?


The prediction y of the classifier is based on the value of the input x. Assuming margin to have the default value of 1, if y=-1, then the loss will be maximum of 0 and (1 — x). If x > 0 loss will be x itself (higher value), if 0<x<1 loss will be 1 — x (smaller value) and if x < 0 loss will be 0 (minimum value). For y =1, the loss is as high as the value of x.


分類器的預測y基於輸入x的值。假設裕度的默認值為1,如果y = -1,則損失將最大為0和(1 — x)。如果x> 0損耗將是x本身(較高的值),如果0 <x <1損耗將是1 — x(較小的值),並且如果x <0損耗將是0(最小值)。對於y = 1,損耗高達x的值。

/*網上找到的結論, 看這個比較清楚

它输入x和y(1或者-1),margin默认为1。

  1. 当y=-1的时候,loss=max(0,1-x),如果x>1(margin),则loss=0;如果x<1,loss=1-x

  2. 当y=1,loss=x*/

When to use it?

何時去使用?


+ Learning nonlinear embeddings

+ Semi-supervised learning

+ Where similarity or dissimilar of two inputs is to be measured.


+ 學習非線性embeddings

+ 半監督學習

+ 要測量兩個輸入的相似性與不相似性


Cosine Embedding Loss

 

torch.nn.CosineEmbeddingLoss

It measures the loss given inputs x1, x2, and a label tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.


它評量了兩個輸入的損失, y只輸出1或-1, 它用來評量兩個輸入相似或不相似


Image for post


What does it mean?

這是什麼意思?


The prediction y of the classifier is based on the cosine distance of the inputs x1 and x2. Cosine distance refers to the angle between two points. It can be easily found out by using dot products as:


分類器的預測y基於輸入x1和x2的餘弦(cosine)距離。餘弦距離是指兩點之間的角度。通過使用點點積(dot)可以很容易地找到它:


Image for post


As cosine lies between - 1 and + 1, loss values are smaller. This aids in computation. Assuming margin to have the default value of 0, if y =1, the loss is (1 - cos(x1, x2)). For y=-1, then the loss will be maximum of 0 and cos(x1, x2). If cos(x1, x2) > 0 loss will be cos(x1, x2) itself (higher value), and if cos(x1, x2) < 0 loss will be 0 (minimum value).


由於餘弦介於-1和+ 1之間,損耗值較小。這有助於計算。假設餘量具有默認值0,如果y = 1,則損失為(1-cos(x1,x2))。對於y = -1,則損耗最大為0,cos(x1,x2)。如果cos(x1,x2)> 0,則損耗將是cos(x1,x2)本身(較高的值),如果cos(x1,x2)<0,則損耗將為0(最小值)。


When to use it?

何時去使用?


+ Learning nonlinear embeddings

+ Semi-supervised learning

+ Where similarity or dissimilar of two inputs is to be measured.


+ 學習非線性embeddings

+ 半監督式學習

+ 要評量兩個輸入的相似性與不相似性


Kaggle競賽-Titanic (前20%排名)

不想成為調參俠, 一直調參的話, 只能用autogluon訓練模型 實作: https://github.com/jackliaoall/deep-learning_exercises/tree/master/Titanic 步驟如下: (1)先安裝autogluon, 不會安裝...