2020年11月5日 星期四

損失函數和優化算法 - 揭開神秘面紗

 翻譯自:

https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c

 

The choice of Optimisation Algorithms and Loss Functions for a deep learning model can play a big role in producing optimum and faster results. Before we begin, let us see how different components of a deep learning model affect its result through the simple example of a Perceptron.

對於深度學習的模型來說, 選擇最佳化算法與損失函數會產生最佳且更快的結果中扮演一個大角色, 在開始之前, 讓我們通過一個簡單的Perceptron範例了解深度學習模型的不同組成部分如何影響其結果。


Perceptron

感知器


If you are not familiar with the term perceptron, it refers to a particular supervised learning model, outlined by Rosenblatt in 1957. The architecture and behavior of a perceptron is very similar to biological neurons, and is often considered as the most basic form of neural network. Other kinds of neural networks were developed after the perceptron, and their diversity and applications continue to grow. It is easier to explain the constitutes of a neural network using the example of a single layer perceptron.

如果你不熟悉這個詞"感知器", 它由1957年的Rosenblatt所概述的一種特殊監督學習模型, 感知器架構與行為是非常類似於神經元, 而已經常被視為是神經網路最基本的單元, 其它形式的神經網路是在感知器後面發展的, 並且它們的多樣性和應用不斷增長, 使用單一個感知器是易於解釋神經網路的組成。

A single layer perceptron works as a linear binary classifier. Consider a feature vector [x1, x2, x3] that is used to predict the probability (p) of occurrence of a certain event.

單一層的感知器可當作一個線性二元分類器, 考慮一個用來預測某一事件發生機率(p)的特徵向量[x1, x2, x3]


Weighing factors: Each input in the feature vector is assigned its own relative weight (w), which decides the impact that the particular input needs in the summation function. In relatively easier terms, some inputs are made more important than others by giving them more weight so that they have a greater effect in the summation function (y). A bias (wo) is also added to the summation.

權重因子: 每個特徵向量的輸入會配給它的相關權重(w), 該權重決定該特殊輸入在求和函數的重要性, 用相對容易的術語, 一些輸入比起其他輸入更重要, 這些輸入在求和函數的過程中會給予更大的權重, 偏移量(bias)是用來加入到求和結果的。

 

Activation function: The result of the summation function, that is the weighted sum, is transformed to a desired output by employing a non linear function (fNL), also known as activation function. Since the desired output is probability of an event in this case, a sigmoid function can be used to restrict the results (y) between 0 and 1.

激活函數: 通過使用非線性函數(fNL(也稱為激活函數)將求和函數的結果(即加權和)轉換為所需的輸出, 由於在這種情況下期望的輸出是事件的概率,因此可以使用Sigmoid函數將結果(y)限制在01之間。

 



Sigmoid Function

Other commonly used activation functions are Rectified Linear Unit (ReLU), Tan Hyperbolic (tanh) and Identity function.

其他常用的激活函數是整流線性單位(ReLU),Tan雙曲線(tanh)和Identity函數。

 

Error and Loss Function: In most learning networks, error is calculated as the difference between the actual output and the predicted output.

誤差和損失函數: 在大多數學習網絡中,誤差計算為實際輸出與預測輸出之間的差。

 

The function that is used to compute this error is known as Loss Function J(.). Different loss functions will give different errors for the same prediction, and thus have a considerable effect on the performance of the model. One of the most widely used loss function is mean square error, which calculates the square of difference between actual value and predicted value. Different loss functions are used to deal with different type of tasks, i.e. regression and classification.

用於計算此誤差的函數稱為損失函數J(.),對於相同的預測,不同的損失函數會產生不同的誤差,因此會對模型的性能產生重大影響,均方誤差是使用最廣泛的損失函數之一,它計算實際值和預測值之間的差的平方,不同的損失函數用於處理不同類型的任務,即回歸和分類。


Back Propogation and Optimisation Function: Error J(w) is a function of internal parameters of model i.e weights and bias. For accurate predictions, one needs to minimize the calculated error. In a neural network, this is done using back propagation. The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in such a way that the error is minimized. The weights are modified using a function called Optimization Function.

反向傳播和優化功能: 誤差函數Jw)是模型內部參數的函數,即權重和偏差,為了進行準確的預測,需要將計算出的誤差最小化,在神經網絡中,這是使用反向傳播完成的,當前錯誤通常會向後傳播到上一層,在該層中,它用於修改權重和偏差,以使錯誤最小化,使用稱為優化功能的函數修改權重。

 



Optimisation functions usually calculate the gradient i.e. the partial derivative of loss function with respect to weights, and the weights are modified in the opposite direction of the calculated gradient. This cycle is repeated until we reach the minima of loss function.

優化函數通常計算梯度,即損失函數相對於權重的偏導數,並且權重在計算出的梯度的相反方向上進行修改。重複此循環,直到我們達到損失函數的最小值。


Thus, the components of a neural network model i.e the activation function, loss function and optimization algorithm play a very important role in efficiently and effectively training a Model and produce accurate results. Different tasks require a different set of such functions to give the most optimum results.

因此,神經網絡模型的組成部分,即激活函數,損失函數和優化算法,在有效且有效地訓練模型並產生準確結果方面起著非常重要的作用。不同的任務需要一組不同的此類功能才能獲得最佳效果。


Loss Functions:

損失函數:


Thus, loss functions are helpful to train a neural network. Given an input and a target, they calculate the loss, i.e difference between output and target variable. Loss functions fall under four major category:

因此,損失函數有助於訓練神經網絡。給定輸入和目標,他們計算損失,即輸出和目標變量之間的差。損失函數分為四大類:

 

Regressive loss functions:

回歸損失函數:


They are used in case of regressive problems, that is when the target variable is continuous. Most widely used regressive loss function is Mean Square Error. Other loss functions are:
1. Absolute error — measures the mean absolute value of the element-wise difference between input;
2. Smooth Absolute Error — a smooth version of Abs Criterion.

它們用於回歸問題的情況,即目標變量是連續的。最廣泛使用的回歸損失函數是均方誤差。其他損失函數包括:

1.絕對誤差測量輸入之間的逐元素差異的平均絕對值;

2.平滑絕對錯誤絕對標準的平滑版本。

 

Classification loss functions:

分類損失函數:


The output variable in classification problem is usually a probability value f(x), called the score for the input x. Generally, the magnitude of the score represents the confidence of our prediction. The target variable y, is a binary variable, 1 for true and -1 for false.
On an example (x,y), the margin is defined as yf(x). The margin is a measure of how correct we are. Most classification losses mainly aim to maximize the margin. Some classification algorithms are:
1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier

分類問題中的輸出變量通常是概率值fx),稱為輸入x的得分。通常,分數的大小代表我們預測的可信度。目標變量y是一個二進制變量,1表示true-1表示false

在示例(xy)上,邊距定義為yfx)。margin是衡量我們正確程度的標準。大多數分類損失值主要是為了使margin最大化。一些分類算法是:

1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier

 

Embedding loss functions:

嵌入損失函數:
It deals with problems where we have to measure whether two inputs are similar or dissimilar. Some examples are:
1. L1 Hinge Error- Calculates the L1 distance between two inputs.
2. Cosine Error- Cosine distance between two inputs.

它處理的問題是,我們必須測量兩個輸入是否相似或不相似。一些例子是:

1. L1 Hinge Error- 計算兩個輸入之間的L1距離。
2. Cosine Error-
兩個輸入之間的餘弦距離。


Visualising Loss Functions:

可視化損失函數:


We performed the task to reconstruct an image using a type of neural network called Autoencoders. Different results were obtained for the same task by using different Loss Functions, while everything else in the neural network architecture remained constant. Thus, the difference in result represents the properties of the different loss functions employed. A very simple data set, MNIST data set was used for this purpose. Three loss functions were used to reconstruct images.

      Absolute Loss Function

      Mean Square Loss Funtion

      Smooth Absolute Loss Function.

我們執行了使用一種稱為自動編碼器的神經網絡來重建圖像的任務。通過使用不同的損失函數,對於同一任務獲得了不同的結果,而神經網絡體系結構中的所有其他內容保持不變。因此,結果的差異代表所採用的不同損耗函數的性質。為此,使用了一個非常簡單的數據集MNIST數據集。使用三個損失函數來重建圖像。

      Absolute Loss Function

      Mean Square Loss Funtion

      Smooth Absolute Loss Function.

While the Absolute error just calculated the mean absolute value between of the pixel-wise difference, Mean Square error uses mean squared error. Thus it was more sensitive to outliers and pushed pixel value towards 1 (in our case, white as can be seen in image after first epoch itself).

絕對誤差只是計算像素差異之間的平均絕對值,而均方誤差則使用均方誤差。因此,它對異常值更加敏感,並將像素值推向1(在我們的情況下,在第一個曆元本身之後的圖像中可以看到白色)。

 

Smooth L1 error can be thought of as a smooth version of the Absolute error. It uses a squared term if the squared element-wise error falls below 1 and L1 distance otherwise. It is less sensitive to outliers than the Mean Squared Error and in some cases prevents exploding gradients.

平滑L1錯誤可以看作是絕對錯誤的平滑版本。如果按平方的元素誤差落在1以下,則使用平方項;否則,使用L1距離。它對離群值的敏感度不如均方誤差,並且在某些情況下可以防止爆炸梯度。

 

Optimisation Algorithms

優化算法


Optimisation Algoritms are used to update weights and biases i.e. the internal parameters of a model to reduce the error. They can be divided into two categories:

優化算法用於更新權重和偏差,即模型的內部參數以減少誤差。它們可以分為兩類:

 

Constant Learning Rate Algorithms:

固定學習率算法:


Most widely used Optimisation Algorithm, the Stochastic Gradient Descent falls under this category.

最廣泛使用的優化算法,隨機梯度下降法屬於這一類。

 


Here η is called as learning rate which is a hyperparameter that has to be tuned. Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence i.e will result in small baby steps towards finding optimal parameter values which minimize loss and finding that valley which directly affects the overall training time which gets too large. While a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.

這裡的η稱為學習率,它是必須調整的超參數。選擇合適的學習速度可能很困難。學習速度太小會導致痛苦的緩慢收斂,即將導致嬰兒般朝著尋找最佳參數值的小步伐邁進,從而使損失最小化並為了找到谷底直接影響總體訓練時間變得太長。學習率太大會阻礙收斂,並導致損失函數在最小值附近波動甚至發散。

 

A similar hyperparameter is momentum, which determines the velocity with which learning rate has to be increased as we approach the minima.

一個類似的超參數是動量, 它決定了當我們接近最小值時必須提高學習率的速度。

 

Adaptive Learning Algorithms:

自適應學習算法:


The challenge of using gradient descent is that their hyper parameters have to be defined in advance and they depend heavily on the type of model and problem. Another problem is that the same learning rate is applied to all parameter updates. If we have sparse data, we may want to update the parameters in different extent instead.

使用梯度下降的挑戰在於必須預先定義它們的超參數,並且它們在很大程度上取決於模型的類型和問題。另一個問題是將相同的學習率應用於所有參數更新。如果我們有稀疏的數據,則可能需要在不同程度上更新參數。

 

Adaptive gradient descent algorithms such as Adagrad, Adadelta, RMSprop, Adam, provide an alternative to classical SGD. They have per-paramter learning rate methods, which provide heuristic approach without requiring expensive work in tuning hyperparameters for the learning rate schedule manually.

自適應梯度下降算法(例如AdagradAdadeltaRMSpropAdam)提供了經典SGD的替代方法。他們具有按參數學習速率的方法,該方法提供了啟發式方法,而無需為手動調整學習速率時間表的超參數而進行昂貴的工作。

 

Working with Optimisation Functions:

使用優化功能:


We used three first order optimisation functions and studied their effect.

      Stochastic Gradient Decent

      Adagrad

      Adam

We used three first order optimisation functions and studied their effect.

      Stochastic Gradient Decent

      Adagrad

      Adam

我們使用了三個一階優化函數並研究了它們的效果。

      Stochastic Gradient Decent

      Adagrad

      Adam

 

Gradient Descent calcultes gradient for the whole dataset and updates values in direction opposite to the gradients until we find a local minima. Stochastic Gradient Descent performs a parameter update for each training example unlike normal Gradient Descent which performs only one update. Thus it is much faster. Gradient Decent algorithms can further be improved by tuning important parametes like momentum, learning rate etc.

梯度下降計算整個數據集的梯度,並在與梯度相反的方向上更新值,直到找到局部最小值為止。隨機梯度下降針對每個訓練示例執行參數更新,這與常規梯度下降僅執行一個更新不同。因此它要快得多。梯度下降算法可以通過調整重要參數(例如動量,學習率等)來進一步改進。

 

Adagrad is more preferrable for a sparse data set as it makes big updates for infrequent parameters and small updates for frequent parameters. It uses a different learning Rate for every parameter θ at a time step based on the past gradients which were computed for that parameter. Thus we do not need to manually tune the learning rate.

Adagrad對於稀疏數據集是更合適的,因為它對不頻繁的參數進行大更新,而對頻繁的參數進行小更新,它基於為該參數計算的過去梯度,針對每個參數θ在某個時間步使用不同的學習率。因此,我們不需要手動調整學習率。


Adam stands for Adaptive Moment Estimation. It also calculates different learning rate. Adam works well in practice, is faster, and outperforms other techniques.

Adam 代表自適應矩估計,它還可以計算出不同的學習率,Adam在實踐中表現良好,速度更快,並且勝過其他技術。

 

Stochastic Gradient Decent was much faster than the other algorithms but the results produced were far from optimum. Both, Adagrad and Adam produced better results that SGD, but they were computationally extensive. Adam was slightly faster than Adagrad. Thus, while using a particular optimization function, one has to make a trade off between more computation power and more optimum results.

隨機梯度算法比其他算法快得多,但產生的結果遠非最佳。 AdagradAdam都比SGD產生了更好的結果,但是它們在計算上很廣泛。AdamAdagrad快一點。因此,在使用一種特定的優化功能時,必須在更多的計算能力和更多的最優結果之間做出權衡。

 

Torch:

We worked with Torch7 to complete this project, which is a Lua based predecessor of PyTorch.

我們使用了Torch7完成了這個項目,它是基於LuaPyTorch的前身。

The github repo of the complete project and codes is- https://github.com/dsgiitr/Visualizing-Loss-Functions 

Kaggle競賽-Titanic (前20%排名)

不想成為調參俠, 一直調參的話, 只能用autogluon訓練模型 實作: https://github.com/jackliaoall/deep-learning_exercises/tree/master/Titanic 步驟如下: (1)先安裝autogluon, 不會安裝...