翻譯自:
The choice of Optimisation Algorithms and
Loss Functions for a deep learning model can play a big role in producing
optimum and faster results. Before we begin, let us see how different
components of a deep learning model affect its result through the simple
example of a Perceptron.
對於深度學習的模型來說, 選擇最佳化算法與損失函數會產生最佳且更快的結果中扮演一個大角色, 在開始之前, 讓我們通過一個簡單的Perceptron範例了解深度學習模型的不同組成部分如何影響其結果。
Perceptron
感知器
If you are not familiar with the term
perceptron, it refers to a particular supervised learning model, outlined by
Rosenblatt in 1957. The architecture and behavior of a perceptron is very
similar to biological neurons, and is often considered as the most basic form
of neural network. Other kinds of neural networks were developed after the
perceptron, and their diversity and applications continue to grow. It is easier
to explain the constitutes of a neural network using the example of a single
layer perceptron.
如果你不熟悉這個詞"感知器", 它由1957年的Rosenblatt所概述的一種特殊監督學習模型, 感知器架構與行為是非常類似於神經元, 而已經常被視為是神經網路最基本的單元, 其它形式的神經網路是在感知器後面發展的, 並且它們的多樣性和應用不斷增長, 使用單一個感知器是易於解釋神經網路的組成。
A single layer perceptron works as a linear
binary classifier. Consider a feature vector [x1, x2, x3] that is used to
predict the probability (p) of occurrence of a certain event.
單一層的感知器可當作一個線性二元分類器, 考慮一個用來預測某一事件發生機率(p)的特徵向量[x1, x2, x3] 。
Weighing factors: Each input in the feature vector is
assigned its own relative weight (w), which decides the impact that the
particular input needs in the summation function. In relatively easier terms,
some inputs are made more important than others by giving them more weight so
that they have a greater effect in the summation function (y). A bias (wo) is
also added to the summation.
權重因子: 每個特徵向量的輸入會配給它的相關權重(w), 該權重決定該特殊輸入在求和函數的重要性, 用相對容易的術語, 一些輸入比起其他輸入更重要, 這些輸入在求和函數的過程中會給予更大的權重, 偏移量(bias)是用來加入到求和結果的。
Activation function: The result of the
summation function, that is the weighted sum, is transformed to a desired
output by employing a non linear function (fNL), also known as activation
function. Since the desired output is probability of an event in this case, a
sigmoid function can be used to restrict the results (y) between 0 and 1.
激活函數: 通過使用非線性函數(fNL)(也稱為激活函數)將求和函數的結果(即加權和)轉換為所需的輸出, 由於在這種情況下期望的輸出是事件的概率,因此可以使用Sigmoid函數將結果(y)限制在0到1之間。
Sigmoid Function
Other commonly used activation functions
are Rectified Linear Unit (ReLU), Tan Hyperbolic (tanh) and Identity function.
其他常用的激活函數是整流線性單位(ReLU),Tan雙曲線(tanh)和Identity函數。
Error and Loss Function: In most learning
networks, error is calculated as the difference between the actual output and
the predicted output.
誤差和損失函數: 在大多數學習網絡中,誤差計算為實際輸出與預測輸出之間的差。
The function that is used to compute this
error is known as Loss Function J(.). Different loss functions will give
different errors for the same prediction, and thus have a considerable effect
on the performance of the model. One of the most widely used loss function is
mean square error, which calculates the square of difference between actual
value and predicted value. Different loss functions are used to deal with
different type of tasks, i.e. regression and classification.
用於計算此誤差的函數稱為損失函數J(.),對於相同的預測,不同的損失函數會產生不同的誤差,因此會對模型的性能產生重大影響,均方誤差是使用最廣泛的損失函數之一,它計算實際值和預測值之間的差的平方,不同的損失函數用於處理不同類型的任務,即回歸和分類。
Back Propogation and Optimisation Function: Error J(w) is a function of internal parameters of model i.e
weights and bias. For accurate predictions, one needs to minimize the
calculated error. In a neural network, this is done using back propagation. The
current error is typically propagated backwards to a previous layer, where it
is used to modify the weights and bias in such a way that the error is
minimized. The weights are modified using a function called Optimization
Function.
反向傳播和優化功能: 誤差函數J(w)是模型內部參數的函數,即權重和偏差,為了進行準確的預測,需要將計算出的誤差最小化,在神經網絡中,這是使用反向傳播完成的,當前錯誤通常會向後傳播到上一層,在該層中,它用於修改權重和偏差,以使錯誤最小化,使用稱為優化功能的函數修改權重。
Optimisation functions usually calculate
the gradient i.e. the partial derivative of loss function
with respect to weights, and the weights are modified in the opposite direction
of the calculated gradient. This cycle is repeated until we reach the minima of
loss function.
優化函數通常計算梯度,即損失函數相對於權重的偏導數,並且權重在計算出的梯度的相反方向上進行修改。重複此循環,直到我們達到損失函數的最小值。
Thus, the components of a neural network
model i.e the activation function, loss function and optimization algorithm
play a very important role in efficiently and effectively training a Model and
produce accurate results. Different tasks require a different set of such
functions to give the most optimum results.
因此,神經網絡模型的組成部分,即激活函數,損失函數和優化算法,在有效且有效地訓練模型並產生準確結果方面起著非常重要的作用。不同的任務需要一組不同的此類功能才能獲得最佳效果。
Loss Functions:
損失函數:
Thus, loss functions are helpful to train a
neural network. Given an input and a target, they calculate the loss, i.e
difference between output and target variable. Loss functions fall under four
major category:
因此,損失函數有助於訓練神經網絡。給定輸入和目標,他們計算損失,即輸出和目標變量之間的差。損失函數分為四大類:
Regressive loss functions:
回歸損失函數:
They are used in case of regressive
problems, that is when the target variable is continuous. Most widely used
regressive loss function is Mean Square Error. Other loss functions are:
1. Absolute error — measures the mean absolute value of the element-wise
difference between input;
2. Smooth Absolute Error — a smooth version of Abs Criterion.
它們用於回歸問題的情況,即目標變量是連續的。最廣泛使用的回歸損失函數是均方誤差。其他損失函數包括:
1.絕對誤差—測量輸入之間的逐元素差異的平均絕對值;
2.平滑絕對錯誤—絕對標準的平滑版本。
Classification loss functions:
分類損失函數:
The output variable in classification
problem is usually a probability value f(x), called the score for the input x.
Generally, the magnitude of the score represents the confidence of our prediction.
The target variable y, is a binary variable, 1 for true and -1 for false.
On an example (x,y), the margin is defined as yf(x). The margin is a measure of
how correct we are. Most classification losses mainly aim to maximize the
margin. Some classification algorithms are:
1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier
分類問題中的輸出變量通常是概率值f(x),稱為輸入x的得分。通常,分數的大小代表我們預測的可信度。目標變量y是一個二進制變量,1表示true,-1表示false。
在示例(x,y)上,邊距定義為yf(x)。margin是衡量我們正確程度的標準。大多數分類損失值主要是為了使margin最大化。一些分類算法是:
1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier
Embedding loss functions:
嵌入損失函數:
It deals with problems where we have to
measure whether two inputs are similar or dissimilar. Some examples are:
1. L1 Hinge Error- Calculates the L1 distance between two inputs.
2. Cosine Error- Cosine distance between two inputs.
它處理的問題是,我們必須測量兩個輸入是否相似或不相似。一些例子是:
1. L1 Hinge Error- 計算兩個輸入之間的L1距離。
2. Cosine Error- 兩個輸入之間的餘弦距離。
Visualising Loss Functions:
可視化損失函數:
We performed the task to reconstruct an
image using a type of neural network called Autoencoders. Different results
were obtained for the same task by using different Loss Functions, while
everything else in the neural network architecture remained constant. Thus, the
difference in result represents the properties of the different loss functions
employed. A very simple data set, MNIST data set was used for this purpose.
Three loss functions were used to reconstruct images.
•
Absolute Loss Function
•
Mean Square Loss Funtion
•
Smooth Absolute Loss Function.
我們執行了使用一種稱為自動編碼器的神經網絡來重建圖像的任務。通過使用不同的損失函數,對於同一任務獲得了不同的結果,而神經網絡體系結構中的所有其他內容保持不變。因此,結果的差異代表所採用的不同損耗函數的性質。為此,使用了一個非常簡單的數據集MNIST數據集。使用三個損失函數來重建圖像。
•
Absolute Loss Function
•
Mean Square Loss Funtion
•
Smooth Absolute Loss Function.
While the Absolute error just calculated
the mean absolute value between of the pixel-wise difference, Mean Square error
uses mean squared error. Thus it was more sensitive to outliers and pushed
pixel value towards 1 (in our case, white as can be seen in image after first
epoch itself).
絕對誤差只是計算像素差異之間的平均絕對值,而均方誤差則使用均方誤差。因此,它對異常值更加敏感,並將像素值推向1(在我們的情況下,在第一個曆元本身之後的圖像中可以看到白色)。
Smooth L1 error can be thought of as a
smooth version of the Absolute error. It uses a squared term if the squared
element-wise error falls below 1 and L1 distance otherwise. It is less
sensitive to outliers than the Mean Squared Error and in some cases prevents
exploding gradients.
平滑L1錯誤可以看作是絕對錯誤的平滑版本。如果按平方的元素誤差落在1以下,則使用平方項;否則,使用L1距離。它對離群值的敏感度不如均方誤差,並且在某些情況下可以防止爆炸梯度。
Optimisation Algorithms
優化算法
Optimisation Algoritms are used to update
weights and biases i.e. the internal parameters of a model to reduce the error.
They can be divided into two categories:
優化算法用於更新權重和偏差,即模型的內部參數以減少誤差。它們可以分為兩類:
Constant Learning Rate Algorithms:
固定學習率算法:
Most widely used Optimisation Algorithm,
the Stochastic Gradient Descent falls under this category.
最廣泛使用的優化算法,隨機梯度下降法屬於這一類。
Here η is
called as learning rate which is a hyperparameter that has to be tuned.
Choosing a proper learning rate can be difficult. A learning rate that is too
small leads to painfully slow convergence
i.e will result in small
baby steps towards finding optimal parameter values which minimize loss and
finding that valley which directly affects the overall training time which gets
too large. While a learning rate that is too large
can hinder convergence and cause the loss function to fluctuate around the
minimum or even to diverge.
這裡的η稱為學習率,它是必須調整的超參數。選擇合適的學習速度可能很困難。學習速度太小會導致痛苦的緩慢收斂,即將導致嬰兒般朝著尋找最佳參數值的小步伐邁進,從而使損失最小化並為了找到谷底直接影響總體訓練時間變得太長。學習率太大會阻礙收斂,並導致損失函數在最小值附近波動甚至發散。
A similar hyperparameter is momentum, which determines the velocity with which learning rate has to be
increased as we approach the minima.
一個類似的超參數是動量, 它決定了當我們接近最小值時必須提高學習率的速度。
Adaptive Learning Algorithms:
自適應學習算法:
The challenge of using gradient descent is
that their hyper parameters have to be defined in advance and they depend
heavily on the type of model and problem. Another problem is that the same
learning rate is applied to all parameter updates. If we have sparse data, we
may want to update the parameters in different extent instead.
使用梯度下降的挑戰在於必須預先定義它們的超參數,並且它們在很大程度上取決於模型的類型和問題。另一個問題是將相同的學習率應用於所有參數更新。如果我們有稀疏的數據,則可能需要在不同程度上更新參數。
Adaptive gradient descent algorithms such
as Adagrad, Adadelta, RMSprop, Adam, provide an alternative to classical SGD.
They have per-paramter learning rate methods, which provide heuristic approach
without requiring expensive work in tuning hyperparameters for the learning
rate schedule manually.
自適應梯度下降算法(例如Adagrad,Adadelta,RMSprop,Adam)提供了經典SGD的替代方法。他們具有按參數學習速率的方法,該方法提供了啟發式方法,而無需為手動調整學習速率時間表的超參數而進行昂貴的工作。
Working with Optimisation Functions:
使用優化功能:
We used three first order optimisation
functions and studied their effect.
•
Stochastic Gradient Decent
•
Adagrad
•
Adam
We used three first order optimisation
functions and studied their effect.
•
Stochastic Gradient Decent
•
Adagrad
•
Adam
我們使用了三個一階優化函數並研究了它們的效果。
•
Stochastic Gradient Decent
•
Adagrad
•
Adam
Gradient Descent calcultes
gradient for the whole dataset and updates values in direction opposite to the
gradients until we find a local minima. Stochastic Gradient Descent performs a
parameter update for each training example unlike normal Gradient Descent which
performs only one update. Thus it is much faster. Gradient Decent algorithms
can further be improved by tuning important parametes like momentum, learning
rate etc.
梯度下降計算整個數據集的梯度,並在與梯度相反的方向上更新值,直到找到局部最小值為止。隨機梯度下降針對每個訓練示例執行參數更新,這與常規梯度下降僅執行一個更新不同。因此它要快得多。梯度下降算法可以通過調整重要參數(例如動量,學習率等)來進一步改進。
Adagrad
is more preferrable for a sparse data set as it makes big updates for
infrequent parameters and small updates for frequent parameters. It uses a
different learning Rate for every parameter θ at a time step based on the past
gradients which were computed for that parameter. Thus we do not need to
manually tune the learning rate.
Adagrad對於稀疏數據集是更合適的,因為它對不頻繁的參數進行大更新,而對頻繁的參數進行小更新,它基於為該參數計算的過去梯度,針對每個參數θ在某個時間步使用不同的學習率。因此,我們不需要手動調整學習率。
Adam stands for Adaptive
Moment Estimation. It also calculates different learning rate. Adam works well
in practice, is faster, and outperforms other techniques.
Adam 代表自適應矩估計,它還可以計算出不同的學習率,Adam在實踐中表現良好,速度更快,並且勝過其他技術。
Stochastic
Gradient Decent was much faster than the other algorithms but the results
produced were far from optimum. Both, Adagrad and Adam produced better results
that SGD, but they were computationally extensive. Adam was slightly faster
than Adagrad. Thus, while using a particular optimization function, one has to
make a trade off between more computation power and more optimum results.
隨機梯度算法比其他算法快得多,但產生的結果遠非最佳。 Adagrad和Adam都比SGD產生了更好的結果,但是它們在計算上很廣泛。Adam比Adagrad快一點。因此,在使用一種特定的優化功能時,必須在更多的計算能力和更多的最優結果之間做出權衡。
Torch:
We worked with Torch7 to complete this
project, which is a Lua based predecessor of PyTorch.
我們使用了Torch7完成了這個項目,它是基於Lua的PyTorch的前身。
The github repo of the complete project and codes is- https://github.com/dsgiitr/Visualizing-Loss-Functions