Implementing AdaGrad Optimizer in Spark

Implementing AdaGrad Optimizer in Spark

Have you ever desired a Gradient Descent approach with greater flexibility? If that's the case,?Adaptive Gradient Descent?(AdaGrad) could be a suitable alternative.

What you will learn:?The workings of the Adaptive Gradient algorithm, how it improves upon Stochastic Gradient Descent, and its implementation using Apache Spark.


Notes:?

  • This post assumes that reader has rudimentary knowledge of the Scala API of Apache Spark and basic understanding of machine learning.
  • The code associated with this article is written using Scala 2.12.11 and Spark 3.1.0

Background

The optimization algorithm known as?Stochastic Gradient Descent?(SGD) is frequently employed to minimize the loss function during the training of various machine learning models, including support vector machines, logistic regression, and back-propagation neural networks. Typically, in its most basic form, SGD calculates the gradient using a singular learning rate.

It's quite usual to find that the features of a model exhibit significant variance across different observations. Under such circumstances, an adaptive gradient?(AdaGrad) algorithm that assigns a unique learning rate to each feature could be an effective solution. There exist numerous methods to develop an algorithm that assigns a distinct learning rate to every feature.?

Accidentally, Spark ML library, MLlib, did not include an implementation of AdaGrad as of 2020.

This article discusses the?AdaGrad?algorithm and an implementation for Apache Spark.

Stochastic Gradient Descent (SGD)

Apache Spark stands out as a rapid and versatile cluster computing system, offering advanced APIs in Java, Scala, Python, and R, along with an enhanced engine capable of supporting general execution graphs.

Within the Apache Spark ecosystem lies MLlib, a machine learning library. The stochastic gradient descent optimizer within this library serves as a randomized variant of the (batched) gradient descent algorithm, which is employed for minimizing a continuous differentiable objective function. In the realm of supervised machine learning, this objective function typically manifests as a loss function, such as logistic loss, sum of least squares, and others.?

The objective function?L?is expressed as the summation of differentiable functions. In supervised learning, the loss related to a specific feature is defined as a continuous, differentiable, convex function.

In supervised learning, the vector?w?represents the vector of weights (or model parameters). At each iteration of the stochastic gradient descent, the weights are updated using the formula:

?Stochastic Gradient Descent (SGD) aims to reduce the loss function, which represents the difference between the model's predicted values and the expected values. In each iteration, SGD picks a small group of observations, referred to as a mini-batch, for the model's training. This repetitive procedure is designed to progressively approach the true global minimum of the loss function.

Adaptive Gradient Descent

The core concept of AdaGrad revolves around the strategy of boosting the learning rate for sparse features (or model parameters) while reducing it for more dense features. As a result, AdaGrad enhances the efficiency of converging towards the minimum loss in models with sparse features, particularly when these sparse features hold significant information.?

SGD in Apache Spark

The Apache spark MLlib library has two implementations of SGD

  • Generic Gradient Descent and related classes in the?mllib.optimization?package
  • SGD bundled with classifier or regression algorithms such as?LogisticRegressionWithSGD,?LassoWithSGD,?SVMWithSGD?or?RidgeRegressionWithSGD

We plan to utilize the optimization package to modify the stochastic gradient descent. Our goal is to use the?mllib.optimization.GradientDescent?template class from MLlib and create a custom?Updater?that implements an adaptive gradient with per-feature learning rates.

This Updater is responsible for "updating the model's weights" (in models like Logistic Regression or SVM) by multiplying the current learning rate with the partial derivative of the loss with respect to each weight, as previously described. We'll name our custom Updater as?AdaGradUpdater, which will handle the model weight updates using the adaptive gradient approach. Following this, we'll initialize the SGD in the following manner.

   val adaSGD = new GradientDescent.
                    .setUpdater(new AdaGradUpdater)
                    .setStepSize(0.01)
                    . .....        

The class?AdaGradUpdater?has to implement the generic compute method

  Updater.compute(
       oldWeights: Vector, 
       gradient: Vector, 
       stepSize: Double, 
       iter: Int, 
       regCoefs: Double  ): (Vector, Double)        

The method returns the tuple (vector of new weights, loss).?Let's implement the AdaGrad algorithm

Implementation of AdaGrad

As mentioned earlier, the implementation of AdaGrad consists of overriding the method?Updater.compute.

The computation of the learning rate requires us to record the past values of the square value of the gradient (previous steps) for this particular weight, in the array?gradientHistory?(line 2). First we define the method?+=?to update the gradient history (lines 26-36). The first call to the method creates the gradient history (line 30).

The next step consists of converting the existing (old) weights into a?Breeze?dense vector?brzWeights?(line 13). The array of the new learning rates is computed as the?inverseVelocity?coefficient (line 40).

The learning rates are zipped with the old weights (line 14) to update the weights?newWeights?as a new dense vector(line 15-17). The linear algebra (matrix computation) on the Spark data node is actually performed by the?LINPACK?library under the cover through calls to?brzAxpy?(line 20) and?brzNorm?(line 21).

 final class AdaGradL2Updater(dimension: Int) extends Updater {
  var gradientsHistory: Array[Double] = _

  override def compute(
       weightsOld: Vector,
       gradient: Vector,
       stepSize: Double,
       iter: Int,
       regParam: Double
  ): (Vector, Double) = {

    +=(gradient)
    val brzWeights: BV[Double] = weightsOld.toBreeze.toDenseVector
    val sumSquareDerivative = inverseVelocity.zip(brzWeights.toArray)
    val newWeights: BV[Double] = 
       new DenseVector[Double](sumSquareDerivative.view.map {
          case (coef, weight) => weight * (1.0 -regParam * coef)
       }

    brzAxpy(-1.0, gradient.toBreeze, newWeights)
    val norm = brzNorm(brzWeights, 2.0)

    (Vectors.fromBreeze(brzWeights), 0.5 * regParam * norm * norm)
  }


  private def +=(gradient: Vector): Unit = {
    val grad = gradient.toArray
    
    grad.view.zipWithIndex.foreach {
      case (g, index) => {
        if(gradientsHistory == null)
          cgradientsHistory = Array.fill(grad.length)(0.0)

        val existingGradient = gradientsHistory(index)
        gradientsHistory.update(index, existingGradient + g*g)
      }
    }
  }

  def inverseVelocity = gradientsHistory.map(1.0/Math.sqrt(_))
}        

Analysis

Given a logistic regression model to classifier of time series pattern, we compute and compare the training binary cross-entropy loss without regularization using the default MLlib SGD and our implementation of AdaGrad.

The AdaGrad algorithm has its limitation. The?accumulation can grow very large over time, causing the learning rate to shrink and become infinitesimally small, which effectively stops the parameter from updating.

AdaGrad is not the only alternative to traditional SGD. RMS and Adam are worth investigating.?


Thank you for reading this article. For more information ...


References


---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.? He has been director of data engineering at Aideo Technologies since 2017 and he is the?author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

要查看或添加评论,请登录

社区洞察

其他会员也浏览了