Section : Stochastic Gradient Descent | HAX912X - Generalized Linear Model / Machine Learning

Moodle UM

Accueil Calendrier Tous les cours Aide Vidéo ENT

Résumé de section

- Sélectionner l’activité Slides
  
  Slides Fichier
- Sélectionner l’activité Gradient Descent in R
  
  Gradient Descent in R Page
  
  This R script illustrates the difference between stochastic gradient descent (SGD) and gradient descent (GD) in a simple linear regression problem.
  
  Synthetic data generation
  
  The dataset is simulated according to the model
  
  y = X θ* + ε
  
  where X is a 100×2 matrix of Gaussian random variables, θ* = (2, −1), and ε is Gaussian noise with standard deviation 0.5.
  
  Loss surface
  
  A grid of values for θ₁ and θ₂ is created around the true parameters.
  
  For each pair (θ₁, θ₂), the script computes the sum of squared residuals
  
  L(θ₁, θ₂) = Σ (xᵢᵀ θ − yᵢ)².
  
  A contour plot displays the loss surface in the (θ₁, θ₂) plane.
  
  Stochastic Gradient Descent (SGD)
  
  Starting from θ = (0,0), the algorithm updates the parameters after each observation:
  
  θ ← θ − δ × 2xᵢ(xᵢᵀθ − yᵢ),
  
  with learning rate δ = 0.05.
  
  This is repeated for 10 epochs, randomly shuffling the data at each pass.
  
  The parameter trajectories (θ₁, θ₂) are stored and plotted.
  
  SGD produces a noisy trajectory but usually converges quickly toward the minimum.
  
  Gradient Descent (GD)
  
  Here, the gradient is computed over the whole dataset at each iteration:
  
  θ ← θ − δ × 2Xᵀ(Xθ − y),
  
  with δ = 0.001 and 50 iterations.
  
  This yields a much smoother and more regular trajectory than SGD.
  
  Visual comparison
  
  The final contour plot shows both optimization paths:
  
  – red line: stochastic gradient descent
  
  – black line: batch gradient descent
  
  Both methods move toward the same minimum (close to θ* = (2, −1)), but SGD fluctuates around the valley of the loss function, whereas GD follows a smooth, deterministic path.
  
  This example demonstrates how SGD and GD explore the parameter space differently while aiming to minimize the same quadratic loss function.