Stochastic Gradient Descent
Résumé de section
-
-
This R script illustrates the difference between stochastic gradient descent (SGD) and gradient descent (GD) in a simple linear regression problem.
-
Synthetic data generation
The dataset is simulated according to the model
y = X θ* + ε
where X is a 100×2 matrix of Gaussian random variables, θ* = (2, −1), and ε is Gaussian noise with standard deviation 0.5.
-
Loss surface
A grid of values for θ₁ and θ₂ is created around the true parameters.
For each pair (θ₁, θ₂), the script computes the sum of squared residuals
L(θ₁, θ₂) = Σ (xᵢᵀ θ − yᵢ)².
A contour plot displays the loss surface in the (θ₁, θ₂) plane.
-
Stochastic Gradient Descent (SGD)
Starting from θ = (0,0), the algorithm updates the parameters after each observation:
θ ← θ − δ × 2xᵢ(xᵢᵀθ − yᵢ),
with learning rate δ = 0.05.
This is repeated for 10 epochs, randomly shuffling the data at each pass.
The parameter trajectories (θ₁, θ₂) are stored and plotted.
SGD produces a noisy trajectory but usually converges quickly toward the minimum.
-
Gradient Descent (GD)
Here, the gradient is computed over the whole dataset at each iteration:
θ ← θ − δ × 2Xᵀ(Xθ − y),
with δ = 0.001 and 50 iterations.
This yields a much smoother and more regular trajectory than SGD.
-
Visual comparison
The final contour plot shows both optimization paths:
– red line: stochastic gradient descent
– black line: batch gradient descent
Both methods move toward the same minimum (close to θ* = (2, −1)), but SGD fluctuates around the valley of the loss function, whereas GD follows a smooth, deterministic path.
This example demonstrates how SGD and GD explore the parameter space differently while aiming to minimize the same quadratic loss function.
-