Résumé de section

    • This R script illustrates the difference between stochastic gradient descent (SGD) and gradient descent (GD) in a simple linear regression problem.

      1. Synthetic data generation

        The dataset is simulated according to the model

        y = X θ* + ε

        where X is a 100×2 matrix of Gaussian random variables, θ* = (2, −1), and ε is Gaussian noise with standard deviation 0.5.

      2. Loss surface

        A grid of values for θ₁ and θ₂ is created around the true parameters.

        For each pair (θ₁, θ₂), the script computes the sum of squared residuals

        L(θ₁, θ₂) = Σ (xᵢᵀ θ − yᵢ)².

        A contour plot displays the loss surface in the (θ₁, θ₂) plane.

      3. Stochastic Gradient Descent (SGD)

        Starting from θ = (0,0), the algorithm updates the parameters after each observation:

        θ ← θ − δ × 2xᵢ(xᵢᵀθ − yᵢ),

        with learning rate δ = 0.05.

        This is repeated for 10 epochs, randomly shuffling the data at each pass.

        The parameter trajectories (θ₁, θ₂) are stored and plotted.

        SGD produces a noisy trajectory but usually converges quickly toward the minimum.

      4. Gradient Descent (GD)

        Here, the gradient is computed over the whole dataset at each iteration:

        θ ← θ − δ × 2Xᵀ(Xθ − y),

        with δ = 0.001 and 50 iterations.

        This yields a much smoother and more regular trajectory than SGD.

      5. Visual comparison

        The final contour plot shows both optimization paths:

        – red line: stochastic gradient descent

        – black line: batch gradient descent

        Both methods move toward the same minimum (close to θ* = (2, −1)), but SGD fluctuates around the valley of the loss function, whereas GD follows a smooth, deterministic path.

      This example demonstrates how SGD and GD explore the parameter space differently while aiming to minimize the same quadratic loss function.