HAX912X - Generalized Linear Model / Machine Learning
Résumé de section
-
-
-
In R, contrasts are used to handle categorical variables (factors) in models like linear regression. Categorical variables need to be encoded numerically so they can be included in models, and contrasts define how the levels of a factor are represented as numeric values.
When you fit a model in R with a categorical predictor (factor), R automatically creates dummy variables to represent the different levels of the factor. The type of contrast determines how these dummy variables are encoded and how comparisons between factor levels are made in the model.This R script demonstrates various ways of encoding categorical variables and fitting linear models using the
lm()function. It also shows how to manipulate factor levels and contrasts when building regression models.
-
-
The provided script is a comprehensive R code snippet for simulating, analyzing, and modeling a synthetic dataset to explore linear and regularization-based regression techniques. It covers the creation of a dataset, fitting multiple models, and evaluating their performance. Below is a detailed commentary on the operations performed by each major block in the script:
1. Setting the Seed for Reproducibility:
- set.seed(345678) ensures that the results of random processes (like data generation) are consistent every time the script runs.
2. Data Generation:
- The script generates a synthetic dataset with 50 observations (`n`). It includes three normally distributed predictor variables (x1, x2, x3), a noise variable (u) with a specific variance, and a response variable (y) determined by a linear combination of these predictors plus noise.
- Additional features (z variables) are generated uniformly, increasing the dimensionality of the dataset to 30 extra predictors.
3. Data Scaling and Test Set Preparation:
- The predictors are scaled to have zero mean and unit variance. This standardization is crucial for many modeling techniques, especially when regularization is involved.
- A test dataset of 1000 observations is generated similarly to the training set.
4. Linear Regression Modeling:
- A linear model (model1) is fitted using all predictors on the original data, and predictions are made on the test set. Root Mean Squared Error (RMSE) is computed to evaluate model performance.
- Another linear model (model1.scale) is fitted on the scaled data, with performance also evaluated via RMSE.
5. Reduced Linear Regression:
- A simpler model (model2.scale) using only x1, x2, and x3 as predictors is fitted to the scaled data and evaluated, allowing comparison against more complex models to see if additional predictors genuinely improve model performance.
6. Oracle Model Evaluation:
- An oracle model uses the true parameters used in generating y to calculate its predictions, serving as a baseline for the best possible performance any model could achieve.
7. Regularization Models:
- Lasso and Ridge regression models are fitted using the glmnet package, with a range of lambda values to find the optimal regularization strength.
- These models are plotted and tuned using cross-validation (train function from caret package), aiming to optimize the lambda parameter based on RMSE.
8. Model Tuning and Validation:
- The best performing lambda values for both Lasso and Ridge models are identified, and the models are re-fit using these optimal parameters.
- Predictions are made on the scaled test set, and performance is again assessed via RMSE.
9. Visualization and Analysis:
- The script includes commands to plot the coefficients of the fitted models, helping visualize how regularization affects the impact of different predictors.
10. Advanced Techniques:
- The use of repeated cross-validation (with parameters such as number of folds and repeats) helps ensure that the model evaluation is robust and reliable.
- Detailed results from model tuning are displayed, providing insights into how the model performance varies with changes in lambda.This script is an example for illustrating practical data science tasks, including data preparation, model fitting, and performance evaluation, particularly in scenarios where the dimensionality of predictors varies significantly. It also showcases the importance of regularization techniques in preventing overfitting and improving model prediction accuracy on unseen data.
-
This Python script primarily demonstrates various regression techniques for modeling the relationship between vehicle speed and braking distance, utilizing a dataset loaded from a file named freinage.txt. Here's a breakdown of the script's key components and operations:
1. Imports and Data Loading
- Several Python libraries are imported, including numpy for data manipulation, sklearn for machine learning models, and matplotlib for plotting.
- The working directory is changed to a specified path where the data file is located, and the dataset is loaded into a numpy array.
2. Data Preparation:
- The data is split into features (X_data) and target (Y_data).
- StandardScaler is used to standardize the feature, which is crucial for many machine learning algorithms to perform well.
3. Linear Regression:
- A simple linear regression model is fit to the standardized data.
- The fit model is used to predict the braking distance over a range of standardized speeds, and the result is plotted.
4. Polynomial Regression:
- Polynomial features of degrees 2 and 6 are generated from the speed data.
- Linear regression models are fit to these polynomial features to capture more complex relationships between speed and braking distance.
- Predictions from these models are plotted to visualize how they compare to the linear model.
5. Ridge Regression:
- Multiple ridge regression models with different regularization strengths (alpha) are fit using polynomial features of degree 6.
- This regularization can help prevent overfitting by shrinking the coefficients.
- The effectiveness of each alpha value is visualized through different plots.
6. Cross-Validation and Model Selection:
- Ridge regression models with polynomial features of degree 1, 2, and 6 are evaluated using cross-validation to determine the best alpha value.
- Lasso regression is also applied with a similar approach to ridge, using LassoCV to automatically find the best alpha.
- Performance metrics (Mean Squared Error) and optimal alpha values are printed for each model.
7. Visualization:
- Multiple plots are created throughout the script to visualize the data alongside the predictions from various models. This helps in comparing how well each model captures the relationship between speed and braking distance.
8. Advanced Model Evaluation:
- The script includes advanced techniques like cross-validation scoring and grid search for hyperparameter tuning in Lasso regression.
- The best parameters and model coefficients are printed, providing insights into the model's performance and configuration.
This script is a demonstration of applying and comparing multiple linear and polynomial regression models, along with regularization techniques like Ridge and Lasso, to understand and predict the relationship between vehicle speed and braking distance effectively.
-
-
Binary Cross-Entropy with Logits and Class Weights
The BCEWithLogitsLoss function in PyTorch is a numerically stable implementation of the binary cross-entropy (BCE) loss
It combines a Sigmoid activation and the BCE computation in a single operation, which avoids numerical instabilities for large positive or negative logits
-
https://www.kaggle.com/mlg-ulb/creditcardfraud
This Python script builds several classifiers for the highly imbalanced “creditcard” dataset and compares them on held-out data. It covers preprocessing, logistic regression (with different imbalance strategies), random forests, and two PyTorch models (a logistic regression and a small MLP) trained with a class-imbalance-aware loss.
-
Setup and data loading
– Resets variables (IPython “%reset”)
– Imports NumPy, pandas, scikit-learn utilities, and later PyTorch
– Loads the CSV file creditcard.csv into a pandas DataFrame
– Splits columns into features X (all columns except “Class”) and target y = “Class”
– Prints class counts to show the severe imbalance (very few frauds)
-
Feature scaling
– Fits a StandardScaler on X (mean 0, variance 1) and transforms X to float32
Note: in a fully rigorous pipeline, the scaler should be fit on the training set only, then applied to the test set
-
Random under-sampling to create a balanced subset
– Counts the number of fraud cases and gets their indices
– Randomly selects the same number of non-fraud indices
– Concatenates both to form a balanced subset (under_sample_data)
– Extracts X_undersample and y_undersample from that subset
-
Train/test splits
– Splits the full data (X, y) into train and test (70/30)
– Splits the balanced subset similarly
– Prints class counts in each split for both the original and the undersampled data
-
Logistic regression (scikit-learn)
(a) Baseline on the original imbalanced data:
– Trains LogisticRegression without regularization
– Predicts on X_test and prints a confusion table (y_test vs predictions)
(b) Class-weighted logistic regression:
– Trains LogisticRegression with class_weight=“balanced” (the minority class is up-weighted automatically)
– Predicts on X_test and prints the confusion table
(c) Logistic regression on the undersampled (balanced) subset:
– Trains LogisticRegression on X_train_undersample, y_train_undersample, with more iterations and a Newton solver.
– Predicts on X_test_undersample and prints the confusion table
-
Random forest (scikit-learn)
(a) Trains a RandomForestClassifier on the original imbalanced training set (100 trees, max_features ≈ sqrt(30), OOB enabled, parallel jobs)
– Predicts on X_test and prints the confusion table
(b) Trains the same RandomForestClassifier on the undersampled training set
– Predicts on X_test_undersample and prints the confusion table
-
PyTorch setup
– Selects the Apple Silicon Metal backend device (“mps”) if available
– Converts train/test arrays to PyTorch tensors; moves the test tensors to the device
– Builds a TensorDataset and DataLoader for the training data (batch size 1024, shuffled)
– Computes pos_weight = (#negatives / #positives) from y_train. This ratio is used by the loss to counter the imbalance
-
Generic PyTorch training loop (train_model)
– Moves the model to device and defines Adam optimizer
– Uses BCEWithLogitsLoss(pos_weight=pos_weight): this expects raw logits and internally applies a stable sigmoid + binary cross-entropy; pos_weight increases the penalty for misclassifying the positive class
– For a given number of epochs:
• Iterates over mini-batches: forward pass → compute loss → backpropagate → optimizer step
• Tracks mean training loss per epoch
• Periodically evaluates test loss on the full test set (still with BCEWithLogitsLoss)
– Returns the trained model
-
PyTorch logistic regression
– Defines a single Linear layer with output size 1 (equivalent to logistic regression without explicit sigmoid on the last layer, because the loss uses logits)
– Trains it with train_model (50 epochs, lr=1e-3)
– Switches to eval mode, computes logits on X_test_t, applies sigmoid, thresholds at 0.5 to obtain class predictions, and prints the confusion table against y_test
-
PyTorch MLP
– Defines a small feed-forward network: Linear → GELU → Linear → GELU → Linear → GELU → Linear(→1)
– Trains it with weight decay for mild regularization (80 epochs, lr=1e-3, weight_decay=1e-5)
– Evaluates as above: logits → sigmoid → threshold 0.5 → confusion table
What the script demonstrates
– Several standard ways to handle class imbalance: raw training on imbalanced data, class weighting, and random under-sampling
– Comparison of linear (logistic regression) and non-linear (random forest, MLP) models
– How to train PyTorch classifiers for imbalanced binary classification using BCEWithLogitsLoss with a positive-class weight, and how to evaluate them via a confusion table on a held-out test set
-
-
-
This R script illustrates the difference between stochastic gradient descent (SGD) and gradient descent (GD) in a simple linear regression problem.
-
Synthetic data generation
The dataset is simulated according to the model
y = X θ* + ε
where X is a 100×2 matrix of Gaussian random variables, θ* = (2, −1), and ε is Gaussian noise with standard deviation 0.5.
-
Loss surface
A grid of values for θ₁ and θ₂ is created around the true parameters.
For each pair (θ₁, θ₂), the script computes the sum of squared residuals
L(θ₁, θ₂) = Σ (xᵢᵀ θ − yᵢ)².
A contour plot displays the loss surface in the (θ₁, θ₂) plane.
-
Stochastic Gradient Descent (SGD)
Starting from θ = (0,0), the algorithm updates the parameters after each observation:
θ ← θ − δ × 2xᵢ(xᵢᵀθ − yᵢ),
with learning rate δ = 0.05.
This is repeated for 10 epochs, randomly shuffling the data at each pass.
The parameter trajectories (θ₁, θ₂) are stored and plotted.
SGD produces a noisy trajectory but usually converges quickly toward the minimum.
-
Gradient Descent (GD)
Here, the gradient is computed over the whole dataset at each iteration:
θ ← θ − δ × 2Xᵀ(Xθ − y),
with δ = 0.001 and 50 iterations.
This yields a much smoother and more regular trajectory than SGD.
-
Visual comparison
The final contour plot shows both optimization paths:
– red line: stochastic gradient descent
– black line: batch gradient descent
Both methods move toward the same minimum (close to θ* = (2, −1)), but SGD fluctuates around the valley of the loss function, whereas GD follows a smooth, deterministic path.
This example demonstrates how SGD and GD explore the parameter space differently while aiming to minimize the same quadratic loss function.
-