Résumé de section

    • Binary Cross-Entropy with Logits and Class Weights

      The BCEWithLogitsLoss function in PyTorch is a numerically stable implementation of the binary cross-entropy (BCE) loss

      It combines a Sigmoid activation and the BCE computation in a single operation, which avoids numerical instabilities for large positive or negative logits

    • https://www.kaggle.com/mlg-ulb/creditcardfraud

      This Python script builds several classifiers for the highly imbalanced “creditcard” dataset and compares them on held-out data. It covers preprocessing, logistic regression (with different imbalance strategies), random forests, and two PyTorch models (a logistic regression and a small MLP) trained with a class-imbalance-aware loss.

      1. Setup and data loading

        – Resets variables (IPython “%reset”)

        – Imports NumPy, pandas, scikit-learn utilities, and later PyTorch

        – Loads the CSV file creditcard.csv into a pandas DataFrame

        – Splits columns into features X (all columns except “Class”) and target y = “Class”

        – Prints class counts to show the severe imbalance (very few frauds)

      2. Feature scaling

        – Fits a StandardScaler on X (mean 0, variance 1) and transforms X to float32

        Note: in a fully rigorous pipeline, the scaler should be fit on the training set only, then applied to the test set

      3. Random under-sampling to create a balanced subset

        – Counts the number of fraud cases and gets their indices

        – Randomly selects the same number of non-fraud indices

        – Concatenates both to form a balanced subset (under_sample_data)

        – Extracts X_undersample and y_undersample from that subset

      4. Train/test splits

        – Splits the full data (X, y) into train and test (70/30)

        – Splits the balanced subset similarly

        – Prints class counts in each split for both the original and the undersampled data

      5. Logistic regression (scikit-learn)

        (a) Baseline on the original imbalanced data:

        – Trains LogisticRegression without regularization

        – Predicts on X_test and prints a confusion table (y_test vs predictions)

        (b) Class-weighted logistic regression:

        – Trains LogisticRegression with class_weight=“balanced” (the minority class is up-weighted automatically)

        – Predicts on X_test and prints the confusion table

      (c) Logistic regression on the undersampled (balanced) subset:

      – Trains LogisticRegression on X_train_undersample, y_train_undersample, with more iterations and a Newton solver.

      – Predicts on X_test_undersample and prints the confusion table

      1. Random forest (scikit-learn)

        (a) Trains a RandomForestClassifier on the original imbalanced training set (100 trees, max_features ≈ sqrt(30), OOB enabled, parallel jobs)

        – Predicts on X_test and prints the confusion table

        (b) Trains the same RandomForestClassifier on the undersampled training set

      – Predicts on X_test_undersample and prints the confusion table

      1. PyTorch setup

        – Selects the Apple Silicon Metal backend device (“mps”) if available

        – Converts train/test arrays to PyTorch tensors; moves the test tensors to the device

        – Builds a TensorDataset and DataLoader for the training data (batch size 1024, shuffled)

        – Computes pos_weight = (#negatives / #positives) from y_train. This ratio is used by the loss to counter the imbalance

      2. Generic PyTorch training loop (train_model)

        – Moves the model to device and defines Adam optimizer

        – Uses BCEWithLogitsLoss(pos_weight=pos_weight): this expects raw logits and internally applies a stable sigmoid + binary cross-entropy; pos_weight increases the penalty for misclassifying the positive class

        – For a given number of epochs:

        • Iterates over mini-batches: forward pass → compute loss → backpropagate → optimizer step

        • Tracks mean training loss per epoch

        • Periodically evaluates test loss on the full test set (still with BCEWithLogitsLoss)

        – Returns the trained model

      3. PyTorch logistic regression

        – Defines a single Linear layer with output size 1 (equivalent to logistic regression without explicit sigmoid on the last layer, because the loss uses logits)

        – Trains it with train_model (50 epochs, lr=1e-3)

        – Switches to eval mode, computes logits on X_test_t, applies sigmoid, thresholds at 0.5 to obtain class predictions, and prints the confusion table against y_test

      4. PyTorch MLP

        – Defines a small feed-forward network: Linear → GELU → Linear → GELU → Linear → GELU → Linear(→1)

        – Trains it with weight decay for mild regularization (80 epochs, lr=1e-3, weight_decay=1e-5)

        – Evaluates as above: logits → sigmoid → threshold 0.5 → confusion table

      What the script demonstrates

      – Several standard ways to handle class imbalance: raw training on imbalanced data, class weighting, and random under-sampling

      – Comparison of linear (logistic regression) and non-linear (random forest, MLP) models

      – How to train PyTorch classifiers for imbalanced binary classification using BCEWithLogitsLoss with a positive-class weight, and how to evaluate them via a confusion table on a held-out test set