Beyond accuracy on “online shoppers”: an end-to-end tabular pipeline with leakage in mind

Goh Siang En Luke · Technical note (AI Singapore AIAP Batch 23 style assessment)
Dataset: e-commerce session features → binary purchase outcome. Code: companion Python repo aiap23-goh-siang-en-luke-963A-recreate; run with bash run.sh after placing data/online_shopping.db.

Reading tip: Dotted underlines mark code map links—click to open a short excerpt from the companion repo (aiap23-goh-siang-en-luke-963A-recreate) with a View full file on GitHub link. Press Escape or the dimmed backdrop to close.

What I built

I implemented a reproducible sklearn pipeline from SQLite: explicit data cleaning, a small set of engineered features, four classifiers tuned with stratified cross-validation optimising F1-weighted, then a deliberate leakage experiment (dropping PageValue and its derived flag), plus precision–recall / threshold analysis, learning curves, and a persisted Pipeline for serving. I cared about which features would be available at scoring time, not only headline accuracy on an imbalanced target.

The problem in one paragraph

Each row is a session. The target is whether the visit ended in a purchase (binary). The class distribution is roughly 15% positive (about 5.5:1 imbalance). In that regime, accuracy is a weak primary metric: a naive “always no purchase” baseline can look deceptively good. I therefore treat F1-weighted as the main objective during model selection, while still logging accuracy and ROC-AUC for context.

Data loading and cleaning (before any model hype)

Data live in SQLite (online_shopping table), loaded in data_loader.py. Cleaning in preprocessing.clean_data is deliberately boring and important:

Drop duplicate rows.
Normalise dirty CustomerType strings (empty, "nan", casing drift) into a small canonical set.
Fix negative GeographicRegion as an encoding artefact (absolute value).
Clamp negative BounceRate and ProductPageTime to zero; cap extreme ProductPageTime at the 99th percentile to limit tail leverage.

EDA (see eda.ipynb) also motivates imputation strategy: I analysed missingness patterns so imputation does not leak target information through careless fill rules. The modelling path uses sklearn’s column transformer (median for numeric, mode + one-hot for categoricals) inside a Pipeline, so train and test follow the same transformations.

Feature engineering (three additions, all interpretable)

In preprocessing.engineer_features:

HasPageValue — indicator that PageValue is non-zero.
BounceExitInteraction — BounceRate × ExitRate to capture “bounced hard and left aggressively” behaviour.
LogProductPageTime — log1p(ProductPageTime) to tame right-skewed dwell time.

Each addition is a small, named hypothesis tied to session behaviour.

Models and tuning: why four, and how they are compared

The brief asked for multiple models. I train Logistic Regression (class_weight='balanced'), Random Forest (class_weight='balanced'), Gradient Boosting, and XGBoost (with scale_pos_weight in the search grid). That is an intentional complexity ladder: linear baseline → bagging → boosting → regularised gradient boosting on tabular data.

Each model is wrapped as:

Pipeline([("preprocessor", …), ("classifier", …)])

Hyperparameters are tuned with RandomizedSearchCV, CV_FOLDS=5, stratified splits, scoring="f1_weighted", and n_iter capped sensibly against the grid size (see src/config.py). Nested parallelism is avoided where sklearn warns (e.g. forest / XGBoost n_jobs=1 inside search with n_jobs=-1 on the outer search).

Held-out evaluation uses a stratified train_test_split (TEST_SIZE=0.2, RANDOM_STATE=42).

Results with “all features” (including PageValue)

On my last documented run (see README.md for the full table), XGBoost led on test F1 (weighted) at 0.8919, with Gradient Boosting essentially tied, Random Forest and Logistic Regression slightly behind. ROC-AUC is highest on Gradient Boosting in that snapshot, but the selection rule for “best” in the pipeline remains F1-weighted, because that aligns better with imbalance.

Accuracy sits around 0.89 for the top models, which is nice, but it is not the quantity I optimise.

The leakage experiment (the part I would lead within an interview)

PageValue is deeply predictive in e-commerce sessions — and for good reason: it summarises value seen on pages that may already reflect post-hoc commercial context. For real-time “will this session convert?” scoring, I need to know whether PageValue (and anything derived from it) is always available and ethically/contractually safe to use at the moment of prediction.

So the pipeline runs twice:

Full feature set (including PageValue and HasPageValue).
Same pipeline, but columns PageValue and HasPageValue removed.

Documented outcome: best test F1 (weighted) falls from about 0.89 to about 0.79 (best model shifts toward Random Forest in that run). That is roughly a 0.10 drop — not a collapse, which tells me other behavioural features still carry signal, and it quantifies how much lift sat with the PageValue family in my run. I keep that A/B in the repo so the effect is explicit and reproducible.

Diagnostics beyond a single threshold

After training, the pipeline:

Builds precision–recall curves and a threshold sweep on the best full-feature model (default 0.5 is not assumed optimal for imbalance).
Plots learning curves for all four models to discuss bias–variance and data appetite.
Persists the best estimator with joblib, so deployment is “raw rows in → same preprocessing + model out”.

Artifacts land under outputs/ when you run bash run.sh.

What I would do next in a real product

Define the decision point: features and labels must align to the same timestamp (e.g. features available after N minutes on site vs at session end).
Business-cost sensitivity: pick threshold from PR curve using FP/FN cost, not F1 alone.
Calibration and monitoring: track score drift and cohort stability; retrain on a cadence tied to seasonality.
Fairness review on GeographicRegion and similar fields if they influence interventions.

How to reproduce

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# place online_shopping.db under data/
bash run.sh

Read README.md for the authoritative tables, file map, and CI notes. Use eda.ipynb for charts and statistical tests behind the design choices.

Closing

In this work I controlled the evaluation on an imbalanced binary target, compared four tuned baselines on the same preprocessing, engineered a small interpretable feature set, and re-ran training after removing a suspicious feature family so the feature contract is stress-tested in numbers, not only in prose. The leakage subsection is the anchor of the analysis if I extend the pipeline further.