From notebook to service: how I would harden this pipeline for production

Goh Siang En Luke · Engineering note (companion to the tabular methods write-up)
Scope: architecture and operations only—no assessment brief or non-public dataset material.

What “production” means here

I separate training (heavy, rare, reproducible) from scoring (small, frequent, bounded latency). The repo today runs end-to-end locally with bash run.sh, persists outputs/best_model.joblib, and documents metrics under outputs/. Production is the layer around that: contracts, versions, and where code runs.

Preprocessing must travel with the contract

The saved Pipeline bundles the column transformer and classifier. clean_data and engineer_features run before that pipeline in main.py. For production scoring I would either:

wrap cleaning + engineering + Pipeline in one serialisable object (e.g. custom estimator or Pipeline stages), or
ship a single Python module used by both training and inference so the order of steps cannot drift.

Drift between train and serve is the fastest way to ship a model that looks good offline and behaves wrong online.

Artifacts I would version

I would treat each deployable build as:

best_model_*.joblib with an explicit suffix (e.g. full features vs no PageValue run), not only the last overwrite from training.
A small manifest.json: git SHA, library versions (sklearn, xgboost), row counts, primary metrics, required input columns, and optional threshold from precision–recall analysis.

Artifacts live in object storage or a registry; the inference process reads MODEL_URI / version from environment, not from the repo tree.

Inference shape

I would expose a narrow HTTP API (e.g. FastAPI): validate a JSON row with Pydantic, run the same preprocessing path, return predict_proba (and a label only if a threshold is fixed in the manifest). Health checks, timeouts, and request size limits are non-negotiable for anything public.

I would not run RandomizedSearchCV on the request path.

Training and CI

Training stays in GitHub Actions, a VM, or a job runner: install locked deps, run run.sh (or a slimmer train entrypoint), upload artifacts and reports. Deployments of the scorer reference a tagged artifact, so rollback is switching an env var.

Data and compliance

Training data paths stay private (bucket + IAM, or internal network). I would not bake sensitive or assessment-only databases into container images or public repos.

What I would add next in the codebase

A serve/ package (FastAPI + Dockerfile), dual saves for the two leakage experiments if both are ever served, and a wrapper pipeline so joblib.load is the only object the API needs after load—minimising footguns for anyone operating the service.

This note is the operational companion to the methods article; together they describe model quality and how I would ship it.