Production deployment
Train vs serve, artifacts, preprocessing contract, inference.
From notebook to service: how I would harden this pipeline for production
Goh Siang En Luke · Engineering note (companion to the tabular methods write-up)
Scope: architecture and operations only—no assessment brief or non-public dataset material.
What “production” means here
I separate training (heavy, rare, reproducible) from scoring (small, frequent, bounded latency). The repo today runs end-to-end locally with bash run.sh, persists outputs/best_model.joblib, and documents metrics under outputs/. Production is the layer around that: contracts, versions, and where code runs.
Preprocessing must travel with the contract
The saved Pipeline bundles the column transformer and classifier. clean_data and engineer_features run before that pipeline in main.py. For production scoring I would either:
- wrap cleaning + engineering +
Pipelinein one serialisable object (e.g. custom estimator orPipelinestages), or - ship a single Python module used by both training and inference so the order of steps cannot drift.
Drift between train and serve is the fastest way to ship a model that looks good offline and behaves wrong online.
Artifacts I would version
I would treat each deployable build as:
best_model_*.joblibwith an explicit suffix (e.g. full features vs noPageValuerun), not only the last overwrite from training.- A small
manifest.json: git SHA, library versions (sklearn, xgboost), row counts, primary metrics, required input columns, and optional threshold from precision–recall analysis.
Artifacts live in object storage or a registry; the inference process reads MODEL_URI / version from environment, not from the repo tree.
Inference shape
I would expose a narrow HTTP API (e.g. FastAPI): validate a JSON row with Pydantic, run the same preprocessing path, return predict_proba (and a label only if a threshold is fixed in the manifest). Health checks, timeouts, and request size limits are non-negotiable for anything public.
I would not run RandomizedSearchCV on the request path.
Training and CI
Training stays in GitHub Actions, a VM, or a job runner: install locked deps, run run.sh (or a slimmer train entrypoint), upload artifacts and reports. Deployments of the scorer reference a tagged artifact, so rollback is switching an env var.
Data and compliance
Training data paths stay private (bucket + IAM, or internal network). I would not bake sensitive or assessment-only databases into container images or public repos.
What I would add next in the codebase
A serve/ package (FastAPI + Dockerfile), dual saves for the two leakage experiments if both are ever served, and a wrapper pipeline so joblib.load is the only object the API needs after load—minimising footguns for anyone operating the service.
This note is the operational companion to the methods article; together they describe model quality and how I would ship it.