Skip to main content
gohluke.comPublic · not AIS official

April 25 gauntlet

Projected questions answered against the real take-home.

April 25 gauntlet: projected questions, answered with this repo

Goh Siang En Luke · Private prep for AIAP Batch 23 (interview 25 Apr 2026)
Grounded in aiap23-goh-siang-en-luke-963A-recreate: SQLite sessions → binary purchase, four tuned models, leakage A/B. No brief text, no non-public dataset dumps.


How I use this page

Panels often run 2–3 senior engineers. They probe decisions, production rigor, and fit, not syntax. Below each prompt is how I answer using my actual code path (preprocessing.engineer_features, train.py, main.py, outputs/).


1. Technical defence (take-home walkthrough)

“Walk us through feature engineering. Why these features?”

How I answer: I only added three interpretable columns in engineer_features, each a named hypothesis:

  • HasPageValue — non-zero PageValue flags sessions where commercial value already surfaced; it pairs with the leakage story.
  • BounceExitInteractionBounceRate × ExitRate; captures “bounced hard and exited aggressively” in one scalar, stronger than either raw rate alone for intent.
  • LogProductPageTimelog1p(ProductPageTime); dampens right-skewed dwell time so extreme sessions do not dominate the linear / tree splits.

Everything else comes from the base schema after clean_data (duplicate drop, CustomerType normalisation, GeographicRegion fix, clamps/caps on rates and dwell time—see methods-article.md).

“You ended up on XGBoost. Why not only Logistic Regression—or a Transformer?”

How I answer: The brief required multiple models. I ran a complexity ladder: Logistic Regression (class_weight='balanced'), Random Forest, Gradient Boosting, XGBoost with scale_pos_weight in the search grid—all inside the same Pipeline([preprocessor, classifier]), RandomizedSearchCV, stratified 5-fold, optimising f1_weighted.

On my documented run, XGBoost edged to ~0.892 test F1 (weighted) with Gradient Boosting essentially tied; ROC-AUC was slightly higher on GB in that snapshot, but selection stayed F1-weighted for the imbalance story.

Why not “just Transformer”: this problem is small tabular session rows, not token sequences. GBDT family models are the practical default here for signal-to-effort; a transformer would be data- and infra-heavy without a clear modality match.

“How did you validate, and which metric did you prioritise?”

How I answer: Stratified train_test_split (80/20, RANDOM_STATE=42), RandomizedSearchCV with CV_FOLDS=5, scoring f1_weighted. I still log accuracy and ROC-AUC for context, but I do not lead with accuracy on ~15% positive class—F1-weighted matches the imbalance. I also ran precision–recall curves and a threshold sweep because the operating point depends on business FP/FN costs I would set with a stakeholder—not a fixed 0.5 default.

“If the dataset had 1,000× more rows, what breaks first in your code?”

How I answer: pandas loading the full SQLite extract into RAM for exploratory-scale code, then single-node RandomizedSearchCV runtime. I would move to Parquet partitions + out-of-core or distributed compute (Spark/Dask/warehouse), orchestrated retrains, and a feature store for reusable slices—the same story as my 100× batch notes in interview-article.md, scaled up.


2. Production and MLOps stress test

“How would you detect performance drop in production?”

How I answer: Data drift (input distribution vs training: PSI, KS, simple stats) and quality drift (null rates, schema violations). Score drift and rolling calibration of outcomes vs predictions. Tools in the Evidently / whylogs class—or SQL dashboards if the org is lean. I tie alerts to model version and data snapshot id in a manifest (see deployment-article.md).

“Deploy this as a real-time API—what stack?”

How I answer: Containerised FastAPI (or equivalent) loading versioned joblib; Pydantic schema on the request body; predict_proba + optional threshold from the PR analysis. Audit logging of request metadata (not raw PII) to Postgres/Supabase. Training stays offline in CI or a worker; the API never runs RandomizedSearchCV on the request path. For my own infra I already operate long-running services on a VPS; the pattern is the same: pin deps, health checks, rollback by artifact version.

“What is data leakage—and did you check for it?”

How I answer: Temporal / contract leakage: features not available or not ethical at decision time. I ran the pipeline twice: with PageValue + HasPageValue, then dropping both. Documented ~0.89 → ~0.79 best F1 (weighted) in my run—quantifies how much lift sat in that family. Separately, Pipeline keeps preprocessing fit inside CV folds so the model never sees validation statistics during tuning.


3. Programme fit and collaboration

“You’re already a founder—why AIAP instead of only your own products?”

How I answer: (Aligned to my submitted form.) I have shipped products end-to-end; I want deeper MLOps and production ML discipline in a structured, industry-grade setting, and to contribute in cohort work—not to pause building, but to compress years of patterns I would otherwise learn only from my own mistakes.

“Teammate insists on a weaker approach in a group exercise.”

How I answer: Time-box a comparison: shared metric on a small slice or agreed baseline, then let numbers decide. If time is zero, I document assumptions and propose a risk-ranked path (fast ship vs rigor).

“How do you think about AI governance for agentic systems in 2026?”

How I answer: One tight frame: accountability (who owns the decision), risk assessment (what can go wrong, severity), technical controls (tool allowlists, evals, kill switches), transparency (logging, user-visible limits). I connect to products I run (Dayze / Coincess) only where I can speak concretely—no buzzword wall.


4. Theory flashcards (short answers I keep ready)

“L1 vs L2 regularisation?”

L1 (Lasso) can zero coefficients—implicit feature selection. L2 (Ridge) shrinks but rarely zeros. I pick based on sparsity vs collinearity needs; in this take-home the heavy lifting was tree ensembles, not my hand-tuned linear penalty grid.

“Cold start in recommendation systems?”

How I answer: My submission is session-level purchase prediction, not a recsys benchmark—so I say that honestly. If the interview pivots: popularity / content-based cold start until enough interactions for collaborative signals; exploration policies where business allows.


Cross-links on this site

  • interview-article.md100× batch + streaming systems depth.
  • methods-article.md — numbers and leakage narrative.
  • deployment-article.md — train/serve split and artifacts.
  • /admin/quiz and /admin/quiz/mock-interview — drills the same muscle out loud.

No panel script is guaranteed; this is the version tied to my repo, not a generic tutorial.