Skip to main content
gohluke.comPublic · not AIS official

Interview — systems

100× batch + streaming, cost, CDC, incremental learning.

Interview prep: systems thinking and extra prompts I rehearse

Goh Siang En Luke · Private notes for AIAP Batch 23 (interview 25 Apr 2026)
Prep only—no assessment brief, no non-public dataset details. Same tone I use in mock interviews: first person, concrete.


Official day-of format (AIS invitation)

  • NRIC: Interviewers verify identity against your NRIC before the technical interview—have it ready.
  • Technical interview (1 hour): 10 minutes to present the Technical Assessment submission (your choice: README, notebook, slides). 50 minutes of interviewer questions.
  • Group exercise (2 hours): Case study with 2–3 other candidates, disclosed on the spot. Bring a personal laptop to code during the session; internet and other resources allowed.
  • AIS states the technical interview and group exercise are two parts that may run in either order—confirm sequencing from your agenda email.

Reference: AIAP technical assessment & interview process.


Why “100× more data” is a different question

If an interviewer asks what I would do if the data were 100× larger, I treat it as a shift from laptop-scale work to systems engineering. The failure mode is not “training takes longer”—it is RAM, I/O, and operational coupling. I want my answer to show I can decouple storage, compute, and serving.


Deep dive: “What if the data size was 100×?”

1. Architectural pivot

At that scale I cannot assume the full dataset fits in memory. I move to out-of-core or distributed processing.

  • Storage: I would move off local CSV/SQLite-only workflows for the primary corpus toward object storage (e.g. S3, GCS) or a warehouse (e.g. BigQuery, Snowflake) with clear partitioning and lifecycle rules.
  • Processing: I would reach for Spark or Dask (or warehouse-native SQL) so work is parallelised across partitions instead of one big read_csv into RAM.

How I would say it in the room: I would implement a distributed compute layer so a 100× increase is handled by scaling out workers and partitioning the data, not by hunting for a single machine with enough RAM.

2. Engineering levers (I/O and reuse)

  • Data formatToday: CSV / SQLite export. At 100×: Parquet (columnar, compressed) so scans read fewer bytes and support predicate pushdown.
  • ETL / training orchestrationToday: a sequential script (run.sh → Python). At 100×: DAG orchestration (Airflow, Prefect, Dagster) for retries, parallelism, and scheduled retrains.
  • FeaturesToday: recompute from raw when it is still cheap. At 100×: a feature store (Feast, Tecton-class pattern, or warehouse tables) for reusable, versioned training and online serving slices.
  • Model trainingToday: single-node RandomizedSearchCV. At 100×: distributed training where the model class supports it (e.g. XGBoost / LightGBM distributed, or Horovod / Lightning for deep models—not always needed for tabular).

The bottleneck I call out explicitly is often I/O, not FLOPs.

3. Cost and reliability

If volume grows 100×, I do not assume the cloud bill scales linearly forever. I would use batch jobs for non-latency-critical work, spot / preemptible capacity for heavy training where restart is acceptable, and autoscaling bounds so a bad query cannot drain the budget. I tie that to ROI for whoever funds the stack (program partner, employer, or my own runway).

4. Ten-second summary I can deliver cold

  1. Columnar storage (Parquet) to cut I/O.
  2. Distributed or partitioned processing (Spark/Dask/warehouse) for horizontal scale.
  3. Lazy / staged transforms: materialise only what each stage needs.
  4. Data quality gates (Great Expectations–class checks, anomaly monitors): bad rows scale with volume; I catch drift and schema breaks early.

Deep dive (streaming): What if the same 100× volume arrived in real time?

Batch scale-up is not enough here. If the panel asks about continuous ingress, I shift the story from “load → transform → train” to event-driven architecture, state, and latency SLOs. I am describing how I would run a live system, not a bigger notebook.

1. Event backbone and stream processing

  • Single stream as source of truth (Kappa-style): I treat logs/events as the primary interface: producers write once to a durable log; downstream systems replay or fork read models. That avoids maintaining two divergent batch and speed layers (the old Lambda split) unless compliance forces it.
  • Log / bus: Kafka or Redpanda as the central ingest buffer—partitioned topics, retention, compaction where appropriate.
  • In-flight compute: Flink or Spark Structured Streaming for windowed aggregations, joins, enrichment, and exactly-once or at-least-once semantics chosen explicitly.
  • Why I say this in Singapore’s AI stack narrative: policy and industry talk in 2026 stresses event-driven and agentic systems; I connect streaming to agents that must act on fresh context (the same pressure I feel building Coincess market feeds or Dayze live assistants).

2. In-flight engineering (what actually breaks)

  • Feature freshness: Online feature stores (Redis-backed, Feast online store, or vendor equivalents) so scoring reads low-latency vectors, while batch/offline stores backfill history.
  • Model serving: I usually separate stream processing from model inference: Flink/Spark does stateful windows and feature assembly; a model server (REST/gRPC) or embedded scoring for tiny models handles predict_proba. I only colocate “model inside the operator” when the model is small and SLAs allow it—otherwise I avoid blocking the stream on GPU RPC.
  • Backpressure: Autoscaled consumers, bounded queues, and dropping or sampling non-critical telemetry under stress—especially relevant when crypto volatility spikes and my stack must not die.
  • Exactly-once where money moves: Kafka transactions / idempotent sinks so I do not double-count a fill or a payment signal.

3. From OLTP to stream (CDC)

For a fintech-style partner I would propose Change Data Capture from operational databases into the bus—Debezium-class patterns—so ML features track operational truth without nightly bulk ETL lag. The pitch: models stay context-aware relative to balances, limits, and state machines, not yesterday’s warehouse snapshot.

4. Incremental and online learning (when I bring it up)

“Retrain every hour on the full table” is a fine baseline. Where I want a senior sound, I add:

  • Incremental / online methods (e.g. River, SGD partial fits, streaming trees where supported) to update estimates as events arrive, with guardrails for concept drift and catastrophic forgetting (replay buffers, periodic full retrains, or shadow deployments).
  • I am honest: not every production model should be pure online learning; I pair micro-updates with scheduled full retrains on Parquet/warehouse history.

5. Streaming checklist I can hit in under a minute (April 25)

  • Messaging: Kafka / Redpanda + Flink or Spark Streaming as the default mental picture.
  • Governance: Real-time guardrails—schema checks, rate limits, bias / policy filters on outputs before they touch users (especially for agents).
  • Cost: Streaming is not free; I use windowing (e.g. 1-minute tumbling) and tiered storage so I do not pay to shuffle raw noise forever.
  • Mindset: Batch trains the spec; streaming runs the product.

Other interview prompts I keep answers ready for

  • Leakage: How I define label time vs feature time; my PageValue A/B in the take-home as a concrete example.
  • Imbalanced metrics: Why I optimised F1-weighted and still report PR-AUC / confusion matrix; threshold choice vs default 0.5.
  • Validation: Stratified k-fold, held-out test, nested CV only if I am honest about compute cost.
  • Reproducibility: RANDOM_STATE, pinned deps, saved Pipeline, manifest with git SHA.
  • Production: Train/serve split, schema validation on the API, monitoring score drift.
  • Ethics / fairness: Sensitive categoricals (GeographicRegion); when I would not deploy a feature even if it lifts accuracy.
  • Model zoo (LR, DT, SVM, RF, boosting): If they broaden beyond my four trained models, I use the Model zoo (panel) crib—especially “why not SVM here” vs “when SVM is fair for sparse high-dim,” and RF vs boosting in one sentence each.

Links I still use the same day

  • /admin/aiap/library — one-screen cribs (ISLP, Clean Code, Databricks GenAI, Designing ML Systems, Model zoo) before the panel.
  • /admin/quiz — flashcards including AIAP-style categories.
  • /admin/quiz/mock-interview — timed spoken walkthrough on framing, leakage, validation.
  • /admin/aiap/interview — optional browser voice drill (scripted prompts + STT + short Gemini replies); same interaction pattern as Keta’s /interview, gohluke admin UI (black/white).

These notes sit next to my methods and deployment articles on this page so I have model → ship → interview (batch + streaming) in one place.