Skip to content
All Projects
Case Study

AI-Powered Data Science Mock Interview Platform

Built a full-stack interview simulator that uses dual LLM agents — one for real-time answer evaluation against weighted rubrics, one for neutral interviewer delivery — to conduct adaptive Data Science interviews over WebSocket. The engine groups questions by domain (depth-first), branches follow-ups based on answer quality, enforces realistic pacing (~4 min/question), and generates post-interview reports with 6-dimension scoring and personalized 7-day training plans.

RoleSole Developer
TimelineMar 2026
Reading Time4 min read
LLM AgentsFastAPIWebSocketNext.jsPrompt EngineeringSupabaseDockerZustand
10
DS domains
100+
Questions
6
Scoring dimensions

Overview

Built a full-stack mock interview platform that simulates realistic Data Science technical interviews using LLM-powered adaptive questioning. The system conducts real-time interviews over WebSocket, evaluates answers against weighted rubrics with 6 scoring dimensions, dynamically adjusts follow-up depth based on answer quality, and generates personalized post-interview reports with 7-day training plans. The interviewer persona is deliberately neutral and probing — modeled after real interview dynamics, not a chatbot.

Problem

Existing interview prep tools fall into two categories: static flashcard-style Q&A (no adaptivity, no follow-ups) and generic chatbot conversations (encouraging tone, no rubric-based evaluation, no time pressure). Neither replicates what a real DS interview actually feels like: a neutral evaluator who probes weak spots, challenges assumptions, manages time, and moves on when you're stuck — not a cheerleader who says "Great answer!" regardless.

The core design question was: how do you make an LLM behave like an interviewer, not a tutor?

Why It Mattered

Interview preparation is high-stakes and deeply personal. Generic tools don't surface the specific gaps in a candidate's understanding — they either accept everything or reject everything. A realistic simulator needs to:

  • Detect answer quality in real-time and branch accordingly
  • Apply time pressure naturally (not just a countdown timer, but verbal cues from the interviewer)
  • Score across multiple dimensions (technical correctness alone misses communication, self-awareness, problem-solving approach)
  • Provide actionable feedback, not just a pass/fail grade

Approach

Interview Engine (State Machine)

Designed a deterministic state machine (WARMUP → CORE → DEEP_PROBE → RETESTING → WRAPUP) with probabilistic transitions driven by LLM evaluation output. Questions are grouped by domain (depth-first, not breadth-first) with adaptive follow-up trees up to 2 levels deep. The engine enforces realistic pacing (~4 minutes per question including follow-ups) and automatically redirects after 5 minutes on a single question.

LLM Integration (Dual-Agent Architecture)

Two separate LLM roles operate in parallel per answer:

  1. Evaluator agent: Scores the answer against a per-question weighted rubric and 6 behavioral dimensions (Technical Correctness 40%, Problem-Solving 25%, Communication 15%, Depth 10%, Challenge Handling 5%, Self-Awareness 5%). Returns structured JSON with quality classification, misconceptions, and missing concepts.

  2. Interviewer agent: Receives the evaluation result and generates the next response. Prompt engineering enforces neutral tone ("Okay.", "I see.", not "Great answer!"), three probing modes (strong answer → escalate difficulty, partial → probe the gap, weak → one probe then move on), and time-pressure verbal cues.

The key insight: separating evaluation from delivery lets the interviewer reference specific things the candidate said while still following a structured assessment framework.

Question Bank

100+ questions across 10 Data Science domains, each with:

  • Weighted rubric criteria with key points
  • 2-level follow-up trees triggered by answer quality
  • Common mistakes and misconceptions for targeted probing
  • Difficulty tiers (foundational, intermediate, advanced) with level-appropriate distribution

Real-Time Communication

WebSocket protocol handles bidirectional text and audio streams. The server injects active silence signals (2-second pauses before responding) and time-warning messages at 15/5/2 minute thresholds — creating natural interview pressure without artificial UI elements.

Architecture

  • Backend: FastAPI (async Python), WebSocket interview handler, LLM client via OpenAI-compatible API
  • Frontend: Next.js 16, React 19, Zustand state management, real-time transcript with auto-scroll
  • Database: PostgreSQL (Supabase) with Row-Level Security for multi-tenancy
  • Voice (optional): OpenAI Whisper (STT) + TTS for spoken interviews
  • Infrastructure: Docker Compose orchestration (6 services), self-hosted Supabase stack

Results & Impact

  • Built a complete interview simulator that adapts difficulty, probes weaknesses, and generates actionable reports — not a chatbot wrapper
  • 100+ questions across 10 DS domains with structured follow-up trees and weighted rubrics
  • 6-dimension scoring with behavioral anchors replaces binary pass/fail evaluation
  • Realistic pacing: ~7-8 questions per 30-minute session (vs. 15 in the initial rapid-fire version), matching real interview cadence
  • Interview persona validated through E2E testing: neutral tone, no cheerleading, challenges weak answers, references candidate's specific words

Lessons Learned

  • The hardest part of building an LLM-powered interviewer isn't the technology — it's prompt engineering the absence of helpfulness. LLMs default to being encouraging and explanatory; making one behave like a neutral evaluator who doesn't coach requires explicit negative instructions ("Do NOT say 'Great answer'", "Do NOT explain the correct answer")
  • Separating evaluation from response generation was the key architectural decision — it lets each agent optimize for its role without conflicting objectives
  • Real interviews are depth-first, not breadth-first. The initial version hopped between domains question-by-question; grouping by domain and completing each before moving on felt dramatically more realistic
  • Time pressure is more about verbal cues than UI timers. "We have about 5 minutes left, let's make sure we cover the remaining topics" from the interviewer creates more pressure than a red countdown clock