Skip to content
All Projects
Case Study

Real-Time Object Detection, Tracking & Segmentation Pipeline

Investigated the limits of CNN-based detection and tracking under real-world camera conditions — then independently designed a transition from bounding-box tracking to pixel-level segmentation. Started with YOLO + DeepSORT + Kalman filtering for fixed-camera scenarios, identified systematic failure modes in moving-camera settings (ID switching after occlusion, lost small/edge objects), and drove the architectural pivot to transformer-based segmentation (SegFormer, Mask2Former, SAM) to recover scene understanding where tracking broke down.

OrganizationWashington University in St. Louis
RoleLead ML Engineer
TimelineSpring – Fall 2025
Reading Time5 min read
YOLODeepSORTKalman FilterMask2FormerSegFormerSAMAWSDocker
19% → 43%
mAP improvement

Overview

Built an end-to-end computer vision pipeline that evolved from CNN-based detection and tracking to transformer-based segmentation — driven by a systematic investigation into where and why standard approaches fail under real-world conditions. This project was a direct continuation of insights from my Glodon internship, where I identified the architectural ceiling of single-stage detection. Here, I pushed further: YOLO + DeepSORT + Kalman filtering for tracking, then a principled pivot to SegFormer, Mask2Former, and SAM when tracking broke down — deployed on AWS (SageMaker, Kubernetes) as a scalable MLOps workflow.

Problem

The system needed to handle two fundamentally different input scenarios:

Fixed-camera video (e.g., surveillance, traffic monitoring): The background is static. YOLO detects objects frame-by-frame, DeepSORT assigns persistent IDs, and the Kalman filter predicts motion trajectories reliably. This works well — the motion model's assumptions hold.

Moving-camera video (e.g., handheld footage, sports broadcasts from the stands): The entire background shifts frame-to-frame. This breaks the core assumptions that Kalman filtering relies on:

  • ID switching after occlusion: A person walks behind a pole and re-emerges — DeepSORT assigns a new ID because the motion model can no longer confidently re-associate the track when background motion dominates.
  • Small object loss: Objects near frame edges or at distance are detected intermittently, and once a track is lost it rarely recovers.
  • Trajectory drift: The Kalman filter's predicted positions diverge from reality when camera ego-motion isn't compensated.

These aren't edge cases — they represent a fundamental limitation of the detection → tracking paradigm when camera motion violates the stationarity assumptions baked into Kalman filtering.

Why It Mattered

At Glodon, I learned where YOLO hits its ceiling on fine-grained, dense targets. Here, I encountered a different ceiling: CNN-based detection is strong, but layering tracking on top only works when the motion model's assumptions hold. Fine-tuning the CNN or tweaking DeepSORT parameters doesn't solve a problem rooted in the paradigm itself.

The question shifted from "how do we make tracking work better?" to "is bounding-box tracking the right abstraction at all?" That reframing is what drove the pivot to segmentation — and it came from independent research motivation, not from being told what to try next.

Approach

Phase 1: Detection + Tracking (Fixed Camera)

  1. YOLO (CNN-based detection): Fine-tuned for real-time object detection — building on my Glodon experience with frozen-backbone fine-tuning and domain-specific augmentation.
  2. DeepSORT + Kalman Filter: Multi-object tracking using appearance embeddings and Kalman-filtered motion prediction for persistent ID assignment across frames.
  3. Validation: Stable on fixed-camera benchmarks — consistent IDs, smooth trajectories, reliable re-identification after brief occlusions.

Phase 2: Diagnosing Failure Modes (Moving Camera)

Systematically tested the pipeline on moving-camera footage and documented where it broke:

  • Kalman filter motion predictions diverge when background optical flow dominates object motion
  • DeepSORT's re-identification fails when appearance features are ambiguous post-occlusion
  • Small objects near frame edges are detected intermittently, creating fragmented tracks

Key insight: these failure modes are inherent to the bounding-box + motion-model paradigm — not fixable by better hyperparameters or more training data.

Phase 3: Pivot to Segmentation

Investigated pixel-level scene understanding as an alternative that doesn't depend on motion priors:

  1. SegFormer: Lightweight transformer-based semantic segmentation — efficient enough for near-real-time inference, strong on scene-level understanding.
  2. Mask2Former: Unified architecture for panoptic, instance, and semantic segmentation using masked attention — explored its encoder-decoder design for fine-grained object boundaries.
  3. SAM (Segment Anything Model): Foundation model for promptable segmentation — tested for zero-shot generalization on unseen object categories.

Why segmentation works where tracking fails: Segmentation provides stronger shape and region information per frame, enabling more stable object association without relying on temporal motion models. A segmentation model understands each frame independently at the pixel level — it doesn't need the camera to be stationary.

Architectural Reflection: CNN vs. Transformer

Through this project, I developed a deeper understanding of the fundamental tradeoff:

  • CNN hierarchical compression (pooling, striding) is computationally efficient but progressively discards fine spatial detail — acceptable for detection but limiting for dense prediction tasks.
  • Transformer attention and mask mechanisms can more flexibly preserve and attend to key region information across spatial scales, at the cost of compute.
  • Future direction: more targeted loss design — mask supervision, joint category-and-region constraints — could further improve precision and robustness by explicitly teaching the model what spatial relationships matter.

Deployment

  • Trained on AWS SageMaker with Docker-based Hugging Face containers for distributed fine-tuning
  • Deployed on Kubernetes with autoscaling inference endpoints
  • Data storage on AWS S3 with versioned pipeline artifacts
  • Established a reusable, modular cloud-native MLOps workflow

Results & Impact

  • Achieved mAP improvement from 19% to 43% through systematic architecture benchmarking
  • Deployed real-time segmentation with autoscaling and reliable cloud performance
  • Documented a principled framework for when to move beyond detection + tracking toward segmentation
  • Connected insights from Glodon (detection ceiling) to this project (tracking ceiling) to form a coherent research arc across CNN → tracking → segmentation → Transformer architectures

Lessons Learned

  • The hardest part wasn't building the models — it was recognizing that the tracking paradigm itself was the bottleneck
  • Kalman filtering is powerful under its assumptions, but those assumptions don't hold when the camera moves — and no amount of fine-tuning fixes a broken assumption
  • The progression from Glodon → this project taught me to think in terms of architectural ceilings: every approach has one, and the skill is knowing when you've hit it vs. when you need to tune harder
  • Segmentation and tracking solve different problems — knowing when each is appropriate is a design decision, not a tuning decision
  • CNN vs. Transformer is not "old vs. new" — it's a tradeoff between computational efficiency and spatial flexibility, and the right choice depends on what the task demands