Real-Time Object Detection, Tracking & Segmentation Pipeline

Overview

Built an end-to-end computer vision pipeline that evolved from CNN-based detection and tracking to transformer-based segmentation — driven by a systematic investigation into where and why standard approaches fail under real-world conditions. This project was a direct continuation of insights from my Glodon internship, where I identified the architectural ceiling of single-stage detection. Here, I pushed further: YOLO + DeepSORT + Kalman filtering for tracking, then a principled pivot to SegFormer, Mask2Former, and SAM when tracking broke down — deployed on AWS (SageMaker, Kubernetes) as a scalable MLOps workflow.

Problem

The system needed to handle two fundamentally different input scenarios:

Fixed-camera video (e.g., surveillance, traffic monitoring): The background is static. YOLO detects objects frame-by-frame, DeepSORT assigns persistent IDs, and the Kalman filter predicts motion trajectories reliably. This works well — the motion model's assumptions hold.

Moving-camera video (e.g., handheld footage, sports broadcasts from the stands): The entire background shifts frame-to-frame. This breaks the core assumptions that Kalman filtering relies on:

ID switching after occlusion: A person walks behind a pole and re-emerges — DeepSORT assigns a new ID because the motion model can no longer confidently re-associate the track when background motion dominates.
Small object loss: Objects near frame edges or at distance are detected intermittently, and once a track is lost it rarely recovers.
Trajectory drift: The Kalman filter's predicted positions diverge from reality when camera ego-motion isn't compensated.

These aren't edge cases — they represent a fundamental limitation of the detection → tracking paradigm when camera motion violates the stationarity assumptions baked into Kalman filtering.

Why It Mattered

At Glodon, I learned where YOLO hits its ceiling on fine-grained, dense targets. Here, I encountered a different ceiling: CNN-based detection is strong, but layering tracking on top only works when the motion model's assumptions hold. Fine-tuning the CNN or tweaking DeepSORT parameters doesn't solve a problem rooted in the paradigm itself.

The question shifted from "how do we make tracking work better?" to "is bounding-box tracking the right abstraction at all?" That reframing is what drove the pivot to segmentation — and it came from independent research motivation, not from being told what to try next.

Approach

Phase 1: Detection + Tracking (Fixed Camera)

YOLO (CNN-based detection): Fine-tuned for real-time object detection — building on my Glodon experience with frozen-backbone fine-tuning and domain-specific augmentation.
DeepSORT + Kalman Filter: Multi-object tracking using appearance embeddings and Kalman-filtered motion prediction for persistent ID assignment across frames.
Validation: Stable on fixed-camera benchmarks — consistent IDs, smooth trajectories, reliable re-identification after brief occlusions.

Phase 2: Diagnosing Failure Modes (Moving Camera)

Systematically tested the pipeline on moving-camera footage and documented where it broke:

Kalman filter motion predictions diverge when background optical flow dominates object motion
DeepSORT's re-identification fails when appearance features are ambiguous post-occlusion
Small objects near frame edges are detected intermittently, creating fragmented tracks

Key insight: these failure modes are inherent to the bounding-box + motion-model paradigm — not fixable by better hyperparameters or more training data.

Phase 3: Pivot to Segmentation

Investigated pixel-level scene understanding as an alternative that doesn't depend on motion priors:

SegFormer: Lightweight transformer-based semantic segmentation — efficient enough for near-real-time inference, strong on scene-level understanding.
Mask2Former: Unified architecture for panoptic, instance, and semantic segmentation using masked attention — explored its encoder-decoder design for fine-grained object boundaries.
SAM (Segment Anything Model): Foundation model for promptable segmentation — tested for zero-shot generalization on unseen object categories.

Why segmentation works where tracking fails: Segmentation provides stronger shape and region information per frame, enabling more stable object association without relying on temporal motion models. A segmentation model understands each frame independently at the pixel level — it doesn't need the camera to be stationary.

Architectural Reflection: CNN vs. Transformer

Through this project, I developed a deeper understanding of the fundamental tradeoff:

CNN hierarchical compression (pooling, striding) is computationally efficient but progressively discards fine spatial detail — acceptable for detection but limiting for dense prediction tasks.
Transformer attention and mask mechanisms can more flexibly preserve and attend to key region information across spatial scales, at the cost of compute.
Future direction: more targeted loss design — mask supervision, joint category-and-region constraints — could further improve precision and robustness by explicitly teaching the model what spatial relationships matter.

Deployment

Trained on AWS SageMaker with Docker-based Hugging Face containers for distributed fine-tuning
Deployed on Kubernetes with autoscaling inference endpoints
Data storage on AWS S3 with versioned pipeline artifacts
Established a reusable, modular cloud-native MLOps workflow

Results & Impact

Achieved mAP improvement from 19% to 43% through systematic architecture benchmarking
Deployed real-time segmentation with autoscaling and reliable cloud performance
Documented a principled framework for when to move beyond detection + tracking toward segmentation
Connected insights from Glodon (detection ceiling) to this project (tracking ceiling) to form a coherent research arc across CNN → tracking → segmentation → Transformer architectures

Lessons Learned

The hardest part wasn't building the models — it was recognizing that the tracking paradigm itself was the bottleneck
Kalman filtering is powerful under its assumptions, but those assumptions don't hold when the camera moves — and no amount of fine-tuning fixes a broken assumption
The progression from Glodon → this project taught me to think in terms of architectural ceilings: every approach has one, and the skill is knowing when you've hit it vs. when you need to tune harder
Segmentation and tracking solve different problems — knowing when each is appropriate is a design decision, not a tuning decision
CNN vs. Transformer is not "old vs. new" — it's a tradeoff between computational efficiency and spatial flexibility, and the right choice depends on what the task demands