Real-Time Object Detection, Tracking & Segmentation Pipeline
Investigated the limits of CNN-based detection and tracking under real-world camera conditions — then independently designed a transition from bounding-box tracking to pixel-level segmentation. Started with YOLO + DeepSORT + Kalman filtering for fixed-camera scenarios, identified systematic failure modes in moving-camera settings (ID switching after occlusion, lost small/edge objects), and drove the architectural pivot to transformer-based segmentation (SegFormer, Mask2Former, SAM) to recover scene understanding where tracking broke down.
Overview
Built an end-to-end computer vision pipeline that evolved from CNN-based detection and tracking to transformer-based segmentation — driven by a systematic investigation into where and why standard approaches fail under real-world conditions. This project was a direct continuation of insights from my Glodon internship, where I identified the architectural ceiling of single-stage detection. Here, I pushed further: YOLO + DeepSORT + Kalman filtering for tracking, then a principled pivot to SegFormer, Mask2Former, and SAM when tracking broke down — deployed on AWS (SageMaker, Kubernetes) as a scalable MLOps workflow.
Problem
The system needed to handle two fundamentally different input scenarios:
Fixed-camera video (e.g., surveillance, traffic monitoring): The background is static. YOLO detects objects frame-by-frame, DeepSORT assigns persistent IDs, and the Kalman filter predicts motion trajectories reliably. This works well — the motion model's assumptions hold.
Moving-camera video (e.g., handheld footage, sports broadcasts from the stands): The entire background shifts frame-to-frame. This breaks the core assumptions that Kalman filtering relies on:
- ID switching after occlusion: A person walks behind a pole and re-emerges — DeepSORT assigns a new ID because the motion model can no longer confidently re-associate the track when background motion dominates.
- Small object loss: Objects near frame edges or at distance are detected intermittently, and once a track is lost it rarely recovers.
- Trajectory drift: The Kalman filter's predicted positions diverge from reality when camera ego-motion isn't compensated.
These aren't edge cases — they represent a fundamental limitation of the detection → tracking paradigm when camera motion violates the stationarity assumptions baked into Kalman filtering.
Why It Mattered
At Glodon, I learned where YOLO hits its ceiling on fine-grained, dense targets. Here, I encountered a different ceiling: CNN-based detection is strong, but layering tracking on top only works when the motion model's assumptions hold. Fine-tuning the CNN or tweaking DeepSORT parameters doesn't solve a problem rooted in the paradigm itself.
The question shifted from "how do we make tracking work better?" to "is bounding-box tracking the right abstraction at all?" That reframing is what drove the pivot to segmentation — and it came from independent research motivation, not from being told what to try next.
Approach
Phase 1: Detection + Tracking (Fixed Camera)
- YOLO (CNN-based detection): Fine-tuned for real-time object detection — building on my Glodon experience with frozen-backbone fine-tuning and domain-specific augmentation.
- DeepSORT + Kalman Filter: Multi-object tracking using appearance embeddings and Kalman-filtered motion prediction for persistent ID assignment across frames.
- Validation: Stable on fixed-camera benchmarks — consistent IDs, smooth trajectories, reliable re-identification after brief occlusions.
Phase 2: Diagnosing Failure Modes (Moving Camera)
Systematically tested the pipeline on moving-camera footage and documented where it broke:
- Kalman filter motion predictions diverge when background optical flow dominates object motion
- DeepSORT's re-identification fails when appearance features are ambiguous post-occlusion
- Small objects near frame edges are detected intermittently, creating fragmented tracks
Key insight: these failure modes are inherent to the bounding-box + motion-model paradigm — not fixable by better hyperparameters or more training data.
Phase 3: Pivot to Segmentation
Investigated pixel-level scene understanding as an alternative that doesn't depend on motion priors:
- SegFormer: Lightweight transformer-based semantic segmentation — efficient enough for near-real-time inference, strong on scene-level understanding.
- Mask2Former: Unified architecture for panoptic, instance, and semantic segmentation using masked attention — explored its encoder-decoder design for fine-grained object boundaries.
- SAM (Segment Anything Model): Foundation model for promptable segmentation — tested for zero-shot generalization on unseen object categories.
Why segmentation works where tracking fails: Segmentation provides stronger shape and region information per frame, enabling more stable object association without relying on temporal motion models. A segmentation model understands each frame independently at the pixel level — it doesn't need the camera to be stationary.
Architectural Reflection: CNN vs. Transformer
Through this project, I developed a deeper understanding of the fundamental tradeoff:
- CNN hierarchical compression (pooling, striding) is computationally efficient but progressively discards fine spatial detail — acceptable for detection but limiting for dense prediction tasks.
- Transformer attention and mask mechanisms can more flexibly preserve and attend to key region information across spatial scales, at the cost of compute.
- Future direction: more targeted loss design — mask supervision, joint category-and-region constraints — could further improve precision and robustness by explicitly teaching the model what spatial relationships matter.
Deployment
- Trained on AWS SageMaker with Docker-based Hugging Face containers for distributed fine-tuning
- Deployed on Kubernetes with autoscaling inference endpoints
- Data storage on AWS S3 with versioned pipeline artifacts
- Established a reusable, modular cloud-native MLOps workflow
Results & Impact
- Achieved mAP improvement from 19% to 43% through systematic architecture benchmarking
- Deployed real-time segmentation with autoscaling and reliable cloud performance
- Documented a principled framework for when to move beyond detection + tracking toward segmentation
- Connected insights from Glodon (detection ceiling) to this project (tracking ceiling) to form a coherent research arc across CNN → tracking → segmentation → Transformer architectures
Lessons Learned
- The hardest part wasn't building the models — it was recognizing that the tracking paradigm itself was the bottleneck
- Kalman filtering is powerful under its assumptions, but those assumptions don't hold when the camera moves — and no amount of fine-tuning fixes a broken assumption
- The progression from Glodon → this project taught me to think in terms of architectural ceilings: every approach has one, and the skill is knowing when you've hit it vs. when you need to tune harder
- Segmentation and tracking solve different problems — knowing when each is appropriate is a design decision, not a tuning decision
- CNN vs. Transformer is not "old vs. new" — it's a tradeoff between computational efficiency and spatial flexibility, and the right choice depends on what the task demands