Success Story
Custom data. Fine-tuned models. Real-world fluency in multi areas.
Multilingual Whisper Fine-Tuning for Global Media Intelligence
We collaborated with a global media analytics company to deliver a production-ready multilingual speech recognition system for high-accuracy transcription across Japanese, Korean, Mandarin, Cantonese, and French. The goal was to enable automated subtitle generation and content indexing in noisy, mixed-language media environments such as TV shows, films, and podcasts.
We designed a pipeline to collect and align over 1,200 hours of domain-specific speech data, sourced from licensed entertainment content, interviews, and public media archives. All audio was segmented, speaker-labeled, and aligned to exact subtitle timestamps. Data was packaged in Whisper-compatible
format, enriched with token-based metadata for language switching and noise classification.
Fine-tuning was performed on a per-language basis using Whisper small and medium models, followed by targeted LoRA adapters for code-switching and latency-sensitive streaming scenarios. Evaluation included WER, CER, BLEU and real-user transcription fluency scores.
The result? Sub-15% WER across five languages in noisy, real-world conditions — enabling the client to build internal AI subtitle pipelines and scale multilingual content indexing across markets.
Supervised Action Recognition for Competitive Sports Analysis
We developed a domain-specific action recognition system tailored for analyzing technical movements in competitive sports. The objective was to detect and classify precise athletic techniques—including transitions, takedowns, and holds—within full-length match videos.
Leveraging a fully supervised training approach, we collaborated with subject-matter experts to create a structured dataset featuring 100+ labeled action classes. Each video segment was manually annotated with start/end timestamps, action type, and contextual metadata to support accurate downstream modeling.
The final model, built on top of a transformer-based video backbone, achieved 78% top-1 accuracy on clip-level predictions and supported real-time inference. A custom labeling interface was also delivered to enable rapid expansion of the dataset and human-in-the-loop quality control.