Itbanque

Success Story

Custom data. Fine-tuned models. Real-world fluency in multi areas.

🌍

Multilingual Whisper Fine-Tuning for Global Media Intelligence

We collaborated with a global media analytics company to deliver a production-ready multilingual speech recognition system for high-accuracy transcription across Japanese, Korean, Mandarin, Cantonese, and French. The goal was to enable automated subtitle generation and content indexing in noisy, mixed-language media environments such as TV shows, films, and podcasts.

We designed a pipeline to collect and align over 1,200 hours of domain-specific speech data, sourced from licensed entertainment content, interviews, and public media archives. All audio was segmented, speaker-labeled, and aligned to exact subtitle timestamps. Data was packaged in Whisper-compatible format, enriched with token-based metadata for language switching and noise classification.

Fine-tuning was performed on a per-language basis using Whisper small and medium models, followed by targeted LoRA adapters for code-switching and latency-sensitive streaming scenarios. Evaluation included WER, CER, BLEU and real-user transcription fluency scores.

The result? Sub-15% WER across five languages in noisy, real-world conditions — enabling the client to build internal AI subtitle pipelines and scale multilingual content indexing across markets.

🗣 5 Languages 🎧 1,200h Aligned Speech ⚙️ Whisper + LoRA 🎯 Sub-15% WER 📺 Film, Podcast, News
🏃‍♂️

Supervised Action Recognition for Competitive Sports Analysis

We developed a domain-specific action recognition system tailored for analyzing technical movements in competitive sports. The objective was to detect and classify precise athletic techniques—including transitions, takedowns, and holds—within full-length match videos.

Leveraging a fully supervised training approach, we collaborated with subject-matter experts to create a structured dataset featuring 100+ labeled action classes. Each video segment was manually annotated with start/end timestamps, action type, and contextual metadata to support accurate downstream modeling.

The final model, built on top of a transformer-based video backbone, achieved 78% top-1 accuracy on clip-level predictions and supported real-time inference. A custom labeling interface was also delivered to enable rapid expansion of the dataset and human-in-the-loop quality control.

🎯 Supervised Training Pipeline 🏷️ 100+ Expert-Labeled Classes 📹 Fine-Grained Video Segmentation ⚡ Real-Time Clip Inference 🧰 Annotation Interface Included