BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing

Kaiwen Wang^1,* Kaili Zheng^1,* Yiming Shi¹ Rongrong Deng² Chenyi Guo^1,† Ji Wu^1,†

¹ Tsinghua University · ² Beijing Sport University

^* Equal contribution · ^† Corresponding author

Paper Dataset on Hugging Face Benchmark on Hugging Face Explore Dataset Demo See Benchmark Tasks

445 matches
77.8 hours World Boxing Championship videos

52K Broadcast commentary sentences with category labels

260K Millisecond-level punch events

Overview

Why we need BoxComm?

Existing sports commentary benchmarks mainly focus on generic narration or sentence-level alignment. In contrast, BoxComm is designed for boxing commentary, where both what to say and when to say it matter. Boxing involves highly dynamic, sub-second actions and a much higher proportion of tactical commentary than team sports (45.6% vs. 21.7%), making direct transfer from generic sports narration benchmarks insufficient. To reflect this, BoxComm evaluates commentary along two complementary dimensions: discourse type and narration rhythm.

Key Ingredients

What makes BoxComm different?

🎙️

Category-aware commentary Each sentence is labeled as play-by-play, tactical, or contextual, so the task is explicitly about discourse control in combat-sport commentary rather than generic narration alone.

👊

Fine-grained punch events Detected events carry boxer side, punch technique, target area, and effectiveness to support structured understanding of combat exchanges.

⏱️

Structured evaluation We do not rely on a single holistic judge score. Instead, BoxComm factorizes commentary quality into discourse type, local semantic correctness, and global narration rhythm, making evaluation more structured than unconstrained commentary scoring.

BoxComm dataset statistics

Construction

How the dataset is built

BoxComm is constructed by aligning broadcast commentary extraction with fine-grained atom-event detection for boxing exchanges.

Commentary extraction and alignment pipeline from professional boxing broadcasts.

Atom-event extraction pipeline for millisecond-level punch events used in BoxComm.

Acknowledgments

This research was supported by Huawei’s AI Hundred Schools Program and was carried out using the Huawei Ascend AI technology stack. Additionally, we would like to acknowledge the Xinjiang Uygur Autonomous Region Sports Science Research Center and the research group led by Prof. Qingmin Fan at Beijing Sport University for their critical assistance with the data collection and annotation iteration processes.