Two complementary protocols for accuracy and rhythm

BoxComm evaluates both instant commentary correctness and long-horizon broadcast pacing. The benchmark is intended to reveal whether a model can describe the right thing at the right time and in the right discourse mode, rather than collapsing commentary quality into a single unconstrained subjective score.

Task 1 Category-conditioned commentary generation
Task 2 Streaming commentary rhythm assessment
Protocols

From sentence-level correctness to full-stream pacing

The two benchmark protocols intentionally target different failure modes. One measures local semantic precision under category control, while the other measures global temporal behavior over continuous clips. Together they turn commentary assessment into a structured evaluation problem instead of an unconstrained preference judgment.

BoxComm benchmark architecture
Overview of the two proposed evaluation protocols: category-conditioned generation and streaming commentary rhythm assessment.
Evaluation Logic

Why this is more than subjective commentary scoring

🧭
Category control constrains the discourse space The model is not judged under an open-ended “say anything plausible” setup. It must produce commentary in a specified discourse mode, which makes the target behavior more concrete and comparable.
📊
Rhythm is measured against professional broadcast patterns Streaming evaluation compares timing behavior and category usage against real boxing commentary distributions, so rhythm is assessed with reference to expert broadcast practice rather than free-form preference alone.
🥊
Event-informed baselines expose perception bottlenecks The event-informed baseline shows that fleeting-action perception materially affects commentary quality, making the benchmark sensitive to combat-specific understanding instead of surface-level fluency only.

Category-Conditioned Generation

Given video context and a target class, the model must produce a natural sentence that matches the requested commentary type and the current fight state.

Rhythm Assessment

The model is allowed to generate over streaming clips, and the evaluation checks whether timing and category distribution resemble professional broadcasts.

Event-Informed Baseline

EIC-Gen injects punch-event cues into commentary generation, showing that fleeting action perception directly matters for combat-sport narration quality.

Citation

BibTeX

ArXiv preprint citation for the current BoxComm paper.

@article{boxcomm2026, title = {BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing}, author = {Wang, Kaiwen and Zheng, Kaili and Shi, Yiming and Deng, Rongrong and Guo, Chenyi and Wu, Ji}, journal = {arXiv preprint arXiv:2604.04419}, year = {2026} }