Two complementary protocols for accuracy and rhythm
BoxComm evaluates both instant commentary correctness and long-horizon broadcast pacing. The benchmark is intended to reveal whether a model can describe the right thing at the right time and in the right discourse mode, rather than collapsing commentary quality into a single unconstrained subjective score.
From sentence-level correctness to full-stream pacing
The two benchmark protocols intentionally target different failure modes. One measures local semantic precision under category control, while the other measures global temporal behavior over continuous clips. Together they turn commentary assessment into a structured evaluation problem instead of an unconstrained preference judgment.
Why this is more than subjective commentary scoring
Category-Conditioned Generation
Given video context and a target class, the model must produce a natural sentence that matches the requested commentary type and the current fight state.
Rhythm Assessment
The model is allowed to generate over streaming clips, and the evaluation checks whether timing and category distribution resemble professional broadcasts.
Event-Informed Baseline
EIC-Gen injects punch-event cues into commentary generation, showing that fleeting action perception directly matters for combat-sport narration quality.
BibTeX
ArXiv preprint citation for the current BoxComm paper.