Unhackable Temporal Rewarding for Scalable Video MLLMs

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, and Wenbing Tao

1HUST    2BUPT    3JHU    4StepAI

We investigate the shortcut learning in video multimodal large language models and systemally establish temporal hacking theory. Our work includes:

  • Systematic exploration of the video MLLM unscaling phenomenon, establishing temporal hacking theory from a novel RL perspective.
  • Design of Temporal Perplexity (TPL) score, providing a reliable reference metric for mitigating temporal hacking.
  • Proposing two principles to guide the design of proxy rewards for video-language modeling and further propose Unhackable Temporal Rewarding (UTR).
  • Introducing Video-UTR, a family of state-of-the-art video-LMMs.

Demo

We propose the theory of temporal hacking, from a reinforcement learning perspective, to explain anti-scaling law phenomenon in video MLLMs. We introduce a novel metric, Temporal Perplexity (TPL), to quantify the severity of temporal assignment. Through extensive experiments, we use the TPL score to analyze the causes and features of temporal hacking, leading to the development of two guiding principles for video-language modeling. Guided by these two principles, we further propose Unhackable Temporal Rewarding (UTR) and build a powerful video MLLM, ie., Video-UTR, a new family of state-of-the-art video-LMMs.

Overview 1
Overview 2
Finding 1

Finding 1: We discover the relationship between temporal perplexity and true model performance, where higher average TPL scores indicate a reduced likelihood of reward hacking, thereby leading to superior video comprehension.

Finding 1

Finding 2: Model trained on data with higher TPL activates more frames during inference.

Finding 1

Finding 3: Output attention visualization. Attention attending to more frames can avoid the loss if crucial details in the video make answers more accurate and detailed.

Finding 1

Finding 4: Video-text input attention visualization. Attention map with higher TPL score (right one) achieves a higher level of image-text alignment, with the input text being well-attended to the corresponding frame.

Finding 1

Finding 5: The TPL distribution can reflect the overall quality of the dataset. Several sub-datasets in VideoChat2, created from a first-person perspective, have higher TPL scores, indicating that their data quality is relatively high.

Finding 1

Finding 6: Higher TPL score indicates a higher information density in the video or a more detailed description.

Results

Model General Video Benchmarks Video-QA Benchmarks
TempCompass
MVBench
MMBen-Video
VideoMME
MSVD-QA
MSRVVT-QA
TGIF-QA
ANet-QA
mc m-avg m-avg wo sub. Acc.Acc.Acc.Acc.
Proprietary
GPT-4V (OpenAI, 2023) -43.5-59.9- ---
GPT-4o (OpenAI, 2024) 70.9-1.8171.9- ---
Gemini-1.5-Flash (Team et al., 2023) --1.6370.3- ---
Gemini-1.5-Pro (Team et al., 2023) 69.3--75.0- ---
Claude-3.5-Sonnet (Anthropic, 2024) --1.3560.0- ---
Open-Source
VideoChat2 (Li et al., 2024a) 38.551.11.23-70.0 54.1-49.1
VideoLLaMA2 (Cheng et al., 2024a) -54.6-46.670.9 --50.2
LLaVA-N-Video-7B (Zhang et al., 2024f) -54.6-33.767.8 --53.5
LLaVA-OV-7B* (Li et al., 2024a) 59.056.7-58.265.3 43.352.856.6
Video-UTR-7B 59.758.81.3552.673.5 58.356.455.0

Website under construction, more coming soon...

Citation

If you find this useful, please consider citing our work:

@article{video-utr,
    title={Unhackable Temporal Rewarding for Scalable Video MLLMs},
    author={En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, and Wenbing Tao},
    journal={arXiv preprint arXiv:2502.12081},
    year={2025}
}