We investigate the shortcut learning in video multimodal large language models and systemally establish temporal hacking theory. Our work includes:
We propose the theory of temporal hacking, from a reinforcement learning perspective, to explain anti-scaling law phenomenon in video MLLMs. We introduce a novel metric, Temporal Perplexity (TPL), to quantify the severity of temporal assignment. Through extensive experiments, we use the TPL score to analyze the causes and features of temporal hacking, leading to the development of two guiding principles for video-language modeling. Guided by these two principles, we further propose Unhackable Temporal Rewarding (UTR) and build a powerful video MLLM, ie., Video-UTR, a new family of state-of-the-art video-LMMs.
Model | General Video Benchmarks | Video-QA Benchmarks | ||||||
---|---|---|---|---|---|---|---|---|
TempCompass |
MVBench |
MMBen-Video |
VideoMME |
MSVD-QA |
MSRVVT-QA |
TGIF-QA |
ANet-QA |
|
mc | m-avg | m-avg | wo sub. | Acc. | Acc. | Acc. | Acc. | |
Proprietary | ||||||||
GPT-4V (OpenAI, 2023) | - | 43.5 | - | 59.9 | - | - | - | - |
GPT-4o (OpenAI, 2024) | 70.9 | - | 1.81 | 71.9 | - | - | - | - |
Gemini-1.5-Flash (Team et al., 2023) | - | - | 1.63 | 70.3 | - | - | - | - |
Gemini-1.5-Pro (Team et al., 2023) | 69.3 | - | - | 75.0 | - | - | - | - |
Claude-3.5-Sonnet (Anthropic, 2024) | - | - | 1.35 | 60.0 | - | - | - | - |
Open-Source | ||||||||
VideoChat2 (Li et al., 2024a) | 38.5 | 51.1 | 1.23 | - | 70.0 | 54.1 | - | 49.1 |
VideoLLaMA2 (Cheng et al., 2024a) | - | 54.6 | - | 46.6 | 70.9 | - | - | 50.2 |
LLaVA-N-Video-7B (Zhang et al., 2024f) | - | 54.6 | - | 33.7 | 67.8 | - | - | 53.5 |
LLaVA-OV-7B* (Li et al., 2024a) | 59.0 | 56.7 | - | 58.2 | 65.3 | 43.3 | 52.8 | 56.6 |
Video-UTR-7B | 59.7 | 58.8 | 1.35 | 52.6 | 73.5 | 58.3 | 56.4 | 55.0 |
Website under construction, more coming soon...
If you find this useful, please consider citing our work:
@article{video-utr,
title={Unhackable Temporal Rewarding for Scalable Video MLLMs},
author={En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, and Wenbing Tao},
journal={arXiv preprint arXiv:2502.12081},
year={2025}
}