Unhackable Temporal Rewarding for Scalable Video MLLMs

¹HUST ²StepAI

We investigate the shortcut learning in video multimodal large language models and systemally establish temporal hacking theory. Our work includes:

Systematic exploration of the video MLLM unscaling phenomenon, establishing temporal hacking theory from a novel RL perspective.

Design of Temporal Perplexity (TPL) score, providing a reliable reference metric for mitigating temporal hacking.

Proposing two principles to guide the design of proxy rewards for video-language modeling and further propose Unhackable Temporal Rewarding (UTR).

Introducing Video-UTR, a family of state-of-the-art video-LMMs.

Finding 1: We discover the relationship between temporal perplexity and true model performance, where higher average TPL scores indicate a reduced likelihood of reward hacking, thereby leading to superior video comprehension.

Finding 2: Model trained on data with higher TPL activates more frames during inference.

Finding 3: Output attention visualization. Attention attending to more frames can avoid the loss if crucial details in the video make answers more accurate and detailed.

Finding 4: Video-text input attention visualization. Attention map with higher TPL score (right one) achieves a higher level of image-text alignment, with the input text being well-attended to the corresponding frame.

Finding 5: The TPL distribution can reflect the overall quality of the dataset. Several sub-datasets in VideoChat2, created from a first-person perspective, have higher TPL scores, indicating that their data quality is relatively high.

Finding 6: Higher TPL score indicates a higher information density in the video or a more detailed description.

Results

Model	General Video Benchmarks				Video-QA Benchmarks
	TempCompass	MVBench	MMBen-Video	VideoMME	MSVD-QA	MSRVVT-QA	TGIF-QA	ANet-QA
	mc	m-avg	m-avg	wo sub.	Acc.	Acc.	Acc.	Acc.
Proprietary
GPT-4V (OpenAI, 2023)	-	43.5	-	59.9	-	-	-	-
GPT-4o (OpenAI, 2024)	70.9	-	1.81	71.9	-	-	-	-
Gemini-1.5-Flash (Team et al., 2023)	-	-	1.63	70.3	-	-	-	-
Gemini-1.5-Pro (Team et al., 2023)	69.3	-	-	75.0	-	-	-	-
Claude-3.5-Sonnet (Anthropic, 2024)	-	-	1.35	60.0	-	-	-	-
Open-Source
VideoChat2 (Li et al., 2024a)	38.5	51.1	1.23	-	70.0	54.1	-	49.1
VideoLLaMA2 (Cheng et al., 2024a)	-	54.6	-	46.6	70.9	-	-	50.2
LLaVA-N-Video-7B (Zhang et al., 2024f)	-	54.6	-	33.7	67.8	-	-	53.5
LLaVA-OV-7B* (Li et al., 2024a)	59.0	56.7	-	58.2	65.3	43.3	52.8	56.6
Video-UTR-7B	59.7	58.8	1.35	52.6	73.5	58.3	56.4	55.0

Citation

If you find this useful, please consider citing our work:

@article{video-utr, title={Unhackable Temporal Rewarding for Scalable Video MLLMs}, author={En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, and Wenbing Tao}, journal={arXiv preprint arXiv:2412.10360}, year={2025} }