




Empowering Multimodal LLMs with Foresight Minds

En Yu    Liang Zhao    Yana Wei    Jinrong Yang   
Dongming Wu    Lingyu Kong    Haoran Wei    Tiancai Wang    Zheng Ge    Xiangyu Zhang    Wenbing Tao
HUST    Megvii Technology    Equal Contribution
The Magic of Merlin
  1. Merlin Introduction. Introducing Merlin, a groundbreaking model capable of generating natural language responses that are intricately linked with object trajectories. Merlin excels in predicting and reasoning about future events based on initial observations, showcasing an unprecedented capability in future prediction and reasoning.

  2. Future Reasoning Evaluation. Addressing the absence of standardized benchmarks for future reasoning, we have developed the Future Reasoning Benchmark, an innovative measure derived from the existing MMBench. We also assess Merlin's performance on mainstream tracking benchmarks to evaluate its proficiency in aligning multiple images and identities. Notably, Merlin is the first model of its kind to perform tracking tasks.

  3. Merlin-Chat Dataset Creation. To facilitate Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), we have created the Merlin-Chat dataset. This dataset, featuring feature reasoning conversations, is developed using GPT-4V and covers three scenarios: sports, lifestyle, and transportation. It comprises 30,000 unique dialogue samples with predicted trajectories for future reasoning. Additionally, we introduce "FPT-data," a tailor-made dataset specifically designed for the FPT task, repurposed from existing open-source datasets.
Creating Merlin
*Overall pipeline of Merlin
*conversations generated with instructions provided by our users
General Detection
General Tracking
Image Referring
Video Referring
Image Relation
Video Relation
Future Reasoning
Future Reasoning