Hi, I am En Yu (于恩 in Chinese), a PhD student at Huazhong University of Science and Technology (HUST) and visting PhD at University of California, Santa Barbara (UCSB), cooperated with Prof.William Wang. I am currently interning at the Foundation Model Group of StepFun AI, where I work with Prof. Xiangyu Zhang and Dr. Zheng Ge.

My research interest includes (1) Perception, Understanding and Reasoning with Multimodal LLMs, and (2) Spatial Intelligence of Visual and Multimodal Foundation Models. I have published several papers at the top-level international AI conferences including ICLR, CVPR, ECCV, AAAI, ICML, etc. My next goal is to further build powerful multimodal foundation models and develop multimodal agents based on the foundation model to deal with complex real-world tasks, e.g., navigation and UI-assistant.

🎺🎺 I’ am set to graduate with my Ph.D. in June 2026 and am currently on the lookout for postdoctoral positions. If you are interested, please feel free to reach out to me via email !

🔥 News

2025.04: 🎉🎉 We present Perception-R1. This work takes a pioneering step in exploring the potential of rule-based RL in MLLM post-training for perception policy learning.
2025.02: 🎉🎉 Glad to announce that we have two papers, Video-UTR and OVTR, accepted for poster presentations at ICLR 2025! Let’s see and have a chat in Singapore!
2024.11: 🎉🎉 We present OVTR, the first fully end-to-end open-vocabulary multiple objects tracking framework.
2024.11: 🎉🎉 We present Video-UTR, investigating the shortcut learning in video multimodal large language models and systemally establish temporal hacking theory.
2024.07: 🎉🎉 Really excited to head to UCSB for a year-long PhD visiting in Prof. William Wang’s NLP lab. Looking forward to boosting my research ability. Catch you all in California!
2024.06: 🍾🍺 Excited to share that our work, Merlin, has been accepted as a poster presentation at ECCV 2024! See you in Milan!
2024.02: 🎉🎉 Glad to announce that our work, ChatSpot, has been accepted for a Long Oral presentation at IJCAI 2024! See you in Jeju!
2023.12: 🎉🎉 We present Merlin, the first end-to-end multimodal large language model that supports video-level visual localization (including tracking, video recognition, video registration, etc.) and future reasoning.
2023.07: 🎉🎉 We present ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.
2023.05: 🎉🎉 We present MOTRv3, a fully end-to-end multiple object tracking model that achieves SOTA performance on DanceTrack, which outperforms the tracking-by-detection trackers for the first time.

📝 Publications

NeuraIPS2025 Submission

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao

Project

Perception-R1 pioneers the exploration of RL’s potential in MLLM post-training for perception policy learning. We get valuable cognition through experiments. It sets new SoTAs in visual perception tasks, especially object detection. Its novel paradigm enables it to match and surpass expert models, showing the great potential of perception policy learning.

ICLR2025 Poster

Unhackable Temporal Rewarding for Scalable Video MLLMs

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao

Project

This work investigates the shortcut learning in video multimodal large language models and systemally establish temporal hacking theory including: (1) Systematic exploration of the video MLLM unscaling phenomenon, establishing temporal hacking theory from a novel RL perspective. (2) Design of Temporal Perplexity (TPL) score, providing a reliable reference metric for mitigating temporal hacking. (3) Proposing two principles to guide the design of proxy rewards for video-language modeling and further propose Unhackable Temporal Rewarding (UTR).

ICLR2025 Poster

OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer

Jinyang Li, En Yu, Sijia Chen, Wenbing Tao

Project

OVTR serves as the first fully end-to-end open-vocabulary multiple-object tracking framework.

ECCV2024 Poster

Merlin: Empowering Multimodal LLMs with Foresight Minds

En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao

Project

Merlin is a groundbreaking model capable of generating natural language responses that are intricately linked with object trajectories. Merlin excels in predicting and reasoning about future events based on initial observations, showcasing an unprecedented capability in future prediction and reasoning.

IJCAI2024 Long Oral

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

Liang Zhao^*, En Yu^*, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang

Project

ChatSpot is a a unified end-toend multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.

IJCV Submission

MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing Tao

Project

MOTRv3 is a fully end-to-end multiple object tracking (MOT) model that outperforms existing SOTA tracking-by-detection methods without any assistance of an extra detection network or post-processing.

RA-L

GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang

Project

GroupLane is the first fully-convoluition end-to-end 3D lane detection network. GroupLane achieves SOTA performance on existing mainstream lane detection benchmark, i.e., OpenLane, Once-3DLanes, and OpenLane-Huawei while also ensuring fast inference speed (7 x faster than PersFormer).

IROS2024 Poster

Quality Matters: Embracing Quality Clues for Robust 3D Multi-Object Tracking

Jinrong Yang^*, En Yu^*, Zeming Li, Xiaoping Li, Wenbing Tao

Project

QTrack achieves 51.1%, 54.8% and 56.6% AMOTA tracking performance on the nuScenes test sets with BEVDepth, VideoBEV, and StreamPETR models, respectively, which significantly reduces the performance gap between pure camera and LiDAR-based trackers.

AAAI2023 Poster

Generalizing multiple object tracking to unseen domains by introducing natural language representation

En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming Li, Shoudong Han, Wenbing Tao

Project

We introudce LTrack, the first multiple-object tracking model supporting vision-language modality inputs. Thanks to the dimain invariant of natural language representation, LTrack achieves SOTA performance on our established cross-domain MOT benchmark.

CVPR2022 Poster

Towards Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object Tracking

En Yu, Zhuoling Li, Shoudong Han

Project

We propose MTrack that adopts multi-view trajectory contrastive learning, in which each trajectory is represented as a center vector. By maintaining all the vectors in a dynamically updated memory bank, a trajectory-level contrastive loss is devised to explore the inter-frame information in the whole trajectories. MTrack surpassed preceding trackers and established new SOTA performance.

TMM

Relationtrack: Relation-aware multiple object tracking with decoupled representation

En Yu, Zhuoling Li, Shoudong Han, Hongwei Wang

Perception in Reflection, Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M Patel, ICML2025 Poster
Rulearena: A benchmark for rule-guided reasoning with llms in real-world scenarios, Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang, ICLR2025 Workshop
Cross-View Referring Multi-Object Tracking, Sijia Chen, En Yu, Wenbing Tao, AAAI2024 Poster
Delving into the Trajectory Long-tail Distribution for Muti-object Tracking, Sijia Chen, En Yu, Jinyang Li, Wenbing Tao, CVPR2024 Poster
MAT: Motion-aware Multi-Object Tracking, Shoudong Han, Piao Huang, Hongwei Wang, En Yu, Donghaisheng Liu, Xiaofeng Pan, Neurocomputing
Implicit and Efficient Point Cloud Completion for 3D Single Object Tracking, Pan Wang, Liangliang Ren, Shengkai Wu, Jinrong Yang, En Yu, Hangcheng Yu, Xiaoping Li, IEEE Robotics and Automation Letters
Efficient few-shot classification via contrastive pre-training on web data, Zhuoling Li, Haohan Wang, Tymosteusz Swistek, En Yu, Haoqian Wang, IEEE Transactions on Artificial Intelligence

🎖 Honors and Awards

2022.05 Second Prize in the First Global Artificial Intelligence Technology Innovation Competition.
2019.08 First Prize in the 13th National College Students’ Intelligent Car Competition.
2018.08 National Champion in the 14th National College Students’ Intelligent Car Competition.

📖 Educations

2024.07 - 2025.05 (now), University of California, Santa Barbara (UCSB), USA.
2022.06 - 2025.05 (now), Huazhong University of Science and Technology, China.
2020.09 - 2022.06, Huazhong University of Science and Technology, Whhan, China.
2016.09 - 2020.06, Huazhong University of Science and Technology, Wuhan, China.

💻 Internships

2022.03 - 2024.03, MEGVII Research, Foundation Model Group.
2024.03 - 2025.05 (now), StepFun AI, Multimodal Intelligence Group.

En Yu (于恩)

🔥 News

📝 Publications

🎖 Honors and Awards

📖 Educations

💻 Internships