Hi, I am En Yu (于恩 in Chinese), a PhD student at Huazhong University of Science and Technology (HUST). I am currently interning at the Foundation Model Group of Megvii Reasearch (Face++), where I work with Prof. Xiangyu Zhang and Dr. Zheng Ge.

My research interest focuses on Computer Vision (CV) and Multi/Cross-Model Modeling, specifically inclduing 2D/3D Object Detection/Tracking, Video Understanding and Generation, and Multi-modal Large Languge Model (MLLM). I have published several papers at the top international AI conferences such as CVPR and AAAI. My next goal is to develop Multi-modal Foundation Models for the long-range video understanding and generation and then build embodied robots based on the foundation model to effectively learn from the world knowledge and interact with hummans.

🎺🎺 Great honor to be heading to the UCSB NLP Group, led by Prof. William, for a one-year PhD visiting! Looking forward to broadening the academic horizons and enhancing the research capabilities over the course of the next year. See you in California~

🔥 News

  • 2023.12:  🎉🎉 We present Merlin, the first end-to-end multimodal large language model that supports video-level visual localization (including tracking, video recognition, video registration, etc.) and future reasoning.
  • 2023.07:  🎉🎉 We present ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.

  • 2023.05:  🎉🎉 We present MOTRv3, a fully end-to-end multiple object tracking model that achieves SOTA performance on DanceTrack, which outperforms the tracking-by-detection trackers for the first time.

📝 Publications

CVPR 2024 on Submit
sym

Merlin: Empowering Multimodal LLMs with Foresight Minds

En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao

Project

  • Merlin is a groundbreaking model capable of generating natural language responses that are intricately linked with object trajectories. Merlin excels in predicting and reasoning about future events based on initial observations, showcasing an unprecedented capability in future prediction and reasoning.
Tech Report
sym

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

Liang Zhao*, En Yu*, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang

Project

  • ChatSpot is a a unified end-toend multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.
CVPR 2024 on Submit
sym

MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing Tao

Project

  • MOTRv3 is a fully end-to-end multiple object tracking (MOT) model that outperforms existing SOTA tracking-by-detection methods without any assistance of an extra detection network or post-processing.
ICLR 2024 on Submit
sym

GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang

Project

  • GroupLane is the first fully-convoluition end-to-end 3D lane detection network. GroupLane achieves SOTA performance on existing mainstream lane detection benchmark, i.e., OpenLane, Once-3DLanes, and OpenLane-Huawei while also ensuring fast inference speed (7 x faster than PersFormer).
R-AL on Submit
sym

Quality Matters: Embracing Quality Clues for Robust 3D Multi-Object Tracking

Jinrong Yang*, En Yu*, Zeming Li, Xiaoping Li, Wenbing Tao

Project

  • QTrack achieves 51.1%, 54.8% and 56.6% AMOTA tracking performance on the nuScenes test sets with BEVDepth, VideoBEV, and StreamPETR models, respectively, which significantly reduces the performance gap between pure camera and LiDAR-based trackers.
AAAI2023
sym

Generalizing multiple object tracking to unseen domains by introducing natural language representation

En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming Li, Shoudong Han, Wenbing Tao

Project

  • We introudce LTrack, the first multiple-object tracking model supporting vision-language modality inputs. Thanks to the dimain invariant of natural language representation, LTrack achieves SOTA performance on our established cross-domain MOT benchmark.
CVPR2022
sym

Towards Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object Tracking

En Yu, Zhuoling Li, Shoudong Han

Project

  • We propose MTrack that adopts multi-view trajectory contrastive learning, in which each trajectory is represented as a center vector. By maintaining all the vectors in a dynamically updated memory bank, a trajectory-level contrastive loss is devised to explore the inter-frame information in the whole trajectories. MTrack surpassed preceding trackers and established new SOTA performance.

🎖 Honors and Awards

  • 2022.05 Second Prize in the First Global Artificial Intelligence Technology Innovation Competition.
  • 2019.08 First Prize in the 13th National College Students’ Intelligent Car Competition.
  • 2018.08 National Champion in the 14th National College Students’ Intelligent Car Competition.

📖 Educations

  • 2022.06 - 2023.12 (now), PhD, Huazhong University of Science and Technology, China.
  • 2020.09 - 2022.06, Master, Huazhong University of Science and Technology, Whhan, China.
  • 2016.09 - 2020.06, Undergraduate, Huazhong University of Science and Technology, Wuhan, China.

💻 Internships