Hi, I am En Yu (于恩 in Chinese), a PhD student at Huazhong University of Science and Technology (HUST). I am currently interning at the Foundation Model Group of Megvii Reasearch (Face++), where I work with Prof. Xiangyu Zhang and Dr. Zheng Ge.
My research interest focuses on Computer Vision (CV) and Multi/Cross-Model Modeling, specifically inclduing 2D/3D Object Detection/Tracking, Video Understanding and Generation, and Multi-modal Large Languge Model (MLLM). I have published several papers at the top international AI conferences such as CVPR and AAAI. My next goal is to develop Multi-modal Foundation Models for the long-range video understanding and generation and then build embodied robots based on the foundation model to effectively learn from the world knowledge and interact with hummans.
🎺🎺 Great honor to be heading to the UCSB NLP Group, led by Prof. William, for a one-year PhD visiting! Looking forward to broadening the academic horizons and enhancing the research capabilities over the course of the next year. See you in California~
🔥 News
- 2023.12: 🎉🎉 We present Merlin, the first end-to-end multimodal large language model that supports video-level visual localization (including tracking, video recognition, video registration, etc.) and future reasoning.
-
2023.07: 🎉🎉 We present ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.
- 2023.05: 🎉🎉 We present MOTRv3, a fully end-to-end multiple object tracking model that achieves SOTA performance on DanceTrack, which outperforms the tracking-by-detection trackers for the first time.
📝 Publications
Merlin: Empowering Multimodal LLMs with Foresight Minds
En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao
- Merlin is a groundbreaking model capable of generating natural language responses that are intricately linked with object trajectories. Merlin excels in predicting and reasoning about future events based on initial observations, showcasing an unprecedented capability in future prediction and reasoning.
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Liang Zhao*, En Yu*, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang
- ChatSpot is a a unified end-toend multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.
MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking
En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing Tao
- MOTRv3 is a fully end-to-end multiple object tracking (MOT) model that outperforms existing SOTA tracking-by-detection methods without any assistance of an extra detection network or post-processing.
GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping
Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang
- GroupLane is the first fully-convoluition end-to-end 3D lane detection network. GroupLane achieves SOTA performance on existing mainstream lane detection benchmark, i.e., OpenLane, Once-3DLanes, and OpenLane-Huawei while also ensuring fast inference speed (7 x faster than PersFormer).
Quality Matters: Embracing Quality Clues for Robust 3D Multi-Object Tracking
Jinrong Yang*, En Yu*, Zeming Li, Xiaoping Li, Wenbing Tao
- QTrack achieves 51.1%, 54.8% and 56.6% AMOTA tracking performance on the nuScenes test sets with BEVDepth, VideoBEV, and StreamPETR models, respectively, which significantly reduces the performance gap between pure camera and LiDAR-based trackers.
En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming Li, Shoudong Han, Wenbing Tao
- We introudce LTrack, the first multiple-object tracking model supporting vision-language modality inputs. Thanks to the dimain invariant of natural language representation, LTrack achieves SOTA performance on our established cross-domain MOT benchmark.
En Yu, Zhuoling Li, Shoudong Han
- We propose MTrack that adopts multi-view trajectory contrastive learning, in which each trajectory is represented as a center vector. By maintaining all the vectors in a dynamically updated memory bank, a trajectory-level contrastive loss is devised to explore the inter-frame information in the whole trajectories. MTrack surpassed preceding trackers and established new SOTA performance.
Relationtrack: Relation-aware multiple object tracking with decoupled representation
En Yu, Zhuoling Li, Shoudong Han, Hongwei Wang
-
MAT: Motion-aware Multi-Object Tracking, Shoudong Han, Piao Huang, Hongwei Wang, En Yu, Donghaisheng Liu, Xiaofeng Pan, Neurocomputing
-
Implicit and Efficient Point Cloud Completion for 3D Single Object Tracking, Pan Wang, Liangliang Ren, Shengkai Wu, Jinrong Yang, En Yu, Hangcheng Yu, Xiaoping Li, IEEE Robotics and Automation Letters
-
Efficient few-shot classification via contrastive pre-training on web data, Zhuoling Li, Haohan Wang, Tymosteusz Swistek, En Yu, Haoqian Wang, IEEE Transactions on Artificial Intelligence
🎖 Honors and Awards
- 2022.05 Second Prize in the First Global Artificial Intelligence Technology Innovation Competition.
- 2019.08 First Prize in the 13th National College Students’ Intelligent Car Competition.
- 2018.08 National Champion in the 14th National College Students’ Intelligent Car Competition.
📖 Educations
- 2022.06 - 2023.12 (now), PhD, Huazhong University of Science and Technology, China.
- 2020.09 - 2022.06, Master, Huazhong University of Science and Technology, Whhan, China.
- 2016.09 - 2020.06, Undergraduate, Huazhong University of Science and Technology, Wuhan, China.
💻 Internships
- 2022.03 - 2023.12 (now), MEGVII Research, Foundation Model Group.