Xiaofeng Wang

Xiaofeng Wang (Jeff)

I am currently a postdoctoral researcher at Tsinghua University, advised by Jiwen Lu. I received my Ph.D. from the Institute of Automation, Chinese Academy of Sciences (CASIA), in 2025. I serve as a Partner at GigaAI, collaborating closely with Zheng Zhu. My professional experience includes research collaborations with Megvii, PhiGent, and Tongyi Wan Team.

My research interests revolve around AIGC (video generation), world models, aiming to develop understanding of physics and motion in AI systems. Please feel free to reach out if you have any questions or would like to discuss further.

Email / Google Scholar / Github

News

2025-06: Three Papers are accepted to ICCV 2025.

2025-03: Five Papers are accepted to CVPR 2025.

2024-12: DriveDreamer-2 is accepted to AAAI 2025.

2024-07: DriveDreamer is selected as one of the Most Influential ECCV'24 Papers.

2024-07: DriveDreamer is accepted to ECCV 2024.

2024-05: One paper on occupancy prediction is accepted to IJCAI 2024.

2023-11: Our ICLR'24 technique report exploring GPT-4V on autonomous driving is available. Exciting to see the community sharing thoughts on our latest findings!

2023-07: One paper on 3D occupancy prediction is accepted to ICCV 2023.

2023-02: One paper on 3D streaming perception is accepted to CVPR 2023.

2023-01: One paper on 3D pretraining is accepted to ICLR 2023.

2022-11: One paper on self-supervised depth estimation is accepted to AAAI 2023.

2022-07: One paper on multi-view depth estimation is accepted to ECCV 2022.

Research

* indicates equal contribution

	ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan, Peng Jia, Xianpeng Lang, Xingang Wang, Wenjun Mei CVPR, 2025 [arXiv] [Page] [Code] ReconDreamer is the first method to effectively render in large maneuvers (e.g., 6m shift)
	EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Guosheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, Xingang Wang arXiv, 2024 [arXiv] [Page] [Data] [Code] EgoVid-5M is the first curated high-quality action-video dataset designed specifically for egocentric video generation.
	DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, Wenjun Mei, Xingang Wang CVPR, 2025 [arXiv] [Page] [Code] DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.
	Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang*, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, Guan Huang arXiv, 2024 [arXiv] [Code] A comprehensive survey on general world models, including world models for video generation, autonomous driving and autonomous agents.
	DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation Guosheng Zhao, Xiaofeng Wang*, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang AAAI, 2025 [arXiv] [Page] [Code] DriveDreamer-2 is the first world model to generate customized driving videos in a user-friendly manner.
	WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens Xiaofeng Wang, Zheng Zhu, Guan Huang*, Boyuan Wang, Xinze Chen, Jiwen Lu arXiv, 2024 [arXiv] [Page] WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation.
	On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang**, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao ICLR Workshop on LLM Agents, 2024 [arXiv] [Page] This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios.
	Drivedreamer: Towards real-world-driven world models for autonomous driving Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, Jiwen Lu European Conference on Computer Vision (ECCV), 2024 [arXiv] [page] [Code] DriveDreamer is the first world model established from real-world driving scenarios. It empowers controllable driving video generation and enables the prediction of reasonable driving policies.
	OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu , Xingang Wang IEEE International Conference on Computer Vision (ICCV)*, 2023 [arXiv] [Code] Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark.
	StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaofeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, Dalong Du International Joint Conferences on Artificial Intelligence (IJCAI), 2024 [arXiv] [Code] We propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors.
	Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark Xiaofeng Wang, Zheng Zhu, Yunpeng Zhang, Guan Huang, Yun Ye, Wenbo Xu, Ziwei Chen , Xingang Wang IEEE Conference on Computer Vision and Pattern Recogintion (CVPR), 2023 [arXiv] [Code] We propose the Autonomousdriving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving.
	LiftedCL: Lifting Contrastive Learning for Human-Centric Perception Ziwei Chen , Qiang Li , Xiaofeng Wang, Wankou Yang International Conference on Learning Representations (ICLR), 2023 [paper] [page] [Code] We propose the Lifting Contrastive Learning (LiftedCL) to obtain 3D-aware human-centric representations which absorb 3D human structure information.
	MOVEDepth: Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning Xiaofeng Wang, Zheng Zhu, Guan Huang, Xu Chi, Yun Ye, Ziwei Chen , Xingang Wang AAAI Conference on Artificial Intelligence (AAAI), 2023 [arXiv] [Code] MOVEDepth is a self-supervised depth estimation method that explores monocular cues to enhance the multi-frame depth learning.
	MVSTER: Epipolar Transformer for Efficient Multi-View Stereo Xiaofeng Wang, Zheng Zhu, Fangbo Qin, Yun Ye, Guan Huang, Xu Chi, Yijia He , Xingang Wang European Conference on Computer Vision (ECCV), 2022 [arXiv] [Code] We propose a novel end-to-end Transformer-based method for multi-view stereo, named MVSTER. It leverages the proposed epipolar Transformer to efficiently learn 3D associations along epipolar line.