Xiaofeng Wang

Xiaofeng Wang (Jeff)

I am currently in my fourth year as a Ph.D. student at the Institute of Automation, Chinese Academy of Sciences (CASIA). Prior to that, I received my Bachelor's degree from the department of Automation, Nanjing University of Science and Technology (NJUST) in 2020. Additionally, I have spent some time at University of Dayton (UD), Megvii, PhiGent, GigaAI, and Alibaba TongYi.

My research interests revolve around AIGC (video generation), world models and 3D perceptions, aiming to develop understanding of physics and motion in AI systems. Please feel free to reach out if you have any questions or would like to discuss further.

Email / Google Scholar / Github

News

2024-05: One paper on occupancy prediction is accepted to IJCAI 2024.

2023-11: Our ICLR'24 technique report exploring GPT-4V on autonomous driving is available. Exciting to see the community sharing thoughts on our latest findings!

2023-07: One paper on 3D occupancy prediction is accepted to ICCV 2023.

2023-02: One paper on 3D streaming perception is accepted to CVPR 2023.

2023-01: One paper on 3D pretraining is accepted to ICLR 2023.

2022-11: One paper on self-supervised depth estimation is accepted to AAAI 2023.

2022-07: One paper on multi-view depth estimation is accepted to ECCV 2022.

Research

* indicates equal contribution

	Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang*, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, Guan Huang arXiv, 2024 [arXiv] [code] A comprehensive survey on general world models, including world models for video generation, autonomous driving and autonomous agents.
	DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation Guosheng Zhao, Xiaofeng Wang*, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang arXiv, 2024 [arXiv] [Page] [code] DriveDreamer-2 is the first world model to generate customized driving videos in a user-friendly manner.
	WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens Xiaofeng Wang, Zheng Zhu, Guan Huang*, Boyuan Wang, Xinze Chen, Jiwen Lu arXiv, 2024 [arXiv] [Page] WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation.
	On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang**, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao ICLR Workshop on LLM Agents, 2024 [arXiv] [Page] This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios.
	Drivedreamer: Towards real-world-driven world models for autonomous driving Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, Jiwen Lu arXiv, 2023 [arXiv] [page] [code] DriveDreamer is the first world model established from real-world driving scenarios. It empowers controllable driving video generation and enables the prediction of reasonable driving policies.
	OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu , Xingang Wang IEEE International Conference on Computer Vision (ICCV)*, 2023 [arXiv] [Code] Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark.
	StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaofeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, Dalong Du International Joint Conferences on Artificial Intelligence (IJCAI), 2024 [arXiv] [Code] We propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors.
	Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark Xiaofeng Wang, Zheng Zhu, Yunpeng Zhang, Guan Huang, Yun Ye, Wenbo Xu, Ziwei Chen , Xingang Wang IEEE Conference on Computer Vision and Pattern Recogintion (CVPR), 2023 [arXiv] [Code] We propose the Autonomousdriving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving.
	LiftedCL: Lifting Contrastive Learning for Human-Centric Perception Ziwei Chen , Qiang Li , Xiaofeng Wang, Wankou Yang International Conference on Learning Representations (ICLR), 2023 [paper] [page] [code] We propose the Lifting Contrastive Learning (LiftedCL) to obtain 3D-aware human-centric representations which absorb 3D human structure information.
	MOVEDepth: Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning Xiaofeng Wang, Zheng Zhu, Guan Huang, Xu Chi, Yun Ye, Ziwei Chen , Xingang Wang AAAI Conference on Artificial Intelligence (AAAI), 2023 [arXiv] [code] MOVEDepth is a self-supervised depth estimation method that explores monocular cues to enhance the multi-frame depth learning.
	MVSTER: Epipolar Transformer for Efficient Multi-View Stereo Xiaofeng Wang, Zheng Zhu, Fangbo Qin, Yun Ye, Guan Huang, Xu Chi, Yijia He , Xingang Wang European Conference on Computer Vision (ECCV), 2022 [arXiv] [code] We propose a novel end-to-end Transformer-based method for multi-view stereo, named MVSTER. It leverages the proposed epipolar Transformer to efficiently learn 3D associations along epipolar line.