Abstract
D4RT, a unified transformer-based model, efficiently reconstructs 4D scenes from videos by querying 3D positions in space-time, outperforming previous methods.
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
Community
π A simple, unified interface for 3D tracking, depth, and pose
π SOTA results on 4D reconstruction & tracking
π Up to 100x faster pose estimation than prior works
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- C4D: 4D Made from 3D through Dual Correspondences (2025)
- Depth Anything 3: Recovering the Visual Space from Any Views (2025)
- WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting (2025)
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend (2025)
- 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer (2025)
- 4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation (2025)
- WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper