Video understanding
updated
Wolf: Captioning Everything with a World Summarization Framework
Paper
• 2407.18908
• Published • 32
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
• 2407.19985
• Published • 37
TPDiff: Temporal Pyramid Video Diffusion Model
Paper
• 2503.09566
• Published • 45
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware
Regressive GRPO
Paper
• 2506.07464
• Published • 14
Video models are zero-shot learners and reasoners
Paper
• 2509.20328
• Published • 100
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large
Multimodal Models
Paper
• 2510.05034
• Published • 51
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
Paper
• 2510.20579
• Published • 56
Video Reasoning without Training
Paper
• 2510.17045
• Published • 8
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
• 2510.23473
• Published • 86
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
• 2510.26802
• Published • 34
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
• 2511.15065
• Published • 78
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Paper
• 2511.16668
• Published • 56
In-Video Instructions: Visual Signals as Generative Control
Paper
• 2511.19401
• Published • 32
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Paper
• 2512.01342
• Published • 18
ViDiC: Video Difference Captioning
Paper
• 2512.03405
• Published • 28
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Paper
• 2512.04678
• Published • 42
Evaluating Gemini Robotics Policies in a Veo World Simulator
Paper
• 2512.10675
• Published • 20
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
Paper
• 2512.13874
• Published • 17
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Paper
• 2512.15702
• Published • 16
Kling-Omni Technical Report
Paper
• 2512.16776
• Published • 173
SemanticGen: Video Generation in Semantic Space
Paper
• 2512.20619
• Published • 94
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Paper
• 2512.20618
• Published • 56
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
Paper
• 2512.21004
• Published • 13
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Paper
• 2601.10553
• Published • 12
Rethinking Video Generation Model for the Embodied World
Paper
• 2601.15282
• Published • 44
Self-Refining Video Sampling
Paper
• 2601.18577
• Published • 25
RISE-Video: Can Video Generators Decode Implicit World Rules?
Paper
• 2602.05986
• Published • 26
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Paper
• 2602.16705
• Published • 26
RynnBrain: Open Embodied Foundation Models
Paper
• 2602.14979
• Published • 43
A Very Big Video Reasoning Suite
Paper
• 2602.20159
• Published • 516
Demystifing Video Reasoning
Paper
• 2603.16870
• Published • 346