Paper Summary
Paperzilla title
Nope! Your 'Realistic' AI Videos are Still Just Guessing How Physics Works!
This paper introduces Physics-IQ, a comprehensive real-world benchmark to evaluate if generative video models truly understand physical principles like gravity or fluid dynamics. The study found that across a range of current models (e.g., Sora, VideoPoet), physical understanding is severely limited and largely unrelated to visual realism, despite some models generating highly realistic-looking videos. The research concludes that visual realism does not imply physical understanding, highlighting a significant gap in current AI capabilities.
Possible Conflicts of Interest
Authors Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos list Google DeepMind as an affiliation, with the work done while at Google DeepMind. The paper evaluates models, including VideoPoet and Lumiere, which are Google/DeepMind models. This constitutes a conflict of interest, as authors are evaluating products associated with their employer.
Identified Weaknesses
Limited Physical Understanding in Models
The core finding is that even the best models scored only 29.5% on the Physics-IQ benchmark, far below the physical variance baseline (100%), indicating a severe lack of true physical understanding despite advances in visual realism. This means models are not learning underlying physical laws.
Uncorrelated Visual Realism and Physical Understanding
The study explicitly shows no significant correlation between how realistic a video looks (as judged by an MLLM) and its physical correctness. Models like Sora, which produced the most visually realistic videos, still scored poorly on physical understanding metrics, proving visual fidelity does not equate to comprehension.
Model Hallucinations and Implausible Actions
Many generative models were observed to hallucinate objects or produce physically impossible temporal sequences (e.g., a candle spontaneously appearing and lighting after a match hits water), indicating a fundamental flaw in their grasp of reality.
Dataset Biases Reflected in Generations
Models sometimes produced biased generations, such as Lumiere turning a red pool table green, reflecting common patterns in their training data rather than understanding the specific scenario. Sora also frequently exhibited transition cuts despite instructions for a static camera perspective, which impacts evaluation metrics.
Metrics as Proxies, not Direct Measures
The paper acknowledges that its proposed suite of metrics (Spatial IoU, MSE, etc.) serve as proxies for physical understanding rather than directly quantifying physical phenomena. The MLLM metric, for instance, is limited by the underlying MLLM's capabilities, and its explanations for decisions were often incorrect, suggesting a potential gap between metric scores and actual understanding.
Rating Explanation
This paper presents strong research with a well-designed, novel real-world benchmark (Physics-IQ) for evaluating physical understanding in generative video models. Its systematic evaluation of multiple state-of-the-art models and clear findings that visual realism doesn't imply physical understanding are significant contributions to the field. While there is a conflict of interest due to authors evaluating models from their employer (Google DeepMind), the findings are critical of the models' performance, which lessens the impact of the COI on the scientific integrity of the results. The methodology is robust, using diverse scenarios and multiple metrics to provide a comprehensive assessment.
Good to know
This is our free standard analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
File Information
Original Title:
Do generative video models understand physical principles?
Uploaded:
October 12, 2025 at 08:12 PM
© 2025 Paperzilla. All rights reserved.