Descriptive Text

VideoPhy 2

Challenging Action-Centric Physical Commonsense Evaluation of Video Generation

(* Equal Contribution)
1University of California, Los Angeles
2Google Research

Abstract

Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy2-eval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation.

Image 1 Image 2

VIDEOPHY 2 PIPELINE, Top: We generate a text prompt from the seed action using an LLM, create a video with a text-to-video model, and caption it with a VLM to extract candidate physical rules. Bottom: Human annotators rate the video's physical likelihood, verify rule violations, suggest missing rules, and assess semantic adherence to the input prompt.

Human Leaderboard on Video Generation Models

Human evaluation results on VideoPhy. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1.

Open Models

# Model Source All Hard PA OI
1 Wan2.1-14B Open 32.6 21.9 31.5 36.2
2 CogVideoX-5B Open 25.0 0.0 24.6 26.1
3 Cosmos-Diff-7B Open 24.1 10.9 22.6 27.4
4 Hunyuan-13B Open 17.2 6.2 17.6 15.9
5 VideoCrafter-2 Open 10.5 2.9 10.1 13.1

Closed Models

# Model Source All Hard PA OI
1 Ray2 Closed 20.3 8.3 21.0 18.5
2 Sora Closed 23.3 5.3 22.2 26.7

🚨 To submit your results to the leaderboard, please send to this email with your csv with video URL and captions from the model builders for human/automatic evaluation.

VideoPhy: Benchmark

Physical Laws Violation Analysis

Auto Evaluation of Physical Commonsense, Semantic Adherence, and Rule Applicability

We use VideoCon-Physics as a base-model for robust semantic adherence evaluation. Specifically, we prompt our finetuned-model to generate a response (1-5) to the text adherence, physical commonsense, and rule applicability of the generated videos.

Effectiveness of Our Auto-Evaluator

roc-auc

Auto-rater evaluation results (pearson’s correlation ×100) between the predicted scores and ground-truth scores (1-5) on the unseen prompts and unseen video models.

Effectiveness of Our Auto-Evaluator

roc-auc

Auto-rater evaluation on joint score judgments. We present the joint accuracy and F1 score between the pre- dicted scores and ground-truth scores (0-1) for our VideoPhy2-autoeval and VideoCon-Physics.

Effectiveness of Our Auto-Evaluator

roc-auc

Auto-rater evaluation on physical rule classification.We present the accuracy results for VideoPhy2-autoeval and other video-language models on the rule classification tasks.