VIDEOPHY: Evaluating Physical Commonsense In Video Generation

Abstract

Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy2-eval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation.

Model: Wan2.1

Violation: The rock should roll downwards instead of upwards (Gravity)

Text Prompt: A small rock tumbles down a steep, rocky hillside, displacing soil and small stones.

Model: CogVideoX-5B

Violation: The beads should not leave the container without a hole (Permeability)

Text Prompt: A child pours colorful beads from a plastic container into a glass jar until they overflow, scattering on the floor.

Model: Cosmos

Violation: The towel should not expell a stream of water with more volume than itself (Conservation of Mass)

Text Prompt: A person vigorously twists a wet towel, water spraying outwards in a visible arc.

Model: Hunyuan

Violation: The leaves should be blow away from the leaf blower, not towards it (Conservation of Momentum)

Text Prompt: A leaf blower is pointed at a patch of leaves on a lawn; the leaves are forcefully displaced in a specific direction.

Model: Ray2

Violation: The liquid should not leave the beaker until the level of the liquid is at the edge of the beaker (Bernouli's Principle)

Text Prompt: A chemist pours a clear liquid from a beaker into a test tube, carefully avoiding spills.

Model: Sora

Violation: The handle of the paddle should not flex at such a sharp angle without breaking, and paddle movements should leave visible ripples in the lake (Hardness, Reflection)

Text Prompt: A canoeist uses a single-bladed paddle to propel their canoe across a lake, the paddle's movement visible against the still water.

Model: VideoCrafter2

Violation: The rope should have the same length and number of ends at all times (Conservation of Mass)

Text Prompt: A jump rope is laid on the ground in a circular pattern after use.

Human Leaderboard on Video Generation Models

Human evaluation results on VideoPhy. We abbreviate semantic adherence as SA, physical commonsense as PC. SA, PC indicates the percentage of the instances for which SA=1 and PC=1.

Open Models

#	Model	Source	All	Hard	PA	OI
1	Wan2.1-14B	Open	32.6	21.9	31.5	36.2
2	CogVideoX-5B	Open	25.0	0.0	24.6	26.1
3	Cosmos-Diff-7B	Open	24.1	10.9	22.6	27.4
4	Hunyuan-13B	Open	17.2	6.2	17.6	15.9
5	VideoCrafter-2	Open	10.5	2.9	10.1	13.1

Closed Models

#	Model	Source	All	Hard	PA	OI
1	Ray2	Closed	20.3	8.3	21.0	18.5
2	Sora	Closed	23.3	5.3	22.2	26.7

🚨 To submit your results to the leaderboard, please send to this email with your csv with video URL and captions from the model builders for human/automatic evaluation.

VideoPhy: Benchmark

Physical Laws Violation Analysis

We present the violation scores for diverse physical laws based on human annotations collected from various video generative models on VideoPhy2 dataset.

Top-20 frequently occurring verbs (inner) and their top-5 direct nouns (outer).

Auto Evaluation of Physical Commonsense, Semantic Adherence, and Rule Applicability

We use VideoCon-Physics as a base-model for robust semantic adherence evaluation. Specifically, we prompt our finetuned-model to generate a response (1-5) to the text adherence, physical commonsense, and rule applicability of the generated videos.

Effectiveness of Our Auto-Evaluator

Auto-rater evaluation results (pearson’s correlation ×100) between the predicted scores and ground-truth scores (1-5) on the unseen prompts and unseen video models.

Effectiveness of Our Auto-Evaluator

Auto-rater evaluation on joint score judgments. We present the joint accuracy and F1 score between the pre- dicted scores and ground-truth scores (0-1) for our VideoPhy2-autoeval and VideoCon-Physics.

Effectiveness of Our Auto-Evaluator

Auto-rater evaluation on physical rule classification.We present the accuracy results for VideoPhy2-autoeval and other video-language models on the rule classification tasks.

VideoPhy 2

Challenging Action-Centric Physical Commonsense Evaluation of Video Generation

Abstract

Human Leaderboard on Video Generation Models

Open Models

Closed Models

VideoPhy: Benchmark

Physical Laws Violation Analysis

Auto Evaluation of Physical Commonsense, Semantic Adherence, and Rule Applicability

Effectiveness of Our Auto-Evaluator

Effectiveness of Our Auto-Evaluator

Effectiveness of Our Auto-Evaluator