PhyWorld: Physics-Faithful World Model for Video Generation

Abstract

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations: generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video-generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training.

In the first stage, we improve video-to-video continuation with flow-matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines, and reaches an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

0.769
VBench overall
+0.013 vs. base

3.09
PhyGround physical-faithfulness
+0.10 vs. base

+0.21
Optical-physics gain
over Wan2.2-I2V-A14B

16 × H100
DPO post-training
2 epochs / 250 steps

Why physics is hard for video models

Prior world-model work is largely Self-Forcing or LongCat-style, focused on ultra-long video generation. These pipelines run into three persistent walls:

Small model size

Ultra-long generation forces compact backbones because of large memory usage at train time, capping visual fidelity.

Consistency, not physics

Existing objectives target temporal smoothness; nothing enforces that motion, contact, or material behavior actually obeys physical law.

Simple prompts, hard outcomes

Even short prompts ("a Newton's cradle") expose limits in ultra-long settings — the model has to predict a physical chain reaction without an explicit physics signal.

Even frontier video models break under simple physics

The Newton's-cradle test: same prompt, same setup, three frontier video models. None of them get the chain-reaction collision right.

NVIDIA Cosmos

Google Veo

OpenAI Sora

Failure cases on a classic physics test — balls fail to swing, fail to transfer momentum, or float. The gap motivates PhyWorld's explicit two-stage physics post-training.

Method

PhyWorld is post-trained from Wan2.2-I2V-A14B in two stages. Stage 1 enhances physical consistency through flow-matching fine-tuning on a video-to-video continuation pipeline. Stage 2 enforces physical laws through reinforcement learning with Direct Preference Optimization (DPO), using human-annotated preference pairs drawn directly from the PhyGround evaluation pool.

Figure 1. Framework overview. Stage 1 fine-tunes with flow matching for temporal consistency; Stage 2 applies LoRA-DPO over physics preference pairs to enforce physical principles.

Stage 1 Physical Consistency Enhancement

Video-to-video continuation with Wan-VAE conditioning, a binary mask delimiting preserved vs. synthesised frames, and a CLIP global-context embedding injected via decoupled cross-attention. The DiT is trained with rectified-flow matching on OpenVid-1M clips, filtered for inter-frame CLIP cosine similarity and per-clip optical-flow magnitude, yielding smooth, motion-controlled supervision.

Stage 2 Physics Enforcement via DPO

A rank-16 LoRA adapter is wrapped around the Wan2.2 denoiser's attention and feed-forward projections. Preference pairs are derived from PhyGround human ratings: within-prompt winner/loser pairs with score margin ≥ 1.0. A class-balanced 1,000-pair trainset spans seven physical-event classes; DPO is restricted to the high-noise window t ∈ [901, 999] with β = 100 to suppress reward hacking.

Preference data, grounded in PhyGround

Stage 2 reuses the 2,000 human-rated videos behind the PhyGround benchmark (~350 raters, ~4,500 cleaned annotations on a 1–5 Likert scale across semantic alignment, physical-temporal validity, and persistence). Pairs are produced by a four-stage content-addressed pipeline (T0 score → T1 pair → T2 encode → T3 subset), with a 42-prompt holdout reserved for validation. The final trainset contains 1,000 pairs over 208 conditioning-image groups, class-balanced over collision / rebound, destruction / deformation, fluids, shadow / reflection, chain / multi-stage, rolling / sliding, and throwing / ballistic events.

Evaluation protocol

We evaluate generation quality with 500 random prompts from VBench at 480p, reporting subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic quality, and imaging quality. For physical faithfulness we use PhyGround's 250-prompt TI2V benchmark and its released PhyJudge-9B judge model under deterministic decoding, with semantic alignment (SA), physical-temporal validity (PTV), and persistence as general dimensions, and solid-body, fluid, and optical pools as per-domain physics scores.

VBench — Generation Quality

Subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic and imaging quality on VBench. Higher is better.

#	Model	VBench dimensions						Avg.
#	Model	Subject consistency	Background consistency	Motion smoothness	Dynamic degree	Aesthetic quality	Imaging quality	Avg.
1	PhyWorld Ours	0.932	0.944	0.986	0.564	0.555	0.632	0.769
2	Wan2.2-I2V-A14B	0.912	0.928	0.977	0.554	0.543	0.622	0.756
3	Cosmos-14B	0.899	0.923	0.973	0.559	0.536	0.629	0.753
4	LTX-2.3-22B	0.894	0.918	0.982	0.549	0.532	0.626	0.751
5	OmniWeaving	0.903	0.907	0.972	0.556	0.541	0.621	0.750
6	Cosmos-2-2B	0.887	0.905	0.964	0.543	0.524	0.608	0.739

Source: paper Table 3, 500 random VBench prompts at 480p. ■ best ■ second.

PhyGround — Physical Faithfulness

Scored by PhyJudge-9B on PhyGround's 250-prompt humaneval set, 1–5 scale (higher is better). SA / PTV / Persistence are per-video general dimensions; Solid-Body / Fluid / Optical are pooled per (video, law) units within each domain.

#	Model	General quality			Physics adherence			Overall
#	Model	SA	PTV	Persist.	Solid-Body	Fluid	Optical	Overall
1	PhyWorld Ours	2.78	3.07	3.23	2.84	3.04	3.57	3.09
2	Wan2.2-I2V-A14B	2.72	2.97	3.08	2.79	3.03	3.36	2.99
3	Cosmos-14B	2.60	2.73	3.07	2.72	2.92	3.53	2.80
4	OmniWeaving	2.68	2.73	2.92	2.71	2.99	3.13	2.78
5	LTX-2.3-22B	2.63	2.79	2.91	2.55	3.02	3.21	2.72
6	Wan2.2-TI2V-5B	2.48	2.70	2.76	2.61	3.01	3.45	2.68
7	LTX-2-19B	2.50	2.62	2.79	2.49	3.01	3.09	2.62

Source: paper Table 4. Overall = 0.5 × mean(SA, PTV, Persist.) + 0.5 × pooled mean over (video, law) units across the three domains. ■ best ■ second.

Head-to-head video comparison

Same conditioning input, same prompt, four models compared. PhyWorld preserves the physical state implied by the input frame and produces temporally coherent continuations; SOTA baselines drift in color, background, or object identity, or break under physical interactions.

PhyWorld (Ours)

Cosmos

LTX

OmniWeave

Figure 2. PhyWorld generates videos with superior physical consistency. Baselines suffer color shift, background change, or object drift; PhyWorld respects the conditioning frame.

Related Releases

BibTeX

@misc{zhao2026phyworld,
  title         = {PhyWorld: Physics-Faithful World Model for Video Generation},
  author        = {Pu Zhao and Juyi Lin and Timothy Rupprecht and Arash Akbari and Chence Yang and Rahul Chowdhury and Elaheh Motamedi and Arman Akbari and Yumei He and Chen Wang and Geng Yuan and Weiwei Chen and Yanzhi Wang},
  year          = {2026},
  eprint        = {2605.19242},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.19242}
}

PhyWorld

Physics-Faithful World Model for Video Generation