PhyWorld

Physics-Faithful World Model for Video Generation

Pu Zhao*1, Juyi Lin*1, Timothy Rupprecht1, Arash Akbari1, Chence Yang2, Rahul Chowdhury1, Elaheh Motamedi1,
Arman Akbari1, Yumei He3, Chen Wang4, Geng Yuan2, Weiwei Chen4, Yanzhi Wang1
1Northeastern University, 2University of Georgia, 3Tulane University, 4EmbodyX

*Equal contribution.

PhyWorld pipeline overview

PhyWorld pipeline. We start from an open video-generation base (e.g. Nvidia Cosmos / Wan2.2), modify the encoder for video-to-video continuation to produce PhyWorld-Base, then iterate against the PhyGround benchmark in two loops — training for continuity via flow matching and training for explicit physics awareness via DPO over PhyGround preference pairs. The resulting PhyWorld is a building block for downstream Physical AI: long-horizon planning, multimodality, and robotic policies such as Omni-Robot.

Abstract

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations: generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video-generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training.

In the first stage, we improve video-to-video continuation with flow-matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines, and reaches an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

0.769
VBench overall
+0.013 vs. base
3.09
PhyGround physical-faithfulness
+0.10 vs. base
+0.21
Optical-physics gain
over Wan2.2-I2V-A14B
16 × H100
DPO post-training
2 epochs / 250 steps

Why physics is hard for video models

Prior world-model work is largely Self-Forcing or LongCat-style, focused on ultra-long video generation. These pipelines run into three persistent walls:

Small model size

Ultra-long generation forces compact backbones because of large memory usage at train time, capping visual fidelity.

Consistency, not physics

Existing objectives target temporal smoothness; nothing enforces that motion, contact, or material behavior actually obeys physical law.

Simple prompts, hard outcomes

Even short prompts ("a Newton's cradle") expose limits in ultra-long settings — the model has to predict a physical chain reaction without an explicit physics signal.

Even frontier video models break under simple physics

The Newton's-cradle test: same prompt, same setup, three frontier video models. None of them get the chain-reaction collision right.

NVIDIA Cosmos
Google Veo
OpenAI Sora

Failure cases on a classic physics test — balls fail to swing, fail to transfer momentum, or float. The gap motivates PhyWorld's explicit two-stage physics post-training.

Method

PhyWorld is post-trained from Wan2.2-I2V-A14B in two stages. Stage 1 enhances physical consistency through flow-matching fine-tuning on a video-to-video continuation pipeline. Stage 2 enforces physical laws through reinforcement learning with Direct Preference Optimization (DPO), using human-annotated preference pairs drawn directly from the PhyGround evaluation pool.

PhyWorld two-stage training framework
Figure 1. Framework overview. Stage 1 fine-tunes with flow matching for temporal consistency; Stage 2 applies LoRA-DPO over physics preference pairs to enforce physical principles.

Stage 1 Physical Consistency Enhancement

Video-to-video continuation with Wan-VAE conditioning, a binary mask delimiting preserved vs. synthesised frames, and a CLIP global-context embedding injected via decoupled cross-attention. The DiT is trained with rectified-flow matching on OpenVid-1M clips, filtered for inter-frame CLIP cosine similarity and per-clip optical-flow magnitude, yielding smooth, motion-controlled supervision.

Stage 2 Physics Enforcement via DPO

A rank-16 LoRA adapter is wrapped around the Wan2.2 denoiser's attention and feed-forward projections. Preference pairs are derived from PhyGround human ratings: within-prompt winner/loser pairs with score margin ≥ 1.0. A class-balanced 1,000-pair trainset spans seven physical-event classes; DPO is restricted to the high-noise window t ∈ [901, 999] with β = 100 to suppress reward hacking.

Preference data, grounded in PhyGround

Stage 2 reuses the 2,000 human-rated videos behind the PhyGround benchmark (~350 raters, ~4,500 cleaned annotations on a 1–5 Likert scale across semantic alignment, physical-temporal validity, and persistence). Pairs are produced by a four-stage content-addressed pipeline (T0 score → T1 pair → T2 encode → T3 subset), with a 42-prompt holdout reserved for validation. The final trainset contains 1,000 pairs over 208 conditioning-image groups, class-balanced over collision / rebound, destruction / deformation, fluids, shadow / reflection, chain / multi-stage, rolling / sliding, and throwing / ballistic events.

Evaluation protocol

We evaluate generation quality with 500 random prompts from VBench at 480p, reporting subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic quality, and imaging quality. For physical faithfulness we use PhyGround's 250-prompt TI2V benchmark and its released PhyJudge-9B judge model under deterministic decoding, with semantic alignment (SA), physical-temporal validity (PTV), and persistence as general dimensions, and solid-body, fluid, and optical pools as per-domain physics scores.

VBench — Generation Quality

Subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic and imaging quality on VBench. Higher is better.

# Model VBench dimensions Avg.
Subject
consistency
Background
consistency
Motion
smoothness
Dynamic
degree
Aesthetic
quality
Imaging
quality
1 PhyWorld Ours 0.932 0.944 0.986 0.564 0.555 0.632 0.769
2 Wan2.2-I2V-A14B 0.912 0.928 0.977 0.554 0.543 0.622 0.756
3 Cosmos-14B 0.899 0.923 0.973 0.559 0.536 0.629 0.753
4 LTX-2.3-22B 0.894 0.918 0.982 0.549 0.532 0.626 0.751
5 OmniWeaving 0.903 0.907 0.972 0.556 0.541 0.621 0.750
6 Cosmos-2-2B 0.887 0.905 0.964 0.543 0.524 0.608 0.739

Source: paper Table 3, 500 random VBench prompts at 480p.  best    second.

PhyGround — Physical Faithfulness

Scored by PhyJudge-9B on PhyGround's 250-prompt humaneval set, 1–5 scale (higher is better). SA / PTV / Persistence are per-video general dimensions; Solid-Body / Fluid / Optical are pooled per (video, law) units within each domain.

# Model General quality Physics adherence Overall
SA PTV Persist. Solid-Body Fluid Optical
1 PhyWorld Ours 2.78 3.07 3.23 2.84 3.04 3.57 3.09
2 Wan2.2-I2V-A14B 2.72 2.97 3.08 2.79 3.03 3.36 2.99
3 Cosmos-14B 2.60 2.73 3.07 2.72 2.92 3.53 2.80
4 OmniWeaving 2.68 2.73 2.92 2.71 2.99 3.13 2.78
5 LTX-2.3-22B 2.63 2.79 2.91 2.55 3.02 3.21 2.72
6 Wan2.2-TI2V-5B 2.48 2.70 2.76 2.61 3.01 3.45 2.68
7 LTX-2-19B 2.50 2.62 2.79 2.49 3.01 3.09 2.62

Source: paper Table 4. Overall = 0.5 × mean(SA, PTV, Persist.) + 0.5 × pooled mean over (video, law) units across the three domains.  best    second.

Head-to-head video comparison

Same conditioning input, same prompt, four models compared. PhyWorld preserves the physical state implied by the input frame and produces temporally coherent continuations; SOTA baselines drift in color, background, or object identity, or break under physical interactions.

PhyWorld (Ours)
Cosmos
LTX
OmniWeave

Figure 2. PhyWorld generates videos with superior physical consistency. Baselines suffer color shift, background change, or object drift; PhyWorld respects the conditioning frame.

BibTeX

@misc{zhao2026phyworld,
  title         = {PhyWorld: Physics-Faithful World Model for Video Generation},
  author        = {Pu Zhao and Juyi Lin and Timothy Rupprecht and Arash Akbari and Chence Yang and Rahul Chowdhury and Elaheh Motamedi and Arman Akbari and Yumei He and Chen Wang and Geng Yuan and Weiwei Chen and Yanzhi Wang},
  year          = {2026},
  eprint        = {2605.19242},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.19242}
}