Solaris is a multiplayer video world model in Minecraft, which generates consistent first-person observations for two players simultaneously. It is trained on 12.6M frames of coordinated Minecraft gameplay created by SolarisEngine, a scalable framework for producing realistic multiplayer Minecraft gameplay.
Several frameworks exist for controlling agents in Minecraft, including Malmo, MineRL, MineDojo, and Mineflayer. While these tools each offer useful capabilities, none of them were designed with multiplayer data collection in mind. Frameworks like MineRL and MineDojo use low-level action spaces where agents can only take random actions, producing chaotic gameplay that is unusable for world modeling. This problem only gets worse with multiple agents acting randomly at the same time. Mineflayer provides high-level primitives like pathfinding and block placement, but it was never built for coordinated multi-agent interaction. There was simply no existing system we could use off the shelf for collecting realistic multiplayer gameplay data, so we built one from scratch.
We chose Mineflayer as our foundation because it provides composable primitives for things like pathfinding, block placement, and combat. On top of this, we built a communication layer that lets bots coordinate with each other during episodes. We also introduced higher-level primitives for building, scaffolding, tool use, and navigation. When combined, these form complete episodes where two bots work together toward a predefined goal. We built a library of episode types covering core aspects of Minecraft interaction: building houses and bridges, PvP and PvE combat, chasing and exploration, and mining (Figure 1). Although the episode logic is written using these high-level primitives, the system translates everything down to a low-level action space compatible with VPT, a single-player Minecraft dataset collected from human players.
Mineflayer controls a character but cannot render graphics. To get visual observations, we pair each controller bot with a camera bot running the official Minecraft Java client in headless mode. A custom server-side plugin synchronizes the camera to mirror the controller's position, orientation, and even animations in real time. Although they run as separate processes, the controller and camera together form a single logical player. Actions and visual observations are aligned in post-processing using timestamps at a shared 20 FPS frame rate. The overall architecture is shown in Figure 2.
We package the controller bots, camera bots, and Minecraft server as Docker containers orchestrated through Docker Compose. A suite of Python scripts manages these units, launching multiple workers in parallel for scalable collection. The bots run in a loop, sampling and executing episodes from our library, teleporting to a random location at the start of each episode to diversify terrain. Since Minecraft is complex and stochastic, episodes inevitably encounter errors or get stuck. We built a safety mechanism that detects failures during execution, notifies all components, and aborts the current episode. The system then proceeds with a fresh state, enabling continuous data collection without manual intervention.
Using SolarisEngine, we collected a multiplayer Minecraft training dataset totaling 9,240 episodes and 6.32M frames per player (12.64M combined). The episodes fall into four broad categories: building (houses, walls, towers, bridges), combat (PvP and PvE), movement (chasing, navigation, exploration), and mining. We split episodes evenly between Superflat and Normal world types, and vary times of day, biomes, and weather to maximize visual diversity. Episode lengths range from 128 to 512 frames (6.4 to 25.6 seconds at 20 fps), with episode types sampled randomly using weights that decrease with typical episode length to keep the distribution balanced. All actions are annotated as semantic game events in the format compatible with VPT, covering movement, camera, and interactive inputs like digging, placing, and attacking. To our knowledge, this is the first action-annotated multiplayer Minecraft dataset suitable for training world models. Figure 3 shows the full dataset breakdown.
Solaris Solaris is a controllable video diffusion model that jointly predicts future observations for multiple players, conditioned on their past observations and actions. We train it using flow matching combined with diffusion forcing, where independent noise levels are sampled per player and per timestep. This lets the model learn to denoise each player's observation stream while staying consistent across players.
We build on MatrixGame 2.0
Autoregressive video generation models are trained on ground truth context but must condition on their own outputs at inference time. This train-test mismatch causes errors to accumulate over long rollouts, leading to visual degradation. Self Forcing addresses this by training the model on its own generated outputs: the generator unrolls autoregressively during training, and a pretrained bidirectional teacher provides a distributional loss on the resulting frames. This closes the gap between training and inference and significantly improves long-horizon generation quality. The videos below show the effect on our model. The left and right views correspond to the two players (Alpha and Bravo).
Before Self Forcing (Causal only)
After Self Forcing
We apply Self Forcing with a long-context bidirectional teacher. Our generator uses a rolling KV cache with a sliding window of 6 latent frames, but the teacher operates over the full sequence. In the original Self Forcing formulation, the generator must unroll autoregressively during training, building up a computation graph that grows with sequence length. When combined with a long-context teacher, the memory cost of backpropagating through this graph becomes prohibitive.
To fix this, we introduce Checkpointed Self Forcing. Instead of backpropagating through the full autoregressive rollout, we split training into two phases. First, we run the autoregressive rollout forward with gradients disabled, caching the clean frame estimates and their corresponding noisy inputs at each step. Second, we recompute the generator's outputs in a single parallelized forward pass using a custom attention mask that reproduces the sliding-window causal dependencies from the rollout. This converts the sequential unrolling into one parallel operation, reducing memory in the backward pass. The resulting memory savings also make it feasible to backpropagate through the KV cache representations, which the original Self Forcing implementation blocks with a stop-gradient. We find that enabling these gradients further improves generation quality (see the paper for details). Figure 6 compares the peak memory usage of naive Self Forcing to our checkpointed variant.
We create the Solaris Eval dataset, which tests five multiplayer capabilities across 7 unique held-out ground-truth episodes. Here, we show the ground-truth episodes in Solaris Eval. The left and right are the first-person views of each player, and the center is the third-person view (square center-cropped).
Tests the model's ability to render visually consistent agent translation (WASD) and camera rotation (mouse) from both players' views simultaneously. One bot moves while the other observes; the VLM judges whether the moving player's position changed correctly in the observer's view.
Tests whether the model remembers the other player's position through observation. One agent turns away (losing sight of the other), pauses, then turns back. Because the turning agent was continuously observed by the stationary player, it should know where the other agent is — the VLM checks whether it sees the other player upon returning.
Tests whether co-visible regions are rendered consistently across both players. Two agents near each other simultaneously turn to look in the same random direction; the VLM checks whether both players see the same scene.
Tests whether the model can remember the environment and other agents across time. Both agents turn away from each other, pause, then return to their original orientations. The VLM checks whether both agents see each other again after turning back.
Tests the model's ability to reflect environmental changes caused by agents' actions. One bot constructs a pre-defined shape, either a square, horizontal strip, or vertical strip, while the other bot watches. After construction the builder moves next to the observer so the full structure is in view for both. The VLM evaluates whether the observer sees the completed structure.
Here we discuss results from our model Solaris. As can be seen in the teaser video section above, our model is capable of simulating complex aspects of Minecraft gameplay, including building, mining, fighting, and multiplayer viewpoint modeling. Examples of advanced model capabilities—including inventory tracking, simulating weather, placing torches, generating animations, and simulating PvP—are presented in the Solaris Model Capabilities section below. Sections Architecture Experiments and Self Forcing Ablations discuss and present results from our architecture comparison and self-forcing training variations respectively.
We compare our architecture implementation to the frame concatenation method of Multiverse
| Method | Movement | Grounding | Memory | Building | Consistency | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | |
| Frame concat | 77.08 | 68.87 | 53.13 | 66.57 | 37.50 | 74.44 | 0.00 | 103.24 | 53.11 | 129.38 |
| Solaris w/o pretrain | 69.27 | 42.53 | 29.17 | 49.85 | 18.75 | 67.80 | 0.00 | 86.60 | 49.48 | 121.39 |
| Solaris | 68.23 | 38.48 | 62.50 | 38.03 | 37.50 | 55.13 | 20.83 | 83.58 | 71.35 | 99.40 |
We ablate each component of our Self Forcing pipeline. We study the two main stages of CausVid, ODE regression initialization and few-step distillation, and find that straightforward causal finetuning is sufficient instead. Although the original Self Forcing paper assumes the generator to be a few-step model at the start of training, we find that the few-step ability can be learned simultaneously with stable autoregressive generation in Self Forcing. Next, we study the ability to backpropagate to the KV representations of the self attention layer, which is feasible with the memory savings of our method. Allowing KV backpropagation achieves better visual results than all other variants based on FID, as shown in Table 2. We do observe that this causes decreased performance in action following for some categories. However, our method remains competitive across all categories and excels in the challenging Building and Consistency VLM tasks.
| Components | Movement | Grounding | Memory | Building | Consistency | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Init. | Pre-DMD | KV-BP | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ | VLM ↑ | FID ↓ |
| ODE Reg | × | ✓ | 23.44 | 65.28 | 3.13 | 56.59 | 0.00 | 99.47 | 3.13 | 95.69 | 48.96 | 142.33 |
| Causal FT | ✓ | ✓ | 21.35 | 49.07 | 3.13 | 40.38 | 0.00 | 55.67 | 8.33 | 90.51 | 55.21 | 160.08 |
| Causal FT | × | × | 78.65 | 60.29 | 72.92 | 55.23 | 48.96 | 63.80 | 15.63 | 87.43 | 70.83 | 105.08 |
| Causal FT | × | ✓ | 68.23 | 38.48 | 62.50 | 38.03 | 37.50 | 55.13 | 20.83 | 83.58 | 71.35 | 99.40 |
We showcase the learned capabilities of our Solaris model. We present generated videos demonstrating the model's ability to simulate complex game dynamics.
As one player mines or places blocks, the changes are reflected in both players' perspectives.
As one player mines or places blocks, the changes are reflected in both players' perspectives.
Rain starts spontaenously and simultaneously for both players.
Rain starts spontaenously and simultaneously (around 00:04s) for both players.
Both players see the same house as they turn together, which is initially only in Alpha's view.
Both players see a similar dessert scene with a pond as they turn together. The pond is not initially in view.
Both players see a similar snowy hill with a tree on the right.
Both players see a similar grassy hill with flowers in the foregound.
Players take turn mining and placing torches.
Players take turn mining and placing torches.
Players build parts of a house.
Players build a tower of blocks.
Players engage in PvP combat.
Players engage in PvP combat.
Players move around and jump randomly. The movement is reflected in both players' perspectives.
Players move around and jump randomly. The movement is reflected in both players' perspectives.
Srivats Poddar completed this work while studying at NYU. Suppakit Waiwitlikhit and Timothy Meehan contributed to the project during their time as visiting students at NYU. We thank Nanye Ma for his help with our Jax codebase. We are grateful for Jihan Yang, Sihyun Yu, Shusheng Yang, and Yucen Lily Li for their advice on the draft and Charles Herrmann for helpful discussions. Egor Gikalo made very useful improvements to the Solaris engine codebase. This work was primarily supported by the Google TPU Research Cloud (TRC) program and the Google Cloud Research Credits program (GCP19980904). Saining Xie acknowledges support from the MSIT IITP grant (RS-2024-00457882) and the NSF award IIS-2443404. Oscar Michel is supported by the NSF Graduate Research Fellowship Program.
@article{solaris2026,
title={Solaris: Building a Multiplayer Video World Model in Minecraft},
author={Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie},
year={2026},
journal={arXiv preprint arXiv:2602.22208}
}