Video world models, who predict future frames on tasks, make immense promises to artificial intelligence, enable agents to plan and cause in dynamic environment. Recent progresses, especially with video spread models, have shown impressive abilities in generating realistic future sequences. However, an important bottleneck remains: Maintaining long -term memory. Current models struggle to remember events and states in the past due to high computational costs associated with extended sequences using traditional attention layers. This limits their ability to perform complex functions requiring a continuous understanding of a visual.
A new paper by researchers at Stanford University, Princeton University and Adobe Research proposes an innovative solution for this challenge, “Long-Cutext State-space video world model”. They introduce a novel architecture that takes advantage State-place model (SSM) To expand temporary memory without renouncing computational efficiency.
The main problem lies in the sequence of sequence length in relation to the meditation mechanism computational complexity. As the video reference increases, the resources required for meditation layers explode, making long-term memory impractical for real-world applications. This means that after a certain number of frames, the model effectively “forgets”, obstructs its performance on the tasks that demand long distance harmony or arguments in the extended period.
The major insight of the authors is to avail the underlying strength of the state-place model (SSM) for sequence modeling. Unlike previous efforts, which retroping SSMs for non-causing vision tasks, this work fully exploits their benefits in processing sequences.
Proposed Long reference state-article video world model (LSSVWM) Includes several important design options:
- Block-wise SSM scanning scheme: It is central for their design. Instead of processing the entire video sequence with a single SSM scan, they appoint a block-wise scheme. It trades some spatial stability (within a block) for strategically quite extended temporary memory. By breaking the long sequence into manageable blocks, they can maintain a compressed “state” that carries information in the block, effectively enhances the model’s memory horizon.
- Dense local attention: The model incorporates dense local attention to the potential loss of spatial consistency offered by block-wise SSM scanning. This ensures that the frames within and within the blocks maintain strong relationships, preserving the fine-and-stability required for the realistic video generations. This dual approach to global (SSM) and local (attention) processing allows them to achieve both long -term memory and local loyalty.
The paper also introduces two major training strategies to improve performance for a long time:
- Defusion Forceing: This technique encourages models to generate air -conditioned frames on a prefix of input, effectively forcing it to learn to maintain stability in the long period of time. Sometimes not taking a sample of a prefix and keeping all the tokens in mind, the training is forced to proliferate, which has been highlighted as a special case of long reference training where the prefix length is zero. This also pushes the model to generate a consistent sequence from the minimum initial context.
- Frame local attention: For rapid training and samples, the authors applied a “frame local attention” mechanism. It uses flextration to achieve significant speedups compared to masks completely causes perfectly. By grouping the frames in the chunks (for example, a piece of 5 with a frame window shape of 10), the frame within a chunk maintains bidaries while participating in the frame in the previous chunk. This allows for an effective receptive area adapting the computational load.

Researchers evaluated their LSSVWM on a challenging dataset, including Memory labyrinth And MincraftWhich are specifically designed to test long -term memory capabilities through spatial recovery and logic functions.
Experiments show that their approach A large extent crosses the baseline In preserving long distance memory. Qualitative results, as shown in complementary figures (eg, S1, S2, S3), suggests that the LSSVWM can produce more consistent and accurate sequences in the extended period, compared to the model that pays attention to the causes or even the MAMBA 2 without frame local attention. For example, on logic tasks for the maze dataset, their model maintains better stability and accuracy on the long horizon. Similarly, for recovery functions, LSSVWM shows the better ability to recall and use information from the previous frames away. Importantly, these reforms are obtained by maintaining practical conclusions, which makes models suitable for interactive applications.

paper Long reference state-article video world model Is on arxiv