Streaming Video Generation · USTC · FrameX.AI

Stream-T1 Test-Time Scaling for Streaming Video Generation

Yijing Tu1 Shaojin Wu3 Mengqi Huang1 Wenchuan Wang1 Yuxin Wang2 Chunxiao Liu3 Zhendong Mao1
1 University of Science and Technology of China 2 FrameX.AI 3 Independent Researcher
Corresponding author Project lead

Hover any clip to pause · click to play fullscreen

Abstract

Combined with candidate selection, Stream-T1 actively optimizes the generation trajectory by dynamically refining both the latent noise and the context memory.

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

01

Stream‑Scaled Noise Propagation

actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation;

02

Stream‑Scaled Reward Pruning

comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations;

03

Stream‑Scaled Memory Sinking

dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream.

Method

Three sequential stages

For each chunk, Stream-T1 performs optimization in the following order: pre-generation, post-generation, and post-pruning.

Stream-T1 method overview
Figure 1. Overview of the Stream-T1 chunk-level scaling pipeline.

Video Generation

High coherence and high-fidelity

Each row plays clips that were independently rolled out at a different duration — 5 seconds, 30 seconds, Hover a row to pause its scroll and zoom the clip under your cursor; click any clip to play it fullscreen.

5seconds
30seconds

Method Comparison

LongLive vs. Stream-T1

Same prompt, same seed. Stream-T1 significantly elevates the temporal consistency and frame-level visual fidelity of the generated videos.

LongLive Stream-T1 (ours)
Baseline
Ours
Baseline
Ours
Baseline
Ours

Full search visualization

How to prune

Stream-T1 search overview
Figure 2. The complete search path of the case in the Figure 1.

Quantitative Results

5s and 30s evaluation

To comprehensively evaluate the effectiveness of our proposed Stream-T1, we compare against three representative open-source models on both 5s and 30s video generation. Overall, extensive experimental analyses demonstrate that Stream-T1 significantly and comprehensively elevates the temporal consistency, motion coherence, and frame-level visual fidelity of the generated videos.

VBench and Videoalign— 5-second video, 832×480

Stream-T1 records the best results on six quality metrics(Subject Consistency, Background Consistency, Motion Smoothness, Aesthetic Quality, MQ and TA) and ranks the second on Imaging Quality and VQ.

Method VBench↑ VideoAlign↑
Subject
Consistency
Background
Consistency
Motion
Smoothness
Imaging
Quality
Aesthetic
Quality
VQ MQ TA
CausVid96.3395.5698.6669.6962.900.4330.5501.02
Self-forcing95.2695.6798.6771.6163.970.0990.0881.193
LongLive97.0096.7899.1271.2865.280.2850.3501.193
Stream-T1 (on LongLive)97.2597.0599.1571.4265.980.4260.6291.305
Relative gainΔ0.26%Δ0.28%Δ0.03%Δ0.20%Δ1.07%Δ49.47%Δ79.71%Δ9.39%

VBench and Videoalign— 30-second video, 832×480

Stream-T1 outperforms state-of-the-art baselines across almost all metrics. Stream-T1 achieves the highest scores in Subject Consistency, Background Consistency, Motion Smoothness, Imaging Quality, and Aesthetic Quality. This superiority is further corroborated by the human-aligned VideoAlign, where our model obtains the best VQ and TA. And Stream-T1 gets second score in MQ.

Method VBench Long↑ VideoAlign↑
Subject
Consistency
Background
Consistency
Motion
Smoothness
Imaging
Quality
Aesthetic
Quality
VQ MQ TA
CausVid97.9196.7498.1566.3259.71-0.1440.3280.501
Self-forcing97.1896.3798.3568.3559.19-0.461-0.2160.656
LongLive97.9096.8298.7868.9961.56-0.169-0.0021.073
Stream-T1 (on LongLive)98.4397.1899.0369.1062.11-0.0730.2261.170
Relative gainΔ0.54%Δ0.37%Δ0.25%Δ0.16%Δ0.89%Δ56.8%Δ11400%Δ9%
Per-metric quality versus video length
Figure 3. Qualitative comparisons of Stream-T1 with Causvid, Self-Forcing, and LongLive. Stream-T1 significantly elevates the temporal consistency and frame-level visual fidelity of the generated videos.

Ablation

Each component contributes

To thoroughly validate the individual contributions of our proposed components, we conduct ablation studies on the 30s video. Both quantitative metrics and extensive qualitative visual comparisons consistently demonstrate the necessity and efficacy of our designed modules.

Method VBench Long↑ VideoAlign↑
Subject
Consistency
Background
Consistency
Motion
Smoothness
Imaging
Quality
Aesthetic
Quality
VQ MQ TA
w/o Stream‑Scaled Memory Sinking 98.30 97.04 98.92 69.51 61.90 -0.083 0.188 1.146
w/o Stream‑Scaled Noise Propagation 98.35 97.14 98.98 69.07 61.99 -0.094 0.176 1.164
w/o Stream‑Scaled Reward Pruning 98.04 96.88 98.87 69.17 61.22 -0.173 0.014 1.035
Ours 98.43 97.18 99.03 69.10 62.11 -0.073 0.226 1.170
Per-metric quality versus video length
Figure 4. Qualitative ablations studies on each component. Omitting the Stream-Scaled Memory Sinking degrades background stability. Removing the Stream-Scaled Noise Propagation introduces local structural artifacts (e.g., on the subject's tail). Eliminating the Stream-Scaled Reward Pruning leads to distinct semantic misalignment and deteriorated aesthetic quality.

BibTeX

Cite Stream-T1


@misc{tu2026streamt1testtimescalingstreaming,
      title={Stream-T1: Test-Time Scaling for Streaming Video Generation}, 
      author={Yijing Tu and Shaojin Wu and Mengqi Huang and Wenchuan Wang and Yuxin Wang and Chunxiao Liu and Zhendong Mao},
      year={2026},
      eprint={2605.04461},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.04461}, 
}