Streaming Video Generation · USTC · FrameX.AI

Stream-T1 Test-Time Scaling for Streaming Video Generation

Yijing Tu¹ Shaojin Wu^3‡ Mengqi Huang^1‡ Wenchuan Wang¹ Yuxin Wang² Chunxiao Liu³ Zhendong Mao^1†

¹ University of Science and Technology of China ² FrameX.AI ³ Independent Researcher

^† Corresponding author ^‡ Project lead

arXiv Code

Hover any clip to pause · click to play fullscreen

Abstract

Combined with candidate selection, Stream-T1 actively optimizes the generation trajectory by dynamically refining both the latent noise and the context memory.

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

Stream‑Scaled Noise Propagation

actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation;

Stream‑Scaled Reward Pruning

comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations;

Stream‑Scaled Memory Sinking

dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream.

Method

Three sequential stages

For each chunk, Stream-T1 performs optimization in the following order: pre-generation, post-generation, and post-pruning.

Stream-T1 method overview — **Figure 1.** Overview of the Stream-T1 chunk-level scaling pipeline.

Video Generation

High coherence and high-fidelity

Each row plays clips that were independently rolled out at a different duration — 5 seconds, 30 seconds, Hover a row to pause its scroll and zoom the clip under your cursor; click any clip to play it fullscreen.

5seconds

30seconds

Method Comparison

LongLive vs. Stream-T1

Same prompt, same seed. Stream-T1 significantly elevates the temporal consistency and frame-level visual fidelity of the generated videos.

LongLive Stream-T1 (ours)

Baseline

Ours

Baseline

Ours

Baseline

Ours

Full search visualization

How to prune

Stream-T1 search overview — **Figure 2.** The complete search path of the case in the Figure 1.

Quantitative Results

5s and 30s evaluation

To comprehensively evaluate the effectiveness of our proposed Stream-T1, we compare against three representative open-source models on both 5s and 30s video generation. Overall, extensive experimental analyses demonstrate that Stream-T1 significantly and comprehensively elevates the temporal consistency, motion coherence, and frame-level visual fidelity of the generated videos.

VBench and Videoalign— 5-second video, 832×480

Stream-T1 records the best results on six quality metrics(Subject Consistency, Background Consistency, Motion Smoothness, Aesthetic Quality, MQ and TA) and ranks the second on Imaging Quality and VQ.

Method	VBench↑					VideoAlign↑
Method	Subject Consistency	Background Consistency	Motion Smoothness	Imaging Quality	Aesthetic Quality	VQ	MQ	TA
CausVid	96.33	95.56	98.66	69.69	62.90	0.433	0.550	1.02
Self-forcing	95.26	95.67	98.67	71.61	63.97	0.099	0.088	1.193
LongLive	97.00	96.78	99.12	71.28	65.28	0.285	0.350	1.193
*Stream-T1 (on LongLive)*	97.25	97.05	99.15	71.42	65.98	0.426	0.629	1.305
Relative gain	Δ0.26%	Δ0.28%	Δ0.03%	Δ0.20%	Δ1.07%	Δ49.47%	Δ79.71%	Δ9.39%

VBench and Videoalign— 30-second video, 832×480

Stream-T1 outperforms state-of-the-art baselines across almost all metrics. Stream-T1 achieves the highest scores in Subject Consistency, Background Consistency, Motion Smoothness, Imaging Quality, and Aesthetic Quality. This superiority is further corroborated by the human-aligned VideoAlign, where our model obtains the best VQ and TA. And Stream-T1 gets second score in MQ.

Method	VBench Long↑					VideoAlign↑
Method	Subject Consistency	Background Consistency	Motion Smoothness	Imaging Quality	Aesthetic Quality	VQ	MQ	TA
CausVid	97.91	96.74	98.15	66.32	59.71	-0.144	0.328	0.501
Self-forcing	97.18	96.37	98.35	68.35	59.19	-0.461	-0.216	0.656
LongLive	97.90	96.82	98.78	68.99	61.56	-0.169	-0.002	1.073
*Stream-T1 (on LongLive)*	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170
Relative gain	Δ0.54%	Δ0.37%	Δ0.25%	Δ0.16%	Δ0.89%	Δ56.8%	Δ11400%	Δ9%

Per-metric quality versus video length — **Figure 3. Qualitative comparisons of Stream-T1 with Causvid, Self-Forcing, and LongLive.** Stream-T1 significantly elevates the temporal consistency and frame-level visual fidelity of the generated videos.

Ablation

Each component contributes

To thoroughly validate the individual contributions of our proposed components, we conduct ablation studies on the 30s video. Both quantitative metrics and extensive qualitative visual comparisons consistently demonstrate the necessity and efficacy of our designed modules.

Method	VBench Long↑					VideoAlign↑
Method	Subject Consistency	Background Consistency	Motion Smoothness	Imaging Quality	Aesthetic Quality	VQ	MQ	TA
w/o Stream‑Scaled Memory Sinking	98.30	97.04	98.92	69.51	61.90	-0.083	0.188	1.146
w/o Stream‑Scaled Noise Propagation	98.35	97.14	98.98	69.07	61.99	-0.094	0.176	1.164
w/o Stream‑Scaled Reward Pruning	98.04	96.88	98.87	69.17	61.22	-0.173	0.014	1.035
*Ours*	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170

BibTeX

Cite Stream-T1


@misc{tu2026streamt1testtimescalingstreaming,
      title={Stream-T1: Test-Time Scaling for Streaming Video Generation}, 
      author={Yijing Tu and Shaojin Wu and Mengqi Huang and Wenchuan Wang and Yuxin Wang and Chunxiao Liu and Zhendong Mao},
      year={2026},
      eprint={2605.04461},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.04461}, 
}