What is Text-to-video?

Question

Accepted Answer

Text-to-video is the generative AI operation that takes a text prompt and produces a short video clip — typically 5–10 seconds, 720p–1080p, no input image required. It is the video analogue of text-to-image and the contrasting workflow to image-to-video (which conditions on a starting frame). The 2024–2025 generation of text-to-video models — OpenAI Sora 2, Google Veo 3 and Veo 3.5, Runway Gen-3/Gen-4, Luma Dream Machine, Kling 2, Pika 2, Hailuo, Vidu, Hunyuan, Wan — produce clips that pass casual viewer-quality bars for social content but still struggle with fine-grained product fidelity, brand-specific identity preservation, and multi-shot consistency. For product marketing, image-to-video on a reference product photo is generally more reliable than pure text-to-video. Pure text-to-video shines for ambient B-roll, scene cutaways, and creative B-side content where exact product fidelity is not required.

What is Text-to-video?

Key statistics

Related terms