Text-to-video model

A text-to-video model is a machine learning model that takes a natural language description as input and produces a video relevant to the input text.^[1] Recent advancements in generating high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.^[2]