Text-to-video model
A text-to-video model is a machine learning model that takes a natural language description as input and produces a video relevant to the input text.[1] Recent advancements in generating high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.[2]