Katana VentraIP

Stable Diffusion

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

Original author(s)

Runway, CompVis, and Stability AI

Stability AI

August 22, 2022

SDXL 1.0 (model)[1] / July 26, 2023

Creative ML OpenRAIL-M

It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.[3] Its development involved researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a computational donation from Stability and training data from non-profit organizations.[4][5][6][7]


Stable Diffusion is a latent diffusion model, a kind of deep generative artificial neural network. Its code and model weights have been released publicly,[8] and it can run on most consumer hardware equipped with a modest GPU with at least 4 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services.[9][10]

Development[edit]

Stable Diffusion, originated from a project called Latent Diffusion,[11] developed in Germany by researchers at Ludwig Maximilian University in Munich and Heidelberg University. Four of the original 5 authors (Robin Rombach, Andreas Blattmann, Patrick Esser and Dominik Lorenz) later joined Stability AI and released subsequent versions of Stable Diffusion.[12]


The technical license for the model was released by the CompVis group at Ludwig Maximilian University of Munich.[10] Development was led by Patrick Esser of Runway and Robin Rombach of CompVis, who were among the researchers who had earlier invented the latent diffusion model architecture used by Stable Diffusion.[7] Stability AI also credited EleutherAI and LAION (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project.[7]

An "embedding" can be trained from a collection of user-provided images, and allows the model to generate visually similar images whenever the name of the embedding is used within a generation prompt. Embeddings are based on the "textual inversion" concept developed by researchers from Tel Aviv University in 2022 with support from Nvidia, where vector representations for specific tokens used by the model's text encoder are linked to new pseudo-words. Embeddings can be used to reduce biases within the original model, or mimic visual styles.[45]

[44]

A "hypernetwork" is a small pretrained neural network that is applied to various points within a larger neural network, and refers to the technique created by developer Kurumuz in 2021, originally intended for text-generation transformer models. Hypernetworks steer results towards a particular direction, allowing Stable Diffusion-based models to imitate the art style of specific artists, even if the artist is not recognised by the original model; they process the image by finding key areas of importance such as hair and eyes, and then patch these areas in secondary latent space.[46]

NovelAI

is a deep learning generation model developed by researchers from Google Research and Boston University in 2022 which can fine-tune the model to generate precise, personalised outputs that depict a specific subject, following training via a set of images which depict the subject.[47]

DreamBooth

Learning Transferable Visual Models From Natural Language Supervision (2021). This paper describes the CLIP method for training text encoders, which convert text into floating point vectors. Such text encodings are used by the diffusion model to create images.

[67]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations (2021). This paper describes SDEdit, aka "img2img".

[68]

High-Resolution Image Synthesis with Latent Diffusion Models (2021, updated in 2022). This paper describes the latent diffusion model (LDM). This is the backbone of the Stable Diffusion architecture.

[69]

Classifier-Free Diffusion Guidance (2022). This paper describes CFG, which allows the text encoding vector to steer the diffusion model towards creating the image described by the text.

[28]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (2023). Describes SDXL.

[19]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (2022).[22] Describes rectified flow, which is used for the backbone architecture of SD 3.0.

[21]

Scaling Rectified Flow Transformers for High-resolution Image Synthesis (2024). Describes SD 3.0.

[20]

Key papers


Training cost

Litigation[edit]

In January 2023, three artists, Sarah Andersen, Kelly McKernan, and Karla Ortiz, filed a copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.[76] The same month, Stability AI was also sued by Getty Images for using its images in the training data.[77]


In July 2023, U.S. District Judge William Orrick inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint.[78]

License[edit]

Unlike models like DALL-E, Stable Diffusion makes its source code available,[79][8] along with the model (pretrained weights). It applies the Creative ML OpenRAIL-M license, a form of Responsible AI License (RAIL), to the model (M).[80] The license prohibits certain use cases, including crime, libel, harassment, doxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories".[81][82] The user owns the rights to their generated output images, and is free to use them commercially.[83]

Artificial intelligence art

Midjourney

Craiyon

Hugging Face

Imagen (Google Brain)

Stable Diffusion Demo

Interactive Explanation of Stable Diffusion

: Investigation on sensitive and private data in Stable Diffusions training data

"We Are All Raw Material for AI"

""

Negative Prompts in Stable Diffusion