DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models

Dmitry Spodarets
Dmitry Spodarets

Diffusion models have recently gained enormous popularity, due to the ability to generate high-quality and controlled images based on textual cues written in natural language. However, generating images with the desired details is challenging, because it requires users to write appropriate cues indicating the exact expected results. Developing such cues requires trial and error, and can often seem random.

The DiffusionDB human-interaction dataset is the first large-scale text-to-image cue database with 2 million real-world cue-image pairs, which opens up a broad research opportunity in understanding the interaction between cues and generative models, detecting deep fakes, and developing human-image interaction tools to help users use these models more easily.

DiffusionDB contains 2 million images generated by Stable Diffusion using cues and hyperparameters set by real users. The main language of the dataset is English, but it also contains languages such as Spanish, Chinese and Russian.

To distribute DiffusionDB, it uses a modular file structure: the 2 million images in DiffusionDB are split into 2,000 folders, where each folder contains 1,000 images and a JSON file that links those 1,000 images with their hints and hyperparameters. The subfolders, in turn, have part-00xxx names, and each image has a unique name generated by UUID version 4. The JSON file in a subfolder has the same name as the subfolder. Each image is a PNG file. The JSON file contains key-value pairs that map image file names to their hints and hyperparameters.

DiffusionDB is quite large - as much as 1.6 TB! However, thanks to the modular file structure, you can easily load the desired number of images and their hints and hyperparameters.

Links to the work and code are left below.

Recall another new addition to the world of AI, eDiff-I, a new generation of generative AI content creation tool that offers unprecedented text-to-image fusion, instant style transfer, and intuitive word-painting capabilities.