LAION introduced Re-LAION 5B, a relaunch of the popular dataset, free of CSAM and other sensitive content

The non-profit AI research organization LAION recently announced the release of Re-LAION 5B, a text-link to images dataset resulting from subjecting the original LAION-5B dataset to a safety revision procedure to clean it from child sexual abuse material (CSAM) and other sensitive content.

It is important to clarify that the LAION dataset is not made up of images, but of pairs of links to images and alt text captions curated from the Common Crawl, an open repository of web crawl data. The decision to perform a safety revision on the dataset was motivated by the findings of the Stanford Internet Observatory, which reported in December 2023 that more than 1008 URLs in the dataset pointed to material classified either as "CSAM" or as "likely CSAM". Working in partnership with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P), LAION has thoroughly cleaned the dataset of known links to suspected Child Sexual Abuse Material (CSAM). The revision process resulted in 2,236 links being removed, including the 1,008 identified in the Stanford report.

The Re-LAION 5B includes two dataset variants: Re-LAION-5B research and Re-LAION-5B research-safe. The former has only been revised to remove the reported 2,236 links, while the latter is cleaned up from all sensitive material. LAION strongly recommends all dataset users migrate to Re-LAION-5B immediately. Both Re-LAION-5B datasets are now available under an Apache 2.0 license via gated access at Hugging Face. Additionally, existing instances of LAION-5B can be cleaned up using Re-LAION-5B's metadata to restore the usability of those (now deprecated) dataset copies.