A recently published report by the Stanford Internet Observatory has revealed alarming content within the LAION-5B open-source AI dataset. Released in March 2022, the dataset, which has been instrumental in training popular AI models like Stable Diffusion and Google's Imagen, contains at least 1,008 instances of child sexual abuse material (CSAM), with thousands more suspected, according to the findings.
The LAION-5B dataset comprises over 5 billion images and related captions from the internet, raising concerns about potential exploitation for AI text-to-image generators. The report warns that the inclusion of CSAM material in the dataset may enable AI products developed using this data to generate new and disturbingly realistic child abuse content.
In response to the revelations, LAION has taken down its datasets temporarily, citing an "abundance of caution" to ensure their safety before republishing.
This is not the first instance of controversy surrounding LAION datasets. In October 2021, cognitive scientist Abeba Birhane flagged problematic content, including explicit images and text pairs, in an earlier LAION dataset called LAION-400M.
In a separate incident in September 2022, private medical record photos surfaced in the LAION-5B dataset, raising concerns about the privacy and ethical implications of using such data. The dataset also became entangled in a class-action lawsuit, Andersen et al. v. Stability AI LTD et al., where LAION was named but not directly sued.
Renowned artist Karla Ortiz, involved in the lawsuit, spoke about the LAION-5B dataset during a virtual FTC panel in October 2022. Ortiz expressed concerns about the dataset containing not only her intellectual property but also sensitive material such as private medical records, non-consensual pornography, and images of children.
AI pioneer Andrew Ng, in a newsletter, defended the importance of free access to large datasets like LAION for the progress of machine learning. He argued that restricting access could impede advancements in various fields, including art, education, drug development, and manufacturing.
LAION, founded by Hamburg-based high school teacher Christoph Schuhmann, started with the aim of creating an open-source dataset for training image-to-text diffusion models. The dataset, initially with 3 million image-text pairs, has since grown to over 5 billion, making it the largest free dataset of its kind.
The controversy has prompted discussions about the transparency and ethical considerations surrounding the use of such vast datasets. LAION's origins, scraping visual data from online shopping sites, have also come under scrutiny, with concerns raised about the sources and potential biases present in the dataset.
As the AI community navigates this controversy, questions arise about the balance between innovation, access to data, and ethical considerations in the development of AI technologies.