A recent study by Cornell researchers sheds light on the verbatim memorization capabilities of ChatGPT, a prominent language model, posing ethical questions about its training methods and implications for privacy. The study revealed that ChatGPT, and likely other proprietary AI models, can "memorize" entire texts, including copyrighted poems, with potential privacy concerns.
Lyra D'Souza, the first author of the study, highlighted the privacy implications, stating that large language models memorizing extensive text poses concerns as their training data, often sourced from the internet, might include private information. The study, presented at the Computational Humanities Research Conference in Paris, focused on ChatGPT's ability to retrieve poems and the ethical considerations surrounding such memorization.
David Mimno, senior author and associate professor of information science at Cornell, explained the choice of poems for the study. Poems, being short and available from reputable sources, presented an opportunity to examine the model's behavior. Poems, he added, are expected to be surprising and meaningful, making them candidates for memorization.
Large language models like ChatGPT are trained to generate text by predicting the most likely next word based on their training data, primarily webpages. The study demonstrated that memorization occurs when the training data includes duplicated passages, reinforcing specific word sequences. D'Souza tested ChatGPT and three other language models, with ChatGPT successfully retrieving 72 out of 240 poems, while others had varying degrees of success.
Notably, the inclusion of a poem in the poetry canon, particularly the "Norton Anthology of Poetry," emerged as the most critical factor in whether the chatbot memorized it. The poet's race, gender, and era were less significant predictors. The study also observed changes in ChatGPT's responses over time, indicating the model's evolution.
D'Souza expressed concern about relying on powerful tools that claim to know everything, emphasizing the importance of learning from diverse sources. The study concluded that responsible and transparent use of powerful tools like ChatGPT is essential as they become increasingly integrated into daily life.
While the study focused on American poets, future research aims to explore how chatbots respond to requests in different languages and whether factors such as poem length, meter, and rhyming patterns influence memorization. D'Souza emphasized the need to address ethical considerations as tools like ChatGPT become integral parts of daily life.