9 minutes to read - Mar 1, 2024

ChatGPT is multilingual but monocultural, and it’s learning your values

VISIT
ChatGPT is multilingual but monocultural, and it’s learning your values
Like the rest of the internet, I’ve been playing with ChatGPT, the new AI chatbot released by OpenAI, and I’ve been fascinated by how much it does well and how it still gets a lot wrong.

Free
Table of Contents
1What is ChatGPT trained on?
2Common Crawl (filtered)
3WebText2
4Books1 and Books2

ChatGPT is a foundation model, that is, a deep learning model (also called a neural network) that is trained on so much data and with so many parameters that it is qualitatively different from models you could feasibly train yourself. I wanted to know what data ChatGPT is trained on, but it turns out information is not readily available.


My conclusion, after reading up on all this, is that ChatGPT is multilingual but monocultural – but that by using it, we’re all helping to train it to align its values with our own.


Let me explain.


What is ChatGPT trained on?


The basics are clear. ChatGPT is based on the GPT models (GPT-1, GPT-2, GPT-3, and the current GPT-3.5 series), which are trained on data scraped from the web and some books. I’ll discuss them in more detail below.


In addition, as described in Ouyang et.al. 2022 or for non-scholars, here, ChatGPT is based on InstructGPT, which was fine-tuned by humans who wrote “desired responses” to prompts that the model was then trained on. After that, human labellers rated GPT-3’s responses (presumably similarly to the way ChatGPT asks us to label its responses). A model was trained on the labeled responses to predict what the humans would prefer, and that gave us InstructGPT, which is what ChatGPT is based on.


Here’s OpenAI’s visual explanation of the process for the value alignment training.



The team describes InstructGPT (which ChatGPT is based on) as aligned with the values of the 40 contractors they initially hired to test it. It is also “biased towards the cultural values of English-speaking people”.

The model card for InstructGTP explains that it still has issues. For instance, and rather seriously, it makes up “facts”. Unfortunately, it’s really good at making its “facts” sound quite convincing.


In this blog post, I’ll explain more about the data its trained on, how it all works, and how you and I are training the model each time we use it.


How deep learning models make sense of the world

First of all: what is the data the GPT-series of AI models were trained on? Here is the table from the paper that introduced GTP-3 in 2020 (Brown et.al 2020).

I’ll return to each of these datasets below, but first I need to explain tokens and vectors and latent space.


Models like GPT-3 count things in tokens. A token is the smallest semantic unit for a machine learning unit, like a phoneme is the smallest unit of sound that can distinguish one word from another in spoken language. Often a token corresponds to a word, although it gets more complicated. The basic GPT-3 model is trained on unlabelled data, so it figures out what a token is itself. A model like GPT-3 calculates how tokens (let’s just say words) relate to each other by assigning each word a vector. For example, in a specific model trained on Wikipedia and newswire data, McCoy and Ullman explain that “the word ‘dog’ is represented as the vector [0.308, 0.309, 0.528, ?0.925, ….]”. If you plot that into a coordinate system, then words that often co-occur with “dog” in the training data will be positioned close to “dog”. This “map” of how words are related to each other is also called the “vector space” or “latent space” or even just “space”.


Remember those x/y coordinate grids we drew in 6th grade? It’s kind of like that. Except instead of two dimensions (an x-axis and a y-axis) there are literally billions of axes, or parameters.


Once GPT-3 is trained, it doesn’t “know” anything about its training data any more. All it knows is those coordinates. Dog is [0.308, 0.309, 0.528, ?0.925, ….], and that …. stands for a lot more numbers. It also knows what other words (or tokens) “dog” is close to. All those tokens and their coordinates across billions of different parameters make up the “latent space” of the model.

OK, so back to the table about the data GPT-3 was trained on.


The Common Crawl is a lot of scraped web data. WebText2 is webpages that have been shared in Reddit posts that have received at least three upvotes. Books1 and Books2 are not specified, but people have suggested the Gutenberg library, BookCorpus (free, self-published books) and libgen as possibilities. Finally, the Wikipedia means the English-language Wikipedia, not all of them. The Quantity (tokens) shows how much is in each dataset, but they’re not equally weighted. Here is a table showing the relative weighting.


Common Crawl (filtered)

 

The Common Crawl is an open repository of web crawl data in more than 40 languages. In the paper Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, Jesse Dodge and co-authors (including Margaret Mitchell who was fired from Google’s AI ethics team last year with Timnit Gebru but now works for Hugging Face) document a version of the Common Crawl that is filtered in the way described for the GPT-2 training data.


Dodge et.al. analyse three levels of the Common Crawl dataset, the metadata, like what domains data is from and when it was created or collected, the text itself, and what is missing or not included.


At the metadata level, they found that US domains dominate, with far more content than domains with many native English speakers like India or Pakistan.

They found a surprising amount of patent data, and a lot of this is machine-translated, because various countries require patents to be in their languages. There are even patents run through OCR, so quite a bit of text is machine-generated in one way or another. Finally, they found that filters that remove words that are on a banned words list “disproportionately remove documents in dialects of English associated with minority identities (e.g., text in African American English, text discussing LGBTQ+ identities).” (Dodge et al., 2021, p. 2) You can take a look at the “bad words list” yourself. It’s clear that most of these words are on it so porn can be filtered out, and there are some slurs and swearwords on there as well. This means that texts representing minorities is missing. Removing sex words also means that non-offensive material about queer culture, including legal documents about same sex marriage, have been filtered out.


OK, so there is some bias there, and a crawl of “all the web” is bound to have a lot of not exactly high quality language. The next corpus is meant to remedy that.


WebText2


WebText2 is a corpus of websites that have been linked to Reddit posts that have three or more upvotes. The idea is that having three upvotes from Reddit ensures that the webpages have a certain level of quality. The exact corpus used to train GPT-3 is not available, but it has been recreated and can be downloaded as OpenWebText2, which also has instructions for how to recreate the dataset.

Unfortunately Reddit users are not a representative sample of humanity, so there is likely to be bias in this too. And three upvotes is not a lot. But OpenAI must trust this dataset, because WebText2 is the most heavily weighted sample of all five samples used to train GPT-3.


Books1 and Books2


The description of these datasets in the original paper is disappointingly vague: “two internet-based books corpora (Books1 and Books2).”

Presumably the reason that OpenAI is so vague about what, exactly, these two datasets are is that it’s a bit dodgy in terms of copyright. I assume (hope?) that at least one of these is the Gutenberg library, which is books in the public domain. But if it is, why not just say so?


Many assume that one of these is BookCorpus, which consists of 11038 books that were self-published on Smashwords and are available for free. BookCorpus was definitely used for training GPT-1 and BERT (another large language model). The BookCorpus dataset is available from Hugging Face, and Jack Bandy and Nicholas Vincent have published a paper retrospectively documenting it.


The biggest issues Bandy and Vincent identify in the BookCorpus dataset are

Even though the books are free, the licence doesn’t really permit this use. It’s legally dubious.

There are lots of duplicates (particularly of romance novels) and some authors have published hundreds of novels in the dataset, so it’s not exactly representative.

There’s a skewed religious orientation – Christianity is overrepresented.

Article source
Author: Jill , Jill
Author
loading...