Google’s Bard AI is trained using website content but little is known about how it was collected and whose content was used. Bard AI is based on the LaMDA language model, trained on datasets based on Internet content called Infiniset of which very little is known about where the data came from and how they got it.
The LaMDA
The 2022 LaMDA research paper lists percentages of different kinds of data used to train LaMDA, but only 12.5% comes from a public dataset of crawled content from the web and another 12.5% comes from Wikipedia.
The LaMDA research paper (PDF) explains why they chose this composition of content:
“…this composition was chosen to achieve a more robust performance on dialog tasks …while still keeping its ability to perform other tasks like code generation.
As future work, we can study how the choice of this composition may affect the quality of some of the other NLP tasks performed by the model.”
The research paper makes reference to dialog and dialogs, which is the spelling of the words used in this context, within the realm of computer science.
In total, LaMDA was pre-trained on 1.56 trillion words of “public dialog data and web text.”
The dataset is comprised of the following mix:
- 12.5% C4-based data
- 12.5% English language Wikipedia
- 12.5% code documents from programming Q&A websites, tutorials, and others
- 6.25% English web documents
- 6.25% Non-English web documents
- 50% dialogs data from public forums
The first two parts of Infiniset (C4 and Wikipedia) is comprised of data that is known.
The C4 dataset, which will be explored shortly, is a specially filtered version of the Common Crawl dataset.
However, only 25% of the data is from a named source (the C4 dataset and Wikipedia).
The rest of the data that makes up the bulk of the Infiniset dataset, 75%, consists of words that were scraped from the Internet.
The research paper doesn’t say how the data was obtained from websites, what websites it was obtained from or any other details about the scraped content.
Google only uses generalized descriptions like “Non-English web documents.”
The word “murky” means when something is not explained and is mostly concealed.
Murky is the best word for describing the 75% of data that Google used for training LaMDA.
There are some clues that may give a general idea of what sites are contained within the 75% of web content, but we can’t know for certain.
C4 Dataset
C4 is a dataset developed by Google in 2020. C4 stands for “Colossal Clean Crawled Corpus.”
This dataset is based on the Common Crawl data, which is an open-source dataset.
About Common Crawl
Needless to say, common Crawl is a registered non-profit organization that crawls the Internet on a monthly basis to create free datasets that anyone can use.
The Common Crawl organization is currently run by people who have worked for the Wikimedia Foundation, former Googlers, a founder of Blekko, and count as advisors people like Peter Norvig, Director of Research at Google and Danny Sullivan (also of Google).