The report, Inside the secret list of websites that make AI like ChatGPT sound smart (subscription required) is a fascinating read.
I was especially interested to see that the analysis includes a tool to check if your own website data is being used as an input to train Google’s C4 data set (Colossal Clean Crawled Corpus), a large language model like ChatGPT that helps power Google Bard.
The analysis ranked the roughly 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Business and industrial websites made up the biggest category of content in the Google’s C4 data set (16 percent of categorized tokens). Google’s C4 data set also includes more than half a million personal blogs (3.8 percent of categorized tokens).
Many people are concerned that these AI models harvest their data. They see it as “stealing” because the content is used without attribution. As a writer, I can certainly understand that.
While the source of data used in AI generated results isn’t yet reported, I firmly believe over time AI companies will list where the data in a specific response comes from.
Perhaps, governments will require reporting. Perhaps there will eventually be a way for a website owner to opt-in or opt-out of having their data used. I suspect that soon, AI companies will volunteer the source of data used in a response.
No matter how it happens, being part of chat responses will become valuable, just like being at the top of search engine results are valuable today.
Two of my URLs are included in the Google C4 dataset - DavidMeermanScott.com (where my blog is hosted) and newsjacking.com. I already knew that both sites are also included in the ChatGPT dataset because when I enter specific queries, the resulting answers clearly pull from my content.
Today, companies are investing billions into surfacing content on search engines like Google via paid search ads and optimized content.
In the future, if your content is used to train AI and the chatbots include the websites accessed in their responses, that becomes a new way to generate attention for your content.
AI presents a new world with many opportunities! It’s fun to think about what’s coming next and to play around with what’s available now.
Bogdan Laketic 09.05.2023, 20:45:48
I was surprised to see that there is already an 8th edition of the book "The new rules of marketing and PR." I got my 4th edition as a gift few years ago, but I started to read it just now since I am in a process of building my own business where Marketing and PR are the most important components. I am amazed by the book! Marketing described in the best way imo. Currently I am at the chapter 5: Blogs, so I was interested to check out your blog as well. Thank you very much for all the great content David! As I am reading the book, I am automatically converting the principles to the social media of today. But I am also looking forward to reading the newest edition, which probably brings more insight into the newest trends.
All the best from Austria!