The internet isn’t totally weird yet; AI can fix it

The Internet is descending into a hurricane of AI-generated nonsense, and no one knows how to stop it.

That’s the sobering possibility presented in a couple of papers examining AI models trained on AI-generated data. This possibly avoidable fate is nothing new to AI researchers. But these two new findings bring to the fore some concrete findings that detail the consequences of a feedback loop training a model on its own output. While the research hasn’t been able to replicate the scale of larger AI models, such as ChatGPT, the results are still disappointing. And they can be reasonably extrapolated to larger models.

Over time, these errors accumulate. So, at some point, your data is essentially error-dominated rather than the original data. Ilia Shumailov, University of Cambridge

With the concept of data generation and the reuse of data generation to retrain, tune or refine machine learning models you are now entering a very dangerous game, says Jennifer Prendki, CEO and founder of DataPrepOps company Alectio.

Artificial intelligence plummets towards collapse

The two articles, both pre-printed, approach the problem from slightly different angles. The Curse of Recursion: Training on Generated Data Makes Models Forget examines the potential effect on Large Language Models (LLM), such as ChatGPT and Google Bard, as well as Gaussian Mixture Models (GMM) and Variational Autoencoders (VAE). The second paper, Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet, examines the effect on diffusion models, such as those used by image generators such as Stable Diffusion and Dall-E.

While the models discussed differ, the papers reach similar results. Both found that training a model on model-generated data can lead to an error known as model collapse.

This is because when the first model fits the data, it has its own errors. And then the second model, which trains on data produced by the first model that contains errors, basically learns the errors you set and adds its own errors to it, says Ilia Shumailov, a Ph.D. in computer science at the University of Cambridge. candidate and co-author of the Recursion paper. Over time, these errors accumulate. So, at some point, your data is essentially error-dominated rather than the original data.

The quality of outcomes generated by LLMs decreases with each generation of AI-generated data training.The Curse of Recursion: Training on Generated Data Makes Models Forget

And mistakes pile up quickly. Shumailov and his co-authors used OPT-125M, an open source LLM introduced by Meta researchers in 2022, and tuned the model with the wikitext2 dataset. While the first few generations produced decent results, the answers became nonsensical within ten generations. A 9th generation response repeated the phrase tailed hares and alternated between various colors, none of which referred to the initial suggestion of tower architecture in Somerset, England.

Diffusion models are just as susceptible. Rik Sarkar, co-author of Towards Understanding and deputy director of the Laboratory for Foundations of Computer Science at the University of Edinburgh says: It seems that as soon as you have a reasonable volume of artificial data, it degenerates. The paper found that a simple diffusion model trained on a specific category of images, such as photos of birds and flowers, produced unusable results within two generations.

Sarkar cautions that the results are a worst-case scenario: The data set was limited, and the results from each generation were fed directly back into the model. However, the paper’s findings show that model collapse can occur if a model training dataset includes too much AI-generated data.

AI training data represents a new frontier for cybersecurity

This comes as no shock to those who closely study the interaction between AI models and the data used to train them. Prendki is an expert in the field of machine learning operations (MLOps), but he also holds a PhD in particle physics and sees the problem through a more fundamental lens.

It’s basically the concept of entropy, right? Data has entropy. More entropy, more information, right? says Prendki. But having a dataset twice as large does not absolutely guarantee double the entropy. It’s like putting sugar in a teacup and then adding more water. You are not increasing the amount of sugar.

This is the next generation of cybersecurity issues that very few people talk about. Jennifer Prendki, CEO, Alectio.com

Model collapse, seen from this perspective, seems like an obvious problem with an obvious solution. Just turn off the tap and add another spoonful of sugar. This, however, is easier said than done. Pedro Reviriego, co-author of Towards Understanding, says that while there are ways to purge AI-generated data, the daily release of new AI models quickly renders them obsolete. And how [cyber]security, Revierigo says. You have to keep running after something that’s moving fast.

Prendki agrees with Reviriego and takes the argument one step further. He says organizations and researchers training an AI model should view the training data as a potential adversary that must be controlled to avoid degrading the model. This is the next generation of cybersecurity issues that very few people talk about, Prendki says.

There is a solution that could solve the problem completely: watermarking. Images generated by OpenAIs DALL-E include a specific color scheme by default, as a watermark (although users have the ability to remove it). LLMs can also contain watermarks, in the form of algorithmically detectable patterns that are not obvious to humans. A watermark provides an easy way to detect and exclude AI-generated data.

However, effective watermarking requires an agreement on how it is implemented and a means of enforcement to prevent bad actors from distributing AI-generated data without a watermark. China has introduced a draft measure that would impose a watermark on AI content (among other regulations), but it’s an unlikely model for Western democracies.

Images created with OpenAIs DALL-E have a watermark in the lower right corner, although users can choose to remove it.Open AI

There are some glimmers of hope left. The models presented in both papers are small compared to the larger models used today, such as Stable Diffusion and GPT-4, and it is possible that the large models will prove more robust. It is also possible that new methods of data curation will improve the quality of future datasets. In the absence of such solutions, however, Shumailov says AI models could face first-mover advantage, as early models will have better access to datasets untainted by AI-generated data.

Once we have the ability to generate synthetic data with some error in it and we have large-scale use of such models, inevitably the data produced by these models will end up being used online, says Shumailov. If I want to build a company that provides a large language model as a service to someone [today]. If I then go and scrape a year of data online and try to build a model, then my model will experience model collapse within it.

From articles on your site

Artificial intelligence plummets towards collapse

AI training data represents a new frontier for cybersecurity

Leave a Comment Cancel reply