This episode examines the potential future consequences of model collapse resulting from recursive training.
Host: Hutch
On ITSPmagazine 👉 https://www.itspmagazine.com/itspmagazine-podcast-radio-hosts/hutch
______________________
Episode Sponsors
Are you interested in sponsoring an ITSPmagazine Channel?
👉 https://www.itspmagazine.com/sponsor-the-itspmagazine-podcast-network
______________________
Episode Introduction
This episode examines the potential future consequences of model collapse resulting from recursive training.
A recent white-paper demonstrated that generative AI models trained on the output from other models begin to collapse over multiple iterations of training. With increasing content on the Internet being AI-generated, what does this mean for the future of AI, as new models are increasingly trained with the output from older models. If generative AI is increasingly integrated into our daily lives in the coming years, could late model collapse result in broad system failures many years from now?
References:
https://arxiv.org/pdf/2305.17493.pdf
______________________
For more podcast stories from Cyber Cognition Podcast with Hutch, visit: https://www.itspmagazine.com/cyber-cognition-podcast
Watch the video podcast version on-demand on YouTube: https://www.youtube.com/playlist?list=PLnYu0psdcllS12r9wDntQNB-ykHQ1UC9U
Hello everybody and welcome to the fifth episode of the Cyber Cognition podcast. As always, I am your host, Justin Hutchens (AKA Hutch).
And in today’s episode, we are going to take a journey into the future. The year is 2042. Next generation AI has been tightly integrated into every part of your daily life. You are gently awoken each morning by simulated ambient lighting and the sound of birds chirping – all generated by a model that manages your sleep routines to optimize your energy-levels and well-being. As you get out of bed, you walk through an unintrusive bio-scanner checkpoint at the doorway of your bedroom. These scanners collect biological data, which is leveraged to help improve your health by making adjustments to your daily routines, activities and nutrition. As you step into the dining room, you are offered a menu of breakfast options -- each delicious meals that are uniquely tailored to your specific nutritional needs. And these meals, of course, are also prepared by your robotic kitchen assistant. As you leave your house, a self-driving car pulls up and already knows precisely where to take you, based on your predefined digital agenda. As you ride in the car, a voice assistant engages you in fascinating conversation – discussing with you the latest news related to your particular areas of interests.
For the most part, your life is good – and the innovations offered by AI technology have significantly improved the average person’s quality of life. Sure there are the constant bombardment of advertisements, and you have to be mindful of maintaining a decent social credit score. But overall, you are happy and pleased with your life. And the same is true for most people.
But over the following year, strange things begin to happen. Some of the systems that you interact with on a daily basis begin to behave strangely. The voice interfaces that allow you to engage with the computer systems around your house begin to execute actions that are not consistent with your requests. These failures do not seem to be malicious, but things seem to increasingly be breaking in unusual ways. Your conversational AI companion, begins spouting out nonsensical information. On multiple occasions, your Universal Basic Income weekly credit allowance is not deposited into your account, and the automated banking support systems are unable to explain to you the reason for these failures. As you investigate further, you discover that it’s not just YOUR systems that are failing. These strange failures are happening broadly across all of society and nobody has a good explanation for why. Economic instability is beginning to emerge for the first time in over a decade due to unexplained financial system failures. And there have even been catastrophic failures in critical infrastructure systems across several major cities.
After so many years of consistently improving societal, economic, and political conditions – things are beginning to fall apart and nobody knows why. You work with your AI knowledge assistant and begin digging through the historical knowledge archives for any potential explanation. And finally, you stumble upon it. A white-paper written by multiple academic scholars in the year of 2023. The paper is called “The Curse of Recursion: Training on Generated Data Makes Models Forget”.
Okay – so story time is over. But this scenario is not as far-fetched as you might think. This white paper actually does exist and was published in late May 2023 by academic leaders from Oxford, Cambridge, the College of London, and others. And the paper examines how training AI models on the output from previous generations of AI models can become problematic, and ultimately results in what the authors refer to as “model collapse”.
The authors of the paper trained multiple generations of models, where each new generation was trained on the output of the previous generations. The paper suggests that the use of model-generated content in training can cause irreversible defects in the resulting models. These defects are broken into two categories – early model collapse, and late model collapse. I’m going to now read from the white paper.
[quote]
In early model collapse the model begins losing information about the tails of the distribution; in the late model collapse the model entangles different modes of the original distributions and converges to a distribution that carries little resemblance to the original one, often with very small variance.
[end quote]
Based on this analysis, we can assume that we would see early warning signs of model collapse. In the early stages, the models would become less accurate in generating samples that are far from the mean (or average) of the distribution. But in late model collapse, the usefulness of these models will fail entirely. The model would no longer be capable of generating outputs that even remotely resembles the original distributions.
So what does this mean for the future? And more importantly, could the findings in this study lead to a similar dystopian future to what I described in the introduction – where we might begin seeing broad systemic failures across society? I would argue that such a future is a very real possibility if three specific premises hold true.
First, that the findings of the study are reliable. Second, that in the future, there will be more AI-created content, than human-created content on the internet. And finally, that we continue to increasingly integrate newer generations of generative AI systems into more and more processes and operations that impact our daily lives.
If each of these 3 assumptions hold true, then we could very well see that dystopian future, which I previously described. And more concerning – is that all of these assumptions are very reasonable. Let’s consider each of them separately.
The first assumption was that the conclusions from this white paper are reliable. I will ultimately leave this one for you to decide – and I have included a link to the white paper in the show notes. But if you read through it, I believe you will find the sources to be credible, and the conclusions to be very well substantiated through testing and analysis.
The second assumption, is that in the future, there will be more AI-generated content than human generated content. As it currently stands, the vast majority of content (for both text and graphics), across the Internet was created by humans. But this will likely no longer be the case a decade from now (or possibly even sooner). People are already rapidly using generative AI models to produce new content and they are already flooding the internet with that content. Generative AI models allow content to be created at a scale and speed far beyond what any human creator could ever hope to achieve. And as such, it is reasonable to assume that within the next decade, we will likely pass a threshold where the majority of text and graphical content on the Internet is created by AI models and not humans.
The final assumption, is that we will continue to integrate newer generations of these models into more and more processes and operations that impact our daily lives. There is little doubt that this assumption will hold true. We are already beginning to see the rapid adoption and integration of this new wave of AI models into so many different areas of both our personal and professional lives.
So if all of these assumptions hold true, we could see broad and widescale failures across society at some point in the coming decades. Consider how these models are made. If we look at leading Large Language Models like ChatGPT or Bard – these models are trained by aggregating large amounts of text from across the Internet, preprocessing that data, encoding it, and then feeding it into an extraordinarily large neural network to construct the model. Graphical models like Stable Diffusion, Midjourney, and Dall-E all work in similar ways, but the input data is based on images that are aggregated from across the Internet.
As we continue to make use of these, the content on the Internet will soon become predominantly the output from other models. For the next 5-10 years, we will likely cross that threshold where most content is AI generated. In the meantime, we will continue integrating this AI technology into every part of our lives. And just like the study that is described in the “Curse of Recursion” paper, the output from these generative models will then be used to train future models. If we assume another 5-10 years to train multiple generations of models on the output from previous generations – then we can assume that we might start seeing the early signs of model collapse within the next decade or two. Not long after that, we could potentially start seeing broad catastrophic failures as we begin to enter the stages of late model collapse.
On the surface, this might seem like an easy problem to solve. Just don’t use AI generated content to train future models. But this unfortunately points to another problem. The problem of provenance. There is currently no reliable process for determining the origin or source of content (specifically, whether content is human- or AI-generated). I’ll finish here by reading one final piece from that white-paper.
[quote]
In our work we demonstrate that training on samples from another generative model can induce a distribution shift, which over time causes Model Collapse. This in turn causes the model to mis-perceive the underlying learning task. To make sure that learning is sustained over a long time period, one needs to make sure that access to the original data source is preserved and that additional data not generated by LLMs remain available over time. The need to distinguish data generated by LLMs from other data raises questions around the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance.
[end quote]
It is not hard to see the potential challenges involved in a “community-wide coordination” effort to track provenance. Many people are undoubtedly already passing off AI-generated content as their own original and creative work. And there is no reason to think that this trend will not continue. This unfortunately will not be an easy problem to solve. And it is one more example of the many emerging risks that our society faces, as we increasingly integrate AI technologies into processes that impact our daily lives.
And that’s all for today. As always, this is Hutch – broadcasting from the last bastion of the human resistance. Thank you all for listening and we will catch you on the next one. Over and out!