There is no need to worry that your secret ChatGPT Conversations were obtained following a recent breach of OpenAI’s systems. The hack itself, while troubling, appears to have been superficial, but it serves as a reminder that AI companies have quickly become one of the juiciest targets for hackers.
The New York Times reported the hack in more detail after former OpenAI employee Leopold Aschenbrenner I recently alluded to this in a podcastHe called it a “major security incident,” but anonymous sources at the company told the Times that the hacker only gained access to an employee chat room. (I’ve reached out to OpenAI for confirmation and comment.)
No security breach should be considered trivial, and listening to internal discussions about OpenAI’s development certainly has its uses. But this isn’t about a hacker gaining access to internal systems, models in development, secret roadmaps, etc.
But this should scare us anyway, and not necessarily because of the threat of China or other adversaries overtaking us in the AI arms race. The fact is that these AI companies have become the gatekeepers of a huge amount of very valuable data.
Let’s talk about three types of data that OpenAI and, to a lesser extent, other AI companies create or have access to: high-quality training data, bulk user interactions, and customer data.
It’s unclear what training data these companies have, as they’re extremely secretive about their stash. But it’s a mistake to think that it’s just big piles of scraped web data. Yes, they use web scrapers or datasets like Pile, but turning that raw data into something that can be used to train a model like GPT-4o is a gargantuan task. A huge amount of human labor hours are required to achieve this. — this can only be partially automated.
Some machine learning engineers have theorized that of all the factors that go into creating a great language model (or, perhaps, any transformer-based system), the most important is the quality of the dataset. That’s why a model trained on Twitter and Reddit will never be as eloquent as a model trained on every book published in the last century. (And probably why OpenAI apparently (They used questionable sources, such as copyrighted books, in their training data, a practice they claim to have abandoned.)
So the training datasets that OpenAI creates are invaluable to competitors, whether they’re other companies, adversary states, or regulators here in the United States. Wouldn’t the FTC or the courts like to know exactly what data was used and whether OpenAI was honest about it?
But perhaps even more valuable is OpenAI’s treasure trove of user data: likely billions of ChatGPT conversations across hundreds of thousands of topics. Just as search data was once the key to understanding the collective psyche of the web, ChatGPT takes the pulse of a population that may not be as large as Google’s user universe, but offers much more depth. (In case you didn’t know, unless you opt out, your conversations are used for training data.)
In Google’s case, an increase in searches for “air conditioners” indicates that the market is heating up a bit. But these users aren’t having a real conversation about what they want, how much they’re willing to spend, what their home looks like, which manufacturers they want to avoid, etc. You know this is valuable because Google itself is trying to convert its users to providing this same information by replacing searches with AI interactions!
Think about how many conversations people have had with ChatGPT and how useful that information is, not only for AI developers, but also for marketing teams, consultants, analysts… it’s a goldmine.
The last category of data is perhaps the most valuable in the open market: how customers actually use AI and the data they themselves have provided to the models.
Hundreds of large companies and countless smaller businesses use tools like OpenAI and Anthropic’s APIs for a wide variety of tasks. And for a language model to be useful to them, it usually needs to be fine-tuned or have access to their own internal databases.
These can be things as mundane as old budgets or personnel files (to make them more easily searchable, for example) or as valuable as the code for a new software product. What they do with the AI capabilities (and whether they are actually useful) is their business, but the fact is that the AI vendor has privileged access, just like any other SaaS product.
These are trade secrets, and AI companies are suddenly at the heart of many of them. What’s new on this side of the industry carries a particular risk as AI processes are simply not yet standardized or fully understood.
Like any SaaS provider, AI companies are perfectly capable of providing industry-standard levels of security, privacy, on-premises options, and generally delivering their services responsibly. I have no doubt that OpenAI’s Fortune 500 customers have their private databases and API calls locked down very tightly! They should certainly be equally, if not more, aware of the risks inherent in handling confidential data in the context of AI. (OpenAI’s failure to report this attack is their choice, but it doesn’t inspire confidence in a company that desperately needs it.)
But good security practices don’t change the value of what they’re meant to protect, or the fact that malicious actors and various adversaries are attacking the door to get in. Security isn’t just about choosing the right settings or keeping your software up to date, although of course the basics are important too. It’s a never-ending game of cat and mouse. that is, ironically, now supercharged by AI itself: attack agents and automatons probe every corner of these companies’ attack surfaces.
There’s no reason to panic: Companies with access to a lot of personal or commercial data have faced and managed similar risks for years. But AI companies represent a newer, younger, and potentially juicier target than your misconfigured corporate server or irresponsible data broker. Even a hack like the one described above, with no known serious exfiltration, should be a cause for concern for anyone doing business with AI companies. They’ve painted their targets on their backs. Don’t be surprised if someone, or everyone, tries their luck.