One of the key ingredients that made ChatGPT a resounding success was an army of human trainers who gave the artificial intelligence model behind the bot, advice on what constitutes good and bad results. OpenAI NOW said that adding even more AI into the mix – to help human trainers – could help make AI assistants smarter and more reliable.
In developing ChatGPT, OpenAI pioneered the use of reinforcement learning with human feedback, or RLHF. This technique uses data from human testers to fine-tune an AI model so that its output is judged to be more consistent, less objectionable, and more accurate. The ratings given by the trainers feed into an algorithm that determines the model’s behavior. This technique has proven crucial both in making chatbots more reliable and useful and in preventing them from misbehaving.
“RLHF works very well, but it has some important limitations,” says Nat McAleese, a researcher at OpenAI involved in the new work. For one thing, human feedback can be inconsistent. On the other hand, it can be difficult for even trained humans to evaluate extremely complex results, such as sophisticated software code. The process can also optimize a model to produce a result that appears convincing rather than actually accurate.
OpenAI has developed a new model by tweaking its most powerful offering, GPT-4, to help human trainers tasked with evaluating code. The company found that the new model, called CriticGPT, could detect bugs that humans had missed, and that human judges found its code reviews to be better 63 percent of the time. OpenAI will consider extending this approach to areas other than code in the future.
“We are starting to work to integrate this technique into our RLHF chat stack,” says McAleese. He notes that the approach is flawed, since CriticGPT can also make errors by hallucinating, but he adds that the technique could help make OpenAI’s models as well as tools like ChatGPT more accurate by reducing errors in human training . He adds that it could also prove crucial in helping AI models become much smarter, as it could allow humans to help train AI that exceeds their own capabilities. “And as the models continue to improve, we think people will need more help,” McAleese says.
The new technique is one of several techniques currently being developed to improve large language models and extract more capabilities from them. It’s also part of an effort to ensure that AI behaves acceptably, even as it becomes more capable.
Earlier this month, Anthropic, an OpenAI rival founded by former OpenAI employees, announced a more efficient version of its own chatbot, called Claude, thanks to improvements made to the model’s training program and the data that feeds it. Anthropic and OpenAI I have both too recently touted new ways to inspect AI models to understand how they arrive at their results to better prevent unwanted behaviors such as deception.
The new technique could help OpenAI train increasingly powerful AI models while ensuring their results are more reliable and aligned with human values, especially if the company succeeds in deploying them in more areas than just code . OpenAI has said it is training its next major AI model, and the company is obviously keen to show that it takes its behavior seriously. This follows the dissolution of a leading team dedicated to assessing the long-term risks posed by AI. The team was co-led by Ilya Sutskever, a co-founder of the company and a former board member who briefly pushed CEO Sam Altman out of the company before stepping down and helping him regain control. Several members of that team have since criticized the company for moving at risk as it rushes to develop and commercialize powerful AI algorithms.
Dylan Hadfield-Menell, a professor at MIT who studies ways to align AI, says the idea of having AI models help train more powerful models has been around for some time. “It’s a pretty natural evolution,” he says.
Hadfield-Menell notes that the researchers who initially developed the techniques used for RLHF discussed related ideas several years ago. According to him, it remains to be seen how applicable and powerful this measure is. “This could lead to big advances in individual capabilities, and it could be a stepping stone toward some kind of more effective feedback in the long term,” he says.