Chatbots use large language models, or LLMs, that consume vast amounts of text from the internet and can be used for various tasks, including generating text by predicting the next word in a sentence. The bots find patterns through trial and error, and human feedback is then used to fine-tune the model.
To test this, Farquhar and his colleagues asked a chatbot questions, then used a second chatbot to review the responses for inconsistencies, similar to the way police might try to trip up a suspect by asking them the same question over and over. If the responses had vastly different meanings, that meant they were probably garbled.
He said the chatbot was asked a set of common trivia questions, as well as elementary school math word problems.
The researchers cross-checked the accuracy of the chatbot evaluation by comparing it against human evaluation on the same subset of questions. They found the chatbot agreed with the human raters 93 percent of the time, while the human raters agreed with one another 92 percent of the time — close enough that chatbots evaluating each other was “unlikely to be concerning,” Farquhar said.
Farquhar said that for the average reader, identifying some AI errors is “pretty hard.”
He often has difficulty spotting such anomalies when using LLMs for his work because chatbots are “often telling you what you want to hear, inventing things that are not only plausible but would be helpful if true, something researchers have labeled ‘sycophancy,’” he said in an email.
Unreliable answers are a barrier to the widespread adoption of AI chatbots, especially in medical fields such as radiology where they “could pose a risk to human life,” the researchers said. They could also lead to fabricated legal precedents or fake news.
Not everyone is convinced that using chatbots to evaluate the responses of other chatbots is a great idea.
In an accompanying News and Views article in Nature, Karin Verspoor, a professor of computing technologies at RMIT University in Melbourne, Australia, said there are risks in “fighting fire with fire.”
The number of errors produced by an LLM appear to be reduced if a second chatbot groups the answers into semantically similar clusters, but “using an LLM to evaluate an LLM-based method does seem circular, and might be biased,” Verspoor wrote.
“Researchers will need to grapple with the issue of whether this approach is truly controlling the output of LLMs, or inadvertently fueling the fire by layering multiple systems that are prone to hallucinations and unpredictable errors,” she added.
Farquhar sees it “more like building a wooden house with wooden crossbeams for support.”
“There’s nothing unusual about having reinforcing components supporting each other,” he said.