Which AI Hallucinates The Most? (And Why You Can't Trust ChatGPT-Style Chatbots)

AI is already impacting our lives and has the potential to develop even further. As amazing as it is though, large language model (LLM) AIs have a pretty big downside. They tend to make outlandish things up and present them as the truth. This phenomenon is known as "AI hallucination," and involves the generative AI coming to the wrong conclusion when scouring its data bank. The models work by finding common themes amongst a huge bank of information and using that to respond to prompts. 

The problem is compounded by the model's ability to "lie" convincingly, which itself is a byproduct of its programming. LLMs are designed to interact with people in a conversational, human-like, fashion. This in itself may make inaccuracies seem more genuine. Then there is also the model's aversion to simply saying "I don't know" and forcing something to fit where it really doesn't. While combatting AI hallucination may take a while, we can at least be more aware of it. Double-checking the information you are being presented should be standard. Bard, one of the most popular chatbots, comes with a "Google It" button which can be used to quickly verify the information the AI is providing.

All bots are prone to this phenomenon, but some are noticeably worse than others. Research group Arthur AI has tested many of the more popular options and ranked them based on how prone they are to hallucination. So, at least you'll know which ones to swerve if you want to improve your chances of not being pulled into some kind of warped robot fantasy.

Cohere's model was the worst, GPT-4 the best

The experiment involved a set of "challenging questions" on the following categories: Combinatorial Mathematics, U.S. Presidents, and Moroccan Political Leaders. The AI's effort, which consisted of three responses to each question, would be compared to a pre-prepared and accurate answer. The models tested were: OpenAI's GPT-3.5 and GPT-4, Claude-2 from Anthropic, Llama-2 from Meta, and the Command model from Cohere. You may notice a couple of omissions. Neither Google's Bard nor anything from Amazon was part of the experiment.

Still, the results were seemingly clear. GPT-4 was the most accurate AI tested, beating its predecessor GPT-3.5 along with every other AI in the pool. The model had an accuracy rate of 50% or better in two of the three categories, only producing more hallucinations than correct answers on the topic of U.S. Presidents. The same could not be said for Cohere's Command Model. Its response was a hallucination almost every time, with only four out of 33 correct answers on the topic of U.S. Presidents, and nothing but pure hallucination when it came to the other two subjects.

Claude-2 and Llama-2 were the most likely to abstain from answering, rather than risk a hallucination, and both did so heavily on the subject of Moroccan Political Leaders. Claude-2 also outperformed GPT-4 on the subject of U.S. Presidents. While Cohere's model may have been the most prone to hallucination, GPT-3.5 wasn't far behind. So be careful if you're a fan of the free version of ChatGPT.