Galileo's LLM Hallucination Index crowned Claude 3.5 Sonnet as the best LLM for RAG

Acknowledging that retrieval-augmented generation has quickly become one of the most popular methods leveraged by generative AI solutions, Galileo has devoted the latest edition of its LLM Hallucination Index to evaluating the accuracy of 22 leading open and closed models in RAG tasks with varying length contexts. After preparing realistic datasets for short, medium, and long context retrieval, the team behind the report evaluated the performance of each model for factual accuracy and closed-domain hallucinations using the Context Adherence evaluation model and ChainPoll evaluation methodology.

Several insights emerged from the analysis, some, like the finding that open models are closing the performance gap with proprietary ones, or that smaller models can be just as capable as large ones, confirm some of the trends that have already gained traction. Others, like that most models do remarkably well in long context tasks or that Anthropic's models consistently outperformed OpenAI's, come as more of a surprise. In particular, Anthropic's Claude 3.5 Sonnet earn the title of best model overall, due to its combination of price and performance. In particular, long context tasks involving contexts up to 100K tokens only accounted for half of the model's supported context window, potentially predicting that the model can process even larger context without losing accuracy.

OpenAI's complete absence from the leaderboards may be the most surprising result—Google's Gemini 1.5 Flash emerged as a worthy competitor to Claude 3.5 Sonnet earning the title of the most affordable model tested. Qwen2-72B-Instruct was the top performer in the short and medium-sized context tasks, which gave it the title of the best open model overall. Meta's Llama-3-70b-instruct offered strong performance, but its restricted context window is an important limitation. Galileo plans to update the Hallucination Index every two quarters. As a result, the following installment will likely evaluate Meta's new Llama 3.1 models, including the updated 70B and 8B parameter models featuring a 128K-token context window.