A Blog by Jonathan Low

 

Mar 7, 2025

Tech Companies Are Cheating On Tests To "Measure" AI As Its Progress Slows

Two major concerns are emerging about AI, even as big tech and VC investments grow and the Silicon Valley hype machine abandons all restraint.

The first concern is that AI progress, eg intelligence and capability, is slowing as growing evidence emerges that limits increasingly appear. The second concern is that big tech and others heavily invested in selling AI as a breakthrough technology are cheating on tests that measure its progress by training their AIs on the benchmark tests being used to gauge that very progress, thereby tainting the results. This is akin to giving a student the answers to a quiz before taking it. All of which raises questions about whether AI is truly 'the future,' as its proponents claim or an interesting but not cataclysmic tech being propagated by those who have heavily invested in it and now need a way out. JL

Alan Reisner reports in The Atlantic:

There is growing evidence AI progress is slowing and the LLM-powered chatbot may be near its peak. How much is it actually improving? How much better can it get? These questions are nearly impossible to answer because the tests that measure AI progress are not working. Published studies show ChatGPT, DeepSeek, Llama, Mistral, Google’s Gemma, Microsoft’s Phi, and Alibaba’s Qwen have been trained on benchmark tests,  tainting the legitimacy of their scores. The problem is known as benchmark contamination. GPT-4 did well on questions published online before September 2021 - text on which it was trained. After that date, its performance tanked, leading researchers to suggest it had memorized the questions, “casting doubt on its actual reasoning abilities.”

Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its “largest and best model for chat yet.” Earlier in February, Google called its latest version of Gemini “the world’s best AI model.” And in January, the Chinese company DeepSeek touted its R1 model as being just as powerful as OpenAI’s o1 model—which Sam Altman had called “the smartest model in the world” the previous month.

Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak. This is troubling, given that the promise of advancement has become a political issue; massive amounts of land, power, and money have been earmarked to drive the technology forward. How much is it actually improving? How much better can it get? These are important questions, and they’re nearly impossible to answer because the tests that measure AI progress are not working. (The Atlantic entered into a corporate partnership with OpenAI in 2024. The editorial division of The Atlantic operates independently from the business division.)

 

Unlike conventional computer programs, generative AI is designed not to produce precise answers to certain questions, but to generalize. A chatbot needs to be able to answer questions that it hasn’t been specifically trained to answer, like a human student who learns not only the fact that 2 x 3 = 6 but also how to multiply any two numbers. A model that can’t do this wouldn’t be capable of “reasoning” or making meaningful contributions to science, as AI companies promise.

 

Generalization can be tricky to measure, and trickier still is proving that a model is getting better at it. To measure the success of their work, companies cite industry-standard benchmark tests whenever they release a new model. The tests supposedly contain questions the models haven’t seen, showing that they’re not simply memorizing facts.

Yet over the past two years, researchers have published studies and experiments showing that ChatGPT, DeepSeek, Llama, Mistral, Google’s Gemma (the “open-access” cousin of its Gemini product), Microsoft’s Phi, and Alibaba’s Qwen have been trained on the text of popular benchmark tests, tainting the legitimacy of their scores. Think of it like a human student who steals and memorizes a math test, fooling his teacher into thinking he’s learned how to do long division.

The problem is known as benchmark contamination. It’s so widespread that one industry newsletter concluded in October that “Benchmark Tests Are Meaningless.” Yet despite how established the problem is, AI companies keep citing these tests as the primary indicators of progress. (A spokesperson for Google DeepMind told me that the company takes the problem seriously and is constantly looking for new ways to evaluate its models. No other company mentioned in this article commented on the issue.)

 

Benchmark contamination is not necessarily intentional. Most benchmarks are published on the internet, and models are trained on large swaths of text harvested from the internet. Training data sets contain so much text, in fact, that finding and filtering out the benchmarks is extremely difficult.

 

When Microsoft launched a new language model in December, a researcher on the team bragged about “aggressively” rooting out benchmarks in its training data—yet the model’s accompanying technical report admitted that the team’s methods were “not effective against all scenarios.” One of the most commonly cited benchmarks is called Massive Multitask Language Understanding. It consists of roughly 16,000 multiple-choice questions covering 57 subjects, including anatomy, philosophy, marketing, nutrition, religion, math, and programming. Over the past year, OpenAIGoogleMicrosoftMeta, and DeepSeek have all advertised their models’ scores on MMLU, and yet researchers have shown that models from all of these companies have been trained on its questions.

How do researchers know that “closed” models, such as OpenAI’s, have been trained on benchmarks? Their techniques are clever, and reveal interesting things about how large language models work.

One research team took questions from MMLU and asked ChatGPT not for the correct answers but for a specific incorrect multiple-choice option. ChatGPT was able to provide the exact text of incorrect answers on MMLU 57 percent of the time, something it likely couldn’t do unless it was trained on the test, because the options are selected from an infinite number of wrong answers.

Another team of researchers from Microsoft and Xiamen University, in China, investigated GPT-4’s performance on questions from programming competitions hosted on the Codeforces website. The competitions are widely regarded as a way for programmers to sharpen their skills. How did GPT-4 do? Quite well on questions that were published online before September 2021. On questions published after that date, its performance tanked. That version of GPT-4 was trained only on data from before September 2021, leading the researchers to suggest that it had memorized the questions and “casting doubt on its actual reasoning abilities,” according to the researchers. Giving more support to this hypothesis, other researchers have shown that GPT-4’s performance on coding questions is better for questions that appear more frequently on the internet. (The more often a model sees the same text, the more likely it is to memorize it.)

 

Can the benchmark-contamination problem be solved? A few suggestions have been made by AI companies and independent researchers. One is to update benchmarks constantly with questions based on new information sources. This might prevent answers from appearing in training data, but it also breaks the concept of a benchmark: a standard test that gives consistent, stable results for purposes of comparison.

 

Another approach is taken by a website called Chatbot Arena, which pits LLMs against one another, gladiator style, and lets users choose which model gives the better answers to their questions. This approach is immune to contamination concerns, but it is subjective and similarly unstable. Others have suggested the use of one LLM to judge the performance of another, a process that is not entirely reliable. None of these methods delivers confident measurements of LLMs’ ability to generalize.

Although AI companies have started talking about “reasoning models,” the technology is largely the same as it was when ChatGPT was released in November 2022. LLMs are still word-prediction algorithms: They piece together responses based on works written by authors, scholars, and bloggers. With casual use, ChatGPT does appear to be “figuring out” the answers to your queries. But is that what’s happening, or is it just very hard to come up with questions that aren’t in its unfathomably massive training corpora?

Meanwhile, the AI industry is running ostentatiously into the red. AI companies have yet to discover how to make a profit from building foundation models. They could use a good story about progress.

1 comments:

Branxify | UI and UX Design Services said...

Branxify is a results-driven performance marketing agency, specializing in optimizing digital advertising campaigns across multiple channels. By leveraging data analytics and targeted strategies, Branxify helps businesses achieve measurable growth, maximize ROI, and drive high-quality leads through performance-focused marketing efforts.

Post a Comment