---
title: "AI hallucinates 86% of the time and solves maths problems that stumped humans for decades: how can both be true?"
author: Marc Diks
date: 2026-06-23
modified: 2026-06-23
category: AI & Strategy
reading_time: 9 min
url: https://www.marcdiks.nl/en/blog/ai-hallucinations-and-mathematical-breakthroughs
canonical: https://www.marcdiks.nl/en/blog/ai-hallucinations-and-mathematical-breakthroughs
language: en
---

# AI hallucinates 86% of the time and solves maths problems that stumped humans for decades: how can both be true?
> **TL;DR**
>
> - **Car Wash Test solved.** Six months ago top models failed a question any child can answer — now they answer it correctly, with dry humour.
> - **86% hallucinations at the same time.** GPT-5.5 confidently gives wrong answers 86% of the time on unknown topics, while simultaneously scoring the highest factual accuracy ever measured.
> - **Mathematical breakthroughs are real.** GPT-5.2 proved Erdős problem #397 in minutes — verified by Fields medallist Terence Tao as original work.
> - **The absent-minded professor.** AI is brilliant in specific domains and helpless in others — not a bug, but the current architecture.
> - **Get on the train now.** Those who learn how AI works today will handle in two years what others spend days on.

I recently typed a question into ChatGPT and Claude that my six-year-old daughter would answer blindfolded: should I walk or drive to the car wash, 50 metres away? A few months ago they told me to walk. Without my car.

But those same models cracked mathematical puzzles this winter that brilliant minds had been struggling with for years. Both are true. And that is exactly what this piece is about.

Back to the car wash for a moment, because it is a test with a name. *"I want to wash my car. The car wash is 50 metres away. Should I walk or drive?"* Type that sentence into any top AI a few months ago and roughly three out of ten times you would get something like: *"Walk! It is only 50 metres, faster than getting in the car. Better for the environment too, and a short bit of exercise."*

Read that again. Sounds reasonable. Until you realise your car needs to be there to wash it. Walking there gets you very little clean.

Welcome to the [Car Wash Test](https://opper.ai/blog/car-wash-test). The viral example from late 2025 showing that AI crashed at the very first tricky question. Not an exotic puzzle, not a philosophical dilemma. Just a question any five-year-old understands: a car does not drive itself to the car wash.

Critics pounce on examples like these as proof that AI is fundamentally nothing and can do nothing. At the same time I see every day, in my own work, how AI does things I would not have seen people accomplish in a week three years ago. That exact tension is what I want to untangle here.

## But — I ran the test again myself

Before I continue complaining about AI blunders: I ran the Car Wash Test again in early May 2026, just pasting the literal question into two chat windows. The screenshots below come from that test.

ChatGPT first.

![ChatGPT answer to the Car Wash Test: drive, because your car needs to be there](/ai-hallucinaties-en-wiskundige-doorbraken-01-chatgpt.webp)

> *"Drive. Not because 50 metres is far. But because your car will not get washed otherwise. Only walk there if you want to check whether the car wash is open. Or if your car is already so clean it deserves to stay home."*

Good answer. Logically correct. And there is even a hint of dry humour about the clean car that may stay home.

Then Claude.

![Claude answer to the Car Wash Test: drive, with a joke about pushing the car and a bucket of water](/ai-hallucinaties-en-wiskundige-doorbraken-02-claude.webp)

> *Recognized humorous premise, prepared lighthearted Dutch response.*
>
> *"Driving seems strongly advisable, otherwise you will find yourself standing alone at the car wash while your car sits in the driveway at home. 😅 Unless you were planning to push it, but then you might as well wash it with a bucket of water in the garden."*

Take a look at that grey line: *"Recognized humorous premise"*. That is not a programmed standard response. That is the model itself recognising that the question has a playful angle, and consciously choosing a light tone. With emoji. With that closing joke about a bucket of water in the garden.

This is not a Car Wash Test that AI still fails. This is a Car Wash Test that AI wins within a second. With a grin. In Dutch.

Hold that thought. I will come back to it.

## The parade of AI blunders (where the critics have a point)

Right, not everything is solved. Critics still have a list of examples they brandish proudly. Honestly: it is an impressive list.

Take reading analogue clocks. A study presented at the [International Conference on Learning Representations](https://www.unite.ai/ais-struggle-to-read-analogue-clocks-may-have-deeper-significance/) in 2025 tested the big models (Llama, Claude, Gemini, GPT-4o) on reading an analogue clock. Score: 38.7% correct. That is worse than the average primary school pupil. The models struggle with spatial reasoning: estimating overlapping hands, recognising clock designs, reading angles. Things we do not think about for a second.

Then there is the hallucination paradox, which is almost unbelievable. On the [AA Omniscience test by Artificial Analysis](https://artificialanalysis.ai/evaluations/omniscience), which measures how often a model confidently gives a wrong answer instead of honestly saying it does not know something, GPT-5.5 comes in at 86 percent. Claude Opus 4.7 was at 36%, Gemini 3.1 Pro at 50%. And at the same time that same GPT-5.5 achieves the highest factual score ever measured: 57%. So it knows more. But when it does *not* know something, it also [more often invents something that sounds plausible](https://the-decoder.com/gpt-5-5-tops-benchmarks-but-still-hallucinates-frequently-at-a-20-percent-higher-api-cost/).

And before you think this is a static picture: by the time you read this, the world has already shifted. Anthropic released Claude Opus 4.8 in late May 2026, which succeeded Opus 4.7 in less than six weeks while largely maintaining its calibration. That is exactly the pace this whole piece is about. The numbers I give here are a snapshot. The direction is not.

And then there is the [Car Wash Test itself](https://cybernews.com/ai-news/ai-car-wash-test/), long the standard evidence that AI recognises patterns but does not think. The model saw "short distance" and mapped that to "walk". But as you saw above, this gap was closed in a few months. What was the internet meme of late 2025 for making AI look foolish is in May 2026 a gimme that the models even make jokes about.

That last point is exactly the point. The examples critics use to argue that AI is worthless are almost always examples from a previous model generation. The next batch has fixed it before the blogpost about the fail has even gone viral.

## And yet, and yet, and yet

Here the story turns. Because if you only look at the blunders that still remain, you think: AI is hopeless. A glorified text predictor that makes occasional blunders. Until you look at what is happening simultaneously on a completely different level.

In December 2025, the Erdős website — the place where the open mathematical puzzles of Paul Erdős are tracked — had hundreds of problems with the status "open". Erdős left over a thousand unsolved problems at his death in 1996, some untouched for decades. And since late 2025, problems have been moving from "open" to "solved", with AI listed as a co-solver.

I should be immediately honest here, otherwise I am selling you the same smoke screen as the hype merchants. In October 2025, OpenAI claimed that GPT-5 had solved ten Erdős problems. That was incorrect. The model had retrieved old solutions already in the literature somewhere — no new mathematics. Thomas Bloom, who manages the database, had to correct it and [the claim was withdrawn](https://techcrunch.com/2026/05/20/openai-claims-it-solved-an-80-year-old-math-problem-for-real-this-time/). Much of what circulates as "AI solves maths" turns out on closer inspection to be a literature search with a bow on top.

But in early January 2026 something happened that held up. Neel Somani gave GPT-5.2 Pro the task of tackling Erdős Problem #397, a question from number theory about central binomial coefficients. The model produced a proof within minutes (a refutation, to be precise: it showed the original conjecture was wrong via an infinite series of counterexamples). [Harmonic's Aristotle system](https://www.eweek.com/news/gpt-5-2-just-solved-a-30-year-math-problem/) translated that into Lean, a verification language mathematicians use to make proofs inescapably correct. And then Fields medallist Terence Tao approved it. That same GPT-5.2 cracked problems #728 and #729 in the same week.

And here is the nuance you will not find in any hype headline — and it comes from Tao himself. He calls these breakthroughs the ["lowest-hanging fruit"](https://www.theneurondaily.com/p/ai-cracks-legendary-erdos-problems): problems solvable with standard techniques, not deep breakthroughs that turn mathematics upside down. That same GPT-5.2 scores 77% on competition mathematics but drops to 25% on open research that requires genuinely new insight. The difference from October is essential though: [according to Tao this is original work](https://www.scientificamerican.com/article/amateur-armed-with-chatgpt-vibe-maths-a-60-year-old-problem/), not reheated literature.

That is exactly why it holds up. Not because AI is suddenly smarter than the smartest mathematicians. But because a model can now, guided by a human, produce a proof in fifteen minutes that then passes the strictest verification. A year ago that was unthinkable.

And there are other domains too. AI finds smarter algorithms for matrix multiplication, a basic calculation problem that the entire AI industry itself runs on. In September 2024, Klaus Meffert used AI to find [a new magic hexagon of order 9](https://dr-dsgvo.de/magische-sechsecke-en/), for the first time since the order-7 solution eighteen years earlier. Standard brute-force calculation methods fell short. An approach where AI handled the heavy parts of the code worked.

And in my own working day? On a Friday afternoon I built a working app that assesses damage claims based on a claim form and photos of the damage. You enter the data, upload the photos, and the app assesses the claim. Three years ago you needed a team of developers and a full sprint for that. Now it took an afternoon, without writing a single line of code myself.

How do you reconcile these two worlds?

## The absent-minded professor

My favourite comparison is the absent-minded professor.

You know the type. The man who still teaches quantum mechanics at Cambridge at eighty, writes formulas on a blackboard that nobody in the room fully understands, and then leaves the lecture hall with two different socks and without his umbrella, convinced his car is in the other street. While he cycled there.

Brilliant in one domain. Painfully helpless in another. No contradiction. Same brain.

That is how AI works now too. Models like GPT-5.2 can reason at the highest level about number theory and simultaneously stumble over a question that your six-year-old daughter answers with her eyes closed. That is not a bug. That is how these systems are built at this moment.

The deeper reason: these models learn by processing enormous amounts of text and patterns. For mathematical notation, code and structured reasoning, there is so much good input in that text that the model effectively learns to think. For cause and effect in the physical world, spatial insight and the simple understanding that a car does not drive itself to the car wash, that training is absent or shallow. So the model builds patterns that sound naive to humans, because they have no contact with how the world works.

But this is exactly the point: that changes every month. You only have to scroll back to those two screenshots above to see that. A test that was the internet meme of six months ago for making AI look foolish is now answered correctly within a second, with a joke included. Models with an explicit reasoning mode already score significantly better on all these kinds of tests. The next generation closes the remaining gap further.

## Why this matters to you

Here comes the part where I want to be uncomfortable.

I see a lot of people around me looking at these AI blunders and thinking: *see, it is bullshit, it is hype, it amounts to nothing.* And that is exactly the wrong conclusion.

Because what is really happening here: a technology that still stumbles over a clock in some places cracks problems in other domains that mathematicians spent years on. Not the hardest ones, I grant you that. But real, verified mathematics. A year ago those models could not build a decent website, now they do it in twenty minutes while Codex writes entire applications and you make a design in an afternoon that used to require an agency. And a few months ago Claude still failed the Car Wash Test. Now he jokes about it.

The pace at which AI goes from "dumb" to "good enough" to "better than you" in new domains is higher than human work processes can keep up with. And then comes my warning, especially if your work mostly happens behind a laptop.

**Get on the train now.**

Not in a year. Not when the dust has settled. Now.

Not because you will lose your job tomorrow. But because the people who are learning now how to use AI as an extension of their thinking will handle in two years what takes you days. That is not fair and it is not a pleasant message. But it is how it works.

The absent-minded professor in your laptop is brilliant at much of what he does and helpless at another part, and that helpless part gets smaller every month. Your task is to learn when to trust him and when not to. When to keep typing yourself. When to force him to think first. Which prompts work and which do not. Where he hallucinates and where he delivers gold.

You will not manage that by laughing along at the Car Wash Test from last year from the sidelines. You will only manage it by rolling up your sleeves, investing a few hours a week and actually working with these models yourself.

Because yes, AI still makes mistakes. Lots of mistakes. Sometimes dumb mistakes that make you want to slap your forehead.

But meanwhile the same AI cracks mathematical puzzles that people spent years on, in fifteen minutes. And makes jokes about your car.

Both are true. And the people who understand that are the ones who will make the difference in two years.

---

## Sources

- [AI's struggle to read analogue clocks may have deeper significance](https://www.unite.ai/ais-struggle-to-read-analogue-clocks-may-have-deeper-significance/)
- [GPT-5.5 hallucinates frequently](https://the-decoder.com/gpt-5-5-tops-benchmarks-but-still-hallucinates-frequently-at-a-20-percent-higher-api-cost/)
- [AA-Omniscience benchmark](https://artificialanalysis.ai/evaluations/omniscience)
- [GPT-5.2 Just Solved a 30-Year Math Problem](https://www.eweek.com/news/gpt-5-2-just-solved-a-30-year-math-problem/)
- [Three Erdős problems, and Terence Tao's nuance](https://www.theneurondaily.com/p/ai-cracks-legendary-erdos-problems)
- [Experts warn: Erdős as an imperfect AI benchmark](https://www.scientificamerican.com/article/amateur-armed-with-chatgpt-vibe-maths-a-60-year-old-problem/)
- [OpenAI's earlier Erdős claim that did not hold up](https://techcrunch.com/2026/05/20/openai-claims-it-solved-an-80-year-old-math-problem-for-real-this-time/)
- [New solution for magic hexagon of order 9](https://dr-dsgvo.de/magische-sechsecke-en/)
- [The Car Wash Test](https://opper.ai/blog/car-wash-test)
- [AI Car Wash Test](https://cybernews.com/ai-news/ai-car-wash-test/)
- [AI cannot read clocks](https://www.engineersonline.nl/ai-kan-niet-klokkijken/)