Apple Says AI Isn’t Thinking. It's Guessing! You Better Check.

Jun 13, 2025·By Ryan Flanagan

TL;DR
Despite the hype, some AI tools still fail at basic reasoning. A recent Apple research paper found that even today’s most advanced language models often give wrong answers to simple logic questions—because they don’t really understand what they’re saying.

What’s going on with AI and ‘thinking’?

If you’ve used ChatGPT or any similar tool, you’ve likely seen moments of brilliance and others where it confidently makes something up. It might explain physics in plain English, then completely miss the point of a yes/no question. That contradiction is exactly what a group of Apple researchers explored in a recent paper titled The Illusion of Thinking (published May 2025).

They tested large language models (LLMs) like GPT-4 and Claude 3 on tasks that look simple to humans but require real reasoning—like tracking what’s true, what’s false, and how the two interact. The models got many of these questions wrong. Not because they’re broken, but because they work differently than most people assume.

Instead of thinking, they predict. And that difference matters.

So what did Apple actually test?

The researchers gave the AI questions like:

“John has a red ball. Mary does not have a red ball. Does Mary have a red ball?”

And the AI would sometimes say: “Yes.”

This isn’t a trick question. But because language models are trained to guess the next likely word in a sentence rather than reason through truth and falsehood, they can fumble even when the logic is simple.

The paper tested hundreds of examples like this. The results were consistent: models that looked fluent often got basic logic questions wrong. And when the questions got a bit more complex (e.g. adding double negatives or increasing volumes and adding quadratic reasoning), the models broke down a bit more.

Why does this matter?

Because many businesses are betting on these tools to do real work. Tasks like summarising contracts, reviewing regulations, answering policy questions, or helping customers navigate legal or financial documents. Whether used directly, part of a workflow or an internal model.

If your team assumes these models “understand” what they’re saying, you ARE in trouble.

The Apple team (a brilliant young PHD student and others) didn’t just criticise thought. They built a new kind of test to evaluate whether a model actually grasps truth and reasoning logic, rather than just sounding right-ish (my dash). Their findings suggest a need for much more caution when deploying AI overall and LLM's in particular into tasks where accuracy matters.

Some are calling this a ‘knockout blow’is that fair?

A few commentators, including well-known AI critic Gary Marcus, have described the Apple paper as a wake-up call. His point isn’t that LLMs are useless. It’s that we’re using them in ways they’re not built for - and on the same day OpenAI forecast a billion users very soon this year.

These tools are great at:

Writing emails
Drafting creative ideas
Summarising large chunks of text
But they’re still unreliable at:
Logical reasoning
Truth tracking
Understanding context across time and space

As Marcus put it: “They sound smart. But they aren’t.” That disconnect creates risk when people start trusting the output too much.

So should you stop using LLMs?

Yes. Sort off. No! but you should change how you think about them.

If you’re using AI to speed up workflows, write content drafts, or help structure thinking, it’s still a powerful tool. But if you’re treating the output as fact, or expecting the model to reason through complex issues reliably, you may be overreaching irrespective of RAG or your prompt engineering. Beware the 'One Prompt to Rule them All' salesmen parading their killer prompts on LinkedIN.

Here’s a simple principle from this research:
If the task needs logic, truth, or judgement, then double check it. The same as you did in uni before you turned in a paper, or these days when you check your kid's homework.

What’s next?

This paper won’t be the last to challenge assumptions about how “smart” LLMs really are. But it’s an important moment for business leaders, content teams, educators, and anyone trying to adopt AI responsibly.

It also reinforces why we need:

AI policies inside organisations alined to global standards like ISO
Human oversight in critical workflows
Better tools to check accuracy, not just fluency

It’s easy to be dazzled by how human these models sound.

But the real question isn’t: “Can it talk?” It’s: “Does it understand?”

Right now, the answer is: not always.

Want to build AI into your workflows without getting caught out?

Join our next AI Fundamentals Masterclass
Or start with our AI Readiness Assessment to see where you stand