AI Models Still Can’t Think. And Now We Can Prove It.

Jun 14, 2025By Ryan Flanagan
Ryan Flanagan

TL;DR
Even the smartest AI models today don’t really reason. They remember. They summarise. They sound fluent.
But when logic is buried in a long, messy story? They fail.

A new paper by Alex Pan (who I worked with at IBM in 2017) and Mary-Anne Williams introduces Verbose ListOps (VLO), a test with a cool name that shows exactly where and how today’s top models break. It explains why bigger models and longer memory aren’t enough, and what businesses need to check before trusting AI with decision making.

What’s the real problem with LLMs?

Large Language Models like GPT, Claude, or Gemini can write fluent emails, summarise PDFs, and answer questions at speed.
That fluency is often mistaken for understanding.

But here’s the catch.
Fluent text is not the same as reasoned thinking.
And pulling facts from a document doesn’t mean the model understands the argument inside it.

How do you know if AI can really think?

Benchmarking is how AI labs test model performance.
They ask things like:

  • Can it solve basic logic puzzles?
  • Can it summarise across documents?
  • Can it find a fact hidden in long content?

These tests get reported as impressive scores. For example, GPT-4 might score 98 percent on Grade 8 science problems. But these scores come from clean inputs with clean answers.

Real work isn’t like that.

Most business logic is buried in the middle of messy, multi-layered information. Or the copy paste you put into Gemini, or the 88 page PDF on how to market to middle class families! 

And that’s where current models start to break.

What does reasoning look like in real business tasks?

Think about the kind of problems your team faces:

Reviewing a 15-page contract with inconsistent clauses
Analysing a series of customer support emails over time
Connecting cause and effect across incident reports
Synthesising different versions of events in a compliance audit

These aren’t prompt-response tasks. They require narrative reasoning.
Understanding what happened, across time, with ambiguity and distraction in play.  Today’s models struggle here. And most benchmarks don’t test for it, and most users do not know how to check for it.

A new kind of benchmark: Verbose ListOps (VLO)

How cool is that name! The legendary Alex Pan and Mary-Anne Williams built VLO to test comprehension inside complex, realistic language.  It is a logic benchmark, but with a twist.

VLO embeds a logic task inside a long, human-sounding story.
It simulates the conditions of real-world work:

  • Logic buried in multi-paragraph content
  • Distracting but plausible noise
  • No giveaways or step-by-step prompts
  • Semantic complexity that forces actual reasoning

As Alex and Mary-Anne explain:
“Our benchmark tests the ability of models to infer, calculate, and reason within semantically complex and verbose contexts... This is key to evaluating comprehension, not just context retention.”

What did the benchmark test reveal?

They tested seven of the most advanced language models available today, including GPT-4.1, Claude 3.7 Sonnet, Gemini 2.5, DeepSeek V3, Qwen 3, LLaMA 4, and Grok 3. When asked direct logic questions, most models performed well.
But when the same logic was hidden in a long narrative, results collapsed:

  • GPT-4.1 dropped to 32 percent accuracy
  • DeepSeek V3 fell from 93 to 25 percent
  • Similar drops were seen across Claude and Gemini
  • The story format didn’t overwhelm the models with length.
  • It tripped them up by requiring actual reasoning.

What does this look like in your workflow?

Imagine (some of you will not have to as you are doing this) your sales or service team uses GPT to review a CRM thread and asks:

“Did the customer churn because of pricing?”

Here’s what’s in the email thread between the teams:

“The new pricing tier felt expensive.”
“The CFO flagged the cost as a concern.”
“We paused onboarding after the quote came in.”
“In the end, we chose the cheaper option.”

To a human, the logic is simple: pricing caused churn.

But GPT might respond with:

“Uptime was mentioned.”
“Karen questioned ROI.”
“Pricing was a factor, but not the only one.”

It doesn’t misread the content.
It just fails to connect the reasoning chain:

Price increase → hesitation → quote rejected → cheaper competitor chosen

This is the kind of failure VLO reveals.And it’s exactly the kind of reasoning most workflows need. And this is a BIG BIG problem for decision making or automated workflows with LLM API's. 

Why does this matter for your business?

Because, legally you are in real deep trouble. It is a real operational risk if you're using AI for:

  • Churn prediction
  • Lead scoring
  • Customer support triage
  • Legal contract review
  • Policy or compliance analysis

Guess what? If the model can’t reason, it will fail to:

  1. Spot the real cause in client feedback
  2. Track obligations across contracts
  3. Interpret context in complaints
  4. Justify decisions to auditors or regulators

And worse, you may not realise it has failed because the language still sounds confident and badass correct.

What VLO proves

VLO doesn’t test if a model can hold more data.
It tests whether it can work through the logic that connects that data.

Questions it answers:

  • Can the model track multiple steps in a story?
  • Can it ignore distractions?
  • Can it compute what isn’t explicitly stated?
  • Can it reason like a human analyst?

If the answer is no, then you should not trust that model in any decision chain where cause, consequence, or traceability matter.

Verbose ListOps (what a name!)  is the benchmark the business world didn’t know it needed. It proves a critical gap: models don’t fail because they forget, they fail because they don’t reason.

That changes how we think about AI in business.

It means:

  • You need to audit your reasoning workflows
  • You need to scaffold logic, not assume it
  • You need to simulate messy, multi-step inputs
  • You need assurance before pushing AI into live decision processes

Alex and Mary Anne didn’t just build a clever test. They issued a warning. And if your business depends on models to think clearly, now is the time to check.

Time to audit your AI?

If your workflows involve risk, logic, or decision-making, you need more than a prompt test. You need to know whether your model can reason under pressure.

We help organisations audit their AI systems against real-world complexity using ISO 42001. We are ISO 42001 Certified Lead Auditors. Make sure your AI decisions hold up under scrutiny.

Explore our ISO 42001 audit and AI assurance services