How AI “Sees” And Why You Shouldn’t Trust It
TLDR: AI doesn’t see images the way humans do. It guesses. It stitches patterns together based on probability, not perception. MIT’s research confirms this, and if your organisation uses AI for anything involving images, diagrams, or reports, you should care. Understanding this difference is how you avoid making a big stuff up.
Sorry....We’re going to nerd out for a moment, but stay with me. This isn’t academic pain killer. This is the difference between a system helping your team and a system quietly producing errors that nobody notices until they matter. Like when a fire starts and no one looked at the electricals.
What is “visual knowledge” in AI?
MIT researchers looked at how language models develop “visual understanding”.
The short version: they don’t. Not in the way you think.
- They don’t store mental images.
- They don’t form a picture in their “mind”.
- They don’t recognise objects through perception.
Instead, they build statistical expectations based on:
- text patterns
- labels
- common co-occurrences (“cats sit on laps”)
- millions of multimodal pairings (“apple = red”)
That creates something like a fuzzy internal map of what an object probably looks like. This is why models can describe an object confidently even when it’s not actually present.
- It’s not deceit.
- It’s actually learnt pattern logic.
So What did the MIT Computer Vision Study Say?
Here’s a summary without drowning you in maths:
1. AI builds visual knowledge indirectly
If a model has seen the word “thermometer” next to phrases like “thin cylinder”, “reading”, and “glass”, it develops a loose visual template through repetition.
2. Multimodal models add more clues, not true perception
Yes, they read images. No, they don’t understand them.
They compress images into a high-dimensional representation that serves as another statistical signal.
3. AI makes specific, predictable mistakes
MIT found that models confidently:
- hallucinate common objects
- misinterpret diagrams
- confuse spatial arrangements
- “blend” categories
- answer based on likelihood rather than details
If this sounds familiar, it should.
Why should a non-technical leader care?
This research matters because AI “confidence” is not a sign of accuracy.
It’s a sign of probability.
1. AI visual reasoning is probabilistic, not perceptual:
When an AI tool describes an image, it’s not describing what it saw. It’s describing what it expects to see based on training data.
That sounds like semantics until the moment you discover the model:
- added a hazard that wasn’t there
- missed a crack in a wall
- misread a compliance figure
- invented context that fits the pattern
This is pretty average and common.
2. This has real operational consequences:
In government, property, healthcare, engineering, insurance, facilities, and safety work, AI often handles imagery or diagrams.
Models can misread:
- thermal hotspots
- electrical heatmaps
- tenancy inspection photos
- structural diagrams
- environmental compliance photos
- OHS hazards in crowded scenes
Not because they’re crap. Because they’re not looking the way your team thinks they are.
3. Explainability matters more than ever:
If your organisation can’t explain how the model formed a judgement, you can’t govern the outcome. Teams need clarity on:
- which features the model used
- where its assumptions came from
- what the likely failure modes are
- how the system behaves with unusual images
This is part literacy, part governance.
A quick personal aside, because we might as well acknowledge we’re nerding out
We recently built a multimodal image-recognition system trained on more than 10,000 thermal images in the facilities sector.
The goal was simple: identify electrical hotspots automatically instead of having technicians review every heatmap manually.
It saved about 2,300 hours of analysis time.
But it only worked because everyone understood a basic truth:
The model wasn’t “seeing” anything.
It was recognising statistical patterns that correlate with hotspots.
Once you understand that, you design controls that work. If you don’t, you risk treating probabilistic inference as unquestioned fact.
So what, does this matter?
1. Use AI for visual assistance:
AI is good at:
- summarising images
- extracting obvious features
- flagging potential attention points
It is not good at:
- final assessment
- safety-critical decisions
- ambiguous scenes
- fine-grain technical interpretation
2. Treat confident answers with suspicion:
Fluent explanations are a veneer.
Always ask where the answer came from.
3. Build a human-in-the-loop review for all visual workflows:
This catches the model’s tendency to fill gaps with probability.
4. Train people before tools:
You don’t need technical depth. You need teams who know when an answer is trustworthy and when it needs checking.
FAQs
Q: Does this mean AI can’t be trusted for visual work?
A: It can be trusted for assistance. Judgement still belongs to a human.
Q: Why do models hallucinate objects that aren’t in the image?
A: Because they predict likely features rather than perceive actual ones.
Q: Should government and regulated sectors care more?
A: Yes. Visual misinterpretation creates compliance risks, not just inconvenience.
Q: Is this too technical for non-AI staff?
A: No. The core idea is simple. AI makes visual guesses. You need to know that.
If you want your team to understand this without becoming researchers
If your organisation wants clear, practical literacy in how AI works, where it fails, and how to use it safely, the AI Fundamentals Masterclass gives non-technical teams the grounding they need to make good decisions.
