Why educators should care that LLMs aren't stochastic parrots
Inside an LLM is a genuine geometric structure and learning to navigate it is a skill.
The popular way of describing what large language models do is something like this: the model ingests massive amounts of text, compresses the statistical patterns, and when you give it a prompt, it rolls weighted dice to pick the next word. The weights come from frequency patterns in the training data. High-probability words come up more often.
We’re told that when ChatGPT or Claude writes something that sounds intelligent, it’s doing a sophisticated version of autocomplete. No understanding, no structure, just pattern-matching against a giant frequency table. “Stochastic parrot” is the famous label, from Emily Bender, Timnit Gebru, and colleagues in their 2021 paper.
There’s something genuinely useful about this framing because it (rightly) pushes against the tendency to see a mind behind every fluent paragraph. And the core observation is that these systems are trained on statistical patterns in text.
The problem is that the parrot framing makes a specific claim about what exists inside the model after training. It claims there’s merely a compressed summary of the training data’s statistics. It’s basically a lookup table and frequency distribution. And everything the model does is mere retrieval from that table, plus some randomness.
But recent research suggests this is very wrong. Understanding why the stochastic parrot model is wrong has important implications for both users and educators.
What recent research is finding in the models
Over the past few years, mechanistic interpretability has started opening up language models the way a biologist opens up an organism. The goal is to figure out what’s actually going on inside an LLM.
Researchers first noticed a phenomenon called “grokking,” identified by Alethea Power and colleagues in 2022. A small Transformer model trained on modular addition (basically, clock arithmetic) would memorize the training data, perform terribly on new examples, and then suddenly snap into perfect generalization long after it seemed to have finished learning. Something was happening under the surface. Neel Nanda and collaborators then reverse-engineered what the model had actually learned. They found three phases.
Phase 1. First, the model memorized input-output pairs. Pure lookup table. The parrot story seemed right.
Phase 2. Then, underneath the memorization, the model quietly developed an entirely different solution: it learned to represent numbers as rotations on a circle and compose them using trigonometric identities, suggesting a genuine algorithm.
Phase 3. The final phase was cleanup: the memorization scaffolding fell away and the algorithmic solution took over.
The parrot framing can describe phase one. It has nothing to say about what came after. If the model were just memorizing patterns, it would have stopped at the lookup table. Instead, it discovered something the training data never explicitly contained: the mathematical structure that generates the patterns.
Perhaps the most important finding came in March 2025, when Anthropic traced the actual computational pathways inside Claude 3.5 Haiku. They found that when writing rhyming couplets, the model picks its end-word before writing the line. At the line break, before a single word of the second line has been generated, features for rhyming sounds and candidate words are already active.
The model has a destination and then writes the line to get there.
This table from Anthropic shows how each injection produces a completely different second line that coherently arrives at the target word. “The model picks its rhyme before writing the line. Researchers proved this by injecting different target words and watching the entire line reshape itself to get there.”
So what?
A next-token predictor, in the parrot sense, has no destination. It picks the most probable next word, then the next, then the next. Anthropic’s model picks the last word first and builds toward it.
The geometry
So what’s actually in there?
It’s increasingly clear that a trained language model should be viewed as a high-dimensional geometric object. Concepts correspond to directions in the model’s activation space. Relationships between concepts are angles. When the model generates a response, it traces a trajectory through this space, and the shape of that space determines what the model can do.
When Anthropic finds that a model’s rhyming features and candidate end-words are directions that activate before a line of poetry is written, that’s geometry. The model’s knowledge is literally stored as a shape. The properties of that shape (the distances, the angles, the subspace structure) are what give the model its capabilities.
The term for this in interpretability literature is the “linear representation hypothesis,” formalized by Park, Choe, and Veitch in 2024. High-level concepts are represented as directions in the activation space, and the relationships between those directions constitute what the model “knows.”
Here’s an analogy: imagine you have a map, but it only has two directions, north-south and east-west. You can describe any location with just those two coordinates. Now imagine the model needs to track thousands of concepts but only has room for a few hundred dimensions. Anthropic’s work on superposition shows that models solve this by packing concepts into nearly perpendicular directions, the way a city grid packs streets into a compact space. The concepts don’t perfectly avoid each other (there’s some interference), but the arrangement is efficient enough to represent far more ideas than the model technically has room for. The result is a structured geometric packing, not a heap.
What’s interesting is that this shape emerges from a probabilistic training process. It’s absolutely constrained by probability. The stochastic view gets that part right. But what the stochastic model gets wrong is confusing that initial optimization goal for the model itself.
Real crystals are fully determined by atomic forces, yet we don’t conclude that a crystal is “just” a solution. The origin of structure is one question. What the structure is and what it does are completely different.
Why this matters for AI fluency
The geometric framing changes how you think about prompting.
If an LLM is a weighted dice roll, then prompting is about triggering the right frequency band. On this view, you’re trying to put the model in a state where the high-probability tokens happen to be the ones you want. Skill is mostly about tapping into the high-frequency bands that correspond to your need.
If an LLM is a geometric object, on the other hand, then prompting is navigating hidden nodes. You’re positioning a query in a high-dimensional space, and what the model can reach depends on where you’ve placed it. Temperature (the randomness parameter) is one lever, because it expands an exploration radius through structured space, surfacing tokens that are structurally connected to your query but farther from the obvious center. But the bigger lever is the prompt itself. Research on prompt engineering has consistently found that the most effective strategies work by steering the model away from its highest-probability defaults and toward less likely but more precise regions of the space. Chain-of-thought prompting and role-based framing are navigation techniques. They reposition the query to activate structural relationships that a generic prompt wouldn’t reach.
In other words, different framings position the query in different regions of the space where different structural relationships become accessible.
Implications for training and teaching
This has implications for how we teach AI literacy. If the model is a parrot, the right pedagogical move is to emphasize how mid the outputs are. They’re “just statistical regularities” and AI literacy is mostly about correcting their middling outputs. If the model is a geometric object, then learning to navigate that space is a genuine skill, one that depends on 1) domain expertise and 2) creativity. We absolutely should emphasize basic information literacy and be wary of poor or inaccurate outputs, but knowing how to navigate a model’s terrain successfully is upstream of that.
Also, the worry that AI replaces learning assumes the model provides similar outputs regardless of who’s asking. The geometric view suggests the opposite: the model is most powerful in the hands of someone who knows how to traverse the territory. Learning a subject deeply is what gives you more access to the model’s hidden structure. Most people don’t know how to access anything other than the main highways.
Sure, LLMs provide shortcuts for everyone. That has implications for teaching and academic integrity.
But those with more knowledge will explore so much more within the model and jump far ahead of others. Students and faculty alike don’t want to be left behind. The basic things we already emphasize in teaching (such as creative thinking and domain expertise) are key to unlocking these digital spaces.
The geometric framing doesn’t resolve every question about what these systems are or what they can do.
Part of the parrot critique emphasizes the models’ lack of understanding. I think that still holds up. These aren’t living systems. Nothing in the system relates to that geometry from a perspective within it.
But this is a genuinely new kind of technology that we’re still getting used to. The parrot framing flattens it downward (just statistics). The “alien intelligence” framing inflates it upward (a new kind of mind). We shouldn’t force it into familiar categories.
Further reading
For a rigorous philosophical treatment of these questions, see Millière & Buckner, “A Philosophical Introduction to Language Models” (Part I, Part II). Scott Alexander’s “Next-Token Predictor Is An AI’s Job, Not Its Species“ argues a complementary point from a different angle. On the geometry of concepts specifically, see “The Geometry of Categorical and Hierarchical Concepts in Large Language Models.” For the world models debate, Anil Ananthaswamy’s series on Where Machines Think is a good entry point.



Thank you for the article. That was very thought provoking. I think the stochastic parrot analogy is useful as a starting point to for students to understanding basic design of an LLM. For me, the geometric understanding of an LLM points to why its use by students quickly yield diminishing returns. In the hands of someone who is a non-expert in any field, the users’ ability to know how to interface so that it augments instead of replacing his or her abilities is limited. For me, this is why it is key to deeply form and educate students’ “native wetware”—their own minds first.
I think the geometric view is a good start. Once we end up in n-space, however, the metaphor breaks down. We can't really imagine 600 dimensions let alone 60,000 dimensions.
https://meresophistry.substack.com/p/the-generic-abyss-of-artificial-intelligence
https://meresophistry.substack.com/p/the-fraudulent-euphemism-of-ai-thinking