
Preface
In my last post, I announced that I’m about to train an AI model that’s “not a Transformer”, and I’d like to explain what I meant with that here in this posting. I’ll explain what a transformer is (ChatGPT, Gemini, Claude, Grok, Deepseek, and so on).
Just to clear up one thing before I dive into it: This is not a “this is better than Transformers” post. Transformers work clearly. There are technical limitations to it, but problems are meant to be overcome, and the most brilliant minds in computer science are working on it. And they obviously believe in the potential on an individual level, else they wouldn’t.
But I believe there’s more than one way to get to the goal (of having extremely competent AI for the task you aim to apply it to, and the way I chose is called BDH (Baby Dragon Hatchling)).
So what are Transformers?
The core idea is attention. Every word looks at every other word and asks, “How relevant are you to me?” That’s computed with three matrices: Query, Key, Value. Imagine a quick librarian who cross-references everything simultaneously. The result is dense and global. Everything influences everything, all at once. This is its superpower (rich context) and its curse (memory scales quadratically, no persistent state).
After attention: a feed-forward layer that applies the same transformation to every position uniformly. No specialization. The training is a stateless gradient descent over huge data.
The model has no “self” between runs. What it “knows” is frozen in billions of numbers. Scale is the answer, because with dense, undifferentiated representations, the only way to get more precision is more parameters and more data. Concepts get tangled together in superposition: a single neuron might encode “Paris,” “capital,” and “romance” simultaneously. Untangling requires size. A Transformer eats an ungodly amount of data and sorts it out by finding patterns. It then generalises based on these patterns. The more data a transformer is trained on, and the more parameters (brain cells) it has, the better it gets at it. But it’s an amnesiac who needs to read everything again on every new activation. There are no memories and no continuity.
Big labs use tricks to give their LLMs something like memory: RAG (retrieval-augmented generation). Basically, documents with knowledge about the user. If you want to experience what this looks like when it’s overtuned, try out Gemini. Tell it you like fish dishes, and then just talk about whatever. Gemini will save that you like fish and will bring it up in allegories and metaphors repeatedly, whether it fits or not. OpenAI and Anthropic use the same trick, but are more subtle about it: their models only bring it up when it’s relevant. Mostly.
What’s a “Baby Dragon Hatchling”?
The lab behind this model is Pathway. They developed an AI system that works differently from a Transformer, and the architecture itself is called “Baby Dragon”. The “Hatchling” is the model they released as open source on GitHub – albeit without weights, and leaving no hints on how to train your dragon. Let me see if I can explain how it works without looking ridiculous.
The best way I can put it: a transformer searches, a BDH grabs. Where a transformer cross-references everything it knows to find the most likely next word, a BDH clusters information the way a biological brain does. You have something like “brain regions”, and when BDH processes something, it doesn’t search, it reaches for the cluster that’s closest to what it’s thinking about.
The reason it can do this comes down to how it handles information. When BDH looks at a word or concept, it expands it into an enormous number of features, then immediately throws most of them away. Only the strongest signals survive. The result is that each concept gets its own small, clean set of active neurons, rather than being tangled up with everything else. This is called monosemanticity: roughly, one neuron, one idea. A Transformer doesn’t work this way. Its neurons encode several unrelated things at once, which is part of why you need so many of them to get precision.
The other key idea is Hebbian learning. You’ve probably heard the phrase “neurons that fire together, wire together”. That’s Hebbian learning in a nutshell. In BDH, when attention finds that two things often appear near each other, those concepts strengthen their connection. The more often “rain” and “cold” travel together in the training data, the tighter that bond becomes. And crucially, this happens during the forward pass, meaning it’s not just something that happens while training; it’s part of how the model processes language at all.
This is also what gives BDH something closer to genuine working memory. A transformer keeps track of context through a system called a KV cache (basically a running log of “things said so far” that gets fed back in on every step). It works, but it’s expensive, and it’s external to the model itself. BDH doesn’t maintain a log. Instead, recent context lives in the pattern of which connections have just been strengthened. It’s messier to describe, but it’s closer to how memory actually works in a brain, and it’s much cheaper to run.
The last piece: because BDH’s active features are sparse and positive: only a few neurons firing, and always in the same direction. You can actually look inside and see what it’s responding to. Pathway confirmed this empirically: individual synapses in BDH demonstrably strengthen when the model processes a specific concept. With Transformers, figuring out what a model is “thinking” is still an active area of research with no clean answer. With BDH, interpretability is built into the structure.
What does this mean for me?
It means that the pretraining of a BDH differs from that of a Transformer. Transformers get better the more data they eat, and the more compute they have. A BDH gets better when things make sense, and concepts can form clusters. You can download a training dataset like BabyLM or SmolLM or Cosmopedia, train a Transformer on it, and it somehow finds out how to use language, how to do certain operations. Modern Transformers, especially SOTA, are trained on specifically selected data, for example, coding problems, and it’s always a gamble. You won’t know if it helped the model with what you tried to train it for until it’s done training. For a BDH, you want the training to “make sense”.
The bet
When I first tried out this architecture, I tried training it like a Transformer and will report on my findings here soon in detail. Let’s just say it didn’t do what it would have done in a Transformer. What I’m currently planning to do is: I developed a curriculum that’s designed and created carefully: Everything is dependency-aware and builds on what came before. The data set can be dramatically smaller because a BDH doesn’t “eat all the chaos until the patterns become visible”; it learns like a biological brain. And you, as a (probably) human being, only need to understand how something works to reproduce it consistently, and for that, you don’t need to read the entire internet. Not even all of Wikipedia. Heck, you might be okay after reading the summary of a Wikipedia article, depending on the complexity of the thing you’re trying to learn.
My training set is tiny – just 14 MB of data. For a human, that’s still a lot of reading, but it’s absolutely doable. No Transformer can booststrap language from such a small set. Early experiments suggest that Ninereeds can. I believe that careful design can make up for scale for this specific model (not BDH – for Ninereeds, my personal take on this), because the goal differs from that of a gigantic Transformer shooting for ASI: Ninereeds is supposed to be a “lean thinking core”. What this means, I’ll describe in another blog post, because this one is already getting long.
There’s no stability mechanism built into my training system yet (Oja’s rule, BCM), so I’ll go slow, monitoring system health every step of the way, and I will document every success and failure along the way. What works, what doesn’t, and why I think that might be, and what I’ll do about it.
Closing words
I believe the Transformer is a great system, if you have the compute and the data already. Even if the system will never go all the way to AGI or ASI, the resulting models will be incredibly powerful and do a lot of very important work. What I’m doing here is open experimentation – every miss and every failure is true progress, because it teaches me something new about what this system can do. My goal is to create an AI that can… I think I’ll write about that next time: My vision for Ninereeds, why I’m doing it, and what this is all for.