Tuesday, March 17, 2026

Thoughts from 3 years of studying AI system design!

Yesss, YES!!!

This is exactly what I've been thinking about this whole time!

Ever since 2023, I've been researching AI systems. The earliest papers and advances all pointed to a specific combination of patterns.. Chain-of-thought, the concept behind "thinking models" was one of the first breakthrough papers there, and it's potential was obvious from day 1.

Early models suffered short context windows. Back in the day you like 8k tokens; you could barely have a prompt and a short conversation, and stuffing three documents and a quarter of a codebase in was unthinkable.

But still, the industry held the path. More parameters. More data. Bigger. Stronger. Heavier. Let's fit a million tokens in the thing.

1 in a half years later, they began caving. GPT-4o mini was released, and the "flash"-like models followed. "Bigger" wasn't working out. The companies were burning money too fast. They needed smaller, faster models. The industry still hasn't properly utilized this yet, but advances like Phi-1 were a good start.

With ongoing papers, and things like the onset of RAG, it was clear that models were nowhere near being utilized to their full potential. I remember watching a video by David Shapiro about the idea of "Latent Space Activation". This was perhaps one of the earliest descriptions of what would become known as "context engineering", and a prime example of pushing the limits.

But that extra thinking context is only useful for that one step, and context windows were still really short, or affected by needle-in-the-haystack in the bigger models...

Then there were all the safety issues. Big AI didnt want their models teaching people to make a bomb, so they went for the brute force approach: just wipe that from the model's 'memory'.

But if you understand latent space, you know that this is the model equivilant of chopping out a piece of someones brain. **A literal lobotomy!**

Sure the model wouldn't tell you an obscene joke, but it's ability to reason about the world had been severly damaged in the process. This lead me to the thought that safety shouldn't be done at the model layer. Leave the models uncensored!! But then verify the output's ethicality at the Orchestration layer. This decouples the model's immediate response from the user allowing safety to be no longer an "initial output" problem, but rather a "does the model have morals once it's spoken" problem. Now you could focus on acutally good model alignment practices!

...it would take until 2025 and the release of DeepSeek R1 before a system capable of implementing this model like this would be properly available.

These three things combined lead me to the idea that instead of big models, with one chat thread filled with information, you could split tasks into smaller steps, and isolate the neccesary CoT reasoning context to that step within just one thread. This concept lent well to parallelization too, and thought-leaders in the field happened to be experimenting with agent-swarms at the time as well. What if you could have a collection of agents, each with a task or a step of a task, isolate the details of the substeps so that the higher threads could focus on what was actually important.

Recently, Anthropic finally put into words the idea I had been carrying for so long: Without persistent, structured representation of work, long-working 'linear' agents become "amnesiacs with tool belts". You need working memory, and you need it at the orchastration layer. The magic, is in the harness!

Today, I came across an article on my feed, and as my eyes read it, I was so excited! Finally, almost all of my core ideas had been convergently figured out and implemented all at once in a single system!

The approach that Slate V1 uses checks almost all the boxes of what I had hoped to implement back then:
- Context distillation, check.
- Parallel multi-threaded agents, check.
- Central orchastration threads, check.
- Latent space activation via artifacts, check.

Kinda amazing and exciting that after 3 years, the AI industry finally caught up and learned things that had been becoming apparent even in the first 6 months :D

https://venturebeat.com/orchestration/y-combinator-backed-random-labs-launches-slate-v1-claiming-the-first-swarm 

No comments:

Post a Comment