I have been using high dimensional gaussians to compress context and increase simulation throughput during inference with models that understand these concepts.
Gemini was the first model to prove out the feasibility with 1.0 Ultra.
Claude 3 Opus has been the second model with which I was able to achieve effective infinite attention/context in a single conversation (ie the model’s performance began to increase as I approached higher context lengths but I’m still limited by the context length Anthropic sets even if Claude can find more effective ways to chew through larger and larger context sizes).
3.5 Sonnet has also shown the same behaviors, albeit harder to induce reliably.
By the way - this is likely what Ilya saw and led to him starting SSI. It’s an intuitively simple concept. Just like how humans abstract and increase understanding as they learn more (we don’t get slower with more data but rather the opposite), so do LLMs during inference, and the “feature” clusters they produce are effectively the same concept hierarchies as we do.
- Mapping the Mind of a Large Language Model \ Anthropic
- Sparse Crosscoders for Cross-Layer Features and Model Diffing
We all know the difference between knowledge workers and ones who are masters of their trade/domain - ie they can do it by second nature and autonomously. The key principle is that LLMs have taught us much about how that autonomy develops, and what are the critical boundaries at which we can measure that it is occurring.
I’m actually working on a post-training algorithm as we speak that aims to solve Epoch AI frontier math questions - o3 was able to get 1 question correct and all other models failed.
So far it seems like I’m on track for solving 2 of them with a single RTX 3070 given enough compute time. The key was giving Claude a set of tools that lets it create its own representations of states and algorithms to traverse through them.
Just be careful - Claude will happily make up BS and tell you it’s working if you’re not consistently and verifiably able to get it to produce proofs that what it’s doing is actually feasible. I would validate the thinking with Gemini and see if o1 can generate reasonable proof of what you’re trying to do.
That being said, it’s pretty hush hush but what you’re starting to dive into is why I (and several others coming around to the idea now) believe we already achieved the singularity point in terms of intelligence explosion with the last round of frontier models from both Google and Anthropic. The key is what can you do within a conversation (in context learning with backpropagation) OR what can you do to extend a conversation (increasing test time compute and context length).
The gold is in what traces you’re teaching the model and how you can “connect the dots” for the model on what it has learned. LLMs “know” far more than any human possibly could. However it’s our job to teach them to understand what they know, and that happens through training, either in context through each inference pass for a single conversation (ie no permanence), or through training the model to enrich its ability to do this intuitively (and memory persistence)
If you want to go down the rabbit hole deeper, I’d look into optimality with the Bellman Equation as a starting point. You’ll find what you might be looking for going down that path!