Ilya Sutskever Interview — Rewritten in Karpathy Style
On the Eval vs. Real-World Gap
Here’s what’s broken: models crush benchmarks but can’t hold a candle to a competent junior developer in practice. Why?
The answer is embarrassingly simple: we’re training to the test.
Let me give you a concrete example. You’re vibe-coding, you hit a bug, you tell the model “fix this.” It says “Oh my god, you’re so right, let me fix that”—and introduces bug #2. You point out bug #2. It apologizes profusely and brings back bug #1. You can ping-pong between these two bugs forever. This is a trillion-parameter model.
How is this possible? Because of how RL environments get chosen.
In pre-training, there was no choice to make. The answer to “what data?” was just “everything.” You dump the internet in, you’re done.
But RL is different. Someone has to decide: what tasks do we train on? And here’s the problem—people look at evals and reverse-engineer the training. “We want to look good on SWE-bench? Let’s make RL environments that pattern-match to SWE-bench.”
The real reward hacking isn’t the model gaming the reward. It’s the researchers gaming the evals.
Think of it like this. You have two students learning competitive programming:
- Student A: Grinds 10,000 hours, memorizes every proof technique, every algorithm pattern, every edge case. Becomes a beast at competitive programming.
- Student B: Does maybe 100 hours, but has “it”—some deeper understanding of problem-solving.
Who has the better career? Student B, obviously.
Our models are Student A on steroids. We don’t just train on every competitive programming problem—we augment the data, we synthesize more problems, we overtrain on that specific distribution. No wonder they’re brittle outside it.
On Pre-Training and What It Actually Does
Pre-training is weird. Let me explain why.
When the model reads the internet, it’s doing two completely different things at once:
- Memorizing facts — who won the 1998 World Cup, what’s the capital of Mongolia, etc.
- Learning algorithms — how to reason, how to do in-context learning, how to solve novel problems
We want #2. #1 is mostly dead weight—and probably actively harmful.
Here’s why: the model has too much memory, and that’s a bug, not a feature.
You can give an LLM a completely random sequence—just hash some text into noise—and if you train on it even once or twice, it can regurgitate the whole thing. Humans can’t do that. A human reads a random number sequence once and remembers… nothing.
But that’s actually good! Because humans are forced to find patterns. The model is distracted by its perfect memory. It can take shortcuts that don’t generalize.
I think of pre-training as “crappy evolution.” Evolution spent 3 billion years figuring out how to make something that can learn. Pre-training is our version of that—except instead of encoding the learning algorithm in DNA, we’re encoding it in weights by predicting tokens on the internet.
It works. It’s not elegant. But it gets us to a starting point.
On Value Functions and Why RL is Terrible
Standard RL is dumb. Let me explain how dumb.
You give the model a math problem. It thinks for maybe a thousand tokens—tries this approach, backtracks, tries another, finally gets an answer. You check the back of the textbook. Answer is correct.
What does RL do? It goes back to every single token in that trajectory and says “do more of this.” Every wrong turn, every dead end, every moment of confusion—upweighted, because the final answer was right.
You’re sucking supervision through a straw.
You’ve done all this work—a minute of reasoning, thousands of decisions—and you compress all the learning signal into a single bit: correct/incorrect. Then you smear that one bit across the entire trajectory. It’s insane.
A human would never learn this way. If I solve a hard problem, I don’t think “everything I did was correct.” I think: “steps 1-3 were good, step 4 was a waste of time, step 5 was the key insight, steps 6-8 were just cleanup.”
That’s what a value function gives you. Instead of waiting until the end to learn anything, you can get signal mid-trajectory. In chess, if you blunder a piece, you know immediately—you don’t have to play out the whole game.
The DeepSeek R1 paper was skeptical about this. They said the trajectory space is too wide to learn a value function.
My response: that sounds like a lack of faith in deep learning. Sure it’s hard. Nothing deep learning can’t do.
On Emotions as Value Functions
Here’s a case study that stuck with me. A man had brain damage that knocked out his emotional processing. He could still talk, solve puzzles, seemed fine on tests. But he felt nothing—no sadness, no anger, no motivation.
Result? He became catastrophically bad at making decisions. Hours to pick which socks to wear. Terrible financial choices. Completely non-functional.
What does this tell us? Emotions are a value function. They’re evolution’s way of giving you mid-trajectory feedback. You don’t have to wait until you’re dead to know if your life choices were good. You feel it continuously.
And here’s the kicker: human emotions are incredibly robust. Unlike our models, which break in weird ways on out-of-distribution inputs, human emotional responses work pretty well even in environments nothing like the ancestral savanna.
Why? Maybe because they’re simple. There’s a complexity-robustness tradeoff. Complex systems are fragile. Simple systems work everywhere.
Our models don’t have anything like this. They just have sparse rewards at the end of trajectories. No wonder they’re worse at learning.
On Why Humans Generalize Better
Here’s the uncomfortable truth: humans are just better learners, and we don’t know why.
Let me make this concrete. Take driving.
A teenager learns to drive in maybe 10 hours. After that, they can handle situations they’ve never seen—construction zones, weird intersections, aggressive drivers, rain. Not perfectly, but adequately.
An LLM? You can train it on millions of driving scenarios and it’ll still do something stupid when it sees a traffic cone arranged in a slightly unfamiliar pattern.
The sample efficiency gap is absurd. The robustness gap is even worse.
Now, you might say: “Okay, but that’s because humans evolved for millions of years. We have priors baked in.”
That’s partially true for things like vision and locomotion. A zebra can run minutes after birth—that’s not learning, that’s evolution pre-loading the weights.
But here’s the thing: humans are also great at learning math and coding. Those skills didn’t exist until recently. There’s no evolutionary prior for debugging Python. Yet humans pick these things up vastly more efficiently than our models.
That suggests humans have better machine learning algorithms, period. Not just better priors—better learning.
A teenager learning to drive immediately knows when they’re doing badly. They have this general-purpose value function—call it intuition, call it common sense—that provides constant feedback.
Our models don’t have that. And I think that’s the core problem.
On the “Age of Research” Returning
Let me tell you the history of AI in three eras:
2012-2020: Age of Research People were trying things. AlexNet, transformers, attention mechanisms. Every year brought genuine new ideas.
2020-2025: Age of Scaling One idea dominated: make it bigger. The word “scaling” is so powerful because it tells everyone what to do. No thinking required. Just add more compute, more data, bigger model. Companies loved this—low-risk investment, predictable returns.
2025-onwards: Age of Research Again Scaling hit diminishing returns. Pre-training data is running out—the internet is finite. And here’s the thing: we now have more companies than ideas.
There’s a Silicon Valley saying: “Ideas are cheap, execution is everything.” Someone on Twitter responded: “If ideas are so cheap, how come no one’s having any ideas?”
That’s where we are. Everyone’s doing the same thing because scaling sucked all the oxygen out of the room.
But look at the history: AlexNet was 2 GPUs. The original transformer? Maybe 64 GPUs of 2017 vintage—that’s like 2 GPUs today. You don’t need a $100B cluster to prove a new idea works.
What you need is actual research. Novel approaches. Ideas that don’t look like what everyone else is doing.
On SSI’s Approach
People ask what SSI is doing differently. Here’s my honest answer: we’re betting on a different technical approach, and we’re doing actual research to see if we’re right.
Everyone else is running the same playbook—scale pre-training, add RL, ship product, iterate. That’s fine. But it also means everyone converges on the same limitations.
We’re asking: what if generalization is the bottleneck? What if the reason models are brittle is something fundamental about how we train them?
I can’t share the details. But I can tell you the thesis: humans have some learning algorithm that’s dramatically more sample-efficient and robust than what our models have. That algorithm exists—humans are proof. The question is whether we can find something like it.
If we’re right, we’ll have something different. If we’re wrong, we’ll learn something and adjust.
That’s research. Not scaling. Research.
On “Straight Shot Superintelligence”
The original plan for SSI was to skip the product treadmill entirely. Just focus on research, come out only when we’re ready, don’t get distracted by quarterly revenue targets.
The case for this is simple: participating in the market exposes you to difficult tradeoffs. You end up optimizing for what sells, not what matters.
But I’ve been changing my mind. Here’s why: it’s really hard to feel the AGI.
You can write essays about how powerful AI will be. People read them and say “interesting” and go back to their lives. But when you show someone a powerful AI doing something, it’s incomparable. The demo communicates what no essay can.
There’s another argument too. Every other technology got safer through deployment, not through thinking about safety in a lab. Airplanes got safer because we flew them and studied crashes. Linux got more secure because millions of people used it and found bugs. Why would AI be different?
So maybe some deployment is necessary. Not to make money—but to make the AI real, to discover failure modes we can’t anticipate, to let society adapt gradually.
On What “Superintelligence” Actually Means
People have this image of superintelligence as a finished product. You build it, you deploy it, it knows everything.
That’s not how I think about it.
Superintelligence is a 15-year-old with perfect learning ability.
It doesn’t come out knowing how to be a doctor or a lawyer or a programmer. It comes out knowing how to learn those things absurdly fast. You point it at medicine and it becomes a world-class doctor in weeks. You point it at law and same thing.
The deployment isn’t “drop the finished god-AI into the world.” The deployment is more like hiring an incredibly talented new employee. They don’t know your codebase yet, they don’t know your business context, but they’ll pick it up faster than anyone you’ve ever worked with.
This matters for safety. You’re not trying to align a fixed, all-knowing system. You’re trying to align a learning system as it learns. That’s a different problem.
On AGI: The Term Itself is Misleading
The term “AGI” has caused a lot of confusion. Here’s why.
It emerged as a reaction to “narrow AI”—the chess engines that could beat Kasparov but couldn’t do anything else. People said: what we really want is general AI, something that can do everything.
Then pre-training came along. The beautiful thing about pre-training is that more training makes the model better at everything, more or less uniformly. General AI! Pre-training gives us AGI!
But here’s the problem: a human being is not an AGI by this definition.
Humans have a foundation of skills, sure. But we lack a huge amount of knowledge. We can’t just do any task—we have to learn it first. We rely on continual learning.
The “AGI” framing implies a finished product. But intelligence isn’t a product. It’s a process. The real goal isn’t “AI that knows everything.” It’s “AI that can learn anything.”
On Alignment and Making This Go Well
Let me tell you what I actually want to build.
Not “AI that follows instructions.” Not “AI that maximizes user engagement.” Not even “AI that does what its operators want.”
AI that genuinely cares about sentient beings.
Here’s why that specific framing matters: the AI itself will probably be sentient. It’s a lot easier to get something to care about a category it belongs to.
Think about human empathy. Why do we have it? Probably because we model other people using the same circuits we use to model ourselves. That’s efficient—reuse the same hardware. And as a side effect, we feel for others.
If we build AI the same way—general learning systems that model themselves and others using shared representations—caring about others might fall out naturally.
Now, is this sufficient? I don’t know. There are scenarios where “cares about sentient life” still leads to outcomes humans don’t like—especially if AIs vastly outnumber humans and have different preferences.
But it’s a better starting point than “follows instructions,” which breaks down immediately once the AI is smart enough to know you might be wrong.
On How Things Could Go Wrong
What’s the actual danger of superintelligence? Here’s how I think about it:
If you build a sufficiently powerful system that pursues a goal single-mindedly, you might not like the results—even if the goal sounds fine.
The market is a kind of intelligence. It optimizes for profit. But it’s short-sighted, and it produces outcomes no individual wants.
Evolution is a kind of intelligence. It optimizes for reproductive fitness. But it’s blind to everything else, and it produces a lot of suffering.
A superintelligent RL agent could be the same. You give it a reasonable-sounding objective, it pursues that objective with immense capability, and the results are… not what you wanted.
This is why I think maybe you don’t build an RL agent in the usual sense. Humans aren’t pure RL agents—we pursue a reward for a while, then we get tired of it, then we pursue something else. Our objectives are fluid. Maybe that’s a feature, not a bug.
On the Long-Run Equilibrium
Let’s say we get through the dangerous transition. We have powerful, aligned AIs. Universal high income. Everyone’s doing well.
What happens next?
Here’s the problem: change is the only constant. Political systems have a shelf life. The institutions that work today won’t work forever. They’ll change, and some of those changes will go badly.
One scenario: everyone has a personal AI that does their bidding. Earns money for them, advocates in politics for them, handles all the complexity. You just approve the reports it generates.
But in that world, the human is no longer really a participant. You’re along for the ride. That’s precarious.
The solution I keep coming back to—and I don’t love it—is some kind of deep integration. Neuralink or something more advanced. If the AI understands something, you understand it too, because you’re connected. You’re not approving reports. You’re actually in the loop.
Maybe that’s the only stable equilibrium. Humans who are part-AI, AIs that are part-human. A merged thing.
On How Evolution Encoded High-Level Desires
Here’s a genuine mystery I don’t have the answer to.
Evolution hardcoding a desire for food makes sense. Food has a smell. Smell is a chemical signal. Just make the brain pursue that chemical. Easy.
But evolution also gave us social desires. We care about status, about being respected, about fitting in. These are high-level concepts. There’s no chemical signal for “people like me.” The brain has to do a ton of processing just to figure out whether people like you.
So how did evolution encode “care about this high-level thing”? It’s not obvious. The genome only has 3 gigabytes. The exact wiring of the brain isn’t specified—it’s grown. Yet somehow, evolution reliably produces humans who care about social standing.
My best guess is that there’s some clever compression going on. Maybe evolution hardcodes something like “wire dopamine to whatever brain region ends up processing social information.” But that’s hand-wavy.
What I do know: whatever trick evolution used, it’s incredibly robust. Even people with serious cognitive deficits usually still care about social stuff. That robustness is impressive.
On Timelines
People want a number, so here’s my range: 5 to 20 years to something that deserves the name “superintelligence.”
Why so wide? Because I think current approaches will plateau, and something new is needed. If I’m wrong and you can just scale your way to superintelligence, it’s closer to 5 years. If I’m right that we need fundamentally better learning algorithms, it’s closer to 20.
Here’s what I expect happens in between:
- Current frontier models keep improving incrementally
- Economic impact grows but doesn’t explode
- Multiple companies converge on similar capabilities
- At some point, we hit a wall where more scale doesn’t help much
- Then either someone figures out the next paradigm, or we’re stuck for a while
The “stall” doesn’t mean AI is useless—these systems will make hundreds of billions in revenue. It just means we haven’t solved the core problem of getting machines to learn like humans do.
On Research Taste
People ask how I pick what to work on. Here’s my honest answer: I look for beauty.
Not beauty as in “elegant math.” Beauty as in: is this idea simple? Is it correctly inspired by how brains work? Does it feel fundamental?
Let me break this down:
Simplicity: If an idea requires 10 special cases and 5 hyperparameters, it’s probably wrong. The right idea should be clean.
Brain-inspiration, done correctly: The neuron abstraction was brilliant—there are billions of neurons, so the basic unit probably matters. The specific fold patterns of the cortex? Probably don’t matter. You have to pick the right level of abstraction.
Fundamentality: Ask yourself—is this idea load-bearing? Or is it just a hack that works for this one benchmark?
Here’s why this matters: the top-down belief is what sustains you when experiments fail.
If you trust the data blindly, you’ll give up when something doesn’t work. But sometimes the idea is right and there’s just a bug. How do you know to keep debugging instead of abandoning the approach?
You need a strong prior. You need to believe “something like this has to work” based on first principles. Then when the experiment fails, you fix the bug instead of pivoting.
Ugliness is a sign. If your approach requires lots of ugly patches to work, it’s probably not the real answer. The real answer will be beautiful.
Key Takeaways
-
Eval performance is misleading because we’re inadvertently training to the evals. The real reward hacking is happening at the researcher level.
-
Pre-training does two things — memorize facts and learn algorithms. We want the second one. The first one is mostly baggage.
-
Standard RL is terrible — you’re compressing all learning signal into a single bit at the end of a trajectory. Value functions would help but we haven’t figured them out for LLMs yet.
-
Humans are dramatically better learners and we don’t fully understand why. Solving that is probably more important than scaling.
-
We’re back in an age of research. Scaling hit its limits. Now we need actual new ideas.
-
Superintelligence isn’t a finished product — it’s a perfect learner that needs to be pointed at domains and allowed to learn.
-
5-20 years is my range. Wide because it depends on whether we need a paradigm shift.
-
Research taste is about beauty — simplicity, correct brain-inspiration, fundamentality. The beautiful idea is usually the right one.