Top Podcasts
Health & Wellness
Personal Growth
Social & Politics
Technology
AI
Personal Finance
Crypto
Explainers
YouTube SummarySee all latest Top Podcasts summaries
Watch on YouTube
Publisher thumbnail
Dwarkesh Patel
11:4010/4/25

Some thoughts on the Sutton interview

TLDR

The speaker challenges Richard Sutton's 'bitter lesson' perspective on AI development, arguing that current LLMs and imitation learning are crucial intermediaries for achieving AGI, even if future systems adopt Sutton's vision of self-directed, continuous learning.

Takeways

Sutton's 'bitter lesson' critiques LLMs for their inefficiency, reliance on human data, and lack of continuous, on-the-job learning.

Imitation learning with human data is a crucial prior, facilitating the development of advanced AI capabilities through subsequent reinforcement learning.

LLMs show promising signs of developing world representations and could potentially replicate continual learning with architectural or methodological enhancements.

The speaker reflects on Richard Sutton's 'bitter lesson' worldview, which posits that current Large Language Models (LLMs) are inefficient due to their reliance on human data and lack of continuous, on-the-job learning. The speaker largely disagrees with Sutton's sharp distinctions, contending that imitation learning is continuous with and complementary to reinforcement learning (RL), and that human-derived data acts as a necessary prior for kickstarting powerful AI capabilities. While acknowledging the future potential of Sutton's vision, the speaker believes LLMs are on a viable path to AGI.

Sutton's Bitter Lesson Perspective

00:00:19 Richard Sutton's 'bitter lesson' argues against throwing compute away and instead advocates for techniques that effectively leverage it. Current LLMs are inefficient because they only learn during a special training phase, not during deployment, and this training is highly inefficient, relying on human data to build a model of 'what a human would say next' rather than a true world model. This approach is not scalable, and LLMs' inability to learn on the job suggests a new architecture enabling continual learning will eventually supersede the current paradigm, making special training phases obsolete.

Imitation Learning's Role

00:03:13 Imitation learning is a continuous and complementary approach to reinforcement learning (RL), where pre-trained LLMs can serve as a crucial prior for experiential learning. This is analogous to fossil fuels, which were essential intermediaries for human civilization's technological advancement. While AlphaZero, bootstrapped from scratch, outperformed AlphaGo, which was conditioned on human games, human data isn't detrimental; it provides a necessary foundation. Human cultural learning itself is a form of imitation learning, emphasizing the importance of accumulated knowledge over generations.

LLMs and World Models

00:06:33 The critical question is whether imitation learning can help models learn better from ground truth. The success of RL-fine-tuned pre-trained base models in solving unseen math Olympiad questions or coding applications demonstrates that a 'reasonable prior over human data' is essential to kickstart the RL process. Whether this prior is called a 'world model' or a 'model of humans' is semantic; its utility in enabling learning from ground truth is what matters. LLMs develop deep representations of the world because their training incentivizes it, even if they aren't explicitly trained to model how their actions affect the world.

Continual Learning

00:08:26 Continual learning, where AI agents learn from environments in a high-throughput way, is necessary for true AGI and does not currently exist with LLMs trained on RL from human feedback. However, straightforward methods might shoehorn continual learning onto LLMs, such as making supervised fine-tuning a tool call for the model, enabling it to effectively teach itself. The emergence of in-context learning, where models demonstrate a form of continual learning within their context windows, suggests that if information could flow across longer windows, models might meta-learn similar flexibility, potentially replicating true continual learning.