Top Podcasts
Health & Wellness
Personal Growth
Social & Politics
Technology
AI
Personal Finance
Crypto
Explainers
YouTube SummarySee all latest Top Podcasts summaries
Watch on YouTube
Publisher thumbnail
Matthew Berman
8:4310/22/25

New DeepSeek just did something crazy...

TLDR

DeepSeek OCR introduces a novel method to compress text by representing it as images, significantly expanding the context window for large language models with maintained accuracy.

Takeways

DeepSeek OCR compresses text 10x by representing it as images, maintaining 97% accuracy.

This method overcomes the context window bottleneck in LLMs, allowing them to process significantly more information efficiently.

Experts believe converting text to image inputs could fundamentally enhance LLM capabilities, enabling more powerful and general information processing.

DeepSeek has developed DeepSeek OCR, a new vision language model that uses images to represent text, achieving up to 10 times text compression while maintaining 97% accuracy. This breakthrough addresses the context window bottleneck in large language models (LLMs), allowing them to process substantially more information efficiently. The technology has the potential to make text-based models far more powerful by enabling longer context windows without quadratically increasing compute costs.

DeepSeek OCR Breakthrough

00:00:00 DeepSeek OCR introduces a novel approach to image recognition, enabling the compression of text by representing it within an image format. This method achieves a 10x compression ratio for text while retaining 97% accuracy, addressing the significant bottleneck of context windows in large language models (LLMs). By converting text to images, LLMs can process far more information within the same token budget, thereby enhancing their power and efficiency.

Context Window Bottleneck

00:01:07 The primary limitation for current large language models like Gemini and ChatGPT is the context window, which dictates how many tokens can be processed in a prompt. Scaling up this context window leads to a quadratic increase in compute costs, making it inefficient. DeepSeek OCR offers a solution by allowing 10 times more context to be included in the context window without altering its underlying structure, thereby circumventing the escalating compute demands.

How DeepSeek OCR Works

00:02:54 DeepSeek OCR processes an image of text, such as a PDF, by splitting it into 16x6 patches. It utilizes an 80 million parameter SAM model for local detail recognition and a 300 million parameter CLIP model to store information for image reconstruction. A 3 billion parameter DeepSeek 3B mixture-of-experts model then decodes the compressed image back into text. This pipeline allows for efficient text compression and decompression, enabling a 96%+ OCR decoding precision at 9-10x compression, though accuracy decreases with higher compression ratios.

Implications and Reactions

00:05:55 Experts like Andrej Karpathy and Brian Romel have reacted positively to DeepSeek OCR, highlighting its potential to revolutionize LLM inputs. Karpathy suggests that pixels might be superior inputs to LLMs than text tokens, advocating for rendering all text input as images to achieve greater information compression, shorter context windows, and more general information streams that include rich formatting beyond plain text. The technology could enable LLMs to process an entire encyclopedia compressed into a single high-resolution image, unlocking new use cases.