• Shaarli
  • Tag cloud
  • Picture wall
  • Daily
  • RSS
  • Login
4410 shaares
Filters

I quite like the new DeepSeek-OCR paper | Andrej Karpathy

QRCode

The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in.

I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go.

https://github.com/deepseek-ai/DeepSeek-OCR

https://x.com/karpathy/status/1980397031542989305?s=43&t=gAZhA3-2h2DvLb-eSzGa5A
June 5, 2026 at 2:00:26 PM EDT *
ai llm ocr pdf
FILLER
Shaarli · The personal, minimalist, super fast, database-free, bookmarking service by the Shaarli community · Documentation
Fold Fold all Expand Expand all Are you sure you want to delete this link? Are you sure you want to delete this tag? The personal, minimalist, super fast, database-free, bookmarking service by the Shaarli community