← Home
Published on 2025-10-21 · CSU-JPG Lab
|

People See Text, But LLM Not

“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are.” — Davis, Matt (2012)

You can probably read that sentence effortlessly. Despite the chaotic order of letters, your brain automatically reconstructs the intended words — because humans don’t read text letter by letter. We perceive shapes, patterns, and visual words.

1. See Text vs. LLM Process Text

1.1 The Visual Nature of Reading

Cognitive neuroscience has long shown that our brain recruits a specialized region called the Visual Word Form Area (VWFA) in the left occipitotemporal cortex. This area recognizes entire word forms as visual objects, not symbolic sequences. That’s why humans can read “Cmabrigde” as Cambridge, or identify words in distorted fonts, mixed scripts, and complex layouts.

Visual Word Form Area: reading is visual
"The Science Behind LearningRx Reading Programs".
The brain “sees” words: reading is vision, not just language processing.

In essence, people see text. Reading is vision — not just language processing.

1.2 How LLMs Process Text (and Why It’s Different)

Large Language Models (LLMs), in contrast, do not “see” text. They tokenize it — breaking sentences into subword units like “vis”, “ion”, “cent”, “ric”. Each token becomes an integer ID looked up in a vocabulary. This is efficient for symbolic computation, but it destroys the visual and structural continuity of language. As a result, the holistic form of text is lost in translation. Humans can easily read “t3xt” as “text,” but token-based models treat them as unrelated sequences.

1.3 The Consequences of Tokenization

It is hard to process Interleaved Document with tokenization:
Discrete tokenization vs. visual perception
Leonardo da Vinci’s Manuscripts

This is why a model might read a paper or screenshot yet miss equations, tables, or captions — because they are treated as pixels, not text.

1.4 Text Widely Exists in Web Images

45%+ web images contain text (e.g., in LAION-2B; Lin et al., Parrot, ECCV 2024). Documents, UI, charts, and designs are inherently visual text.

Text prevalence in web images (Parrot)
Text is ubiquitous in web images: documents, UI, charts, designs.

2. Early Attempts: Making Models See Text / Unified Model

Several early studies tried to bridge this gap by treating text as an image signal:

These works helped models see text, but left a key question: what tangible benefit does transforming text into images actually provide?

3. Recent Attempts: Visual Tokens for Long-Context Compression

Key observations:


Interleaved document-level multimodal pretraining is an ideal setup.
NeurIPS 2024 — “Leveraging Visual Tokens for Extended Text Contexts”: represent long text as compact visual tokens, enabling longer & denser context at training and inference. This improves in-context text length from 256 → 2048 during pretraining on NVIDIA H100.

Visual tokens for extended in-context learning
Leveraging visual tokens for extended in-context understanding.
H100 pretraining: text context 256→2048
On NVIDIA H100: in-context text length improved from 256 → 2048.

Long-text understanding in LLM is another ideal situation.
Xing et al., NeurIPS 2025 — “Vision-Centric Token Compression in Large Language Models” (VIST): inspired by a human slow–fast reading circuit: a fast visualized path renders distant context as images for quick semantic extraction; a slow path feeds key text to LLM for deep reasoning.

VIST: vision-centric token compression in LLMs
VIST (NeurIPS’25): fast visual path + slow language path.

Deepseek's answer: OCR.

DeepSeek-OCR (Oct 2025): Contextual Optical Compression extends visual-token compression to OCR: compressing thousands of text tokens into a few hundred visual reps, reaching ~97% OCR precision at ~10× compression. DeepSeek-OCR was driven by powerful infrastructure and painstaking large-scale data preparation, enabling the model to scale visual–token compression far beyond prior works.

DeepSeek-OCR contextual optical compression
DeepSeek-OCR: ~97% precision with ~10× compression.

The convergence of visual perception and language understanding is not a coincidence — it is the next paradigm shift.

4. The Future Ahead: A Vision-centric MLLM

In a truly vision-centric multimodal language model, we may no longer need a traditional tokenizer. The model could read text visually — as humans do — and even generate text as images, unifying perception and generation in the same visual space.

Dense Text Image Generation:

To reach that goal, we must perfect image-based text rendering and long-text visual generation: TextAtlas5M provides large-scale dense text rendering where captions, documents, and designs are visually represented.

TextAtlas5M dataset for dense text rendering
TextAtlas5M: large-scale dense text rendering. Arxiv 25'2.

Beyond Words aims to generate text-heavy, information-dense images from natural prompts, pushing multimodal autoregressive models toward true long-text visual generation.
Beyond Words: long-text visual generation
Beyond Words: toward long-text visual generation. Arxiv 25'2.

More tasks in LLM:

Xing et al, Arxiv25'10 - "SEE THE TEXT: FROM TOKENIZATION TO VISUAL READING", this work extend this idea to more tasks include Classification, and QA.

See The Text
See The Text (Arxiv25'10)

5. Toward the Next Generation of Vision-centric MLLM

The ultimate goal is a model that reads, writes, and sees text the way humans do — through the eyes of vision.
People see text. Soon, LLMs & LVMs will too.


← Back to Home Back to Top ↑