People See Text, But LLM Not

“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are.” — Davis, Matt (2012)

You can probably read that sentence effortlessly. Despite the chaotic order of letters, your brain automatically reconstructs the intended words — because humans don’t read text letter by letter. We perceive shapes, patterns, and visual words.

1. See Text vs. LLM Process Text

1.1 The Visual Nature of Reading

Cognitive neuroscience has long shown that our brain recruits a specialized region called the Visual Word Form Area (VWFA) in the left occipitotemporal cortex. This area recognizes entire word forms as visual objects, not symbolic sequences. That’s why humans can read “Cmabrigde” as Cambridge, or identify words in distorted fonts, mixed scripts, and complex layouts.

Visual Word Form Area: reading is visual — "The Science Behind LearningRx Reading Programs".

The brain “sees” words: reading is vision, not just language processing.

In essence, people see text. Reading is vision — not just language processing.

1.2 How LLMs Process Text (and Why It’s Different)

Large Language Models (LLMs), in contrast, do not “see” text. They tokenize it — breaking sentences into subword units like “vis”, “ion”, “cent”, “ric”. Each token becomes an integer ID looked up in a vocabulary. This is efficient for symbolic computation, but it destroys the visual and structural continuity of language. As a result, the holistic form of text is lost in translation. Humans can easily read “t3xt” as “text,” but token-based models treat them as unrelated sequences.

1.3 The Consequences of Tokenization

Loss of visual semantics: Fonts, shapes, and layout cues disappear.
Over-segmentation in multilingual text: Low-resource languages get fragmented into meaningless subwords.
Inefficiency for long text: A few characters can turn into multiple tokens, inflating context length and cost.

It is hard to process Interleaved Document with tokenization:

Discrete tokenization vs. visual perception — Leonardo da Vinci’s Manuscripts

This is why a model might read a paper or screenshot yet miss equations, tables, or captions — because they are treated as pixels, not text.

1.4 Text Widely Exists in Web Images

45%+ web images contain text (e.g., in LAION-2B; Lin et al., Parrot, ECCV 2024). Documents, UI, charts, and designs are inherently visual text.

Text prevalence in web images (Parrot) — Text is ubiquitous in web images: documents, UI, charts, designs.

2. Early Attempts: Making Models See Text / Unified Model

Several early studies tried to bridge this gap by treating text as an image signal:

Visual Text Representations (Salesky et al., EMNLP 2021): The paper introduces visual text representations that replace discrete subword vocabularies with continuous embeddings derived from rendered text. This method matches traditional models in translation quality while being far more robust to noisy or corrupted input.

PIXEL (Phillip et al., ICLR 2023): render text as images, pretrain ViT-MAE to reconstruct masked pixels, removing the tokenizer; robust to unseen scripts and orthographic perturbations.

CLIPPO (Michael et al., CVPR 2023): unify image & text under a single pixel-based encoder; text is rendered to images and trained with contrastive loss.

CLIPPO pixel-only encoder — CLIPPO (CVPR’23): a pixel-only encoder for image and text.

Pix2Struct (Lee et al., ICML 2023): parse screenshots as structured visual input for document/layout understanding.

Pix2Struct for screenshot and layout understanding — Pix2Struct (ICML’23): parse screenshots as structured visual input.

PIXAR(Tai et al., ACL 2024): PIXAR is pixel-based autoregressive LLM capable of generating readable text, using adversarial pretraining to overcome MLE limitations and reach GPT-2-level performance. .

PIXAR: a pre-training objective that leverages both visual and textual information to enhance model understanding. — PIXAR (ACL’24): PIXAR is pixel-based autoregressive LLM capable of generating readable text, using adversarial pretraining to overcome MLE limitations and reach GPT-2-level performance.

PTP (Gao et al., arXiv 2024): screenshot language models with a Patch-and-Text Prediction objective to reconstruct masked image patches and text jointly.

PTP screenshot language models — PTP (arXiv’24): joint patch-and-text prediction on screenshots.

Fuyu:

PEAP (Lyu et al., arXiv 2025'): a unified perception paradigm for agentic language models that interact directly with real-world environments combining visual and textual information.

PEAP: a unified perception paradigm for agentic language models that interact directly with real-world environments combining visual and textual information. — PEAP (arXiv’25): a unified perception paradigm for agentic language models that interact directly with real-world environments combining visual and textual information.

These works helped models see text, but left a key question： what tangible benefit does transforming text into images actually provide?

3. Recent Attempts: Visual Tokens for Long-Context Compression

Key observations:

1. Vision encoders are typically much smaller than LLMs (e.g., ~100M for ViT-B vs. 7B+ for LLaMA/Mistral).
2. CLIP-style pretraining yields emergent OCR-like abilities without explicit supervision.
3. Visual patches can encode dense textual content (more characters per spatial area), effectively extending context via spatial compression.

Interleaved document-level multimodal pretraining is an ideal setup.
NeurIPS 2024 — “Leveraging Visual Tokens for Extended Text Contexts”: represent long text as compact visual tokens, enabling longer & denser context at training and inference. This improves in-context text length from 256 → 2048 during pretraining on NVIDIA H100.

Visual tokens for extended in-context learning — Leveraging visual tokens for extended in-context understanding.

H100 pretraining: text context 256→2048 — On NVIDIA H100: in-context text length improved from 256 → 2048.

Long-text understanding in LLM is another ideal situation.
Xing et al., NeurIPS 2025 — “Vision-Centric Token Compression in Large Language Models” (VIST): inspired by a human slow–fast reading circuit: a fast visualized path renders distant context as images for quick semantic extraction; a slow path feeds key text to LLM for deep reasoning.

VIST: vision-centric token compression in LLMs — VIST (NeurIPS’25): fast visual path + slow language path.

Deepseek's answer: OCR.

DeepSeek-OCR (Oct 2025): Contextual Optical Compression extends visual-token compression to OCR: compressing thousands of text tokens into a few hundred visual reps, reaching ~97% OCR precision at ~10× compression. DeepSeek-OCR was driven by powerful infrastructure and painstaking large-scale data preparation, enabling the model to scale visual–token compression far beyond prior works.

DeepSeek-OCR contextual optical compression — DeepSeek-OCR: ~97% precision with ~10× compression.

The convergence of visual perception and language understanding is not a coincidence — it is the next paradigm shift.

4. The Future Ahead: A Vision-centric MLLM

In a truly vision-centric multimodal language model, we may no longer need a traditional tokenizer. The model could read text visually — as humans do — and even generate text as images, unifying perception and generation in the same visual space.

Dense Text Image Generation:

To reach that goal, we must perfect image-based text rendering and long-text visual generation: TextAtlas5M provides large-scale dense text rendering where captions, documents, and designs are visually represented.

TextAtlas5M dataset for dense text rendering — TextAtlas5M: large-scale dense text rendering. Arxiv 25'2.

Beyond Words aims to generate text-heavy, information-dense images from natural prompts, pushing multimodal autoregressive models toward true long-text visual generation.

Beyond Words: long-text visual generation — *Beyond Words*: toward long-text visual generation. Arxiv 25'2.

More tasks in LLM:

Xing et al, Arxiv25'10 - "SEE THE TEXT: FROM TOKENIZATION TO VISUAL READING", this work extend this idea to more tasks include Classification, and QA.

5. Toward the Next Generation of Vision-centric MLLM

Text is visual, not merely symbolic.
Vision is language, not separate.
Compression is perception, not just engineering.

The ultimate goal is a model that reads, writes, and sees text the way humans do — through the eyes of vision.
People see text. Soon, LLMs & LVMs will too.

人能“看见”文字，但 LLM 却不能

“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are.” — Davis, Matt (2012)

你大概率可以毫不费力地读懂这句话。虽然字母顺序被打乱，你的大脑仍能自动复原——因为人类阅读并不是逐字母解码，而是在看词形、结构与视觉模式。

1. 人“看见”文本 vs. LLM“处理”文本

1.1 阅读的视觉本质

认知神经科学表明：大脑在枕颞叶左侧的一个专门区域——视觉词形区（VWFA）参与阅读。它把整个词当作视觉对象来识别，而非单纯的符号序列。因此我们能把“Cmabrigde”读成 Cambridge，也能在扭曲字体、混合书写系统、复杂版式中识别文字。

本质上，人类是在“看”文本。 阅读首先是视觉，而不仅是语言。

1.2 LLM 如何处理文本（以及差异何在）

相比之下，大语言模型并不“看见”文本。它们先将文本分词 ——把句子打成“vis / ion / cent / ric”等子词单元；每个 token 映射为词表中的一个整数 ID。这对符号计算很高效，但会破坏语言的视觉与结构连续性。结果是，文本的整体形态在映射中丢失。人类能把 “t3xt” 读成 “text”，但基于 token 的模型会将它们当作无关序列。

1.3 分词带来的后果

视觉语义丢失：字体、形状与版式线索消失。
多语文本过度切分：资源稀缺语言被拆成缺乏语义的片段。
长文本低效：少量字符可能对应多个 token，放大上下文长度与成本。

这就是为什么模型“看过”论文或截图，却容易漏掉公式、表格或图注——因为它们被当成像素而非文本。

1.4 互联网上的图像中广泛存在文本

在如 LAION-2B 等数据中，45%+ 的网页图像包含文本（Lin 等，Parrot，ECCV 2024）。文档、UI、图表、设计，本质上都是“视觉化文本”。

2. 早期尝试：让模型“看见”文本

一些工作把文本当作图像信号来建模，以弥合这种差距：

visual text representations（Salesky 等，EMNLP 2021）：把文本渲染为图像，用 ViT-MAE 进行像素重建预训练，去掉 tokenizer；对未见书写系统与正字法扰动更鲁棒。

PIXEL（Phillip 等，ICLR 2023）：把文本渲染为图像，用 ViT-MAE 进行像素重建预训练，去掉 tokenizer；对未见书写系统与正字法扰动更鲁棒。

CLIPPO（Michael 等，CVPR 2023）：用单一像素编码器统一图像与文本；文本渲染为图像，采用对比学习。

Pix2Struct（Lee 等，ICML 2023）：把截图解析为结构化视觉输入，面向文档/版式理解。

PIXAR(Tai et al., ACL 2024): 基于像素的自回归LLM，能够生成可读文本，使用对抗性预训练克服MLE限制，达到GPT-2级别性能。

PTP（Gao 等，arXiv 2024）：提出截图语言模型与 Patch-and-Text Prediction 目标，同时重建图像补丁与文本。

Fuyu:通过Unified Model处理视觉和文本信息。

PEAP（Lyu 等，arXiv 2025')提出一种统一的感知范式，用于与真实世界环境直接交互的智能语言模型，结合视觉和文本信息。

这些工作让模型开始"看见"文本，但一个关键问题仍待回答：把文本变成图像，具体带来什么可量化收益？

3. 近期进展：用视觉 token 做长上下文压缩

关键观察：

视觉编码器通常远小于 LLM（如 ViT-B ~1e8 参数 vs. LLaMA/Mistral 7B+）。
CLIP 式预训练能涌现类 OCR 能力，即使没有显式监督。
视觉 patch 能编码高密度文本（单位面积更多字符），通过空间压缩有效延长上下文。

交错式（interleaved）文档级多模态预训练是理想设定。
NeurIPS 2024 —— "Leveraging Visual Tokens for Extended Text Contexts"： 将长文本表示为紧凑的视觉 token，使训练与推理都能处理更长、更密集的上下文。该工作把预训练阶段的可用文本上下文从 256 → 2048（NVIDIA H100）显著提升。

长文本理解是另一理想应用。
Xin 等，NeurIPS 2025 —— "Vision-Centric Token Compression in Large Language Models"（VIST）： 受人类快/慢通路阅读启发：快通路把远距上下文渲染成图像以快速抽取语义；慢通路把关键文本直接送入 LLM 做深度推理。

DeepSeek-OCR（2025 年 10 月）：Contextual Optical Compression 把视觉 token 压缩扩展到大规模 OCR：将数千文本 token 压缩为数百个视觉表征，达到约 97% OCR 精度、约 10× 压缩。 DeepSeek-OCR 的成功得益于 强大的基础设施 和 精细的大规模数据准备，使得模型能够超越以往的工作，实现远超预期的视觉–token 压缩效果。

视觉感知与语言理解的融合并非偶然——它正在成为下一个范式。

4. 未来图景：Vision-centric MLLM

在真正以视觉为中心的多模态语言模型里，我们或许不再需要传统 tokenizer。模型将像人一样“视觉化阅读”，甚至以图像方式“视觉化生成”文本，于同一视觉空间统一感知与创作。

为此我们需要完善基于图像的文本渲染与长文本视觉生成： TextAtlas5M 提供大规模致密文本渲染数据，覆盖说明、文档与设计等视觉表达；

稠密文本图像生成:

Beyond Words 旨在从自然语言生成高文本量、高信息密度的图像，推动多模态自回归模型迈向真正的长文本视觉生成。

探索LLM里面更多任务:

Xing et al, Arxiv25'10 - "SEE THE TEXT: FROM TOKENIZATION TO VISUAL READING", 这个工作将视觉处理文本的思想扩展到更多任务包括分类和QA.

5. 走向下一代 Vision-centric MLLM

文本首先是视觉，而非仅是符号。
视觉就是语言，两者并不割裂。
压缩即是感知，而非仅是工程手段。

终极目标：让模型像人类一样去读、去写、去“看见”文本。
People see text. 很快，LLMs 与 LVMs 也会如此。