Chart2Code

From charts to code: a hierarchical benchmark for multimodal models

1CSU-JPG, Central South University,  2National University of Singapore,  3Nanyang Technological University, 

What's new with Chart2Code benchmark

TL;DR: We introduce Chart2Code, a new benchmark designed to evaluate chart generation capabilities of LMMs under progressively challenging conditions. There are five tasks in the Chart2Code benchmark.

Direct Reproduction Example
Customized Raw Data Example
Customized Figure Data Example
Level 2 Example
Level 3 Example

Overview

Chart2Code is a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models. Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty.

Chart2Code Overview

To our knowledge, Chart2Code is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. It consists of three levels as illustrated in above figure:
Level1(Chart Reproduction) reproduces charts from a reference figure and user query;
Level2(Chart Editing) involves complex modifications such as changing chart types or adding elements;
Level3(Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions.

Data Statistic

Data Statistics

In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with multi-level evaluation metrics that assess both code correctness and the visual fidelity of rendered charts.

Benchmark Comparison

Benchmark Comparison Table

Table 1: Chart2Code is a unique benchmark featuring a more comprehensive set of tasks that better reflect real-world scenarios.

Comparison of existing chart-to-code benchmarks.: Chart2Code is a new benchmark designed to rigorously evaluate chart generation capabilities of LMMs under progressively challenging conditions. This hierarchical design reflects real-world usage while progressively increasing difficulty, and its distinctions from prior benchmarks are highlighted in the above table.

Human-Model Performance Comparison

Direct Reproduction Example
Customized Raw Data Example
Customized Figure Data Example
Level 2 Example
Level 3 Example
Model Direct Reproduction(DR) Customize Raw Data(CRD) Customize Figure Data(CFD)
Exec.Rate LLM-Score LMM-Score Exec.Rate LLM-Score LMM-Score Exec.Rate LLM-Score LMM-Score
Proprietary
Gemini-2.5-Pro90.40.62860.38071000.67630.266187.040.61450.2214
Claude-Sonnet-496.380.56290.255397.20.48780.23688.890.55380.2273
GPT-587.480.63340.357594.40.60700.223885.190.60820.2382
Seed-1.5-VL85.810.55360.234197.20.63250.266265.740.57560.1962
Seed-1.6-VL84.700.52370.811794.40.65250.250383.960.59780.2075
Open-Source LMMs (non-thinking)
LLaVA-OV-Qwen2-7B-SI32.820.18200.015411.110.42250.15500--
LLaVA-OV-Qwen2-7B-OV11.130.26510.03765.560.42130.08250--
DeepSeek-VL-7B48.680.28540.043161.110.53740.111410.190.25390.0145
kimi-VL-A3B68.850.44090.137472.220.58870.208161.110.46410.1379
Qwen2-VL-7B64.390.33640.066475.000.59500.136730.560.42350.0519
Qwen2-VL-72B75.660.43680.120780.560.60820.162851.850.55180.1373
InternVL-2.5-8B66.890.33480.072380.560.57120.118337.740.57150.0568
InternVL-2.5-38B86.230.45770.14630--0--
InternVL-3-8B66.340.43710.138986.110.61690.173257.410.44500.1028
GLM-4V-9B72.180.28810.045966.670.56280.118344.740.29040.0130
Intern-VL-3.5-8B66.340.43710.138986.110.61690.173257.410.44500.1028
MiMo-VL-7B-RL37.830.54390.231669.440.60680.242141.670.49620.1407
MiMo-VL-7B-SFT44.650.49590.198369.440.62370.185246.300.51550.1732
Qwen2.5-VL-7B65.640.41970.099475.000.59520.151544.440.59520.0910
Qwen2.5-VL-72B65.360.51180.18931000.62730.198937.960.55320.1688
Molmo-7B-D34.770.21640.09434.550.24000.46000.970.05000.4100
Qwen3-30B64.670.52930.253177.780.25460.236870.370.24120.2698
Open-Source LMMs (thinking)
MiMo-VL-7B-RL55.770.52610.229469.440.60530.258233.330.58070.2172
MiMo-VL-7B-SFT50.350.65550.213086.110.66440.224838.890.55780.1455
Qwen3-30B45.060.55820.273072.220.33670.336839.810.31850.2780
Model Exec.
Rate
Code-Level Chart-Level
LMM-Score
ColorGridLayoutLegendVisualDataTextTypeLLM-Score
Proprietary
Gemini-2.5-Pro90.300.62170.88420.96130.50930.51700.75600.63300.96360.57420.2459
Claude-Sonnet-491.190.57370.81100.95870.47140.47760.67360.58690.95630.53170.2147
GPT-590.580.58120.84670.94990.48350.48150.70470.60960.95810.56630.2506
Seed-1.5-VL63.170.51060.82300.95380.44080.45820.69830.71660.94000.51260.1975
Seed-1.6-VL72.380.52770.80130.94710.47140.44530.68840.73120.94310.51510.1863
Open-Source LMMs (non-thinking)
LLaVA-OV-Qwen2-7B-SI1.190.35070.69640.78330.40740.30020.52490.48710.78890.31570.0875
LLaVA-OV-Qwen2-7B-OV2.570.31630.60130.68630.44880.20300.56850.49280.81540.35120.0366
DeepSeek-VL-7B21.680.25230.62060.73500.24360.18200.40310.45380.79220.25830.0433
kimi-VL-A3B49.50.39010.72700.90740.34110.31960.57240.59130.90330.37010.1039
Qwen2-VL-7B24.950.28460.58250.77110.27230.23850.46930.48830.81410.31810.0780
Qwen2-VL-72B55.050.40130.77040.90440.34640.33450.60860.57440.90980.39280.1140
InternVL-2.5-8B21.290.33410.70020.83620.31480.29550.54210.55300.85360.33440.0869
InternVL-2.5-38B68.220.45440.79020.94050.41460.37450.63340.63610.93380.43110.1367
InternVL-3-8B4.550.34910.59140.94470.33890.36450.55610.54210.85560.34190.0943
InternVL-3-38B67.430.47200.78530.94100.41330.39940.65250.65380.92350.45280.1476
GLM-4V-9B10.690.20110.69100.77940.23570.21960.46040.50030.74720.29530.0770
Intern-VL-3.5-8B27.230.40150.73500.90560.35660.37180.61210.65050.89980.39640.1466
MiMo-VL-7B-RL20.590.43780.84620.92050.42010.42310.65050.66660.92000.46150.1573
MiMo-VL-7B-SFT21.880.43250.75060.89410.38230.40350.64310.65640.94050.44590.1399
Qwen2.5-VL-7B33.360.35240.73740.85920.32960.33020.59440.57800.88870.36030.0974
Qwen2.5-VL-72B71.490.50180.82290.95090.44670.42420.66730.68150.93480.47390.1684
Molmo-7B-D0.990.24710.81520.56360.10000.22750.34770.30820.34760.34880.1347
Qwen3-30B41.390.540.81740.95870.46230.45010.69110.70840.93840.36110.2257
Open-Source LMMs (thinking)
MiMo-VL-7B-RL27.620.50760.75600.94490.41090.43790.70060.68590.94460.48190.1737
MiMo-VL-7B-SFT24.160.45620.74040.92860.36860.39800.68120.66170.93850.44960.1774
Qwen3-30B42.380.52130.82480.95490.47180.44530.69240.70460.94030.49470.2362
Model Exec.
Rate
Code-Level Figure-Level
LMM-Score
ColorGridLayoutLegendVisualDataTextTypeLLM-Score
Proprietary
Gemini-2.5-Pro29.330.72760.97331.00000.77270.67010.78800.82910.94700.35160.0361
Claude-Sonnet-438.000.56760.79631.00000.81480.37310.58810.71750.90620.51250.007
GPT-538.000.56760.79631.00000.81480.37310.58810.71750.90620.51250.0362
Seed-1.5-VL18.670.72520.89291.00000.88690.55020.71820.78040.96900.00000.0611
Seed-1.6-VL40.000.70300.88331.00000.79720.53960.79560.81280.92440.00000.0547

Results Analysis

Correlation of model performance

Correlation of the model performance (i.e., LMM-score) on different manually annotated difficulties (i.e., Easy, Medium, Hard) on Level 1, 2, 3, respectively.

Model generalization analysis

Left: Both proprietary and open-source models generalize well on Level 1 and Level 2 tasks when calculating the LLM-score for predicted code assessment. Right: Proprietary models tend to obtain higher LMM-scores on the Level 1 task rather than the Level 2, while open-source models perform poorly on both tasks (scores are lower than 0.5).

Performance analysis on different task cases

Analysis of model performance on different task cases with LLM-score and LMM-score.

Example of Data and Error Cases

An Example of Level 1: Direct Reproduction

Direct Reproduction Example

An Example of Level 1: Customized Text-Format Table Data

Customized Raw Data Example

An Example of Level 1: Figure-Format Table Data

Customized Figure Data Example

An Example of Level 2

Level 2 Example

An Example of Level 3

Level 3 Example

Error Cases Visualization

Error Analysis 1 Error Analysis 2

Citation


@misc{tang2025chartscodehierarchicalbenchmark,
    title={From Charts to Code: A Hierarchical Benchmark for Multimodal Models}, 
    author={Jiahao Tang and Henry Hengyuan Zhao and Lijian Wu and Yifei Tao and Dongxing Mao and Yang Wan and Jingru Tan and Min Zeng and Min Li and Alex Jinpeng Wang},
    year={2025},
    eprint={2510.17932},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2510.17932}, 
}

Acknowledge: Thanks to Carlos & John for this webpage template. Also thanks to the SWE-bench team and their benchmark https://www.swebench.com/multimodal.html.

Template Usage: If you would like to use this website template for your own leaderboard, please send Carlos & John an email requesting permission. If granted, please make sure to acknowledge the SWE-bench team and link to this leaderboard on the home page of the website.

NUS Logo NUS Logo