ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation

1Tsinghua University, 2Tencent AI Lab
*Equal contribution. †Corresponding author.
Teaser image.

The real-world example. LMMs assist scientists and researchers in understanding, interpreting and creating charts during the reading and writing of academic papers. These models serve as assistants that enhance the comprehension and presentation of data in scholarly communications.

Abstract

We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering.

ChartMimic includes 4,800 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains (e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 201 subcategories.

Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning.

The evaluation of 3 proprietary models and 14 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4o, InternVL2-Llama3-76B only achieved an average score across Direct Mimic and Customized Mimic tasks of 82.2 and 61.6, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

Framework.

The pipeline of ChartMimic. We provide 4,800 human-curated (figure, instruction, code) triplets. We use ChartMimic to evaluate LMMs’ proficiency in the multimodal chart-to-code generation, resulting in both high-level and low-level evaluation results.

Leaderboard

We conduct examination of 17 LMMs on ChartMimic, including 3 proprietary models and 14 open-weight models.

The ChartMimic leaderboard with Direct Mimic task. We also include the code execution success rate (Exec. Rate) and model size (Params).

The ChartMimic leaderboard with Customized Mimic task. We also include the code execution success rate (Exec. Rate) and model size (Params).


Please refer to our GitHub repo to add your model to the leaderboard.

Data

You can directly download our data from Huggingface datasets. For guidance on how to access and utilize the data, please consult our instructions on Github.

BibTeX

@article{yang2024chartmimic,
        title={Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation},
        author={Yang, Cheng and Shi, Chufan and Liu, Yaxin and Shui, Bo and Wang, Junjie and Jing, Mohan and Xu, Linran and Zhu, Xinyu and Li, Siheng and Zhang, Yuxiang and others},
        journal={arXiv preprint arXiv:2406.09961},
        year={2024}
      }

Contact Us

If you have any inquiries about ChartMimic, feel free to reach out to us at chartmimic@gmail.com, {scf22, yangc21}@mails.tsinghua.edu.cn or raise an issue on Github.