Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

A new benchmark developed by researchers at the NSF-Simons AI Institute for Cosmic Origins is testing how well LLMs implement scientific workflows in astronomy and visualize results.

The paper was accepted as part of the NeurIPS Datasets and Benchmarks Track 2025.

AstroVisBench was created by Sebastian Joseph, Syed Murtaza Husain, Stella Offner, Stéphanie Juneau, Paul Torrey, Adam Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, and Junyi Jessy Li. The benchmark draws from expert-curated jupyter notebooks for astronomy tasks. From these, the team constructed 864 processing and visualization tasks, testing a diverse set of visualizations and long-tail API use. 

Researchers then prompted LLMs to generate code for these tasks, run the code, and evaluate each task by doing the following:

  • Processing tasks: comparing key variable values to that from the ground truth.

  • Visualization tasks: using a VLM judge that compares a visualization’s scientific utility to that of the ground truth. The VLM judge is well-correlated with professional astronomers, whose labels were developed collectively over hours of discussion.

The team found that even the best LLMs struggle to execute scientific workflows. SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases. 

This dataset focuses on an important aspect of the scientific workflow that is achievable in the near term and aims to produce tools used by astronomers, rather than replacing or automating all of science.

AstroVisBench is the first scientific coding benchmark that evaluates whether models:

  • aid scientists amidst their own workflows when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring.

  • are adequate at long-tail knowledge, focusing especially on the usage of domain-specific APIs and visualization generation

  • interact with a variety of data formats to create diverse visualizations that comply with expert standards


Learn more
Website | Paper

Next
Next

Our Fall CosmicAI seminar series will kick off Sept 3!