Matbench Discovery: Benchmarking machine learning models for crystal stability predictions

Jun 23, 2025

Machine learning (ML) models have transformed various scientific disciplines by efficiently extracting patterns from complex data. Materials science is no exception, witnessing rapid advancements as ML methods increasingly augment traditional computational approaches. One critical application of ML in materials science is predicting the stability of crystalline materials, a task crucial for the accelerated discovery of new, functional inorganic crystals.

However, a systematic approach for benchmarking and evaluating the performance of these ML models has been lacking. A recent paper by Janosh Riebesell and colleagues, published in Nature Machine Intelligence, addresses this gap by introducing "Matbench Discovery"—an evaluation framework specifically designed for assessing the effectiveness of ML models in crystal stability predictions.

Context: Why benchmarking matters

In materials discovery, ML models are often employed as pre-filters to high-cost first-principles calculations such as density functional theory (DFT). While DFT remains a reliable method for predicting material properties, it is computationally expensive, motivating the use of ML models to screen large datasets quickly and cheaply. The effectiveness of ML, however, is not solely determined by how accurately it predicts properties like formation energy, but rather how well it identifies thermodynamically stable structures—materials that lie on or close to the "convex hull" of known stable phases.

The authors highlight several challenges inherent to current benchmarking methods:

Retrospective versus prospective evaluation: Traditional methods often employ retrospective benchmarks, which might not represent real-world scenarios, potentially overstating model performance.
Relevant targets: Formation energies are common regression targets but do not necessarily reflect true stability. The relevant measure should be the distance to the convex hull.
Metrics alignment: Common regression metrics (e.g., RMSE, R²) might poorly correlate with actual decision-making, as accurate predictions near stability thresholds can still lead to costly false positives.
Scalability: Benchmarks often inadequately evaluate performance at realistic scales, limiting insights into model behavior in large-scale screening campaigns.

Introducing Matbench Discovery

To address these issues, the authors propose "Matbench Discovery," an open-source Python-based evaluation framework accompanied by a continuously updated online leaderboard. Matbench Discovery provides a realistic and scalable benchmark environment that closely mimics an actual ML-assisted discovery campaign. It uses datasets of unrelaxed structures, requiring models to predict the convex hull distance after structure relaxation, thus reflecting real-world applicability where relaxed structures are not pre-known.

Key technical contributions

The authors evaluate multiple state-of-the-art ML methodologies within Matbench Discovery, including:

Random forests: These serve as baseline traditional models using engineered descriptors such as Voronoi tessellations and composition-based Magpie features. They provide a performance reference point against more sophisticated ML architectures.
Graph neural networks (GNNs): Models like CGCNN, MEGNet, ALIGNN, and M3GNet explicitly leverage atomic structure information to predict properties directly from crystal structures. These GNN architectures utilize atomic positions and bonds to construct graph representations that capture essential spatial relationships and interactions within crystals.
Universal interatomic potentials (UIPs): Including models like CHGNet, MACE, SevenNet, Orb, and EquiformerV2 + DeNS, these methods are particularly advanced as they can simulate not just energy predictions but also forces and stress tensors to emulate full structure relaxation paths. UIPs thus represent a more physically informed approach, capable of better extrapolation and performance in regions of material space not covered extensively by training data.
Bayesian optimizers (BOWSR): Bayesian methods systematically balance exploration and exploitation by iteratively proposing candidate structures based on posterior uncertainty. These optimizers work in conjunction with surrogate ML energy models to efficiently search high-dimensional configuration spaces.
Coordinate-free methods (Wrenformer): These methods discard precise atomic coordinates, relying instead on symmetry-invariant information encoded through Wyckoff positions. This significantly reduces the search space and computational complexity, making it feasible to systematically screen vast combinational spaces of candidate structures.

The benchmark dataset comprises:

Training data: Derived from the latest release of the Materials Project (MP), it includes detailed DFT-calculated information—energies, forces, stresses, and magnetic moments. This comprehensive dataset enables the robust training of sophisticated models like UIPs, which explicitly learn physical interactions from force and stress data.
Test data: The Wang-Botti-Marques (WBM) dataset is employed, featuring structures generated by systematic elemental substitutions of known prototypes and subsequently relaxed via DFT calculations. This ensures that the test set captures realistic extrapolation scenarios, closely mirroring challenges encountered in real-world materials discovery.

Evaluation and metrics

Matbench Discovery employs both regression and classification metrics to thoroughly evaluate model effectiveness. Models are assessed primarily on their ability to accurately classify thermodynamic stability, defined as whether a material lies on or below the convex hull, rather than solely on regression accuracy. Key metrics include precision, recall, F1 scores, and a novel "discovery acceleration factor" (DAF), which measures how effectively a model identifies stable crystals compared to random selection. The DAF provides an intuitive understanding of how significantly an ML model can speed up the discovery process by pinpointing promising candidates more efficiently.

Key findings

The benchmark results demonstrate clear advantages for UIP models. Specifically, EquiformerV2 + DeNS achieved the highest overall performance, reaching an impressive DAF of up to 6.33× for the top-ranked predictions. UIP models consistently outperform simpler regression-based methods by explicitly modeling relaxation paths, significantly enhancing classification accuracy near critical stability boundaries.

Notably, some models with poor global regression metrics (negative R²) still yielded strong classification outcomes. This underscores that regression accuracy alone may not be a reliable indicator of practical model utility. Models capable of effectively identifying stability thresholds, despite weaker regression metrics, can still provide substantial value in real-world discovery scenarios.

Insights on model reliability

The authors introduce a novel visualization termed the "triangle of peril," illustrating regions around the stability threshold where average prediction errors surpass the classification decision boundary. UIP models demonstrated superior performance by exiting this danger zone more quickly, resulting in fewer misclassifications and more trustworthy predictions. This visualization provides clear insight into model reliability and helps identify which methods are most dependable, especially for structures near the critical convex hull threshold.

Future Directions and Recommendations

The authors advocate expanding ML benchmarks to include additional crucial material properties, such as temperature-dependent behavior, dynamic stability (phonon spectra), and reaction pathways. These areas represent potential future frontiers for UIP models, particularly given sufficient high-quality data and ongoing training improvements. Additionally, they stress the need for generating continuous and scalable datasets at fidelity levels beyond current standards (e.g., beyond PBE-level DFT), highlighting data quality as a fundamental constraint on further advances.

Implications for Materials Discovery

Matbench Discovery marks a substantial advancement toward standardized, transparent evaluation of ML methods in materials science, enabling clearer methodological comparisons and better-informed decision-making for ML-assisted discovery campaigns. By highlighting UIPs as currently the best available approach and providing comprehensive benchmarking methodologies, this framework sets a robust foundation for future research and continued innovation in ML-driven materials discovery.

AI x Science

Discussion about this post