为何机器学习在小分子质谱上失败
Why machine learning fails at mass spectrometry for small molecules
Machine learning approaches are increasingly being used to aid small-molecule structure elucidation from mass spectrometry data. Surprisingly, however, current models often fail to outperform even simple baseline methods. Here we examine why these approaches fall short and propose strategies to overcome their limitations.
The advent of machine learning (ML) has led to transformative breakthroughs in biology. For example, the development of AlphaFold[1](https://www.nature.com/articles/s42255-026-01544-6#ref-CR1 "Jumper, J. et al. Nature 596, 583–589 (2021).") has significantly advanced protein structure prediction and accelerated drug discovery. Within metabolomics, one area positioned to benefit from artificial intelligence (AI) is molecular structure elucidation for small molecules from liquid chromatography–tandem mass spectrometry (LC–MS/MS). Because manual spectral interpretation is time-consuming and labour-intensive, automating this process would enable high-throughput compound identification at scale and fundamentally reshape the field.
Recognizing this potential, the community has devoted extensive effort to developing datasets and ML models for automated structure elucidation[2](https://www.nature.com/articles/s42255-026-01544-6#ref-CR2 "Böcker, S. & Dührkop, K. J. Cheminform. 8, 5 (2016)."),[3](https://www.nature.com/articles/s42255-026-01544-6#ref-CR3 "Bushuiev, R. et al. MassSpecGym: a benchmark for the discovery and identification of molecules. In 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS, 2024)."),[4](https://www.nature.com/articles/s42255-026-01544-6#ref-CR4 "Bushuiev, R. et al. Nat. Biotechnol. 44, 630–640 (2026)."),[5](https://www.nature.com/articles/s42255-026-01544-6#ref-CR5 "Goldman, S. et al. Nat. Mach. Intell. 5, 965–979 (2023)."),[6](https://www.nature.com/articles/s42255-026-01544-6#ref-CR6 "Huber, F. et al. PLOS Comput. Biol. 17, e1008724 (2021)."),[7](https://www.nature.com/articles/s42255-026-01544-6#ref-CR7 "de Jonge, N. F. et al. Metabolomics 18, 103 (2022)."),[8](https://www.nature.com/articles/s42255-026-01544-6#ref-CR8 "Ludwig, M., Fleischauer, M., Dührkop, K. Hoffmann, M. A. & Böcker, S. In Computational Methods and Data Analysis for Metabolomics (ed. Li, S.) 185–207 (Springer, 2020)."),[9](https://www.nature.com/articles/s42255-026-01544-6#ref-CR9 "Wang, F. et al. Anal. Chem. 93, 11692–11700 (2021)."),[10](https://www.nature.com/articles/s42255-026-01544-6#ref-CR10 "Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. J. Cheminform. 13, 84 (2021)."). However, past studies report that ML methods for structure elucidation fall short on this task[7](https://www.nature.com/articles/s42255-026-01544-6#ref-CR7 "de Jonge, N. F. et al. Metabolomics 18, 103 (2022)."),[11](https://www.nature.com/articles/s42255-026-01544-6#ref-CR11 "Kretschmer, F., Seipp, J., Ludwig, M., Klau, G. W. & Böcker, S. Nat. Commun. 16, 554 (2025)."). Given continual advances in general AI methodology and access to large spectral datasets, these findings are puzzling. As such, advancing structure elucidation requires understanding the failure modes of current technologies. We begin by summarizing how current methods work and evaluating their performance.
Most current AI approaches for this task share a common two-step pipeline (Fig. 1a). Given an experimental spectrum, an ML model predicts a molecular fingerprint, which is then used to query molecular databases such as PubChem[12](https://www.nature.com/articles/s42255-026-01544-6#ref-CR12 "Kim, S. et al. Nucleic Acids Res. 51, D1373–D1380 (2023).") to retrieve candidate molecules. This approach can potentially identify compounds that have never been previously characterized by mass spectrometry, as it can in principle map any spectrum to a molecular fingerprint. This workflow is a major departure from traditional pre-AI methods, which rely on spectral library matching and can only identify molecules whose spectra already exist and are labelled in reference databases. From an algorithmic perspective, this mapping approach is analogous to machine translation in natural language processing (NLP), whereby a source sentence is encoded into a vector (also known as embedding) and then decoded into its translation in the target language. In this analogy, spectral peaks serve as ‘words’, collectively forming a ‘sentence’ that is ‘translated’ into a molecular fingerprint. As in ML translation, the ML algorithm is trained on a dataset of experimental spectra paired with molecular fingerprints. Commonly used datasets include NPLIB1 (ref. [13](https://www.nature.com/articles/s42255-026-01544-6#ref-CR13 "Dührkop, K. et al. Nat. Biotechnol. 39, 462–471 (2021).")), MassSpecGym[3](https://www.nature.com/articles/s42255-026-01544-6#ref-CR3 "Bushuiev, R. et al. MassSpecGym: a benchmark for the discovery and identification of molecules. In 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS, 2024).") and the NIST 2023 LC–MS/MS dataset[14](https://www.nature.com/articles/s42255-026-01544-6#ref-CR14 "National Institute of Standards and Technology. 2023. NIST/EPA/NIH Mass Spectral Library, Version 2023. https://www.nist.gov/programs-projects/electron-ionization-library-component-nistepanih-mass-spectral-library-and-nist-gc
(2023).").
Fig. 1: Pipeline for chemical structure elucidation using ML models.
The alternative text for this image may have been generated using AI.
a, Step 1: Given an LC–MS/MS spectrum, the model first encodes it into a vector embedding. This embedding is then used by a neural network to predict the molecular fingerprint of the molecule that generated the spectrum. Step 2: The predicted fingerprint is compared against the fingerprints of known molecules in public databases such as PubChem. Candidate molecules are ranked on the basis of their similarity to the predicted fingerprint. The structure of the molecule with the highest similarity score is then assigned as the label for the spectrum. b, Matching performance measured by the binary Jaccard score using Morgan fingerprints (4,096 bits, radius = 2) for all models under different dataset splitting strategies. † indicates zero-shot evaluation (no task-specific training). For the nearest-neighbour and DreaMS baselines, predictions are obtained via top-1 retrieval of the most similar training example. Bolded records indicate the best-performing model.
Formulating mass spectrometry analysis as machine translation enables the use of powerful architectures developed for machine translation, especially the transformers network[15](https://www.nature.com/articles/s42255-026-01544-6#ref-CR15 "Vaswani, A et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (NIPS, 2017)."). Their strength lies in building a vector space for representing text meaning: they encode words and sentences into embeddings such that semantically similar inputs are geometrically close. The same mechanism enables generation of molecular fingerprints from spectral embeddings. Given the strong performance of transformer networks across many domains, framing spectrum-to-fingerprint prediction as a translation task holds significant promise for the LC–MS/MS setting.
However, treating peaks and spectra naively as words and sentences is suboptimal. While words may have several meanings, the ambiguity of fragment mass peaks is significantly higher as a single peak can correspond to a much larger number of possible substructures. Moreover, spectra are inherently noisy. Some peaks may correspond to other components of the molecular mixture, without contributing useful structural information to the target molecule — unlike words, which generally contribute to sentence meaning.
To address these issues, ML approaches preprocess spectra into representations more amenable to the translation paradigm. Goldman et al.[5](https://www.nature.com/articles/s42255-026-01544-6#ref-CR5 "Goldman, S. et al. Nat. Mach. Intell. 5, 965–979 (2023).") map _m_/_z_ values to candidate chemical formulas, making peaks more word-like. Bushuiev et al.[4](https://www.nature.com/articles/s42255-026-01544-6#ref-CR4 "Bushuiev, R. et al. Nat. Biotechnol. 44, 630–640 (2026).") encode peaks using learnable Fourier features, enabling the model to capture higher-order relationships between masses — analogous to learnt word relationships in NLP. Another concept adopted from NLP is foundation models, which learn input representations from large amounts of raw data using auxiliary objectives (for example, masking). Bushuiev et al.[4](https://www.nature.com/articles/s42255-026-01544-6#ref-CR4 "Bushuiev, R. et al. Nat. Biotechnol. 44, 630–640 (2026).") leverage this strategy to learn spectral embeddings, which are then fine-tuned using paired spectra–fingerprint data. A detailed explanation of key ML terms used in this Comment is provided in Supplementary Table 1.
As with all ML systems, the key question is whether models generalize to unseen inputs, measuring their capacity to analyse molecules different from the ones seen in training. How can we evaluate generalization? The easiest evaluation relies on random splits, in which spectra are divided into non-overlapping training and test sets. While this approach controls for spectral overlap, the same molecule may appear in both sets as different spectra, resulting in data leakage. A more informative and practically useful evaluation uses scaffold splits, in which molecules are separated by chemical scaffolds. Another way to assess generalization and the practical utility of ML models is by comparing their performance against non-ML approaches used today. A commonly used method is a nearest-neighbour baseline, in which a test spectrum is assigned the molecular fingerprint of its closest training counterpart according to the cosine similarity metric. The performance of all ML approaches is evaluated using the binary Jaccard score between predicted and ground-truth molecular fingerprints.
Figure 1b shows the performance of two state-of-the-art ML methods: MIST[5](https://www.nature.com/articles/s42255-026-01544-6#ref-CR5 "Goldman, S. et al. Nat. Mach. Intell. 5, 965–979 (2023)."), a fingerprint-prediction model, and DreaMS[4](https://www.nature.com/articles/s42255-026-01544-6#ref-CR4 "Bushuiev, R. et al. Nat. Biotechnol. 44, 630–640 (2026)."), a retrieval-based foundation model, together with the nearest-neighbour retrieval baseline, across three benchmark datasets. A detailed description of the task formulation, evaluation metrics and benchmarked models is provided in Supplementary Notes 1–3, while data attribution methods and additional analyses are described in Supplementary Notes 4 and 5.
The results are surprising. The performance across all ML methods is poor. On the scaffold split, nearest-neighbour outperforms MIST and nears the performance of DreaMS. DreaMS, the top model under the random split, drops sharply under the scaffold split, showing poor generalization. Even more puzzling is the modest performance on an easy setting (the random split) where there is high molecular overlap between training and test sets (51.9% for NPLIB1, 92.3% for MassSpecGym and 99.5% for NIST2023). These findings contradict the hypothesis that low performance arises primarily from insufficient training coverage[11](https://www.nature.com/articles/s42255-026-01544-6#ref-CR11 "Kretschmer, F., Seipp, J., Ludwig, M., Klau, G. W. & Böcker, S. Nat. Commun. 16, 554 (2025)."). Simply adding more data to existing models is unlikely to solve the problem.
To understand these failures, we turn to data-attribution methods, which trace model performance back to the data from which the algorithm learns. These methods identify examples that are hard for the model to reason about, leading to the ultimate performance drop. We use two complementary approaches — influence functions and learning-to-split — to ensure robustness. Influence functions identify training examples that help or hurt specific predictions. Learning-to-split instead takes all the data and divides them into training and test sets, so the test performance is the lowest. De facto, it selects test examples that are the hardest for the algorithm to understand. By analysing these examples, we can get a better understanding of how the algorithms fail. Below we summarize results of these analyses applied to ML algorithms for LC–MS/MS.
Inability to generalize across experimental conditions
Both learning-to-split and influence functions flag spectra of molecules collected under different conditions as challenging. While current datasets contain data from multiple experimental settings, algorithms fail to model how change in conditions impacts the spectra.
Inability to capture peak intensity
Data attribution methods show that models struggle to distinguish molecules with similar _m_/_z_ distributions but different intensity profiles. In fact, regardless of intensity, the algorithm maps spectra with similar _m_/_z_ patterns into similar vector representations, largely ignoring intensity information.
Inability to generalize to new chemical formulas
A substantial portion of hard examples contains spectra with peaks corresponding to molecular fragments unseen during training. Although some models[5](https://www.nature.com/articles/s42255-026-01544-6#ref-CR5 "Goldman, S. et al. Nat. Mach. Intell. 5, 965–979 (2023).") attempt to address this issue algorithmically, they all fail to perform well on new chemical formulas.
Are these flaws driven by data, algorithms, or both? Most likely, it is both. As seen in other ML domains, initial developments for new applications start with homogeneous datasets. For the chemical structure elucidation task, such a dataset should contain data collected under similar conditions (for example, consistent instrument type and molecule class) while avoiding unseen fragment formulas in the test set. Once models can robustly generalize under these conditions, we can gradually introduce broader chemical and experimental diversity. Complementary to dataset improvements, algorithmic advances are needed. For example, the success of AlphaFold builds on a neural architecture that explicitly incorporates domain-specific knowledge. By contrast, current mass spectrometry models largely recycle NLP architectures, which fail to capture domain-specific properties. Moreover, we need to explore alternatives to current formulations, moving beyond fingerprint-based methods. Exploring new architectures requires meaningful benchmarks that evaluate whether new models address the limitations uncovered here.
Data availability
Supporting analyses, dataset statistics and hyperparameters are reported in Supplementary Tables 2–5, and additional visual analyses are shown in Supplementary Figs. 1–3.
Code availability
The code used to generate these findings is publicly available at https://github.com/serenaklm/ML_MS_analysis.
References
- Jumper, J. et al. _Nature_596, 583–589 (2021).
ArticleCASPubMedPubMed CentralGoogle Scholar
- Böcker, S. & Dührkop, K. _J. Cheminform._8, 5 (2016).
ArticlePubMedPubMed CentralGoogle Scholar
- Bushuiev, R. et al. MassSpecGym: a benchmark for the discovery and identification of molecules. In _38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track_ (NeurIPS, 2024).
- Bushuiev, R. et al. _Nat. Biotechnol._44, 630–640 (2026).
ArticleCASPubMedGoogle Scholar
- Goldman, S. et al. _Nat. Mach. Intell._5, 965–979 (2023).
- Huber, F. et al. _PLOS Comput. Biol._17, e1008724 (2021).
ArticleCASPubMedPubMed CentralGoogle Scholar
- de Jonge, N. F. et al. _Metabolomics_18, 103 (2022).
ArticlePubMedPubMed CentralGoogle Scholar
- Ludwig, M., Fleischauer, M., Dührkop, K. Hoffmann, M. A. & Böcker, S. In _Computational Methods and Data Analysis for Metabolomics_ (ed. Li, S.) 185–207 (Springer, 2020).
- Wang, F. et al. _Anal. Chem._93, 11692–11700 (2021).
ArticleCASPubMedPubMed CentralGoogle Scholar
- Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. _J. Cheminform._13, 84 (2021).
ArticlePubMedPubMed CentralGoogle Scholar
- Kretschmer, F., Seipp, J., Ludwig, M., Klau, G. W. & Böcker, S. _Nat. Commun._16, 554 (2025).
ArticleCASPubMedPubMed CentralGoogle Scholar
- Kim, S. et al. _Nucleic Acids Res._51, D1373–D1380 (2023).
ArticlePubMedPubMed CentralGoogle Scholar
- Dührkop, K. et al. _Nat. Biotechnol._39, 462–471 (2021).
- National Institute of Standards and Technology. 2023. _NIST/EPA/NIH Mass Spectral Library, Version 2023_. https://www.nist.gov/programs-projects/electron-ionization-library-component-nistepanih-mass-spectral-library-and-nist-gc (2023).
- Vaswani, A et al. Attention is all you need. In _Proc. 31st International Conference on Neural Information Processing Systems_ (NIPS, 2017).
Acknowledgements
We thank D. Sabatini for valuable insights and feedback during the development of this manuscript. We also thank members of R.B.’s group for discussions and feedback on the manuscript. This work was supported by DSO National Laboratories, Singapore, and the MIT Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) Consortium.
Author information
Authors and Affiliations
- Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Ling Min Serena Khoo&Regina Barzilay
Authors
- Ling Min Serena Khoo
- Regina Barzilay
Corresponding author
Correspondence to Ling Min Serena Khoo.
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
_Nature Metabolism_ thanks Pierre Baldi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
About this article
Cite this article
Khoo, L.M.S., Barzilay, R. Why machine learning fails at mass spectrometry for small molecules. _Nat Metab_ (2026). https://doi.org/10.1038/s42255-026-01544-6
- Published: 11 June 2026
- Version of record: 11 June 2026
- DOI: https://doi.org/10.1038/s42255-026-01544-6
这篇还没有中文全文
该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。
挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文