Deep learning based similarity measure of mass spectrometry data.
A classical way to compare MS/MS mass spectra is to quantify their peak overlap, often done by using variations of cosine similarity scores. Those measures tend to work well for nearly equal spectra, i.e. cases of very high peak overlap. We recently introduced Spec2Vec an unsupervised machine learning approach for computing spectrum similarities based on learned relationships between peaks across large training datasets [ref]. Spec2Vec based similarity scores were observed to correlate more strongly than classical cosine-like scores with actual structural similarities between the underlying compounds. Additional core advantages are its fast computation, which allows to compare query spectra against very large libraries, and the fact that -as an unsupervised method- it can be trained on non-annotated data. However, the downside of an unsupervised approach is that it does not make use of the large fraction of labels that we have for the training data. The used training data (MS/MS spectra from GNPS) contains smiles/InChI annotations hence does allow to create molecular fingerprints for quantifying the structural similarities.