You are working with the text-only light edition of "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8". Click here for further information.

Transformation of the Data Space
Example: mass spectrometry

An example of the successful transformation of data space comes from mass spectrometry. Although this example is specialised and may be hard to understand for readers without any background in chemistry, it clearly shows the benefits of introducing knowledge into the modeling process.

A well-known feature of mass spectra (MS) of alkyl compounds is the occurrence of regularly spaced peak groups with a difference of 14 mass units. These periodic peaks result because alkyl chains break at any carbon-carbon bond with about the same probability, thus creating fragments differing in mass by an integral multiple of CH2-groups (m/e = 14). Substances which do not contain any alkyl groups, of course, do not exhibit these periodicities in their mass spectra. This property could now be used to create a classifier to discriminate between alkyl and non-alkyl compounds.

We could now, in a first naive approach train a neural network to detect those periodicities in the mass spectra. But detecting periodicities is particularly hard for neural networks (as for any other method). So you need quite a large network to accomplish this task - with all the disadvantages of large networks (i.e.overfitting). Moreover, the input vector of the network is quite large (the whole mass spectrum, typically several hundred elements per vector) which slows down the training process considerably.

Now let's introduce our knowledge on the fragmentation behavior of alkyl compounds. Since we know that alkyl compounds yield periodic mass spectra we transform the original data space (the MS vector) to a single variable which reflects the periodical peaks in the mass spectrum. We thus calculate the value of the autocorrelation function with shift 14. This value will be high for the spectra which exhibit periodic peaks with a distance of 14 masses, and will be low for spectra which do not. Now, the problem can easily be solved by using only a one-dimensional input and a comparatively small ANN.

Of course, this concept can be extended and applied to mass spectral data in such a way that all our knowledge is coded into such new variables. By doing this, the original n-dimensional data space (in case of mass spectrometry, n could well exceed 500) is transformed into a p-dimensional space, with p being usually much lower than n.

The figure below shows the effects of using knowledge-based features instead of the original mass spectral data. By using features, the classification task can be solved in a much better way.


Last Update: 2002-Nov-03