Self-organizing maps (SOMs) as an AI tool for music analysis and production: Musical timbre

3. How to describe music to a computer

3.1. Musical timbre

Looking at the spectrum is helpful in describing a sound's timbre. The spectrum lists the amount of each individual frequency that is present in a sound. E.g. the spectrum of a plucked, nylon string is shown below and the analysis of its sound can be found here:

edu sharing object

Spectrum of a plucked string.

Similar to the raw, recorded audio data the spectrum contains a lot of information. It can be reduced because, depending on the research questions, only limited ranges of the spectrum are important. It can even be reduced to a single value, say, when looking at the spectral centroid (SC):

\[ SC = \frac{\sum_{n=0} ^{N} f_n \times A_n} {\sum_{n=0} ^{N} A_n} \]

where f_n refers to each of the N recorded frequencies, and A_n is the assigned amplitude. The spectral centroid is the mean value of each frequency, weighted with its amplitude, and represents the brightness of a sound. In general the sound is perceived to be brighter if the spectral centroid increases. [2]

The amount of detail can be increased when additional parameters are added. For instance, the spectral spread describes how broad the area is around the spectral centroid [to do: check], which is a characteristic of certain sounds. E.g. although whistling produces a rather narrow spectrum, the spectral spread increases with added noise. Spectral flux describes how quickly the spectrum changes and, accordingly, the temporal development of a given sound.

Another feature that uniquely describes the perception of music is roughness. If two notes are played simultaneously they may be perceived differently depending on their frequency ratios. Therefore intervals, scales and tuning have a strong impact on musical perception. If two sounds have slightly different frequencies a loudness fluctuation is perceived. Detuning the frequencies further increases the frequency of the beating. If the beating becomes sufficiently fast, it is no longer possible to distinguish between loud and quiet points in time; the sound is perceived as rough. According to Helmholtz this roughness is at maximum when the difference of both frequencies is between 30 and 40 Hz. Further detuning leads to the perception of two different tones [3]. This effect can be systematically experienced using the provided [to do: link] interactive online demo.