Zum Hauptinhalt

Self-organizing maps (SOMs) as an AI tool for music analysis and production

Dr. Simon Linke

4. Application

4.4. Towards Syntax

When using SOMs to produce or compose music they must be able to recreate musical structure and syntax. As the previous applications were practical examples of how SOMs are used in the field of music, this theoretical section will depart from music and focus on words. The approaches, once understood, may be transferred to music in an individually creative way, providing unique SOMs for specialized musical purposes.

For this experiment every word of the introduction of a paper by Teuvo Kohonen [7] was used as an input item for a SOM. In the previous sections we learned how to use colors as input vectors, by looking at their amounts of red, green and blue, and that music can be described by various psychoacoustic parameters. Using words as training data, however, is not straightforward.

As a first approach, each word was described by a 26-dimensional vector. Each dimension represents a letter of the alphabet. A word can then be represented by counting the number of appearances of each letter. In this [to do] figure the resulting SOM is shown and 50 arbitrary words of the training data have been plotted.

The words appear somehow organized but in a rather unintuitive way. The underlying criteria can hardly be figured out and, generally, the result does not seem very helpful. This approach was probably too naive as it did not take into account the letters' order of appearance.

One could argue that a word by itself is already a feature vector, with the letters being the single features. The dimension is then given by the length of the word, which can be extended simply by appending empty spaces. As a result, all words would have the same length. When comparing them and calculating how similar certain words are, however, [to do] Euclidean distances do not work anymore (they also have some issues when dealing with notated music). Nevertheless, as already stated, in mathematics, there are many ways of calculating distances.

For the comparison of words the Hamming distance has been proven to be a suitable choice. Its calculation is straightforward. For instance, when looking at the two words tooth and tough they sound and look similar. Both words consist of five letters but they differ in two places. If we replace the second 'o' with 'u' and the second 't' with 'g,' then tooth becomes tough. When calculating the Hamming distance the number of different letters is divided by the length of the word:

\[ \|\vec d \| = \frac{2}{5} = 0.4 \]

If we use Hamming distances when calculating a SOM the [to do] result appears to be a little bit more helpful. But, considering that the color of the dots now refers to the length of the words, this seems to be the dominating factor when sorting the text. This is amazing, as the length of the word was never a feature used to train the SOM. It has only an indirect influence when calculating the Hamming distances. Nevertheless, this is not very helpful because sorting words by their length is straightforward and can be easily done without the use of AI.

In the next step, this model is improved by looking at the context in which the words appear. Each word is combined with its preceding and following words, i.e. three words are taken as input for the SOM. The resulting [to do] map now shows clear clusters of words which no longer depend on their length. Even though some regions can be better comprehended than others, it is reasonable to assume that clusters are now based on the appearance of the word in the original paper [7].

As similarities between letters and musical notes, or between words and motifs, have been extensively discussed in the past, it is easy to imagine how the approaches described above may be transferred to music. Though some approaches will sort the data better than others, this does not guarantee that some approaches produce more exciting music.


edu sharing object

SOM Emoji.


Remembering the SOM which was used to sort colors, the single colors of the training data were assigned to the best-matching units. On the trained SOM the BMUs' colors were usually the same as the training colors. In between, however, the BMUs' transitions, from one color to another, could be detected. A similar effect can be seen in the picture above. Here, a SOM was trained to sort different emojis. At the BMUs one can clearly see a distinct emoji, but in between them interestingly blurred and mixed versions of the training data were created.

What does this mean for maps trained with musical motifs? In these in-between regions, some newly composed motifs would be found. Similarly to the [to do] sorted EDM music, one could also find an interesting path from one BMU to another and, when taking the in-between nodes into account, the motif would be progressively altered.

Of course the musical value and the novelty of the results are strongly dependent on the training data. Furthermore, the underlying mathematical model has a strong impact. Using the Hamming distance, single notes could be changed to different notes of the scale when moving from one node of the SOM to another. Utilizing [to do] Euclidean distances could probably indicate small, continuous microtonal changes of the motifs.