Zum Hauptinhalt

Self-organizing maps (SOMs) as an AI tool for music analysis and production

Website: Hamburg Open Online University
Kurs: MUTOR: Artificial Intelligence for Music and Multimedia
Buch: Self-organizing maps (SOMs) as an AI tool for music analysis and production
Gedruckt von: Gast
Datum: Montag, 16. September 2024, 21:44

Beschreibung

Dr. Simon Linke

1. Introduction

When talking about music we are usually biased by our musical experiences, education and, of course, our tastes. This may lead to fruitful discussions in our daily lives but it is a serious problem when systematically analyzing music and its perception. Artificial intelligence (AI) may be a solution. If the algorithms are not trained AI doesn't know anything about music—a little bit like the unbiased brain of a newborn baby. If training data is chosen carefully, a neutral and objective view, on rather emotional musical topics, can be provided.

Most AI models we talk about nowadays, like ChatGPT, are so-called connectionist models. They process information using complex networks of artificial neurons. Yet while they lead to surprisingly sophisticated results, we do not yet know why they make certain decisions. They may work very well in suggesting music we like, or even composing music for us, but they usually fail when trying to understand the complex physical and psychological foundations of music.

This talk is of an alternative artificial intelligence approach called self-organizing maps (SOMs). As they are not very present in today's media, they are usually not very familiar to most people. Nevertheless, depending on the use case, they have great benefits compared to other AI algorithms.

The approach of self-organizing maps differs from connectionist models. The 'map' illustrates how the neuron field organizes itself during learning by placing similar stimuli close to each other. Thus, the learning process becomes transparent and how results are derived can be analyzed. One can judge the influence of each single data parameter. Self-organizing maps are very helpful in data browsing, clustering and classification. In later sections, however, we will also discover a few examples of how these algorithms may be creatively used.

2. How does a SOM work?

Self-organizing maps (SOMs) were introduced in the early 1980s by the Finnish researcher Teuvo Kohonen [1], which is why they are also called Kohonen maps. These maps rely on unsupervised learning. Once a system is defined it reorganizes itself without further input, revealing complex patterns and structures.

edu sharing object

[Fig: picture of SOM]

This picture shows a SOM. It is often described as a two-dimensional slice through the human brain. Each square represents a single neuron and the colored dots show where specific data is processed. In human brains similar tasks are usually processed in the same region. Similarly, colored dots close to each other on the map refer to similar data.

Each datum that should be used to train such a map must be described as a series of single numbers, the so-called feature vector. When describing music, deriving this vector is not straightforward. Furthermore the vector may be composed of many different values. Thus, it is a high-dimensional vector. All feature vectors used to train the map describe a high-dimensional feature space. A trained Kohonen map provides a mapping from this high-dimensional input layer to a two-dimensional output layer, the so-called unit layer.

2.1. Example: sorting colors

This sounds very mathematical and complicated, so it will be helpful to look at a practical example and explore how those algorithms work. On a computer, colors are usually described by their amounts of red, green, and blue. Thus all colors construct a three-dimensional feature space, and a three-dimensional vector can describe each color. E.g. a SOM can be trained which sorts different colors using this [to do] interactive online demo.

A suitable training set has to be chosen at first by randomly selecting colors with the buttons on the upper right, then the Kohonen map will be shown. Each node of the map initially points to a random location in the feature space; in other words, a random color is assigned to each square of the map. To train the map we identify the nodes whose pointers are most proximate to the location of each item. So we sort our training data to the squares with the best matching colors.

2.2. Euclidean distances

While visually finding similar colors on the map can be done rather intuitively, it can also be described in a reliable mathematical way. A feature space is shown below. For a simple visualization we can look at two axes: red and blue. Each color has its specific location and we can determine how similar two colors are by calculating their distance from one another.

edu sharing object

Feature space.

In mathematics there are many ways to calculate those distances. One common way, which is often a good choice when training self-organizing maps, is the Euclidean norm. Here, the squared differences between each parameter are calculated individually, and then the square root is computed:

\[ \| \vec{d} \| = \sqrt{(r_1-r_2)^2+(g_1-g_2)^2+(b_1-b_2)^2} \]

So, when describing colors, the differences in red, green, and blue are calculated individually. They are then squared and summed up before the square root is calculated. Thus it can be determined how similar different colors are.

2.3. Continue the training

The node of the map that has the smallest Euclidean distance to a training item is called the best matching unit (BMU). Once a BMU of a specific item is found, the map adapts to it. The training item drags this node's pointer toward the location of itself. This means that the pointer is modified to become a weighted mean value of the item's location and the original pointer. Put more simply, the color of the BMU is slightly changed towards the color of the training item. The weighting is called the learning coefficient and it usually decreases over time. The pointers of all neighboring nodes are modified, too. The larger the distance between a node and the winning node, the lower the learning coefficient.

We can easily visualize the result: after the first learning step the color of the map changes and the training items change their BMU. This process is repeated many times. Finally, when no changes can be detected, the training is over. Now the map has different regions assigned to different colors. In between those regions there may be smooth or sudden transitions, depending on the training data.

2.4. The u-matrix

Since we trained the map using colors, visualizing the pointer of each node is straightforward, as it is a color itself. Consequently, similarities between the nodes can be directly seen. As soon as different data is chosen, however, and the dimension of the feature vector is larger than three, no direct visualization is possible anymore.

In such cases the u-matrix may be calculated. Even though the result may look rather complicated it is computed straightforwardly. Again, Euclidean distances are calculated, however, distances between the SOM nodes and the training items are not computed. To calculate the u-matrix, the distances of a node to its neighboring nodes are observed instead. The resulting mean value is shown on the map, ranging from black for low mean values, to white for high mean values. As a result, if a node is visualized with a light color the distances to its neighboring nodes are large, while dark nodes are very similar to their neighbors.

Comparing both visualizations of the SOM reveals that regions assigned to a single color are shown in black on the u-matrix. The map becomes gray for smooth transitions between those regions, while sudden and chaotic transitions are visualized in white. Of course a lot of detail is lost on the u-matrix, as it is still known that a region is assigned to a single color, but which color remains unclear. This trade-off cannot be avoided when visualizing an entire, high-dimensional feature space with a single map.

2.5. Component planes

If more detail is required, different maps are necessary. Often, component planes are used in which a different map is calculated for each feature of the training vector. Each map visualizes how a certain feature is distributed over the map.

Returning to the example of sorting colors, it can be individually visualized how a certain color—red, green or blue—is present on the map. There may be regions where the amount of red is large while blue and green are hardly present. Therefore a single map must be plotted for each color.

Even though more data must be observed, these component planes allow a systematic analysis of the map and can be further applied to very high-dimensional feature spaces.

2.6. Using a trained SOM

Utilizing a SOM is not limited to its training data. If similar data can be described with the same feature vectors, it can also be processed by an already-trained SOM. Again, the Euclidean distances from this data to the nodes of the map can be computed and, as a result, the data gets sorted on the trained map. This reveals similarities and differences between the new data vectors but also between the new and the training data.

This works especially well if similar data is used to train the map. Otherwise, the distances to all of the map's nodes are rather large. There is always one node, however, that is the closest. If, e.g., the SOM was trained using only different shades of blue, it may distinguish them very well but might have problems distinguishing between red and green. On the other hand, if a SOM was trained with many different colors, it can roughly distinguish between all kinds of arbitrary colors but struggles in very precise separation of slightly differing shades.

3. How to describe music to a computer

Finding suitable parameters is straightforward when using colors, but how can this be transferred to music?

Sometimes music is described as sheet music. Setting aside the fact that sheet music may lack some important parameters, like the actual sound of the instruments or their articulations, this approach however, is usually limited to Western classical music and cannot be transferred to modern, electronic pop music or most of the world's folk music.

Another approach could be to use the raw waveforms resulting from digital recordings of music as they occur, e.g., on a CD or as seen in modern recording software. Even though this data is very precise, it is usually a large amount (typically 44,100 to 96,000 values per second), which massively increases the computational costs when computing a SOM. But, more importantly, these raw digital waveforms lack musical meaning. Even though a SOM relying on them might sort music successfully, it is impossible to know the reasons for its decisions. Even the enormous number of possible component planes would hardly reveal any helpful information.

A much better solution is to use psychoacoustic parameters. They are directly linked to musical perception and, as such, many aspects of music and its perception can be described with a limited number of these parameters. Following are a few examples of parameters that describe musical timbre, explained in detail. There are many more, however, and a wise decision on which of them are used to train a SOM must be made according to the research question.

3.1. Musical timbre

Looking at the spectrum is helpful in describing a sound's timbre. The spectrum lists the amount of each individual frequency that is present in a sound. E.g. the spectrum of a plucked, nylon string is shown below and the analysis of its sound can be found here:

edu sharing object

edu sharing object

Spectrum of a plucked string.

Similar to the raw, recorded audio data the spectrum contains a lot of information. It can be reduced because, depending on the research questions, only limited ranges of the spectrum are important. It can even be reduced to a single value, say, when looking at the spectral centroid (SC):

\[ SC = \frac{\sum_{n=0} ^{N} f_n \times A_n} {\sum_{n=0} ^{N} A_n} \]

where fn refers to each of the N recorded frequencies, and An is the assigned amplitude. The spectral centroid is the mean value of each frequency, weighted with its amplitude, and represents the brightness of a sound. In general the sound is perceived to be brighter if the spectral centroid increases. [2]

The amount of detail can be increased when additional parameters are added. For instance, the spectral spread describes how broad the area is around the spectral centroid [to do: check], which is a characteristic of certain sounds. E.g. although whistling produces a rather narrow spectrum, the spectral spread increases with added noise. Spectral flux describes how quickly the spectrum changes and, accordingly, the temporal development of a given sound.

Another feature that uniquely describes the perception of music is roughness. If two notes are played simultaneously they may be perceived differently depending on their frequency ratios. Therefore intervals, scales and tuning have a strong impact on musical perception. If two sounds have slightly different frequencies a loudness fluctuation is perceived. Detuning the frequencies further increases the frequency of the beating. If the beating becomes sufficiently fast, it is no longer possible to distinguish between loud and quiet points in time; the sound is perceived as rough. According to Helmholtz this roughness is at maximum when the difference of both frequencies is between 30 and 40 Hz. Further detuning leads to the perception of two different tones [3]. This effect can be systematically experienced using the provided [to do: link] interactive online demo.

4. Application

In this section practical examples are shown of how psychoacoustic parameters, derived from sound recording, can be used to train SOMs. Furthermore it shall be demonstrated how this approach can be extended.

4.1. Harpsichords and Grand Pianos

A SOM was trained to distinguish between the sounds of historical harpsicords, hammer pianos and modern grand pianos. It can be explored in an interactive online demo [to do: link].

This example uses timbral parameters derived from the spectrum, like the spectral spread or the spectral centroid, as input features to train the SOM. When looking at the u-matrix, a big black area in the middle of the map can be seen. Neighboring nodes are similar as well as items sorted into this region. A closer look reveals that items in this region are mainly historical hammer pianos. Clicking on the items can ensure they sound quite similar (besides differences in tuning), especially when compared to the other items on the map.

In contrast, if the background is lighter, neighboring nodes of the map differ from each other and the assigned items may sound different even though they are sorted close to each other. This can be perceived when clicking on items sorted into the white region on the lower left, where various historical instruments can be found.

The buttons above the map change the background of the SOM and reveal the different component planes. They describe how one specific feature distributes over the whole Kohonen map. When choosing the sound pressure level, for example, it can be seen that this is much higher in the region on the upper left. This stands to reason as modern grand pianos can be found here, and one may easily verify by clicking on the single items.

Another parameter that usually has a big influence on distinguishing between modern and historical pianos is the spectral flux [4]. It is the change in the spectrum when striking a piano. In the beginning the sound is usually bright; while the sound decays it becomes darker. This can be perceived when listening to the modern pianos' sounds. In comparison to the historical piano, the sound also changes but in a different way. Hence, instruments can be distinguished by humans as well as by AI.

While the u-matrix reveals similarities and differences between the items, however, looking at the component planes shows the reasons for the differences.

4.2. Gamelan

In the second example a SOM is used to analyze gamelan music. Gamelan orchestras originate in Indonesia and do not use the Western major and minor scales; they use two distinctive scales called slendro and pelog. Furthermore, gamelan music is played on non-Western instruments. Most instruments are idiophones such as gongs and metallophones, which, in contrast to Western instruments, often do not show harmonic overtone spectra. The tunings of these instruments and, consequently, the intervals of their scales are distinctive and differ slightly between orchestras. It is supposed that the interplay of the tuning and the particular inharmonic overtone spectra is crucial for the characteristic sound of Gamelan music [5].

This interplay can be explored with another [to do] interactive online demo.

In this application a recording of a gamelan tune is played, composed using the slendro scale. Its tuning can be manipulated by moving the four sliders, allowing everyone to design the tuning they prefer. One can also find buttons for directly switching to gamelan or traditional Western tunings.

People who have experimented with the online demo and listened to different tunings of the musical piece often stated they perceived the original slendro scale as the most authentic and appropriate, while Western scales, e.g. the major scale, were described as sounding rather pale or childish. This is striking, as most of the listeners had no or little experience with gamelan music, but were educated with Western music and its own scales and tunings. It cannot be assumed, however, that the interplay between scale tuning and overtone spectra is more important than musical education, as this interplay is also important in Western music. Different spectra may require different scales and tunings.

Perhaps AI, as a neutral, unbiased analysis tool, could help find a suitable ratio of tuning and spectrum for the characteristic sound of gamelan music. In the last few decades some gamelan orchestras have been founded in Germany. Can AI reveal a difference in the approach Western musicians have to Gamelan music? To answer this question two different SOMs were trained and can be explored in an [to do] interactive online demo.

Both maps are based on the same set of training data. The color indicates a specific orchestra while the shape of the markers depicts the orchestra's origin, either Indonesia [diamonds] or Germany [circles].

The SOM on the right was trained by focussing on the overtone spectra. Defined regions for every single orchestra can be easily recognized. These individual spectra may be the reason why every orchestra uses slightly different tunings. Furthermore, no systematical differences between German and Indonesian orchestras can be detected. For this reason it is more likely that different tunings are used due to different instrument spectra than due to the cultural roots of the musicians.

The SOM on the left was trained using various timbral features introduced in [to do] Section 3.1. How these features fluctuated during musical performances was analyzed. As these fluctuations must be related to dynamics, expression and articulation, the different component planes reveal that the Indonesian orchestras performed in a more balanced way, while the German orchestras tended to have strong fluctuations or no fluctuations at all. Nevertheless, it remains unclear if this is due to musical structure and composition or if the fluctuations arise, rather, from musical expression and dynamics.

In either case AI did detect differences between Indonesian and German musicians in gamelan performances and, yet, the overtone spectra and tuning were individual to every orchestra, independent of its origin. Further research, and maybe more AI models, is needed to systematically describe the interplay of scales, tuning and overtone spectra.

4.3. Electronic Dance Music

DJs and producers of electronic dance music (EDM) often use the term "fat" or "fatness" to describe the perception of different music tracks. However, even though they usually cannot give a sufficient definition of these expressions, professionals in this field of music seem to agree on the meaning of fatness, and increasing the fatness seems to be desirable. For this reason Lars Schmedeke investigated this topic during his Ph.D. [6].

He asked professional DJs and producers of EDM to judge the fatness of several different tracks ranging from the 1960s to 2020. The results were used to derive a mathematical expression from already-known musical parameters. Thus, as it has already been proven that fatness is a relevant feature to describe EDM [6], Schemedeke used it and other psychoacoustic features to train a SOM capable of distinguishing different genres of EDM. The results can be explored in an [to do] interactive online demo.

The chart on the right shows the results of the experts after judging the fatness of different EDM tracks. As fatness is supposed to be desired for modern EDM, it is no surprise that it has increased during the last few decades. On the left is shown how the trained SOM organizes the tracks and, when looking at the associated component plane, how the calculated fatness is distributed over the map. As both charts rely on the same EDM tracks, the SOM can be directly compared with the experts' opinions. One can decide with whom to agree more, the experts or the AI, or maybe both are wrong and one has a completely different opinion. There have always been many answers to such subjective questions.

Observing the SOM reveals that the tracks are also clustered by their colors, which refer to different EDM genres. There are small exceptions, however. These are pieces of music which were often a little bit ahead of their time and sometimes introduced new genres. As a result, they were already at the edge of new musical genres.

Besides musical analysis such a map can also be used in more practical ways, for instance, when organizing personal music libraries. Adding songs to a playlist that are close to each other on the map will result in a collection of songs that are musically similar, rather than relying on arbitrary tags of genre and sub-genre concepts.

Such a SOM may also be a great tool for DJs. Designing a DJ set which should last for the entire night requires, on the one hand, songs rich in variety while, on the other hand, avoiding sudden transitions. One just has to move from one point to another on a SOM because similar music has been arranged close to each other. Sudden transitions can be avoided when the distance between consecutive points on the map is sufficiently small. If the direction of travel is not changed too often one can move, point after point, from one region of the map to another, providing overall a musically diverse DJ set.

Similar techniques can be used if someone in the audience requests a specific track. The DJ may be willing to fulfill those wishes but, often, this would interrupt the musical flow. Using a SOM, the DJ could find a path on the map that prevents sudden musical changes and finally leads to the requested track. With this in mind, everyone is warmly invited to use the online demo to create unique and astonishing DJ sets.

4.4. Towards Syntax

When using SOMs to produce or compose music they must be able to recreate musical structure and syntax. As the previous applications were practical examples of how SOMs are used in the field of music, this theoretical section will depart from music and focus on words. The approaches, once understood, may be transferred to music in an individually creative way, providing unique SOMs for specialized musical purposes.

For this experiment every word of the introduction of a paper by Teuvo Kohonen [7] was used as an input item for a SOM. In the previous sections we learned how to use colors as input vectors, by looking at their amounts of red, green and blue, and that music can be described by various psychoacoustic parameters. Using words as training data, however, is not straightforward.

As a first approach, each word was described by a 26-dimensional vector. Each dimension represents a letter of the alphabet. A word can then be represented by counting the number of appearances of each letter. In this [to do] figure the resulting SOM is shown and 50 arbitrary words of the training data have been plotted.

The words appear somehow organized but in a rather unintuitive way. The underlying criteria can hardly be figured out and, generally, the result does not seem very helpful. This approach was probably too naive as it did not take into account the letters' order of appearance.

One could argue that a word by itself is already a feature vector, with the letters being the single features. The dimension is then given by the length of the word, which can be extended simply by appending empty spaces. As a result, all words would have the same length. When comparing them and calculating how similar certain words are, however, [to do] Euclidean distances do not work anymore (they also have some issues when dealing with notated music). Nevertheless, as already stated, in mathematics, there are many ways of calculating distances.

For the comparison of words the Hamming distance has been proven to be a suitable choice. Its calculation is straightforward. For instance, when looking at the two words tooth and tough they sound and look similar. Both words consist of five letters but they differ in two places. If we replace the second 'o' with 'u' and the second 't' with 'g,' then tooth becomes tough. When calculating the Hamming distance the number of different letters is divided by the length of the word:

\[ \|\vec d \| = \frac{2}{5} = 0.4 \]

If we use Hamming distances when calculating a SOM the [to do] result appears to be a little bit more helpful. But, considering that the color of the dots now refers to the length of the words, this seems to be the dominating factor when sorting the text. This is amazing, as the length of the word was never a feature used to train the SOM. It has only an indirect influence when calculating the Hamming distances. Nevertheless, this is not very helpful because sorting words by their length is straightforward and can be easily done without the use of AI.

In the next step, this model is improved by looking at the context in which the words appear. Each word is combined with its preceding and following words, i.e. three words are taken as input for the SOM. The resulting [to do] map now shows clear clusters of words which no longer depend on their length. Even though some regions can be better comprehended than others, it is reasonable to assume that clusters are now based on the appearance of the word in the original paper [7].

As similarities between letters and musical notes, or between words and motifs, have been extensively discussed in the past, it is easy to imagine how the approaches described above may be transferred to music. Though some approaches will sort the data better than others, this does not guarantee that some approaches produce more exciting music.


edu sharing object

SOM Emoji.


Remembering the SOM which was used to sort colors, the single colors of the training data were assigned to the best-matching units. On the trained SOM the BMUs' colors were usually the same as the training colors. In between, however, the BMUs' transitions, from one color to another, could be detected. A similar effect can be seen in the picture above. Here, a SOM was trained to sort different emojis. At the BMUs one can clearly see a distinct emoji, but in between them interestingly blurred and mixed versions of the training data were created.

What does this mean for maps trained with musical motifs? In these in-between regions, some newly composed motifs would be found. Similarly to the [to do] sorted EDM music, one could also find an interesting path from one BMU to another and, when taking the in-between nodes into account, the motif would be progressively altered.

Of course the musical value and the novelty of the results are strongly dependent on the training data. Furthermore, the underlying mathematical model has a strong impact. Using the Hamming distance, single notes could be changed to different notes of the scale when moving from one node of the SOM to another. Utilizing [to do] Euclidean distances could probably indicate small, continuous microtonal changes of the motifs.

4.5. Sonification

The last example of the application of SOMs turns things upside down. In the previous sections SOMs were used to analyze music and sound; now, sound is used to analyze SOMs. In [to do] section 2.4 some problems with the u-matrix were discussed. A u-matrix is used to present an overview of the entire SOM, even when high-dimensional feature spaces are investigated. As a result a lot of detail gets lost, as it is hardly possible to visualize more than three dimensions at once.

Nevertheless, in music, large numbers of parameters are perceived at the same time and, depending on one's musical training and experience, can be directly analyzed. Thus, every feature of a high-dimensional feature space can be assigned to specific psychoacoustic parameters. The psychoacoustic parameters should be as different as possible to distinguish the features clearly. This approach can be experienced in the provided [to do] interactive online demonstration.

In this example the first psychoacoustic parameter is chroma, which is similar to pitch. Changing chroma changes the fundamental frequency without affecting the perceived brightness. As a second parameter, roughness has been implemented similarly to the [to do] demo. above. By changing the prominence of higher frequencies, the sharpness of the sound can be varied. As the last parameter, a periodical loudness fluctuation has been implemented. All four parameters are applied to different features used to train the SOM. Furthermore, by moving the assigned sliders in the interactive demo, the parameters can be changed individually to get an impression of how they sound. This is especially important when using the tool for the first time; the approach can be learned and one can adapt to it.

The SOM was trained to sort different genres of techno music based on features derived from music production tools. Several black regions can be distinguished and it is known that the nodes inside these black regions are very similar. By relying solely on the u-matrix, however, it is impossible to judge similarities and differences between black areas. This can be done here, however, by simply clicking on the map and moving the mouse to different regions, which changes the sound. While the sound changes only slightly, e.g., when moving between the green dots, it changes drastically when moving away to the violet dots. The sound changes further when moving towards the blue dots at the top.

In this example the component planes may be analyzed individually, even though it is no longer necessary to do so, as the entire Kohonen map can be explored using one's ears instead of one's eyes. Nevertheless, when one is not used to the tool it can be confusing to listen to the changes of all four parameters, simultaneously, while moving the mouse along the map. To this end, one can activate the '1-D' button while looking at the component planes. Only changes related to a selected parameter will affect the sound. This is, of course, a disadvantage, but it may be helpful in learning to judge the output of the tool when one is new to it.

The [to do] demo tool also provides an extended version in which a seven-dimensional feature space can be explored. There are no differences to the previous four-dimensional version other than it being slightly more confusing and complicated. Therefore, a modulation matrix is added. Each parameter can be removed to reduce complexity when the user is new to this topic or if not all parameters are equally important. Furthermore, some combinations of Features and psychoacoustic parameters are perceived to be more intuitive, e.g., the tempo of the training data matches well with the beating of the loudness fluctuation. Thus, the tool can be customized using the modulation matrix to allow the most intuitive user experience possible.

The focus of this study was to produce sounds which were as differentiated as they could be. The results are easy to distinguish but also sound rather annoying when using the tool for a long time. It might be a valuable solution to map the features of the SOM to more pleasing, musical parameters, even though they might be harder to distinguish.

5. Conclusion

After an introduction to the mathematical fundamentals of SOMs several examples were discussed. In the field of music and musicology, Kohonen maps are used to analyze music and explain specific effects. They are an excellent tool for gaining deeper insight and answering difficult questions in musicology. They also have to be handled carefully, as the right choice of training data and features will dramatically influence the result, e.g., when organizing [to do] words.

It was also shown that SOMs are helpful when producing music. They can be used to organize sounds, e.g., recordings of [to do] historical pianos and harpsicords. Furthermore, entire musical pieces can be sorted for different musical purposes, e.g., to create unique [to do] DJ sets. A lot of the creative potential of SOMs has yet to be explored. It was roughly touched on when dealing with [to do] syntax, but this can be extended further by rethinking or misusing Kohonen maps. This is not straightforward and the fundamental theoretical and mathematical principles of SOMs must be understood well before they can be manipulated. This, certainly, is one reason why a closer look at mathematics could not be avoided in this talk.

In going beyond the theoretical observation of SOMs, however, this article discusses and links to many interactive examples, and everyone should be encouraged to try them. This is usually a much more intuitive way to get an impression of the benefits of Kohonen maps. How do the maps react? What are the differences between two items arranged at different locations on the map? It is, of course, never wrong to be skeptical. If artificial intelligence detects differences, you can check if you perceive them the same way. And if not, why not? In these cases it is often helpful to have a closer look at the component planes. A new perspective can be obtained in this fashion, even on already-known music. It may be surprising or educational, but it is usually a good moment.

Analyzing SOMs visually using the u-matrix and the component planes, however, is not intuitive. Therefore, in the final example, the advantages of interpreting the maps acoustically were discussed. Even though artificial intelligence is a great tool for analyzing or describing music, the other way around is also true: music and sound can provide a deeper understanding of the results of artificial intelligence and the underlying algorithms.

Now that you have learned the basics of SOMs and have gained first-hand experience in using them, by exploring the provided links, you can make your own opinion about Kohonen maps. Are they a useful tool, or are they something you would not rely on and would rather stick to your own creative mind? Do not forget that if SOMs are trained knowledgeably and responsibly, they may provide an objective view on the emotional topic of music, which may be very helpful, at least in double-checking one's own opinions critically.

6. [to do: rm #] References

[1] Kohonen, T.: Self-organizing maps. Springer, Berlin 1995.

[2] Ziemer, T.: Sound Terminology Describing Production and Perception of Sonification. arXiv:2312.00091[cs.SD] 2023.

[3] Schneider, A.: Pitch and Pitch Perception. In: Bader, R. (eds): Springer Handbook of Systematic Musicology, Springer Handbooks. Springer, Berlin, Heidelberg, 2018.

[4] Plath, N. & Bader, R.: Piano Timbre Development Analysis using Machine Learning. arXiv:2112.03214[q-bio.NC] 2021.

[5] Wendt, G. & Bader, R.: Analysis and Perception of Javanese Gamelan Tunings. In: Bader, R. (eds): Computational Phonogram Archiving, Current Research in Systematic Musicology, vol 5. Springer, Cham., 2019.

[6] Schmedecke, L.: Bass Drop! – Der Bass in der modernen, populären Tanzmusik (Bass Drop! – Bass in modern popular dance music). PhD Thesis. Hamburg: UHH. 2021.

[7] Kohonen, T.: The self-organizing map. Proceedings of the IEEE, 78(9), 1464-1480, 1990.