Skip to main content

Organized Sound Spaces with Machine Learning

Dr. Kıvanç Tatar

2. Latent Spaces

2.1.1 Musical Agents Based On SOMs

Musical Agents Based On Self-Organizing Maps

Those higher level feature calculation in the previous section was part of a system titled Musical Agents based on Self Organizing Maps (Tatar 2019).

Self-Organizing Map (SOM) is a Machine Learning algorithm to visualize, represent, and cluster high-dimensional input data with a simpler 2D topology. SOM topologies are typically square and include a finite number of nodes. Node vectors have the same number of dimensions as the input data.

edu sharing object

Fig. RGB colors organized in a latent space using Self-Organizing Maps.

SOMs organize the input data using a 2D similarity grid so that similar data clusters locate closer to each other in the topology. Moreover, SOMs cluster the input data by assigning each input vector to the closest node called the best matching unit (BMU). The figure above shows a SOM with 625 nodes for organizing RGB colors. The training is unsupervised in SOMs, but designers set the topology and the number of nodes in the topology. Each input vector is a training instance of SOM’s learning. During a training instance, a SOM also updates BMU’s neighboring model vectors using a neighborhood function. On each training instance, SOMs update their nodes using the data of an input vector. First, SOMs find the BMU of an input vector. Second, SOMs calculate the Euclidean distance between the input vector and the BMU. Third, SOMs update the BMU by this distance multiplied by the learning rate. The learning rate is a user-set global parameter in the range [0., 1.]. Lower learning rate corresponds to less adaptive and more history-depended SOMs. Depending on the neighboring rate, SOM also updates the neighbors of BMU in the direction of BMU’s update. The update amount becomes less as the neighboring node is further away from the BMU. Therefore, the BMU and its neighboring nodes move closer to input vectors on each training instance.

edu sharing object

Fig. The training process of Musical Agents based on Self-organizing Maps.

Without going into too much details, let's have a brief overview of how it works. Here in the image above, at the section A, we have an audio waveform which is processed through an audio segmentation algorithm. Using the audio segmentation, we find different audio segments automatically, which are visualized within rectangles in the image above. The automatic audio segmentation allows us to generate a dataset of audio samples, in which we have audio samples that are either a fraction of a second or a couple of seconds. Using that audio dataset, we extract a set of features in the section b, and then we train self-organizing maps to create a discrete latent audio space, which is the sound memory of the musical agent. The creation of sound memory is similar to the color latent space example that we have covered earlier. Using that discrete latent audio space, we go back to the original recording and we label each audio sample with their corresponding clusters in the latent space. This process gives us a symbolic representation of the audio recording. And it's very interesting that even in this short composition by Iannis Xenakis in the image above, our approach has revealed musical patterns. For example, the composition starts with a cluster pattern of (8,8,5) and then we see (8,8,5) again somewhere in the middle of the composition. Looking at such symbolic representation, we can use sequence modeling algorithms in machine learning to try to find recurring patterns in that symbolic representation. We could use this approach to perform music with it. We could generate a discrete latent audio space by using the self organizing map approach, and combine that with a sequence modeling algorithm, such as factor oracle or recurrent neural networks, to come up with a system that reacts to other performers in real time.

edu sharing object

Fig. The performance setup of Musical Agents based on Self-organizing Maps (masom-performance.png)

Here in this image above, what we see is we have a performer, who is creating an audio output in real time. This audio output is fed into the system — a musical agent. The system first carries out a feature extraction, and try to understand where the other performers current state is in the self organizing maps,that is in this abstract discrete latent audio space. And the system keeps a history of performance, and it tries to match that histroy to the time series patterns that are in the training dataset. With the patterns that the system has in its memory, patterns such as (8,8,5) as we just mentioned earlier. For example, let's assume that the current state of the audio input is the cluster 8 and then the machine knows that. After the cluster 8, there are two other clusters that the machine could play, either the cluster 8 again, or the cluster 5. After deciding on which cluster to play, choosing sounds from that cluster gives us the audio output and the reaction or the other performer.


The video above is an example performance of the musical agent system and another human performner, which is a work titled A Conversation with Artificial Intelligence (Tatar 2017). Here the machine output is in one channel. And we hear the human performer on the other channel. The video will start with the machine output in one channel, and thus we can tell the first channel that you will hear is generated by the musical agent output.

In that example, we had a feature extraction in which the musical agent were calculating 35 dimensions, that is 35 numbers as a vector to represent an audio sample. The feature set consisted of timbere features, loudness, the duration of the audio segment, and musical mention recognition features. In regards to timbre features and loudness, we were calculating the statistics on those features. Because we have a varying duration in audio samples, per future, let's say loudness, we calculate the mean and standard deviation of the feature across the whole audio sample; which gives us two values, mean and standard deviation, to be added to the feature vector. Using that feature vector, we describe or define the audio sample, to train a machine learning model.