Zum Hauptinhalt

Organized Sound Spaces with Machine Learning

Dr. Kıvanç Tatar

2. Latent Spaces

2.1.2 Musical Agents Based On SOM's (cont'd.)


Musical Agents Based On Self-Organizing Maps (continued)




Let's see another example in the video above in which we have a relatively small self organizing maps. In that video, we have clustered audio samples, and we can hear how that clustering worked by listening to the audio samples within clusters.




There are other example systems in the literature that uses a similar approach to the architecture of Musical Agents based of Self-Organizing Maps. One of those systems is called AudioStellar by Garber and Ciccola (2019) that we can watch in the video above. The authors here organize sounds in a 2D space and each dot here will represent an audio sample.

One exciting aspect of audio stellar is the user interaction possibilities that is already available in it. You can create certain paths in the latent space, or interact with the latent space using generative approaches such as particle simulations or swarms, to use the discrete latent audio space in a musically meaningful way.

edu sharing object

Fig. The training of AudioStellar.

Let's have a look at the machine learning pipeline behind AudioStellar in the figure above. We have a data set of audio files, which are going through a feature extraction process that the authors refer to as preprocess. All audio is converted to a mono file and then they are calculating a spectrogram feature called Short-time Fourier Transform (STFT), so that they end up with a matrix that represents the audio. From that spectrogram representation, they first run their first machine learning algorithm, which is called principal component analysis. Using principal component analysis, we can keep the main features or the main distribution in the original data set, while representing the dataset in a lower dimensional domain, such as in three dimensions or two dimensions. After that, the authors use a stochastic visualization technique, which is called T Stochastic Neighboring Embedding (t-SNE), to come up with a visualization of the dataset in 2D domain. After running the t-SNE, we can already observe the clusters appearing. Yet, we still don't have the clusters to actually come up with the exact borders of those clusters. Hence, the author apply another machine learning approach called DBScan. And after that pipeline, we have a 2D discrete latent audio space in which we can observe clusters, which are represented as colours, in which we have circles, which are audio samples. We can play with that discrete latent audio space in a musically meaningful way.