Organized Sound Spaces with Machine Learning

Website:	Hamburg Open Online University
Kurs:	MUTOR: Artificial Intelligence for Music and Multimedia
Buch:	Organized Sound Spaces with Machine Learning

Gedruckt von:	Guest user
Datum:	Mittwoch, 16. Juli 2025, 18:24

Beschreibung

Dr. Kıvanç Tatar

Inhaltsverzeichnis

Preface
1. Materiality of Music
2. Latent Spaces
Acknowledgements
References

Preface

Organized Sound Spaces with Machine Learning

In this lecture, we will be looking into how we can use a few different machine learning approaches to create abstract sound spaces that are organized in a way where similar sounds are brought closer to each other.

The topics that we will be showcasing in this lecture are in two parts. We will first give a brief introduction to materiality of music, to understand in which musical context the machine learning algorithms are situated.

1. Materiality of Music

Materiality of Music

Before diving into machine learning approaches, it is good to talk about the materiality of music and give a background in what we mean by music and the musical perspective that we are interested in this lecture. First, we will go through the expansion of musical material at the beginning of 20th century. and advancements in futurism and more. And then, we will look into the mid 20th century where we started describing music as organized sound.

1.1 Expansion of Musical Material

Expansion of Musical Material – Futurism and Beyond

Let's start with expansion of musical material. Let's go back to the beginning of 20th century and try to imagine what music meant back then. For example, this piece by Richard Strauss is composed in 1915, and the video below is a recent performance of that Symphony by Oslo Philharmonic (2020), which was performed in 2019.

In the beginning of 20th century we had a quite concrete understanding of what music is and what kind of sounds were musical. We had an understanding of what kind of musical instruments that we could use to make musical sounds. Of course, some composers were disagreeing with that, such as Luigi Russolo. Around the same time in 1913, Luigi Russolo released his futurist manifesto (1913). The beginning of the 20th century is an interesting time to do that, because at that time, our cities are quite different than today. The cities are loud were loud with industrial noises. We have been in the industrial revolution for quite a while at that time. We could imagine that the soundscape of our daily lives in the beginning of 20th century, were quite noisy. As Luigi Russolo mentions in the Art of Noise (1913):

Let us cross a large modern capital with our ears more sensitive than our eyes. We will delight in distinguishing the eddying of water, of air or gas in metal pipes, the muttering of motors that breathe and pulse with an indisputable animality, the throbbing of valves, the bustle of pistons, the shrieks of mechanical saws, the starting of trams on the tracks, the cracking whips, the flapping of awnings and flags, we will amuse ourselves by orchestrating together in our imagination the din of rolling shop shutters, the varied hubhub of train stations, iron works, tread mills, printing presses, electrical plants, and subways.

Luigi Russolo suggests further that:

Futurist musicians should substitute for the limited variety of timbres that the orchestra possesses today the infinite variety of timbres in noises, reproduces with appropriate mechanisms.

The video below (BBC Radio 3, 2009) is an example from Luigi Russolo's futurist intonarumori. These are devices that are made to make a variety of noises, performed in a musical way. We could call them noise instruments, and those instruments have been proposed in the beginning of 20th century as a way of expanding our sound palette for musical practices.

Dieser Inhalt wird von YouTube bereitgestellt.
Beim Abspielen wird eine Verbindung zu den Servern von YouTube hergestellt und der Anbieter kann Cookies einsetzen, die Hinweise über dein Nutzungsverhalten sammeln.

Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

This was the expansion of our musical material.

1.2.1 Music as Organized Sound

Music as Organized Sound – Varèse

Later, towards the mid 20th century, some composers started to think about what is then to make music if we can use any sound as musical material? One of those composers was Edgard Varèse, and again, I would like to read a paragraph from an article titled Liberation of Sound by Edgard Varèse (1966).

First of all, I should like you to consider what I believe is the best definition of music, because it is all inclusive: 'the corporealization of the intelligence that is in sound' as proposed by Hoene Wronsky. If you think about it you will realize that, unlike most dictionary definitions, which make use of such subjective terms as beauty, feelings, etc., it covers all music, Eastern or Western, past or present, including the music of our new electronic medium. Although this new music is being gradually accepted, there are still people who, while admitting that it is 'interesting,' say: 'but is it music?' It is a question I am only too familiar with. Until quite recently I used to hear it so often in regard to my own works that, as far back as the tweties, I decided to call my music 'organized sound' and myself, not a musician, but 'a worker and rhythms, frequencies and intensities.' Indeed, to stubbornly conditioned ears, anything new in music has always been cold noise. But after all, what is music but organized noises and a composer, like all artists, is an organizer of disparate elements?

Varèse opens up the idea of music. Starting from the expansion of musical material, and thinking about, how can we call something music if any sound can be musical? And in his view, it is the organization that matters. And following Varèse's suggestion, we also have other composers coming in and giving us a more comprehensive understanding of the materiality of music:

any sound can be used to produce music (Russolo 1913);
music is organized sound (Varese 1966);
relationships exist between pitch, noise, timbre, and rhythm involving multiple layers (Stockhausen 1972);
sounds exist in a physical 3-D space (ibid.);
the timescales of music is in multiple levels infinitesimal, subsample, sample, sound object, meso, macro, supra, and infinite (Roads 2004).

We started from Luigi Russolo's expansion of musical material. And then, we came to Varèse's understanding of music, a generalized definition of music. And then after that, Karlheinz Stockhausen proposes four criteria of electronic music, which he later expands those criterias in the late 20th century. Without getting into the details of those criterias, Stockhausen emphasizes the relationships between pitch, noise, timbre, rhythm, and how music consist of multiple musical layers. Additionally, Stockhausen emphasizes the physical 3-D space that we can use for musical composition. In the beginning of 21st century, Curtis Roads (2004) proposes the time scales of music in multiple levels.

1.2.2 Music as Organized Sound

Music as Organized Sound – Derbyshire

Without getting too much into the details of materiality of music, the video below exemplifies how a composer works with the materiality of music in music production. The composer, Delia Derbyshire is using any sounds, processing them, and making a composition out of those sound organizations and in this video. Delia Derbyshire explains what kind of musical materials she used to compose [arrange] the theme [title music] of Doctor Who.

In the video above, Derbyshire (1965) reveals how she worked with any sounds in her compositions:

The first stage in the realization of a piece of music is to construct the individual sounds that we're going to use. To do this, we can, if we like, go to these sound generators here electronic generators, and we'll listen to three of the basic electronic sounds. First it's a simpler sound, which is a sine wave. Particularly on the oscilloscope, it has a very simple form and has a very pure sound. Now listen to the same note, but with a different quality. This is a square wave. Because it's very square on the picture and it's perhaps rather harsh to listen to, this is because it has a lot of high harmonics and that's what gives the corners on the picture. A more complex sound still is white noise. But those basic sounds aren't really interesting in their raw state like this to make them a value for a musical piece, we have to shape them and mold them. But using all of these, we can build up any sound we can possibly imagine. Almost we spend quite a lot of time trying to invent new sounds. I mean, sounds that don't exist already sounds that can't be produced by musical instruments. But we don't always go to electronic sound generators for our basic sources of sound. If the sound we want exists already in real life, say we can't go and record it. The sound I want for the rhythm of this piece needs to be a very short, dry, hollow wooden sun. I can get from this. And then the sound for the punctuating chords. I want the sound of a short wire strike string being plucked. That's at the speed we recorded in the studio. We can get the lower sounds we need from the rhythm. By slowing down a tape. And the higher sounds by speeding up the tape. These particular pictures we can record on this machine here. And then all we have to do is cut the notes to the right length. We can join them together on a loop and listen to them.

In that video, Delia Derbyshire showed us how she was selecting different kinds of audio, and how they relate to each other. For example, how does a sinusoidal sound? How does a square wave sound? After selecting the musical material, we observed how Delia Derbyshire is composing by organizing the selected material in time. Like many other composers, Delia Derbyshire (1965) organized musical material consisted of any sound, to compose [arrange] the Doctor Who theme [title music] using that material:

And then all we have to do is cut the notes to the right length. We can join them together on a loop and. Listen to them. And then with the higher notes of the rhythm, again, we do need to get down a loop and play it in synchronization with the first date. And over this we can play the sound of the plucked string, which can be either in the form of the. Loop like this. Civilization, or in the form of a band on the tape.

edu sharing object

This documentary from BBC is a great example to see how a composer works with latent audio spaces and temporal organization of sound (figure 1). The latent audio spaces are rooted in our understanding, perception, comprehension and conceptualization of audio similarity and audio dissimilarity. It also showcased how we organize sounds in time, so that we come up with musical form. Whether it is a performance or composition, We can map both of those musical organizations to machine learning approaches. For example, to organize sounds in an abstract space of similarity and dissimilarity, we can use the notion of latent space in machine learning and use various machine learning approaches to create those latent spaces. On the other hand, to organize sounds in time, we can use a variety of sequence modeling approaches or machine learning approaches for time series data.

2. Latent Spaces

Latent Spaces in Machine Learning

But what is a latent space? Let's have an example of that. For example, let's think about a latent space of colours, and let's define a colour as three values of RGB: red, green and blue. Let's think about a way of organizing those colours on a 2D surface, and let's see an example in which a machine learning algorithm generates a latent space of colors:

The machine learning algorithm in the video above is called self organizing maps, which has a predefined number of nodes that moves itself so that it takes the shape of the latent space. And in our case, it takes the shape of the colour space towards the end of the training. We can see a variety of colours and how they relate to each other, how they are similar or dissimilar to each other.

Latent Audio Spaces

Now that we have a musical perspective and background to cover latent audio spaces, we will now dive into the main topic of this lecture. We will be looking into two types of approaches to latent audio spaces: discrete approaches and continuous approaches.

edu sharing object

Fig.: Time scaes of music by Curtis Roads (2004).

What is a discrete approach and what is a continuous approach then? Mathematically, both categories that we mention in this section are discrete approaches. However, the continuous approach is in the time scales of micro scale, whereas the discrete approach is working in the time scales of sound object and mesoscale. In the sense that, in discrete latent spaces, we are organizing audio samples that are either fraction of a second to a couple of seconds. In an abstract space, we can think those as sound objects, such as short sound gestures etc. In the continuous latent spaces, we are working in the micro scale. We are working with audio windows that are relatively short, around a few milliseconds to 50 milliseconds. Thus, by putting audio one audio window after another, we treat an audio signal as a time series data where each data point is one audio window. The continuous latent space audio space consists of an organization of audio windows, where each data point represent one audio window.

2.1 Discrete Latent Audio Spaces

Discrete Latent Audio Spaces

edu sharing object

To create a discrete latent audio space, we need to do two things. First, we need to find a way to represent an audio sample, and second, we apply a machine learning approach to generate the latent space.

For the first step, we need to find a way to represent audio sample, because we cannot just work with audio as a signal in many of machine learning approaches. In some applications, we will see that the duration of the audio samples vary from one to another, and we need to come up with a series of numbers, that is a feature vector, and the number of numbers in that series of numbers, that is the number of features we need. We then apply a feature extraction algorithm to create the feature vectors for all audio samples in our dataset. There are a variety of ways to extract features from an audio signal. For example, we could extract spectral features.

The are quite a different types of spectral features, and we can cluster them in two main categories. The first category is Fourier transform based approaches in which we have the audio window static across different frequency bands, such as the Mel-frequency Cepstral Coefficients in the figure above. On the other hand, we have the wavelet based transforms such as constant-Q transform in the figure above, in which we have the audio window changing across different frequency bands. For example, for lower frequency bands, we have a longer audio window, and we have a shorter audio window for higher frequency bands. While the wavelet transform based spectral features are computationally heavy, they can represent relatively low frequencies while keeping low latency in the higher frequency bands.

edu sharing object

We could use spectral features and calculate the spectrogram of an audio signal so that we could train a machine learning algorithm on those features. Alternatively, we could also use a higher level features. For example in the figure above, we see a representation of audio samples from Karlheinz Stockhausen's composition titled Kontakte. Each segment is labeled on a two dimensional effect estimation model. In that affect estimation model, we have the dimensions of arousal and valence. Arousal refers to the eventfulness of a sound, whether it is calm or it is eventful. Valance refers to the pleasantness of a sound. Whether it is, roughly, something that could initiates negative emotions versus something that could initiate positive emotions. Our previous work on this have showed that there has been a consistency in affect rankings across a variety of people (Fan et al. 2017). Using a dataset of affect ratings on audio samples, we trained the machine learning algorithm to come up with an affect estimation model that gives us a prediction on the pleasantness and eventfulness of an audio recording.

Here, the point is that we can have also higher level features, instead of audio spectrogram features, and train a machine learning algorithm on those features as well.

Let's see an example in the video above, in which we calculate higher level features on a dataset of audio samples. The video above first starts by a chaotic sound, then continuous with a low arousal/calm sound, then a more pleasant sound, then a more eventful and pleasant sound, then another chaotic sound followed by a calm sound again.

2.1.1 Musical Agents Based On SOMs

Musical Agents Based On Self-Organizing Maps

Those higher level feature calculation in the previous section was part of a system titled Musical Agents based on Self Organizing Maps (Tatar 2019).

Self-Organizing Map (SOM) is a Machine Learning algorithm to visualize, represent, and cluster high-dimensional input data with a simpler 2D topology. SOM topologies are typically square and include a finite number of nodes. Node vectors have the same number of dimensions as the input data.

edu sharing object

Fig. RGB colors organized in a latent space using Self-Organizing Maps.

SOMs organize the input data using a 2D similarity grid so that similar data clusters locate closer to each other in the topology. Moreover, SOMs cluster the input data by assigning each input vector to the closest node called the best matching unit (BMU). The figure above shows a SOM with 625 nodes for organizing RGB colors. The training is unsupervised in SOMs, but designers set the topology and the number of nodes in the topology. Each input vector is a training instance of SOM’s learning. During a training instance, a SOM also updates BMU’s neighboring model vectors using a neighborhood function. On each training instance, SOMs update their nodes using the data of an input vector. First, SOMs find the BMU of an input vector. Second, SOMs calculate the Euclidean distance between the input vector and the BMU. Third, SOMs update the BMU by this distance multiplied by the learning rate. The learning rate is a user-set global parameter in the range [0., 1.]. Lower learning rate corresponds to less adaptive and more history-depended SOMs. Depending on the neighboring rate, SOM also updates the neighbors of BMU in the direction of BMU’s update. The update amount becomes less as the neighboring node is further away from the BMU. Therefore, the BMU and its neighboring nodes move closer to input vectors on each training instance.

edu sharing object

Fig. The training process of Musical Agents based on Self-organizing Maps.

Without going into too much details, let's have a brief overview of how it works. Here in the image above, at the section A, we have an audio waveform which is processed through an audio segmentation algorithm. Using the audio segmentation, we find different audio segments automatically, which are visualized within rectangles in the image above. The automatic audio segmentation allows us to generate a dataset of audio samples, in which we have audio samples that are either a fraction of a second or a couple of seconds. Using that audio dataset, we extract a set of features in the section b, and then we train self-organizing maps to create a discrete latent audio space, which is the sound memory of the musical agent. The creation of sound memory is similar to the color latent space example that we have covered earlier. Using that discrete latent audio space, we go back to the original recording and we label each audio sample with their corresponding clusters in the latent space. This process gives us a symbolic representation of the audio recording. And it's very interesting that even in this short composition by Iannis Xenakis in the image above, our approach has revealed musical patterns. For example, the composition starts with a cluster pattern of (8,8,5) and then we see (8,8,5) again somewhere in the middle of the composition. Looking at such symbolic representation, we can use sequence modeling algorithms in machine learning to try to find recurring patterns in that symbolic representation. We could use this approach to perform music with it. We could generate a discrete latent audio space by using the self organizing map approach, and combine that with a sequence modeling algorithm, such as factor oracle or recurrent neural networks, to come up with a system that reacts to other performers in real time.

edu sharing object

Fig. The performance setup of Musical Agents based on Self-organizing Maps (masom-performance.png)

Here in this image above, what we see is we have a performer, who is creating an audio output in real time. This audio output is fed into the system — a musical agent. The system first carries out a feature extraction, and try to understand where the other performers current state is in the self organizing maps,that is in this abstract discrete latent audio space. And the system keeps a history of performance, and it tries to match that histroy to the time series patterns that are in the training dataset. With the patterns that the system has in its memory, patterns such as (8,8,5) as we just mentioned earlier. For example, let's assume that the current state of the audio input is the cluster 8 and then the machine knows that. After the cluster 8, there are two other clusters that the machine could play, either the cluster 8 again, or the cluster 5. After deciding on which cluster to play, choosing sounds from that cluster gives us the audio output and the reaction or the other performer.

The video above is an example performance of the musical agent system and another human performner, which is a work titled A Conversation with Artificial Intelligence (Tatar 2017). Here the machine output is in one channel. And we hear the human performer on the other channel. The video will start with the machine output in one channel, and thus we can tell the first channel that you will hear is generated by the musical agent output.

In that example, we had a feature extraction in which the musical agent were calculating 35 dimensions, that is 35 numbers as a vector to represent an audio sample. The feature set consisted of timbere features, loudness, the duration of the audio segment, and musical mention recognition features. In regards to timbre features and loudness, we were calculating the statistics on those features. Because we have a varying duration in audio samples, per future, let's say loudness, we calculate the mean and standard deviation of the feature across the whole audio sample; which gives us two values, mean and standard deviation, to be added to the feature vector. Using that feature vector, we describe or define the audio sample, to train a machine learning model.

2.1.2 Musical Agents Based On SOM's (cont'd.)

Musical Agents Based On Self-Organizing Maps (continued)

Let's see another example in the video above in which we have a relatively small self organizing maps. In that video, we have clustered audio samples, and we can hear how that clustering worked by listening to the audio samples within clusters.

There are other example systems in the literature that uses a similar approach to the architecture of Musical Agents based of Self-Organizing Maps. One of those systems is called AudioStellar by Garber and Ciccola (2019) that we can watch in the video above. The authors here organize sounds in a 2D space and each dot here will represent an audio sample.

One exciting aspect of audio stellar is the user interaction possibilities that is already available in it. You can create certain paths in the latent space, or interact with the latent space using generative approaches such as particle simulations or swarms, to use the discrete latent audio space in a musically meaningful way.

edu sharing object

Fig. The training of AudioStellar.

Let's have a look at the machine learning pipeline behind AudioStellar in the figure above. We have a data set of audio files, which are going through a feature extraction process that the authors refer to as preprocess. All audio is converted to a mono file and then they are calculating a spectrogram feature called Short-time Fourier Transform (STFT), so that they end up with a matrix that represents the audio. From that spectrogram representation, they first run their first machine learning algorithm, which is called principal component analysis. Using principal component analysis, we can keep the main features or the main distribution in the original data set, while representing the dataset in a lower dimensional domain, such as in three dimensions or two dimensions. After that, the authors use a stochastic visualization technique, which is called T Stochastic Neighboring Embedding (t-SNE), to come up with a visualization of the dataset in 2D domain. After running the t-SNE, we can already observe the clusters appearing. Yet, we still don't have the clusters to actually come up with the exact borders of those clusters. Hence, the author apply another machine learning approach called DBScan. And after that pipeline, we have a 2D discrete latent audio space in which we can observe clusters, which are represented as colours, in which we have circles, which are audio samples. We can play with that discrete latent audio space in a musically meaningful way.

2.2 Continuous Latent Audio Spaces

Continuous Latent Audio Spaces

Until now we have covered two examples of discrete latent audio spaces. Now, let's have a dive into the continuous latent audio spaces. As we covered earlier in the time scales of music, while the continuous latent audio spaces are not mathematically continuous, the discrete elements that we have are so perceptually short that we can approach them as if they are iknfinitely small or they are audio quanta.

edu sharing object

Fig. The spectrogram of a frequency sweep from 220 Hz to 1000 Hz or so.

In the example above, we have a frequency sweep that is starting from 220 Hertz going up to 1000 Hertz or so. If we look at the spectrogram of that sweep in the image above, we can see that it looks almost as if it is a part. The question here is, how can we come up with a more complex way in which we have a 2D representation of an audio space? And in that 2D representation, how can we represent an audio file almost like a path? This is an interesting research question because there are quite interesting musical applications that we could explore using such approach.

edu sharing object

Fig. Excerpt from the latent space of Latent Timbre Synthesis.

For example, here in the image above, we see three colours: red, green, and blue. And in those three colours, we see some small circles. The red coloured path and the blue coloured path are recordings from an audio dataset. Each circle in those paths represent one audio window. Those paths are interesting because this is actually an emerging property of a machine learning approach that I will be talking about later. The latent space in the image above is not a mathematically continuous space, it is still a grid. However, because each audio quanta or audio window is quite small, we could almost approach each audio sample as if it is a continuously changing representation.

2.2.1 Variational Autoencoders

Variational Autoencoders

To understand how we could create a continuous latent audio space, we first need to talk about a specific type of machine learning algorithm called Variational autoencoders. Variational encoders are a type of Deep Learning architectures.

Autoencoders are Deep Learning architectures for generative modelling. The architecture consists of two main modules: an encoder and a decoder. The encoder maps the input data \[ x \in R^L \] to a latent vector \[ z \in R^M \] where z = encoder(x), and M < L. The decoder aims to reconstruct the input data from its latent vector, and ideally, decoder(encoder(x)) = x. The Autoencoder architecture encodes the input data vector to a single point, that is the latent vector. Variational Autoencoder (VAE) is an improved version of the Autoencoder architecture that converts the input data vector to a stochastic distribution over the latent space. This difference is also referred to as the ``reparametrization trick'' (Kingma and Welling 2014 and 2019, Sønderby et al. 2016).

In VAE, the encoder tries to generate a latent space by approximating p(z|x) while the decoder tries to capture the true posterior p(x|z). The vanilla VAE approximates p(z|x) using \[ q(z|x) \in Q \] with the assumption that p(z|x) is in the form of a Gaussian distribution N(0,I). This approximation is referred in the literature as Variational Inference (Kingma and Welling 2014). Specifically, the encoder outputs the mean \[ \mu _M \] and the co-variance \[ \sigma _ M \] as the inputs of the Gaussian distribution function \[ N(z; \mu_M, \sigma^2 _M I) \] over a latent space with M number of dimensions. Hence, the encoder approximates p(z|x) using \[q^* (z|x) = N(z; f(x), g(x)^2 I)\] where \[\mu _M = f(x),\] \[f \in F,\] \[\sigma _ M = g(x)\] and \[g \in G.\] The decoder's input, the latent vector z is sampled from the latent distribution \[ q(z) = N(z; f(x),g(x) ^2 I) \] Hence, the loss function consists of the reconstruction loss and the regularization term of Kullback-Leibler divergence (KLD) between \[ q^* (z|x)\] and \[p^* (z)\]

\[L_{f,g} = \mathbb{E} _{q^* (z)}[logp^* (x|z)] - \alpha \cdot D_{KL}[q^* (z|x)||p^* (z)]\]

In the equation above, KLD multiplier, α is one of the training hyper-parameters of VAE architectures.

2.2.2 Latent Timbre Synthesis

Latent Timbre Synthesis

Let's reiterate that, in the sound applications of Machine Learning architectures, two types of latent spaces are utilized, discrete and continuous latent audio spaces. Continuous latent audio spaces encode audio quanta into a latent space using various Machine Learning approaches such as Variational Autoencoders (VAEs). In those types of latent audio spaces, the input is an audio window, which contains a couple of thousands of samples, and in the durations of a fraction of a second. The network can be either trained directly on the audio signal (Tatar et al. 2023), or any type of audio spectrogram as in the figure below (Tatar et al. 2020).

Latent Timbre Synthesis (Tatar et al. 20202) is a Deep Learning architecture for utilizing an abstract latent timbre space that is generated by training a VAE on the Constant-Q Transform (CQT) spectrograms of audio recordings. Latent Timbre Synthesis allow musicians, composers, and sound designers to synthesize sounds using a latent space of audio that is constrained to the timbre space of the audio recordings in the training set.

edu sharing object

Fig. Latent Timbre Synthesis framework.

In this work titled Latent Timbre Synthesis, we are training a variational autoencoder on a specific type of spectrogram representations called Constant-Q Transform (Schörkhuber and Klapuri 2010), which is a wavelet-based spectrogram. The pipeline in the figure above starts with the calculation of Constant-Q transform (CQT) spectrograms from an audio frame. The CQT spectrogram is passed to a Variational Autoencoder, which creates a latent space of spectrograms. Those latent vectors later are passed to the decoder. Then, the decoder reconstructs the spectrograms, that is single spectrogram vectors. The Variational Autoeencoder in Latent Timbre Synthesis works in the magnitude domain, thus it is only working with real numbers. Hence, the CQT spectrograms generated by the Variational Autoencoder does not have a phase. Thus, in this last step, we are taking the magnitude CQT spectrograms, running a phase reconstruction called Griffin-Lim algorithm to predict the phase of each CQT magnitude spectrogram, and apply inverse Constant-Q Transform to generate an audio signal.

edu sharing object

Fig. The user interface of Latent Timbre Synthesis.

The figure above is the user interface of Latent Timbre Synthesis. We have two audio files, and using latent timbre synthesis and a continuous latent audio space, we can create interpolation curves that are changing in time. What is an interpolation curve? If we set the interpolation closer to 1, or closer to the top, the generated audio for that time point, or that audio frame, is going to be similar sounding to audio 1. As we set the interpolation curve lower and closer to -1 at that particular time point, it is going to be more similar to the audio s. And the rest of the framework is about, loading data sets and different runs in which you could have different machine learning models, and setting up the audio inputs and outputs, and selecting different sounds, and selecting different regions in those sounds to generate different interpolation curves in an interactive way.

The video above exemplifies the sound design process with Latent Timbre Synthesis.

Here on the left. What we see is a representation of the latent space that's generated by. Evaluation. Norton coder. The examples that we just heard. Was from this. Particular trend? Model. Here what we see is using T stochastic neighboring embedding TSNA as an audio stellar. We visualized the latent space generated by the variation motor encoder. One emerging property here is that even though we didn't give any information to the machine learning algorithm. That's where or which audio sample the audio frame is coming from. We can clearly see here that there are certain paths or certain squiggles appearing. If we dive into some of those giggles and zoom into them on the right. We see again. A A red part and a blue part. In those each circle is an audio frame from the original audio recording, and even though we didn't explicitly give this to the. Machine learning pipeline and emerging property was that the audio frames coming from the same audio files? Are appearing now as a part in latent space. And thus this is a continuous latent outer space. The green part that we see is an interesting one. If we take the latent vectors of each. Audio sample in blue part and the red part and if you find the middle point in between them in their original higher number of dimensions. If I'm not mistaken, here 64 dimensions. UM. If we take the middle point in each audio in each latent vector of an audio frame, here we end up with a green path and this green path. Is supposed to be the middle point in between those two audio samples, and using that region between. Red path and the blue pad and creating a variety of green paths as we saw in the previous example, we can do sound design and this is how Layton timber synthesis work. To sum up in the first part of this lecture we covered the materiality of music. We dive into musical composition and sound studies in 2010, 21st century. First, we covered the expansion of musical material starting from the beginning of 20th century with the futurism movement. And then we talked about the new approaches to music as organized. To give you an introduction to materiality of music. And to ground our latent audio space approaches in sound studies and musical practices. After that, we dived into two kinds of latent audio spaces. First, we dived into discrete latent audio spaces in which we worked with audio samples. Ranging from a fraction of a second to a couple of seconds, then we dived into continuous latent audio spaces in which we used. Audio frames that are. Ranging in 1012 milliseconds up to 50 milliseconds. Thank you so much for joining me in this lecture and I'm happy to answer your questions if you have any and feel free to contact me. Have a great day.

edu sharing object

Fig. Here are the 2-dimensional visualizations of the latent space generated by an LTS model trained on a dataset, using t-Distributed stochastic neighbouring.

The continuous latent audio spaces encode audio samples as a continuous path in the latent space, where each point is encoded from one audio window. In the Figure above, the red circles represent one CQT spectrogram calculated from a single audio window with 1024 samples that are encoded from a single audio sample file, and the path-like appearance of these circles are an emerging property with the latent audio space approach using VAEs. Hence, the red path is a continuous latent audio space encoding of one audio sample recording.

Acknowledgements

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program – Humanities and Society (WASP-HS) funded by the Marianne and Marcus Wallenberg Foundation and the Marcus and Amalia Wallenberg Foundation.

References

BBC Radio 3, dir. 2009. "The Futurist Intonarumori by Russolo - 2", YouTube.

Cox, Christoph, and Daniel Warner. 2004. Audio Culture: Readings in Modern Music. A&C Black.

Fan, Jianyu, Kıvanç Tatar, Miles Thorogood, and Philippe Pasquier. 2017. “Ranking-Based Emotion Recognition for Experimental Music.” In Proceedings of the International Symposium on Music Information Retrieval (ISMIR) 2017. Suzhou, China.

Garber, Leandro, and Tomás Ciccola. 2019. “AudioStellar.” 2019. https://audiostellar.xyz/.

Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” arXiv:1312.6114 [Cs, Stat], May. http://arxiv.org/abs/1312.6114.

———. 2019. “An Introduction to Variational Autoencoders.” Foundations and Trends® in Machine Learning 12 (4): 307–92. https://doi.org/10.1561/2200000056.

Oslo Philharmonic, dir. 2020. "An Alpine Symphony / Richard Strauss / Vasily Petrenko / Oslo Philharmonic", YouTube.

Roads, Curtis. 2004. Microsound. Cambridge, Mass.: The MIT Press.

Russolo, Luigi. 1913. The Art of Noise. A Great Bear Pamphlet. http://www.artype.de/Sammlung/pdf/russolo_noise.pdf.

Schörkhuber, Christian, and Anssi Klapuri. 2010. “Constant-Q Transform Toolbox For Music Processing.” In Proceedings of the 7th Sound and Music Computing Conference (SMC 2010), 8. Barcelona, Spain.

Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks.” In Proceedings of the 23rd International Conference on Machine Learning (ICML 2016). Pittsburgh, Pennsylvania: ACM Press.

Stockhausen, Karlheinz. 1958. Kontakte. https://en.wikipedia.org/wiki/Kontakte.

Stockhausen, Karlheinz, dir. 1972. Four Criteria of Electronic Music with Examples from Kontakte. Stockhausen Verlag.

Strauss, Richard. 1915. An Alpine Symphony.

Tatar, Kıvanç. 2017. A Conversation with Artificial Intelligence. https://kivanctatar.com/A-Conversation-with-Artificial-Intelligence.

Tatar, Kıvanç, Daniel Bisig, and Philippe Pasquier. 2020. “Latent Timbre Synthesis.” Neural Computing and Applications, October. https://doi.org/10.1007/s00521-020-05424-2.

Tatar, Kıvanç, Kelsey Cotton, and Daniel Bisig. 2023. “Sound Design Strategies for Latent Audio Space Explorations Using Deep Learning Architectures.”

Varèse, Edgard, and Chou Wen-chung. 1966. “The Liberation of Sound.” Perspectives of New Music 5 (1): 11–19.