Organized Sound Spaces with Machine Learning
2. Latent Spaces
2.2.2 Latent Timbre Synthesis
Latent Timbre Synthesis
Let's reiterate that, in the sound applications of Machine Learning architectures, two types of latent spaces are utilized, discrete and continuous latent audio spaces. Continuous latent audio spaces encode audio quanta into a latent space using various Machine Learning approaches such as Variational Autoencoders (VAEs). In those types of latent audio spaces, the input is an audio window, which contains a couple of thousands of samples, and in the durations of a fraction of a second. The network can be either trained directly on the audio signal (Tatar et al. 2023), or any type of audio spectrogram as in the figure below (Tatar et al. 2020).
Latent Timbre Synthesis (Tatar et al. 20202) is a Deep Learning architecture for utilizing an abstract latent timbre space that is generated by training a VAE on the Constant-Q Transform (CQT) spectrograms of audio recordings. Latent Timbre Synthesis allow musicians, composers, and sound designers to synthesize sounds using a latent space of audio that is constrained to the timbre space of the audio recordings in the training set.
In this work titled Latent Timbre Synthesis, we are training a variational autoencoder on a specific type of spectrogram representations called Constant-Q Transform (Schörkhuber and Klapuri 2010), which is a wavelet-based spectrogram. The pipeline in the figure above starts with the calculation of Constant-Q transform (CQT) spectrograms from an audio frame. The CQT spectrogram is passed to a Variational Autoencoder, which creates a latent space of spectrograms. Those latent vectors later are passed to the decoder. Then, the decoder reconstructs the spectrograms, that is single spectrogram vectors. The Variational Autoeencoder in Latent Timbre Synthesis works in the magnitude domain, thus it is only working with real numbers. Hence, the CQT spectrograms generated by the Variational Autoencoder does not have a phase. Thus, in this last step, we are taking the magnitude CQT spectrograms, running a phase reconstruction called Griffin-Lim algorithm to predict the phase of each CQT magnitude spectrogram, and apply inverse Constant-Q Transform to generate an audio signal.
The figure above is the user interface of Latent Timbre Synthesis. We have two audio files, and using latent timbre synthesis and a continuous latent audio space, we can create interpolation curves that are changing in time. What is an interpolation curve? If we set the interpolation closer to 1, or closer to the top, the generated audio for that time point, or that audio frame, is going to be similar sounding to audio 1. As we set the interpolation curve lower and closer to -1 at that particular time point, it is going to be more similar to the audio s. And the rest of the framework is about, loading data sets and different runs in which you could have different machine learning models, and setting up the audio inputs and outputs, and selecting different sounds, and selecting different regions in those sounds to generate different interpolation curves in an interactive way.
The video above exemplifies the sound design process with Latent Timbre Synthesis.
Here on the left. What we see is a representation of the latent space that's generated by. Evaluation. Norton coder. The examples that we just heard. Was from this. Particular trend? Model. Here what we see is using T stochastic neighboring embedding TSNA as an audio stellar. We visualized the latent space generated by the variation motor encoder. One emerging property here is that even though we didn't give any information to the machine learning algorithm. That's where or which audio sample the audio frame is coming from. We can clearly see here that there are certain paths or certain squiggles appearing. If we dive into some of those giggles and zoom into them on the right. We see again. A A red part and a blue part. In those each circle is an audio frame from the original audio recording, and even though we didn't explicitly give this to the. Machine learning pipeline and emerging property was that the audio frames coming from the same audio files? Are appearing now as a part in latent space. And thus this is a continuous latent outer space. The green part that we see is an interesting one. If we take the latent vectors of each. Audio sample in blue part and the red part and if you find the middle point in between them in their original higher number of dimensions. If I'm not mistaken, here 64 dimensions. UM. If we take the middle point in each audio in each latent vector of an audio frame, here we end up with a green path and this green path. Is supposed to be the middle point in between those two audio samples, and using that region between. Red path and the blue pad and creating a variety of green paths as we saw in the previous example, we can do sound design and this is how Layton timber synthesis work. To sum up in the first part of this lecture we covered the materiality of music. We dive into musical composition and sound studies in 2010, 21st century. First, we covered the expansion of musical material starting from the beginning of 20th century with the futurism movement. And then we talked about the new approaches to music as organized. To give you an introduction to materiality of music. And to ground our latent audio space approaches in sound studies and musical practices. After that, we dived into two kinds of latent audio spaces. First, we dived into discrete latent audio spaces in which we worked with audio samples. Ranging from a fraction of a second to a couple of seconds, then we dived into continuous latent audio spaces in which we used. Audio frames that are. Ranging in 1012 milliseconds up to 50 milliseconds. Thank you so much for joining me in this lecture and I'm happy to answer your questions if you have any and feel free to contact me. Have a great day.
The continuous latent audio spaces encode audio samples as a continuous path in the latent space, where each point is encoded from one audio window. In the Figure above, the red circles represent one CQT spectrogram calculated from a single audio window with 1024 samples that are encoded from a single audio sample file, and the path-like appearance of these circles are an emerging property with the latent audio space approach using VAEs. Hence, the red path is a continuous latent audio space encoding of one audio sample recording.