Organized Sound Spaces with Machine Learning
2. Latent Spaces
2.1 Discrete Latent Audio Spaces
Discrete Latent Audio Spaces
To create a discrete latent audio space, we need to do two things. First, we need to find a way to represent an audio sample, and second, we apply a machine learning approach to generate the latent space.
For the first step, we need to find a way to represent audio sample, because we cannot just work with audio as a signal in many of machine learning approaches. In some applications, we will see that the duration of the audio samples vary from one to another, and we need to come up with a series of numbers, that is a feature vector, and the number of numbers in that series of numbers, that is the number of features we need. We then apply a feature extraction algorithm to create the feature vectors for all audio samples in our dataset. There are a variety of ways to extract features from an audio signal. For example, we could extract spectral features.
The are quite a different types of spectral features, and we can cluster them in two main categories. The first category is Fourier transform based approaches in which we have the audio window static across different frequency bands, such as the Mel-frequency Cepstral Coefficients in the figure above. On the other hand, we have the wavelet based transforms such as constant-Q transform in the figure above, in which we have the audio window changing across different frequency bands. For example, for lower frequency bands, we have a longer audio window, and we have a shorter audio window for higher frequency bands. While the wavelet transform based spectral features are computationally heavy, they can represent relatively low frequencies while keeping low latency in the higher frequency bands.
We could use spectral features and calculate the spectrogram of an audio signal so that we could train a machine learning algorithm on those features. Alternatively, we could also use a higher level features. For example in the figure above, we see a representation of audio samples from Karlheinz Stockhausen's composition titled Kontakte. Each segment is labeled on a two dimensional effect estimation model. In that affect estimation model, we have the dimensions of arousal and valence. Arousal refers to the eventfulness of a sound, whether it is calm or it is eventful. Valance refers to the pleasantness of a sound. Whether it is, roughly, something that could initiates negative emotions versus something that could initiate positive emotions. Our previous work on this have showed that there has been a consistency in affect rankings across a variety of people (Fan et al. 2017). Using a dataset of affect ratings on audio samples, we trained the machine learning algorithm to come up with an affect estimation model that gives us a prediction on the pleasantness and eventfulness of an audio recording.
Here, the point is that we can have also higher level features, instead of audio spectrogram features, and train a machine learning algorithm on those features as well.
Let's see an example in the video above, in which we calculate higher level features on a dataset of audio samples. The video above first starts by a chaotic sound, then continuous with a low arousal/calm sound, then a more pleasant sound, then a more eventful and pleasant sound, then another chaotic sound followed by a calm sound again.