Organized Sound Spaces with Machine Learning: 2.2.1 Variational Autoencoders

2. Latent Spaces

2.2.1 Variational Autoencoders

Variational Autoencoders

To understand how we could create a continuous latent audio space, we first need to talk about a specific type of machine learning algorithm called Variational autoencoders. Variational encoders are a type of Deep Learning architectures.

Autoencoders are Deep Learning architectures for generative modelling. The architecture consists of two main modules: an encoder and a decoder. The encoder maps the input data \[ x \in R^L \] to a latent vector \[ z \in R^M \] where z = encoder(x), and M < L. The decoder aims to reconstruct the input data from its latent vector, and ideally, decoder(encoder(x)) = x. The Autoencoder architecture encodes the input data vector to a single point, that is the latent vector. Variational Autoencoder (VAE) is an improved version of the Autoencoder architecture that converts the input data vector to a stochastic distribution over the latent space. This difference is also referred to as the ``reparametrization trick'' (Kingma and Welling 2014 and 2019, Sønderby et al. 2016).

In VAE, the encoder tries to generate a latent space by approximating p(z|x) while the decoder tries to capture the true posterior p(x|z). The vanilla VAE approximates p(z|x) using \[ q(z|x) \in Q \] with the assumption that p(z|x) is in the form of a Gaussian distribution N(0,I). This approximation is referred in the literature as Variational Inference (Kingma and Welling 2014). Specifically, the encoder outputs the mean \[ \mu _M \] and the co-variance \[ \sigma _ M \] as the inputs of the Gaussian distribution function \[ N(z; \mu_M, \sigma^2 _M I) \] over a latent space with M number of dimensions. Hence, the encoder approximates p(z|x) using \[q^* (z|x) = N(z; f(x), g(x)^2 I)\] where \[\mu _M = f(x),\] \[f \in F,\] \[\sigma _ M = g(x)\] and \[g \in G.\] The decoder's input, the latent vector z is sampled from the latent distribution \[ q(z) = N(z; f(x),g(x) ^2 I) \] Hence, the loss function consists of the reconstruction loss and the regularization term of Kullback-Leibler divergence (KLD) between \[ q^* (z|x)\] and \[p^* (z)\]

\[L_{f,g} = \mathbb{E} _{q^* (z)}[logp^* (x|z)] - \alpha \cdot D_{KL}[q^* (z|x)||p^* (z)]\]

In the equation above, KLD multiplier, α is one of the training hyper-parameters of VAE architectures.