Skip to main content

AI Composition: New, artistic workflows in dealing with generative audio tools

Genoël von Lilienstern

IV. The Individual Areas of Work

4.3 Training and 4.4 Generation

4.3 Training

4.3.1 About the term repository

Two different repositories are currently of particular interest for training with audio: mimikit from ktonal and rave from Antoine Caillon (IRCAM).

I call them repository or repo here because when you use the term app or program, you think of more or less user-optimised GUI interfaces. Repository actually just means that there is a place where things are stored. This is usually the source code, documentation and configuration files. This is typical for projects that are in a state of development, but are already being used and tested publicly.

4.3.2 mimikit

The Music Modelling Toolkit (mimikit)9 is a Python package from ktonal that performs machine learning with audio data. It focuses on training autoregressive neural networks to generate audio. A detailed demonstration of mimikit can be found in the MUTOR instructional video of the same name. In workshops on AI and audio, we always find that participants end up coming back to mimikit, as it has the advantage of being able to handle small amounts of data and deliver results quickly.

4.3.3 rave

Rave10 (Realtime Audio Variational autoEncoder) is a deep learning–based generative audio processing platform developed by Antoine Caillon at IRCAM. Rave is a learning framework for generating a neural network model from audio data. It enables both fast and high-quality audio waveform synthesis. In Max and Pd, it is accompanied by its nn~ decoder, which makes it possible to use these models in real time for various applications such as audio generation and timbre transformation/transfer. A demonstration of rave can also be found in the MUTOR instructional video of the same name.

4.3.3.1 Hardware

A basic problem when training neural networks and deep learning is the available processing power of computer processors. Graphics chips are particularly suitable. However, these are not available in a suitable form on every computer. It may therefore be worth purchasing a powerful GPU, such as a GeForce RTX 4060 card from Nvidia. However, the newer generation of MacBook M2 and M3 processors now also appear to be suitable for machine learning. Alternatively, Python code can be executed in Colab notebooks on Google computers. On a smaller scale, as with mimikit, this is quite practical.

4.3.4 Summary

When using both repositories, it can be seen that the parameterisation of the networks plays an important role. With mimikit, you can even make fundamental changes to the dimensionality of the neural network. This is an interesting area for influencing the resulting sound quality. For example, varying the hop size is a variable that quickly leads to sonic differences. However, there is a tendency for most users of these notebooks to look for a setting that works well and then use it over and over again, rather than what you put into the network, i.e. which database, to get results. In the case of rave, this is particularly obvious, because the calculation process can take several days and you don't want to risk having selected a setting that ultimately produces little that is useful.


4.4 Generation

4.4.1 Differentiation from training

In the case of mimikit, two different processes are visible relatively close to each other, making them appear almost like a single process. The training with the different epochs is constantly visible. In between, four example files are generated every four epochs. The current checkpoints are used to generate new data samples based on the similarity to the training data.

4.4.2 Utilisation of different training levels

The quality of the results varies depending on the level of training. The error rate is particularly high at the beginning. This means that many standing loops, static sounds or simply silence are to be expected.

In the middle stage, at around ten to thirty epochs, there is the highest variance. Beyond that, we often move towards a more stable state, which often approaches overfit with longer single files.

Prompting is also important for starting the generation. This refers to the initialisation with a small sound file, which then triggers a sound prediction based on the training data.

4.4.3 Creative use of checkpoints

However, the generation also offers interesting possibilities for composing. This already anticipates the final area of composition.

Björn Erlach from ktonal has written a notebook in which a Supercollider pattern script can rhythmically access different stored checkpoints. This is an elegant way to organically mix different sound materials within a sound file.

The corresponding notebook from ktonal Ensemble Generator.11 In the sound study Prokonsori 12 by Dohi Moon, checkpoints from a Prokofiev piano piece and a Korean pansori song are accessed alternately.

4.4.4 Latency spaces and live generation

The results of rave's training are so-called latent spaces. A latency space is an n-dimensional space in which the weights are stored as potential, so to speak. In this form, they could be used as a resource for live performances. The problem with latency spaces is that they are relative black boxes and inherently lack temporal linearity, so their interpretation can be relatively unpredictable. The more reliable retrieval of specific sound coordinates from latency spaces is a field of active research. Exciting examples of latent space interpretations can be found in Moisés Herrera's live DJ sets.13

Latency spaces can be exported in rave as ts files and then read out in Max/MSP using the nn˜ object.14 In this way, they can be used for live electronic experiments. The Intelligent Instruments Lab Reykjavik15 has published a collection of different, already trained ts files on the Huggingface16 page.

4.4.5 Timbre transfer

Another way of using latency spaces is to transfer timbres to other sounds. The documentation of the nn˜ object in Max/MSP shows how a percussion rhythm is mapped to the timbre of a female voice. The percussion rhythm is the (potential live) input, the female voice is available as a ts file. This technology is also the same that is used in the field of AI covers, where the timbre of person A's voice is often mapped to the timbre of person B's voice.


8 https://ultimatevocalremover.com/
9 https://github.com/ktonal/mimikit
10 https://github.com/acids-ircam/RAVE