Zum Hauptinhalt

AI Composition: New, artistic workflows in dealing with generative audio tools

III. Neural Audio Synthesis

Neural Audio Synthesis

3.1 Differentiation between audio data and symbolic notations of sound

This is about neural audio synthesis or generative audio. In other words, it is about tools that deal with audio data in some form, like training, generating or selecting.

What is meant by audio data? Everything that can be heard at the beginning and end of a process as a recorded or artificially generated sound event; the data of a specific sound measurement or of artificially generated sounds. The term 'raw audio' is often used in connection with AI and sound applications. This is somewhat problematic as digital or digitised sound is always encoded in some kind of format and is not raw, i.e. not present in its pure form.

Furthermore, audio data is not synonymous with symbolic representations of sound, i.e. notations of music. Digital notations of musical events are found, e.g., in the MIDI or OSC protocols. The area of symbolic representation is also, traditionally, the domain of generative approaches. A fairly well-known, somewhat older example is Magenta.3

Also interesting, in the context of symbolic operations, is SCAMP,4 an algorithmic package for Python by Marc Evanstein in which there are interfaces between composition and machine learning.

But this text is about dealing with AI and audio files. As a composer I have a special approach to electroacoustic music, i.e. music that some people still call 'tape music' today but which essentially means music composed for a spatial listening situation with loudspeakers.


3.2 Musical genres in the context of electro-acoustics and sampling

Working with sound files is in the general proximity of the musical technique of sampling. Almost since sound recordings have been possible there has been an interest in reintegrating recorded sounds into a musical framework. The use of sound recordings was particularly pronounced in musique concrète in the post-war period in France, in hip hop and rap music from the 1980s onwards, with experimental musicians such as John Oswald or Vicki Bennett and, of course, in all kinds of forms of internet art since the 2000s. Working with sounds in the context of machine learning and neural networks is therefore close to this rather postmodern musical tradition of sampling.

Live electronics is another area for which today's presentation is of interest—in other words the live transformation, processing or generation of sound, either in combination with acoustic instruments, in pure form as a live performance or generated from live coding.


3.3 Fields of work in neural audio synthesis: overview

After this introduction we now come to the practicalities of composing with audio tools. The following tables divides the process into different steps.

edu sharing object


In the context of generative AI the two grey columns—training and generation—are the steps that are most directly associated with working with neural networks. In training, the activity of the network with a prepared database (in this case audio files) and in generation, after training has been completed, the generation of new data, in this case new sound files, on the basis of checkpoints or weights, i.e. calculated probabilities.

These two areas are obviously those that can only be carried out by an AI. The other steps, before and after, can be carried out manually, which is often the case.

Acquisition—one could also say collection—refers to the question of how to obtain sound files, where they come from and also how and where they are stored.

Preparation refers to the question of whether and how sound files are specifically prepared, i.e. cleaned, separated, grouped or combined, so that they function as a database for training and generation in a particularly successful way.

Anyone who has been trying to generate a good result for a long time and is then faced with a large number of new files, which may all appear to be of comparable quality, has to deal with the issue of selection. How do you select and organise this amount of data?

And lastly, composition should be mentioned, although each of these steps itself has an impact on the type or character of the composition and is, therefore, actually part of composing. But here composition is meant in the literal sense of putting things together: how do I put the selected results together to form a piece of music, or a performance?

In addition, the question of spatialisation can also be an aspect. How are sound sources to be arranged in the room?

It is now interesting to consider to what extent these five steps, which are grouped around the two "classic" AI operations, can also be carried out using AI?

But why would this be interesting?

One often has to deal with large quantities of individual files. Automating such steps can enable workflows that would otherwise be impossible to manage quantitatively. However, unusual, artistically appealing orders can also be created qualitatively. For example, with techniques such as sorting sound files according to certain sound characteristics using clustering algorithms.


3 https://magenta.tensorflow.org/
4 http://scamp.marcevanstein.com/#