Skip to main content

AI Composition: New, artistic workflows in dealing with generative audio tools

Genoël von Lilienstern

IV. The Individual Areas of Work

IV. The Individual Areas of Work

4.1 Acquisition

4.1.1 Problems

Which sounds does one use? Where do they come from?

The issue related to these questions is one on which the main criticism of generative AI is based, namely that artistic works or even sound recordings by other artists are used, perhaps even from a culturally vulnerable area, without the original creators having anything to gain from it.

In general, German music law contains a passage on pastiche,5 i.e. the artistically alienating (verfremdend) use of materials and citations. The extent to which this is appropriate for large amounts of data, however, is a difficult question to answer.

4.1.2 Sound sources

4.1.2.1 One's own sound libraries

If you don't want to have anything to do with this kind of problem and want to have a clear conscience, so to speak, you can fall back on your own sounds. Your sound recordings are, of course, unproblematic. You can search your hard drive for your own music, your old or unfinished projects. It might also be fun to meet up with friends to record music together that you want to use as the basis for training.

4.1.2.2 Open source databases and royalty-free music

Another possibility is the use of sound databases specifically placed on the internet for use as training sets. It must be said, however, that the target group is often not artistic. It is mostly aimed at data scientists, medical students and linguists. The ecological field with nature recordings is also strongly represented. There could perhaps be points of contact here if you are planning to compose a piece that deals with these topics.

There is also the option of using royalty-free music. However, as this often has a functional background character, it is not necessarily something that many people want to work with.

4.1.3 Similarity search

I personally find the following case interesting: I have sound material that I like and I want to collect other, similar sounds into a folder to create my database.

If you already have your own sound database with lots of sounds, there are tools such as Sononym that can perform a similarity search. You can then find suitable 'candidates' among your own sounds. E.g. Ableton Live 12 has a similarity search function.

It is more difficult, but also more interesting, if I don't have a stock of potentially similar sounds myself and want to search for them on the internet. In the image sector, there is the possibility of a reverse photo search, for example using Google. I have not yet been able to find a comparable function for sounds. Shazam and other music recognition apps have been around for many years. However, these refer to music databases, i.e. entire pieces of music and not individual sounds. A reverse search browser for sound could be very exciting to use. If anyone reading this knows of something similar, I'd be happy to hear from you on Discord.


4.2 Data Preparation

Once you have decided on a sound file or on many sound files that you want to work with, there are a variety of ways to deal with this material before training.

4.2.1 Segmentation and grouping

An interesting way of pre-structuring sounds is by segmentation and grouping. This is where a technique known as clustering or the application of k-means nearest neighbour comes into play. Roughly speaking, this involves splitting many individual sounds or a longer sound into segments, for example using a zero-crossing function, and then arranging these sound segments on a two-dimensional plane according to similarity. The aim is to determine which sound is the nearest neighbour of each other sound, for example, by paying attention to criteria such as the sounds' envelope curves and timbres.

A very interesting tool that does this is AudioStellar,6 an open-source programme from the Instituto de Investigaciones en Arte y Culture in Ciencia, Argentina. With Audiostellar you can save a cluster of certain sound characteristics in a folder and then use it for training.

4.2.2 Combinations of sounds

With machine learning, there is often the problem of overfitting. And this is also the actual reason why we try to give the data a special twist during preparation. Overfitting means that we generate a result that is similar or identical to our source material. Of course we want to avoid that. After all, the aim is to create new sound progressions and new sound combinations.

One way to avoid an overfit is to train the network with different sounds at the same time, so that there is no single, whole target sound that can be produced as an overfit. The problem with this, however, is that often either sound A or sound B is returned as an overfit. This happens especially when the sounds are clearly different. With a violin melody and percussion sounds, for example, there is no good 'possibility of confusion' when predicting the frames to be performed in the next generations. The sounds are too clearly distinguishable for the neural network.

The search for 'confusable' similarities is part of the exciting, strategic experimentation and the consideration of which tools can be helpful to accumulate similar sounds.

For example you can do the following: take two long sounds, A and B, and upload them both contiguously in a clusterer, which segments the overall file and then arranges the areas that are similar in both sounds on the two-dimensional plane. Once you have found an interesting overlap, you can export it as a cluster, which may work well as a database.

4.2.3 Data augmentation

What is also mentioned here all the time is a term from data science: data augmentation. Data augmentation refers to the expansion of the data pool in various ways. On the one hand, this can be the described search for similar data. However, it can also consist of the multiplicative transformation of the existing data so that the data pool becomes expandable on its own.

edu sharing object


This figure, which comes from the graphics field, can inspire your imagination. How can an audio file be copied and modified to create a more diverse database? Sounds could be played backwards. The same sound can be used in different transpositions. Individual sound components could be filtered out. Also effective, perhaps, would be sounds with different bit rates.

4.2.4 Source separation

Another technology that has reached a whole new level in the last two years is source separation. Until now, we have been talking about the vertical separation of sound files. However, it is also becoming increasingly possible to extract horizontally coherent sounds from sound mixes.

Voice separators are quite common, for example to separate vocals from a piece of music and then use the remaining instrumental part for karaoke performances.

At the same time, the extracted voices are also used as the basis for AI covers. A longer passage of audio material from a particular voice is used to train a voice model, which can then be used to perform a timbre transfer. The timbre of sound A is 'superimposed' on sound B, so to speak. A common programme for this is RVC.7

However, not only vocals, but all possible instrumental tracks can be separated. These new individual tracks can also be cleaned up in many cases, for example by reducing reverb or removing the remains of harmony voices.

Izotope is an example of commercial software for realising this. I have also noticed that Ultimate Vocal Remover8 is a particularly well functioning freeware programme.

4.2.5 Summary

To summarise, it can be said that these different ways of pre-structuring and pre-processing sounds offer an unbelievable range of experimental fields, of which it is not yet possible to foresee what sonic and compositional possibilities could arise from them.



5 https://www.anwalt.de/rechtstipps/was-ist-der-pastiche-im-urheberrecht-211562.html
6 https://audiostellar.xyz/lang/en/index.html

7 "How To Train AI Voice Models ONLINE For FREE (No GPU Needed)"

8 https://ultimatevocalremover.com/