AI Composition: New, artistic workflows in dealing with generative audio tools

Site:	Hamburg Open Online University
Course:	MUTOR: Artificial Intelligence for Music and Multimedia
Book:	AI Composition: New, artistic workflows in dealing with generative audio tools

Printed by:	Gast
Date:	Monday, 21 July 2025, 3:48 AM

Description

Genoël von Lilienstern

Video
I. AI tools: a dynamic field
II. AI Tools In the Context of Society and Collectivity
III. Neural Audio Synthesis
IV. The Individual Areas of Work
- 4.3 Training and 4.4 Generation
- 4.5 Selection and 4.6 Composition
V. New Qualities in Working with Generative AI Toold

Video

Video in German. Following text in English. –Editor

"Neue, künstlerische Arbeitsabläufe im Umgang mit generativen Audiowerkzeugen" (Genöel von Lilienstern, HOOU HfMT, May 2024, YouTube).

I. AI tools: a dynamic field

I. AI Tools: A Dynamic Field

1.1 Who is this text aimed at?

This presentation is aimed at anyone from the fields of music, composition and sound art who is interested in experimenting with digitised sound.

It is also aimed at anyone who has questions about an artistic approach to AI and would like to know more, e.g.

How exactly do people make music and sound art with AI?
What are new techniques and compositional concepts when working with AI?
What new sound experiences or even new musical styles could result from the use of AI audio tools?

1.2 Working with tools which are under development

We are operating in an extremely dynamic, rapidly changing field. People are often amazed at what is already possible with AI today. There are also voices who fear the imminent takeover by AI of most areas of our lives. It must be clearly stated, however, that many technologies and approaches are still in their infancy and that what is announced in the title—new artistic workflows—is only just beginning to emerge. It's more a case of gaining experience, especially in the artistic and creative fields, because most music-related AI tools are still in a state of development. Nevertheless, the aim here will also be to discuss actual applications, namely apps, code, repositories and even finished pieces of music.

At the same time, however, it is often appropriate to say, "You could imagine working with this application in such and such a way," or, "You have to try that out first."

If you want to work with AI on a more fundamental level, it's part of the process and also part of the special appeal of always thinking about improving and expanding the artistic-technological tools at the same time. You learn something in the process, both about technology and about music and artistic processes.

1.3 After the AI tsunami

This text was written at the beginning of 2024, which may be helpful to mention due to the short half-life of specific AI technologies. We are currently at the provisional end of a period that many have labelled "the AI tsunami." The AI tsunami covers the period from early/mid-2022 to mid-2023 and is characterised by two events in particular: in 2022, the emergence of generative image models such as Dall-E, Bing and, in particular, Mid-Journey; and, from the beginning of 2023, the publication of the large text model ChatGPT.

After the hot topic of machine learning and artificial intelligence had already been simmering quite constantly on the side, but remained rather abstract for many, from this point onwards many people really began to gain concrete experience in dealing with AI models themselves.

Although it was abstract to many, the hot topic of machine learning and artificial intelligence had been simmering on the side for quite some time. From this point onward, however, people began to gain concrete experience with AI models themselves.

The metaphor of a tsunami-like wave is quite apt as the development of AI has been taking place in waves for several decades. Between moments characterised by particularly outstanding events, such as the introduction of the first chess computer, there were times when less progress was made; these down periods are known as AI winters. It must be added, however, that never before has there been anything like the present AI tsunami in the field of artificial intelligence.

1.4 AI versus AIs

It should be clear that there is no such thing as an AI, but rather many areas of application with varying degrees of development for a wide range of tasks. In everyday dialogue there still seems to be an occasional perception of a uniform, overarching AI superpower. This is, of course, a utopian or dystopian fantasy.

Incidentally, the distinctions between artificial intelligence, machine learning, deep learning and neural networks will not be discussed in detail here. There are certainly many good articles on this, including on the MUTOR platform.

1.5 A focus on the dynamics of creative processes

This text is not so much about how the technology works, nor is it about discursive issues such as copyright. It is intended to provide an overview of technological applications and the associated artistic and technical processes.

It may well happen that one becomes interested in neural networks, attends a workshop on the subject, learns about the various possibilities of programming networks and ends up being able to create a functioning model. After all that, one may still not know how and whether one can do anything useful with their network.

Or, in another case, you find your way to a lecture on the sensitive social aspects of AI music. The pros and cons are highlighted, but you don't really learn anything about the artistic potential and detailed possibilities of AI, which are perhaps worth taking seriously.

The aim here is to shed some light on this little-discussed area of concrete work and the associated, sometimes novel and unfamiliar, steps when creating an AI music or sound composition.

II. AI Tools In the Context of Society and Collectivity

2.1 Differentiation from the commercial use of large models

It should be noted beforehand which types of AI tools are not included in this discussion. Since it is current, the Suno.ai platform¹ will serve as an example. By means of text prompts, songs can be generated which you would think were made by humans and could be played on the radio on a Thursday afternoon.

Suno is building a future where anyone can make great music. Whether you're a shower singer or a charting artist, we break barriers between you and the song you dream of making. No instrument needed, just imagination. From your mind to music.

If you spend more time with Suno, however, the limitations and the relative standardisation of the musical results become apparent. In general, the user is someone who operates the top level of a large, fully trained model. This interface is also optimised in such a way that the results will meet an expectation of coherence that is more oriented towards a mainstream idiom. Our expectations should be met. The large models do this very well, especially in the area of a certain standard.

fig_my latest cuisine.jpg

On the one hand, the question of music inheriting a genuine sense of intention remains. Whether music means and wants something becomes a relevant musical parameter. A convincing avant-garde orchestral work or a touching Delta-style blues, that you want to listen to more than once, is unlikely to be created if there is no subjective orientation or, to use the somewhat overused term, if there is no agency.

It is important to remember that no completely new sounds or new musical structures are being created here, but that labelled, human-made music forms the basis.

It is also interesting to see who Suno is primarily aimed at. The fact that it allows non-musicians to influence a musical creation by means of text prompts is probably one of the app's main selling points. This form of recompiling pre-trained results is more like an amusement, however, it probably offers a practical opportunity to try out musical and especially lyrical ideas. We can certainly expect this type of generation to produce even more differentiated, more astonishing results in the future.

A musician usually doesn't want to just decide between A, B, C or D, but wants to make the most far-reaching decisions possible. This starts with the question of what sounds form the basis of a training. In concrete terms, our own creative work with machine learning and neural networks is about training models ourselves.

The question of artistic control can be viewed from a political or moral perspective. It is often asked whether musicians, whose music or voice is used as training data for commercial models, should share in the profit from the results. There is also an ecological question, which should not be neglected, as training large models requires a lot of energy.

Anyone interested in working with neural networks in the long term should (a) train by themselves and (b) work with small models.

2.2 Collectivity, knowledge exchange in online forums

There is another, social aspect to working with AI-based tools, that of collectivity.

For example, learning on the internet is an activity based on networking. Since programming languages and neural networks are so complex and time-consuming to work with, it can be helpful to organise oneself as a small group or collective in order to benefit from different interests and specializations.

In the group ktonal,² which I founded a few years ago with composer friends, we benefited from the fact that some members had stronger mathematical skills and knowledge of networks while others were very good at giving feedback on possible musical goals. In this way, speculative ideas could be expressed which were then brought down to Earth by the technical team. Vice versa, the members with primarily compositional interests could test new ideas in collaborative, online notebooks (e.g. Jupyter) and provide information on whether these would be helpful for sound experiments.

Online forums are generally extremely helpful for solving problems—and problems with code and training are very common. Discord has been particular helpful. There are channels with various sub-topics for AI projects where you can read about the experiences of others working on similar projects. Often you can also post your own code or error messages and get helpful responses.

I have installed a public channel on Discord, MUTOR AI, for this text and the associated video, which is intended to invite an interactive exchange about the content conveyed here.

2.3 Problem solving with the help of language models

Last but not least, using ChatGPT or similar language models is a great resource for finding solutions to problems with code or other neural network questions. Personally, I am a bit sceptical about ChatGPT and its ability to help with human questions. However, when it comes to the machine providing information about other machine problems, such as Python code, it must be admitted that ChatGPT works very well. It is also a useful instrument for putting your own questions into words as precisely as possible.

¹ https://suno.com/ but also https://www.udio.com/
² Website of ktonal. https://ktonal.com/

III. Neural Audio Synthesis

Neural Audio Synthesis

3.1 Differentiation between audio data and symbolic notations of sound

This is about neural audio synthesis or generative audio. In other words, it is about tools that deal with audio data in some form, like training, generating or selecting.

What is meant by audio data? Everything that can be heard at the beginning and end of a process as a recorded or artificially generated sound event; the data of a specific sound measurement or of artificially generated sounds. The term 'raw audio' is often used in connection with AI and sound applications. This is somewhat problematic as digital or digitised sound is always encoded in some kind of format and is not raw, i.e. not present in its pure form.

Furthermore, audio data is not synonymous with symbolic representations of sound, i.e. notations of music. Digital notations of musical events are found, e.g., in the MIDI or OSC protocols. The area of symbolic representation is also, traditionally, the domain of generative approaches. A fairly well-known, somewhat older example is Magenta.³

Also interesting, in the context of symbolic operations, is SCAMP,⁴ an algorithmic package for Python by Marc Evanstein in which there are interfaces between composition and machine learning.

But this text is about dealing with AI and audio files. As a composer I have a special approach to electroacoustic music, i.e. music that some people still call 'tape music' today but which essentially means music composed for a spatial listening situation with loudspeakers.

3.2 Musical genres in the context of electro-acoustics and sampling

Working with sound files is in the general proximity of the musical technique of sampling. Almost since sound recordings have been possible there has been an interest in reintegrating recorded sounds into a musical framework. The use of sound recordings was particularly pronounced in musique concrète in the post-war period in France, in hip hop and rap music from the 1980s onwards, with experimental musicians such as John Oswald or Vicki Bennett and, of course, in all kinds of forms of internet art since the 2000s. Working with sounds in the context of machine learning and neural networks is therefore close to this rather postmodern musical tradition of sampling.

Live electronics is another area for which today's presentation is of interest—in other words the live transformation, processing or generation of sound, either in combination with acoustic instruments, in pure form as a live performance or generated from live coding.

3.3 Fields of work in neural audio synthesis: overview

After this introduction we now come to the practicalities of composing with audio tools. The following tables divides the process into different steps.

edu sharing object

In the context of generative AI the two grey columns—training and generation—are the steps that are most directly associated with working with neural networks. In training, the activity of the network with a prepared database (in this case audio files) and in generation, after training has been completed, the generation of new data, in this case new sound files, on the basis of checkpoints or weights, i.e. calculated probabilities.

These two areas are obviously those that can only be carried out by an AI. The other steps, before and after, can be carried out manually, which is often the case.

Acquisition—one could also say collection—refers to the question of how to obtain sound files, where they come from and also how and where they are stored.

Preparation refers to the question of whether and how sound files are specifically prepared, i.e. cleaned, separated, grouped or combined, so that they function as a database for training and generation in a particularly successful way.

Anyone who has been trying to generate a good result for a long time and is then faced with a large number of new files, which may all appear to be of comparable quality, has to deal with the issue of selection. How do you select and organise this amount of data?

And lastly, composition should be mentioned, although each of these steps itself has an impact on the type or character of the composition and is, therefore, actually part of composing. But here composition is meant in the literal sense of putting things together: how do I put the selected results together to form a piece of music, or a performance?

In addition, the question of spatialisation can also be an aspect. How are sound sources to be arranged in the room?

It is now interesting to consider to what extent these five steps, which are grouped around the two "classic" AI operations, can also be carried out using AI?

But why would this be interesting?

One often has to deal with large quantities of individual files. Automating such steps can enable workflows that would otherwise be impossible to manage quantitatively. However, unusual, artistically appealing orders can also be created qualitatively. For example, with techniques such as sorting sound files according to certain sound characteristics using clustering algorithms.

³ https://magenta.tensorflow.org/
⁴ http://scamp.marcevanstein.com/#

IV. The Individual Areas of Work

4.1 Acquisition

4.1.1 Problems

Which sounds does one use? Where do they come from?

The issue related to these questions is one on which the main criticism of generative AI is based, namely that artistic works or even sound recordings by other artists are used, perhaps even from a culturally vulnerable area, without the original creators having anything to gain from it.

In general, German music law contains a passage on pastiche,⁵ i.e. the artistically alienating (verfremdend) use of materials and citations. The extent to which this is appropriate for large amounts of data, however, is a difficult question to answer.

4.1.2 Sound sources

4.1.2.1 One's own sound libraries

If you don't want to have anything to do with this kind of problem and want to have a clear conscience, so to speak, you can fall back on your own sounds. Your sound recordings are, of course, unproblematic. You can search your hard drive for your own music, your old or unfinished projects. It might also be fun to meet up with friends to record music together that you want to use as the basis for training.

4.1.2.2 Open source databases and royalty-free music

Another possibility is the use of sound databases specifically placed on the internet for use as training sets. It must be said, however, that the target group is often not artistic. It is mostly aimed at data scientists, medical students and linguists. The ecological field with nature recordings is also strongly represented. There could perhaps be points of contact here if you are planning to compose a piece that deals with these topics.

There is also the option of using royalty-free music. However, as this often has a functional background character, it is not necessarily something that many people want to work with.

4.1.3 Similarity search

I personally find the following case interesting: I have sound material that I like and I want to collect other, similar sounds into a folder to create my database.

If you already have your own sound database with lots of sounds, there are tools such as Sononym that can perform a similarity search. You can then find suitable 'candidates' among your own sounds. E.g. Ableton Live 12 has a similarity search function.

It is more difficult, but also more interesting, if I don't have a stock of potentially similar sounds myself and want to search for them on the internet. In the image sector, there is the possibility of a reverse photo search, for example using Google. I have not yet been able to find a comparable function for sounds. Shazam and other music recognition apps have been around for many years. However, these refer to music databases, i.e. entire pieces of music and not individual sounds. A reverse search browser for sound could be very exciting to use. If anyone reading this knows of something similar, I'd be happy to hear from you on Discord.

4.2 Data Preparation

Once you have decided on a sound file or on many sound files that you want to work with, there are a variety of ways to deal with this material before training.

4.2.1 Segmentation and grouping

An interesting way of pre-structuring sounds is by segmentation and grouping. This is where a technique known as clustering or the application of k-means nearest neighbour comes into play. Roughly speaking, this involves splitting many individual sounds or a longer sound into segments, for example using a zero-crossing function, and then arranging these sound segments on a two-dimensional plane according to similarity. The aim is to determine which sound is the nearest neighbour of each other sound, for example, by paying attention to criteria such as the sounds' envelope curves and timbres.

A very interesting tool that does this is AudioStellar,⁶ an open-source programme from the Instituto de Investigaciones en Arte y Culture in Ciencia, Argentina. With Audiostellar you can save a cluster of certain sound characteristics in a folder and then use it for training.

4.2.2 Combinations of sounds

With machine learning, there is often the problem of overfitting. And this is also the actual reason why we try to give the data a special twist during preparation. Overfitting means that we generate a result that is similar or identical to our source material. Of course we want to avoid that. After all, the aim is to create new sound progressions and new sound combinations.

One way to avoid an overfit is to train the network with different sounds at the same time, so that there is no single, whole target sound that can be produced as an overfit. The problem with this, however, is that often either sound A or sound B is returned as an overfit. This happens especially when the sounds are clearly different. With a violin melody and percussion sounds, for example, there is no good 'possibility of confusion' when predicting the frames to be performed in the next generations. The sounds are too clearly distinguishable for the neural network.

The search for 'confusable' similarities is part of the exciting, strategic experimentation and the consideration of which tools can be helpful to accumulate similar sounds.

For example you can do the following: take two long sounds, A and B, and upload them both contiguously in a clusterer, which segments the overall file and then arranges the areas that are similar in both sounds on the two-dimensional plane. Once you have found an interesting overlap, you can export it as a cluster, which may work well as a database.

4.2.3 Data augmentation

What is also mentioned here all the time is a term from data science: data augmentation. Data augmentation refers to the expansion of the data pool in various ways. On the one hand, this can be the described search for similar data. However, it can also consist of the multiplicative transformation of the existing data so that the data pool becomes expandable on its own.

edu sharing object

This figure, which comes from the graphics field, can inspire your imagination. How can an audio file be copied and modified to create a more diverse database? Sounds could be played backwards. The same sound can be used in different transpositions. Individual sound components could be filtered out. Also effective, perhaps, would be sounds with different bit rates.

4.2.4 Source separation

Another technology that has reached a whole new level in the last two years is source separation. Until now, we have been talking about the vertical separation of sound files. However, it is also becoming increasingly possible to extract horizontally coherent sounds from sound mixes.

Voice separators are quite common, for example to separate vocals from a piece of music and then use the remaining instrumental part for karaoke performances.

At the same time, the extracted voices are also used as the basis for AI covers. A longer passage of audio material from a particular voice is used to train a voice model, which can then be used to perform a timbre transfer. The timbre of sound A is 'superimposed' on sound B, so to speak. A common programme for this is RVC.⁷

However, not only vocals, but all possible instrumental tracks can be separated. These new individual tracks can also be cleaned up in many cases, for example by reducing reverb or removing the remains of harmony voices.

Izotope is an example of commercial software for realising this. I have also noticed that Ultimate Vocal Remover⁸ is a particularly well functioning freeware programme.

4.2.5 Summary

To summarise, it can be said that these different ways of pre-structuring and pre-processing sounds offer an unbelievable range of experimental fields, of which it is not yet possible to foresee what sonic and compositional possibilities could arise from them.

⁵ https://www.anwalt.de/rechtstipps/was-ist-der-pastiche-im-urheberrecht-211562.html
⁶ https://audiostellar.xyz/lang/en/index.html

⁷"How To Train AI Voice Models ONLINE For FREE (No GPU Needed)"

⁸ https://ultimatevocalremover.com/

4.3 Training and 4.4 Generation

4.3 Training

4.3.1 About the term repository

Two different repositories are currently of particular interest for training with audio: mimikit from ktonal and rave from Antoine Caillon (IRCAM).

I call them repository or repo here because when you use the term app or program, you think of more or less user-optimised GUI interfaces. Repository actually just means that there is a place where things are stored. This is usually the source code, documentation and configuration files. This is typical for projects that are in a state of development, but are already being used and tested publicly.

4.3.2 mimikit

The Music Modelling Toolkit (mimikit)⁹ is a Python package from ktonal that performs machine learning with audio data. It focuses on training autoregressive neural networks to generate audio. A detailed demonstration of mimikit can be found in the MUTOR instructional video of the same name. In workshops on AI and audio, we always find that participants end up coming back to mimikit, as it has the advantage of being able to handle small amounts of data and deliver results quickly.

4.3.3 rave

Rave¹⁰ (Realtime Audio Variational autoEncoder) is a deep learning–based generative audio processing platform developed by Antoine Caillon at IRCAM. Rave is a learning framework for generating a neural network model from audio data. It enables both fast and high-quality audio waveform synthesis. In Max and Pd, it is accompanied by its nn~ decoder, which makes it possible to use these models in real time for various applications such as audio generation and timbre transformation/transfer. A demonstration of rave can also be found in the MUTOR instructional video of the same name.

4.3.3.1 Hardware

A basic problem when training neural networks and deep learning is the available processing power of computer processors. Graphics chips are particularly suitable. However, these are not available in a suitable form on every computer. It may therefore be worth purchasing a powerful GPU, such as a GeForce RTX 4060 card from Nvidia. However, the newer generation of MacBook M2 and M3 processors now also appear to be suitable for machine learning. Alternatively, Python code can be executed in Colab notebooks on Google computers. On a smaller scale, as with mimikit, this is quite practical.

4.3.4 Summary

When using both repositories, it can be seen that the parameterisation of the networks plays an important role. With mimikit, you can even make fundamental changes to the dimensionality of the neural network. This is an interesting area for influencing the resulting sound quality. For example, varying the hop size is a variable that quickly leads to sonic differences. However, there is a tendency for most users of these notebooks to look for a setting that works well and then use it over and over again, rather than what you put into the network, i.e. which database, to get results. In the case of rave, this is particularly obvious, because the calculation process can take several days and you don't want to risk having selected a setting that ultimately produces little that is useful.

4.4 Generation

4.4.1 Differentiation from training

In the case of mimikit, two different processes are visible relatively close to each other, making them appear almost like a single process. The training with the different epochs is constantly visible. In between, four example files are generated every four epochs. The current checkpoints are used to generate new data samples based on the similarity to the training data.

4.4.2 Utilisation of different training levels

The quality of the results varies depending on the level of training. The error rate is particularly high at the beginning. This means that many standing loops, static sounds or simply silence are to be expected.

In the middle stage, at around ten to thirty epochs, there is the highest variance. Beyond that, we often move towards a more stable state, which often approaches overfit with longer single files.

Prompting is also important for starting the generation. This refers to the initialisation with a small sound file, which then triggers a sound prediction based on the training data.

4.4.3 Creative use of checkpoints

However, the generation also offers interesting possibilities for composing. This already anticipates the final area of composition.

Björn Erlach from ktonal has written a notebook in which a Supercollider pattern script can rhythmically access different stored checkpoints. This is an elegant way to organically mix different sound materials within a sound file.

The corresponding notebook from ktonal Ensemble Generator.¹¹ In the sound study Prokonsori¹² by Dohi Moon, checkpoints from a Prokofiev piano piece and a Korean pansori song are accessed alternately.

4.4.4 Latency spaces and live generation

The results of rave's training are so-called latent spaces. A latency space is an n-dimensional space in which the weights are stored as potential, so to speak. In this form, they could be used as a resource for live performances. The problem with latency spaces is that they are relative black boxes and inherently lack temporal linearity, so their interpretation can be relatively unpredictable. The more reliable retrieval of specific sound coordinates from latency spaces is a field of active research. Exciting examples of latent space interpretations can be found in Moisés Herrera's live DJ sets.¹³

Latency spaces can be exported in rave as ts files and then read out in Max/MSP using the nn˜ object.¹⁴ In this way, they can be used for live electronic experiments. The Intelligent Instruments Lab Reykjavik¹⁵ has published a collection of different, already trained ts files on the Huggingface¹⁶ page.

4.4.5 Timbre transfer

Another way of using latency spaces is to transfer timbres to other sounds. The documentation of the nn˜ object in Max/MSP shows how a percussion rhythm is mapped to the timbre of a female voice. The percussion rhythm is the (potential live) input, the female voice is available as a ts file. This technology is also the same that is used in the field of AI covers, where the timbre of person A's voice is often mapped to the timbre of person B's voice.

⁸ https://ultimatevocalremover.com/
⁹ https://github.com/ktonal/mimikit
¹⁰ https://github.com/acids-ircam/RAVE

4.5 Selection and 4.6 Composition

4.5 The Selection of Results

4.5.1 The problem of countless sound files

If you have not, as just described, found a way to use the generation itself as a means for a composition or a live performance, then you are now often confronted with the problem of having a lot of results from training and generation available and having to invest a relatively large amount of time to decide which of them you want to use as the basis for a composition or performance.

4.5.2 Conceptual approaches to outputs

In general, working with generative tools requires the conceptual imagination of the user. Without clear ideas and goals, they otherwise quickly resemble a vehicle with a powerful engine, but no steering wheel. In terms of both the selection and the form of sound presentation, ideas are needed to give the work a direction and the compositional result a shape.

Alexander Schubert has found a very elegant way of dealing with this multitude of results as part of his Avery¹⁷ project. The associated website contains ten thousand folders, each with short pieces of music that Avery has generated. Visitors to the website can listen their way through this seemingly unmanageable amount of AI music.

In his piece Despacito Variations (2023) for electric guitar and effect pedals, Malte Giesen has the recording of a short sequence from the guitar hit "Despacito" calculated by the sample-rnn¹⁸ network in one thousand epochs. However, the results of this training cannot be heard directly in the piece itself. Instead, they form the basis of a transcription that is characterised by special, glitch-like sounds and strange, unimaginable loops.

4.5.3 Measuring deviations

At this point, one could well imagine that an automated comparison is made between the source material and the newly generated sounds, i.e. that the degree of deviation is assessed in tests. In his notebooks k-best and longest output,¹⁹ Antoine Daurat has tried to find a solution for this.

4.5.4 Back to step two

Of course, we could also bring a clusterer like Audiostellar into play at this point. Here we could also imagine a kind of loop in our composition process, so that we jump back to step two of the preparation and selection at this point and use the results of the generation as a database for further training.

4.5.5 Live clustering and concatenative synthesis

Another way to utilise sound results live electronically is the IRCAM/Ableton plug-in Coalescence,²⁰ which is based on the principle of concatenative synthesis.

Here you can split one or more sound files into segments, distribute them in a two-dimensional graphical layer and then match them with an incoming live signal. The first step is more or less identical to Audiostellar, with the addition of now being enabled to use this matrix as a live instrument.

In the composition Holometabola²¹ for the clarinettist Carola Schaal, I have arranged alienated clarinet sounds in four different coalescence modules in mimikit, which Carola Schaal then explores by listening and playing in an improvisational passage in the piece.

Another example of live playing with clustered sound segments are the sound performances Tomomibot²² by Tomomi Adachi and the Berlin programmer Andreas Dzialocha.²³

4.6 Composition

4.6.1 Reassembling tonal passages

The final assembly of results into a composition is probably the point at which most musicians would like to lend a hand themselves. Nevertheless, it is also exciting to consider whether the arrangement of larger parts could also be done by an AI.

In auto-arrangement apps such as Jamahook, the main aim seems to be to select parts from an associated library that already match each other. It would be more interesting to associate any type of sound with other sounds according to various criteria.

4.6.2 Concatenative synthesis

Another possibility is the use of concatenative synthesis, as can be used in the programme Audioguide²⁴ by Ben Hackbarth.

In body target operations, individual sounds are mapped onto a long basic sound. Not as in timbre transfer, from a single latency space, but in a mosaic like style, in many matches of individual sound files. Here too, a strategy can be devised to create entire compositions from this.

4.7 Autospatialisation

So far, there are few examples of AI-based multichannel design. The idea of an AI-transformed, spatial auditory impression or an automated, spatial audio dramaturgy seems appealing. If any of the readers have heard of approaches that go in this direction, they can be shared in the discord channel.

¹¹ Ensemble Generator.

Dieser Inhalt wird von YouTube bereitgestellt.
Beim Abspielen wird eine Verbindung zu den Servern von YouTube hergestellt und der Anbieter kann Cookies einsetzen, die Hinweise über dein Nutzungsverhalten sammeln.

Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

¹² Prokonsori, Dohi Moon, 2022.

Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

¹³ KICamp23, AI Concert, Hexorcismus

¹⁴ Github, Acids, IRCAM, nn~
¹⁵ Intelligent Instruments Lab.
¹⁶ Huggingface, Intelligent Instruments, rave models
¹⁷ Av3ry
¹⁸ ktonal, mimikit
¹⁹ Colab, Github, ktonal, mimikit
²⁰ Coalescence
²¹ Elbphilharmonie, Decoder Ensemble

²² Tomomibot with Anton Bruckner 1

²³ adz.garden
²⁴ Audioguide

V. New Qualities in Working with Generative AI Toold

V. New Qualities in Working with Generative AI Tools

5.1 New artistic workflows with selection and generation tools

It has become clear that there are various areas of application for working with audio data and AI. Similar to new art and music movements of the past, technical practices can establish themselves at various points in these processes, which can lead to the development of new, genuine styles and genres. Examples discussed in this text include generative sound transformations, AI covers, live interpretations of latency spaces, conceptual arrangements of sounds, transcriptions of AI, improvisation with live concatenative synthesis. But there are many more and there will be many more ways to work with AI artistically and compositionally in the future.

Each area of the scheme in itself can be the basis or main method of an artistic concept. For example, a conceptual work could be created from the very first step, the systematic and automated search for sounds.

The different areas can be combined; some steps can also be swapped or performed several times in sequence. This results in new, sometimes unusual workflows that can offer new perspectives on sound source materials, on the organisation and production of sound and on forms of composition.

edu sharing object

5.2 Automated, but not automatic

However, this also means that we are saying goodbye to the idea that AI composition with audio files is done in one go by a complete automation system.

Occasionally you hear the somewhat smug question of what percentage of a work is AI-generated and what percentage is your own input. One can clearly say that without your own concept or at least your own idea of what you want to try out no composition can be created. This is why an AI composition, as of 2024, with self-trained sounds, is the composer's own creation as a whole. As already mentioned, this does not apply to large, commercial models in which the user is more likely to trigger results that are pre-standardised for coherence.

5.3 Resistance and distance

It should also have become clear that we can say goodbye to the idea that anything is easier to achieve with AI. On the contrary: many work steps are difficult to control and you can only really achieve something by trial and error and with conceptual objectives.

This also brings us to the interesting aspect of working with these tools: a dimension is inserted between our intentional actions and the result of our actions, which creates a distance that leads us astray, but can also surprise us in positive cases. In any case, we are forced to think carefully about what we actually want and how we can reach our goal with the means at hand. Whether we ultimately succeed is another question.

5.4 New editing techniques

On the other hand, AI audio tools enable a variety of concrete, new editing techniques that we could only have dreamed of a few years ago. The most important are:

sound separation
sound continuation
sound cleaning (of reverberation, secondary sounds, etc.)
segment groupings through clustering
timbre transfer

5.5 New types of authorship

New types of collaboration and collective authorship will also be of interest. Take AI covers, for example, where anyone can now try to compose music that is not only identical in style, but also in the sound of a particular music group (would it be better to say "imitate"?). This certainly raises a lot of questions, but should not be viewed negatively in all cases; it could also be seen as a fun, collective continuation of cultural narratives.

5.6 Outlook

I assume that more and more artistic disciplines will emerge in the field of AI-based tools in the future. It can be assumed that artistic innovations in this area will primarily come from people who deal with this technology in more detail or perhaps even specialise in it.

Photography, as a technology newly introduced in the 19th century and partly utilised artistically, offers interesting opportunities for comparison: photography is simple. Anyone can take a photograph, but that doesn't automatically make them a photographer. Activating AI models can also be child's play, depending on the interface. However, a person who generates images with Midjourney today is not seen as someone who achieves anything special artistically. Perhaps someone who has specialised in working with Midjourney, rather, and knows exactly how to use special formats and special prompts in order to get something unexpected and above-average out of the model – but then again most likely this will happen at the cost of more time. The pain of the invention does not seem to be reducible.

Photography as an analogy is also interesting because it shows how long it has taken for a new technology to establish itself as an art form in its own right. In the case of AI, however, this will probably not take a hundred years.

It can also be assumed that the impact of AI will be felt in two directions: in the results of using its tools, but also in those areas that consciously refrain from using AI or those that use their own resources to form equivalents to AI paradigms. Every innovation always motivates a series of counter drafts.

AI Composition: New, artistic workflows in dealing with generative audio tools

Description

Table of contents

Video

I. AI tools: a dynamic field

I. AI Tools: A Dynamic Field

1.1 Who is this text aimed at?

1.2 Working with tools which are under development

1.3 After the AI tsunami

1.4 AI versus AIs

1.5 A focus on the dynamics of creative processes

II. AI Tools In the Context of Society and Collectivity

II. AI Tools In the Context of Society and Collectivity

2.1 Differentiation from the commercial use of large models

2.2 Collectivity, knowledge exchange in online forums

2.3 Problem solving with the help of language models

III. Neural Audio Synthesis

Neural Audio Synthesis

3.1 Differentiation between audio data and symbolic notations of sound

3.2 Musical genres in the context of electro-acoustics and sampling

3.3 Fields of work in neural audio synthesis: overview

IV. The Individual Areas of Work

IV. The Individual Areas of Work

4.1 Acquisition

4.1.1 Problems

4.1.2 Sound sources

4.1.2.1 One's own sound libraries

4.1.2.2 Open source databases and royalty-free music

4.1.3 Similarity search

4.2 Data Preparation

4.2.1 Segmentation and grouping

4.2.2 Combinations of sounds

4.2.3 Data augmentation

4.2.4 Source separation

4.2.5 Summary

4.3 Training and 4.4 Generation

4.3 Training

4.3.1 About the term repository

4.3.2 mimikit

4.3.3 rave

4.3.3.1 Hardware

4.3.4 Summary

4.4 Generation

4.4.1 Differentiation from training

4.4.2 Utilisation of different training levels

4.4.3 Creative use of checkpoints

4.4.4 Latency spaces and live generation

4.4.5 Timbre transfer

4.5 Selection and 4.6 Composition

4.5 The Selection of Results

4.5.1 The problem of countless sound files

4.5.2 Conceptual approaches to outputs

4.5.3 Measuring deviations

4.5.4 Back to step two

4.5.5 Live clustering and concatenative synthesis

4.6 Composition

4.6.1 Reassembling tonal passages

4.6.2 Concatenative synthesis

4.7 Autospatialisation

V. New Qualities in Working with Generative AI Toold

V. New Qualities in Working with Generative AI Tools

5.1 New artistic workflows with selection and generation tools

5.2 Automated, but not automatic

5.3 Resistance and distance

5.4 New editing techniques

5.5 New types of authorship

5.6 Outlook