History of AI

6. Contemporary AI Models

6.1 Iamus

In the last 20 years the use of AI in the field of music has undergone significant advancement. Consider for example Iamus (2010), a computer cluster powered by Melomics’s technology and developed at the University of Málaga. Iamus is based on the use of genetic algorithms capable of autonomously creating scores. As in a natural selection process, a random sequence of notes is first generated, then mutated and finally analysed by a set of rules. These rules are based on music theory in order to compose contemporary classical music, such as Opus One and Admus (2010), the former recorded by the London Symphony Orchestra in 2011.

Admus, Málaga Philharmonic Orchestra.

6.2 Variational Auto Encoders: Magenta, Jukebox and RAVE

In 2011 Google inaugurated Google Brain, a company department dedicated to conducting research in the field of AI. In 2016 Google released Magenta, an ecosystem of tools, models and resources designed to support the creation and processing of artistic content through AI (Engel et al. 2017).

One of the main components of Magenta is MusicVAE, short for Music Variational Autoencoder. In brief, a MusicVAE encodes and decodes the input data, such as scores, into a continuous latent space through two neural networks: an encoder and a decoder. The encoder converts high-dimensional input data into a low-dimensional probability distribution in the latent space, providing a mathematically meaningful representation. The decoder reconstructs the original data from a point in the latent space, allowing the MusicVAE to learn a compact and structured representation of the data and to generate new musical data consistent with the input characteristics. After learning how to compress and decompress the data, MusicVAE adds a hierarchical structure to the process to produce a structure that takes into account the long-term relationships of the input data. MusicVAEs have been successfully implemented in various musical contexts, including AI-assisted composition, sound environment creation for video games and automatic generation of musical content.

In addition to MusicVAE, Magenta includes a series of other tools and models for creating and manipulating musical content. One of these is PerformanceRNN, an LSTM-based recurrent neural network designed to model polyphonic music with expressive timing and dynamics. Notably, PerformanceRNN also includes a layer of constraint that allows for the specification of chords and keys, providing more control over the generated music. Finally, Magenta also includes NSynth, a sound synthesis model capable of generating sound from individual samples rather than with oscillators and wavetables, as found in conventional synthesisers.

VAEs are used not only for symbolic music generation, as in Magenta, but also for neural audio synthesis. This involves generating sound sample by sample, without relying on oscillators or wavetables. The popular Jukebox, introduced by OpenAI in 2020 (Dhariwal et al. 2020), is based on this family of models. Similarly RAVE (Real-Time Audio Variational auto-Encoder), developed specifically for real-time neural synthesis, also stems from this model family (Caillon & Esling 2021). RAVE is highly versatile, enabling unsupervised learning on extensive audio datasets without the need for labels or annotations, making it particularly suitable for performative and research contexts.

From the 2010s onwards, in addition to the aforementioned autoencoder architecture, other types of models that will revolutionise the field of AI have emerged. Here is a brief overview of the most frequently used types in creative contexts.

6.3 Recurrent Neural Networks: SampleRNN by Dadabots

The already mentioned recurrent neural networks (RNNs) are widely used for processing sequential data, such as texts and time series. RNNs are able to capture long-term dependencies between different parts of a sequence and are excellent for music generation. This is the case with the Dadabot duo (Carr & Zukowski 2018) which in 2016, using a variant of an RNN developed for the occasion, SampleRNN, generated a parody album, Bot Prownies, based on NOFX's punk rock album Prunk in Drublic.

The album Bot Prownies features music generated autoregressively using SampleRNN. The NOFX album, on which the model was trained, was listened to 26 times. This resulted in 900 minutes of generated audio, from which 20 minutes were selected by humans across various learning epochs to compose the album.

6.4 Generative Adversarial Networks: WaveGAN

Generative adversarial networks (GANs), first introduced by Goodfellow et al. (2014), are composed of two neural networks, a generator and a discriminator, which compete against each other. The generator creates synthetic data while the discriminator tries to distinguish between real and synthetic data. This model is typically used to generate images that are both locally and globally coherent. A variant of this model, DCGAN, forms the basis of WaveGAN, a popular model for unsupervised neural audio synthesis (Donahue et al. 2019).

6.5 Convolutional Neural Networks: Text-to-image Tools

Convolutional neural networks (CNNs) are specialised in processing grid-structured data such as images. They use convolutional filters to detect patterns and features in images, making them effective for object recognition, classification and other image-related tasks. Popular text-to-image models such as Stable Diffusion are based on architectures derived from them, such as U-Net.

6.6 Transformers: Large Language Models

The transformer model, introduced in the paper "Attention is all you need" (Vaswani et al. 2017), has revolutionised the field of Natural Language Processing (NLP). Transformers use self-attention mechanisms to capture relationships between words in a text, making them particularly effective for tasks such as automatic translation, text generation and natural language understanding. Popular large language models (LLMs) such as ChatGPT and Copilot, for example, are based on this type of model and also find application in the aforementioned Stable Diffusion.