Skip to main content

Introduction to the MUTOR course on Artificial Intelligence in Music and Multimedia

Prof. Dr. Georg Hajdu

Preface

This is an introduction to the third course in the Music Technology Online Repository serving to complement the courses on the Science of Music and the History and Practice of Multimedia. Thanks to the initiative taken by Goran Lazarević (coordinator of HOOU at Hamburg University of Music and Drama, HfMT) and generous support by MMKH (the Multimedia Kontor Hamburg) we were able to solicit contributions by leading figures in the field of AI in music and sound as well as teachers and doctoral students at the HfMT. I would like to acknowledge all of those who have help to get this course off the ground: Xiao Fu (coordinator of the course), Todd Harrop (text editor), Yuri Akbalkan, and Tam Pham (video editor).

The following text entitled "The Work of Art in the Age of Artificial Intelligence" first appeared in the HfMT magazine ZWOELF and aims at situating AI-created art works within the context of the accelerated development of technology and the new realities they create:

Artificial intelligence is all the hype. For instance, the music software maker Ableton devoted itself to this topic in its blog with "AI and Music-Making: The State of Play." Although it has been used for numerous applications for decades, the general public has only recently discovered AI and, as a result of an overheated debate, has split into the camps of supporters and opponents. After all, nobody can be left cold when the end of humanity is once again invoked. I will therefore develop some ideas on the subject in my article, which will neither explain how AI works in detail nor claim to describe all its manifestations. However, it should become clear that humanity is de facto entering a new stage of cultural development due to the disruptive and transformative potential of AI.

To better understand this, it helps to take a look at our cultural past: the Canadian media scientist Marshall McLuhan divided it into four stages: oral culture, which began over a million years ago with the dawn of mankind, written culture, which emerged around 3,000 years ago, the printing press, which was invented 1150 or 550 years ago (depending on your point of view), and the information age, which began 150 years ago with the telegraph. If one looks at the time intervals between these revolutionary developments, it becomes clear that they are following an exponential curve. The American futurologist Ray Kurzweil analyzed this acceleration in his book The Singularity is Near and identified a fifth epoch at the end of which technology and human intelligence will merge.

And, indeed, we are witnessing an evolutionary paradigm shift: while the human brain developed in the course of genetic evolution in order to depict external reality as accurately as possible as a survival strategy (an inward projection of reality), the “brains” of AI are increasingly used to create new realities (an outward projection of reality). This is obvious in the case of deep fakes: they have a particular significance in the context of the arts, which more or less contributes to resolving the famous dispute between the philosophers Walter Benjamin and Theodor W. Adorno about the art work in the age of its technical reproducibility. The dispute was sparked, among other things, by the concept of the aura of an art work, which—linked to the concept of the genuine, the unique—is lost due to the fact that reproduction technology multiplies reproduction and, according to Benjamin, "replaces its unique occurrence with its mass occurrence." Although Benjamin laments the disintegration of the aura, he also recognizes the potential that lies in technical reproducibility, which in turn prompted criticism from Adorno. Generally skeptical about the use of technology in art, Adorno wrote in the 1960s: "The decrease in effort, the relief, always means a preponderance of dead matter, of elements that have not passed through the subject, that are externally thing-like and ultimately alien to art."

edu sharing object

Fake news existed before AI. Stalin had Nikolai Yezhov erased from a photograph after the latter fell out of favor.

Here, one could start a discussion about whether AI and large language models (LLMS) such as ChatGPT display something akin to what Adorno calls the “subject” which elements (for instance a collection of computer-generated melodies) have been passed through, which some would probably affirm, while others would vehemently reject this notion. In deep fakes, however, the real and unique is replaced by the deceptively real, which has its very own aura. Although the deceptively real already has a history that begins with the retouching of photographs in the Soviet Union of the 1920s, it has taken on a different dimension since the use of AI. A YouTube blogger called “The Gaze” has produced an excellent video about the film “Bataille de boules de neige,” shot by Louis Lumière in 1896 and restored in 2020, in which he discusses the striking aura of the film colored and stabilized by AI.


Dieser Inhalt wird von YouTube bereitgestellt.
Beim Abspielen wird eine Verbindung zu den Servern von YouTube hergestellt und der Anbieter kann Cookies einsetzen, die Hinweise über dein Nutzungsverhalten sammeln.
Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

A restored movie by the Lumière brothers features an eerie aura analyzed by The Gaze.


Similar experiences are gained in the audio field, where, for example, in a video not publicly available, the voice of Ella Fitzgerald was replaced by a saxophone sound with such nuance using an application called RAVE that it is hard to imagine that the original could not have sounded like this. We achieve this by training deep learning networks that are fed a corpus of recorded saxophone solos from different sources.

edu sharing object

The two stages of RAVE – Variational Auto-Encoder for fast and high-quality neural audio synthesis.


Dieser Inhalt wird von YouTube bereitgestellt.
Beim Abspielen wird eine Verbindung zu den Servern von YouTube hergestellt und der Anbieter kann Cookies einsetzen, die Hinweise über dein Nutzungsverhalten sammeln.
Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

A RAVE tutorial by Antoine Callion.


The data is independently arranged by the network in a latent space in compressed form, from which it is decoded. LLMs such as ChatGTP function in an analog way and show a tendency to hallucinate in certain contexts. This word means that they produce a reality that is plausible but does not correspond to the facts. When I asked ChatGTP-3 to generate information about me and about my own software, I was able to experience this hallucination firsthand. While the information about the software was impeccable, my CV contained many details taken from similar CVs. The reason for the latter is that a CV is higher-dimensional than a simple factual text, and its data is clustered with similar data in latent space. This explains why I was born in Hamburg on May 28, 1960—incidentally György Ligeti's 37th birthday—while I was actually born in Göttingen on June 21, 1960. These inaccuracies seem almost human but should give us pause for thought if we rely too much on ChatGTP. We will have to develop new skills in dealing with the media in order to avoid the echo chambers of fake news.

Of course, there is also the question of how intelligent AI really is. It is currently able to close gaps in human cognition through interpolation in latent spaces or to imitate them, albeit within limits, as is the case with autonomous driving. However, it remains to be seen whether it is capable of extrapolation, i.e. creative leaps such as those made by biological systems like the human brain. After all, a single human brain has around hundred times the number of connections compared to the 1000 billion parameters of ChatGPT-4 (Cobus Greyling, "What Are Realistic GPT-4 Expectations?" Medium, 16 Jan. 2023), and these figures do not consider the substantial differences in the irrespective architectures. Nonetheless, as it will become obvious in the units of this HOOU course, we can’t deny the great potential in its application to the arts as a source of inspiration or as assistance systems leveraging its power in music composition, classification, analysis and sound design.


edu sharing object

The deep fake software used by Denis Połeć applies his face to all members of a symphony orchestra with astonishing accuracy. The photo from REMIX (2023) looks deceptively real.