Interactive Machine Learning for Music

Website:	Hamburg Open Online University
Kurs:	MUTOR: Artificial Intelligence for Music and Multimedia
Buch:	Interactive Machine Learning for Music

Gedruckt von:	Gast
Datum:	Samstag, 12. Juli 2025, 10:40

Beschreibung

Prof. Rebecca Fiebrink

Inhaltsverzeichnis

Synopsis
1. Motivation: Why IML for music?
- 1.1 The "mapping problem"
2. Machine learning: A good tool for creating mappings
- 2.1 Two relevant types of supervised learning
- 2.2 No need for programming models by hand
3. How can we support instrument designers in using ML in practice?
4. Instruments and performances made with Wekinator
4.1 From the Waters
5. Musical and creative benefits of IML
6. Moving forward with IML
- 6.1 IML is not always the right approach
- 6.2 Resources for using IML and learning more
References

Synopsis

Making new digital musical instruments requires an instrument builder to design a "mapping" specifying how a musician's actions, sensed with sensors, should translate into control over sound. Creating mappings can be difficult and time consuming. I discuss how interactive machine learning can be used to make mappings from examples, without programming. I introduce a software tool for making these mappings, provide some examples of instruments and pieces made with interactive machine learning, and discuss how using interactive machine learning can benefit creators making musical instruments and other realtime creative interactions with sensors.

1. Motivation: Why IML for music?

One of my main motivations for beginning to explore machine learning (ML) for music in the early 2000s was my excitement around the possibilities for making new digital musical instruments (or “DMIs”). By that time, decades of work had already been done by researchers and musicians such as Max Matthews (1991), Laetitia Sonami (Bongers 2000), and Michael Waisvisz (Bongers 2000) (see also videos below) to explore the new sonic, musical, and interactive possibilities arising from DMIs. Digital sound synthesis and processing techniques make it possible for such instruments to produce sounds never before heard—in theory, any sound imaginable! Further, a musician’s actions or gestures can be sensed with a huge variety of technologies—not only buttons and knobs, but also cameras, microphones, force or touch or magnetic sensors, physiological sensors capturing brain wave or heart rate information, or any number of others. There is thus huge flexibility in how an instrument maker might link a musician’s actions to the resulting instrument sound: sensor data is fed into a computer or microprocessor, and code—not physics!—determines how sound is produced in response.

Max Mathews demonstrating the Radio Baton.

Laetitia Sonami discussing the Lady’s Glove.

Michael Waisvisz playing The Hands.

(If you are interested in learning more about DMIs, the research behind them, and the music being made with them, a great place to start is the proceedings of the New Interfaces for Musical Expression—or NIME—conference.)

1.1 The "mapping problem"

The computational processing that determines how a DMI’s sound should be influenced by its sensor values is often referred to as the instrument’s “mapping.” Hunt et al. (2003) describe this mapping as defining “the very essence of an instrument”; the mapping strongly influences what sounds can be played, what musician gestures may be used to play them, how comfortable a musician is, how a musician looks to an audience, and how easily a musician can learn or repeat particular musical material.

Yet, designing effective (e.g., musically satisfying) mappings can be very difficult in practice. For instance, an instrument may use many sensors to capture a musician’s actions, and/or an individual sensor may produce many dimensions of numbers (e.g., we might want the x, y, and z coordinates of a point tracked in 3D space, or to compute the energy in multiple frequency bands of a signal). Likewise, our sound synthesis or processing algorithm(s) may require many parameters to be set simultaneously to enable rich control over the sound. This can present an instrument designer with a very difficult task: determining a mapping function (Figure 1) which may take dozens (or even hundreds or more) of inputs, use these to compute dozens (or more) of sound parameters, while also ensuring that the instrument is capable of making musically appropriate sounds with feasible, comfortable (enough), and aesthetically appropriate movements. This task becomes even harder when using any of the many sensors that produce noisy signals; for instance, cameras, microphones, and many types of hardware sensors inevitably produce slightly different values at different times, even for the same musician actions.

edu sharing object

Fig. 1: A mapping translates sensor input values into sound synthesis or processing parameters.

As a result, implementing a mapping function by writing code is often a painstaking process, even for instrument builders who are expert programmers. For decades, researchers have explored alternative mechanisms for designing these functions, including using clever mathematical operations (e.g., Bencina 2005, Bevilacqua et al. 2005) as well as machine learning, as we will discuss in the next section.

Of course, this view of the technological components of an instrument as inputs, a mappings, and outputs is quite simplistic, and in practice instruments may have many such mappings, pieced together in different ways, as well as other processes that do not fit into this framework. Nevertheless, mappings are a good starting point for thinking and reasoning about DMI design, and other interactions that you might find in live performance can align well with this framework too (e.g., visuals or robots that respond to live audio or gesture).

2. Machine learning: A good tool for creating mappings

Supervised learning is a type of machine learning which can be understood as inferring relationships between two types of phenomena, which we can think of as the “inputs” to a mathematical or programming function, and the “output” of that function. Often, this function is referred to as a “model” (figure 2).

edu sharing object

Figure 2: A model is a function that computes an output from one or more input values.

2.1 Two relevant types of supervised learning

For instance, classification is a type of supervised learning technique in which the goal is to build a function or model that is capable of outputting a label or category in response to some input values. Outside of music, classification has been very important in, for instance, e-mail spam classification: an email service typically computes some numbers related to the types of words present in an email, then it predicts whether the email is spam or not. In music performance, we might build a classifier that uses some numbers computed from a webcam to determine in real-time whether a DMI performer is waving her hand or not, or which of five hand gestures a performer is making. The output of this classifier—the current “category” of hand gesture—can then be used as a simple trigger, for instance turning samples on or off.

Regression is another type of supervised learning. Its goal is to build a model that is capable of outputting a number in response to some set of input values. In principle, this number could take on any real value (e.g., integers, fractional values like 1.5, even negative values like -22.1). For instance, a very simple regression model might be the function “output = 5 x input”, in which case the model would output -10 for the input value -2, and it would output 5.5 for the input 1.1. In music, a regression model could be used to set any number of synthesis or processing values, for instance setting the length of a virtual string in a physical model of a violin, or setting the EQ of a filter, or the amplitude of an oscillator. (In practice, of course, an instrument builder would typically build in some safeguards, ensuring that the values used to control the music are reasonable, such as ensuring gain remains between 0 and 1). In the common case that an instrument requires many such parameters to be controlled simultaneously, one simple approach is to use several regression models in parallel, each one controlling one musical parameter.

It is worth noting that today, machine learning has become synonymous with “neural networks”—a type of learning algorithm which is loosely inspired by the way that neurons in the brain work. Neural networks can be used for either classification or regression (among other things); likewise, historically, a great variety of learning algorithms have been proposed and used, and sometimes, other algorithms may be preferable.

2.2 No need for programming models by hand

The most exciting thing about using supervised learning to build a mapping function is that a learning algorithm (i.e., a classification or regression algorithm) can produce the mapping function for us, rather than requiring a programmer to write the function by hand (figure 3). Specifically, the learning algorithm will use a set of pairs of inputs and outputs, called the training data, and it will do its best to infer a model function that accurately captures the relationship between those inputs and outputs.

edu sharing object

Figure 3: A supervised learning algorithm uses training data to build a model. It attempts to build a model that captures the relationships between inputs and outputs in the training dataset.

For instance, as shown in Figure 4, if a regression algorithm sees the example inputs “4, -11, 13” paired with the example outputs “5, -10, 14”, it may produce the model function “output = 1 + input”, reflecting the relationships between the inputs and outputs in these training data. If this model sees the value “20” as an input, it will thus produce the value “21” as its output.

edu sharing object

Figure 4: This regression algorithm has produced the model function “output = 1 + input” using these training examples.

An immediately obvious advantage of this approach is that, if we can supply examples of musician actions—each paired with the sound synthesis or processing parameters we would like to correspond to those actions—the learning algorithm can produce the mapping function for us, without the instrument designer needing to grapple with writing code that appropriately handles high-dimensional and noisy data.

Mapping creation was first identified as a potential use case for supervised learning by Lee et al. in 1991. However, using machine learning was far from straightforward at this time, and their proposed technique never took off. For one thing, while they showed that mappings created with a neural network could be run in real-time performance, they mention that training the network was “computationally intensive” (which I assume means that it required hours, at minimum). Further, while they published a mathematical description of their approach, neither they nor others created ML software tools for others to use in instrument building, so there remained a large technical barrier for others to experiment with ML mappings.

3. How can we support instrument designers in using ML in practice?

When I began working in this space in 2007 as a PhD student at Princeton, laptops were far more capable than when Lee et al. had first proposed using neural networks for mappings, and I could train many usable neural networks in a few minutes—if not a few seconds—fast enough to even use in live performance. Yet still, almost nobody was building instruments using ML; the exceptions tended to be people with computer science or engineering PhDs who had the expertise and inclination to code up their own ML systems and figure out how to connect them to sensors and music.

I therefore set out to explore how we could make ML tools for instrument creators. I did this through a combination of approaches: endlessly tinkering on my own, discovering what seemed like it might work; learning all I could about DMI design through the NIME community as well as the work of graduate students and professors in the Music department at Princeton; and ultimately working with those colleagues and professors to explore a series of ML software prototypes, critique them, and make music with them. By 2010, this work culminated in a lot of learning about how to make ML usable by and useful to instrument creators; a piece of software, called Wekinator (which has been downloaded over 50,000 times and which is still used around the world in teaching and creative practice); and in my PhD thesis (Fiebrink 2011).

3.1 ML for building creative interactions is different from other ML

One of the most important and repeated lessons I learned through this work was that using ML to build creative interactions, such as new musical instruments and other performance interfaces, is fundamentally different from more conventional uses of ML, and the tools that we use to support this work must therefore look and behave differently as well.

Three key areas of difference, as I discuss below, pertain to the training data, the method for evaluating a model, and the methods for improving or changing a model.

3.1.1 What is the training data?

If I am a musical instrument builder using ML to build a mapping, one obvious source of training data is me. I can simply record demonstrations of the way I might want to manipulate or move with my instrument, and pair these with examples of the sound synthesis or processing parameters I would like to be associated with those actions, and these can be come my training data. Or, if I am building an instrument for someone else to play, I could record examples of that person moving or manipulating the instrument.

This ability for a ML system builder to generate suitable training examples is unusual in the landscape of ML applications, where typically a person may use ML because they do not fully understand the phenomenon they are modeling (e.g., whether a medical patient is likely to respond to a treatment, or whether the price of a stock is likely to go up or down).

3.1.2 How should we evaluate whether a model is good, or better than another?

Conventionally, a machine learning model is evaluated based on its ability to “generalise”, to compute appropriate outputs for inputs that it has not seen in the training set. The simplest way to estimate this is to reserve some examples (of inputs paired with outputs), putting them into a “test set” rather than the “training set.” Then generalization accuracy can be estimated by computing how closely the model produces the desired outputs for the test set examples (e.g., for how many test examples does a classifier produce the right label output?). There are a number of variations on this, but in practice these approaches will give you some simple metrics that you can use to compare one model against another, or to decide whether a model is good enough to use in practice.

But what does it mean if my DMI hand gesture classifier is 90% accurate? Will it make mistakes on gesture variations I can learn to avoid in performance, or on gestures that are crucial to the opening bars of my piece of music? How could a number possibly tell me if my instrument is something I can play comfortably, or repeatably, or musically? Numbers may have a place in some applications of ML to instrument building (e.g., I probably would prefer a 99%-accurate classifier to a 40%-accurate one), but an instrument maker cannot form a full opinion of whether a model is suitable without actually playing with it!

3.1.3 How should we improve or change a model?

As mentioned above, conventionally, a lot of the human work of applied machine learning is involved in finding a good modeling approach for a given dataset. This can involve trying different learning algorithms, for instance swapping a neural network for a random forest (Breiman 2001), or changing the configuration or type of neurons in a neural network. This can also involve computing different representations, or features, of the data to be used as inputs into the model. For instance, when building models for musical audio analysis, it is common practice to use spectral representations rather than audio waveforms as inputs to the models, because these can result in better accuracy.

Often, such work proceeds under the assumption that the training dataset is sufficient and thorough—that it is a good enough representation of the modeling task to be undertaken, and so effort focuses on the engineering and experimentation work of trying to find the best approach to modeling the patterns in that dataset.

When building an instrument, however, the barrier to providing additional training examples may be very low—particularly if the designer herself is the one demonstrating examples. When a designer tries out a model and discovers that it makes mistakes on particular variants of a hand gesture that is crucial to an intended performance—perhaps it makes mistakes when her wrist is a bit rotated, or when her hand is less well lit—it can be straightforward to remedy those model mistakes by simply providing additional training examples which illustrate the variants causing the mistakes. The learning algorithm can then build a new model from this augmented training set, and there is some reasonable hope that the new model will not make those same mistakes.

Further, it is notable that model mistakes are far from the only reason an instrument designer may wish to change a model. I may build a model, play with it, and wish to make it more complicated by adding capacity to respond to new movements. Or I may wish to change the sound associated with one of my movements to see if it feels more musically satisfying. Or I may wish to try adding a new sensor to my instrument to see if I can more accurately capture aspects of my movement. In all these cases, the appropriate starting point for modifying the model is to modify the training dataset.

If we also keep in mind that an instrument designer rarely knows exactly the form they want the instrument and mapping to take before they begin, but rather discovers these over time in a process of experimentation with the evolving instrument, the requirement to be able to dynamically retrain a model using many variations of a training set becomes even clearer.

3.2 Interactive machine learning

In 2003, Fails and Olsen described an approach to enabling users to build simple image classifiers, in which users could iteratively improve a model through edits to its training set. They term this approach interactive machine learning, in contrast to more conventional approaches to machine learning in which a person focuses their effort on changing the learning algorithm or features rather than the training data. Interactive machine learning (IML) therefore seems like a potentially good fit for musical instrument building, given the observations above around the ease with which instrument builders may be able to provide new training examples, and the importance of enabling designers to iteratively change the training set and experiment with the resulting model variants.

3.3 The Wikinator: An IML tool for music

The tool I built to enable instrument builders to build mappings is called the Wekinator. Wekinator allows users to employ an IML workflow, illustrated in figure 5.

edu sharing object

Figure 5: The IML workflow supported in Wekinator.

Wekinator has the following key capabilities:

It allows you to make mappings using classification or regression, using a few different algorithms. (It also supports a slightly more complex approach to identifying temporal gestures, called dynamic time warping.) It allows you to build multiple models in parallel; for instance, if you want to build a mapping that controls 10 real-valued synthesis parameters, you can build 10 regression models simultaneously.
It allows you to record new training examples in realtime, from demonstrations.
It can receive inputs (e.g., sensor values) from anywhere using OpenSoundControl messages. For instance, people have controlled Wekinator using sensors attached to Arduinos, microphones, webcams, game controllers, Leap motions, and many other devices.
It can send the models’ output values to any other software using OpenSoundControl. For instance, people have used it to control sound in Max/MSP, SuperCollider, Ableton, ChucK, and JavaScript, as well as for controlling game engines, animation software, lighting systems, web apps, physical computing systems built with microcontrollers like Arduino, and other processes.
The software itself can also be controlled using OpenSoundControl messages, allowing behaviours like training or loading new models to be triggered by messages sent from other software.
It allows you to play with models in realtime immediately after they have been trained, enabling you to try out a new interaction or instrument and decide if you want to change anything about it.
It supports an interactive machine learning approach, in which you can change or improve models by immediately and iteratively adding (or removing) training examples.
The algorithms Wekinator employs are chosen and configured to generally work well on small datasets. You can sometimes get away with just a few examples (under 10) if you have a straightforward mapping problem. But Wekinator is also fast—you should always be able to run trained models at high rates, and training can take just a few seconds or less for many datasets.

The following videos demonstrate how Wekinator can be used to make two simple instruments. The first uses a very simple webcam program to capture a performer’s posture, and uses this to choose which drum sequences to play in ChucK:

Using Wekinator to build a posture classifier to control a drum machine.

And this next video demonstrates how we can build a more expressive, complex instrument using regression. Here, the input comes from a GameTrak “Real World Golf” controller, and 9 regression models in Wekinator each control a parameter of the Blotar (Van Stiefel et al. 2004) physical model in Max/MSP:

Using Wekinator to build 9 regression models to control a physical model in Max/MSP with a GameTrak controller.

The Wekinator website has detailed instructions on how to run the software, as well as examples for how to hook it up to numerous sensors and software environments, including the two examples above. If you are interested in the details of how it was designed (and why) you can refer to the original NIME publication (Fiebrink et al. 2009).

4. Instruments and performances made with Wekinator

Wekinator has been used in numerous performances across the world. Here are some examples:

4.1 From the Waters

In the first section of this striking piece—one of the first created with Wekinator—composer Anne Hege has very intentionally designed performers’ movement sequences. Here, machine learning allowed her to fairly easily match poses (captured by GameTrak controllers) within these sequences to sounds, resulting in a mapping that allows the music to flow naturally along with performers’ movements.

Anne Hege’s From the Waters.

4.2 The MARtLET

Michelle Nagai’s MARtLET, another early instrument created with Wekinator, contains 28 light sensors embedded in a wearable piece of tree bark. Here, ML made it feasible to construct mappings that translated these 28 noisy values to up to 15 Max/MSP synthesis parameters, where they control the pitch range of the generated sounds, as well as the pulse period, amplitude, and filter Q parameters of a vocoder.

Michelle Nagai’s MARtLET.

4.3 Spring Spyre

Digital instrument pioneer Laetitia Sonami has been using Wekinator in her Spring Spyre instrument since 2012. Audio pickups attached to the three springs act as gesture sensors; these audio signals are filtered in Max/MSP using five biquad filters with different centre frequencies, and the instantaneous amplitude of each filter output are sent to Wekinator as the features capturing the springs’ current states. Wekinator’s outputs are sent back to Max/MSP to control sound synthesis. Sonamia also uses a modified controller with 16 faders and buttons to control aspects of the sound synthesis as well as to change which of the spring filter outputs will currently be used in Wekinator’s model computations (see Fiebrink and Sonami 2020 for more details on the instrument construction and use). Sonami uses Spring Spyre to play a variety of pieces, each one employing different models.

In this video, Laetitia Sonami plays Spring Spyre in a collaborative performance with Zeena Parkins.

In this video of our NIME 2020 presentation, I talk with Laetitia about her use of Wekinator in Spring Spyre and about what we’ve both learned through our years of working together.

4.4 Gabriel Vigliensoni's experiments with IML for realtime control of generative models

Much of the excitement around ML in the last decade has centred on generative ML algorithms, those capable of outputting not just a label (like classification) or number (like regression), but media content itself, such as images, video, or music. In a live musical performance context, a key question is how to exercise effective musical control over a generative model. My collaborator Gabriel Vigliensoni has been exploring how IML (using both Wekinator and Flucoma) can be used to build instruments that allow gestural control over generative models in realtime (Vigliensoni and Fiebrink 2023). The following short video demonstrates how this is achieved:

Using IML to build an instrument controlling the latent space of a generative model, specifically RAVE by Caillon and Esling (2021).

The following videos show Vigliensoni using this technique in performance:

Vigliensoni performing with dedosmuertos, in Paris in 2022.

Vigliensoni performing at CMMAS in 2022.

5. Musical and creative benefits of IML

When I ask others why they’ve chosen to use Wekinator in their work, and when I consider why I’ve chosen to use it in my own instrument building, the answers reveal some exciting musical and creative benefits. For instance:

5.1 ML makes working with sensors and data easier

As illustrated in the videos above, ML can make it much easier to work with high-dimensional and/or noisy signals, such as those that come from sensors, audio, and video. Making good mappings for From the Waters, MARtLET, or Spring Spyre through writing programming code would be incredibly difficult and time consuming, and perhaps impossible. Supervised learning algorithms are designed to be able to infer functional relationships between inputs and outputs in a training set, even in the presence of noise, and they can often do this faster and more accurately than a person.

5.2 Data and ML is better than math and code for communicating embodied practices to a computer

A programming language is one “interface” through which people can communicate to a computer how they would like it to behave. A training dataset containing examples of what a computer should do in response to particular human actions is an alternative “interface” for accomplishing this, and one that is arguably more natural for many contexts. When a violin teacher instructs a student how to move their bow arm, or a conductor instructs an orchestra about the sound they would like to get, they do not communicate these things through mathematical functions or programming code; often, they don’t even use language. Rather, they typically use demonstrations of movement and sound—these are often the most natural ways for us to communicate about music to other people. ML allows us to use these modalities to communicate to a computer.

This capability can be key for instrument builders, as illustrated by Michelle Nagai’s statement: “I have never before been able to work with a musical interface … that allowed me to really ‘feel’ the music as I was playing it and developing it. The Wekinator allowed me to approach composing with electronics and the computer more in the way I might if I was writing a piece for cello, where I would actually sit down with a cello and try things out” ([qtd. in] Fiebrink 2011, p. 258).

5.3 IML can allow creators to prototype and explore many ideas

Prototyping—the instantiation and exploration of many ideas, before settling on a final design—is important in any creative practice. Creative domains including music composition and instrument building can be seen entailing “wicked design problems” (Dahl 2012) wherein creators must simultaneously navigate many unknowns. Experimentation with prototypes helps creators understand more deeply what they are trying to make, as well as the practicalities of how to make it.

For many Wekinator users, ML is valuable in that it allows them to create and try mappings more quickly than making mappings by hand. This means that they are able to experiment with a wider variety of mapping ideas in the time they have available, potentially allowing them to identify an instrument design that is ultimately more satisfying.

5.4 Designing with data can allow more people to become creators

ML allows a creator to communicate through examples what actions they’d like a musician to make, and what sounds they’d like a computer to make in response. This means that it is no longer necessary to write programming code (at least for designing the mapping), potentially opening up the instrument design process to people who lack programming and other technical expertise.

One project that explored this possibility was Sound Control (figure 6, Parke-Wolfe et al. 2019), a collaboration with music teachers and therapists working with children with a large variety of disabilities. We built a standalone software tool which allows anyone to select an input modality (e.g., Webcam colour tracking, microphone input, Gametrak) and a sound-making module (e.g., looper, FM synthesis, sample mixer) from drop-down menus. They can then demonstrate a few examples of how actions captured with those inputs should relate to sounds, and a Wekinator-style IML process trains a model and produces a playable instrument. This project was particularly exciting because not only could teachers and therapists who knew nothing about ML or programming make completely new instruments for the children they worked with, but they explored a much wider range of musical interactions than if they had depended on a programmer to implement their initial ideas of the sorts of instruments they thought they wanted to make.

edu sharing object

Figure 6: A music teacher and music therapist using Sound Control to make custom interfaces for a child.

5.5 IML allows new musical outsomes, and new creative relationships between people and machines

Instruments like Spring Spyre would arguably be impossible to make without the use of ML; the mappings involved are just too complex for a human programmer to manage. Other instruments, perhaps like those in From the Waters, might feasibly be created with some painstaking programming without ML; however, the question arises whether an instrument builder would commit the time and effort needed to build the desired mapping, or whether they might just decide to build something simpler. Undoubtedly, IML enables people to make instruments and sounds they would not make otherwise.

Furthermore, using IML can arguably change one’s creative process, compared to writing mappings by hand. I’ve already discussed above how designing from demonstrated examples can allow instrument makers to take a more embodied approach to design, and focusing on movement and sound rather than programming can make the creation process more enjoyable. But using IML can also introduce fruitful surprises into the design process, particularly when using regression: because regression models are capable of producing output values that have not been present in the training set, giving these models inputs that are unlike those in the training set can lead to unexpected behaviours. In instrument building, these unexpected behaviours usually arise in the form of new sounds—some of which may be undesired, but some of which may compelling, and which an instrument builder may want to reinforce in their instrument through adding new training examples that include those sounds. This is very different from the “surprises” one usually encounters during programming (e.g., compiler errors, no sound happening, things just not working).

6. Moving forward with IML

[No text; how to skip to or include sect. 6.1?]

6.1 IML is not always the right approach

IML as described above can be an ideal approach to building instruments and other real-time interactions under certain circumstances:

You understand the behaviour you want the model to take on, and you are capable of providing training examples that illustrate that behaviour, evaluating whether a model is satisfactory, and identifying model mistakes.
The representation of the input data (e.g., numbers used to represent sensor values) does not have too complicated a relationship with the output. (Of course, “too complicated” is hard to define, but something like feeding in a raw audio waveform as an input will certainly be too complicated. In general, the more complex this relationship, and the greater the number of input dimensions, the more training examples may be needed and the more sophisticated a learning algorithm may be needed.)

In these circumstances, IML allows you to craft training sets that accurately communicate your intention, and you are able to build suitable models from a relatively small number of examples (e.g., dozens, or maybe hundreds of examples) without taking too much time to source examples or wait for training.

On the other hand, IML may not be a good fit when:

You already have a training set which you are confident represents the patterns between inputs and outputs very well, and this is unlikely to be improved.
You want to build a model that accommodates a diverse variety of users, e.g. doing accurate gesture recognition on a wide variety of body types. (In this case, using training data only from yourself may give you a model that fails unexpectedly for other users, and you should probably spend time collecting a substantial dataset from diverse users.)
You suspect you may have a more complicated relationship between inputs and outputs than can be supported by simple (fast) learning algorithms applied to small datasets (e.g., because you’ve tried IML and it hasn’t worked).

In these cases, you may be better off using a more conventional approach to ML, with tooling (e.g., Python libraries and offline experimentation) that focuses your effort on finding the best “feature representation” and learning algorithm configuration for a particular dataset.

6.2 Resources for using IML and learning more

If you are interested in using IML yourself, here are some tools you might find helpful:

The Wekinator website contains free downloads, sample code to connect Wekinator to many sensors and music environments, and tutorials to get you started.
The Flucoma project includes a number of ML tools in Max/MSP, SuperCollider, and PureData.
Teachable Machine by Google uses an approach to interactive machine learning based on Wekinator to allow anyone to build simple image, audio, and pose classifier models in the browser.
ml5.js is a set of JavaScript machine learning libraries that is compatible with the very popular P5.js environment for programming animations and interactive visuals. It also allows you to load and interact with models you’ve made using Teachable Machine.
MIMIC is a web-based creative coding platform that I worked on, with colleagues from the Creative Computing Institute, University of Sussex, and Durham University. It contains high-level libraries and many examples for using IML and generative ML in the browser.
InteractML is an IML platform for the Unity and Unreal game engines, created in partnership with Phoenix Perry and other collaborators.

If you want to learn more about IML for musical instrument building or other creative pursuits, these are the resources I recommend:

Machine Learning for Musicians and Artists is an online course I designed and taught, which focuses on IML and builds hands-on skills using Wekinator.
Apply Creative Machine Learning is another online course, which explores creative ML, including IML, using JavaScript.
Research on applications of ML in music and other creative fields is usually available to read without a paywall, including in the proceedings of conferences such as:

References

Bencina, Ross. “The metasurface: applying natural neighbour interpolation to two-to-many mapping.” In Proceedings of the 2005 Conference on New Interfaces for Musical Expression, pp. 101-104. 2005.

Bevilacqua, Frédéric, Rémy Müller, and Norbert Schnell. “MnM: a Max/MSP mapping toolbox.” In Proceedings of the International Conference on New Interfaces for Musical Expression, pp. 85-88. 2005.

Bongers, Bert. “Physical interfaces in the electronic arts.” Trends in gestural control of music (2000): 41-70.

Breiman, Leo. “Random forests.” Machine learning 45 (2001): 5-32.

Caillon, Antoine, and Philippe Esling. “RAVE: A variational autoencoder for fast and high-quality neural audio synthesis.” arXiv preprint arXiv: 2111.05011 (2021).

Dahl, Luke. “Wicked problems and design considerations in composing for laptop orchestra.” In Proceedings of the International Conference on New Interfaces for Musical Expression. 2012.

Fails, Jerry Alan, and Dan R. Olsen Jr. “Interactive machine learning.” In Proceedings of the 8th International Conference on Intelligent User Interfaces, pp. 39-45. 2003.

Fiebrink, Rebecca. Real-time human interaction with supervised learning algorithms for music composition and performance. PhD Thesis. Princeton University, 2011.

Fiebrink, Rebecca, and Laetitia Sonami. “Reflections on eight years of instrument creation with machine learning.” In Proceedings of the International Conference on New Interfaces for Musical Expression. 2020.

Fiebrink, Rebecca, Daniel Trueman, and Perry R. Cook. “A meta-instrument for interactive, on-the-fly machine learning.” In Proceedings of the International Conference on New Interfaces for Musical Expression. 2009.

Hunt, Andy, Marcelo M. Wanderley, and Matthew Paradis. “The importance of parameter mapping in electronic instrument design.” Journal of New Music Research 32, no. 4 (2003): 429-440.

Lee, Michael, Adrian Freed, and David Wessel. “Real-time neural network processing of gestural and acoustic signals.” In Proceedings of the International Computer Music Conference, pp. 277-277. 1991.

Mathews, Max V. “The radio baton and conductor program, or: Pitch, the most important and least expressive part of music.” Computer Music Journal 15, no. 4 (1991): 37-46.

Parke-Wolfe, Samuel Thompson, Hugo Scurto, and Rebecca Fiebrink. “Sound control: Supporting custom musical interface design for children with disabilities.” In Proceedings of the International Conference on New Interfaces for Musical Expression. 2019.

Stiefel, Van, Dan Trueman, and Perry Cook. “Re-coupling: the uBlotar Synthesis Instrument and the sHowl Speaker-feedback Controller.” In Proceedings of the International Computer Music Conference. 2004.

Vigliensoni, Gabriel, and Rebecca Fiebrink. “Steering latent audio models through interactive machine learning.” In Proceedings of the International Conference on Computational Creativity. 2023.