Interactive Machine Learning for Music: 3.1.2 How should we evaluate whether a model is good, or better than another?

Interactive Machine Learning for Music

Prof. Rebecca Fiebrink

3. How can we support instrument designers in using ML in practice?

3.1.2 How should we evaluate whether a model is good, or better than another?

Conventionally, a machine learning model is evaluated based on its ability to “generalise”, to compute appropriate outputs for inputs that it has not seen in the training set. The simplest way to estimate this is to reserve some examples (of inputs paired with outputs), putting them into a “test set” rather than the “training set.” Then generalization accuracy can be estimated by computing how closely the model produces the desired outputs for the test set examples (e.g., for how many test examples does a classifier produce the right label output?). There are a number of variations on this, but in practice these approaches will give you some simple metrics that you can use to compare one model against another, or to decide whether a model is good enough to use in practice.

But what does it mean if my DMI hand gesture classifier is 90% accurate? Will it make mistakes on gesture variations I can learn to avoid in performance, or on gestures that are crucial to the opening bars of my piece of music? How could a number possibly tell me if my instrument is something I can play comfortably, or repeatably, or musically? Numbers may have a place in some applications of ML to instrument building (e.g., I probably would prefer a 99%-accurate classifier to a 40%-accurate one), but an instrument maker cannot form a full opinion of whether a model is suitable without actually playing with it!