Zum Hauptinhalt

Sound Gesture Intelligence

Dr. Greg Beller

Gesture recognition and following

The mapping-by-demonstration approach

The mapping–by–demonstration approach involves an interaction loop that consists of two phases: A training phase that allows the user to define mappings and a performance phase in which the user controls sound synthesis through the previously defined mappings. The XMM library is designed to make this interaction loop transparent to the user without requiring expert knowledge of machine learning algorithms and methods.

edu sharing object

Workflow of a Mapping-by-Demonstration System. Source: Françoise 2014.


In the following examples, the XMM for Multimodal Hierarchical Hidden Markov Model algorithm simultaneously enables gesture recognition and gesture following.


edu sharing object

Multimodal Hierarchical Hidden Markov Model for real-time gesture recognition and tracking (Françoise 2014).


Wired Gestures dynamically links voice to gesture, in an artificial way. The machine simultaneously records a voice gesture and a manual gesture [19]. It learns the temporal relationship between the two. Then it reproduces the voice, when the performer repeats the same gesture. The nuances of timing in the gesture are then heard as prosodic variations of the voice, and it becomes possible to break down the expressivity.

Wired Gestures with Richard Dubelski @ Scene 44, Marseille, France, January 2014.

Exercise – Wired Gestures

Using motion data from the smartphone (CoMo.te) and the mubu.xmm object, create a patch to reproduce the wired gestures instrument.

Gesture Scape jointly records voice, gesture and video. Then, new gestures will activate this memory. The performer animates the video, by reproducing the same gesture, or by repeating the same associated sound.

Gesture Scape with Jean-Charles Gaume @ Scene 44, Marseille, France, April 2014.

Exercise – Gesture Scape

Let’s add our webcam to the plot. Create a patch to reproduce the Gesture Scape instrument.

Vocal gesture only

While motion data can be derived from hand tracking, it is also possible to track speech gestures by modeling them with Mel Frequency Cepstrum Coefficients (MFCCs). In this way, we can track and recognize vocal gestures, speech or singing, without using gesture data.

Exercise – Syllable recognition and tracking

Record 3 syllables ("ba," "mo" and "te," for instance) and use the mubu.xmm object to recognize and follow them based on their MFCC description only.