Sound Gesture Intelligence

Website:	Hamburg Open Online University
Kurs:	MUTOR: Artificial Intelligence for Music and Multimedia
Buch:	Sound Gesture Intelligence

Gedruckt von:	Gast
Datum:	Sonntag, 13. Juli 2025, 12:06

Beschreibung

Dr. Greg Beller

Preface

About the author

Greg Beller works as an artist, researcher, teacher and computer designer for the contemporary arts. Founder of the Synekine project, he invents new musical instruments combining sound and movement, which he uses in comprovisation situations with various performers or in computer-assisted composition, notably in his opera The Fault. At the Ligeti Center, while preparing a second doctorate on “Natural Interfaces for Computer Music”, he is a research assistant in the innovation-lab and teaches in the Multimedia Composition department at Hamburg’s HfMT University for Music and Drama. At the nexus of the arts and sciences at IRCAM, he has successively been a doctoral student working on generative models of expressivity and their applications to speech and music, a computer-aided music designer, director of the Research/Creation Interfaces department and product manager of the IRCAM Forum.

This unit

In this unit, you will discover some links between voice and gesture. You will use this natural proximity to create new intuitive musical instruments that allow you to manipulate sound with your hands. The structure of this unit will enable you to tackle, with increasing complexity, the notion of machine learning to model the temporal relationship between multimodal data. Various physical gesture sensors are introduced and compared. Libraries for gesture processing and machine learning are presented. Their uses are demonstrated in different artistic contexts for pedagogical purposes.

The exercises enable you to make them your own, and incorporate them into your artist's studio.

Introduction

For the past ten years, through residencies and artistic creations, the Synekine Project has invited performers to question the intimate relationship between vocal gestures and manual gestures through the manipulation of scenic devices based on new technologies. Metaphorically, the pre-eminent neuromotor link between voice and gesture is closed by “creative prostheses” joining the capture of movement to the transformation of the voice by artificial intelligence. Linking space and time through movement then transforms the search for sound into a scenic exploration. This unit provides a retrospective panorama of this experimental research in a narrative oscillating between theoretical questions, technological innovations and artistic productions.

What is a gesture?

A gesture is a form of communication in which bodily actions communicate particular messages [1].

Manual gestures are most commonly broken down into four distinct categories:

Symbolic (Emblematic): a handwave used for "hello."
Deictic (Indexical): a pointing finger showing “this” or “that.”
Motor (Beat): accompanying the prosody of speech.
Lexical (Iconic): a gesture that depicts the act of throwing may be synchronous with the utterance, "He threw the ball right into the window."

Speech can be described as audible movements, a series of vocal gesture (Löfqvist 1990). By varying the positions and trajectories of the lips, the jaw, the tongue, the velum and the glottis, a speaker creates variations in air pressure and airflow in the vocal tract. These variations in pressure and flow produce the acoustic signal that we hear when listening to speech. “Speech is rather a set of movements made audible than a set of sounds produced by movements” (Stetson 1905).

Language is thought by some scholars to have evolved in Homo sapiens from an earlier system consisting of manual gestures (Corballis 2005).

A neurological link

Indeed, gesture processing takes place in areas of the brain such as Broca's and Wernicke'sareas, which are used by speech and sign language (Xu 2009). The faculty of speaking with the hands would not only result from cultural origins, but would also result from a deep neuronal relation connecting the speech to the gestures of the hands (Iverson 1998). There is ample evidence for the ubiquitous link between manual motor and speech systems, in infant development, in deictic pointing, and in repetitive tapping and speaking tasks (Parrel 2014).

Exercise: clapping while speaking

For example, you can break your speech into syllables without even thinking about it, simply by clapping your hands as you speak. You don't need any special training, simply because the motor control of your vocal apparatus is neurologically linked with the motor control of your hands.

The Synekine project

By analogy to synesthesia, a phenomenon by which two or more senses of perception are associated, the neologism “synekinesia” would reflect our ability to associate two or more motor senses. In the Synekine project, the natural link between voice and gesture is enhanced by "creative prostheses". Technology allows the transformation of the voice by capturing the gesture. The voice is conventionally captured by a microphone, while the positions and dynamics of the hands are informed by different motion capture sensors.

Between the two, different computer programs based on artificial intelligence allow the direct manipulation of sound with the hands. The performer can record and arrange her or his voice in space, sample and play vocal percussion, and design a complete sound stage. This technology makes it possible to establish a link between voice and gesture, between sound and movement, between time and space. This link will be referred to as mapping in the rest of this unit. The following table gives a chronological overview of the work carried out as part of the Synekine project, the different instruments developed and a list of the various technical devices used.

edu sharing object

Table: Chronology of the works realized within the framework of the synekine project presenting the various instruments elaborated from various technical devices.

Linking space and time through movement then transforms the search for sound into a scenic exploration. These instruments have been used to create performances and installations at the crossroads of theater, dance and music.

Direct mapping between voice and gesture

During my thesis on generative models of expressivity and their applications for speech and music, an artificial intelligence algorithm based on a corpus of expressive sentences allowed me to generate an “emotional” speech by modulating the prosody of a “neutral” utterance (Beller 2009a, Beller 2009b, Beller 2010). Several times during the development of this vocal emotion synthesizer I felt the desire to control prosody by gesture. After all, gesture seems to naturally accompany speech, so why not the other way around?

At the same time, another research team at IRCAM was developing one of the first instrumental gesture sensors allowing the measurement of the dynamics of a bow by integrating small accelerometers and gyroscopes (6 Dof, degrees of freedom). The data related to movement was transmitted in real time by WiFi to a computer which triggered sounds and modulated effects according to the dynamics of the gesture (Bevilacqua 2006).

In the rest of this unit, we will show which sensors to use to obtain hand dynamics data, and how to transform this data into sound triggering parameters to create an aerial percussion instrument.

Dynamics of the gesture

[This poge is under construction. –Editor]

To play aerial percussion, you will need a controller embedding dynamic sensors such as accelerometers or gyroscopes: R-IoT accelerometers, Nintendo Wii controller, Genki Wave Ring or your smartphone.

An ex-bee sensor held between thumb and forefinger.

XBEE sensor

R-IoT sensor - inside view

R-IoT sensor, inside view

edu sharing object

Genki Wave ring

Inside view of a Nintendo Wii remote controller.

Nintendo-Wii remote controller

Controllers embedding dynamic sensors such as accelerometers or gyroscopes. Top left an XBEE sensor; Bottom left, an inside-view of a R-IoT sensor; Top right, a Genki Wave ring; Bottom right an inside-view of a Nintendo-Wii remote controller (Tinoco 2016).

Depending on the sensor used, you'll have access to 6 or 9 Dof (degrees of freedom).

edu sharing object

Nine DoF (degrees of freedom) sensor made of a 3-axis accelerometer (left), 3-axis gyroscope (centre) and 3-axis magetometer (right).

Your smartphone, tablet or smartwatch contains these sensors, and you can use them to play aerial percussions. The easiest way to do this is to use the CoMo.te app, available for Android or iOS, in conjunction with the CoMo.te Max package.

edu sharing object

CoMo.te app, available for Android or iOS, in conjunction with the CoMe.te Max package.

Aerial percussion

Once the data from your smartphone's accelerometers and gyroscopes has been received in your Max patch, it needs to be processed to transform it into sound-triggering events. To do this, we use the [link] Gestural Sound Toolkit v2 (Caramiaux 2015).

edu sharing object

Gestural Sound Toolkit v2 – Toolkit in Max/MSP for gesture-to-sound scenario prototyping (Caramiaux 2015).

With this combination of an application and two packages, you can trigger sounds in Max from your smartphone.

Exercise: The Skatphone

Build a Max patch that triggers vocal sounds with a smartphone using the CoMo.te app, CoMo.te Max Package and the Gestural Sound Kit v2.

Air Theremin

While gyroscope and magnetometer data are static and can therefore be used to continuously control sound processes (a value can be mapped directly to a slider), this is not the case for accelerometer data, which only have values when they are moved. However, with the sensor at rest, a constant, non-zero value appears on the accelerometer's Z axis, due to terrestrial gravitation. In the absence of a magnetometer, this offset can be projected onto the XYZ axes to estimate static rotation angles, as with a magnetometer. In the beginning of [to do] Babil-on, this offset of acceleration due to gravity, projected onto the vertical axis of the hand, provides a continuous controller of a sound effect (in this case, the transposition and gain of a voice sample).

Babil-on, with Richard Dubelski, CMMR2013, Marseille France, October 2013.

SpokHands

In Luna Park, a musical theater work by Georges Aperghis, I integrated some dynamic sensors into gloves and developed a first instrument called SpokHands, which literally made it possible to speak with the hands (Beller 2011a, b, c, d). SpokHands allows the triggering and modulation of voice samples by aerial percussion and hand elevation. Like a vocal Theremin, SpokHands offers the performer the option of three-voice polyphony (her.his own and both hands) or control of text-to-speech parameters. In this case, the natural division of a conductor's brain is used, the left brain (right hand) for the segmental part, and the right brain (left hand) for expressivity. The percussive gestures of the right-hand trigger pre-selected syllables whose pitch and intensity are modulated by continuous gestures of the left hand.

Rehearsal video of Luna Park, 2010, Georges Aperghis, showing Richard Dubelski discovering SpokHands (Beller 2011a, b ,c, d) @IRCAM, Paris, France.

Probabilistic mapping

In this case, the mapping is direct between the detection of a movement pulse or "kick" and the triggering of a sound segment belonging to a buffer, which can be selected:

incrementally within the original sequence (to preserve text, for example),
inversely or palindromically to the original sequence,
probabilistically, by randomly selecting the index of the segment to be played:

among all segments each time ([random] in Max)
among segments not yet selected ([urn] in Max)
among segments according to their probability of appearance ([hist] and [proba] in Max)

In Luna Park, a probabilistic segment/text generation interface has been used (see following figure). The probability distribution (bottom) used to draw the next segment can be manually edited. The initial distribution (middle) corresponds to the frequency histogram of each labeled segment within a buffer (top). In this case, we can speak of a probabilistic mapping paradigm.

edu sharing object

The figure shows a segmented recording in syllabus (on top), a histogram presenting frequency of appearance of each symbol (middle); and the same histogram guided for generation (low), for example by forcing equal appearance of syllables “ni”, “naR” and “noR”.

Hand Sampling

Now we can trigger speech sounds such as syllables, let’s record and segment them in real-time using a button (one of the CoMo.te app for instance). In Babil-on V2, a pair of button-rings have been added to the sensor-gloves allowing for the picking and erasing of voice samples on the fly. Thus, SpokHands and the triggering of pre-made sounds evolved into Hand Sampling, in which the vocal flow is cut and recombined in percussive motor gestures.

The Hand Sampling allows the performer to cut her.his voice in real time, and recombine immediately through gesture. It involves percussive gestures that will segment and trigger vocal fragments. The length of these fragments can vary from syllable to sentence. The order of the re-played segments can be sequential, random, or palindromic, which allows different playing modes. In addition, the quality of the gesture influences the quality of the sound perceived, making the instrument expressive.

Hand Sampling, with Richard Dubelski @ Scene44, Marseille, France, January 2014.

While accelerometric sensors are ideal for applications linked to dynamic motor gestures (beat), as their response time is very fast (on the order of milliseconds), they are unable to estimate static gestures and their positions in space (only angles), and therefore to process symbolic and deictic gestures. This requires the addition of video sensors that can operate in the visible or infrared range.

Hand Tracking

Various cameras can be used to estimate the absolute position of the joints in the skeleton of the hand. From a simple webcam, to a network of distributed infrared cameras, to depth-sensor cameras, here's an overview of the different sensors that enable you to track hands and fingers in space.

Dieser Inhalt wird von YouTube bereitgestellt.
Beim Abspielen wird eine Verbindung zu den Servern von YouTube hergestellt und der Anbieter kann Cookies einsetzen, die Hinweise über dein Nutzungsverhalten sammeln.

Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

Accessing Theater: Theater und Multimedia – tools - presented by Dr. Greg Beller.

On some XR devices it is possible to get fully articulated information about the user’s hands when they are used as input sources. The WebXR Hand Input module expands the WebXR Device API with the functionality to track articulated hand poses. This API exposes the poses of each of the users' hand skeleton joints. This can be used to do gesture detection or to render a hand model in VR scenarios.

edu sharing object
WebXR Device API: Hand skeleton joints

Video-based tracking

If you only have a webcam available, you can use a solution based on image processing and AI such as n4m-handpose or mmhuman3d. This will work for slow, continuous control movements.

edu sharing object

n4m-handpose: Wraps MediaPipe Handpose inside electron and serves the detected parts via MaxAPI.

Depth-sensing cameras

To the fast capture of the dynamics of the gesture by the accelerometer gloves, one can add the relatively slow capture of the absolute position of the hands in space, by the use of depth cameras. The body skeleton can be tracked using a Microsoft Kinect [18], while the hand skeleton is tracked using a Leap Motion, for example. This makes it possible to obtain, in addition to the fine temporal precision of the percussive type triggering, the continuous control of sound processes according to the posture and the spatial position of the hands/fingers.

img_Microsoft Kinect.png
Microsoft Kinect

img_Leap Motion.png
Leap Motion

On the other hand, the depth sensor brings other constraints such as a reduction in the playing area, a single performer possible, the need for a phase of calibration and detection of the skeleton, the risk of infrared disturbance by lights.

LIDAR sensor and XR headset

The most advanced hand-tracking systems are now directly integrated into VR and XR headsets. In Unity or Unreal, you can use a XR-SDK such as webXR framework, XR interaction toolkit or Meta XR interaction SDK to obtain the joints of the hand skeletons. Then you can send this data via OSC to Max to build a new instrument. Alternatively, using Graham Wakefield's Max VR package,² you can easily integrate static position and dynamic acceleration data from the controllers of a MetaQuest or HTC Vive headset.

In the Spatial Sampler XR, sounds are arranged in the surrounding space using a Meta Quest 2 thanks to a Max patch based on the VR package. In the same way that a sampler is an empty keyboard that is filled with sounds, Spatial Sampler XR uses hand tracking to transform the surrounding physical space into a key zone for indexing, placing and replaying samples. With Spatial Sampler XR, the musician spreads sound around him/her through gesture, creating a spatialized and interactive sound scene. The 3D immersion greatly facilitates the organization of the sounds and increases the precision of the interaction. Several playing modes are possible, the Sound Space, the Spatial Trigger and the Spatial Looper. The interaction modalities also vary according to the type of performance, in solo, duo or with several people. Movement links time (sound) and space. This makes Spatial Sampler XR suitable for movement artists as well, and for various applications.

The Spatial Sampler XR adds to the Sound Space the possibility to visualize sounds in mixed reality.

Infrared cameras

If you have fast cameras like in the Optitrack Motion Capture System (up to 120fps), you can use the position of your hands (skeleton tracking mode) or those of rigid objects you hold in your hand. You can use the custom «motion-tracking» middleware that sends OSC messages to Max from Motive.

The Fault - Introduction to Optitrack Motion Capture System @ Forum, HfMT Hamburg, Germany.

Absolute vs. relative mapping

Once we have the hand's positions in space, we can use these spatial coordinates in an absolute, physical frame of reference (that of the room, for example), or in a relative, perceptual frame of reference (in relation to the coordinates of the torso or head, for example). In the latter case, we are more concerned with the construction of an instrument attached to its user than to the room. The user can move around in space while playing the instrument, without the mapping between gesture and sound effect changing.

Intermediate physical mapping

The Body Choir uses the hand position to control a choir effect and the Hyper Ball to control a granular synthesizer. In both cases, a physical mapping paradigm between effect parameters and relative hand position is based on the manipulation of a virtual ball whose diameter is defined by the distance between the hands.

The Body Choir transforms a singer into a choir. This virtual choir accompanies the singer according to her.his gestures and the postures s.he adopts. Singing involves movement of the body. This movement is captured and used to magnify the singer's musical intentions. The posture of the body and the sung note modulate in real time the harmony, the number of voices, or the spatial density of the choir.

Weitere Informationen zum Datenschutz bei YouTube findest du in der Datenschutzerklärung des Anbieters.

Body Choir - with Dalila Khatir @ IRCAM, Paris, France, the 21st of May 2014.

The Hyper Ball takes the form of a virtual sound ball, which the participant waters with her.his voice and modulates with her.his gesture. The position, size and orientation of the ball influence the height, density and volume of the sound generated (see fig. 3). This type of musical activity, by its constitution, causes choreographic movements.

Hyper Ball, with Lenny Barouk, Jean-Pierre Drouet, Martin Seigneur, Stéfany Ganachaud and Richard Dubelski @ IRCAM, Paris, France, March 2014.

Excercise: Pinch controller

In Max, with n4m-handpose, create a voice processing effect (reverb, delay or pitch-shifting) the parameters of which are controlled by the distance between the index finger and thumb.

Gesture recognition and following

We have seen that direct mapping is possible between gesture data (in absolute or relative referential) and sound data, enabling, for example, the triggering of pre-recorded or real-time–recorded sounds, or the continuous control of sound effects. Hand tracking is already providing a wealth of information for the creation of new instruments that can be voice-based.

Several mappings are possible, based on a direct relationship between the position and dynamics of the hands or fingers, a relationship mediated by the manipulation of virtual objects, or the addition of random processes at the heart of the hand-sound relationship.

For gestures, it is necessary to take into account temporality, as well as the relative position of the hand to the body. This is made possible by merging data from dynamic (temporally precise) and static (spatially precise) sensors.

Between vocal and manual gestures, various computer programs based on artificial intelligence enable the direct manipulation of sound by gesture, thanks to the learning of temporal relationships from merged data.

Machine learning: Regression vs. classification

Supervised machine learning algorithms are trained with labelled datasets and can perform two tasks: regression and classification. Regression finds correlations between variables. If the variable is time, then regression can be used to follow a previously recorded gesture. Classification is an algorithm that divide the dataset into classes. If several gestures have been recorded, modeled and labeled, categorization enables recognition of a new gesture.

edu sharing object

Both Regression and Classification algorithms are known as Supervised Learning algorithms and are used to predict in Machine learning and work with labeled datasets: source simplilearn.com

Multimodal data

MuBu For Max is a toolbox for multimodal analysis of sound and motion, interactive sound synthesis and machine learning. This is the Max package found in the package manager and on the IRCAM forum. It includes:

real-time and batch data processing (sound descriptors, motion features, filtering);
granular, concatenative and additive synthesis;
data visualization;
static and temporal recognition;
regression and classification algorithms.

The MuBu multi-buffer is a container providing a structured memory for the real-time processing. The buffers of a MuBu container associate multiple tracks of aligned data to complex data structures such as:

segmented audio data with audio descriptors and annotations
annotated motion capture data
aligned audio and motion capture data

Each track of a MuBu buffer represents a regularly sampled data stream or a sequence of time-tagged events such as audio samples, audio descriptors, motion capture data, markers, segments and musical events.

edu sharing object

Different data visualization using imubu from the MuBu package.

Let's dive into the package to find the tools linked to tracking and gesture recognition.

edu sharing object

MuBu overview available in the MuBu package.

The mapping-by-demonstration approach

The mapping–by–demonstration approach involves an interaction loop that consists of two phases: A training phase that allows the user to define mappings and a performance phase in which the user controls sound synthesis through the previously defined mappings. The XMM library is designed to make this interaction loop transparent to the user without requiring expert knowledge of machine learning algorithms and methods.

edu sharing object

Workflow of a Mapping-by-Demonstration System. Source: Françoise 2014.

In the following examples, the XMM for Multimodal Hierarchical Hidden Markov Model algorithm simultaneously enables gesture recognition and gesture following.

edu sharing object

Multimodal Hierarchical Hidden Markov Model for real-time gesture recognition and tracking (Françoise 2014).

Wired Gestures dynamically links voice to gesture, in an artificial way. The machine simultaneously records a voice gesture and a manual gesture [19]. It learns the temporal relationship between the two. Then it reproduces the voice, when the performer repeats the same gesture. The nuances of timing in the gesture are then heard as prosodic variations of the voice, and it becomes possible to break down the expressivity.

Wired Gestures with Richard Dubelski @ Scene 44, Marseille, France, January 2014.

Exercise – Wired Gestures

Using motion data from the smartphone (CoMo.te) and the mubu.xmm object, create a patch to reproduce the wired gestures instrument.

Gesture Scape jointly records voice, gesture and video. Then, new gestures will activate this memory. The performer animates the video, by reproducing the same gesture, or by repeating the same associated sound.

Gesture Scape with Jean-Charles Gaume @ Scene 44, Marseille, France, April 2014.

Exercise – Gesture Scape

Let’s add our webcam to the plot. Create a patch to reproduce the Gesture Scape instrument.

Vocal gesture only

While motion data can be derived from hand tracking, it is also possible to track speech gestures by modeling them with Mel Frequency Cepstrum Coefficients (MFCCs). In this way, we can track and recognize vocal gestures, speech or singing, without using gesture data.

Exercise – Syllable recognition and tracking

Record 3 syllables ("ba," "mo" and "te," for instance) and use the mubu.xmm object to recognize and follow them based on their MFCC description only.

Conclusion

In this unit, you discovered some links between voice and gesture. You have used this natural proximity to create new intuitive musical instruments that allow you to manipulate sound with your hands. These instruments have enabled you to tackle, with increasing complexity, the notion of machine learning to model the temporal relationship between multimodal data. Various physical gesture sensors were introduced and compared. Libraries for gesture processing and machine learning were presented. Their uses have been demonstrated in different artistic contexts for pedagogical purposes. The exercises enable you to make them your own, and incorporate them into your artist's studio. Now it's your turn to create with these tools and make them even better!

References

[dAlessandro2011] C. d'Alessandro, A. Rilliard and S. LeBeux, "Chironomic stylization of intonation", JASA, 129(3):1594-1604, 2011.

[Beller2009a] Beller, Grégory (2009). "Analyse et Modèle Génératif de l’Expressivité: application à la parole et à l'interprétation musicale ", Paris 6 – IRCAM.

[Beller2009b] Beller, Grégory (2009). "Transformation of Expressivity in Speech", The Role of Prosody in the Expression of Emotions in English and in French, ed. Peter Lang.

[Beller2010] Beller, Grégory (2010). Expresso: Transformation of Expressivity in Speech, Speech Prosody, Chicago.

[Beller2011a] Beller, Grégory, Aperghis, Georges (2011). Contrôle gestuel de la synthèse concaténative en temps réel dans Luna Park: rapport recherche.

[Beller2011b] Beller, Grégory, Aperghis, Georges (2011). "Gestural Control of Real-Time Concatenative Synthesis in Luna Park", P3S, International Workshop on Performative Speech and Singing Synthesis, Vancouver, pp. 23-28.

[Beller2011c] Beller, Grégory (2011). Gestural Control Of Real Time Concatenative Synthesis, ICPhS, Hong Kong.

[Beller2011d] Beller, Grégory (2011). Gestural Control of Real-Time Speech Synthesis in Luna Park, SMC, Padova.

[Beller2011e] Beller, Grégory (2011). Arcane d'Un mage en été, Théâtre Publique, n° 200.

[Beller2012a] Beller, Grégory (2012). In-vivo: laboratoire de recherche et d'expérimentation autour du son pour le théâtre, Towards a History of Sound in Theatre, Montreal.

[Beller2014a] Beller, Grégory (2014). The Synekine Project», MOCO 2014, IRCAM, Paris 2024_MUTOR_GregBeller_GestureAI_V1.5.docx

[Beller2014c], Beller, Grégory (2014). L'IRCAM et la voix augmentée au théâtre: Les nouvelles technologies sonores au service de la dramaturgie, L'Annuaire théâtral, Numéro 56–57, p. 195–205.

[Beller2015a], Beller, Grégory (2015). Sound Space and Spatial Sampler, MOCO 2015, SFU, Vancouver.

[Beller2017a], Beller, Grégory (2017). Spectacle vivant : des voix imaginaires aux monstres vocaux, InaGlobal, Paris, France.

Bevilacqua, Frederic, Nicolas Rasamimanana, Emmanuel Fléty, Serge Lemouton and Florence Baschet (2006). The augmented violin project: research, composition and performance report. In 6th International Conference on New Interfaces for Musical Expression (NIME 06), Paris

[Caramiaux2015] Caramiaux, Baptiste, Altavilla, Alessandro, Pobiner, Scott and Tanaka, Atau. "Form Follows Sound: Designing Interactions from Sonic Memories". In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 3943-3952. 2015 (http://dx.doi.org/10.1145/2702123.2702515).

Corballis, Michael (January–February 2005). "The gestural origins of language". WIREs Cognitive Science. 1 (1): 2–7. doi:10.1002/wcs.2. PMID 26272832. S2CID 22492422.

[Françoise2014] Jules Françoise, Norbert Schnell, Riccardo Borghesi, Frédéric Bevilacqua. Probabilistic Models for Designing Motion and Sound Relationships. Proceedings of the 2014 International Conference on New Interfaces for Musical Expression, Jun 2014, London, UK, United Kingdom. pp.287-292. ⟨hal-01061335⟩

Iverson, Jana M. and Susan Goldin-Meadow (1998). Why people gesture when they speak, Nature 396, 228.

[Kean2011] Kean, S., Hall, J.C., Perry, P. (2011). Microsoft’s Kinect SDK. In: Meet the Kinect. Apress. https://doi.org/10.1007/978-1-4302-3889-8_8

[Laukka2013], Laukka, Petri,Eerola, Tuomas,Thingujam, Nutankumar S.,Yamasaki, Teruo, Beller, Grégory (2013). Universal and Culture-Specific Factors in the Recognition and Performance of Musical Affect Expressions, Emotion, American Psychological Association, Vol 13(3), 434-449.

Löfqvist, A. (1990). Speech as Audible Gestures. In: Hardcastle, W.J., Marchal, A. (eds) Speech Production and Speech Modelling. NATO ASI Series, vol 55. Springer, Dordrecht. https://doi.org/10.1007/978-94-009-2037-8_12

[Parrel2014] Benjamin Parrell, Louis Goldstein, Sungbok Lee, Dani Byrd (2014). Spatiotemporal coupling between speech and manual motor actions. Journal of Phonetics, Volume 42, Pages 1-11.

Stetson, Raymond Herbert. ‘A Motor Theory of Rhythm and Discrete Succession’, Psychological Review (1905) 12: 250–70; 293–330.

[Tinoco2016] Tinoco, Hector. (2016). Beam Design for Voice Coil Motors used for Energy Harvesting Purpose with Low Frequency Vibrations: A Finite Element Analysis. International Journal of Modeling, Simulation, and Scientific Computing. 7. 1-17. 10.1142/S1793962316400018.

[Xu2009] Xu, J; Gannon, PJ; Emmorey, K; Smith, JF; Braun, AR (2009). "Symbolic gestures and spoken language are processed by a common neural system". Proc Natl Acad Sci U.S.A. 106 (49): 20664–20669. Bibcode:2009PNAS..10620664X. doi:10.1073/pnas.0909197106. PMC 2779203. PMID 19923436.

Acknowledgment

The author would like to thank The Foundation for Innovation in Higher Education, the Multimedia Office Hamburg for their funding and support, the performers of the Synekine project and the IRCAM Forum libraries.

Sound Gesture Intelligence

Beschreibung

Inhaltsverzeichnis

Preface

About the author

This unit

Introduction

What is a gesture?

A neurological link

Exercise: clapping while speaking

The Synekine project

Direct mapping between voice and gesture

Dynamics of the gesture

Aerial percussion

Exercise: The Skatphone

Air Theremin

SpokHands

Probabilistic mapping

Hand Sampling

Hand Tracking

Video-based tracking

Depth-sensing cameras

LIDAR sensor and XR headset

Infrared cameras

Absolute vs. relative mapping

Intermediate physical mapping

Excercise: Pinch controller

Gesture recognition and following

Machine learning: Regression vs. classification

Multimodal data

The mapping-by-demonstration approach

Exercise – Wired Gestures

Exercise – Gesture Scape

Vocal gesture only

Exercise – Syllable recognition and tracking

Conclusion

References

Acknowledgment