Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.

Introduction

Multimodal human-computer interaction refers to the "interaction with the virtual and physical environment through natural modes of communication", This implies that multimodal interaction enables a more free and natural communication, interfacing users with automated systems in both input and output. Specifically, multimodal systems can offer a flexible, efficient and usable environment allowing users to interact through input modalities, such as

speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...

handwriting Handwriting is the writing done with a writing instrument, such as a pen or pencil, in the hand. Handwriting includes both printing and cursive styles and is separate from formal calligraphy or typeface. Because each person's handwriting is u ...

, hand gesture and

gaze In critical theory, sociology, and psychoanalysis, the gaze (French ''le regard''), in the philosophical and figurative sense, is an individual's (or a group's) awareness and perception of other individuals, other groups, or oneself. The concept ...

, and to receive information by the system through output modalities, such as speech synthesis, smart graphics and other modalities, opportunely combined. Then a multimodal system has to recognize the inputs from the different modalities combining them according to temporal and contextual constraintsCaschera M. C., Ferri F., Grifoni P. (2007).
Multimodal interaction systems: information and time features
. International Journal of Web and Grid Services (IJWGS), Vol. 3 - Issue 1, pp 82-99. in order to allow their interpretation. This process is known as multimodal fusion, and it is the object of several research works from nineties to now.D'Ulizia, A., Ferri, F. and Grifoni, P. (2010). "Generating Multimodal Grammars for Multimodal Dialogue Processing". IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, Vol 40, no 6, pp. 1130 – 1145.D'Ulizia , A. (2009).
Exploring Multimodal Input Fusion Strategies
. In: Grifoni P (ed) Handbook of Research on Multimodal Human Computer Interaction and Pervasive Services: Evolutionary Techniques for Improving Accessibility. IGI Publishing, pp. 34-57.Sun, Y., Shi, Y., Chen, F. and Chung , V.(2007). "An Efficient Multimodal Language Processor for Parallel Input Strings in Multimodal Input Fusion," in Proc. of the international Conference on Semantic Computing, pp. 389-396.Russ, G., Sallans, B., Hareter, H. (2005).
Semantic Based Information Fusion in a Multimodal Interface
. International Conference on Human-Computer Interaction (HCI'05), Las Vegas, Nevada, USA, 20–23 June, pp 94-100.Corradini, A., Mehta M., Bernsen, N.O., Martin, J.-C. (2003). "Multimodal Input Fusion in Human-Computer Interaction on the Example of the on-going NICE Project". In Proceedings of the NATO-ASI conference on Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management, Yerevan, Armenia.Pavlovic, V.I., Berry, G.A., Huang, T.S. (1997).
Integration of audio/visual information for use in human-computer intelligent interaction
. Proceedings of the 1997 International Conference on Image Processing (ICIP '97), Volume 1, pp. 121-124.Andre, M., Popescu, V.G., Shaikh, A., Medl, A., Marsic, I., Kulikowski, C., Flanagan J.L. (1998).
Integration of Speech and Gesture for Multimodal Human-Computer Interaction
. In Second International Conference on Cooperative Multimodal Communication. 28–30 January, Tilburg, The Netherlands.Vo, M.T., Wood, C. (1996).
Building an application framework for speech and pen input integration in multimodal learning interfaces
. In Proceedings of the Acoustics, Speech, and Signal Processing (ICASSP'96), May 7–10, IEEE Computer Society, Volume 06, pp. 3545-3548. The fused inputs are interpreted by the system. Naturalness and flexibility can produce more than one interpretation for each different modality (channel) and for their simultaneous use, and they consequently can produce multimodal ambiguityCaschera, M.C. , Ferri, F. , Grifoni, P. (2013).
From Modal to Multimodal Ambiguities: a Classification Approach
, Journal of Next Generation Information Technology (JNIT), Vol. 4, No. 5, pp. 87 -109. generally due to imprecision, noises or other similar factors. For solving ambiguities, several methods have been proposed.Caschera, M.C. , Ferri, F. , Grifoni, P. (2013). InteSe
An Integrated Model for Resolving Ambiguities in Multimodal Sentences
. IEEE Transactions on Systems, Man, and Cybernetics: Systems, Volume: 43, Issue: 4, pp. 911 - 931.18. Spilker, J., Klarner, M., Görz, G. (2000). "Processing Self Corrections in a speech to speech system". COLING 2000. pp. 1116-1120.Caschera M.C., Ferri F., Grifoni P., (2007).
The Management of ambiguities
. In Visual Languages for Interactive Computing: Definitions and Formalizations. IGI Publishing. pp.129-140.J. Chai, P. Hong, and M. X. Zhou, (2004 )
A probabilistic approach to reference resolution in multimodal user interface
in Proc. 9th Int. Conf. Intell. User Interf., Madeira, Portugal, Jan. 2004, pp. 70–77.Dey, A. K. Mankoff , J., (2005).
Designing mediation for context-aware applications
. ACM Trans. Comput.-Hum. Interact. 12(1), pp. 53-80.Spilker, J., Klarner, M., Görz, G. (2000). "Processing Self Corrections in a speech to speech system". COLING 2000. pp. 1116-1120.Mankoff, J., Hudson, S.E., Abowd, G.D. (2000).
Providing integrated toolkit-level support for ambiguity in recognition-based interfaces
. Proceedings of ACM CHI'00 Conference on Human Factors in Computing Systems. pp. 368 – 375. Finally the system returns to the user outputs through the various modal channels (disaggregated) arranged according to a consistent feedback (fission). The pervasive use of mobile devices, sensors and web technologies can offer adequate computational resources to manage the complexity implied by the multimodal interaction. "Using cloud for involving shared computational resources in managing the complexity of multimodal interaction represents an opportunity. In fact, cloud computing allows delivering shared scalable, configurable computing resources that can be dynamically and automatically provisioned and released".

Multimodal input

Two major groups of multimodal interfaces have merged, one concerned in alternate input methods and the other in combined input/output. The first group of interfaces combined various user input modes beyond the traditional

keyboard Keyboard may refer to: Text input * Keyboard, part of a typewriter * Computer keyboard ** Keyboard layout, the software control of computer keyboards and their mapping ** Keyboard technology, computer keyboard hardware and firmware Music * Mu ...

and

mouse A mouse ( : mice) is a small rodent. Characteristically, mice are known to have a pointed snout, small rounded ears, a body-length scaly tail, and a high breeding rate. The best known mouse species is the common house mouse (''Mus musculus' ...

input/output In computing, input/output (I/O, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, possibly a human or another information processing system. Inputs are the signals ...

, such as speech, pen, touch, manual gestures, gaze and head and body movements. The most common such interface combines a visual modality (e.g. a display, keyboard, and mouse) with a voice modality (

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

for input,

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

and recorded audio for output). However other modalities, such as pen-based input or haptic input/output may be used. Multimodal user interfaces are a research area in human-computer interaction (HCI). The advantage of multiple input modalities is increased

usability Usability can be described as the capacity of a system to provide a condition for its users to perform the tasks safely, effectively, and efficiently while enjoying the experience. In software engineering, usability is the degree to which a sof ...

: the weaknesses of one modality are offset by the strengths of another. On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say (e.g. Poughkeepsie). Consider how you would access and search through

digital media Digital media is any communication media that operate in conjunction with various encoded machine-readable data formats. Digital media can be created, viewed, distributed, modified, listened to, and preserved on a digital electronics device. ...

catalogs from these same devices or set top boxes. And in one real-world example, patient information in an operating room environment is accessed verbally by members of the surgical team to maintain an antiseptic environment, and presented in near realtime aurally and visually to maximize comprehension. Multimodal input user interfaces have implications for

accessibility Accessibility is the design of products, devices, services, vehicles, or environments so as to be usable by people with disabilities. The concept of accessible design and practice of accessible development ensures both "direct access" (i. ...

. A well-designed multimodal application can be used by people with a wide variety of impairments. Visually impaired users rely on the voice modality with some keypad input. Hearing-impaired users rely on the visual modality with some speech input. Other users will be "situationally impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired. On the other hand, a multimodal application that requires users to be able to operate all modalities is very poorly designed. The most common form of input multimodality in the market makes use of the

XHTML+Voice Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior t ...

(aka X+V) Web markup language, an open specification developed by IBM,

Motorola Motorola, Inc. () was an American multinational telecommunications company based in Schaumburg, Illinois, United States. After having lost $4.3 billion from 2007 to 2009, the company split into two independent public companies, Motorola ...

, and

Opera Software Opera is a Norwegian multinational technology company and subsidiary of Kunlun that specializes in web browser development, fintech, as well as services such as Opera News and YoYo Games. The company's total user base, including users of it ...

. X+V is currently under consideration by the W3C and combines several

W3C Recommendation The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working t ...

s including

XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior ...

for visual markup, VoiceXML for voice markup, and XML Events, a standard for integrating

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...

languages. Multimodal browsers supporting X+V include IBM WebSphere Everyplace Multimodal Environment,

Opera Opera is a form of theatre in which music is a fundamental component and dramatic roles are taken by singers. Such a "work" (the literal translation of the Italian word "opera") is typically a collaboration between a composer and a libr ...

for Embedded

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, whi ...

and

Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...

, and ACCESS Systems NetFront for

Windows Mobile Windows Mobile is a discontinued family of mobile operating systems developed by Microsoft for smartphones and personal digital assistants. Its origin dated back to Windows CE in 1996, though Windows Mobile itself first appeared in 2000 as Pock ...

. To develop multimodal applications,

software developer Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components. Software development invo ...

s may use a

software development kit A software development kit (SDK) is a collection of software development tools in one installable package. They facilitate the creation of applications by having a compiler, debugger and sometimes a software framework. They are normally specific ...

, such as IBM WebSphere Multimodal Toolkit, based on the

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized so ...

Eclipse An eclipse is an astronomical event that occurs when an astronomical object or spacecraft is temporarily obscured, by passing into the shadow of another body or by having another body pass between it and the viewer. This alignment of three c ...

framework A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of. Framework may refer to: Computing * Application framework, used to implement the structure of an application for an op ...

, which includes an X+V debugger,

editor Editing is the process of selecting and preparing written, photographic, visual, audible, or cinematic material used by a person or an entity to convey a message or information. The editing process can involve correction, condensation, or ...

, and simulator.

Multimodal sentiment analysis

Multimodal output

The second group of multimodal systems presents users with multimedia displays and multimodal output, primarily in the form of visual and auditory cues. Interface designers have also started to make use of other modalities, such as touch and olfaction. Proposed benefits of multimodal output system include synergy and redundancy. The information that is presented via several modalities is merged and refers to various aspects of the same process. The use of several modalities for processing exactly the same information provides an increased bandwidth of information transfer . Currently, multimodal output is used mainly for improving the mapping between communication medium and content and to support attention management in data-rich environment where operators face considerable visual attention demands. An important step in multimodal interface design is the creation of natural mappings between modalities and the information and tasks. The auditory channel differs from vision in several aspects. It is omnidirectional, transient and is always reserved. Speech output, one form of auditory information, received considerable attention. Several guidelines have been developed for the use of speech. Michaelis and Wiggins (1982) suggested that speech output should be used for simple short messages that will not be referred to later. It was also recommended that speech should be generated in time and require an immediate response. The sense of touch was first utilized as a medium for communication in the late 1950s. It is not only a promising but also a unique communication channel. In contrast to vision and hearing, the two traditional senses employed in HCI, the sense of touch is proximal: it senses objects that are in contact with the body, and it is bidirectional in that it supports both perception and acting on the environment. Examples of auditory feedback include auditory icons in computer operating systems indicating users' actions (e.g. deleting a file, open a folder, error), speech output for presenting navigational guidance in vehicles, and speech output for warning pilots on modern airplane cockpits. Examples of tactile signals include vibrations of the turn-signal lever to warn drivers of a car in their blind spot, the vibration of auto seat as a warning to drivers, and the

stick shaker A stick shaker is a mechanical device designed to rapidly and noisily vibrate the control yoke (the "stick") of an aircraft, warning the flight crew that an imminent aerodynamic stall has been detected. It is typically present on the majority of ...

on modern aircraft alerting pilots to an impending stall. Invisible interface spaces became available using sensor technology. Infrared, ultrasound and cameras are all now commonly used. Transparency of interfacing with content is enhanced providing an immediate and direct link via meaningful mapping is in place, thus the user has direct and immediate feedback to input and content response becomes interface affordance (Gibson 1979).

Multimodal fusion

The process of integrating information from various input modalities and combining them into a complete command is referred as multimodal fusion. In literature, three main approaches to the fusion process have been proposed, according to the main architectural levels (recognition and decision) at which the fusion of the input signals can be performed: recognition-based,Vo, M.T. (1998).
A framework and Toolkit for the Construction of Multimodal Learning Interfaces
, PhD. Thesis, Carnegie Mellon University, Pittsburgh, USA. decision-based,Cohen, P.R.; Johnston, M.; McGee, D.; Oviatt, S.L.; Pittman, J.; Smith, I.A.; Chen, L.; Clow, J. (1997). "Quickset: Multimodal interaction for distributed applications", ACM Multimedia, pp. 31-40.Johnston, M. (1998).
Unification-based Multimodal Parsing
. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL '98), August 10–14, Université de Montréal, Montreal, Quebec, Canada. pp. 624-630.Nigay, L.; Coutaz, J. (1995).
A generic platform for addressing the multimodal challenge
. Proceedings of the Conference on Human Factors in Computing Systems, ACM Press.Bouchet, J.; Nigay, L.; Ganille, T. (2004).
Icare software components for rapidly developing multimodal interfaces
. ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces (New York, NY, USA), ACM, pp. 251-258. and hybrid multi-level fusion.D'Ulizia, A.; Ferri, F.; Grifoni P. (2007).
A Hybrid Grammar-Based Approach to Multimodal Languages Specification
, OTM 2007 Workshop Proceedings, 25–30 November 2007, Vilamoura, Portugal, Springer-Verlag, Lecture Notes in Computer Science 4805, pp. 367-376.Johnston, M.; Bangalore, S. (2000).
Finite-state Multimodal Parsing and Understanding
, In Proceedings of the International Conference on Computational Linguistics, Saarbruecken, Germany.Sun, Y.; Chen, F.; Shi, Y.D.; Chung, V. (2006).
A novel method for multi-sensory data fusion in multimodal human computer interaction
. In Proceedings of the 20th conference of the computer-human interaction special interest group (CHISIG) of Australia on Computer-human interaction: design: activities, artefacts and environments, Sydney, Australia, pp. 401-404Shimazu, H.; Takashima, Y. (1995). "Multimodal Definite Clause Grammar," Systems and Computers in Japan, vol. 26, no 3, pp. 93-102.Johnston, M.; Bangalore, S. (2005).
Finite-state multimodal integration and understanding
" Nat. Lang. Eng, Vol. 11, no. 2, pp. 159-187.Reitter, D.; Panttaja, E. M.; Cummins, F. (2004). "UI on the fly: Generating a multimodal user interface," in Proc. of HLT-NAACL-2004, Boston, Massachusetts, USA. The recognition-based fusion (also known as early fusion) consists in merging the outcomes of each modal recognizer by using integration mechanisms, such as, for example, statistical integration techniques, agent theory, hidden Markov models, artificial neural networks, etc. Examples of recognition-based fusion strategies are action frame, input vectors and slots. The decision-based fusion (also known as late fusion) merges the semantic information that are extracted by using specific dialogue-driven fusion procedures to yield the complete interpretation. Examples of decision-based fusion strategies are typed feature structures, melting pots, semantic frames, and time-stamped lattices. The potential applications for multimodal fusion include learning environments, consumer relations, security/surveillance, computer animation, etc. Individually, modes are easily defined, but difficulty arises in having technology consider them a combined fusion. It's difficult for the algorithms to factor in dimensionality; there exist variables outside of current computation abilities. For example, semantic meaning: two sentences could have the same lexical meaning but different emotional information. In the hybrid multi-level fusion, the integration of input modalities is distributed among the recognition and decision levels. The hybrid multi-level fusion includes the following three methodologies: finite-state transducers, multimodal grammars and dialogue moves.

Ambiguity

User's actions or commands produce multimodal inputs (multimodal message), which have to be interpreted by the system. The multimodal message is the medium that enables communication between users and multimodal systems. It is obtained by merging information that are conveyed via several modalities by considering the different types of cooperation between several modalities, the time relationships among the involved modalities and the relationships between chunks of information connected with these modalities. The natural mapping between the multimodal input, which is provided by several interaction modalities (visual and auditory channel and sense of touch), and information and tasks imply to manage the typical problems of human-human communication, such as ambiguity. An ambiguity arises when more than one interpretation of input is possible. A multimodal ambiguity arises both, if an element, which is provided by one modality, has more than one interpretation (i.e. ambiguities are propagated at the multimodal level), and/or if elements, connected with each modality, are univocally interpreted, but information referred to different modalities are incoherent at the syntactic or the semantic level (i.e. a multimodal sentence having different meanings or different syntactic structure). In "The Management of Ambiguities", the methods for solving ambiguities and for providing the correct interpretation of the user's input are organized in three main classes: prevention, a-posterior resolution and approximation resolution methods. Prevention methods impose users to follow predefined interaction behaviour according to a set of transitions between different allowed states of the interaction process. Example of prevention methods are: procedural method, reduction of the expressive power of the language grammar, improvement of the expressive power of the language grammar. The a-posterior resolution of ambiguities uses mediation approach. Examples of mediation techniques are: repetition, e.g. repetition by modality, granularity of repairSuhm, B., Myers, B. and Waibel, A. (1999).
Model-based and empirical evaluation of multimodal interactive error correction
. In Proc. Of CHI'99, May, 1999, pp. 584-591 and undo, and choice. The approximation resolution methods do not require any user involvement in the disambiguation process. They can all require the use of some theories, such as

fuzzy logic Fuzzy logic is a form of many-valued logic in which the truth value of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completel ...

, Markov random field,

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Ba ...

s and

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...

References

External links

W3C Multimodal Interaction Activity

XHTML+Voice Profile 1.0
W3C Note 21 December 2001 * Hoste, Lode, Dumas, Bruno and Signer, Beat
''Mudra: A Unified Multimodal Interaction Framework''
In Proceedings of the 13th International Conference on Multimodal Interaction (ICMI 2011), Alicante, Spain, November 2011. * Toselli, Alejandro Héctor, Vidal, Enrique, Casacuberta, Francisco
''Multimodal Interactive Pattern Recognition and Applications''
Springer, 2011. {{DEFAULTSORT:Multimodal Interaction