Multimodal sentiment analysis is a new dimension of the traditional text-based
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, which goes beyond the analysis of texts, and includes other
modalities such as audio and visual data. It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities. With the extensive amount of
social media
Social media are interactive media technologies that facilitate the creation and sharing of information, ideas, interests, and other forms of expression through virtual communities and networks. While challenges to the definition of ''social me ...
data available online in different forms such as videos and images, the conventional text-based
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
has evolved into more complex models of multimodal sentiment analysis,
which can be applied in the development of
virtual assistant
An intelligent virtual assistant (IVA) or intelligent personal assistant (IPA) is a software agent that can perform tasks or services for an individual based on commands or questions. The term " chatbot" is sometimes used to refer to virtua ...
s,
analysis
Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
of YouTube movie reviews,
analysis
Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
of news videos, and
emotion recognition
Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Genera ...
(sometimes known as
emotion
Emotions are mental states brought on by neurophysiological changes, variously associated with thoughts, feelings, behavioral responses, and a degree of pleasure or displeasure. There is currently no scientific consensus on a definition. ...
detection) such as
depression monitoring,
among others.
Similar to the traditional
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, one of the most basic task in multimodal sentiment analysis is
sentiment classification, which classifies different sentiments into categories such as positive, negative, or neutral. The complexity of
analyzing text, audio, and visual features to perform such a task requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion.
The performance of these fusion techniques and the
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s applied, are influenced by the type of textual, audio, and visual features employed in the analysis.
Features
Feature engineering
Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The motivation is to use these extra features to improve the qu ...
, which involves the selection of features that are fed into
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
algorithms, plays a key role in the sentiment classification performance.
In multimodal sentiment analysis, a combination of different textual, audio, and visual features are employed.
Textual features
Similar to the conventional text-based
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, some of the most commonly used textual features in multimodal sentiment analysis are
unigrams and
n-grams, which are basically a sequence of words in a given textual document. These features are applied using
bag-of-words
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
or bag-of-concepts feature representations, in which words or concepts are represented as vectors in a suitable space.
Audio features
Sentiment and
emotion
Emotions are mental states brought on by neurophysiological changes, variously associated with thoughts, feelings, behavioral responses, and a degree of pleasure or displeasure. There is currently no scientific consensus on a definition. ...
characteristics are prominent in different
phonetic
Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...
and
prosodic
In linguistics, prosody () is concerned with elements of speech that are not individual phonetic segments (vowels and consonants) but are properties of syllables and larger units of speech, including linguistic functions such as intonation, s ...
properties contained in audio features. Some of the most important audio features employed in multimodal sentiment analysis are
mel-frequency cepstrum (MFCC),
spectral centroid The spectral centroid is a measure used in digital signal processing to characterise a spectrum. It indicates where the center of mass of the spectrum is located. Perceptually, it has a robust connection with the impression of brightness of a soun ...
,
spectral flux, beat histogram, beat sum, strongest beat, pause duration, and
pitch.
OpenSMILE and
Praat
Praat (; , '' "talk"'') is a free computer software package for speech analysis in phonetics. It was designed, and continues to be developed, by Paul Boersma and David Weenink of the University of Amsterdam. It can run on a wide range of operat ...
are popular open-source toolkits for extracting such audio features.
Visual features
One of the main advantages of analyzing videos with respect to texts alone, is the presence of rich sentiment cues in visual data. Visual features include
facial expression
A facial expression is one or more motions or positions of the muscles beneath the skin of the face. According to one set of controversial theories, these movements convey the emotional state of an individual to observers. Facial expressions are ...
s, which are of paramount importance in capturing sentiments and
emotion
Emotions are mental states brought on by neurophysiological changes, variously associated with thoughts, feelings, behavioral responses, and a degree of pleasure or displeasure. There is currently no scientific consensus on a definition. ...
s, as they are a main channel of forming a person's present state of mind.
Specifically,
smile, is considered to be one of the most predictive visual cues in multimodal sentiment analysis.
OpenFace is an open-source facial analysis toolkit available for extracting and understanding such visual features.
Fusion techniques
Unlike the traditional text-based
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, multimodal sentiment analysis undergo a fusion process in which data from different modalities (text, audio, or visual) are fused and analyzed together.
The existing approaches in multimodal sentiment analysis
data fusion can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the sentiment classification depends on which type of fusion technique is employed.
Feature-level fusion
Feature-level fusion (sometimes known as early fusion) gathers all the features from each
modality (text, audio, or visual) and joins them together into a single feature vector, which is eventually fed into a classification algorithm.
One of the difficulties in implementing this technique is the integration of the heterogeneous features.
Decision-level fusion
Decision-level fusion (sometimes known as late fusion), feeds data from each modality (text, audio, or visual) independently into its own classification algorithm, and obtains the final sentiment classification results by fusing each result into a single decision vector.
One of the advantages of this fusion technique is that it eliminates the need to fuse heterogeneous data, and each
modality can utilize its most appropriate
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
.
Hybrid fusion
Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process.
It usually involves a two-step procedure wherein feature-level fusion is initially performed between two modalities, and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining
modality.
Applications
Similar to text-based sentiment analysis, multimodal sentiment analysis can be applied in the development of different forms of
recommender system
A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...
s such as in the analysis of user-generated videos of movie reviews
and general product reviews, to predict the sentiments of customers, and subsequently create product or service recommendations. Multimodal sentiment analysis also plays an important role in the advancement of
virtual assistant
An intelligent virtual assistant (IVA) or intelligent personal assistant (IPA) is a software agent that can perform tasks or services for an individual based on commands or questions. The term " chatbot" is sometimes used to refer to virtua ...
s through the application of
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
(NLP) and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
techniques.
In the healthcare domain, multimodal sentiment analysis can be utilized to detect certain medical conditions such as
stress,
anxiety
Anxiety is an emotion which is characterized by an unpleasant state of inner turmoil
Turmoil may refer to:
* ''Turmoil'' (1984 video game), a 1984 video game released by Bug-Byte
* ''Turmoil'' (2016 video game), a 2016 indie oil tycoon video ...
, or
depression.
Multimodal sentiment analysis can also be applied in understanding the sentiments contained in video news programs, which is considered as a complicated and challenging domain, as sentiments expressed by reporters tend to be less obvious or neutral.
References
{{Reflist
Natural language processing
Affective computing
Social media
Machine learning
Multimodal interaction