Multimodal sentiment analysis
   HOME

TheInfoList



OR:

Multimodal sentiment analysis is a technology for traditional text-based
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, which includes modalities such as audio and visual data. It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities. With the extensive amount of
social media Social media are interactive technologies that facilitate the Content creation, creation, information exchange, sharing and news aggregator, aggregation of Content (media), content (such as ideas, interests, and other forms of expression) amongs ...
data available online in different forms such as videos and images, the conventional text-based
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
has evolved into more complex models of multimodal sentiment analysis, which can be applied in the development of
virtual assistant A virtual assistant (VA) is a software agent that can perform a range of tasks or services for a user based on user input such as commands or questions, including verbal ones. Such technologies often incorporate chatbot capabilities to streaml ...
s,
analysis Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
of YouTube movie reviews,
analysis Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
of news videos, and
emotion recognition Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Gener ...
(sometimes known as
emotion Emotions are physical and mental states brought on by neurophysiology, neurophysiological changes, variously associated with thoughts, feelings, behavior, behavioral responses, and a degree of pleasure or suffering, displeasure. There is ...
detection) such as depression monitoring, among others. Similar to the traditional
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, one of the most basic task in multimodal sentiment analysis is sentiment classification, which classifies different sentiments into categories such as positive, negative, or neutral. The complexity of analyzing text, audio, and visual features to perform such a task requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion. The performance of these fusion techniques and the
classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s applied, are influenced by the type of textual, audio, and visual features employed in the analysis.


Features

Feature engineering, which involves the selection of features that are fed into
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms, plays a key role in the sentiment classification performance. In multimodal sentiment analysis, a combination of different textual, audio, and visual features are employed.


Textual features

Similar to the conventional text-based
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, some of the most commonly used textual features in multimodal sentiment analysis are unigrams and
n-gram An ''n''-gram is a sequence of ''n'' adjacent symbols in particular order. The symbols may be ''n'' adjacent letter (alphabet), letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or ...
s, which are basically a sequence of words in a given textual document. These features are applied using bag-of-words or bag-of-concepts feature representations, in which words or concepts are represented as vectors in a suitable space.


Audio features

Sentiment and
emotion Emotions are physical and mental states brought on by neurophysiology, neurophysiological changes, variously associated with thoughts, feelings, behavior, behavioral responses, and a degree of pleasure or suffering, displeasure. There is ...
characteristics are prominent in different
phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians ...
and
prosodic In linguistics, prosody () is the study of elements of speech, including intonation (linguistics), intonation, stress (linguistics), stress, Rhythm (linguistics), rhythm and loudness, that occur simultaneously with individual phonetic segments: v ...
properties contained in audio features. Some of the most important audio features employed in multimodal sentiment analysis are mel-frequency cepstrum (MFCC),
spectral centroid The spectral centroid is a measure used in digital signal processing to characterise a spectrum. It indicates where the center of mass In physics, the center of mass of a distribution of mass in space (sometimes referred to as the barycenter o ...
,
spectral flux ''Spectral'' is a 2016 Hungarian-American military science fiction action film co-written and directed by Nic Mathieu. Written with Ian Fried & George Nolfi, the film stars James Badge Dale as DARPA research scientist Mark Clyne, with Max Ma ...
, beat histogram, beat sum, strongest beat, pause duration, and pitch. OpenSMILE and
Praat Praat ( , ; ) is a free, open-source computer software package widely used for speech analysis and synthesis in phonetics and other fields of linguistics. It was designed and continues to be developed by Paul Boersma and David Weenink at the ...
are popular open-source toolkits for extracting such audio features.


Visual features

One of the main advantages of analyzing videos with respect to texts alone, is the presence of rich sentiment cues in visual data. Visual features include
facial expression Facial expression is the motion and positioning of the muscles beneath the skin of the face. These movements convey the emotional state of an individual to observers and are a form of nonverbal communication. They are a primary means of conveying ...
s, which are of paramount importance in capturing sentiments and
emotion Emotions are physical and mental states brought on by neurophysiology, neurophysiological changes, variously associated with thoughts, feelings, behavior, behavioral responses, and a degree of pleasure or suffering, displeasure. There is ...
s, as they are a main channel of forming a person's present state of mind. Specifically,
smile A smile is a facial expression formed primarily by flexing the muscles at the sides of the mouth. Some smiles include a contraction of the muscles at the corner of the eyes, an action known as a Duchenne smile. Among humans, a smile expresses d ...
, is considered to be one of the most predictive visual cues in multimodal sentiment analysis. OpenFace is an open-source facial analysis toolkit available for extracting and understanding such visual features.


Fusion techniques

Unlike the traditional text-based
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, multimodal sentiment analysis undergo a fusion process in which data from different modalities (text, audio, or visual) are fused and analyzed together. The existing approaches in multimodal sentiment analysis data fusion can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the sentiment classification depends on which type of fusion technique is employed.


Feature-level fusion

Feature-level fusion (sometimes known as early fusion) gathers all the features from each
modality Modality may refer to: Humanities * Modality (theology), the organization and structure of the church, as distinct from sodality or parachurch organizations * Modality (music), in music, the subject concerning certain diatonic scales * Modalit ...
(text, audio, or visual) and joins them together into a single feature vector, which is eventually fed into a classification algorithm. One of the difficulties in implementing this technique is the integration of the heterogeneous features.


Decision-level fusion

Decision-level fusion (sometimes known as late fusion), feeds data from each modality (text, audio, or visual) independently into its own classification algorithm, and obtains the final sentiment classification results by fusing each result into a single decision vector. One of the advantages of this fusion technique is that it eliminates the need to fuse heterogeneous data, and each
modality Modality may refer to: Humanities * Modality (theology), the organization and structure of the church, as distinct from sodality or parachurch organizations * Modality (music), in music, the subject concerning certain diatonic scales * Modalit ...
can utilize its most appropriate
classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
.


Hybrid fusion

Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process. It usually involves a two-step procedure wherein feature-level fusion is initially performed between two modalities, and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining
modality Modality may refer to: Humanities * Modality (theology), the organization and structure of the church, as distinct from sodality or parachurch organizations * Modality (music), in music, the subject concerning certain diatonic scales * Modalit ...
.


Applications

Similar to text-based sentiment analysis, multimodal sentiment analysis can be applied in the development of different forms of
recommender system A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fi ...
s such as in the analysis of user-generated videos of movie reviews and general product reviews, to predict the sentiments of customers, and subsequently create product or service recommendations. Multimodal sentiment analysis also plays an important role in the advancement of
virtual assistant A virtual assistant (VA) is a software agent that can perform a range of tasks or services for a user based on user input such as commands or questions, including verbal ones. Such technologies often incorporate chatbot capabilities to streaml ...
s through the application of
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP) and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
techniques. In the healthcare domain, multimodal sentiment analysis can be utilized to detect certain medical conditions such as stress,
anxiety Anxiety is an emotion characterised by an unpleasant state of inner wikt:turmoil, turmoil and includes feelings of dread over Anticipation, anticipated events. Anxiety is different from fear in that fear is defined as the emotional response ...
, or depression. Multimodal sentiment analysis can also be applied in understanding the sentiments contained in video news programs, which is considered as a complicated and challenging domain, as sentiments expressed by reporters tend to be less obvious or neutral.


References

{{Reflist Natural language processing Affective computing Social media Machine learning Multimodal interaction