Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ''audio mining'' is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

History

Academic research on audio mining began in the late 1970s in schools like Carnegie Mellon University, Columbia University, the Georgia Institute of Technology, and the University of Texas. Audio data indexing and retrieval began to receive attention and demand in the early 1990s, when multimedia content started to develop and the volume of audio content significantly increased. Before audio mining became the mainstream method, written transcripts of audio content were created and manually analyzed.

Process

Audio mining is typically split into four components: audio indexing, speech processing and recognition systems, feature extraction and audio classification. The audio will typically be processed by a speech recognition system in order to identify word or

phoneme A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...

units that are likely to occur in the spoken content. This information may either be used immediately in pre-defined searches for keywords or phrases (a real-time "word spotting" system), or the output of the speech recognizer may be stored in an index file. One or more audio mining index files can then be loaded at a later date in order to run searches for keywords or phrases. The results of a search will normally be in terms of hits, which are regions within files that are good matches for the chosen keywords. The user may then be able to listen to the audio corresponding to these hits in order to verify if a correct match was found.

Audio Indexing

In audio, there is the main problem of information retrieval - there is a need to locate the text documents that contain the search key. Unlike humans, a computer is not able to distinguish between the different types of audios such as speed, mood, noise, music or human speech - an effective searching method is needed. Hence, audio indexing allows efficient search for information by analyzing an entire file using speech recognition. An index of content is then produced, bearing words and their locations done through content-based audio retrieval, focusing on extracted audio features. It is done through mainly two methods: Large Vocabulary Continuous Speech Recognition (LVCSR) and Phonetic-based Indexing.

Large Vocabulary Continuous Speech Recognizers (LVCSR)

In text-based indexing or large vocabulary continuous speech recognition (LVCSR), the audio file is first broken down into recognizable phonemes. It is then run through a

dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged Alphabetical order, alphabetically (or by Semitic root, consonantal root for Semitic languages or radical-and-stroke sorting, radical an ...

that can contain several hundred thousand entries and matched with words and phrases to produce a full text transcript. A user can then simply search a desired word term and the relevant portion of the audio content will be returned. If the text or word could not be found in the dictionary, the system will choose the next most similar entry it can find. The system uses a language understanding model to create a confidence level for its matches. If the confidence level be below 100 percent, the system will provide options of all the found matches.

=Advantages and disadvantages

= The main draw of LVCSR is its high accuracy and high searching speed. In LVCSR,

statistical methods Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

are used to predict the likelihood of different word sequences, hence the accuracy is much higher than the single word lookup of a phonetic search. If the word can be found, the probability of the word spoken is very high. Meanwhile, while initial processing of audio takes a fair bit of time, searching is quick as just a simple test to text matching is needed. On the other hand, LVCSR is susceptible to common issues of

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

. The inherent random nature of audio and problems of external noise all affect the accuracies of text-based indexing. Another problem with LVCSR is its over reliance on its dictionary database. LVCSR only recognizes words that are found in their dictionary databases, and these dictionaries and databases are unable to keep up with the constant evolving of new

terminology Terminology is a group of specialized words and respective meanings in a particular field, and also the study of such terms and their use; the latter meaning is also known as terminology science. A ''term'' is a word, Compound (linguistics), com ...

, names and words. Should the dictionary not contain a word, there is no way for the system to identify or predict it. This reduces the accuracy and reliability of the system. This is named the Out-of-vocabulary (OOV) problem. Audio mining systems try to cope with OOV by continuously updating the dictionary and language model used, but the problem still remains significant and has probed a search for alternatives. Additionally, due to the need to constantly update and maintain task-based knowledge and large training databases to cope with the OOV problem, high computational costs are incurred. This makes LVCSR an expensive approach to audio mining.

Phonetic-based Indexing

Phonetic-based indexing also breaks the audio file into recognizable phonemes, but instead of converting them to a text index, they are kept as they are and analyzed to create a phonetic-based index. The process of phonetic-based indexing can be split into two phases. The first phase is indexing. It begins by converting the input media into a standard audio representation format (

PCM Pulse-code modulation (PCM) is a method used to Digital signal (signal processing), digitally represent analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio application ...

). Then, an acoustic model is applied to the speech. This acoustic model represents characteristics of both an acoustic channel (an environment in which the speech was uttered and a transducer through which it was recorded) and a natural language (in which human beings expressed the input speech). This produces a corresponding phonetic search track, or phonetic audio track (PAT), a highly compressed representation of the phonetic content of the input media. The second phase is searching. The user's search query term is parsed into a possible phoneme string using a phonetic dictionary. Then, multiple PAT files can be scanned at high speed during a single search for likely phonetic sequences that closely match corresponding strings of phonemes in the query term.

=Advantages and disadvantages

= Phonetic indexing is most attractive as it is largely unaffected by linguistic issues such as unrecognized words and spelling errors. Phonetic preprocessing maintains an open vocabulary that does not require updating. That makes it particularly useful for searching specialized terminology or words in foreign languages that do not commonly appear in dictionaries. It is also more effective for searching audio files with disruptive background noise and/or unclear utterances as it can compile results based on the sounds it can discern, and should the user wish to, they can search through the options until they find the desired item. Furthermore, in contrast to LVCSR, it can process audio files very quickly as there are very few unique phonemes between languages. However, phonemes cannot be effectively indexed like an entire word, thus searching on a phonetic-based system is slow. An issue with phonetic indexing is its low accuracy. Phoneme-based searches result in more false matches than text-based indexing. This is especially prevalent for short search terms, which have a stronger likelihood of sounding similar to other words or being part of bigger words. It could also return irrelevant results from other languages. Unless the system recognizes exactly the entire word, or understands phonetic sequences of languages, it is difficult for phonetic-based indexing to return accurate findings.

Speech processing and recognition system

Deemed as the most critical and complex component of audio mining, speech recognition requires the knowledge of human speech production system and its modeling. To correspond the Human speech production system, the electrical speech production system is developed to consist of: *Speech generation *Speech perception *Voiced & unvoiced speech *Model of human speech The electrical speech production system converts acoustic signal into corresponding representation of the spoken through the acoustic models in their software where all phonemes are represented. A statistical

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

aids in the process by identifying how likely words are to follow each other in certain languages. Put together with a complex probability analysis, the speech recognition system is capable of taking an unknown speech signal and transcribing it into words based on the program's dictionary. ASR (automatic speech recognition) system includes: * Acoustic analysis: input sound waveform is transformed into a feature * * Acoustic model: establishes relationship between speech signal and phonemes, pronunciation model and language model. Training algorithms are applied to the speech database to create statistical representation of each phoneme, thus generating an acoustic model with a set of phonemes and their probability measures. * * Pronunciation model: Phonemes are mapped to specific words * * Language model: Words are organized to form meaningful sentences Some applications of speech processing includes speech recognition, speech coding, speaker authentication, speech enhancement and speech synthesis.

Feature extraction

Prerequisite to the entire speech recognition process, feature extraction must be established first within the system. Audio files must be processed from start to end, ensuring no important information is lost. By differentiating sound sources through pitch, timbral features, rhythmic features, inharmonicity, autocorrelation and other features based on the signal's predictability, statistical pattern, and dynamic characteristics. Enforcing standardization within feature extraction is regulated through the international MPEG-7 standard features, where features for audio or speech signal classification are fixed in terms of techniques used to analyze and represent raw data in terms of certain features. Standard speech extraction techniques: * Linear Predictive Coding (LPC) estimates current speech sample by analyzing previous speech sample * * Mel-frequency cepstral coefficient (MFCC) represents speech signal through parametric form using mel scale * * Perceptual Linear Prediction (PLP) takes human speech into consideration However, the three techniques are not ideal as non-stationary signals are ignored. Non-stationary signals can be analyzed using Fourier and short-time Fourier, while time-varying signals are analyzed using

Wavelet A wavelet is a wave-like oscillation with an amplitude that begins at zero, increases or decreases, and then returns to zero one or more times. Wavelets are termed a "brief oscillation". A taxonomy of wavelets has been established, based on the n ...

and Discrete wavelet transform (DWT).

Audio Classification

Audio classification is a form of

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

, and involves the analysis of audio recordings. It is split into several categories- acoustic data classification, environmental sound classification, musical classification, and natural language utterance classification. The features often used for this process are pitch, timbral features, rhythmic features,

inharmonicity In music, inharmonicity is the degree to which the frequency, frequencies of overtones (also known as Harmonic series (music)#Partial, partials or partial tones) depart from Integer, whole multiples of the fundamental frequency (harmonic seri ...

, and audio correlation, although other features may also be used. There are several methods to audio classification using existing classifiers, such as the k-Nearest Neighbors, or the naïve Bayes classifier. Using annotated audio data, machines learn to identify and classify the sounds. There has also been research into using

deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

for speech recognition and audio classification, due to their effectiveness in other fields such as image classification. One method of using DNNs is by converting audio files into image files, by way of spectrograms in order to perform classification.

Applications of Audio Mining

Audio mining is used in areas such as musical audio mining (also known as

music information retrieval Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine lear ...

), which relates to the identification of perceptually important characteristics of a piece of music such as melodic, harmonic or rhythmic structure. Searches can then be carried out to find pieces of music that are similar in terms of their melodic, harmonic and/or rhythmic characteristics. Within the field of

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, audio mining has been used for phonetic processing and semantic analysis. The efficiency of audio mining in processing audio-visual data lends aid in speaker identification and segmentation, as well as text transcription. Through this process, speech can be categorized in order to identify information, or to extract information through keywords spoken in the audio. In particular, this has been used for speech analytics. Call centers have used the technology to conduct real time analysis by identifying changes in tone, sentiment or pitch, amongst others, which is then processed by decision engine or artificial intelligence to take further action. Further use has been seen in areas of speech recognition and text-to-speech applications. It has also been used in conjunction with video mining, in projects such as mining movie data.

References

External links

Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews
Speech recognition Music information retrieval Information retrieval genres Computational linguistics

History

Process

Audio Indexing

Large Vocabulary Continuous Speech Recognizers (LVCSR)

=Advantages and disadvantages

Phonetic-based Indexing

=Advantages and disadvantages

Speech processing and recognition system

Feature extraction

Audio Classification

Applications of Audio Mining

See also

References

Further reading

External links