HOME

TheInfoList



OR:

Speech segmentation is the process of identifying the boundaries between
word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
s,
syllable A syllable is a unit of organization for a sequence of Phone (phonetics), speech sounds typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered t ...
s, or
phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west ...
s in spoken
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
s. The term applies both to the mental processes used by humans, and to artificial processes of
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
. Speech segmentation is a subfield of general
speech perception Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perce ...
and an important subproblem of the technologically focused field of
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
, and cannot be adequately solved in isolation. As in most
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
problems, one must take into account
context Context may refer to: * Context (language use), the relevant constraints of the communicative situation that influence language use, language variation, and discourse summary Computing * Context (computing), the virtual environment required to s ...
,
grammar In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words. The term can also refer to the study of such constraints, a field that includes domain ...
, and
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comp ...
, and even so the result is often a
probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...
division (statistically based on likelihood) rather than a categorical one. Though it seems that
coarticulation Coarticulation in its general sense refers to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound. There are two types of coarticulation: ''anticipatory coarticulat ...
—a phenomenon which may happen between adjacent words just as easily as within a single word—presents the main challenge in speech segmentation across languages, some other problems and strategies employed in solving those problems can be seen in the following sections. This problem overlaps to some extent with the problem of
text segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...
that occurs in some languages which are traditionally written without inter-word spaces, like
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of v ...
and Japanese, compared to
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable for ...
s which indicate speech segmentation between words by a
word divider In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitespace''. T ...
, such as the
space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually con ...
. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of
Chinese characters Chinese characters () are logograms developed for the Written Chinese, writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are ...
for word stems in Japanese).


Lexical recognition

In natural languages, the meaning of a complex spoken sentence can be understood by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and combining those meanings according to the grammar rules of the language. Though lexical recognition is not thought to be used by infants in their first year, due to their highly limited vocabularies, it is one of the major processes involved in speech segmentation for adults. Three main models of lexical recognition exist in current research: first, whole-word access, which argues that words have a whole-word representation in the lexicon; second, decomposition, which argues that morphologically complex words are broken down into their
morpheme A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology. In English, morphemes are often but not necessarily words. Morphemes that stand alone ...
s ( roots, stems,
inflection In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and ...
s, etc.) and then interpreted and; third, the view that whole-word and decomposition models are both used, but that the whole-word model provides some computational advantages and is therefore dominant in lexical recognition.Badecker, William and Mark Allen
"Morphological Parsing and the Perception of Lexical Identity: A Masked Priming Study of Stem Homographs"
''Journal of Memory and Language'' 47.1 (2002): 125–144. Retrieved 27 April 2014.
To give an example, in a whole-word model, the word "cats" might be stored and searched for by letter, first "c", then "ca", "cat", and finally "cats". The same word, in a decompositional model, would likely be stored under the root word "cat" and could be searched for after removing the "s" suffix. "Falling", similarly, would be stored as "fall" and suffixed with the "ing" inflection. Though proponents of the decompositional model recognize that a morpheme-by-morpheme analysis may require significantly more computation, they argue that the unpacking of morphological information is necessary for other processes (such as
syntactic structure In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), ...
) which may occur parallel to lexical searches. As a whole, research into systems of human lexical recognition is limited due to little experimental evidence that fully discriminates between the three main models. In any case, lexical recognition likely contributes significantly to speech segmentation through the contextual clues it provides, given that it is a heavily probabilistic system—based on the statistical likelihood of certain words or constituents occurring together. For example, one can imagine a situation where a person might say "I bought my dog at a ____ shop" and the missing word's vowel is pronounced as in "net", "sweat", or "pet". While the probability of "netshop" is extremely low, since "netshop" isn't currently a compound or phrase in English, and "sweatshop" also seems contextually improbable, "pet shop" is a good fit because it is a common phrase and is also related to the word "dog". Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase "How to wreck a nice beach", which sounds very similar to "How to recognize speech". As this example shows, proper lexical segmentation depends on context and
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comp ...
which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...
technologies to be implemented on a computer. Lexical recognition is of particular value in the field of computer
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
, since the ability to build and search a network of semantically connected ideas would greatly increase the effectiveness of speech-recognition software. Statistical models can be used to segment and align recorded speech to words or phones. Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available.


Phonotactic cues

For most spoken languages, the boundaries between lexical units are difficult to identify; phonotactics are one answer to this issue. One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, may be a relic of alphabetic heritage for some language communities. In fact, the way vowels are produced depends on the surrounding consonants just as consonants are affected by surrounding vowels; this is called
coarticulation Coarticulation in its general sense refers to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound. There are two types of coarticulation: ''anticipatory coarticulat ...
. For example, in the word "kit", the is farther forward than when we say 'caught'. But also, the vowel in "kick" is phonetically different from the vowel in "kit", though we normally do not hear this. In addition, there are language-specific changes which occur in casual speech which makes it quite different from spelling. For example, in English, the phrase "hit you" could often be more appropriately spelled "hitcha". From a decompositional perspective, in many cases, phonotactics play a part in letting speakers know where to draw word boundaries. In English, the word "strawberry" is perceived by speakers as consisting (phonetically) of two parts: "straw" and "berry". Other interpretations such as "stra" and "wberry" are inhibited by English phonotactics, which does not allow the cluster "wb" word-initially. Other such examples are "day/dream" and "mile/stone" which are unlikely to be interpreted as "da/ydream" or "mil/estone" due to the phonotactic probability or improbability of certain clusters. The sentence "Five women left", which could be phonetically transcribed as aɪvwɪmɘnlɛft is marked since neither /vw/ in /faɪvwɪmɘn/ or /nl/ in /wɪmɘnlɛft/ are allowed as syllable
onset Onset may refer to: * Onset (audio), the beginning of a musical note or sound *Onset, Massachusetts, village in the United States ** Onset Island (Massachusetts), a small island located at the western end of the Cape Cod Canal *Interonset interval ...
s or codas in English phonotactics. These phonotactic cues often allow speakers to easily distinguish the boundaries in words. Vowel harmony in languages like Finnish can also serve to provide phonotactic cues. While the system does not allow front vowels and back vowels to exist together within one morpheme, compounds allow two morphemes to maintain their own vowel harmony while coexisting in a word. Therefore, in compounds such as "selkä/ongelma" ('back problem') where
vowel harmony In phonology, vowel harmony is an assimilatory process in which the vowels of a given domain – typically a phonological word – have to be members of the same natural class (thus "in harmony"). Vowel harmony is typically long distance, me ...
is distinct between two constituents in a compound, the boundary will be wherever the switch in harmony takes place—between the "ä" and the "ö" in this case. Still, there are instances where phonotactics may not aid in segmentation. Words with unclear clusters or uncontrasted vowel harmony as in "opinto/uudistus" ('student reform') do not offer phonotactic clues as to how they are segmented. From the perspective of the whole-word model, however, these words are thought be stored as full words, so the constituent parts wouldn't necessarily be relevant to lexical recognition.


Speech segmentation in infants and non-natives

Infants are one major focus of research in speech segmentation. Since infants have not yet acquired a lexicon capable of providing extensive contextual clues or probability-based word searches within their first year, as mentioned above, they must often rely primarily upon phonotactic and rhythmic cues (with prosody being the dominant cue), all of which are language-specific. Between 6 and 9 months, infants begin to lose the ability to discriminate between sounds not present in their native language and grow sensitive to the sound structure of their native language, with the word segmentation abilities appearing around 7.5 months. Though much more research needs to be done on the exact processes that infants use to begin speech segmentation, current and past studies suggest that English-native infants approach stressed syllables as the beginning of words. At 7.5 months, infants appear to be able to segment bisyllabic words with strong-weak stress patterns, though weak-strong stress patterns are often misinterpreted, e.g. interpreting "guiTAR is" as "GUI TARis". It seems that infants also show some complexity in tracking frequency and probability of words, for instance, recognizing that although the syllables "the" and "dog" occur together frequently, "the" also commonly occurs with other syllables, which may lead to the analysis that "dog" is an individual word or concept instead of the interpretation "thedog". Language learners are another set of individuals being researched within speech segmentation. In some ways, learning to segment speech may be more difficult for a second-language learner than for an infant, not only in the lack of familiarity with sound probabilities and restrictions but particularly in the overapplication of the native language's patterns. While some patterns may occur between languages, as in the syllabic segmentation of French and English, they may not work well with languages such as Japanese, which has a mora-based segmentation system. Further, phonotactic restrictions like the boundary-marking cluster /ld/ in German or Dutch are permitted (without necessarily marking boundaries) in English. Even the relationship between stress and
vowel length In linguistics, vowel length is the perceived length of a vowel sound: the corresponding physical measurement is duration. In some languages vowel length is an important phonemic factor, meaning vowel length can change the meaning of the word, ...
, which may seem intuitive to speakers of English, may not exist in other languages, so second-language learners face an especially great challenge when learning a language and its segmentation cues.Tyler, Michael D. and Anne Cutler
"Cross-Language Differences in Cue Use for Speech Segmentation"
''Journal of the Acoustical Society of America'' 126 (2009): 367–376. Retrieved 27 April 2014.


See also

*
Ambiguity Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...
*
Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
*
Speech processing Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied ...
* Hyphenation *
Mondegreen A mondegreen () is a mishearing or misinterpretation of a phrase in a way that gives it a new meaning. Mondegreens are most often created by a person listening to a poem or a song; the listener, being unable to hear a lyric clearly, substitutes w ...
*
Speech perception Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perce ...
* Sentence boundary disambiguation


References


External links


"Phonolyze" speech segmentation softwareSPPAS - the automatic annotation and analysis of speech
{{Natural language processing Natural language processing