Speech segmentation is the process of identifying the boundaries between

word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...

syllable A syllable is a basic unit of organization within a sequence of speech sounds, such as within a word, typically defined by linguists as a ''nucleus'' (most often a vowel) with optional sounds before or after that nucleus (''margins'', which are ...

s, or

phoneme A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...

s in spoken

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

s. The term applies both to the mental processes used by humans, and to artificial processes of

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

. Speech segmentation is a subfield of general

speech perception Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and percept ...

and an important subproblem of the technologically focused field of

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

, and cannot be adequately solved in isolation. As in most

problems, one must take into account context,

grammar In linguistics, grammar is the set of rules for how a natural language is structured, as demonstrated by its speakers or writers. Grammar rules may concern the use of clauses, phrases, and words. The term may also refer to the study of such rul ...

, and

semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

, and even so the result is often a probabilistic division (statistically based on likelihood) rather than a categorical one. Though it seems that coarticulation—a phenomenon which may happen between adjacent words just as easily as within a single word—presents the main challenge in speech segmentation across languages, some other problems and strategies employed in solving those problems can be seen in the following sections. This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese, compared to

writing system A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...

s which indicate speech segmentation between words by a word divider, such as the

space Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...

. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of

Chinese characters Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...

for word stems in Japanese).

Lexical recognition

In natural languages, the meaning of a complex spoken sentence can be understood by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and combining those meanings according to the grammar rules of the language. Though lexical recognition is not thought to be used by infants in their first year, due to their highly limited vocabularies, it is one of the major processes involved in speech segmentation for adults. Three main models of lexical recognition exist in current research: first, whole-word access, which argues that words have a whole-word representation in the lexicon; second, decomposition, which argues that morphologically complex words are broken down into their

morpheme A morpheme is any of the smallest meaningful constituents within a linguistic expression and particularly within a word. Many words are themselves standalone morphemes, while other words contain multiple morphemes; in linguistic terminology, this ...

s (

roots A root is the part of a plant, generally underground, that anchors the plant body, and absorbs and stores water and nutrients. Root or roots may also refer to: Art, entertainment, and media * ''The Root'' (magazine), an online magazine focusin ...

, stems,

inflection In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...

s, etc.) and then interpreted and; third, the view that whole-word and decomposition models are both used, but that the whole-word model provides some computational advantages and is therefore dominant in lexical recognition.Badecker, William and Mark Allen
"Morphological Parsing and the Perception of Lexical Identity: A Masked Priming Study of Stem Homographs"
''Journal of Memory and Language'' 47.1 (2002): 125–144. Retrieved 27 April 2014. To give an example, in a whole-word model, the word "cats" might be stored and searched for by letter, first "c", then "ca", "cat", and finally "cats". The same word, in a decompositional model, would likely be stored under the root word "cat" and could be searched for after removing the "s" suffix. "Falling", similarly, would be stored as "fall" and suffixed with the "ing" inflection. Though proponents of the decompositional model recognize that a morpheme-by-morpheme analysis may require significantly more computation, they argue that the unpacking of morphological information is necessary for other processes (such as

syntactic structure In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

) which may occur parallel to lexical searches. As a whole, research into systems of human lexical recognition is limited due to little experimental evidence that fully discriminates between the three main models. In any case, lexical recognition likely contributes significantly to speech segmentation through the contextual clues it provides, given that it is a heavily probabilistic system—based on the statistical likelihood of certain words or constituents occurring together. For example, one can imagine a situation where a person might say "I bought my dog at a ____ shop" and the missing word's vowel is pronounced as in "net", "sweat", or "pet". While the probability of "netshop" is extremely low, since "netshop" isn't currently a compound or phrase in English, and "sweatshop" also seems contextually improbable, "pet shop" is a good fit because it is a common phrase and is also related to the word "dog". Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase "How to wreck a nice beach", which sounds very similar to "How to recognize speech". As this example shows, proper lexical segmentation depends on context and

which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

technologies to be implemented on a computer. Lexical recognition is of particular value in the field of computer

, since the ability to build and search a network of semantically connected ideas would greatly increase the effectiveness of speech-recognition software. Statistical models can be used to segment and align recorded speech to words or phones. Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available.

Phonotactic cues

For most spoken languages, the boundaries between lexical units are difficult to identify;

phonotactics Phonotactics (from Ancient Greek 'voice, sound' and 'having to do with arranging') is a branch of phonology that deals with restrictions in a language on the permissible combinations of phonemes. Phonotactics defines permissible syllable struc ...

are one answer to this issue. One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, may be a relic of alphabetic heritage for some language communities. In fact, the way vowels are produced depends on the surrounding consonants just as consonants are affected by surrounding vowels; this is called coarticulation. For example, in the word "kit", the is farther forward than when we say 'caught'. But also, the vowel in "kick" is phonetically different from the vowel in "kit", though we normally do not hear this. In addition, there are language-specific changes which occur in casual speech which makes it quite different from spelling. For example, in English, the phrase "hit you" could often be more appropriately spelled "hitcha". From a decompositional perspective, in many cases, phonotactics play a part in letting speakers know where to draw word boundaries. In English, the word "strawberry" is perceived by speakers as consisting (phonetically) of two parts: "straw" and "berry". Other interpretations such as "stra" and "wberry" are inhibited by English phonotactics, which does not allow the cluster "wb" word-initially. Other such examples are "day/dream" and "mile/stone" which are unlikely to be interpreted as "da/ydream" or "mil/estone" due to the phonotactic probability or improbability of certain clusters. The sentence "Five women left", which could be phonetically transcribed as aɪvwɪmɘnlɛft is marked since neither /vw/ in /faɪvwɪmɘn/ nor /nl/ in /wɪmɘnlɛft/ are allowed as syllable onsets or codas in English phonotactics. These phonotactic cues often allow speakers to easily distinguish the boundaries in words. Vowel harmony in languages like Finnish can also serve to provide phonotactic cues. While the system does not allow front vowels and back vowels to exist together within one morpheme, compounds allow two morphemes to maintain their own vowel harmony while coexisting in a word. Therefore, in compounds such as "selkä/ongelma" ('back problem') where

vowel harmony In phonology, vowel harmony is a phonological rule in which the vowels of a given domain – typically a phonological word – must share certain distinctive features (thus "in harmony"). Vowel harmony is typically long distance, meaning tha ...

is distinct between two constituents in a compound, the boundary will be wherever the switch in harmony takes place—between the "ä" and the "ö" in this case. Still, there are instances where phonotactics may not aid in segmentation. Words with unclear clusters or uncontrasted vowel harmony as in "opinto/uudistus" ('student reform') do not offer phonotactic clues as to how they are segmented. From the perspective of the whole-word model, however, these words are thought be stored as full words, so the constituent parts would not necessarily be relevant to lexical recognition.

In infants and non-natives

Infants are one major focus of research in speech segmentation. Since infants have not yet acquired a lexicon capable of providing extensive contextual clues or probability-based word searches within their first year, as mentioned above, they must often rely primarily upon phonotactic and rhythmic cues (with prosody being the dominant cue), all of which are language-specific. Between 6 and 9 months, infants begin to lose the ability to discriminate between sounds not present in their native language and grow sensitive to the sound structure of their native language, with the word segmentation abilities appearing around 7.5 months. Though much more research needs to be done on the exact processes that infants use to begin speech segmentation, current and past studies suggest that English-native infants approach stressed syllables as the beginning of words. At 7.5 months, infants appear to be able to segment bisyllabic words with strong-weak stress patterns, though weak-strong stress patterns are often misinterpreted, e.g. interpreting "guiTAR is" as "GUI TARis". It seems that infants also show some complexity in tracking frequency and probability of words, for instance, recognizing that although the syllables "the" and "dog" occur together frequently, "the" also commonly occurs with other syllables, which may lead to the analysis that "dog" is an individual word or concept instead of the interpretation "thedog". Language learners are another set of individuals being researched within speech segmentation. In some ways, learning to segment speech may be more difficult for a second-language learner than for an infant, not only in the lack of familiarity with sound probabilities and restrictions but particularly in the overapplication of the native language's patterns. While some patterns may occur between languages, as in the syllabic segmentation of French and English, they may not work well with languages such as Japanese, which has a mora-based segmentation system. Further, phonotactic restrictions like the boundary-marking cluster /ld/ in German or Dutch are permitted (without necessarily marking boundaries) in English. Even the relationship between stress and

vowel length In linguistics, vowel length is the perceived or actual length (phonetics), duration of a vowel sound when pronounced. Vowels perceived as shorter are often called short vowels and those perceived as longer called long vowels. On one hand, many ...

, which may seem intuitive to speakers of English, may not exist in other languages, so second-language learners face an especially great challenge when learning a language and its segmentation cues.Tyler, Michael D. and Anne Cutler
"Cross-Language Differences in Cue Use for Speech Segmentation"
''Journal of the Acoustical Society of America'' 126 (2009): 367–376. Retrieved 27 April 2014.

References

External links

"Phonolyze" speech segmentation softwareSPPAS – the automatic annotation and analysis of speech
{{Natural language processing Natural language processing

Lexical recognition

Phonotactic cues

In infants and non-natives

See also

References

External links