Temporal envelope (ENV) and temporal fine structure (TFS) are changes in the

amplitude The amplitude of a periodic variable is a measure of its change in a single period (such as time or spatial period). The amplitude of a non-periodic signal is its magnitude compared with a reference value. There are various definitions of am ...

and

frequency Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is eq ...

of sound perceived by humans over time. These temporal changes are responsible for several aspects of auditory perception, including

loudness In acoustics, loudness is the subjective perception of sound pressure. More formally, it is defined as, "That attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud". The relation of ph ...

, pitch and

timbre In music, timbre ( ), also known as tone color or tone quality (from psychoacoustics), is the perceived sound quality of a musical note, sound or tone. Timbre distinguishes different types of sound production, such as choir voices and musica ...

perception and spatial hearing. Complex sounds such as speech or music are decomposed by the peripheral

auditory system The auditory system is the sensory system for the sense of hearing. It includes both the sensory organs (the ears) and the auditory parts of the sensory system. System overview The outer ear funnels sound vibrations to the eardrum, increasin ...

of humans into narrow frequency bands. The resulting narrow-band signals convey information at different time scales ranging from less than one millisecond to hundreds of milliseconds. A dichotomy between slow "temporal envelope" cues and faster "temporal fine structure" cues has been proposed to study several aspects of auditory perception (e.g.,

, pitch and

perception,

auditory scene analysis In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. Th ...

sound localization Sound localization is a listener's ability to identify the location or origin of a detected sound in direction and distance. The sound localization mechanisms of the mammalian auditory system have been extensively studied. The auditory system us ...

) at two distinct time scales in each frequency band. Over the last decades, a wealth of psychophysical, electrophysiological and computational studies based on this envelope/fine-structure dichotomy have examined the role of these temporal cues in sound identification and communication, how these temporal cues are processed by the peripheral and central auditory system, and the effects of

aging Ageing ( BE) or aging ( AE) is the process of becoming older. The term refers mainly to humans, many other animals, and fungi, whereas for example, bacteria, perennial plants and some simple animals are potentially biologically immortal. In ...

and cochlear damage on temporal auditory processing. Although the envelope/fine-structure dichotomy has been debated and questions remain as to how temporal fine structure cues are actually encoded in the auditory system, these studies have led to a range of applications in various fields including speech and audio processing, clinical audiology and rehabilitation of

sensorineural hearing loss Sensorineural hearing loss (SNHL) is a type of hearing loss in which the root cause lies in the inner ear or sensory organ (cochlea and associated structures) or the vestibulocochlear nerve (cranial nerve VIII). SNHL accounts for about 90% of re ...

via

hearing aid A hearing aid is a device designed to improve hearing by making sound audible to a person with hearing loss. Hearing aids are classified as medical devices in most countries, and regulated by the respective regulations. Small audio amplifiers s ...

s or

cochlear implant A cochlear implant (CI) is a surgically implanted neuroprosthesis that provides a person who has moderate-to-profound sensorineural hearing loss with sound perception. With the help of therapy, cochlear implants may allow for improved speech unde ...

Definition

Notions of temporal envelope and temporal fine structure may have different meanings in many studies. An important distinction to make is between the physical (i.e., acoustical) and the biological (or perceptual) description of these ENV and TFS cues. Any sound whose frequency components cover a narrow range (called a narrowband signal) can be considered as an envelope (ENV_p, where p denotes the physical signal) superimposed on a more rapidly oscillating carrier, the temporal fine structure (TFS_p). Many sounds in everyday life, including speech and music, are broadband; the frequency components spread over a wide range and there is no well-defined way to represent the signal in terms of ENV_p and TFS_p. However, in a normally functioning

cochlea The cochlea is the part of the inner ear involved in hearing. It is a spiral-shaped cavity in the bony labyrinth, in humans making 2.75 turns around its axis, the modiolus. A core component of the cochlea is the Organ of Corti, the sensory o ...

, complex broadband signals are decomposed by the filtering on the

basilar membrane The basilar membrane is a stiff structural element within the cochlea of the inner ear which separates two liquid-filled tubes that run along the coil of the cochlea, the scala media and the scala tympani. The basilar membrane moves up and down ...

(BM) within the cochlea into a series of narrowband signals. Therefore, the waveform at each place on the BM can be considered as an envelope (ENV_BM) superimposed on a more rapidly oscillating carrier, the temporal fine structure (TFS_BM). The ENV_BM and TFS_BM depend on the place along the BM. At the apical end, which is tuned to low (audio) frequencies, ENV_BM and TFS_BM vary relatively slowly with time, while at the basal end, which is tuned to high frequencies, both ENV_BM and TFS_BM vary more rapidly with time. Both ENV_BM and TFS_BM are represented in the time patterns of

action potential An action potential occurs when the membrane potential of a specific cell location rapidly rises and falls. This depolarization then causes adjacent locations to similarly depolarize. Action potentials occur in several types of animal cells ...

s in the

auditory nerve The cochlear nerve (also auditory nerve or acoustic nerve) is one of two parts of the vestibulocochlear nerve, a cranial nerve present in amniotes, the other part being the vestibular nerve. The cochlear nerve carries auditory sensory information ...

these are denoted ENV_n and TFS_n. TFS_n is represented most prominently in neurons tuned to low frequencies, while ENV_n is represented most prominently in neurons tuned to high (audio) frequencies. For a broadband signal, it is not possible to manipulate TFS_p without affecting ENV_BM and ENV_n, and it is not possible to manipulate ENV_p without affecting TFS_BM and TFS_n.

Temporal envelope (ENV) processing

Neurophysiological aspects

The neural representation of stimulus envelope, ENV_n, has typically been studied using well-controlled ENV_p modulations, that is sinusoidally amplitude-modulated (AM) sounds. Cochlear filtering limits the range of AM rates encoded in individual auditory-nerve fibers. In the auditory nerve, the strength of the neural representation of AM decreases with increasing modulation rate. At the level of the

cochlear nucleus The cochlear nuclear (CN) complex comprises two cranial nerve nuclei in the human brainstem, the ventral cochlear nucleus (VCN) and the dorsal cochlear nucleus (DCN). The ventral cochlear nucleus is unlayered whereas the dorsal cochlear nucle ...

, several cell types show an enhancement of ENV_n information. Multipolar cells can show band-pass tuning to AM tones with AM rates between 50 and 1000 Hz. Some of these cells show an excellent response to the ENV_n and provide inhibitory sideband inputs to other cells in the cochlear nucleus giving a physiological correlate of comodulation masking release, a phenomenon whereby the detection of a signal in a masker is improved when the masker has correlated envelope fluctuations across frequency (see section below). Responses to the temporal-envelope cues of speech or other complex sounds persist up the auditory pathway, eventually to the various fields of the auditory cortex in many animals. In the

Primary Auditory Cortex The auditory cortex is the part of the temporal lobe that processes auditory information in humans and many other vertebrates. It is a part of the auditory system, performing basic and higher functions in hearing, such as possible relations ...

, responses can encode AM rates by phase-locking up to about 20–30 Hz, while faster rates induce sustained and often tuned responses. A topographical representation of AM rate has been demonstrated in the primary auditory cortex of awake macaques. This representation is approximately perpendicular to the axis of the tonotopic gradient, consistent with an orthogonal organization of spectral and temporal features in the auditory cortex. Combining these temporal responses with the spectral selectivity of A1 neurons gives rise to the spectro-temporal receptive fields that often capture well cortical responses to complex modulated sounds. In secondary auditory cortical fields, responses become temporally more sluggish and spectrally broader, but are still able to phase-lock to the salient features of speech and musical sounds. Tuning to AM rates below about 64 Hz is also found in the human auditory cortex as revealed by brain-imaging techniques (

fMRI Functional magnetic resonance imaging or functional MRI (fMRI) measures brain activity by detecting changes associated with blood flow. This technique relies on the fact that cerebral blood flow and neuronal activation are coupled. When an area ...

) and cortical recordings in epileptic patients (

electrocorticography Electrocorticography (ECoG), or intracranial electroencephalography (iEEG), is a type of electrophysiological monitoring that uses electrodes placed directly on the exposed surface of the brain to record electrical activity from the cerebral co ...

). This is consistent with neuropsychological studies of brain-damaged patients and with the notion that the central auditory system performs some form of spectral decomposition of the ENV_p of incoming sounds. The ranges over which cortical responses encode well the temporal-envelope cues of speech have been shown to be predictive of the human ability to understand speech. In the human superior temporal gyrus (STG), an anterior-posterior spatial organization of spectro-temporal modulation tuning has been found in response to speech sounds, the posterior STG being tuned for temporally fast varying speech sounds with low spectral modulations and the anterior STG being tuned for temporally slow varying speech sounds with high spectral modulations. One unexpected aspect of phase locking in the auditory cortex has been observed in the responses elicited by complex acoustic stimuli with spectrograms that exhibit relatively slow envelopes (< 20 Hz), but that are carried by fast modulations that are as high as hundreds of Hertz. Speech and music, as well as various modulated noise stimuli have such temporal structure. For these stimuli, cortical responses phase-lock to ''both'' the envelope and fine-structure induced by interactions between unresolved harmonics of the sound, thus reflecting the pitch of the sound, and exceeding the typical lower limits of cortical phase-locking to the envelopes of a few 10’s of Hertz. This paradoxical relation between the slow and fast cortical phase-locking to the carrier “fine structure” has been demonstrated both in the auditory and visual cortices. It has also been shown to be amply manifested in measurements of the spectro-temporal receptive fields of the primary auditory cortex giving them unexpectedly fine temporal accuracy and selectivity bordering on a 5-10 ms resolution. The underlying causes of this phenomenon have been attributed to several possible origins, including nonlinear synaptic depression and facilitation, and/or a cortical network of thalamic excitation and cortical inhibition. There are many functionally significant and perceptually relevant reasons for the coexistence of these two complementary dynamic response modes. They include the ability to accurately encode onsets and other rapid ‘events’ in the ENV_p of complex acoustic and other sensory signals, features that are critical for the perception of consonants (speech) and percussive sounds (music), as well as the texture of complex sounds.

Psychoacoustical aspects

The perception of ENV_p depends on which AM rates are contained in the signal. Low rates of AM, in the 1–8 Hz range, are perceived as changes in perceived intensity, that is loudness fluctuations (a percept that can also be evoked by frequency modulation, FM); at higher rates, AM is perceived as roughness, with the greatest roughness sensation occurring at around 70 Hz; at even higher rates, AM can evoke a weak pitch percept corresponding to the modulation rate. Rainstorms, crackling fire, chirping crickets or galloping horses produce "sound textures" - the collective result of many similar acoustic events - which perception is mediated by ENV_n statistics. The auditory detection threshold for AM as a function of AM rate, referred to as the temporal modulation transfer function (TMTF), is best for AM rates in the range from 4 – 150 Hz and worsens outside that range The cutoff frequency of the TMTF gives an estimate of temporal acuity (temporal resolution) for the auditory system. This cutoff frequency corresponds to a time constant of about 1 - 3 ms for the auditory system of normal-hearing humans. Correlated envelope fluctuations across frequency in a masker can aid detection of a pure tone signal, an effect known as comodulation masking release. AM applied to a given carrier can perceptually interfere with the detection of a target AM imposed on the same carrier, an effect termed ''modulation masking''. Modulation-masking patterns are tuned (greater masking occurs for masking and target AMs close in modulation rate), suggesting that the human auditory system is equipped with frequency-selective channels for AM. Moreover, AM applied to spectrally remote carriers can perceptually interfere with the detection of AM on a target sound, an effect termed ''modulation detection interference''. The notion of modulation channels is also supported by the demonstration of selective adaptation effects in the modulation domain. These studies show that AM detection thresholds are selectively elevated above pre-exposure thresholds when the carrier frequency and the AM rate of the adaptor are similar to those of the test tone. Human listeners are sensitive to relatively slow "second-order" AMs cues correspond to fluctuations in the strength of AM. These cues arise from the interaction of different modulation rates, previously described as "beating" in the envelope-frequency domain. Perception of second-order AM has been interpreted as resulting from nonlinear mechanisms in the auditory pathway that produce an audible distortion component at the envelope beat frequency in the internal modulation spectrum of the sounds. Interaural time differences in the envelope provide binaural cues even at high frequencies where TFS_n cannot be used.

Models of normal envelope processing

Diagram of the envelope perception model

The most basic computer model of ENV processing is the ''leaky integrator model''. This model extracts the temporal envelope of the sound (ENV_p) via bandpass filtering, half-wave rectification (which may be followed by fast-acting amplitude compression), and lowpass filtering with a cutoff frequency between about 60 and 150 Hz. The leaky integrator is often used with a decision statistic based on either the resulting envelope power, the max/min ratio, or the crest factor. This model accounts for the loss of auditory sensitivity for AM rates higher than about 60–150 Hz for broadband noise carriers. Based on the concept of frequency selectivity for AM, the perception model of Torsten Dau incorporates broadly tuned bandpass modulation filters (with a Q value around 1) to account for data from a broad variety of psychoacoustic tasks and particularly AM detection for noise carriers with different bandwidths, taking into account their intrinsic envelope fluctuations. This model of has been extended to account for comodulation masking release (see sections above). The shapes of the modulation filters have been estimated and an “envelope power spectrum model” (EPSM) based on these filters can account for AM masking patterns and AM depth discrimination. The EPSM has been extended to the prediction of speech intelligibility and to account for data from a broad variety of psychoacoustic tasks. A physiologically-based processing model simulating brainstem responses has also been developed to account for AM detection and AM masking patterns.

Temporal fine structure (TFS) processing

Neurophysiological aspects

Phase locking recorded from a neuron in the cochlear nucleus

The neural representation of temporal fine structure, TFS_n, has been studied using stimuli with well-controlled TFS_p: pure tones, harmonic complex tones, and

frequency-modulated Frequency modulation (FM) is the encoding of information in a carrier wave by varying the instantaneous frequency of the wave. The technology is used in telecommunications, radio broadcasting, signal processing, and computing. In analog frequ ...

(FM) tones. Auditory-nerve fibres are able to represent low-frequency sounds via their phase-locked discharges (i.e., TFS_n information). The upper frequency limit for phase locking is species dependent. It is about 5 kHz in the cat, 9 kHz in the barn owl and just 4 kHz in the guinea pig. We do not know the upper limit of phase locking in humans but current, indirect, estimates suggest it is about 4–5 kHz. Phase locking is a direct consequence of the transduction process with an increase in probability of transduction channel opening occurring with a stretching of the stereocilia and decrease in channel opening occurring when pushed in the opposite direction. This has led some to suggest that phase locking is an epiphenomenon. The upper limit appears to be determined by a cascade of low pass filters at the level of the

inner hair cell Hair cells are the sensory receptors of both the auditory system and the vestibular system in the ears of all vertebrates, and in the lateral line organ of fishes. Through mechanotransduction, hair cells detect movement in their environment. ...

and auditory-nerve synapse. TFS_n information in the auditory nerve may be used to encode the (audio) frequency of low-frequency sounds, including single tones and more complex stimuli such as frequency-modulated tones or steady-state vowels (see role and applications to speech and music). The auditory system goes to some length to preserve this TFS_n information with the presence of giant synapses (End bulbs of Held) in the ventral cochlear nucleus. These synapses contact bushy cells (Spherical and globular) and faithfully transmit (or enhance) the temporal information present in the auditory nerve fibers to higher structures in the

brainstem The brainstem (or brain stem) is the posterior stalk-like part of the brain that connects the cerebrum with the spinal cord. In the human brain the brainstem is composed of the midbrain, the pons, and the medulla oblongata. The midbrain is ...

. The bushy cells project to the

medial superior olive The superior olivary complex (SOC) or superior olive is a collection of brainstem nuclei that functions in multiple aspects of hearing and is an important component of the ascending and descending auditory pathways of the auditory system. The SO ...

and the globular cells project to the medial nucleus of the trapezoid body (MNTB). The MNTB is also characterized by giant synapses (calyces of Held) and provides precisely timed inhibition to the

lateral superior olive The superior olivary complex (SOC) or superior olive is a collection of brainstem nuclei that functions in multiple aspects of hearing and is an important component of the ascending and descending auditory pathways of the auditory system. The SO ...

. The medial and lateral superior olive and MNTB are involved in the encoding of interaural time and intensity differences. There is general acceptance that the temporal information is crucial in sound localization but it is still contentious as to whether the same temporal information is used to encode the frequency of complex sounds. Several problems remain with the idea that the TFS_n is important in the representation of the frequency components of complex sounds. The first problem is that the temporal information deteriorates as it passes through successive stages of the auditory pathway (presumably due to the low pass dendritic filtering). Therefore, the second problem is that the temporal information must be extracted at an early stage of the auditory pathway. No such stage has currently been identified although there are theories about how temporal information can be converted into rate information (see section Models of normal processing: Limitations).

Psychoacoustical aspects

It is often assumed that many perceptual capacities rely on the ability of the monaural and binaural auditory system to encode and use TFS_n cues evoked by components in sounds with frequencies below about 1–4 kHz. These capacities include discrimination of frequency, discrimination of the fundamental frequency of harmonic sounds, detection of FM at rates below 5 Hz, melody recognition for sequences of pure tones and complex tones, lateralization and localization of pure tones and complex tones, and segregation of concurrent harmonic sounds (such as speech sounds). It appears that TFS_n cues require correct

tonotopic In physiology, tonotopy (from Greek tono = frequency and topos = place) is the spatial arrangement of where sounds of different frequency are processed in the brain. Tones close to each other in terms of frequency are represented in topologically ...

(

place Place may refer to: Geography * Place (United States Census Bureau), defined as any concentration of population ** Census-designated place, a populated area lacking its own municipal government * "Place", a type of street or road name ** O ...

) representation to be processed optimally by the auditory system. Moreover, musical pitch perception has been demonstrated for complex tones with all harmonics above 6 kHz, demonstrating that it is not entirely dependent on neural phase locking to TFS_BM (i.e., TFS_n) cues. As for FM detection, the current view assumes that in the normal auditory system, FM is encoded via TFS_n cues when the FM rate is low (<5 Hz) and when the carrier frequency is below about 4 kHz, and via ENV_n cues when the FM is fast or when the carrier frequency is higher than 4 kHz. This is supported by single-unit recordings in the low brainstem. According to this view, TFS_n cues are not used to detect FM with rates above about 10 Hz because the mechanism decoding the TFS_n information is “sluggish” and cannot track rapid changes in frequency. Several studies have shown that auditory sensitivity to slow FM at low carrier frequency is associated with speech identification for both normal-hearing and hearing-impaired individuals when speech reception is limited by acoustic degradations (e.g., filtering) or concurrent speech sounds. This suggests that robust speech intelligibility is determined by accurate processing of TFS_n cues.

Models of normal processing: limitations

The separation of a sound into ENV_p and TFS_p appears inspired partly by how sounds are synthesized and by the availability of a convenient way to separate an existing sound into ENV and TFS, namely the

Hilbert transform In mathematics and in signal processing, the Hilbert transform is a specific linear operator that takes a function, of a real variable and produces another function of a real variable . This linear operator is given by convolution with the functi ...

. There is a risk that this view of auditory processing is dominated by these physical/technical concepts, similarly to how cochlear frequency-to-place mapping was for a long time conceptualized in terms of the

Fourier transform A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the transform as a function of frequency. Most commonly functions of time or space are transformed ...

. Physiologically, there is no indication of a separation of ENV and TFS in the auditory system for stages up to the

. Only at that stage does it appear that parallel pathways, potentially enhancing ENV_n or TFS_n information (or something akin to it), may be implemented through the temporal response characteristics of different cochlear nucleus cell types. It may therefore be useful to better simulate cochlear nucleus cell types to understand the true concepts for parallel processing created at the level of the cochlear nucleus. These concepts may be related to separating ENV and TFS but are unlikely realized like the Hilbert transform. A computational model of the peripheral auditory system may be used to simulate auditory-nerve fiber responses to complex sounds such as speech, and quantify the transmission (i.e., internal representation) of ENV_n and TFS_n cues. In two simulation studies, the mean-rate and spike-timing information was quantified at the output of such a model to characterize, respectively, the short-term rate of neural firing (ENV_n) and the level of synchronization due to phase locking (TFS_n) in response to speech sounds degraded by vocoders. The best model predictions of vocoded-speech intelligibility were found when both ENV_n and TFS_n cues were included, providing evidence that TFS_n cues are important for intelligibility when the speech ENV_p cues are degraded. At a more fundamental level, similar computational modeling was used to demonstrate that the functional dependence of human just-noticeable-frequency-differences on pure-tone frequency were not accounted for unless temporal information was included (notably most so for mid-high frequencies, even above the nominal cutoff in physiological phase locking). However, a caveat of most TFS models is that optimal model performance with temporal information typically over-estimates human performance. An alternative view is to assume that TFS_n information at the level of the auditory nerve is converted into rate-place (ENV_n) information at a later stage of the auditory system (e.g., the low brainstem). Several modelling studies proposed that the neural mechanisms for decoding TFS_n are based on correlation of the outputs of adjacent places.

Role in speech and music perception

Role of temporal envelope in speech and music perception

Modulation spectra of english and french

The ENV_p plays a critical role in many aspects of auditory perception, including in the perception of speech and music. Speech recognition is possible using cues related to the ENV_p, even in situations where the original spectral information and TFS_p are highly degraded. Indeed, when the spectrally local TFS_p from one sentence is combined with the ENV_p from a second sentence, only the words of the second sentence are heard. The ENV_p rates most important for speech are those below about 16 Hz, corresponding to fluctuations at the rate of syllables. On the other hand, the

fundamental frequency The fundamental frequency, often referred to simply as the ''fundamental'', is defined as the lowest frequency of a periodic waveform. In music, the fundamental is the musical pitch of a note that is perceived as the lowest partial present. I ...

(“ pitch”) contour of speech sounds is primarily conveyed via TFS_p cues, although some information on the contour can be perceived via rapid envelope fluctuations corresponding to the fundamental frequency. For music, slow ENV_p rates convey rhythm and tempo information, whereas more rapid rates convey the onset and offset properties of sound (attack and decay, respectively) that are important for timbre perception.

Role of TFS in speech and music perception

The ability to accurately process TFS_p information is thought to play a role in our perception of pitch (i.e., the perceived height of sounds), an important sensation for music perception, as well as our ability to understand speech, especially in the presence of background noise.

Role of TFS in pitch perception

Although pitch retrieval mechanisms in the auditory system are still a matter of debate, TFS_n information may be used to retrieve the pitch of low-frequency pure tones and estimate the individual frequencies of the low-numbered (ca. 1st-8th) harmonics of a complex sound, frequencies from which the fundamental frequency of the sound can be retrieved according to, e.g., pattern-matching models of pitch perception. A role of TFS_n information in pitch perception of complex sounds containing intermediate harmonics (ca. 7th-16th) has also been suggested and may be accounted for by temporal or spectrotemporal models of pitch perception. The degraded TFS_n cues conveyed by cochlear implant devices may also be partly responsible for impaired music perception of cochlear implant recipients.

Role of TFS cues in speech perception

TFS_p cues are thought to be important for the identification of speakers and for tone identification in

tonal languages Tone is the use of pitch in language to distinguish lexical or grammatical meaning – that is, to distinguish or to inflect words. All verbal languages use pitch to express emotional and other paralinguistic information and to convey emph ...

. In addition, several

vocoder A vocoder (, a portmanteau of ''voice'' and ''encoder'') is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation. The vocoder ...

studies have suggested that TFS_p cues contribute to the intelligibility of speech in quiet and noise. Although it is difficult to isolate TFS_p from ENV_p cues, there is evidence from studies in hearing-impaired listeners that speech perception in the presence of background noise can be partly accounted for by the ability to accurately process TFS_p, although the ability to “listen in the dips” of fluctuating maskers does not seem to depend on periodic TFS_p cues.

Role in environmental sound perception

Environmental sounds can be broadly defined as nonspeech and nonmusical sounds in the listener's environment that can convey meaningful information about surrounding objects and events. Environmental sounds are highly heterogeneous in terms of their acoustic characteristics and source types, and may include human and animal vocalizations, water and weather related events, mechanical and electronic signaling sounds. Given a great variety in sound sources that give rise to environmental sounds both ENV_p and TFS_p play an important role in their perception. However, the relative contributions of ENV_p and TFS_p can differ considerably for specific environmental sounds. This is reflected in the variety of acoustic measures that correlate with different perceptual characteristics of objects and events. Early studies highlighted the importance of envelope-based temporal patterning in perception of environmental events. For instance, Warren & Verbrugge, demonstrated that constructed sounds of a glass bottle dropped on the floor were perceived as bouncing when high-energy regions in four different frequency bands were temporally aligned, producing amplitude peaks in the envelope. In contrast, when the same spectral energy was distributed randomly across bands the sounds were heard as breaking. More recent studies using vocoder simulations of cochlear implant processing demonstrated that many temporally-patterned sounds can be perceived with little original spectral information, based primarily on temporal cues. Such sounds as footsteps, horse galloping, helicopter flying, ping-pong playing, clapping, typing were identified with a high accuracy of 70% or more with a single channel of envelope-modulated broadband noise or with only two frequency channels. In these studies, envelope-based acoustic measures such as number of bursts and peaks in the envelope were predictive of listeners’ abilities to identify sounds based primarily on ENV_p cues. On the other hand, identification of brief environmental sounds without strong temporal patterning in ENV_p may require a much larger number of frequency channels to perceive. Sounds such as a car horn or a train whistle were poorly identified even with as many as 32 frequency channels. Listeners with cochlear implants, which transmit envelope information for specific frequency bands, but do not transmit TFS_p, have considerably reduced abilities in identification of common environmental sounds. In addition, individual environmental sounds are typically heard within the context of larger auditory scenes where sounds from multiple sources may overlap in time and frequency. When heard within an auditory scene, accurate identification of individual environmental sounds is contingent on the ability to segregate them from other sound sources or auditory streams in the auditory scene, which involves further reliance on ENV_p and TFS_p cues (see Role in auditory scene analysis).

Role in auditory scene analysis

Auditory scene analysis In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. Th ...

refers to the ability to perceive separately sounds coming from different sources. Any acoustical difference can potentially lead to auditory segregation, and so any cues based either on ENV_p or TFS_p are likely to assist in segregating competing sound sources. Such cues involve percepts such as pitch. Binaural TFS_p cues producing interaural time differences have not always resulted in clear source segregation, particularly with simultaneously presented sources, although successful segregation of sequential sounds, such as noise or speech, have been reported.

Effects of age and hearing loss on temporal envelope processing

Developmental aspects

In infancy, behavioral AM detection thresholds and forward or backward masking thresholds observed in 3-month olds are similar to those observed in adults. Electrophysiological studies conducted in 1-month-old infants using 2000 Hz AM pure tones indicate some immaturity in envelope following response (EFR). Although sleeping infants and sedated adults show the same effect of modulation rate on EFR, infants’ estimates were generally poorer than adults’. This is consistent with behavioral studies conducted with school-age children showing differences in AM detection thresholds compared to adults. Children systematically show worse AM detection thresholds than adults until 10–11 years. However, the shape of the TMTF (the cutoff) is similar to adults’ for younger children of 5 years. Sensory versus non-sensory factors for this long maturation are still debated, but the results generally appear to be more dependent on the task or on sound complexity for infants and children than for adults. Regarding the development of speech ENV_p processing, vocoder studies suggest that infants as young as 3 months are able to discriminate a change in consonants when the faster ENV_p information of the syllables is preserved (< 256 Hz) but less so when only the slowest ENV_p is available (< 8 Hz). Older children of 5 years show similar abilities than adults to discriminate consonant changes based on ENV_p cues (< 64 Hz).

Neurophysiological aspects

The effects of hearing loss and age on neural coding are generally believed to be smaller for slowly varying envelope responses (i.e., ENV_n) than for rapidly varying temporal fine structure (i.e., TFS_n). Enhanced ENV_n coding following noise-induced hearing loss has been observed in peripheral auditory responses from single neurons and in central evoked responses from the auditory midbrain. The enhancement in ENV_n coding of narrowband sounds occurs across the full range of modulation frequencies encoded by single neurons. For broadband sounds, the range of modulation frequencies encoded in impaired responses is broader than normal (extending to higher frequencies), as expected from reduced frequency selectivity associated with outer-hair-cell dysfunction. The enhancement observed in neural envelope responses is consistent with enhanced auditory perception of modulations following cochlear damage, which is commonly believed to result from loss of cochlear compression that occurs with outer-hair-cell dysfunction due to age or noise overexposure. However, the influence of inner-hair-cell dysfunction (e.g., shallower response growth for mild-moderate damage and steeper growth for severe damage) can confound the effects of outer-hair-cell dysfunction on overall response growth and thus ENV_n coding. Thus, not surprisingly the relative effects of outer-hair-cell and inner-hair-cell dysfunction have been predicted with modeling to create individual differences in speech intelligibility based on the strength of envelope coding of speech relative to noise.

Psychoacoustical aspects

For sinusoidal carriers, which have no intrinsic envelope (ENV_p) fluctuations, the TMTF is roughly flat for AM rates from 10 to 120 Hz, but increases (i.e. threshold worsens) for higher AM rates, provided that spectral sidebands are not audible. The shape of the TMTF for sinusoidal carriers is similar for young and older people with normal audiometric thresholds, but older people tend to have higher detection thresholds overall, suggesting poorer “detection efficiency” for ENV_n cues in older people. Provided that the carrier is fully audible, the ability to detect AM is usually not adversely affected by cochlear hearing loss and may sometimes be better than normal, for both noise carriers and sinusoidal carriers, perhaps because loudness recruitment (an abnormally rapid growth of loudness with increasing sound level) “magnifies” the perceived amount of AM (i.e., ENV_n cues). Consistent with this, when the AM is clearly audible, a sound with a fixed AM depth appears to fluctuate more for an impaired ear than for a normal ear. However, the ability to detect changes in AM depth can be impaired by cochlear hearing loss. Speech that is processed with noise vocoder such that mainly envelope information is delivered in multiple spectral channels was also used in investigating envelope processing in hearing impairment. Here, hearing-impaired individuals could not make use of such envelope information as well as normal-hearing individuals, even after audibility factors were taken into account. Additional experiments suggest that age negatively affects the binaural processing of ENV_p at least at low audio-frequencies.

Models of impaired temporal envelope processing

The perception model of ENV processing that incorporates selective (bandpass) AM filters accounts for many perceptual consequences of cochlear dysfunction including enhanced sensitivity to AM for sinusoidal and noise carriers, abnormal forward masking (the rate of recovery from forward masking being generally slower than normal for impaired listeners), stronger interference effects between AM and FM and enhanced temporal integration of AM. The model of Torsten Dau has been extended to account for the discrimination of complex AM patterns by hearing-impaired individuals and the effects of noise-reduction systems. The performance of the hearing-impaired individuals was best captured when the model combined the loss of peripheral amplitude compression resulting from the loss of the active mechanism in the cochlea with an increase in internal noise in the ENVn domain. Phenomenological models simulating the response of the peripheral auditory system showed that impaired AM sensitivity in individuals experiencing chronic tinnitus with clinically normal audiograms could be predicted by substantial loss of auditory-nerve fibers with low spontaneous rates and some loss of auditory-nerve fibers with high-spontaneous rates.

Effects of age and hearing loss on TFS processing

Developmental aspects

Very few studies have systematically assessed TFS processing in infants and children. Frequency-following response (FFR), thought to reflect phase-locked neural activity, appears to be adult-like in 1-month-old infants when using a pure tone (centered at 500, 1000 or 2000 Hz) modulated at 80 Hz with a 100% of modulation depth. As for behavioral data, six-month-old infants require larger frequency transitions to detect an FM change in a 1-kHz tone compared to adults. However, 4-month-old infants are able to discriminate two different FM sweeps, and they are more sensitive to FM cues swept from 150 Hz to 550 Hz than at lower frequencies. In school-age children, performance in detecting FM change improves between 6 and 10 years and sensitivity to low modulation rate (2 Hz) is poor until 9 years. For speech sounds, only one vocoder study has explored the ability of school age children to rely on TFSp cues to detect consonant changes, showing the same abilities for 5-years-olds than adults.

Neurophysiological aspects

Psychophysical studies have suggested that degraded TFS processing due to age and hearing loss may underlie some suprathreshold deficits, such as speech perception; however, debate remains about the underlying neural correlates. The strength of phase locking to the temporal fine structure of signals (TFS_n) in quiet listening conditions remains normal in peripheral single-neuron responses following cochlear hearing loss. Although these data suggest that the fundamental ability of auditory-nerve fibers to follow the rapid fluctuations of sound remains intact following cochlear hearing loss, deficits in phase locking strength do emerge in background noise. This finding, which is consistent with the common observation that listeners with cochlear hearing loss have more difficulty in noisy conditions, results from reduced cochlear frequency selectivity associated with outer-hair-cell dysfunction. Although only limited effects of age and hearing loss have been observed in terms of TFS_n coding strength of narrowband sounds, more dramatic deficits have been observed in TFS_n coding quality in response to broadband sounds, which are more relevant for everyday listening. A dramatic loss of tonotopicity can occur following noise induced hearing loss, where auditory-nerve fibers that should be responding to mid frequencies (e.g., 2–4 kHz) have dominant TFS responses to lower frequencies (e.g., 700 Hz). Notably, the loss of tonotopicity generally occurs only for TFS_n coding but not for ENV_n coding, which is consistent with greater perceptual deficits in TFS processing. This tonotopic degradation is likely to have important implications for speech perception, and can account for degraded coding of vowels following noise-induced hearing loss in which most of the cochlea responds to only the first formant, eliminating the normal tonotopic representation of the second and third formants.

Psychoacoustical aspects

Several psychophysical studies have shown that older people with normal hearing and people with sensorineural hearing loss often show impaired performance for auditory tasks that are assumed to rely on the ability of the monaural and binaural auditory system to encode and use TFS_n cues, such as: discrimination of sound frequency, discrimination of the fundamental frequency of harmonic sounds, detection of FM at rates below 5 Hz, melody recognition for sequences of pure tones and complex sounds, lateralization and localization of pure tones and complex tones, and segregation of concurrent harmonic sounds (such as speech sounds). However, it remains unclear to which extent deficits associated with hearing loss reflect poorer TFS_n processing or reduced cochlear frequency selectivity.

Models of impaired processing

The quality of the representation of a sound in the auditory nerve is limited by refractoriness, adaptation, saturation, and reduced synchronization (phase locking) at high frequencies, as well as by the stochastic nature of actions potentials. However, the auditory nerve contains thousands of fibers. Hence, despite these limiting factors, the properties of sounds are reasonably well represented in the ''population'' nerve response over a wide range of levels and audio frequencies (see

Volley Theory Volley theory states that groups of neurons of the auditory system respond to a sound by firing action potentials slightly out of phase with one another so that when combined, a greater frequency of sound can be encoded and sent to the brain to be a ...

). The coding of temporal information in the auditory nerve can be disrupted by two main mechanisms: reduced synchrony and loss of synapses and/or auditory nerve fibers. The impact of disrupted temporal coding on human auditory perception has been explored using physiologically inspired signal-processing tools. The reduction in neural synchrony has been simulated by jittering the phases of the multiple frequency components in speech, although this has undesired effects in the spectral domain. The loss of auditory nerve fibers or synapses has been simulated by assuming (i) that each afferent fiber operates as a stochastic sampler of the sound waveform, with greater probability of firing for higher-intensity and sustained sound features than for lower-intensity or transient features, and (ii) that deafferentation can be modeled by reducing the number of samplers. However, this also has undesired effects in the spectral domain. Both jittering and stochastic undersampling degrade the representation of the TFS_n more than the representation of the ENV_n. Both jittering and stochastic undersampling impair the recognition of speech in noisy backgrounds without degrading recognition in silence, support the argument that TFS_n is important for recognizing speech in noise. Both jittering and stochastic undersampling mimic the effects of aging on speech perception.

Transmission by hearing aids and cochlear implants

Temporal envelope transmission

Individuals with cochlear hearing loss usually have a smaller than normal dynamic range between the level of the weakest detectable sound and the level at which sounds become uncomfortably loud. To compress the large range of sound levels encountered in everyday life into the small

dynamic range Dynamic range (abbreviated DR, DNR, or DYR) is the ratio between the largest and smallest values that a certain quantity can assume. It is often used in the context of signals, like sound and light. It is measured either as a ratio or as a base ...

of the hearing-impaired person, hearing aids apply amplitude compression, which is also called

automatic gain control Automatic gain control (AGC) is a closed-loop feedback regulating circuit in an amplifier or chain of amplifiers, the purpose of which is to maintain a suitable signal amplitude at its output, despite variation of the signal amplitude at the inpu ...

(AGC). The basic principle of such compression is that the amount of amplification applied to the incoming sound progressively decreases as the input level increases. Usually, the sound is split into several frequency “channels”, and AGC is applied independently in each channel. As a result of compressing the level, AGC reduces the amount of envelope fluctuation in the input signal (ENV_p) by an amount that depends on the rate of fluctuation and the speed with which the amplification changes in response to changes in input sound level. AGC can also change the shape of the envelope of the signal.

Cochlear implant A cochlear implant (CI) is a surgically implanted neuroprosthesis that provides a person who has moderate-to-profound sensorineural hearing loss with sound perception. With the help of therapy, cochlear implants may allow for improved speech unde ...

s are devices that electrically stimulate the auditory nerve, thereby creating the sensation of sound in a person who would otherwise be profoundly or totally deaf. The electrical dynamic range is very small, so cochlear implants usually incorporate AGC prior to the signal being filtered into multiple frequency channels. The channel signals are then subjected to instantaneous compression to map them into the limited dynamic range for each channel.

s differ than hearing aids in that the entire acoustic hearing is replaced with direct electric stimulation of the auditory nerve, achieved via an electrode array placed inside the cochlea. Hence, here, other factors than device signal processing also strongly contribute to overall hearing, such as etiology, nerve health, electrode configuration and proximity to the nerve, and overall adaptation process to an entirely new mode of hearing. Almost all information in cochlear implants is conveyed by the envelope fluctuations in the different channels. This is sufficient to give reasonable perception of speech in quiet, but not in noisy or reverberant conditions. The processing in cochlear implants is such that the TFSp is discarded in favor of fixed-rate pulse trains amplitude-modulated by the ENVp within each frequency band. Implant users are sensitive to these ENVp modulations, but performance varies across stimulation site, stimulation level, and across individuals. The TMTF shows a low-pass filter shape similar to that observed in normal-hearing listeners. Voice pitch or musical pitch information, conveyed primarily via weak periodicity cues in the ENVp, results in a pitch sensation that is not salient enough to support music perception, talker sex identification, lexical tones, or prosodic cues. Listeners with cochlear implants are susceptible to interference in the modulation domain which likely contributes to difficulties listening in noise.

Temporal fine structure transmission

Hearing aids usually process sounds by filtering them into multiple frequency channels and applying AGC in each channel. Other signal processing in hearing aids, such as noise reduction, also involves filtering the input into multiple channels. The filtering into channels can affect the TFS_p of sounds depending on characteristics such as the phase response and group delay of the filters. However, such effects are usually small. Cochlear implants also filter the input signal into frequency channels. Usually, the ENV_p of the signal in each channel is transmitted to the implanted electrodes in the form an electrical pulses of fixed rate that are modulated in amplitude or duration. Information about TFS_p is discarded. This is justified by the observation that people with cochlear implants have a very limited ability to process TFS_p information, even if it is transmitted to the electrodes, perhaps because of a mismatch between the temporal information and the place in the cochlea to which it is delivered Reducing this mismatch may improve the ability to use TFS_p information and hence lead to better pitch perception. Some cochlear implant systems transmit information about TFS_p in the channels of the cochlear implants that are tuned to low audio frequencies, and this may improve the pitch perception of low-frequency sounds.

Training effects and plasticity of temporal-envelope processing

Perceptual learning resulting from training has been reported for various auditory AM detection or discrimination tasks, suggesting that the responses of central auditory neurons to ENV_p cues are plastic and that practice may modify the circuitry of ENV_n processing. The plasticity of ENV_n processing has been demonstrated in several ways. For instance, the ability of auditory-cortex neurons to discriminate voice-onset time cues for phonemes is degraded following moderate hearing loss (20-40 dB HL) induced by acoustic trauma. Interestingly, developmental hearing loss reduces cortical responses to slow, but not fast (100 Hz) AM stimuli, in parallel with behavioral performance. As a matter of fact, a transient hearing loss (15 days) occurring during the "critical period" is sufficient to elevate AM thresholds in adult gerbils. Even non-traumatic noise exposure reduces the phase-locking ability of cortical neurons as well as the animals' behavioral capacity to discriminate between different AM sounds. Behavioral training or pairing protocols involving neuromodulators also alter the ability of cortical neurons to phase lock to AM sounds. In humans, hearing loss may result in an unbalanced representation of speech cues: ENV_n cues are enhanced at the cost of TFS_n cues (see: Effects of age and hearing loss on temporal envelope processing). Auditory training may reduce the representation of speech ENV_n cues for elderly listeners with hearing loss, who may then reach levels comparable to those observed for normal-hearing elderly listeners. Last, intensive musical training induces both behavioral effects such as higher sensitivity to pitch variations (for Mandarin linguistic pitch) and a better synchronization of brainstem responses to the f0-contour of lexical tones for musicians compared with non-musicians.

Clinical evaluation of TFS sensitivity

Fast and easy to administer psychophysical tests have been developed to assist clinicians in the screening of TFS-processing abilities and diagnosis of suprathreshold temporal auditory processing deficits associated with cochlear damage and ageing. These tests may also be useful for audiologists and hearing-aid manufacturers to explain and/or predict the outcome of hearing-aid fitting in terms of perceived quality, speech intelligibility or spatial hearing. These tests may eventually be used to recommend the most appropriate compression speed in hearing aids or the use of directional microphones. The need for such tests is corroborated by strong correlations between slow-FM or spectro-temporal modulation detection thresholds and aided speech intelligibility in competing backgrounds for hearing-impaired persons. Clinical tests can be divided into two groups: those assessing monaural TFS processing capacities (TFS1 test) and those assessing binaural capacities (binaural pitch, TFS-LF, TFS-AF). TFS1: this test assesses the ability to discriminate between a harmonic complex tone and its frequency-transposed (and thus, inharmonic) version. Binaural pitch: these tests evaluate the ability to detect and discriminate binaural pitch, and melody recognition using different types of binaural pitch. TFS-LF: this test assesses the ability to discriminate low-frequency pure tones that are identical at the two ears from the same tones differing in interaural phase. TFS AF: this test assesses the highest audio frequency of a pure tone up to which a change in interaural phase can be discriminated.

Objective measures using envelope and TFS cues

Signal distortion, additive noise, reverberation, and audio processing strategies such as noise suppression and dynamic-range compression can all impact speech intelligibility and speech and music quality. These changes in the perception of the signal can often be predicted by measuring the associated changes in the signal envelope and/or temporal fine structure (TFS). Objective measures of the signal changes, when combined with procedures that associate the signal changes with differences in auditory perception, give rise to auditory performance metrics for predicting speech intelligibility and speech quality. Changes in the TFS can be estimated by passing the signals through a filterbank and computing the coherence between the system input and output in each band. Intelligibility predicted from the coherence is accurate for some forms of additive noise and nonlinear distortion, but works poorly for ideal binary mask (IBM) noise suppression. Speech and music quality for signals subjected to noise and clipping distortion have also been modeled using the coherence or using the coherence averaged across short signal segments. Changes in the signal envelope can be measured using several different procedures. The presence of noise or reverberation will reduce the modulation depth of a signal, and multiband measurement of the envelope modulation depth of the system output is used in the speech transmission index (STI) to estimate intelligibility. While accurate for noise and reverberation applications, the STI works poorly for nonlinear processing such as dynamic-range compression. An extension to the STI estimates the change in modulation by cross-correlating the envelopes of the speech input and output signals. A related procedure, also using envelope cross-correlations, is the short-time objective intelligibility (STOI) measure, which works well for its intended application in evaluating noise suppression, but which is less accurate for nonlinear distortion. Envelope-based intelligibility metrics have also been derived using modulation filterbanks and using envelope time-frequency modulation patterns. Envelope cross-correlation is also used for estimating speech and music quality. Envelope and TFS measurements can also be combined to form intelligibility and quality metrics. A family of metrics for speech intelligibility, speech quality, and music quality has been derived using a shared model of the auditory periphery that can represent hearing loss. Using a model of the impaired periphery leads to more accurate predictions for hearing-impaired listeners than using a normal-hearing model, and the combined envelope/TFS metric is generally more accurate than a metric that uses envelope modulation alone.

References

{{Reflist, 32em Psychoacoustics

Definition

Temporal envelope (ENV) processing

Neurophysiological aspects

Psychoacoustical aspects

Models of normal envelope processing

Temporal fine structure (TFS) processing

Neurophysiological aspects

Psychoacoustical aspects

Models of normal processing: limitations

Role in speech and music perception

Role of temporal envelope in speech and music perception

Role of TFS in speech and music perception

Role of TFS in pitch perception

Role of TFS cues in speech perception

Role in environmental sound perception

Role in auditory scene analysis

Effects of age and hearing loss on temporal envelope processing

Developmental aspects

Neurophysiological aspects

Psychoacoustical aspects

Models of impaired temporal envelope processing

Effects of age and hearing loss on TFS processing

Developmental aspects

Neurophysiological aspects

Psychoacoustical aspects

Models of impaired processing

Transmission by hearing aids and cochlear implants

Temporal envelope transmission

Temporal fine structure transmission

Training effects and plasticity of temporal-envelope processing

Clinical evaluation of TFS sensitivity

Objective measures using envelope and TFS cues

See also

References