linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Ling ...

and

pedagogy Pedagogy (), most commonly understood as the approach to teaching, is the theory and practice of learning, and how this process influences, and is influenced by, the social, political and psychological development of learners. Pedagogy, taken ...

, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines, such as between a line of original text and its

translation Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

into another

language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...

. When glossed, each line of the original text acquires one or more corresponding lines of transcription known as an interlinear text or interlinear glossed text (IGT)interlinear for short. Such glosses help the reader follow the relationship between the

source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...

and its translation, and the structure of the original language. In its simplest form, an interlinear gloss is simply a literal, word-for-word translation of the

History

Interlinear glosses have been used for a variety of purposes over a long period of time. One common usage has been to annotate bilingual textbooks for language education. This sort of interlinearization serves to help make the meaning of a

explicit without attempting to formally model the structural characteristics of the source language. Such annotations have occasionally been expressed not through interlinear layout, but rather, through enumeration of words in the object and meta language. One such example is

Wilhelm von Humboldt Friedrich Wilhelm Christian Karl Ferdinand von Humboldt (, also , ; ; 22 June 1767 – 8 April 1835) was a Prussian philosopher, linguist, government functionary, diplomat, and founder of the Humboldt University of Berlin, which was named afte ...

's annotation of

Classical Nahuatl Classical Nahuatl (also known simply as Aztec or Nahuatl) is any of the variants of Nahuatl spoken in the Valley of Mexico and central Mexico as a ''lingua franca'' at the time of the 16th-century Spanish conquest of the Aztec Empire. During the ...

: This "inline" style allows examples to be included within the flow of text, and for the word order of the target language to be written in an order which approximates the target language syntax. (In the gloss here, ''mache es'' is reordered from the corresponding source order to approximate German syntax more naturally.) Even so, this approach requires the readers to "re-align" the correspondences between source and target forms. More modern 19th- and 20th-century approaches took to glossing vertically, aligning the same sort of word-by-word content in such a way that the metalanguage terms were placed vertically below the source language terms. In this style, the given example might be rendered thus (here English gloss): Note that here word ordering is determined by the syntax of the object language. Finally, modern linguists have adopted the practice of using abbreviated grammatical category labels. A 2008 publication which repeats this example labels it as follows: This approach is denser and also requires effort to read, but it is less reliant on the grammatical structure of the metalanguage for expressing the semantics of the target forms. In computing, special text markers are provided in the Specials Unicode block to indicate the start and end of interlinear glosses.

Structure

Though there is no formal specification for the IGT format, the

Leipzig Glossing Rules Leipzig ( , ; Upper Saxon: ) is the most populous city in the German state of Saxony. Leipzig's population of 605,407 inhabitants (1.1 million in the larger urban zone) as of 2021 places the city as Germany's eighth most populous, as ...

is a set of guidelines that aim to standardize the format as much as possible. An interlinear text for linguistics will commonly consist of some or all of the following, usually in this order, from top to bottom: *The original

orthography An orthography is a set of conventions for writing a language, including norms of spelling, hyphenation, capitalization, word breaks, emphasis, and punctuation. Most transnational languages in the modern period have a writing system, and ...

(typically in ''italic'' or ''bold italic''), *a conventional transliteration into the Latin alphabet, *a

phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...

transcription, *a

morphophonemic Morphophonology (also morphophonemics or morphonology) is the branch of linguistics that studies the interaction between morphological and phonological or phonetic processes. Its chief focus is the sound changes that take place in morphemes ...

transliteration, *a word-by-word or

morpheme A morpheme is the smallest meaningful Constituent (linguistics), constituent of a linguistic expression. The field of linguistics, linguistic study dedicated to morphemes is called morphology (linguistics), morphology. In English, morphemes are ...

-by-morpheme gloss, where morphemes within a word are separated by hyphens or other punctuation, and finally *a free translation, which may be placed in a separate paragraph or on the facing page if the structures of the languages are too different for it to follow the text line by line. As an example, the following

Taiwanese Taiwanese may refer to: * Taiwanese language, another name for Taiwanese Hokkien * Something from or related to Taiwan (Formosa) * Taiwanese aborigines, the indigenous people of Taiwan * Han Taiwanese, the Han people of Taiwan * Taiwanese people, r ...

clause has been transcribed with five lines of text: :1. the standard ''

pe̍h-ōe-jī (; ; ), also sometimes known as the Church Romanization, is an orthography used to write variants of Southern Min Chinese, particularly Taiwanese and Amoy Hokkien. Developed by Western missionaries working among the Chinese diaspora in Sout ...

'' transliteration, :2. a gloss using tone numbers for the surface tones, :3. a gloss showing the underlying tones in citation form (before undergoing

tone sandhi Tone sandhi is a phonological change occurring in tonal languages, in which the tones assigned to individual words or morphemes change based on the pronunciation of adjacent words or morphemes. It usually simplifies a bidirectional tone into a ...

), :4. a morpheme-by-morpheme gloss in

English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...

, and :5. an English translation: Word-by-word alignment. According to the Leipzig Glossing Rules, it is standard to left-align the words in the object language with the corresponding words in the metalanguage; this alignment can be seen between lines (1-3) and line (4). Morpheme-by-morpheme correspondence. At the sub-word level, segmentable morphemes are separated by hyphens, both in the example and in the gloss. There should be the same number of hyphens in the example and in the gloss, as shown in the following example: Grammatical category labels. In ''amuqʼ-da-č'', the stem (''amuq'') is translated into the corresponding English lexeme (''stay'') while the inflectional affixes (''da'') and (''č'') are inflectional affixes representing future tense and negation. These inflectional affixes are glossed as ''FUT'' and ''NEG''; a list of standard abbreviations for grammatical categories widely used in Linguistics can be found in the Leipzig Glossing Rules. One-to-many correspondences. When a single object-language element corresponds to several metalanguage elements, these are separated by periods. E.g., Non-overt elements. if the morpheme-by-morpheme gloss (middle line) contains an element that does not correspond to an overt element in the example, a standard strategy is to include an overt "ø" in the object-language text, which is separated by a hyphen like an overt element would be: Reduplication is treated similarly to affixation, but with a tilde (instead of the standard hyphen) connecting the copied element to the stem:

Punctuation

In interlinear morphological glosses, various forms of punctuation separate the glosses. Typically, the words are aligned with their glosses; within words, a hyphen is used when a boundary is marked in both the text and its gloss, a period when a boundary appears in only one. That is, there should be the same number of words separated with spaces in the text and its gloss, as well as the same number of hyphenated morphemes within a word and its gloss. This is the basic system, and can be applied universally. For example, An underscore may be used instead of a period, as in ''go_out-'', when a single word in the source language happens to correspond to a phrase in the glossing language, though a period would still be used for other situations, such as Greek ''oikíais'' house. 'to the houses'. However, sometimes finer distinctions may be made. For example,

clitic In morphology and syntax, a clitic (, backformed from Greek "leaning" or "enclitic"Crystal, David. ''A First Dictionary of Linguistics and Phonetics''. Boulder, CO: Westview, 1980. Print.) is a morpheme that has syntactic characteristics of a ...

s may be separated with a double hyphen (or, for ease of typing, an equal sign) rather than a hyphen: Affixes which cause discontinuity (

infix An infix is an affix inserted inside a word stem (an existing word or the core of a family of words). It contrasts with '' adfix,'' a rare term for an affix attached to the outside of a stem, such as a prefix or suffix. When marking text for i ...

es,

circumfix A circumfix (abbreviated ) (also confix or ambifix) is an affix which has two parts, one placed at the start of a word, and the other at the end. Circumfixes contrast with prefixes, attached to the beginnings of words; suffixes, attached at th ...

es, transfixes, etc.) may be set off by angle brackets, and

reduplication In linguistics, reduplication is a morphological process in which the root or stem of a word (or part of it) or even the whole word is repeated exactly or with a slight change. The classic observation on the semantics of reduplication is Edwa ...

with tildes, rather than with hyphens: (See

affix In linguistics, an affix is a morpheme that is attached to a word stem to form a new word or word form. Affixes may be derivational, like English ''-ness'' and ''pre-'', or inflectional, like English plural ''-s'' and past tense ''-ed''. They ...

for other examples.) Morphemes which cannot be easily separated out, such as umlaut, may be marked with a backslash rather than a period: A few other conventions which are sometimes seen are illustrated in the Leipzig Glossing Rules.

Interlinear gloss resources

Efforts have been undertaken to digitize IGT for hundreds of the world's languages.

Online Database of Interlinear Text

The Online Database of Interlinear Text (ODIN) is a database of over 200,000 instances of interlinear glosses for more than 1,500 languages extracted from scholarly linguistic research. The database was constructed in two phases: automatic construction followed by manual correction. The automatic construction stage itself was completed in three steps: # First, search engines (e.g., Google, Bing) were queried to retrieve scholarly documents that were likely to contain interlinear glosses. The queries comprised terms relevant to linguistic research such as grammatical morphemes (e.g., "NOM"—shorthand for nominative; "3SG"—shorthand for 3rd person singular). # Second, each line in an extracted document was tagged for whether it was a line belonging to an interlinear gloss or not using sequence-labeling methods from Machine Learning. # Third, each interlinear gloss instance was assigned a language name (e.g., Tagalog) and an ISO 693-3 language ID. Language names and IDs were automatically assigned to interlinear glosses using Coreference Resolution models from Natural Language Processing, where the interlinear gloss instance was tagged with the language name (and ID) that appears in the scholarly document the interlinear gloss instance was extracted from. In the manual correction phase, the database creators manually corrected the boundaries of the interlinear gloss instances discovered by the sequence-labelling method in Step 2 of the automatic construction phase. The creators then verified the language names and language codes in a second and third pass over the data, respectively.

Automatic processing of interlinear gloss instances

Natural Language Processing models leveraging interlinear gloss resources, such as the Online Database of Interlinear Text, have been developed.

Automatic glossing

Natural Language Processing systems, for example, have been developed to automatically produce interlinear glosses.: Given the morpheme segmented line (first line above) and the free translation line (third line above), the task is to produce the middle glossed line comprising stem translations (e.g., ''mi'':''you'') and the grammatical category labels corresponding to affixes (e.g., ''a'':''ERG.1.PL''). Sequence prediction models from Natural Language Processing have been used to perform this task. Two factors contribute to the difficulty of this task: # The translation is not necessarily in alignment with the morpheme segmented line (e.g., ''camel'' is the last word in the translation but the second word in the morpheme segmented line). # Some words in the morpheme segmented line have multiple correspondences in the gloss (e.g., ''anu'':''be.NEG'').

Automatic discovery of morphological structure from glosses

Researchers have used interlinear glosses is to obtain the morphological paradigms of the object language (i.e., the language being glossed). To automatically create morphological paradigms from interlinear glosses, researchers have created tables for every stem in the gloss and a (possibly empty) slot for every grammatical category (e.g., ERG) in the gloss. For instance, given the glossed sentence below: There would be a paradigm for the stem ''pobeja'' with slots for ''PFV.PST.SG.FEM'' and ''PFV.PST.SG.MASC'': The slot for ''PFV.PST.SG.FEM'' would be filled (since it was observed in the interlinear gloss data) but the slot for ''PFV.PST.SG.MASC'' would be empty (assuming that no other interlinear gloss instance contains ''pobeja'' inflected for the ''PFV.PST.SG.MASC'' grammatical category). A statistical machine learning model for morphological inflection can be used to fill in the missing entries.

References

{{Reflist, refs= {{Cite book , publisher = Walter de Gruyter , isbn = 978-3-11-011423-2 , last = Haspelmath , page
715
, first = Martin , title = Language typology and language universals: an international handbook , url = https://archive.org/details/sprachtypologieu00teil , url-access = limited , year = 2008 {{Cite thesis , type = PhD , last = Georgi , first = Ryan , date = 2016 , title = From Aari to Zulu: massively multilingual creation of language tools using interlinear glossed tex , publisher = University of Washington {{Cite book , publisher = W. de Gruyter , volume = 2 , pages = 1834–1857 , last = Lehmann , first = Christian , editor1= Geert Booij , editor2=Christian Lehmann , editor3=Joachim Mugdan , editor4=Stavros Skopeteas , title = Morphologie. Ein internationales Handbuch zur Flexion und Wortbildung , chapter = Directions for interlinear morphemic translations , location = Berlin , series = Handbücher der Sprach- und Kommunikationswissenschaft , date = 2004-01-23 {{Cite journal , last1 = Xia , first1 = Fei , last2 = Lewis , first2 = William , last3 = Wayne , first3 = Michael , last4 = Slayden , first4 = Glenn , last5 = Georgi , first5 = Ryan , last6 = Crowgey , first6 = Joshua , last7 = Bender , first7 = Emily , date = 2016 , title = Enriching a massively multilingual database of interlinear glossed text , url = https://link.springer.com/article/10.1007/s10579-015-9325-4 , journal = Language Resources and Evaluation , volume = 50 , issue = 2 , pages = 321–349 , doi = 10.1007/s10579-015-9325-4 , s2cid = 2674996 , access-date = 2021-12-15 {{Cite web , last = Bickel , first = Balthasar , author2=Bernard Comrie , author3=Martin Haspelmath , title = The Leipzig Glossing Rules. Conventions for Interlinear Morpheme by Morpheme Glosses. , work = Dept. of Linguistics – Resources – Glossing Rules , access-date = 2010-06-30 , date = February 2008 , url = http://www.eva.mpg.de/lingua/tools-at-lingboard/tools.php {{Cite journal , last1 = Xingyuan , first1 = Zhao , last2 = Satoru , first2 = Ozaki , last3 = Anastasopoulos , first3 = Antonios , last4 = Neubig , first4 = Graham , last5 = Levin , first5 = Lori , date = 2020 , title = Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations , url = https://aclanthology.org/2020.coling-main.471/ , journal = COLING , volume = Proceedings of the 28th International Conference on Computational Linguistics , pages = 5397–5408 , doi = 10.18653/v1/2020.coling-main.471 , s2cid = 227231816 , access-date = 2021-12-15 {{Cite journal , last1 = Moeller , first1 = Sarah , last2 = Liu , first2 = Ling , last3 = Yang , first3 = Changbing , last4 = Kann , first4 = Katharina , last5 = Hulden , first5 = Mans , date = 2020 , title = IG2P: From Interlinear Glossed Texts to Paradigms , url = https://aclanthology.org/2020.emnlp-main.424 , journal = EMNLP , volume = Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages = 5251–5262 , doi = 10.18653/v1/2020.emnlp-main.424 , s2cid = 226262296 , access-date = 2021-12-15

External links

The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses

(E-MELD)

(E-MELD)
Towards a General Model of Interlinear Text
(E-MELD)
Interlinear Morphemic Glosses

Glossing Ancient Languages and Texts
A forum for recommendations on the Interlinar Morphemic Glossing of ancient languages as attested in ancient manuscripts.

* ttp://depts.washington.edu/uwcl/odin/ ODIN - The Online Database of INterlinear text!--formerly(?) at http://odin.linguistlist.org/-->
Latinum Interlinear Method page
Listing of older interlinear and construed texts, mostly from Latin or Ancient Greek and mostly to English * Ernest Blum
"The New Old Way of Learning Languages"
''

The American Scholar "The American Scholar" was a speech given by Ralph Waldo Emerson on August 31, 1837, to the Phi Beta Kappa Society of Harvard College at the First Parish in Cambridge in Cambridge, Massachusetts. He was invited to speak in recognition of his gro ...

'', Autumn 2008. Translation Linguistics Reordered languages