Native-language identification
   HOME

TheInfoList



OR:

Native-language identification (NLI) is the task of determining an author's
native language A first language, native tongue, native language, mother tongue or L1 is the first language or dialect that a person has been exposed to from birth or within the critical period. In some countries, the term ''native language'' or ''mother tongu ...
(L1) based only on their writings in a
second language A person's second language, or L2, is a language that is not the native language (first language or L1) of the speaker, but is learned later. A second language may be a neighbouring language, another language of the speaker's home country, or a fo ...
(L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and then applying this knowledge to predict the native language of previously unseen texts. This is motivated in part by applications in second-language acquisition, language teaching and
forensic linguistics Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
, amongst others.


Overview

NLI works under the assumption that an author's L1 will dispose them towards particular language production patterns in their L2, as influenced by their native language. This relates to cross-linguistic influence (CLI), a key topic in the field of second-language acquisition (SLA) that analyzes transfer effects from the L1 on later learned languages. Using large-scale English data, NLI methods achieve over 80% accuracy in predicting the native language of texts written by authors from 11 different L1 backgrounds. This can be compared to a baseline of 9% for choosing randomly.


Applications


Pedagogy and language transfer

This identification of L1-specific features has been used to study
language transfer Language transfer is the application of linguistic features from one language to another by a bilingual or multilingual speaker. Language transfer may occur across both languages in the acquisition of a Simultaneous bilingualism, simultaneous b ...
effects in second-language acquisition. This is useful for developing pedagogical material, teaching methods, L1-specific instructions and generating learner feedback that is tailored to their native language.


Forensic linguistics

NLI methods can also be applied in
forensic linguistics Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
as a method of performing authorship profiling in order to infer the attributes of an author, including their linguistic background. This is particularly useful in situations where a text, e.g. an anonymous letter, is the key piece of evidence in an investigation and clues about the native language of a writer can help investigators in identifying the source. This has already attracted interest and funding from intelligence agencies.


Methodology

Natural language processing methods are used to extract and identify language usage patterns common to speakers of an L1-group. This is done using language learner data, usually from a
learner corpus Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of learnin ...
. Next,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
is applied to train classifiers, like support vector machines, for predicting the L1 of unseen texts. A range of ensemble based systems have also been applied to the task and shown to improve performance over single classifier systems. Various linguistic feature types have been applied for this task. These include syntactic features such as constituent parses, grammatical dependencies and part-of-speech tags. Surface level lexical features such as character, word and lemma n-grams have also been found to be quite useful for this task. However, it seems that character n-grams are the single best feature for the task.


2013 shared task

The Building Educational Applications (BEA) workshop at
NAACL The North American Chapter of the Association for Computational Linguistics (NAACL) provides a regional focus for members of the Association for Computational Linguistics (ACL) in North America North America is a continent in the No ...
2013 hosted the inaugural NLI shared task.Tetreault et al
"A report on the first native language identification shared task"
2013
The competition resulted in 29 entries from teams across the globe, 24 of which also published a paper describing their systems and approaches.


See also

* * * * * *


References

{{DEFAULTSORT:Natural Language Processing Computational linguistics Second-language acquisition Natural language processing Machine learning Applied linguistics Bilingualism