The field of
language documentation
Language documentation (also: documentary linguistics) is a subfield of linguistics which aims to describe the grammar and use of human languages. It aims to provide a comprehensive record of the linguistic practices characteristic of a given sp ...
in the modern context involves a complex and ever-evolving set of tools and methods, and the study and development of their use - and, especially, identification and promotion of best practices - can be considered a sub-field of
language documentation
Language documentation (also: documentary linguistics) is a subfield of linguistics which aims to describe the grammar and use of human languages. It aims to provide a comprehensive record of the linguistic practices characteristic of a given sp ...
proper. Among these are ethical and recording principles, workflows and methods, hardware tools, and software tools.
Principles and workflows
Researchers in language documentation often conduct linguistic fieldwork to gather the data on which their work is based, recording audiovisual files that document language use in traditional contexts. Because the environments in which linguistic fieldwork often takes place may be logistically challenging, not every type of recording tool is necessary or ideal, and compromises must often be struck between quality, cost and usability. It is also important to envision one's complete workflow and intended outcomes; for example, if video files are made, some amount of processing may be required to expose the audio component to processing in various ways by different software packages.
Ethics
Ethical practices in language documentation have been the focus of much recent discussion and debate. The
Linguistic Society of America
The Linguistic Society of America (LSA) is a learned society for the field of linguistics. Founded in New York City in 1924, the LSA works to promote the scientific study of language. The society publishes three scholarly journals: ''Language'', ...
has prepared a
Ethics Statement and maintains a
Ethics Discussion Blogwhich is primarily focused on ethics in the language documentation context. The morality of ethics protocols has itself been brought into question by
George van Driem
George "Sjors" van Driem (born 1957) is a Dutch linguist associated with the University of Bern, where he is the chair of Historical Linguistics and directs the Linguistics Institute.
Education
* Leiden University, 1983–1987 (PhD, ''A Grammar ...
. Most postgraduate programs that involve some form of language documentation and description require researchers to submit their proposed protocols to an internal Institutional Review Board which ensures that research is being conducted ethically. Minimally, participants should be informed of the process and the intended use of the recordings, and give recorded audible or written permission for the audiovisual materials to be used for linguistic investigation by the researcher(s). Many participants will want to be named as consultants, but others will not - this will determine whether the data needs to be anonymized or restricted from public access.
Data Formats
Adhering to standards for formats is critical for interoperability between software tools. Many individual archives or data repositories have their own standards and requirements for data deposited on their servers - knowledge of these requirements ought to inform the data collection strategy and tools used, and should be part of a
data management plan A data management plan or DMP is a formal document that outlines how data are to be handled both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, m ...
developed before the start of research. Some example guidelines from well-used repositories are given below:
Endangered Languages Archive (ELAR)guidelines
accepted formats
Yale University Libraryaudiovisual guidelines
Most current archive standards for
video
Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...
use MPEG-4 (H264) as an encoding or storage format, which includes an AAC audio stream (generally of up to 320 kbit/s).
Audio
Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to:
Sound
*Audio signal, an electrical representation of sound
*Audio frequency, a frequency in the audio spectrum
* Digital audio, representation of soun ...
archive quality is at least WAV 44.1 kHz, 16-bit.
Principles for recording
Since documentation of languages is often difficult, with many languages that linguists work with being endangered (they may not be spoken in the near future), it is recommended to record at the highest quality possible given the limitations of a recorder. For video, this means recording at HD resolution (1080p or 720p) or higher when possible, while for audio this means recording minimally in uncompressed PCM 44,100 samples per second, 16-bit resolution. Arguably, however, good recording techniques (isolation, microphone selection and usage, using a tripod to minimize blur) is more important than resolution. A microphone that gives a clear recording of a speaker telling a folktale (high signal/noise ratio) in MP3 format (perhaps via a phone) is better than an extremely noisy recording in WAV format where all that can be heard are cars going by. To ensure that good recordings can be obtained, linguists should practice with their recording devices as much as possible and compare the results to observe which techniques yield the best results.
Workflows
For many linguists the end-result of making recordings is language analysis, often investigation of a language's phonological or syntactic properties using various software tools. This requires transcription of the audio, generally in collaboration with native speakers of the language in question. For general transcription, media files can be played back on a computer (or other device capable of playback) and paused for transcription in a text editor. Other (cross-platform) tools to assist this process includ
Audacityand
Transcriber
Transcriber is an open-source software tool for the transcription and annotation of speech signals for linguistic research. It supports multiple hierarchical layers of segmentation, named entity annotation, speaker lists, topic lists, and ove ...
, while a program lik
ELAN(described further below) can also perform this function.
Programs lik
Toolboxo
FLExare often preferred by linguists who want to be able to
interlinearize their texts, as these programs build a dictionary of forms and parsing rules to help speed up analysis. Unfortunately, media files are generally not linked by these programs (as opposed to ELAN, in which linked files are preferred), making it difficult to view or listen back to recordings to check transcriptions. There i
currently a workaroundfor Toolbox that allows timecodes to reference an audio file and enable playback (of a complete text or a referenced sentence) from within Toolbox - in this workflow, time-alignment of text is performed in Transcriber, and then the relevant timecodes and text are converted into a format that Toolbox can read.
Hardware
Video+audio recorders
Recorders that record video typically also record audio as well. However, the audio does not always meet the criteria of minimal needs and recommended best practices for language documentation (uncompressed WAV format, 44.1 kHz, 16-bit), and is often not useful for linguistic purposes such as phonetic analysis. Many video devices record instead to a compressed audio format such as AAC or MP3, which is combined with the video stream in a wrapper of
various kinds. Exceptions to this general rule are the following Video+Audio recorders:
Th
Zoomseries, particularly th
Q8Q4n an
Q2n which record to multiple video and audio resolutions/formats, most notably WAV (44.1/48/96 kHz, 16/24-bit).
When using a video recorder that does not record audio in WAV format (such as most DSLR cameras), it is recommended to record audio separately on another recorder, following some of the guidelines below. As with the audio recorders described below, many video recorders also accept microphone input of various kinds (generally through an 1/8-inch or TRS connector) - this can ensure a high-quality backup audio recording that is in sync with the recorded video, which can be helpful in some cases (i.e. for transcription).
Audio recorders and microphones
Audio-only recorders can be used in scenarios where video is impractical or otherwise undesirable. In most cases it is advantageous to combine the use of an audio-only recorder with one or more external microphones, however many modern audio recorders include built-in microphones which are usable if cost or setup speed are important concerns. Digital (solid state) recorders are preferred for most language documentation scenarios. Modern digital recorders achieve a very high level of quality at a relatively low price. Some of the most popular field recorders are found in th
Zoomrange, including th
H1H2H4H5an
H6 Th
H1is particularly suitable for situations in which cost and user-friendliness are major desiderata. Other popular recorders for situations where size is a factor are th
and th
Sony Digital Voice recorders(though in the latter case, ensure that the device can record to WAV/Linear PCM format).
Several types of
microphone
A microphone, colloquially called a mic or mike (), is a transducer that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and pub ...
can be effectively used in language documentation scenarios, depending on the situation (especially, including factors such as number, position and mobility of speakers) and on budget. In general,
condenser microphones
A microphone, colloquially called a mic or mike (), is a transducer that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and public ...
should be selected rather than
dynamic microphones
A microphone, colloquially called a mic or mike (), is a transducer that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and public ...
. It is an advantage in most fieldwork situations if a condenser microphone is self-powered (via a battery); however, when power is not a major factor, phantom-powered models can also be used. A stereo microphone setup is needed whenever more than one speaker is involved in a recording; this can be achieved via an array of two mono microphones, or by a dedicated stereo microphone.
Directional microphones should be used in most cases, in order to isolate a speaker's voice from other potential noise sources. However, omnidirectional microphones may be preferred in situations involving larger numbers of speakers arrayed in a relatively large space. Among directional microphones,
cardioid
In geometry, a cardioid () is a plane curve traced by a point on the perimeter of a circle that is rolling around a fixed circle of the same radius. It can also be defined as an epicycloid having a single cusp. It is also a type of sinusoida ...
microphones are suitable for most applications, however in some cases a
hypercardioid ("shotgun") microphone may be preferred.
Good quality headset microphones are comparatively expensive, but can produce recordings of extremely high quality in controlled situations.
Lavalier
A lavalier or lavaliere or lavalliere is an item of jewelry consisting of a pendant, sometimes with one stone, pendulous and centered from a necklace.
The style was popularized by the Duchesse de la Vallière, a mistress of King Louis XIV of Fr ...
or "lapel" microphones may be used in some situations, however, depending on the microphone they can produce recordings which are inferior to a headset microphone for phonetic analysis, and are subject to some of the same concerns that headset microphones are in terms of restriction of a recording to a single speaker - while other speakers may be audible on the recording, they will be backgrounded in relation to the speaker wearing the lavalier microphone.
Some good quality microphones used for film-making and interviews include th
Røde VideoMic shotgun and the Røde lavalier seriesShure headworn micsan
Shure lavaliers Depending on the recorder and microphone, additional
cables
Cable may refer to:
Mechanical
* Nautical cable, an assembly of three or more ropes woven against the weave of the ropes, rendering it virtually waterproof
* Wire rope, a type of rope that consists of several strands of metal wire laid into a hel ...
(XLR, stereo/mono converter or
TRRS to TRS adapter will be necessary.
Other recording tools
Electrical power generation, storage and management
Computer systems
Accessories
Software
There is as yet no single software suite which is designed to or able to handle all aspects of a typical language documentation workflow. Instead, there is a large and increasing number of packages designed to handle various aspects of the workflow, many of which overlap considerably. Some of these packages use standard formats and are inter-operable, whereas others are much less so.
SayMore
SayMoreis a language documentation package developed by
SIL International
SIL International (formerly known as the Summer Institute of Linguistics) is an evangelical Christian non-profit organization whose main purpose is to study, develop and document languages, especially those that are lesser-known, in order to e ...
in
Dallas
Dallas () is the List of municipalities in Texas, third largest city in Texas and the largest city in the Dallas–Fort Worth metroplex, the List of metropolitan statistical areas, fourth-largest metropolitan area in the United States at 7.5 ...
which primarily focuses on the initial stages in language documentation, and aims for a relatively uncomplicated user experience.
The primary functions of SayMore are: (a) audio recording (b) file import from recording device (video and/or audio) (c) file organization (d) metadata entry at session and file levels (e) association of AV files with evidence of informed consent and other supplementary objects (such as photographs) (f) AV file segmentation (g) transcription/translation (h
BOLDstyle Careful Speech annotation and Oral Translation.
SayMore files can be further exported for annotation i
FLEx and metadata can be exported in
.csv
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separato ...
and
IMDI
IMDI (ISLE Meta Data Initiative) is a metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of speci ...
formats for archiving.
ELAN
ELAN
Elan or Élan may refer to:
People
*Elan Atias (born 1975), American singer-songwriter
*Elán (musician) (born 1983), Mexican singer
* Poets of Elan, a group of Ecuadorian poets
Fictional characters
* Elan (Order of the Stick), a character in th ...
is developed b
The Language Archiveat the
Max Planck Institute for Psycholinguistics
The Max Planck Institute for Psycholinguistics (German: ''Max-Planck-Institut für Psycholinguistik''; Dutch: ''Max Planck Instituut voor Psycholinguïstiek'') is a research institute situated on the campus of Radboud University Nijmegen located ...
in
Nijmegen
Nijmegen (;; Spanish and it, Nimega. Nijmeegs: ''Nimwèège'' ) is the largest city in the Dutch province of Gelderland and tenth largest of the Netherlands as a whole, located on the Waal river close to the German border. It is about 6 ...
. ELAN is a full-featured transcription tool, particularly useful for researchers with complex annotation needs/goals.
FLEx
FieldWorks Language Explorer, FLExis developed b
SIL International formerly Summer Institute of Linguistics, Inc. at
SIL International
SIL International (formerly known as the Summer Institute of Linguistics) is an evangelical Christian non-profit organization whose main purpose is to study, develop and document languages, especially those that are lesser-known, in order to e ...
in
Dallas
Dallas () is the List of municipalities in Texas, third largest city in Texas and the largest city in the Dallas–Fort Worth metroplex, the List of metropolitan statistical areas, fourth-largest metropolitan area in the United States at 7.5 ...
. FLEx allows the user to build a "lexicon" of the language, i.e. a word-list with definitions and grammatical information, and also to store texts from the language. Within the texts, each word or part of a word (i.e. a "morpheme") is linked to an entry in the lexicon. For new projects and for students learning for the first time
FLExis now the best tool for interlinearising and dictionary-making.
Toolbox
Field Linguist's Toolbox(usually called Toolbox) is a precursor o
FLExand has been one of the most widely used language documentation packages for some decades. Previously known a
Shoebox Toolbox's primary functions are construction of a lexical database, and interlinearization of texts through interaction with the lexical database. Both lexical database and texts can be exported to a word processing environment, in the case of the lexical database using the Multi-Dictionary Formatter
MDF conversion tool. It is also possible to use Toolbox as a transcription environment. By comparison with ELAN and FLEx, Toolbox has relatively limited functionality, and is felt by some to have an unintuitive design and interface. However, a large number of projects have been carried-out in the Shoebox/Toolbox environment over its lifespan, and its user base continues to enjoy its advantages of familiarity, speed, and community support. Toolbox also has the advantage of working directly with human-readable text files that can be opened in any text editor and easily manipulated and archived. Toolbox files can also be easily converted for storage in XML (recommended for archives), such as with open source Python libraries lik
Xigtintended for computational uses of IGT data.
Tools for automating components of the workflow
Language documentation may be partially automated thanks to a number of software tools, including:
*
eSpeak
eSpeakNG is a Free and open-source software, free and open-source, cross-platform, compact, software speech synthesis, speech synthesizer. It uses a speech synthesis#Formant synthesis, formant synthesis method, providing many languages in a rela ...
*
HTK
*
Lingua Libre
Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license.
Description
Lingua Libre enables to record words, phr ...
, a
libre
Libre may refer to:
Computing
* Libre software, free software
* Libre Computer Project, developer of open-hardware single-board computers
Medicine
* FreeStyle Libre, a glucose monitoring device
Media
* Libre Times, news site which people can f ...
online tool allowing to record a large number of words and phrases in a short period (up to 1 000 words/hour with a clean word list and an experienced user). It automatizes the classic procedure for recording audio and video pronunciation files (for
spoken and
signed languages). Once the recording is done, the platform automatically uploads clean, well cut, well named and apps-friendly files, directly to
Wikimedia Commons
Wikimedia Commons (or simply Commons) is a media repository of free-to-use images, sounds, videos and other media. It is a project of the Wikimedia Foundation.
Files from Wikimedia Commons can be used across all of the Wikimedia projects in ...
(it is possible to download datasets for a specific language).
* Maus
* Prosodylab Aligner
* Sox
Literature
The peer-reviewed journa
Language Documentation and Conservationhas published a large number of articles focusing on tools and methods in language documentation.
Film
The 2021 Indian documentary film
Dreaming of Words
''Dreaming of Words'' is a 2021 Indian documentary film directed and produced by Nandan. ''Dreaming of Words'' has received numerous accolades including National Film Award for Best Educational/Motivational/Instructional Film (2020) awarded to ...
traces the life and work of
Njattyela Sreedharan
Njattyela Sreedharan ( ml, ഞാറ്റ്യേല ശ്രീധരൻ; born in 1938) is a lexicographer from Thalassery in Kerala. He is known for compiling a dictionary connecting four major Dravidian languages Malayalam, Kannada, Tamil a ...
, a fourth standard drop-out, who compiles a multilingual dictionary connecting four major
Dravidian languages
The Dravidian languages (or sometimes Dravidic) are a family of languages spoken by 250 million people, mainly in southern India, north-east Sri Lanka, and south-west Pakistan. Since the colonial era, there have been small but significant i ...
Malayalam
Malayalam (; , ) is a Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry ( Mahé district) by the Malayali people. It is one of 22 scheduled languages of India. Malayalam wa ...
,
Kannada
Kannada (; ಕನ್ನಡ, ), originally romanised Canarese, is a Dravidian language spoken predominantly by the people of Karnataka in southwestern India, with minorities in all neighbouring states. It has around 47 million native s ...
,
Tamil
Tamil may refer to:
* Tamils, an ethnic group native to India and some other parts of Asia
** Sri Lankan Tamils, Tamil people native to Sri Lanka also called ilankai tamils
**Tamil Malaysians, Tamil people native to Malaysia
* Tamil language, nati ...
and
Telugu
Telugu may refer to:
* Telugu language, a major Dravidian language of India
*Telugu people, an ethno-linguistic group of India
* Telugu script, used to write the Telugu language
** Telugu (Unicode block), a block of Telugu characters in Unicode
S ...
. Travelling across four states and doing extensive research, he spent twenty five years
making this multilingual dictionary.
See also
LRE MapLanguage resources map
Searchable by Resource Type, Language(s), Language type, Modality, Resource Use, Availability, Production Status, Conference(s), Resource name
Richard Littauer's GitHub catalogA catalog of "open-source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages".
RNLD software pageResearch Network for Linguistic Diversity's page on linguistic software.
References
{{Reflist
Language documentation