Wiktionary (, ; , ; rhyming with "dictionary") is a multilingual,

web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...

-based project to create a

free content Free content, libre content, libre information, or free information is any kind of creative work, such as a work of art, a book, a software program, or any other creative content for which there are very minimal copyright and other legal limi ...

dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged Alphabetical order, alphabetically (or by Semitic root, consonantal root for Semitic languages or radical-and-stroke sorting, radical an ...

of terms (including

word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...

phrase In grammar, a phrasecalled expression in some contextsis a group of words or singular word acting as a grammatical unit. For instance, the English language, English expression "the very happy squirrel" is a noun phrase which contains the adject ...

proverb A proverb (from ) or an adage is a simple, traditional saying that expresses a perceived truth based on common sense or experience. Proverbs are often metaphorical and are an example of formulaic speech, formulaic language. A proverbial phrase ...

linguistic reconstruction Linguistic reconstruction is the practice of establishing the features of an unattested ancestor language of one or more given languages. There are two kinds of reconstruction: * Internal reconstruction uses irregularities in a single language t ...

s, etc.) in all

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

s and in a number of artificial languages. These entries may contain

definition A definition is a statement of the meaning of a term (a word, phrase, or other set of symbols). Definitions can be classified into two large categories: intensional definitions (which try to give the sense of a term), and extensional definitio ...

image An image or picture is a visual representation. An image can be Two-dimensional space, two-dimensional, such as a drawing, painting, or photograph, or Three-dimensional space, three-dimensional, such as a carving or sculpture. Images may be di ...

s for illustration,

pronunciation Pronunciation is the way in which a word or a language is spoken. To This may refer to generally agreed-upon sequences of sounds used in speaking a given word or all language in a specific dialect—"correct" or "standard" pronunciation—or si ...

etymologies Etymology ( ) is the study of the origin and evolution of words—including their constituent units of sound and meaning—across time. In the 21st century a subfield within linguistics, etymology has become a more rigorously scientific study. ...

inflection In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...

s, usage examples,

quotation A quotation or quote is the repetition of a sentence, phrase, or passage from speech or text that someone has said or written. In oral speech, it is the representation of an utterance (i.e. of something that a speaker actually said) that is intro ...

s, related terms, and

translation Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...

s of terms into other languages, among other features. It is collaboratively edited via a

wiki A wiki ( ) is a form of hypertext publication on the internet which is collaboratively edited and managed by its audience directly through a web browser. A typical wiki contains multiple pages that can either be edited by the public or l ...

. Its name is a

portmanteau In linguistics, a blend—also known as a blend word, lexical blend, or portmanteau—is a word formed by combining the meanings, and parts of the sounds, of two or more words together.

of the words ''

'' and ''

''. It is available in languages and in Simple English. Like its sister project

Wikipedia Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...

, Wiktionary is run by the

Wikimedia Foundation The Wikimedia Foundation, Inc. (WMF) is an American 501(c)(3) nonprofit organization headquartered in San Francisco, California, and registered there as foundation (United States law), a charitable foundation. It is the host of Wikipedia, th ...

, and is written collaboratively by

volunteers Volunteering is an elective and freely chosen act of an individual or group giving their time and labor, often for community service. Many volunteers are specifically trained in the areas they work, such as medicine, education, or emergenc ...

, dubbed "Wiktionarians". Its wiki software,

MediaWiki MediaWiki is free and open-source wiki software originally developed by Magnus Manske for use on Wikipedia on January 25, 2002, and further improved by Lee Daniel Crocker,mailarchive:wikipedia-l/2001-August/000382.html, Magnus Manske's announc ...

, allows almost anyone with access to the website to create and edit entries. Because Wiktionary is not limited by print space considerations, most of Wiktionary's language editions provide definitions and translations of terms from many languages, and some editions offer additional information typically found in

thesauri A thesaurus (: thesauri or thesauruses), sometimes called a synonym dictionary or dictionary of synonyms, is a reference work which arranges words by their meanings (or in simpler terms, a book where one can find different words with similar me ...

. Wiktionary's data is frequently used in various natural language processing tasks.

History and development

Wiktionary was brought online on December 12, 2002, following a proposal by Daniel Alston and an idea by Larry Sanger, co-founder of Wikipedia. On March 28, 2004, the first non- English Wiktionaries were initiated in French and Polish. Wiktionaries in numerous other languages have since been started. Wiktionary was hosted on a temporary

domain name In the Internet, a domain name is a string that identifies a realm of administrative autonomy, authority, or control. Domain names are often used to identify services provided through the Internet, such as websites, email services, and more. ...

(wiktionary.wikipedia.org) until May 1, 2004, when it switched to the current domain name. , Wiktionary features over 30 million articles (and even more entries) across its editions. The largest of the language editions is the English Wiktionary, with over 7.5 million entries, followed by the French Wiktionary with over 4.7 million and the Malagasy Wiktionary with over 3.5 million entries. Forty-three Wiktionary language editions contain over 100,000 entries each. Wiktionary growth

Many of the definitions at the project's largest language editions were created by

bots The British Overseas Territories (BOTs) or alternatively referred to as the United Kingdom Overseas Territories (UKOTs) are the fourteen dependent territory, territories with a constitutional and historical link with the United Kingdom that, ...

that found creative ways to generate entries or (rarely) automatically imported thousands of entries from previously published dictionaries. Seven of the 18 bots registered at the English Wiktionary in 2007 created 163,000 of the entries there.TheDaveBot

TheCheatBot

Websterbot

PastBot

NanshuBot
Another of these bots, " ThirdPersBot", was responsible for the addition of a number of third-person

conjugation Conjugation or conjugate may refer to: Linguistics *Grammatical conjugation, the modification of a verb from its basic form *Emotive conjugation or Russell's conjugation, the use of loaded language Mathematics *Complex conjugation, the change o ...

s that would not have received their own entries in standard dictionaries; for instance, it defined " smoulders" as the "third-person singular simple present form of smoulder." Of the 1,269,938 definitions the English Wiktionary provides for 996,450 English words, 478,068 are "form of" definitions of this kind. This means that even without such entries, its coverage of English is significantly larger than that of major monolingual print dictionaries. '' Merriam-Webster's Third New International Dictionary of the English Language, Unabridged'', for instance, has 475,000 entries (with many additional embedded headwords); the ''

Oxford English Dictionary The ''Oxford English Dictionary'' (''OED'') is the principal historical dictionary of the English language, published by Oxford University Press (OUP), a University of Oxford publishing house. The dictionary, which published its first editio ...

'' has 615,000 headwords, but includes

Middle English Middle English (abbreviated to ME) is a form of the English language that was spoken after the Norman Conquest of 1066, until the late 15th century. The English language underwent distinct variations and developments following the Old English pe ...

as well, for which the English Wiktionary has an additional 34,234 gloss definitions. Detailed

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

exist to show how many entries of various kinds exist. The English Wiktionary does not rely on bots to the extent that some other editions do. The French and Vietnamese Wiktionaries, for example, imported large sections of the Free Vietnamese Dictionary Project (FVDP), which provides free content bilingual dictionaries to and from Vietnamese. These imported entries make up virtually all of the Vietnamese edition's contents. Like the English edition, the French Wiktionary has imported approximately 20,000 entries from the Unihan database of Chinese, Japanese, Korean and Indian characters. The French Wiktionary grew rapidly in 2006 thanks in a large part to bots copying many entries from old, freely licensed dictionaries, such as the eighth edition of the (1935, around 35,000 words), and using bots to add words from other Wiktionary editions with French translations. The Russian edition grew by nearly 80,000 entries as " LXbot" added boilerplate entries (with headings, but without definitions) for words in English and German. As of July 2021, the English Wiktionary has over 791,870 gloss definitions and over 1,269,938 total definitions (including different forms) for English entries alone, with a total of over 9,928,056 definitions across all languages.

Logos

Wiktionary has historically lacked a uniform logo across its numerous language editions. Some editions use logos that depict a dictionary entry about the term "Wiktionary", based on the previous English Wiktionary logo, which was designed by Brooke Vibber, a

developer. Because a purely textual logo must vary considerably from language to language, a four-phase contest to adopt a uniform logo was held at the Wikimedia Meta-Wiki from September to October 2006. Some communities adopted the winning entry by " Smurrayinchester", a 3×3 grid of wooden tiles, each bearing a character from a different writing system. However, the poll did not see as much participation from the Wiktionary community as some community members had hoped, and a number of the larger wikis ultimately kept their textual logos. In April 2009, the issue was resurrected with a new contest. This time, a depiction by "AAEngelman" of an open hardbound dictionary won a head-to-head vote against the 2006 logo, but the process to refine and adopt the new logo then stalled. In the following years, some wikis replaced their textual logos with one of the two newer logos. In 2012, 55 wikis that had been using the English Wiktionary logo received localized versions of the 2006 design by "Smurrayinchester". In July 2016, the English Wiktionary adopted a variant of this logo. , 135 wikis, representing 61% of Wiktionary's entries, use a logo based on the 2006 design by "Smurrayinchester", 33 wikis (36%) use a textual logo, and three wikis (3%) use the 2009 design by "AAEngelman".

Multi-lingual

As of , there are Wiktionary sites for languages of which are active and are closed. Wikimedia's

API:Sitematrix. Retrieved from Data:Wikipedia statistics/meta.tab The active sites have articles, and the closed sites have articles. Wikimedia's

API:Siteinfo. Retrieved from Data:Wikipedia statistics/data.tab There are registered users of which are recently active. The top ten Wiktionary language projects by mainspace article count: For a complete list with totals see Wikimedia Statistics:

Critical reception

Critical reception of Wiktionary has been mixed. In 2006, Jill Lepore wrote in the article "Noah's Ark" for ''

The New Yorker ''The New Yorker'' is an American magazine featuring journalism, commentary, criticism, essays, fiction, satire, cartoons, and poetry. It was founded on February 21, 1925, by Harold Ross and his wife Jane Grant, a reporter for ''The New York T ...

,''

There's no show of hands at ''Wiktionary''. There's not even an editorial staff. "Be your own lexicographer!", might be ''Wiktionary's'' motto. Who needs experts? Why pay good money for a dictionary written by lexicographers when we could cobble one together ourselves? ''Wiktionary'' isn't so much republican or democratic as Maoist. And it's only as good as the copyright-expired books from which it pilfers.

Keir Graff's review for ''

Booklist ''Booklist'' is a publication of the American Library Association that provides critical reviews of books and audiovisual materials for all ages. ''Booklist''s primary audience consists of libraries, educators, and booksellers. The magazine is ...

'' was less critical:

Is there a place for Wiktionary? Undoubtedly. The industry and enthusiasm of its many creators are proof that there's a market. And it's wonderful to have another strong source to use when searching the odd terms that pop up in today's fast-changing world and the online environment. But as with so many Web sources (including this column), it's best used by sophisticated users in conjunction with more reputable sources.

References in other publications are fleeting and part of larger discussions of Wikipedia, not progressing beyond a definition, although David Brooks in '' The Nashua Telegraph'' described it as "wild and woolly". One of the impediments to independent coverage of Wiktionary is the continuing confusion that it is merely an extension of Wikipedia. The measure of correctness of the inflections for a subset of the Polish words in the English Wiktionary showed that this grammatical data is very stable (a study showed that only 131 out of 4,748 Polish words have had their inflection data corrected). , Wiktionary has seen growing use in

academia An academy (Attic Greek: Ἀκαδήμεια; Koine Greek Ἀκαδημία) is an institution of tertiary education. The name traces back to Plato's school of philosophy, founded approximately 386 BC at Akademia, a sanctuary of Athena, the go ...

Wiktionary data in natural language processing

Wiktionary has semi-structured data. Wiktionary

lexicographic Lexicography is the study of lexicons and the art of compiling dictionaries. It is divided into two separate academic disciplines: * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoretical lex ...

data can be converted to machine-readable format in order to be used in

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

tasks. Wiktionary's

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

is a complex task. There are the following difficulties: * (1) the constant and frequent changes to data and schemata * (2) the heterogeneity in Wiktionary language edition schemata and * (3) the human-centric nature of a

. There are several parsers for different Wiktionary language editions: * DBpedia Wiktionary : a subproject of

DBpedia DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia a ...

, the data are extracted from English, French, German, and Russian Wiktionaries; the data includes language,

parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

, definitions,

semantic relations Semantics is the study of linguistic meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction between sense and referenc ...

and translations. The declarative description of the page schema,

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

s and

finite state transducer A finite-state transducer (FST) is a finite-state machine with two memory ''tapes'', following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. ...

are used in order to extract information. * JWKTL (

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

Wiktionary Library) : provides access to English Wiktionary and German Wiktionary dumps via a Java Wiktionary API. The data includes language, parts of speech, definitions, quotations, semantic relations, etymologies and translations. JWKTL is distributed under the

Apache License The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software ...

. * wikokit : the

parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...

of English Wiktionary and Russian Wiktionary. The parsed data includes language, parts of speech, definitions, quotations, semantic relations and translations. This is a multi-licensed

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

software. * Etymological entries have been parsed in the Etymological

WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...

project. Examples of

tasks which have been solved with the help of Wiktionary data include: * Rule-based machine translation between

Dutch language Dutch ( ) is a West Germanic languages, West Germanic language of the Indo-European language family, spoken by about 25 million people as a first language and 5 million as a second language and is the List of languages by total number of speak ...

and

Afrikaans Afrikaans is a West Germanic languages, West Germanic language spoken in South Africa, Namibia and to a lesser extent Botswana, Zambia, Zimbabwe and also Argentina where there is a group in Sarmiento, Chubut, Sarmiento that speaks the Pat ...

; data of English Wiktionary, Dutch Wiktionary and Wikipedia were used with the Apertium machine translation platform. * Construction of machine-readable dictionary by the parser NULEX, which integrates open linguistic resources: English Wiktionary,

, and VerbNet. The parser NULEX scrapes English Wiktionary for tense information (verbs), plural form and parts of speech (nouns). *

Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

and synthesis, where Wiktionary was used to automatically create pronunciation dictionaries. Word-pronunciation pairs were retrieved from 6 Wiktionary language editions ( Czech, English, French, Spanish, Polish, and German). Pronunciations are in terms of the

International Phonetic Alphabet The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standard written representation ...

. The ASR system based on English Wiktionary has the highest word error rate, where each third

phoneme A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...

has to be changed. *

Ontology engineering In computer science, information science and systems engineering, ontology engineering is a field which studies the methods and methodologies for building Ontology (information science), ontologies, which encompasses a representation, formal nami ...

and semantic network constructing. * Ontology matching. *

Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning an ...

. Medero & Ostendorf assessed vocabulary difficulty ( reading level detection) with the help of Wiktionary data. Properties of words extracted from Wiktionary entries (definition length and POS, sense, and translation counts) were investigated. Medero & Ostendorf expected that ** (1) very common words will be more likely to have multiple parts of speech, ** (2) common words will be more likely to have multiple senses, ** (3) common words will be more likely to have been translated into multiple languages. These features extracted from Wiktionary entries were useful in distinguishing word types that appear in

Simple English Wikipedia The Simple English Wikipedia is a modified English language, English-language edition of Wikipedia written primarily in Basic English and Learning English (version of English), Learning English. It is one of seven List of Wikipedias, Wikipedias ...

articles from words that only appear in the Standard English comparable articles. *

Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

. Li et al. (2012) built multilingual POS-taggers for eight resource-poor languages on the basis of English Wiktionary and hidden Markov models. * Sentiment analysis. " Wikidata:Lexicographical data" was started in 2018 to provide structured data support to Wiktionaries. It stores word data of all languages in a machine readable data model, under a dedicated "

Lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms ta ...

" namespace in Wikidata. As of October 2021, the project has amassed over 600,000 lexeme entries of various languages.

Notes

References

Citations

Sources

* * * * * * * * * * * * * * * * * * *

External links

* * Wikipedia:List of Wiktionaries * List of all Wiktionary editions * * /en.wiktionary.org/wiki/Wiktionary:Multilingual_statistics Wiktionary's multilingual statistics* Wikimedia's page on Wiktionary (including list of all existing Wiktionaries) * Pages about Wiktionary in Meta. {{Dictionaries of English Etymological dictionaries Internet properties established in 2002 MediaWiki websites Multilingual websites Online dictionaries Wikimedia projects Jimmy Wales Larry Sanger