HOME

TheInfoList



OR:

MeCab is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
text segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...
library for use with text written in the
Japanese language is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...
originally developed by the
Nara Institute of Science and Technology ) , city = Ikoma ( Kansai Science City) , state = Nara , country = Japan , postgrad = 1,043 , administrative_staff= 374 , campus = Suburban,139,967 m², , mascot = None , free_label = , free = , endowment= US$-- billion(JP¥-- bill ...
and currently maintained by Taku Kudou (工藤拓) as part of his work on the Google Japanese Input project. The name derives from the developer's favorite food, (和布蕪), a
Japanese dish Japanese cuisine encompasses the regional and traditional foods of Japan, which have developed through centuries of political, economic, and social changes. The traditional cuisine of Japan (Japanese: ) is based on rice with miso soup and other ...
made from wakame leaves. The software was originally based on ChaSen and was developed under the name ChaSenTNG, but now it is developed independently from ChaSen and was rewritten from scratch. MeCab's analysis accuracy is comparable to ChaSen, and its analysis speed is 3–4 times faster on average. MeCab can analyze and segment a sentence into its parts of speech. There are several dictionaries available for MeCab, but IPADIC is the most commonly used one as with ChaSen. In 2007, Google used MeCab to generate n-gram data for a large corpus of Japanese text, which it published on its Google Japan blog. MeCab is also used for
Japanese input Japanese input methods are used to input Japanese characters on a computer. There are two main methods of inputting Japanese on computers. One is via a romanized version of Japanese called '' rōmaji'' (literally "Roman character"), and the o ...
on
Mac OS X macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
10.5 and 10.6, and in iOS since version 2.1.


Example

Input: Results in: Besides segmenting the text, MeCab also lists the part of speech of the word, and, if applicable and in the dictionary, its pronunciation. In the above example, the verb できる (''dekiru'', "to be able to") is classified as an ''ichidan'' (一段) verb (動詞) in the infinitive tense (基本形). The word でも (''demo'') is identified as an adverbial particle (副助詞). As not all columns apply to all words, when a column does not apply to a word, an asterisk is used; this makes it possible to format the information after the word and the
tab character The tab key (abbreviation of tabulator key or tabular key) on a keyboard is used to advance the cursor to the next tab stop. History The word ''tab'' derives from the word ''tabulate'', which means "to arrange data in a tabular, or table, f ...
as the
comma-separated values A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separat ...
. MeCab also supports several output formats; one of which, , outputs tab-separated values in a format that programs written for ChaSen can use. Another format, (from 読む ''yomu'', to read), outputs the pronunciation of the input text as
katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived f ...
, as shown below.


References


External links

* {{Official website, https://taku910.github.io/mecab/ Natural language processing