HOME
*





Tehran Monolingual Corpus
The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing. The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps. TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian. TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator. See also * Bijankhan Corpus * Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Persian Language
Persian (), also known by its endonym and exonym, endonym Farsi (, ', ), is a Western Iranian languages, Western Iranian language belonging to the Iranian languages, Iranian branch of the Indo-Iranian languages, Indo-Iranian subdivision of the Indo-European languages. Persian is a pluricentric language predominantly spoken and used officially within Iran, Afghanistan, and Tajikistan in three mutual intelligibility, mutually intelligible standard language, standard varieties, namely Iranian Persian (officially known as ''Persian''), Dari, Dari Persian (officially known as ''Dari'' since 1964) and Tajik language, Tajiki Persian (officially known as ''Tajik'' since 1999).Siddikzoda, S. "Tajik Language: Farsi or not Farsi?" in ''Media Insight Central Asia #27'', August 2002. It is also spoken natively in the Tajik variety by a significant population within Uzbekistan, as well as within other regions with a Persianate society, Persianate history in the cultural sphere of Greater Ira ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Language Model
A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers. Language models are useful for a variety of problems in computational linguistics; from initial applications in speech recognition to ensure nonsensical (i.e. low-probability) word sequences are not predicted, to wider use in machine tran ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Natural Language Processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. History Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled " Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Hamshahri Corpus
The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper ''Hamshahri'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG GroupDBRG News
Database Research Group of . Later, a team headed by Ale AhmadHamshahri
Database Research Group
built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks. This corpus w ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Iranian Students News Agency
The Iranian Students' News Agency (ISNA) is a news agency run by Iranian university students. Position It covers a variety of national and international topics.Engber, Daniel. What's With the Iranian Students News Agency?, ''Slate'', 2 February 2006. Retrieved 7 February 2007. Editors and correspondents are themselves students in a variety of subjects, many of them are volunteers (nearly 1000). ISNA is considered by Western media to be one of the most independent and moderate media organizations in Iran, and is often quoted. "While taking a reformist view of events, ISNA has managed to remain politically independent. It has, however, maintained its loyalty to the former president and carries a section devoted to "Khatami's perspectives". Although it is generally considered independent, the ISNA is financially supported in part by the Iranian government and is supported by ACECR (Academic Center for Education, Culture and Research), another student organization. The agency's ma ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Tokenization (lexical Analysis)
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a ''lexer'', ''tokenizer'', or ''scanner'', although ''scanner'' is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth. Applications A lexer forms the first phase of a compiler frontend in modern processing. Analysis generally occurs in one pass. In older languages such as ALGOL, the initial stage was instead line reconstruction, which performed unstropping and removed whitespace and comments (and had scannerless parsers, with no separate lexer). These steps are now done as part of the lexer. Lexers and parsers are most often used for compilers, bu ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

University Of Tehran
The University of Tehran (Tehran University or UT, fa, دانشگاه تهران) is the most prominent university located in Tehran, Iran. Based on its historical, socio-cultural, and political pedigree, as well as its research and teaching profile, UT has been nicknamed "The Mother University f Iran ( fa, دانشگاه مادر). In international rankings, UT has been ranked as one of the best universities in the Middle East and is among the top universities of the world. It is also the premier knowledge producing institute among all OIC countries. Tehran University of Medical Sciences is in the 7th ranking of the Islamic World University Ranking in 2021. The university offers more than 111 bachelor's degree programs, 177 master's degree programs, and 156 PhD. programs. Many of the departments were absorbed into the University of Tehran from the Dar al-Funun established in 1851 and the Tehran School of Political Sciences established in 1899. The main campus of the univers ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Bijankhan Corpus
The Bijankhan corpus ( fa, پیکرهٔ بی‌جن‌خان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags. The Bijankhan corpus was created by the Database Research Group at the University of Tehran. The corpus is non-free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named after Mahmood Bijankhan, professor of linguistics at the University of Tehran due to his contributions in this area. See also *Hamshahri Corpus The Hamshahri Corpus ( fa, پیکره همشهری) is a sizable Persian corpus based on the Iranian newspa ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Persian Corpora
Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the Indo-European family, native language of ethnic Persians *** Persian alphabet, a writing system based on the Perso-Arabic script * People and things from the historical Persian Empire Other uses * Persian (patience), a card game * Persian (roll), a pastry native to Thunder Bay, Ontario * Persian (wine) * Persian, Indonesia, on the island of Java * Persian cat, a long-haired breed of cat characterized by its round face and shortened muzzle * The Persian, a character from Gaston Leroux's ''The Phantom of the Opera'' * Persian, a generation I Pokémon species * Alpha Indi, star also known as "The Persian" See also * Persian Empire (other) * Persian expedition (other) or Persian campaign * Persian Gulf (other) * ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Applied Linguistics
Applied linguistics is an interdisciplinary field which identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, psychology, communication research, information science, natural language processing, anthropology, and sociology. Domain Applied linguistics is an interdisciplinary field. Major branches of applied linguistics include bilingualism and multilingualism, conversation analysis, contrastive linguistics, language assessment, literacies, discourse analysis, language pedagogy, second language acquisition, language planning and policy, interlinguistics, stylistics, language teacher education, forensic linguistics, and translation. Journals Major journals of the field include ''Research Methods in Applied Linguistics'', ''Annual Review of Applied Linguistics'', '' Applied Linguistics'', Studies in Second Language Acquisition, ''Applied Psycholinguistics'', ' ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Linguistic Research
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguistics is concerned with both the cognitive and social aspects of language. It is considered a scientific field as well as an academic discipline; it has been classified as a social science, natural science, cognitive science,Thagard, PaulCognitive Science, The Stanford Encyclopedia of Philosophy (Fall 2008 Edition), Edward N. Zalta (ed.). or part of the humanities. Traditional areas of linguistic analysis correspond to phenomena found in human linguistic systems, such as syntax (rules governing the structure of sentences); semantics (meaning); morphology (structure of words); phonetics (speech sounds and equivalent gestures in sign languages); phonology (the abstract sound system of a particular language); and pragmatics (how social con ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]