Hamshahri Corpus
   HOME

TheInfoList



OR:

The Hamshahri Corpus () is a sizable
Persian Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the ...
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
based on the
Iran Iran, officially the Islamic Republic of Iran (IRI) and also known as Persia, is a country in West Asia. It borders Iraq to the west, Turkey, Azerbaijan, and Armenia to the northwest, the Caspian Sea to the north, Turkmenistan to the nort ...
ian newspaper ''
Hamshahri ''Hamshahri'' (; ) is a major Iranian national Persian-language newspaper in Tehran (whose municipal government owns the newspaper). History and profile ''Hamshahri'' is published by the municipality of Tehran, and founded by Gholamhossein ...
'', one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG GroupDBRG News
Database Research Group of
University of Tehran The University of Tehran (UT) or Tehran University (, ) is a public collegiate university in Iran, and the oldest and most prominent Iranian university located in Tehran. Based on its historical, socio-cultural, and political pedigree, as well as ...
. Later, a team headed by Abolfazl AleAhmadHamshahri
Database Research Group
built on this corpus and created the first Persian text collection suitable for
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
evaluation tasks. This corpus was created by crawling the online news articles from the
Hamshahri ''Hamshahri'' (; ) is a major Iranian national Persian-language newspaper in Tehran (whose municipal government owns the newspaper). History and profile ''Hamshahri'' is published by the municipality of Tehran, and founded by Gholamhossein ...
's website and processing the HTML pages to create a standard
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
for modern information retrieval experiments.


Version 1.0

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average size of 1.8 KB. The corpus is available in several formats for download: * Tagged Text: 560 MB * In SQL Server 2000 Tables: 712 MB


Version 2.0

The second release of the Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements: * More News: 323,616 Text Stories in 3206 XML files (one file for each day) * Increased Time Span: from 22 June 1996 to 13 May 2007 * Bigger in Size: 1.42 GB uncompressed * Standard Container: Unicode XML * Included Images: images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks. * Categorized News: the news stories have been categorized semi-automatically (appropriate for text categorization and classification tasks). The corpus is available for download in XML format.


See also

* Bijankhan Corpus *
Text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
*
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...


References


External links


Hamshahri Corpus Homepage

irBlogs Collection Homepage
Persian corpora Persian-language newspapers Applied linguistics Linguistic research Mass media in Tehran {{ie-lang-stub