MAREC
   HOME

TheInfoList



OR:

The MAtrixware REsearch Collection (MAREC) is a standardised patent data corpus available for research purposes. MAREC seeks to represent patent documents of several languages in order to answer specific research questions. It consists of 19 million patent documents in different languages, normalised to a highly specific
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
schema. MAREC is intended as raw material for research in areas such as
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
,
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
or
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
, which require large amounts of complex documents. The collection contains documents in 19 languages, the majority being English, German and French, and about half of the documents include full text. In MAREC, the documents from different countries and sources are normalised to a common XML format with a uniform patent numbering scheme and citation format. The standardised fields include dates, countries, languages, references, person names, and companies as well as subject classifications such as
IPC IPC may refer to: Businesses and organizations Arts and media * Intellectual Property Committee, a coalition of US corporations with intellectual property interests * International Panorama Council, an international network of specialists in ...
codes. MAREC is a comparable corpus, where many documents are available in similar versions in other languages. A comparable corpus can be defined as consisting of texts that share similar topics – news text from the same time period in different countries, while a parallel corpus is defined as a collection of documents with aligned translations from the source to the target language. Since the patent document refers to the same “invention” or “concept of idea” the text is a translation of the invention, but it does not have to be a direct translation of the text itself – text parts could have been removed or added for clarification reasons. The 19,386,697 XML files measure a total of 621 GB and are hosted by the Information Retrieval Facility. Access and support are free of charge for research purposes.


Use Cases

* MAREC is used in the Patent Language Translations Online (PLuTO) project.


References


External links


User guide and statistics

Information Retrieval Facility
{{Webarchive, url=https://web.archive.org/web/20080522151226/http://www.ir-facility.org/ , date=2008-05-22 Corpora Information retrieval systems Machine translation Natural language processing XML