HOME

TheInfoList



OR:

General Architecture for Text Engineering or GATE is a
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
suite of tools originally developed at the
University of Sheffield The University of Sheffield (informally Sheffield University or TUOS) is a public university, public research university in Sheffield, South Yorkshire, England. Its history traces back to the foundation of Sheffield Medical School in 1828, Firth C ...
beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
tasks, including
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
in many languages. As of May 28, 2011, 881 people are on the gate-users mailing list at SourceForge.net, and 111,932 downloads from
SourceForge SourceForge is a web service that offers software consumers a centralized online location to control and manage open-source software projects and research business software. It provides source code repository hosting, bug tracking, mirrori ...
are recorded since the project moved to SourceForge in 2005. The paper "GATE: A framework and graphical development environment for robust NLP tools and applications" has received over 2000 citations since publication (according to Google Scholar). Books covering the use of GATE, in addition to the GATE User Guide, include "Building Search Applications: Lucene, LingPipe, and Gate", by Manu Konchady, and "Introduction to Linguistic Annotation and Text Analytics", by Graham Wilcock. GATE community and research has been involved in several European research projects including: Transitioning Applications to Ontologies, SEKT, NeOn, Media-Campaign, Musing, Service-Finder, LIRICS and KnowledgeWeb.


Features

GATE includes an
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
system called ANNIE (A Nearly-New Information Extraction System) which is a set of modules comprising a tokenizer, a
gazetteer A gazetteer is a geographical index or directory used in conjunction with a map or atlas.Aurousseau, 61. It typically contains information concerning the geographical makeup, social statistics and physical features of a country, region, or ...
, a sentence splitter, a part of speech tagger, a
named entities Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
transducer and a coreference tagger. ANNIE can be used as-is to provide basic
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
functionality, or provide a starting point for more specific tasks. Languages currently handled in GATE include English, Chinese,
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walte ...
, Bulgarian,
French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...
,
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ger ...
,
Hindi Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan languages, Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of North India, northern, Central India, centr ...
, Italian, Cebuano, Romanian, Russian, Danish. Plugins are included for
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
with
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recog ...
, RASP, MAXENT, SVM Light, as well as a LIBSVM integration and an in-house perceptron implementation, for managing ontologies like
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definit ...
, for querying
search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a l ...
like
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
or
Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...
, for
part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...
with Brill or TreeTagger, and many more. Many external plugins are also available, for handling e.g. tweets. GATE accepts input in various formats, such as TXT,
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
, Doc, PDF documents, and Java Serial,
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
, Lucene,
Oracle An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination. Description The wor ...
Databases with help of RDBMS storage over JDBC. JAPE transducers are used within GATE to manipulate annotations on text. Documentation is provided in the GATE User Guide. A tutorial has also been written by Press Association Images.


GATE Developer

The screenshot shows the document viewer used to display a document and its annotations. In pink are hyperlink annotations from an
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
file. The right list is the annotation sets list, and the bottom table is the annotation list. In the center is the annotation editor window.


GATE Mímir

GATE generates vast quantities of information including; natural language text, semantic annotations, and ontological information. Sometimes the data itself is the end product of an application but often the information would be more useful if it could be efficiently searched. GATE Mimir provides support for indexing and searching the linguistic and semantic information generated by such applications and allows for querying the information using arbitrary combinations of text, structural information, and SPARQL.


See also

* Unstructured Information Management Architecture (UIMA) * OpenNLP *
Pheme In Greek mythology, Pheme ( ; Greek: , ''Phēmē''; Roman equivalent: Fama), also known as Ossa in Homeric sources, was the personification of fame and renown, her favour being notability, her wrath being scandalous rumours. She was a daughte ...
, a major EU project managed by the GATE group on early detection of false information in social media


References


External links

* {{DEFAULTSORT:General Architecture For Text Engineering Data mining and machine learning software Free computer libraries Free science software Free software programmed in Java (programming language) Free integrated development environments Knowledge representation Natural language processing toolkits Ontology editors