The Apache Object Oriented Data Technology (OODT) is an open source
data management system
A data hub is a center of data exchange that is supported by data science, data engineering, and data warehouse technologies to interact with endpoints such as applications and algorithms.
Features
A data hub differs from a data warehouse in t ...
framework that is managed by the
Apache Software Foundation
The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
. OODT was originally developed at
NASA Jet Propulsion Laboratory
The Jet Propulsion Laboratory (JPL) is a federally funded research and development center (FFRDC) in La Cañada Flintridge, California, Crescenta Valley, United States. Founded in 1936 by California Institute of Technology (Caltech) research ...
to support capturing, processing and sharing of data for NASA's scientific archives.
History
The project started out as an internal
NASA Jet Propulsion Laboratory
The Jet Propulsion Laboratory (JPL) is a federally funded research and development center (FFRDC) in La Cañada Flintridge, California, Crescenta Valley, United States. Founded in 1936 by California Institute of Technology (Caltech) research ...
project incepted by Daniel J. Crichton, Sean Kelly and Steve Hughes. The early focus of the effort was on information integration and search using XML as described in Crichton et al.'s paper in the CODATA meeting in 2000.
After deploying OODT to the
Planetary Data System
The Planetary Data System (PDS) is a distributed data system that NASA uses to archive data collected by Solar System missions.
The PDS is an active archive that makes available well documented, peer reviewed planetary data to the research commun ...
and to the
National Cancer Institute
The National Cancer Institute (NCI) coordinates the United States National Cancer Program and is part of the National Institutes of Health (NIH), which is one of eleven agencies that are part of the U.S. Department of Health and Human Services. ...
EDRN or Early Detection Research Network project, OODT in 2005 moved into the era of large scale data processing and management via
NASA
The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the federal government of the United States, US federal government responsible for the United States ...
's
Orbiting Carbon Observatory
The Orbiting Carbon Observatory (OCO) was a failed NASA satellite mission intended to provide global space-based observations of atmospheric carbon dioxide (). The original spacecraft was lost in a launch failure on 24 February 2009, when the ...
(OCO) project. OODT's role on OCO was to usher in a new data management processing framework that instead of tens of jobs per day and tens of gigabytes of data would handle 10,000 jobs per day and hundreds of terabytes of data. This required an overhaul of OODT to support these new requirements. Dr.
Chris Mattmann
Chris Mattmann (born October 29, 1980) is an American data scientist currently working as the Principal Data Scientist and Chief Technology and Innovation Officer in the Office of the Chief Information Officer (OCIO) at the NASA Jet Propulsion L ...
at NASA JPL led a team of 3-4 developers between 2005-2009 and completely re-engineered OODT to support these new requirements.
Influenced by the emerging efforts in
Apache Nutch
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architec ...
and
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
which Mattmann participated in, OODT was given an overhaul making it more amenable towards Apache Software Foundation like projects. In addition, Mattmann had a close relationship with Dr.
Justin Erenkrantz, who as the Apache Software Foundation President at the time, and the idea to bring OODT to the Apache Software Foundation emerged. In 2009, Mattmann and his team received approval from NASA and from JPL to bring OODT to Apache making it the first NASA project to be stewarded by the foundation. Seven years later, the project has released a version 1.0.
Features
OODT focuses on two canonical use cases:
Big Data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
processing and on
Information integration
Information integration (II) is the merging of information from heterogeneous sources with differing conceptual, contextual and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured ...
. Both were described in Mattmann's ICSE 2006 and SMC-IT 2009 papers. It provides three core services.
File Manager
A File Manager is responsible for tracking file locations, their metadata, and for transferring files from a staging area to controlled access storage.
Workflow Manager
A Workflow Manager captures control flow and data flow for complex processes, and allows for reproducibility and the construction of scientific pipelines.
Resource Manager
A Resource Manager handles allocation of Workflow Tasks and other jobs to underlying resources, e.g., Python jobs go to nodes with Python installed on them; jobs that require a large disk or CPU are properly sent to those nodes that fulfill those requirements.
In addition to the three core services, OODT provides three client-oriented frameworks that build on these services.
File Crawler
A file Crawler automatically extracts metadata and uses
Apache Tika
Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a
Java library, ...
to identify file types and ingest the associated information into the File Manager.
Catalog and Archive Crawling Framework
A Push/Pull framework acquires remote files and makes them available to the system.
Catalog and Archive Service Production Generation Executive (CAS-PGE)
A scientific algorithm wrapper (called CAS-PGE, for Catalog and Archive Service Production Generation Executive) encapsulates scientific codes and allows for their execution independent of environment, and while doing so capturing provenance, and making the algorithms easily integrated into a production system.
CAS RESTful Services
A Set of RESTful APIs which exposes the capabilities of File Manager, Workflow Manager and Resource manager components.
OPSUI Monitor Dashboard
A web application for exposing services form the underlying OODT product / workflow / resource managing Control Systems via the
JAX-RS
Jakarta RESTful Web Services, (JAX-RS; formerly Java API for RESTful Web Services) is a Jakarta EE API specification that provides support in creating web services according to the Representational State Transfer (REST) architectural pattern. J ...
specification. At this stage it is built using
Apache Wicket
Apache Wicket, commonly referred to as Wicket, is a component-based web application framework for the Java programming language conceptually similar to JavaServer Faces and Tapestry. It was originally written by Jonathan Locke in April 2004. Ver ...
components.
The overall motivation for OODT's re-architecting was described in a paper in
Nature (journal)
''Nature'' is a British weekly scientific journal founded and based in London, England. As a multidisciplinary publication, ''Nature'' features Peer review, peer-reviewed research from a variety of academic disciplines, mainly in science and t ...
in 2013 by Mattmann called A Vision for Data Science.
OODT is written in the
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
, and through its
REST API
REST (Representational State Transfer) is a software architectural style that was created to describe the design and guide the development of the architecture for the World Wide Web. REST defines a set of constraints for how the architecture of ...
used in other languages including
Python (programming language)
Python is a high-level programming language, high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is type system#DYNAMIC, dynamically type-checked a ...
.
Notable uses
OODT has been recently highlighted as contributing to NASA missions including
Soil Moisture Active Passive and
New Horizons
''New Horizons'' is an Interplanetary spaceflight, interplanetary space probe launched as a part of NASA's New Frontiers program. Engineered by the Johns Hopkins University Applied Physics Laboratory (APL) and the Southwest Research Institut ...
. OODT also helps to power the
Square Kilometre Array
The Square Kilometre Array (SKA) is an intergovernmental organisation, intergovernmental international radio telescope project being built in Australia (low-frequency) and South Africa (mid-frequency). The combining infrastructure, the Square ...
telescope increasing the scope of its use from Earth science, Planetary science, radio astronomy, and to other sectors. OODT is also used within bioinformatics and is a part of the Knowledgent Big Data Platform.
References
External links
* http://oodt.apache.org
{{Apache Software Foundation
OODT
Java platform
Free software programmed in Java (programming language)
Java (programming language) libraries
Software using the Apache license