HOME

TheInfoList



OR:

HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source,
data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that ...
system platform developed by
LexisNexis Risk Solutions LexisNexis Risk Solutions is a global data and analytics company that provides data and technology services, analytics, predictive insights, and fraud prevention for a wide range of industries. It is headquartered in Alpharetta, Georgia (part ...
. The HPCC platform incorporates a
software architecture Software architecture is the set of structures needed to reason about a software system and the discipline of creating such structures and systems. Each structure comprises software elements, relations among them, and properties of both elements a ...
implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing
big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL. The public release of HPCC wa
announced
in 2011, after ten years of in-house development (according to LexisNexis). It is an alternative to
Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
and other
Big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
platforms.


System architecture

The HPCC system architecture includes two distinct cluster processing environments Thor and Roxie, each of which can be optimized independently for its parallel data processing purpose. The first of these platforms is called Thor, a data refinery whose overall purpose is the general processing of massive volumes of raw data of any type for any purpose but typically used for
data cleansing Data cleansing or data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset, table, or database. It involves detecting incomplete, incorrect, or inaccurate parts of the dat ...
and hygiene, ETL (
extract, transform, load Extract, transform, load (ETL) is a three-phase computing process where data is ''extracted'' from an input source, ''transformed'' (including cleaning), and ''loaded'' into an output data container. The data can be collected from one or mor ...
) processing of the raw data, record linking and entity resolution, large-scale ad-hoc complex analytics, and creation of keyed data and indexes to support high-performance structured queries and
data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
applications. The data refinery name
Thor Thor (from ) is a prominent list of thunder gods, god in Germanic paganism. In Norse mythology, he is a hammer-wielding æsir, god associated with lightning, thunder, storms, sacred trees and groves in Germanic paganism and mythology, sacred g ...
is a reference to the mythical Norse god of thunder with the large hammer symbolic of crushing large amounts of raw data into useful information. A Thor cluster is similar in its function, execution environment, filesystem, and capabilities to the Google and
Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
platforms. Figure 2 shows a representation of a physical Thor processing cluster which functions as a batch job execution engine for scalable data-intensive computing applications. In addition to the Thor master and slave nodes, additional auxiliary and common components are needed to implement a complete HPCC processing environment. The second of the parallel data processing platforms is called Roxie and functions as a rapid data delivery engine. This platform is designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times. Roxie utilizes a distributed indexed filesystem to provide parallel processing of queries using an optimized execution environment and filesystem for high-performance online processing. A Roxie cluster is similar in its function and capabilities to
ElasticSearch Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...
and Hadoop with HBase and Hive capabilities added, and provides for near real time predictable query latencies. Both Thor and Roxie clusters utilize the ECL programming language for implementing applications, increasing continuity and programmer productivity. Figure 3 shows a representation of a physical Roxie processing cluster which functions as an online query execution engine for high-performance query and data warehousing applications. A Roxie cluster includes multiple nodes with server and worker processes for processing queries; an additional auxiliary component called an ESP server which provides interfaces for external client access to the cluster; and additional common components which are shared with a Thor cluster in an HPCC environment. Although a Thor processing cluster can be implemented and used without a Roxie cluster, an HPCC environment which includes a Roxie cluster should also include a Thor cluster. The Thor cluster is used to build the distributed index files used by the Roxie cluster and to develop online queries which will be deployed with the index files to the Roxie cluster.


Software architecture

The HPCC software architecture incorporates the Thor and Roxie clusters as well as common
middleware Middleware is a type of computer software program that provides services to software applications beyond those available from the operating system. It can be described as "software glue". Middleware makes it easier for software developers to imple ...
components, an external communications layer, client interfaces which provide both end-user services and system management tools, and auxiliary components to support monitoring and to facilitate loading and storing of filesystem data from external sources. Usually a HPCC environment includes only Thor clusters, or both Thor and Roxie clusters, although Roxie occasionally is used to build its own indexes. The overall HPCC software architecture is shown in Figure 4.


HPCC Systems

HPCC Systems (High Performance Computing Cluster) is part of
LexisNexis Risk Solutions LexisNexis Risk Solutions is a global data and analytics company that provides data and technology services, analytics, predictive insights, and fraud prevention for a wide range of industries. It is headquartered in Alpharetta, Georgia (part ...
and was formed to promote and sell the HPCC software. In June 2011, it announced the offering of the software under an open source dual license model. HPCC Systems offers both a Community Edition and an Enterprise Edition. The Community Edition is free to download, includes the source code and is released under the
Apache License The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software ...
2.0. The Enterprise Edition is available under a paid commercial license and includes training, support, indemnification and additional modules. In November 2011, HPCC Systems announced the availability of its Thor Data Refinery Cluster on
Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
. In January 2012, HPCC Systems announced distributed
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms.


See also

*
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
*
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
* Aster Data Systems * ECL (data-centric programming language) *
ElasticSearch Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...
* Sector/Sphere *
Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
*
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...


References


External links


Sandia sees data management challenges spiral

Sandia National Laboratories Leverages the Data Analytics Supercomputer (DAS) by LexisNexis Risk & Information Analytics Group, Which Offers Breakthrough High Performance Computing to Address Data Management and Analysis Challenges

Programming models for the LexisNexis High Performance Computing Cluster

LexisNexis Data Analytics Supercomputer

LexisNexis HPCC Systems

Reference to the term BORPS (Billions of Records Per Second)


* ttp://catalog.kennesaw.edu/preview_program.php?catoid=25&poid=3023&returnto=2119 High Performance Computing Clusters (HPCC) and Big Data Analytics Certificate - Stand-Alone
FAU Receives National Science Foundation Rapid Response Grant to Develop Innovative Computer Model for Ebola Spread

CPL Online delivers added value for clients through its Big Data Platform

HPCC Systems
{{DEFAULTSORT:Hpcc Parallel computing Distributed computing Declarative programming languages Query languages Data warehousing products