In-database processing, sometimes referred to as in-database analytics, refers to the integration of data

analytics Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns toward effective decision-making. It ...

into

data warehousing In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...

functionality. Today, many large databases, such as those used for

credit card fraud Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services or to make payment to another account, which is controlled by a criminal. The P ...

detection {{Unreferenced, date=March 2018 In general, detection is the action of accessing information without specific cooperation from with the sender. In the history of radio communications, the term "detector" was first used for a device that detected t ...

and

investment bank Investment is the dedication of money to purchase of an asset to attain an increase in value over a period of time. Investment requires a sacrifice of some present asset, such as time, money, or effort. In finance, the purpose of investing is ...

risk management, use this technology because it provides significant performance improvements over traditional methods.

History

Traditional approaches to data analysis require data to be moved out of the database into a separate analytics environment for processing, and then back to the database. (

SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Cur ...

from IBM are examples of tools that still do this today). Doing the analysis in the database, where the data resides, eliminates the costs, time and security issues associated with the old approach by doing the processing in the data warehouse itself. Though in-database capabilities were first commercially offered in the mid-1990s, as object-related database systems from vendors including IBM,

Illustra Illustra was a commercialized version of the Postgres object-relational database management system (DBMS) sold by Illustra Information Technologies, a company formed by Michael Stonebraker and Gary Morgenthaler and several of Michael Stonebraker's ...

Informix IBM Informix is a product family within IBM's Information Management division that is centered on several relational database management system (RDBMS) offerings. The Informix products were originally developed by Informix Corporation, whose ...

(now IBM) and

Oracle An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination. Description The wor ...

, the technology did not begin to catch on until the mid-2000s. The concept of migrating analytics from the analytical workstation and into the Enterprise Data Warehouse was first introduced by Thomas Tileston in his presentation entitled, “Have Your Cake & Eat It Too! Accelerate Data Mining Combining SAS & Teradata” at the

Teradata Teradata Corporation is an American software company that provides cloud database and analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech a ...

Partners 2005 "Experience the Possibilities" conference in Orlando, FL, September 18–22, 2005. Mr. Tileston later presented this technique globally in 2006, 2007 and 2008. At that point, the need for in-database processing had become more pressing as the amount of data available to collect and analyze continues to grow exponentially (due largely to the rise of the Internet), from megabytes to gigabytes, terabytes and petabytes. This “ big data” is one of the primary reasons it has become important to collect, process and analyze data efficiently and accurately. Also, the speed of business has accelerated to the point where a performance gain of nanoseconds can make a difference in some industries. Additionally, as more people and industries use data to answer important questions, the questions they ask become more complex, demanding more sophisticated tools and more precise results. All of these factors in combination have created the need for in-database processing. The introduction of the column-oriented database, specifically designed for analytics, data warehousing and reporting, has helped make the technology possible.

Types

There are three main types of in-database processing: translating a model into SQL code, loading C or C++ libraries into the database process space as a built-in user-defined function (UDF), and out-of-process libraries typically written in C, C++ or Java and registering them in the database as a built-in UDFs in a SQL statement.

Translating models into SQL code

In this type of in-database processing, a predictive model is converted from its source language into SQL that can run in the database usually in a

stored procedure A stored procedure (also termed proc, storp, sproc, StoPro, StoredProc, StoreProc, sp, or SP) is a subroutine available to applications that access a relational database management system (RDBMS). Such procedures are stored in the database data di ...

. Many analytic model-building tools have the ability to export their models in either SQL or

PMML The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML prov ...

(Predictive Modeling Markup Language). Once the SQL is loaded into a stored procedure, values can be passed in through parameters and the model is executed natively in the database. Tools that can use this approach include SAS, SPSS, R and KXEN.

Loading C or C++ libraries into the database process space

With C or C++ UDF libraries that run in process, the functions are typically registered as built-in functions within the database server and called like any other built-in function in a SQL statement. Running in process allows the function to have full access to the database server’s memory, parallelism and processing management capabilities. Because of this, the functions must be well-behaved so as not to negatively impact the database or the engine. This type of UDF gives the highest performance out of any method for OLAP, mathematical, statistical, univariate distributions and data mining algorithms.

Out-of-process

Out-of-process UDFs are typically written in C, C++ or Java. By running out of process, they do not run the same risk to the database or the engine as they run in their own process space with their own resources. Here, they wouldn’t be expected to have the same performance as an in-process UDF. They are still typically registered in the database engine and called through standard SQL, usually in a stored procedure. Out-of-process UDFs are a safe way to extend the capabilities of a database server and are an ideal way to add custom data mining libraries.

Uses

In-database processing makes data analysis more accessible and relevant for high-throughput, real-time applications including fraud detection, credit scoring, risk management, transaction processing, pricing and margin analysis, usage-based micro-segmenting, behavioral ad targeting and recommendation engines, such as those used by customer service organizations to determine next-best actions.

Vendors

In-database processing is performed and promoted as a feature by many of the major data warehousing vendors, including

(and

Aster Data Systems Aster Data Systems was a data management and analysis software company headquartered in San Carlos, California. It was founded in 2005 and acquired by Teradata in 2011. History Aster Data was co-founded in 2005 by Stanford University graduate st ...

, which it acquired), IBM (with its

Netezza IBM Netezza (pronounced ne-teez-a) is a subsidiary of American technology company IBM that designs and markets high-performance data warehouse appliances and advanced analytics applications for uses including enterprise data warehousing, busine ...

, PureData Systems, an
Db2 Warehouse
products), IEMC

Greenplum Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. The technology was created by a company of the same name headquartered in San Mateo, California around 2005. Greenplum was acquire ...

Sybase Sybase, Inc. was an enterprise software and services company. The company produced software to manage and analyze information in relational databases, with facilities located in California and Massachusetts. Sybase was acquired by SAP in 2010; ...

ParAccel ParAccel, Inc. was a California-based software company. It provided a database management system designed to provide advanced analytics for business intelligence. ParAccel was acquired by Actian in April 2013. History ParAccel was a venture-bac ...

, SAS, and EXASOL. Some of the products offered by these vendors, such as CWI's

MonetDB MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against lar ...

or IBM's Db2 Warehouse, offer users the means to write their own functions (UDFs) or extensions (UDXs) to enhance the products' capabilities. Fuzzy Logix offers libraries of in-database models used for mathematical, statistical, data mining, simulation, and classification modelling, as well as financial models for equity, fixed income, interest rate, and portfolio optimization
In-DataBase Pioneers
collaborates with marketing and IT teams to institutionalize data mining and analytic processes inside the data warehouse for fast, reliable, and customizable consumer-behavior and predictive analytics.

Related Technologies

In-database processing is one of several technologies focused on improving data warehousing performance. Others include parallel computing, shared everything architectures,

shared nothing architecture Shared may refer to: * Sharing * Shared ancestry or Common descent * Shared care * Shared-cost service * Shared decision-making in medicine * Shared delusion, various meanings * Shared government * Shared intelligence or collective intelligence ...

s and

massive parallel processing Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of t ...

. It is an important step towards improving

predictive analytics Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events. In busin ...

capabilities.http://timmanns.blogspot.com/2009/01/isnt-in-database-processing-old-news.html "Isn't In-database processing old news yet?," "Blog by Tim Manns (Data Mining Blog)," January 8, 2009

External links

EXASOL EXAPowerlytics

References

{{Software engineering Database management systems Transaction processing