HOME

TheInfoList



OR:

Druid is a column-oriented,
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
,
distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
data store A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails, etc. A ''database'' is a collection of data that ...
written in
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.Hemsoth, Nicole. , ''Datanami'', 8 November 2012 The name Druid comes from the
shapeshifting In mythology, folklore and speculative fiction, shapeshifting is the ability to physically transform oneself through unnatural means. The idea of shapeshifting is found in the oldest forms of totemism and shamanism, as well as the oldest existen ...
Druid class in many
role-playing game A role-playing game (sometimes spelled roleplaying game, or abbreviated as RPG) is a game in which players assume the roles of player character, characters in a fictional Setting (narrative), setting. Players take responsibility for acting out ...
s, to reflect that the architecture of the system can shift to solve different types of data problems. Druid is commonly used in
business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
-
OLAP In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction processi ...
applications to analyze high volumes of
real-time Real-time, realtime, or real time may refer to: Computing * Real-time computing, hardware and software systems subject to a specified time constraint * Real-time clock, a computer clock that keeps track of the current time * Real-time Control Syst ...
and historical data. Druid is used in production by technology companies such as
Alibaba Ali Baba is a character from the folk tale "Ali Baba and the Forty Thieves". Alibaba Group is a Chinese multinational internet technology company. Ali Baba or Alibaba may also refer to: Arts and entertainment Films * ''Ali Baba and the Forty T ...
,
Airbnb Airbnb, Inc. ( , an abbreviation of its original name, "Air Bed and Breakfast") is an American company operating an online marketplace for short-and-long-term homestays, experiences and services in various countries and regions. It acts as a ...
, Nielsen,
Cisco Cisco Systems, Inc. (using the trademark Cisco) is an American multinational digital communications technology conglomerate corporation headquartered in San Jose, California. Cisco develops, manufactures, and sells networking hardware, s ...
,
eBay eBay Inc. ( , often stylized as ebay) is an American multinational e-commerce company based in San Jose, California, that allows users to buy or view items via retail sales through online marketplaces and websites in 190 markets worldwide. ...
,
Lyft Lyft, Inc. is an American company offering ride-hailing services, motorized scooters, and bicycle-sharing systems in the United States and Canada. Lyft sets fares, which vary using a dynamic pricing model based on local supply and demand a ...
,
Netflix Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...
,
PayPal PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support E-commerce payment system, online money transfers; it serves as an electronic alter ...
,
Pinterest Pinterest is an American social media service for publishing and discovery of information in the form of digital Bulletin board, pinboards. This includes recipes, home, style, motivation, and inspiration on the Internet using image sharing. Pint ...
,
Reddit Reddit ( ) is an American Proprietary software, proprietary social news news aggregator, aggregation and Internet forum, forum Social media, social media platform. Registered users (commonly referred to as "redditors") submit content to the ...
,
Twitter Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
,
Walmart Walmart Inc. (; formerly Wal-Mart Stores, Inc.) is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery stores in the United States and 23 other ...
,
Wikimedia Foundation The Wikimedia Foundation, Inc. (WMF) is an American 501(c)(3) nonprofit organization headquartered in San Francisco, California, and registered there as foundation (United States law), a charitable foundation. It is the host of Wikipedia, th ...
and
Yahoo Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, an ...
.


History

Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012,Tschetter, Eric. , ''druid.apache.org'', 24 October 2012Higginbotham, Stacey. , ''
GigaOM Gigaom is a technology-focused analyst firm and media company. It was founded by Om Malik in San Francisco, California. In March 2015, it was shut down and in June 2015, its website and content were acquired by Knowingly and relaunched. History ...
'', 24 October 2012
and moved to an Apache License in February 2015.


Architecture

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support a
fault-tolerant Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems. Fault to ...
architecture where data is stored redundantly, and there is no single point of failure. The cluster includes external dependencies for coordination (
Apache ZooKeeper Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation. ZooKeeper is essentially a service for distributed systems offering a hierarchical ...
), metadata storage (e.g.
MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
,
PostgreSQL PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
, or
Derby Derby ( ) is a City status in the United Kingdom, city and Unitary authorities of England, unitary authority area on the River Derwent, Derbyshire, River Derwent in Derbyshire, England. Derbyshire is named after Derby, which was its original co ...
), and a deep storage facility (e.g.
HDFS Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
, or
Amazon S3 Amazon Simple Storage Service (S3) is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerc ...
) for permanent data backup.


Query management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.


Cluster management

Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.


Features

* Low latency (streaming) data ingestion. * Arbitrary slice and dice data exploration. * Sub-second analytic queries. * Approximate and exact computations.


Performance

In 2019, researchers compared the performance of Hive,
Presto Presto may refer to: Computing * Presto (browser engine), an engine previously used in the Opera web browser * Presto (operating system), a Linux-based OS by Xandros * Presto (SQL query engine), a distributed query engine * Presto (animation so ...
, and Druid using a denormalized
Star Schema In computing, the star schema or star model is the simplest style of data mart Logical schema, schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables ...
Benchmark based on the TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions. Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database). Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.


See also

*
List of column-oriented DBMSes This article is a list of column-oriented database management system software. Free and open-source software (FOSS) Platform as a Service (PaaS) * Amazon Redshift * Microsoft Azure Synapse Analytics (formerly Azure SQL Data Warehouse) * ...


References


External links

* {{Apache Software Foundation
Druid A druid was a member of the high-ranking priestly class in ancient Celtic cultures. The druids were religious leaders as well as legal authorities, adjudicators, lorekeepers, medical professionals and political advisors. Druids left no wr ...
Distributed data stores Structured storage NoSQL Free database management systems