Druid is a
column-oriented,
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
,
distributed Distribution may refer to:
Mathematics
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
*Probability distribution, the probability of a particular value or value range of a varia ...
data store
A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails, etc.
A ''database'' is a collection of data that ...
written in
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.
[Hemsoth, Nicole. , ''Datanami'', 8 November 2012] The name Druid comes from the
shapeshifting
In mythology, folklore and speculative fiction, shapeshifting is the ability to physically transform oneself through unnatural means. The idea of shapeshifting is found in the oldest forms of totemism and shamanism, as well as the oldest existen ...
Druid class in many
role-playing game
A role-playing game (sometimes spelled roleplaying game, or abbreviated as RPG) is a game in which players assume the roles of player character, characters in a fictional Setting (narrative), setting. Players take responsibility for acting out ...
s, to reflect that the architecture of the system can shift to solve different types of data problems.
Druid is commonly used in
business intelligence
Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
-
OLAP
In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction processi ...
applications to analyze high volumes of
real-time
Real-time, realtime, or real time may refer to:
Computing
* Real-time computing, hardware and software systems subject to a specified time constraint
* Real-time clock, a computer clock that keeps track of the current time
* Real-time Control Syst ...
and historical data.
Druid is used in production by technology companies such as
Alibaba
Ali Baba is a character from the folk tale "Ali Baba and the Forty Thieves".
Alibaba Group is a Chinese multinational internet technology company.
Ali Baba or Alibaba may also refer to:
Arts and entertainment Films
* ''Ali Baba and the Forty T ...
,
Airbnb
Airbnb, Inc. ( , an abbreviation of its original name, "Air Bed and Breakfast") is an American company operating an online marketplace for short-and-long-term homestays, experiences and services in various countries and regions. It acts as a ...
,
Nielsen,
Cisco
Cisco Systems, Inc. (using the trademark Cisco) is an American multinational digital communications technology conglomerate corporation headquartered in San Jose, California. Cisco develops, manufactures, and sells networking hardware, s ...
,
eBay
eBay Inc. ( , often stylized as ebay) is an American multinational e-commerce company based in San Jose, California, that allows users to buy or view items via retail sales through online marketplaces and websites in 190 markets worldwide. ...
,
Lyft
Lyft, Inc. is an American company offering ride-hailing services, motorized scooters, and bicycle-sharing systems in the United States and Canada. Lyft sets fares, which vary using a dynamic pricing model based on local supply and demand a ...
,
Netflix
Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...
,
PayPal
PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support E-commerce payment system, online money transfers; it serves as an electronic alter ...
,
Pinterest
Pinterest is an American social media service for publishing and discovery of information in the form of digital Bulletin board, pinboards. This includes recipes, home, style, motivation, and inspiration on the Internet using image sharing. Pint ...
,
Reddit
Reddit ( ) is an American Proprietary software, proprietary social news news aggregator, aggregation and Internet forum, forum Social media, social media platform. Registered users (commonly referred to as "redditors") submit content to the ...
,
Twitter
Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
,
Walmart
Walmart Inc. (; formerly Wal-Mart Stores, Inc.) is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery stores in the United States and 23 other ...
,
Wikimedia Foundation
The Wikimedia Foundation, Inc. (WMF) is an American 501(c)(3) nonprofit organization headquartered in San Francisco, California, and registered there as foundation (United States law), a charitable foundation. It is the host of Wikipedia, th ...
and
Yahoo
Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, an ...
.
History
Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012,
[Tschetter, Eric. , ''druid.apache.org'', 24 October 2012][Higginbotham, Stacey. , '']GigaOM
Gigaom is a technology-focused analyst firm and media company. It was founded by Om Malik in San Francisco, California. In March 2015, it was shut down and in June 2015, its website and content were acquired by Knowingly and relaunched.
History
...
'', 24 October 2012 and moved to an Apache License in February 2015.
Architecture

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support a
fault-tolerant
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault to ...
architecture
where data is stored redundantly, and there is no single point of failure. The cluster includes external dependencies for coordination (
Apache ZooKeeper
Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.
ZooKeeper is essentially a service for distributed systems offering a hierarchical ...
), metadata storage (e.g.
MySQL
MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
,
PostgreSQL
PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
, or
Derby
Derby ( ) is a City status in the United Kingdom, city and Unitary authorities of England, unitary authority area on the River Derwent, Derbyshire, River Derwent in Derbyshire, England. Derbyshire is named after Derby, which was its original co ...
), and a deep storage facility (e.g.
HDFS
Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
, or
Amazon S3
Amazon Simple Storage Service (S3) is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerc ...
) for permanent data backup.
Query management
Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or
shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.
Cluster management
Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.
Features
* Low latency (streaming) data ingestion.
* Arbitrary slice and dice data exploration.
* Sub-second analytic queries.
* Approximate and exact computations.
Performance
In 2019, researchers compared the performance of
Hive,
Presto
Presto may refer to:
Computing
* Presto (browser engine), an engine previously used in the Opera web browser
* Presto (operating system), a Linux-based OS by Xandros
* Presto (SQL query engine), a distributed query engine
* Presto (animation so ...
, and Druid using a denormalized
Star Schema
In computing, the star schema or star model is the simplest style of data mart Logical schema, schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables ...
Benchmark based on the
TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions.
Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database).
Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.
See also
*
List of column-oriented DBMSes
This article is a list of column-oriented database management system software.
Free and open-source software (FOSS)
Platform as a Service (PaaS)
* Amazon Redshift
* Microsoft Azure Synapse Analytics (formerly Azure SQL Data Warehouse)
* ...
References
External links
*
{{Apache Software Foundation
Druid
A druid was a member of the high-ranking priestly class in ancient Celtic cultures. The druids were religious leaders as well as legal authorities, adjudicators, lorekeepers, medical professionals and political advisors. Druids left no wr ...
Distributed data stores
Structured storage
NoSQL
Free database management systems