A data lake is a system or
repository of data stored in its natural/raw format, usually object
blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as
reporting,
visualization,
advanced analytics and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
. A data lake can include
structured data from
relational database
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s (rows and columns), semi-structured data (
CSV, logs,
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
,
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...
),
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...
(
emails, documents,
PDFs) and
binary data
Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.
Binary data occurs in many different technical and scientific fields, wher ...
(images,
audio, video). A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as
Amazon
Amazon most often refers to:
* Amazons, a tribe of female warriors in Greek mythology
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon River, in South America
* Amazon (company), an American multinational technolog ...
,
Microsoft
Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
, or
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
).
Background
James Dixon, then chief technology officer at
Pentaho, coined the term by 2011
to contrast it with
data mart
A data mart is a structure/access pattern specific to ''data warehouse'' environments, used to retrieve client-facing data. The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data w ...
, which is a smaller repository of interesting attributes derived from raw data.
In promoting data lakes, he argued that data marts have several inherent problems, such as
information siloing.
PricewaterhouseCoopers
PricewaterhouseCoopers is an international professional services brand of firms, operating as partnerships under the PwC brand. It is the second-largest professional services network in the world and is considered one of the Big Four account ...
(PwC) said that data lakes could "put an end to data silos".
In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository."
Hortonworks,
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
,
Oracle
An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination.
Description
The wor ...
,
Microsoft
Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
,
Zaloni,
Teradata, Impetus Technologies,
Cloudera
Cloudera, Inc. is an American software company providing enterprise data management systems that make significant use of Apache Hadoop. As of January 31, 2021, the company had approximately 1,800 customers.
History
Cloudera, Inc. was formed on ...
,
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
, and
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
all used the term by 2016.
Examples
Many companies use cloud storage services such as
Google Cloud Storage and
Amazon S3 or a distributed file system such as
Apache Hadoop distributed file system (HDFS).
There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at
Cardiff University
, latin_name =
, image_name = Shield of the University of Cardiff.svg
, image_size = 150px
, caption = Coat of arms of Cardiff University
, motto = cy, Gwirionedd, Undod a Chytgord
, mottoeng = Truth, Unity and Concord
, established = 1 ...
is a new type of data lake which aims at managing
big data of individual users by providing a single point of collecting, organizing, and sharing personal data.
An earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing (
Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant one had to have expertise in Java with map reduce and higher-level tools like
Apache Pig,
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...
and
Apache Hive (which by themselves were originally batch-oriented).
Criticism
Poorly-managed data lakes have been facetiously called data swamps.
In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage
big data".
PwC
PricewaterhouseCoopers is an international professional services brand of firms, operating as partnerships under the PwC brand. It is the second-largest professional services network in the world and is considered one of the Big Four accounting ...
was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of
Cambridge Semantics:
They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and
metadata are important to the organization.
Another criticism is that the term "data lake" is not useful because it is used in so many different ways. It may be used to refer to, for example: any tools or data management practices that are not
data warehouses; a particular technology for implementation; a raw data reservoir; a hub for
ETL offload; or a central hub for self-service analytics.
While critiques of data lakes are warranted, in many cases they apply to other data projects as well. For example, the definition of “data warehouse” is also changeable, and not all data warehouse efforts have been successful. In response to various critiques, McKinsey noted
that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.
See also
*
Azure Data Lake
References
{{Reflist
Data management
Cloud storage