
A data lake is a system or
repository of data stored in its natural/raw format, usually object
blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as
reporting,
visualization,
advanced analytics, and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. A data lake can include
structured data
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
from
relational database
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s (rows and columns), semi-structured data (
CSV, logs,
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
,
JSON
JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
),
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
(
emails, documents,
PDFs), and
binary data
Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.
Binary data occurs in many different technical and scientific fields, wh ...
(images,
audio
Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to:
Sound
*Audio signal, an electrical representation of sound
*Audio frequency, a frequency in the audio spectrum
*Digital audio, representation of sound ...
, video). A data lake can be established ''on premises'' (within an organization's data centers) or ''in the cloud'' (using
cloud services
Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to ISO.
Essential characteristics ...
).
Background
James Dixon, then chief technology officer at
Pentaho
Pentaho is the brand name for several data management software products that make up the Pentaho+ Data Platform. These include Pentaho Data Integration, Pentaho Business Analytics, Pentaho Data Catalog, and Pentaho Data Optimiser.
Overview
P ...
, coined the term by 2011
to contrast it with
data mart
A data mart is a structure/access pattern specific to ''data warehouse'' environments. The data mart is a subset of the data warehouse that focuses on a specific business line, department, subject area, or team. Whereas data warehouses have an en ...
, which is a smaller repository of interesting attributes derived from raw data.
In promoting data lakes, he argued that data marts have several inherent problems, such as
information silo
An information silo, or a group of such silos, is an insular management system in which one information system or subsystem is incapable of reciprocal operation with others that are, or should be, related. Thus information is not adequately shared ...
ing.
PricewaterhouseCoopers
PricewaterhouseCoopers, also known as PwC, is a multinational professional services network based in London, United Kingdom.
It is the second-largest professional services network in the world and is one of the Big Four accounting firms, alon ...
(PwC) said that data lakes could "put an end to data silos".
In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single,
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
-based repository."
Examples
Many companies use
cloud storage service
A file-hosting service, also known as cloud-storage service, online file-storage provider, or cyberlocker, is an internet hosting service specifically designed to host user files. These services allow users to upload files that can be accessed o ...
s such as
Google Cloud Storage and
Amazon S3
Amazon Simple Storage Service (S3) is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerc ...
or a distributed file system such as
Apache Hadoop
Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
distributed file system (HDFS).
There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at
Cardiff University
Cardiff University () is a public research university in Cardiff, Wales. It was established in 1883 as the University College of South Wales and Monmouthshire and became a founding college of the University of Wales in 1893. It was renamed Unive ...
is a new type of data lake which aims at managing
big data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
of individual users by providing a single point of collecting, organizing, and sharing personal data.
Early data lakes, such as Hadoop 1.0, had limited capabilities because it only supported batch-oriented processing (
Map Reduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a Parallel computing, parallel and distributed computing, distributed algorithm on a Cluster (computing), cluster.
A MapReduce progr ...
). Interacting with it required expertise in Java, map reduce and higher-level tools like
Apache Pig
Apache Pig
is a high-level platform for creating programs that run on Hadoop, Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the ...
,
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
and
Apache Hive
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
(which were also originally batch-oriented).
Criticism
Poorly managed data lakes have been facetiously called data swamps.
In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage
big data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
".
PwC
PricewaterhouseCoopers, also known as PwC, is a Multinational corporation, multinational professional services network based in London, United Kingdom.
It is the second-largest professional services network in the world and is one of the Big Fo ...
was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of
Cambridge Semantics:
They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
are important to the organization.
Another criticism is that the term ''data lake'' is not useful because it is used in so many different ways. It may be used to refer to, for example: any tools or data management practices that are not
data warehouses; a particular technology for implementation; a raw data reservoir; a hub for
ETL offload; or a central hub for self-service analytics.
While critiques of data lakes are warranted, in many cases they apply to other data projects as well. For example, the definition of ''data warehouse'' is also changeable, and not all data warehouse efforts have been successful. In response to various critiques, McKinsey noted that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.
Data lakehouses
Data lakehouses are a hybrid approach that can ingest a variety of raw data formats like a data lake, yet provide
ACID
An acid is a molecule or ion capable of either donating a proton (i.e. Hydron, hydrogen cation, H+), known as a Brønsted–Lowry acid–base theory, Brønsted–Lowry acid, or forming a covalent bond with an electron pair, known as a Lewis ...
transactions and enforce data quality like a
data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
. A data lakehouse architecture attempts to address several criticisms of data lakes by adding data warehouse capabilities such as transaction support, schema enforcement, governance, and support for diverse workloads. According to Oracle, data lakehouses combine the "flexible storage of unstructured data from a data lake and the management features and tools from data warehouses".
What is a Data Lakehouse? , Oracle
/ref>
See also
* Azure Data Lake
References
{{Reflist
Data management
Cloud storage