Data engineering is a
software engineering
Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
approach to the building of
data system Data system is an organized collection of symbols and processes that may be used to operate on such symbols. Any organised collection of symbols and symbol-manipulating operations can be considered a data system. Hence, human-speech analysed at the ...
s, to enable the collection and usage of
data
Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
. This data is usually used to enable subsequent
analysis
Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
and
data science
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, stru ...
, which often involves
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
.
Making the data usable usually involves substantial
compute and
storage, as well as
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...
.
History
Around the 1970s/1980s the term ''
information engineering
Information engineering is the engineering discipline that deals with the generation, distribution, analysis, and use of information, data, and knowledge in electrical systems. The field first became identifiable in the early 21st century.
Th ...
methodology'' (IEM) was created to describe
database design
Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate. With this information, they can begin to fit the data to the database model.Teorey, T ...
and the use of
software
Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications.
The history of software is closely tied to the development of digital comput ...
for data analysis and processing.
These techniques were intended to be used by
database administrator
A database administrator (DBA) manages computer databases. The role may include capacity planning, installation, configuration, database design, migration, performance monitoring, security, troubleshooting, as well as backup and data re ...
s (DBAs) and by
systems analyst
A systems analyst, also known as business technology analyst, is an information technology (IT) professional who specializes in analyzing, designing and implementing information systems. Systems analysts assess the suitability of information syst ...
s based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian
Clive Finkelstein, who wrote several articles about it between 1976 and 1980, and also co-authored an influential
Savant Institute report on it with James Martin. Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM.
In the early 2000s, the data and data tooling was generally held by the
information technology
Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
(IT) teams in most companies.
Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.
In the early 2010s, with the rise of the
internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
, the massive increase in data volumes, velocity, and variety led to the term
big data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
to describe the data itself, and data-driven tech companies like
Facebook
Facebook is a social media and social networking service owned by the American technology conglomerate Meta Platforms, Meta. Created in 2004 by Mark Zuckerberg with four other Harvard College students and roommates, Eduardo Saverin, Andre ...
and
Airbnb
Airbnb, Inc. ( , an abbreviation of its original name, "Air Bed and Breakfast") is an American company operating an online marketplace for short-and-long-term homestays, experiences and services in various countries and regions. It acts as a ...
started using the phrase data engineer.
Due to the new scale of the data, major firms like
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
, Facebook,
Amazon
Amazon most often refers to:
* Amazon River, in South America
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon (company), an American multinational technology company
* Amazons, a tribe of female warriors in Greek myth ...
,
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
,
Microsoft
Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
, and
Netflix
Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...
started to move away from traditional
ETL and storage techniques. They started creating data engineering, a type of
software engineering
Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
focused on data, and in particular
infrastructure
Infrastructure is the set of facilities and systems that serve a country, city, or other area, and encompasses the services and facilities necessary for its economy, households and firms to function. Infrastructure is composed of public and pri ...
,
warehousing
A warehouse is a building for storing goods. Warehouses are used by manufacturers, importers, exporters, wholesalers, transport businesses, customs, etc. They are usually large plain buildings in industrial parks on the rural–urban fringe, out ...
,
data protection Data protection may refer to:
* Information privacy, also known as data privacy
* Data security
{{Authority control ...
,
cybersecurity
Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and networks from thr ...
,
mining
Mining is the Resource extraction, extraction of valuable geological materials and minerals from the surface of the Earth. Mining is required to obtain most materials that cannot be grown through agriculture, agricultural processes, or feasib ...
,
modelling,
processing, and
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
management.
This change in approach was particularly focused on
cloud computing
Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
.
Data started to be handled and used by many parts of the business, such as
sales
Sales are activities related to selling or the number of goods sold in a given targeted time period. The delivery of a service for a cost is also considered a sale. A period during which goods are sold for a reduced price may also be referred ...
and
marketing
Marketing is the act of acquiring, satisfying and retaining customers. It is one of the primary components of Business administration, business management and commerce.
Marketing is usually conducted by the seller, typically a retailer or ma ...
, and not just IT.
Tools
Compute
High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is
dataflow programming
In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
, in which the computation is represented as a
directed graph (dataflow graph); nodes are the operations, and edges represent the flow of data.
Popular implementations include
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
, and the
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
specific
TensorFlow
TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...
.
More recent implementations, such as
Differential/
Timely Dataflow, have used
incremental computing
Incremental computing, also known as incremental computation, is a software feature which, whenever a piece of data changes, attempts to save time by only recomputing those outputs which depend on the changed data.
When incremental computing is su ...
for much more efficient data processing.
Storage
Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used.
Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving.
Databases
If the data is structured and some form of
online transaction processing
Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to the fact that such systems are expected to respond to user requests and process them i ...
is required, then
databases
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
are generally used.
Originally mostly
relational database
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s were used, with strong
ACID
An acid is a molecule or ion capable of either donating a proton (i.e. Hydron, hydrogen cation, H+), known as a Brønsted–Lowry acid–base theory, Brønsted–Lowry acid, or forming a covalent bond with an electron pair, known as a Lewis ...
transaction correctness guarantees; most relational databases use
SQL
Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel")
is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
for their queries. However, with the growth of data in the 2010s,
NoSQL
NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
databases have also become popular since they
horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the
object-relational impedance mismatch.
More recently,
NewSQL
NewSQL is a class of relational database management system, relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a t ...
databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.
[
][
]
Data warehouses
If the data is structured and
online analytical processing
In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
is required (but not online transaction processing), then
data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
s are a main choice.
They enable data analysis, mining, and
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
on a much larger scale than databases can allow,
and indeed data often flow from databases into data warehouses.
Business analyst
A business analyst (BA) is a person who processes, interprets and documents business processes, products, services and software through analysis of data. The role of a business analyst is to ensure business efficiency increases through their kno ...
s, data engineers, and data scientists can access data warehouses using tools such as SQL or
business intelligence
Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
software.
Data lakes
A
data lake
A data lake is a system or data repository, repository of data stored in its natural/raw format, usually object binary large object, blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor ...
is a centralized repository for storing, processing, and securing large volumes of data. A data lake can contain
structured data
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
from
relational databases
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured form ...
,
semi-structured data,
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
, and
binary data
Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.
Binary data occurs in many different technical and scientific fields, wh ...
. A data lake can be created on premises or in a cloud-based environment using the services from
public cloud vendors such as
Amazon
Amazon most often refers to:
* Amazon River, in South America
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon (company), an American multinational technology company
* Amazons, a tribe of female warriors in Greek myth ...
,
Microsoft
Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
, or
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
.
Files
If the data is less structured, then often they are just stored as
files. There are several options:
*
File systems represent data hierarchically in nested folders.
*
Block storage
In computing (specifically data transmission and data storage), a block, sometimes called a physical record, is a sequence of bytes or bits, usually containing some whole number of records, having a fixed length; a ''block size''. Data thus ...
splits data into regularly sized chunks;
this often matches up with (virtual)
hard drives or
solid state drives
A solid-state drive (SSD) is a type of solid-state storage device that uses integrated circuits to store data persistently. It is sometimes called semiconductor storage device, solid-state device, or solid-state disk.
SSDs rely on non-v ...
.
*
Object storage manages data using
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
;
often each file is assigned a key such as a
UUID
A Universally Unique Identifier (UUID) is a 128-bit nominal number, label used to uniquely identify objects in computer systems. The term Globally Unique Identifier (GUID) is also used, mostly in Microsoft systems.
When generated according to the ...
.
Management
The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a
workflow management system
A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance, and monitoring of a defined sequence of tasks arranged as a workflow application.
International standards
There are several international standard ...
(e.g.
Airflow
Airflow, or air flow, is the movement of air. Air behaves in a fluid manner, meaning particles naturally flow from areas of higher pressure to those where the pressure is lower. Atmospheric air pressure is directly related to altitude, temperat ...
) to allow the data tasks to be specified, created, and monitored.
The tasks are often specified as a
directed acyclic graph (DAG).
Lifecycle
Business planning
Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan.
Systems design
The design of data systems involves several components such as architecting data platforms, and designing data stores.
Data modeling
This is the process of producing a
data model
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
, an
abstract model to describe the data and relationships between different parts of the data.
Roles
Data engineer
A data engineer is a type of software engineer who creates
big data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into
insights. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Python,
Scala, and
Rust
Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
.
They will be more familiar with databases, architecture, cloud computing, and
Agile software development
Agile software development is an umbrella term for approaches to software development, developing software that reflect the values and principles agreed upon by ''The Agile Alliance'', a group of 17 software practitioners, in 2001. As documented ...
.
Data scientist
Data scientists are more focused on the analysis of the data, they will be more familiar with
mathematics
Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
,
algorithms
In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...
,
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
.
See also
*
Big data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
*
Information technology
Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
*
Software engineering
Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
*
Computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
References
Further reading
*
*
*
* Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland.
* Ian Macdonald (1988). "Automating the Information engineering methodology with the Information Engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''.
T.W. Olle et al. (ed.). North-Holland.
*
James Martin and
Clive Finkelstein. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK.
* James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc.
*
*
External links
The Complex Method IEM
Enterprise Engineering and Rapid Delivery of Enterprise Architecture
{{Engineering fields
Software engineering
Information systems
Data management
Data engineering
Engineering disciplines