Data engineering refers to the building of
systems to enable the collection and usage of
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
. This data is usually used to enable subsequent
analysis
Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
and
data science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
; which often involves
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
.
Making the data usable usually involves substantial
compute and
storage, as well as
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information.
Data processing is a form of '' information processing'', which is the modification (processing) of information in any manner detectable by ...
and
cleaning
Cleaning is the process of removing unwanted substances, such as dirt, infectious agents, and other impurities, from an object or environment. Cleaning is often performed for aesthetic, hygienic, functional, environmental, or safety purposes. ...
.
History
Around the 1970s/1980s the term information engineering methodology (IEM) was created to describe
database design
Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate. With this information, they can begin to fit the data to the database model.Teorey, T ...
and the use of
software
Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work.
...
for data analysis and processing.
These techniques were intended to be used by
database administrator
Database administrators (DBAs) use specialized software to store and organize data. The role may include capacity planning, installation, configuration
Configuration or configurations may refer to:
Computing
* Computer configuration or system c ...
s (DBAs) and by
systems analyst
A systems analyst, also known as business technology analyst, is an information technology (IT) professional who specializes in analyzing, designing and implementing information systems. Systems analysts assess the suitability of information syst ...
s based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian
Clive Finkelstein, who wrote several articles about it between 1976 and 1980, and also co-authored an influential
Savant Institute report on it with James Martin. Over the next few years, Finkelstein continued work in a more business driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role by revamping IEM as well as helping to design the IEM software product (user-data), which helped automate IEM.
In the early 2000s, the data and data tooling was generally held by the
information technology
Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system ...
(IT) teams in most companies.
Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.
In the early 2010s, with the rise of the
internet
The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a ''internetworking, network of networks'' that consists ...
, the massive increase in data volumes, velocity, and variety led to the term
big data to describe the data itself, and data-driven tech companies like
Facebook
Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin ...
and
Airbnb
Airbnb, Inc. ( ), based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 by ...
started using the phrase data engineer.
Due to the new scale of the data, major firms like
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
, Facebook,
Amazon
Amazon most often refers to:
* Amazons, a tribe of female warriors in Greek mythology
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon River, in South America
* Amazon (company), an American multinational technolog ...
,
Apple
An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus '' Malus''. The tree originated in Central Asia, where its wild ances ...
,
Microsoft
Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
, and
Netflix
Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a ...
started to move away from traditional
ETL and storage techniques. They started creating data engineering, a type of
software engineering
Software engineering is a systematic engineering approach to software development.
A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term ' ...
focused on data, and in particular
infrastructure,
warehousing
A warehouse is a building for storing goods. Warehouses are used by manufacturers, importers, exporters, wholesalers, transport businesses, customs, etc. They are usually large plain buildings in industrial parks on the outskirts of cities, ...
,
data protection
Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, contextual information norms, and the legal and political issues surrounding them. It is also known as da ...
,
cybersecurity
Computer security, cybersecurity (cyber security), or information technology security (IT security) is the protection of computer systems and networks from attack by malicious actors that may result in unauthorized information disclosure, th ...
,
mining
Mining is the extraction of valuable minerals or other geological materials from the Earth, usually from an ore body, lode, vein, seam, reef, or placer deposit. The exploitation of these deposits for raw material is based on the economic ...
,
modelling,
processing
Processing is a free graphical library and integrated development environment (IDE) built for the electronic arts, new media art, and visual design communities with the purpose of teaching non-programmers the fundamentals of computer programming ...
, and
metadata management.
This change in approach was particularly focused on
cloud computing
Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over m ...
.
Data started to be handled and used by many parts of the business, such as
sales
Sales are activities related to selling or the number of goods sold in a given targeted time period. The delivery of a service for a cost is also considered a sale.
The seller, or the provider of the goods or services, completes a sale in ...
and
marketing
Marketing is the process of exploring, creating, and delivering value to meet the needs of a target market in terms of goods and services; potentially including selection of a target audience; selection of certain attributes or themes to empha ...
, and not just IT.
Tools
Compute
High performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is
dataflow programming
In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
, in which the computation is represented as a
directed graph
In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs.
Definition
In formal terms, a directed graph is an ordered pai ...
(dataflow graph); nodes are the operations, and edges represent the flow of data.
Popular implementations include
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...
, and the
deep learning specific
TensorFlow
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
.
More recent implementations such as
Differential/
Timely Dataflow have used
incremental computing
Incremental computing, also known as incremental computation, is a software feature which, whenever a piece of data changes, attempts to save time by only recomputing those outputs which depend on the changed data.
When incremental computing is s ...
for much more efficient data processing.
Storage
Data are stored in a variety of ways, one of the key deciding factors is in how the data will be used.
Databases
If the data are structured and some form of
online transaction processing In online transaction processing (OLTP), information systems typically facilitate and manage transaction-oriented applications. This is contrasted with online analytical processing.
The term "transaction" can have two different meanings, both of wh ...
is required, then
databases
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
are generally used.
Originally mostly
relational database
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s were used, with strong
ACID transaction correctness guarantees; most relational databases use
SQL for their queries. However, with the growth of data in the 2010s,
NoSQL
A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
databases have also become popular since they
horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the
object-relational impedance mismatch.
More recently,
NewSQL
NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.
Many ...
databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.
[
][
]
Data Warehouses
If the data are structured and
online analytical processing
Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, re ...
is required (but not online transaction processing), then
data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
s are a main choice.
They enable data analysis, mining, and
artificial intelligence
Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...
on a much larger scale than databases can allow,
and indeed data often flow from databases into data warehouses.
Business analyst
A business analyst (BA) is a person who processes, interprets and documents business processes, products, services and software through analysis of data. The role of a business analyst is to ensure business efficiency increases through their know ...
s, data engineers, and data scientists can access data warehouses using tools such as SQL or
business intelligence
Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
software.
Files
If the data are less structured, then often they are just stored as
files
File or filing may refer to:
Mechanical tools and processes
* File (tool), a tool used to ''remove'' fine amounts of material from a workpiece
**Filing (metalworking), a material removal process in manufacturing
** Nail file, a tool used to gent ...
. There are several options:
*
File system
In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one lar ...
s represent data hierarchially in nested folders.
*
Block storage
In computing (specifically data transmission and data storage), a block, sometimes called a physical record, is a sequence of bytes or bits, usually containing some whole number of records, having a maximum length; a ''block size''. Data t ...
splits data into regularly sized chunks;
this often matches up with (virtual)
hard drives
A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magne ...
or
solid state drives
A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
.
*
Object storage
Object storage (also known as object-based storage) is a computer data storage that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as ...
manages data using
metadata;
often each file is assigned a key such as a
UUID.
Management
The number of different data processes and storage locations can quickly become overwhelming. This motivates the usage of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application.
International standards
There are several international standards ...
(e.g.
Airflow
Airflow, or air flow, is the movement of air. The primary cause of airflow is the existence of air. Air behaves in a fluid manner, meaning particles naturally flow from areas of higher pressure to those where the pressure is lower. Atmospheric ...
) to allow the data tasks to be specified, created, and monitored.
The tasks are often specified as a directed acyclic graph (DAG).
Lifecycle
Business planning
Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of their business plan.
Systems Design
The design of data systems involves several components such as architecting data platforms, and designing data stores.
Data modelling
This is the process of producing a
data model
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be c ...
, an
abstract model
A conceptual model is a representation of a system. It consists of concepts used to help people know, understand, or simulate a subject the model represents. In contrast, physical models are physical object such as a toy model that may be assem ...
to describe the data and relationships between different parts of the data.
Roles
Data Engineer
A data engineer is a type of software engineer who creates
big data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into
insights
Insight is the understanding of a specific cause and effect within a particular context. The term insight can have several related meanings:
*a piece of information
*the act or result of understanding the inner nature of things or of seeing intui ...
. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
,
Python,
Scala, and
Rust
Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH), ...
.
They will be more familiar with databases, architecture, cloud computing, and
Agile software development
In software development, agile (sometimes written Agile) practices include requirements discovery and solutions improvement through the collaborative effort of self-organizing and cross-functional teams with their customer(s)/ end user(s), ...
.
Data Scientist
Data scientists are more focused on the analysis of the data, they will be more familiar with
mathematics,
algorithms
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
,
statistics, and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
.
See also
*
Big data
*
Information technology
Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system ...
*
Software engineering
Software engineering is a systematic engineering approach to software development.
A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term ' ...
*
Computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
References
Further reading
* John Hares (1992). "Information engineering for the Advanced Practitioner", Wiley.
* Clive Finkelstein (1989). ''An Introduction to Information engineering : From Strategic Planning to Information Systems''. Sydney: Addison-Wesley.
* Clive Finkelstein (1992). "Information Engineering: Strategic Systems Development". Sydney: Addison-Wesley.
* Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland.
* Ian Macdonald (1988). "Automating the Information engineering methodology with the Information engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''.
T.W. Olle et al. (ed.). North-Holland.
*
James Martin and
Clive Finkelstein. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK.
* James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc.
* Clive Finkelstein (2006) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". First Edition, Artech House, Norwood MA in hardcover.
* Clive Finkelstein (2011) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". Second Edition in PDF at www.ies.aust.com and as an ibook on the Apple iPad and ebook on the Amazon Kindle.
* Reis, Joe; Housley, Matt (2022) "Fundamentals of Data Engineering". O'Reilly Media, Inc. ISBN 9781098108304
External links
The Complex Method IEMEnterprise Engineering and Rapid Delivery of Enterprise Architecture
{{Authority control
Software development process
Information systems