HOME

TheInfoList



OR:

Data engineering is a
software engineering Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
approach to the building of
data system Data system is an organized collection of symbols and processes that may be used to operate on such symbols. Any organised collection of symbols and symbol-manipulating operations can be considered a data system. Hence, human-speech analysed at the ...
s, to enable the collection and usage of
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
. This data is usually used to enable subsequent
analysis Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
and
data science Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, stru ...
, which often involves
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. Making the data usable usually involves substantial compute and storage, as well as
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...
.


History

Around the 1970s/1980s the term ''
information engineering Information engineering is the engineering discipline that deals with the generation, distribution, analysis, and use of information, data, and knowledge in electrical systems. The field first became identifiable in the early 21st century. Th ...
methodology'' (IEM) was created to describe
database design Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate. With this information, they can begin to fit the data to the database model.Teorey, T ...
and the use of
software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...
for data analysis and processing. These techniques were intended to be used by
database administrator A database administrator (DBA) manages computer databases. The role may include capacity planning, installation, configuration, database design, migration, performance monitoring, security, troubleshooting, as well as backup and data re ...
s (DBAs) and by
systems analyst A systems analyst, also known as business technology analyst, is an information technology (IT) professional who specializes in analyzing, designing and implementing information systems. Systems analysts assess the suitability of information syst ...
s based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian Clive Finkelstein, who wrote several articles about it between 1976 and 1980, and also co-authored an influential Savant Institute report on it with James Martin. Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM. In the early 2000s, the data and data tooling was generally held by the
information technology Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
(IT) teams in most companies. Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business. In the early 2010s, with the rise of the
internet The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
, the massive increase in data volumes, velocity, and variety led to the term
big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
to describe the data itself, and data-driven tech companies like
Facebook Facebook is a social media and social networking service owned by the American technology conglomerate Meta Platforms, Meta. Created in 2004 by Mark Zuckerberg with four other Harvard College students and roommates, Eduardo Saverin, Andre ...
and
Airbnb Airbnb, Inc. ( , an abbreviation of its original name, "Air Bed and Breakfast") is an American company operating an online marketplace for short-and-long-term homestays, experiences and services in various countries and regions. It acts as a ...
started using the phrase data engineer. Due to the new scale of the data, major firms like
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
, Facebook,
Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...
,
Apple An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
,
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
, and
Netflix Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...
started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of
software engineering Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
focused on data, and in particular
infrastructure Infrastructure is the set of facilities and systems that serve a country, city, or other area, and encompasses the services and facilities necessary for its economy, households and firms to function. Infrastructure is composed of public and pri ...
,
warehousing A warehouse is a building for storing goods. Warehouses are used by manufacturers, importers, exporters, wholesalers, transport businesses, customs, etc. They are usually large plain buildings in industrial parks on the rural–urban fringe, out ...
,
data protection Data protection may refer to: * Information privacy, also known as data privacy * Data security {{Authority control ...
,
cybersecurity Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and networks from thr ...
,
mining Mining is the Resource extraction, extraction of valuable geological materials and minerals from the surface of the Earth. Mining is required to obtain most materials that cannot be grown through agriculture, agricultural processes, or feasib ...
, modelling, processing, and
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
management. This change in approach was particularly focused on
cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
. Data started to be handled and used by many parts of the business, such as
sales Sales are activities related to selling or the number of goods sold in a given targeted time period. The delivery of a service for a cost is also considered a sale. A period during which goods are sold for a reduced price may also be referred ...
and
marketing Marketing is the act of acquiring, satisfying and retaining customers. It is one of the primary components of Business administration, business management and commerce. Marketing is usually conducted by the seller, typically a retailer or ma ...
, and not just IT.


Tools


Compute

High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is
dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
, in which the computation is represented as a directed graph (dataflow graph); nodes are the operations, and edges represent the flow of data. Popular implementations include
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
, and the
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
specific
TensorFlow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...
. More recent implementations, such as Differential/ Timely Dataflow, have used
incremental computing Incremental computing, also known as incremental computation, is a software feature which, whenever a piece of data changes, attempts to save time by only recomputing those outputs which depend on the changed data. When incremental computing is su ...
for much more efficient data processing.


Storage

Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used. Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving.


Databases

If the data is structured and some form of
online transaction processing Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to the fact that such systems are expected to respond to user requests and process them i ...
is required, then
databases In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
are generally used. Originally mostly
relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s were used, with strong
ACID An acid is a molecule or ion capable of either donating a proton (i.e. Hydron, hydrogen cation, H+), known as a Brønsted–Lowry acid–base theory, Brønsted–Lowry acid, or forming a covalent bond with an electron pair, known as a Lewis ...
transaction correctness guarantees; most relational databases use
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
for their queries. However, with the growth of data in the 2010s,
NoSQL NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
databases have also become popular since they horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the object-relational impedance mismatch. More recently,
NewSQL NewSQL is a class of relational database management system, relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a t ...
databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.


Data warehouses

If the data is structured and
online analytical processing In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
is required (but not online transaction processing), then
data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
s are a main choice. They enable data analysis, mining, and
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
on a much larger scale than databases can allow, and indeed data often flow from databases into data warehouses.
Business analyst A business analyst (BA) is a person who processes, interprets and documents business processes, products, services and software through analysis of data. The role of a business analyst is to ensure business efficiency increases through their kno ...
s, data engineers, and data scientists can access data warehouses using tools such as SQL or
business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
software.


Data lakes

A
data lake A data lake is a system or data repository, repository of data stored in its natural/raw format, usually object binary large object, blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor ...
is a centralized repository for storing, processing, and securing large volumes of data. A data lake can contain
structured data A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
from
relational databases A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured form ...
, semi-structured data,
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
, and
binary data Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, wh ...
. A data lake can be created on premises or in a cloud-based environment using the services from public cloud vendors such as
Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...
,
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
, or
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
.


Files

If the data is less structured, then often they are just stored as files. There are several options: * File systems represent data hierarchically in nested folders. *
Block storage In computing (specifically data transmission and data storage), a block, sometimes called a physical record, is a sequence of bytes or bits, usually containing some whole number of records, having a fixed length; a ''block size''. Data thus ...
splits data into regularly sized chunks; this often matches up with (virtual) hard drives or
solid state drives A solid-state drive (SSD) is a type of solid-state storage device that uses integrated circuits to store data persistently. It is sometimes called semiconductor storage device, solid-state device, or solid-state disk. SSDs rely on non-v ...
. * Object storage manages data using
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
; often each file is assigned a key such as a
UUID A Universally Unique Identifier (UUID) is a 128-bit nominal number, label used to uniquely identify objects in computer systems. The term Globally Unique Identifier (GUID) is also used, mostly in Microsoft systems. When generated according to the ...
.


Management

The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance, and monitoring of a defined sequence of tasks arranged as a workflow application. International standards There are several international standard ...
(e.g.
Airflow Airflow, or air flow, is the movement of air. Air behaves in a fluid manner, meaning particles naturally flow from areas of higher pressure to those where the pressure is lower. Atmospheric air pressure is directly related to altitude, temperat ...
) to allow the data tasks to be specified, created, and monitored. The tasks are often specified as a directed acyclic graph (DAG).


Lifecycle


Business planning

Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan.


Systems design

The design of data systems involves several components such as architecting data platforms, and designing data stores.


Data modeling

This is the process of producing a
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
, an abstract model to describe the data and relationships between different parts of the data.


Roles


Data engineer

A data engineer is a type of software engineer who creates
big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into insights. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
, Python, Scala, and
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
. They will be more familiar with databases, architecture, cloud computing, and
Agile software development Agile software development is an umbrella term for approaches to software development, developing software that reflect the values and principles agreed upon by ''The Agile Alliance'', a group of 17 software practitioners, in 2001. As documented ...
.


Data scientist

Data scientists are more focused on the analysis of the data, they will be more familiar with
mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
,
algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...
,
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
.


See also

*
Big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
*
Information technology Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
*
Software engineering Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
*
Computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...


References


Further reading

* * * * Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland. * Ian Macdonald (1988). "Automating the Information engineering methodology with the Information Engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''. T.W. Olle et al. (ed.). North-Holland. * James Martin and Clive Finkelstein. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK. * James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc. * *


External links


The Complex Method IEM



Enterprise Engineering and Rapid Delivery of Enterprise Architecture
{{Engineering fields Software engineering Information systems Data management Data engineering Engineering disciplines