HOME

TheInfoList



OR:

In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html and data management tasks such as
data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
,
data warehousing In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
,
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
and application integration. Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and volume of the data being transformed. A master data recast is another form of data transformation where the entire database of data values is transformed or recast without extracting the data from the database. All data in a well designed database is directly or indirectly related to a limited set of master
database table A table is a collection of related data held in a table format within a database. It consists of columns and rows. In relational databases, and flat file databases, a ''table'' is a set of data elements (values) using a model of vertical colu ...
s by a network of
foreign key A foreign key is a set of attributes in a table that refers to the primary key of another table. The foreign key links these two tables. Another way to put it: In the context of relational databases, a foreign key is a set of attributes subject to ...
constraints. Each foreign key constraint is dependent upon a unique
database index A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without ...
from the parent database table. Therefore, when the proper master database table is recast with a different unique index, the directly and indirectly related data are also recast or restated. The directly and indirectly related data may also still be viewed in the original form since the original unique index still exists with the master data. Also, the database recast must be done in such a way as to not impact the applications architecture software. When the data mapping is indirect via a mediating
data model A data model is an abstract model that organizes elements of data and Standardization, standardizes how they relate to one another and to the properties of real-world Entity, entities. For instance, a data model may specify that the data element ...
, the process is also called data mediation.


Data transformation process

Data transformation can be divided into the following steps, each applicable as needed based on the complexity of the transformation required.
* Data discovery * Data mapping * Code generation * Code execution * Data review These steps are often the focus of developers or technical data analysts who may use multiple specialized tools to perform their tasks. The steps can be described as follows: Data discovery is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how it needs to be transformed. Data mapping is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual ETL tools, transformation languages). Code generation is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules. Typically, the data transformation technologies generate this code based on the definitions or metadata defined by the developers. Code execution is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code. Data review is the final step in the process, which focuses on ensuring the output data meets the transformation requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors in the data that are found and communicated back to the developer or data analyst as new requirements to be implemented in the transformation process.


Types of data transformation


Batch data transformation

Traditionally, data transformation has been a bulk or batch process,TDWI. 10 Rules for Real-Time Data Integration. Retrieved from: https://tdwi.org/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1 whereby developers write code or implement transformation rules in a data integration tool, and then execute that code or those rules on large volumes of data.Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf This process can follow the linear set of steps as described in the data transformation process above. Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data migration and application integration. When data must be transformed and delivered with low latency, the term “microbatch” is often used. This refers to small batches of data (e.g. a small number of rows or small set of data objects) that can be processed very quickly and delivered to the target system when needed.


Benefits of batch data transformation

Traditional data transformation processes have served companies well for decades. The various tools and technologies (data profiling, data visualization, data cleansing, data integration etc.) have matured and most (if not all) enterprises transform enormous volumes of data that feed internal and external applications, data warehouses and other data stores.The Value of Data Transformation


Limitations of traditional data transformation

This traditional process also has limitations that hamper its overall efficiency and effectiveness. The people who need to use the data (e.g. business users) do not play a direct role in the data transformation process.Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://digital.lib.washington.edu/researchworks/handle/1773/35165 Typically, users hand over the data transformation task to developers who have the necessary coding or technical skills to define the transformations and execute them on the data. This process leaves the bulk of the work of defining the required transformations to the developer. The developer interprets the business user requirements and implements the related code/logic. This has the potential of introducing errors into the process (through misinterpreted requirements), and also increases the time to arrive at a solution.McKinsey.com. Using Agile to Accelerate Data Transformation This problem has given rise to the need for agility and self-service in data integration (i.e. empowering the user of the data and enabling them to transform the data themselves interactively). There are companies that provide self-service data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical and process complexity that currently exists. While these companies use traditional batch transformation, their tools enable more interactivity for users through visual platforms and easily repeated scripts. Still, there might be some compatibility issues (e.g. new data sources like IoT may not work correctly with older tools) and compliance limitations due to the difference in
data governance Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate data gover ...
, preparation and audit practices.


Interactive data transformation

Interactive data transformation (IDT) is an emerging capability that allows business analysts and business users the ability to directly interact with large datasets through a visual interface, understand the characteristics of the data (via automated data profiling or visualization), and change or correct the data through simple interactions such as clicking or selecting certain elements of the data. Although IDT follows the same data integration process steps as batch data integration, the key difference is that the steps are not necessarily followed in a linear fashion and typically don't require significant technical skills for completion. A number of companies, primarily start-ups such as Trifacta, Alteryx and Paxata provide interactive data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical and process complexity that currently exists. IDT solutions provide an integrated visual interface that combines the previously disparate steps of data analysis, data mapping and code generation/execution and data inspection. IDT interfaces incorporate visualization to show the user patterns and anomalies in the data so they can identify erroneous or outlying values. Once they've finished transforming the data, the system can generate executable code/logic, which can be executed or applied to subsequent similar data sets. By removing the developer from the process, IDT systems shorten the time needed to prepare and transform the data, eliminate costly errors in interpretation of user requirements and empower business users and analysts to control their data and interact with it as needed.


Transformational languages

There are numerous languages available for performing data transformation. Many
transformation language A transformation language is a computer language designed to transform some input text in a certain formal language into a modified output text that meets some specific goal. Program transformation systems such as Stratego/XT, TXL, Tom, DMS, ...
s require a
grammar In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words. The term can also refer to the study of such constraints, a field that includes domain ...
to be provided. In many cases, the grammar is structured using something closely resembling Backus–Naur Form (BNF). There are numerous languages available for such purposes varying in their accessibility (cost) and general usefulness. Examples of such languages include: *
AWK AWK (''awk'') is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems. The AWK lan ...
- one of the oldest and popular textual data transformation language; *
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...
- a high-level language with both procedural and object-oriented syntax capable of powerful operations on binary or text data. * Template languages - specialized to transform data into documents (see also
template processor A template processor (also known as a template engine or template parser) is software designed to combine templates with a data model to produce result documents. The language that the templates are written in is known as a template language ...
); * TXL - prototyping language-based descriptions, used for source code or data transformation. *
XSLT XSLT (Extensible Stylesheet Language Transformations) is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequ ...
- the standard XML data transformation language (suitable by
XQuery XQuery (XML Query) is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats (JSON, b ...
in many applications); Additionally, companies such as Trifacta and Paxata have developed domain-specific transformational languages (DSL) for servicing and transforming datasets. The development of domain-specific languages has been linked to increased productivity and accessibility for non-technical users. Trifacta's “Wrangle” is an example of such a domain specific language. Another advantage of the recent DSL trend is that a DSL can abstract the underlying execution of the logic defined in the DSL, but it can also utilize that same logic in various processing engines, such as Spark,
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
, and Dataflow. With a DSL, the transformation language is not tied to the engine. Although transformational languages are typically best suited for transformation, something as simple as regular expressions can be used to achieve useful transformation. A
text editor A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be u ...
like vim,
emacs Emacs , originally named EMACS (an acronym for "Editor MACroS"), is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, se ...
or
TextPad TextPad is a text editor for Microsoft Windows developed by Helios Software Solutions. It is currently in its eighth major version. TextPad was initially released in 1992 as shareware, with users requested to pay a registration fee to support ...
supports the use of regular expressions with arguments. This would allow all instances of a particular pattern to be replaced with another pattern using parts of the original pattern. For example:
foo ("some string", 42, gCommon);
bar (someObj, anotherObj);

foo ("another string", 24, gCommon);
bar (myObj, myOtherObj);
could both be transformed into a more compact form like: foobar("some string", 42, someObj, anotherObj); foobar("another string", 24, myObj, myOtherObj); In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two arguments would be replaced with a single function invocation using some or all of the original set of arguments. Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your transformational language of choice, run a sample program through a transformation that doesn't perform any transformations. Many transformational languages will fail this test.


See also


File Formats, Transformation, and Migration
(related wikiversity article) *
Data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the dat ...
*
Data mapping In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including: * Data transformati ...
*
Data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
*
Data preparation Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes. Data preparation is the first step in dat ...
*
Data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
*
Extract, transform, load In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output data container. The data can be collated from one or more sources and it can also ...
*
Information integration Information integration (II) is the merging of information from heterogeneous sources with differing conceptual, contextual and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured ...


References


External links

* {{DEFAULTSORT:Data Transformation Metadata Articles with example C++ code Data management Data warehousing