File conversion
   HOME

TheInfoList



OR:

Data conversion is the conversion of computer data from one
format Format may refer to: Printing and visual media * Text formatting, the typesetting of text elements * Paper formats, or paper size standards * Newspaper format, the size of the paper page Computing * File format, particular way that informatio ...
to another. Throughout a computer environment, data is
encoded In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
in a variety of ways. For example,
computer hardware Computer hardware includes the physical parts of a computer, such as the case, central processing unit (CPU), random access memory (RAM), monitor, mouse, keyboard, computer data storage, graphics card, sound card, speakers and motherboard. ...
is built on the basis of certain standards, which requires that data contains, for example,
parity bit A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes), ...
checks. Similarly, the
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
is predicated on certain standards for data and file handling. Furthermore, each computer program handles data in a different manner. Whenever any one of these variables is changed, data must be converted in some way before it can be used by a different computer, operating system or program. Even different versions of these elements usually involve different data structures. For example, the changing of
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s from one format to another, usually for the purpose of application interoperability or of the capability of using new features, is merely a data conversion. Data conversions may be as simple as the conversion of a
text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operat ...
from one
character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
system to another; or more complex, such as the conversion of office file formats, or the conversion of image formats and
audio file format An audio file format is a file format for storing digital audio data on a computer system. The bit layout of the audio data (excluding metadata) is called the audio coding format and can be uncompressed, or compressed to reduce the file size, ofte ...
s. There are many ways in which data is converted within the computer environment. This may be seamless, as in the case of upgrading to a newer version of a computer program. Alternatively, the conversion may require processing by the use of a special conversion program, or it may involve a complex process of going through intermediary stages, or involving complex "exporting" and "importing" procedures, which may include converting to and from a tab-delimited or comma-separated text file. In some cases, a program may recognize several data file formats at the data input stage and then is also capable of storing the output data in several different formats. Such a program may be used to convert a file format. If the source format or target format is not recognized, then at times a third program may be available which permits the conversion to an intermediate format, which can then be reformatted using the first program. There are many possible scenarios.


Information basics

Before any data conversion is carried out, the user or application programmer should keep a few basics of computing and
information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
in mind. These include: * Information can easily be discarded by the computer, but adding information takes effort. * The computer can add information only in a rule-based fashion. * Upsampling the data or converting to a more
feature-rich In software, the term feature has several definitions. The Institute of Electrical and Electronics Engineers defines the term ''feature'' in IEEE 829 as " distinguishing characteristic of a software item (e.g., performance, portability, or functio ...
format does not add information; it merely makes room for that addition, which usually a human must do. * Data stored in an electronic format can be quickly modified and analyzed. For example, a true color image can easily be converted to grayscale, while the opposite conversion is a painstaking process. Converting a
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
text file to a
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washi ...
(DOS/Windows) text file involves adding characters, but this does not increase the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
since it is rule-based; whereas the addition of color information to a grayscale image cannot be reliably done programmatically, as it requires adding new information, so any attempt to add color would require
estimation Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
by the computer based on previous knowledge. Converting a 24-bit PNG to a 48-bit one does not add information to it, it only pads existing
RGB The RGB color model is an additive color model in which the red, green and blue primary colors of light are added together in various ways to reproduce a broad array of colors. The name of the model comes from the initials of the three addi ...
pixel values with zeroes, so that a pixel with a value of FF C3 56, for example, becomes FF00 C300 5600. The conversion makes it possible to change a pixel to have a value of, for instance, FF80 C340 56A0, but the conversion itself does not do that, only further manipulation of the image can. Converting an image or audio file in a
lossy In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data si ...
format (like
JPEG JPEG ( ) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and imag ...
or
Vorbis Vorbis is a free and open-source software project headed by the Xiph.Org Foundation. The project produces an audio coding format and software reference encoder/decoder ( codec) for lossy audio compression. Vorbis is most commonly used in con ...
) to a
lossless Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
(like PNG or
FLAC FLAC (; Free Lossless Audio Codec) is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, and is also the name of the free software project producing the FLAC tools, the reference softwa ...
) or uncompressed (like BMP or WAV) format only wastes space, since the same image with its loss of original information (the artifacts of lossy compression) becomes the target. A JPEG image can never be restored to the quality of the original image from which it was made, no matter how much the user tries the " JPEG Artifact Removal" feature of his or her image manipulation program. Automatic restoration of information that was lost through a
lossy compression In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data si ...
process would probably require important advances in
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech ...
. Because of these realities of computing and information theory, data conversion is often a complex and error-prone process that requires the help of experts.


Pivotal conversion

Data conversion can occur directly from one format to another, but many applications that convert between multiple formats use an
intermediate representation An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A "good" ...
by way of which any source format is converted to its target. For example, it is possible to convert
Cyrillic The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking co ...
text from
KOI8-R KOI8-R (RFC 1489) is an 8-bit character encoding, derived from the KOI-8 encoding by the programmer Andrei Chernov in 1993 and designed to cover Russian, which uses a Cyrillic alphabet. KOI8-R was based on Russian Morse code, which was creat ...
to
Windows-1251 Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages. On the web, it is the second most-used ...
using a lookup table between the two encodings, but the modern approach is to convert the KOI8-R file to
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
first and from that to Windows-1251. This is a more manageable approach; rather than needing lookup tables for all possible pairs of character encodings, an application needs only one lookup table for each character set, which it uses to convert to and from Unicode, thereby scaling the number of tables down from hundreds to a few tens. Pivotal conversion is similarly used in other areas. Office applications, when employed to convert between office file formats, use their internal, default file format as a pivot. For example, a
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current ...
may convert an RTF file to a WordPerfect file by converting the RTF to
OpenDocument The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was develope ...
and then that to WordPerfect format. An image conversion program does not convert a
PCX PCX, standing for ''PiCture eXchange'', was an image file format developed by the now-defunct ZSoft Corporation of Marietta, Georgia, United States. It was the native file format for PC Paintbrush and became one of the first widely accepted DOS ...
image to PNG directly; instead, when loading the PCX image, it decodes it to a simple bitmap format for internal use in memory, and when commanded to convert to PNG, that memory image is converted to the target format. An audio converter that converts from
FLAC FLAC (; Free Lossless Audio Codec) is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, and is also the name of the free software project producing the FLAC tools, the reference softwa ...
to AAC decodes the source file to raw
PCM Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the am ...
data in memory first, and then performs the lossy AAC compression on that memory image to produce the target file.


Lost and inexact data conversion

The objective of data conversion is to maintain all of the data, and as much of the embedded information as possible. This can only be done if the target format supports the same features and data structures present in the source file. Conversion of a word processing document to a plain text file necessarily involves loss of formatting information, because plain text format does not support word processing constructs such as marking a word as boldface. For this reason, conversion from one format to another which does not support a feature that is important to the user is rarely carried out, though it may be necessary for interoperability, e.g. converting a file from one version of
Microsoft Word Microsoft Word is a word processor, word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other pla ...
to an earlier version to enable transfer and use by other users who do not have the same later version of Word installed on their computer. Loss of information can be mitigated by approximation in the target format. There is no way of converting a character like ''ä'' to
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
, since the ASCII standard lacks it, but the information may be retained by approximating the character as ''ae''. Of course, this is not an optimal solution, and can impact operations like searching and copying; and if a language makes a distinction between ''ä'' and ''ae'', then that approximation does involve loss of information. Data conversion can also suffer from inexactitude, the result of converting between formats that are conceptually different. The
WYSIWYG In computing, WYSIWYG ( ), an acronym for What You See Is What You Get, is a system in which editing software allows content to be edited in a form that resembles its appearance when printed or displayed as a finished product, such as a printed d ...
paradigm, extant in word processors and
desktop publishing Desktop publishing (DTP) is the creation of documents using page layout software on a personal ("desktop") computer. It was first used almost exclusively for print publications, but now it also assists in the creation of various forms of online ...
applications, versus the structural-descriptive paradigm, found in
SGML The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates": * Declarative: Markup should ...
,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
and many applications derived therefrom, like
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
and
MathML Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content. It aims at integrating mathematical formulae into World Wide W ...
, is one example. Using a WYSIWYG HTML editor conflates the two paradigms, and the result is HTML files with suboptimal, if not nonstandard, code. In the WYSIWYG paradigm a double linebreak signifies a new paragraph, as that is the visual cue for such a construct, but a WYSIWYG HTML editor will usually convert such a sequence to

, which is structurally no new paragraph at all. As another example, converting from
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
to an editable word processor format is a tough chore, because PDF records the textual information like engraving on stone, with each character given a fixed position and linebreaks hard-coded, whereas word processor formats accommodate text reflow. PDF does not know of a word space character—the space between two letters and the space between two words differ only in quantity. Therefore, a title with ample letter-spacing for effect will usually end up with spaces in the word processor file, for example INTRODUCTION with spacing of 1 em as I N T R O D U C T I O N on the word processor.


Open vs. secret specifications

Successful data conversion requires thorough knowledge of the workings of both source and target formats. In the case where the specification of a format is unknown,
reverse engineering Reverse engineering (also known as backwards engineering or back engineering) is a process or method through which one attempts to understand through deductive reasoning how a previously made device, process, system, or piece of software accompli ...
will be needed to carry out conversion. Reverse engineering can achieve close approximation of the original specifications, but errors and missing features can still result.


Electronics

Data format conversion can also occur at the physical layer of an electronic communication system. Conversion between
line code In telecommunication, a line code is a pattern of voltage, current, or photons used to represent digital data transmitted down a communication channel or written to a storage medium. This repertoire of signals is usually called a constrained ...
s such as NRZ and RZ can be accomplished when necessary.


See also

*
Character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
* Comparison of programming languages (basic instructions)#Data conversions *
Data migration Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. Additionally, the validation of migrated data for completeness and the decommis ...
*
Data transformation In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integrationCIO.com. Agile Comes to Data Integration. Retrieved from: htt ...
*
Data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
*
Transcoding Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for video data files, audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859). This is usually done in cases where a target d ...
*
Distributed Data Management Architecture Distributed Data Management Architecture (DDM) is IBM's open, published software architecture for creating, managing and accessing data on a remote computer. DDM was initially designed to support record-oriented files; it was extended to support ...
(DDM) *
Code conversion (computing) A translator or programming language processor is a generic term that can refer to a compiler, assembler, or interpreter—anything that converts code from one computer language into another. These include translations between high-level an ...
*
Source-to-source translation A source-to-source translator, source-to-source compiler (S2S compiler), transcompiler, or transpiler is a type of translator that takes the source code of a program written in a programming language as its input and produces an equivalent sou ...
*
Presentation layer In the seven-layer OSI model of computer networking, the presentation layer is layer 6 and serves as the data translator for the network. It is sometimes called the syntax layer. Description Within the service layering semantics of the OSI netw ...


References

{{Authority control Computer data