A flat-file database is a
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a
plain text
In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
file (e.g.
csv,
txt or
tsv), or a
binary file
A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files ...
. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.
The term has generally implied a small database, but very large databases can also be flat.
Overview
Plain text files usually contain one
record per line.
Examples of flat files include
/etc/passwd
and
/etc/group
on
Unix-like
A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
operating systems. Another example of a flat file is a name-and-address list with the fields ''Name'', ''Address'' and ''Phone Number''.
Flat files are typically either delimiter-separated or fixed-width.
Delimiter-separated values
In
delimiter-separated values
Formats that use delimiter-separated values (also DSV)DSV stands for ''Delimiter Separated Values'' store two-dimensional arrays of data by separating the values in each row with specific delimiter character (computing), characters. Most database ...
files, the
fields are separated by a character or string called the
delimiter
A delimiter is a sequence of one or more Character (computing), characters for specifying the boundary between separate, independent regions in plain text, Expression (mathematics), mathematical expressions or other Data stream, data streams. An ...
.
Common variants are
comma-separated values
Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores Table (information), tabular data (numbers and text) in plain text, where each line of the file typically r ...
(CSV) where the delimiter is a comma,
tab-separated values (TSV) where the delimiter is the tab character), space-separated values and vertical-bar-separated values (delimiter is
,
).
If the delimiter is allowed inside a field, there needs to be a way to distinguish delimiters characters or strings that are meant literally. For example, consider the sentence "If I have to, I'll do it myself.". To encode it in CSV, there needs to be a way to prevent the comma from splitting the field. Several
strategies to prevent delimiter collision exist.
Fixed-width formats
With fixed-width formats, each field has a fixed length with extra
spaces added as needed. The fixed lengths can be predefined and known ahead of time (i.e. stated in the format's specification), or parsed from a
header.
With predefined lengths, fields are limited to a maximum length. The need for longer fields may appear sometime after the format is defined. Possible workarounds include abbreviating phrases, replacing values with links (e.g. a URI pointing to the value), and splitting a file into multiple files.
With delimiter-separated formats, determining the field boundaries requires finding the delimiters, which incurs some
computational overhead
Overhead in computer systems consists of shared functions that benefit all users or processes but are not directly attributable to any specific task. It is thus similar to overhead in organizations. Computer system overhead shows up as slower pr ...
. This is not needed for fixed-width formats. However, fixed-width formats can lead to unnecessarily large file sizes if fields tend to be shorter than the lengths reserved for them.
Declarative notation
Delimiters can be used alongside a notation stating the length of each field. For example,
5apple, 9pineapple
specifies the length (5 and 9) of each field. This is called
declarative notation. It has low overhead and trivially avoids delimiter collisions, but it is brittle when edited manually.
History
Herman Hollerith
Herman Hollerith (February 29, 1860 – November 17, 1929) was a German-American statistician, inventor, and businessman who developed an electromechanical tabulating machine for punched cards to assist in summarizing information and, later, in ...
's work for the
US Census Bureau
The United States Census Bureau, officially the Bureau of the Census, is a principal agency of the U.S. federal statistical system, responsible for producing data about the American people and economy. The U.S. Census Bureau is part of the U ...
first exercised in the
1890 United States census
The 1890 United States census was taken beginning June 2, 1890. The census determined the resident population of the United States to be 62,979,766, an increase of 25.5 percent over the 50,189,209 persons enumerated during the 1880 United States ...
, involving data tabulated via hole punches in paper cards,
is sometimes considered the first computerized flat-file database, as it included no cards indexing other cards, or otherwise relating the individual cards to one another, save by their group membership.
In the 1980s, configurable flat-file database
computer application
Application software is any computer program that is intended for end-user use not operating, administering or programming the computer. An application (app, application program, software application) is any program that can be categorized as ...
s were popular on the
IBM PC
The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the List of IBM Personal Computer models, IBM PC model line and the basis for the IBM PC compatible ''de facto'' standard. Released on ...
and the
Macintosh
Mac is a brand of personal computers designed and marketed by Apple Inc., Apple since 1984. The name is short for Macintosh (its official name until 1999), a reference to the McIntosh (apple), McIntosh apple. The current product lineup inclu ...
. These programs were designed to make it easy for individuals to design and use their own databases, and were almost on par with
word processors and
spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
s in popularity. Examples of flat-file database software include early versions of
FileMaker and the
shareware
Shareware is a type of proprietary software that is initially shared by the owner for trial use at little or no cost. Often the software has limited functionality or incomplete documentation until the user sends payment to the software developer. ...
PC-File and the popular
dBase
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a Form (programming), forms engine, and a pr ...
.
Flat-file databases are common and ubiquitous because they are easy to write and edit, and suit myriad purposes in an uncomplicated way.
Modern implementations
Linear stores of
NoSQL
NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
data,
JSON
JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
data, primitive spreadsheets (perhaps comma-separated or tab-delimited), and text files can all be seen as flat-file databases because they lack integrated indexes, built-in references between data elements, and complex data types. Programs to manage collections of books or appointments and
address books may use single-purpose flat-file databases, storing and retrieving information from flat files unadorned with indexes or pointing systems.
While a user can write a table of contents into a text file, the text file format itself does not include a concept of a table of contents. While a user may write "friends with Kathy" in the "Notes" section for John's contact information, this is interpreted by the user rather than a built-in feature of the database. When a database system begins to recognize and codify relationships between records, it begins to drift away from being "flat," and when it has a detailed system for describing types and hierarchical relationships, it is now too structured to be considered "flat."
Example database
The following example illustrates typical elements of a flat-file database. The
data
Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
arrangement consists of a series of columns and rows organized into a
tabular format. This specific example uses only one table.
The columns include: ''name'' (a person's name, second column); ''team'' (the name of an athletic team supported by the person, third column); and a numeric ''unique ID'', (used to uniquely identify records, first column).
Here is an example textual representation of the described data:
id name team
1 Amy Blues
2 Bob Reds
3 Chuck Blues
4 Richard Blues
5 Ethel Reds
6 Fred Blues
7 Gilly Blues
8 Hank Reds
9 Hank Blues
This type of data representation is quite standard for a flat-file database, although there are some additional considerations that are not readily apparent from the text:
* Data types: each column in a database table such as the one above is ordinarily restricted to a specific
data type
In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
. Such restrictions are usually established by convention, but not formally indicated unless the data is transferred to a
relational database
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
system.
* Separated columns: In the above example, individual columns are separated using
whitespace characters. This is also called indentation or "fixed-width" data formatting. Another common convention is to separate columns using one or more
delimiter
A delimiter is a sequence of one or more Character (computing), characters for specifying the boundary between separate, independent regions in plain text, Expression (mathematics), mathematical expressions or other Data stream, data streams. An ...
characters, such as a
tab or comma.
* Relational algebra: Each row or record in the above table meets the standard definition of a
tuple
In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...
under
relational algebra
In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics (computer science), semantics. The theory was introduced by Edgar F. Codd.
The main applica ...
(the above example depicts a series of 3-tuples). Additionally, the first row specifies the
field names that are associated with the values of each row.
* Database management system: Since the formal operations possible with a text file are usually more limited than desired, the text in the above example would ordinarily represent an intermediary state of the data prior to being transferred into a
database management system
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and an ...
.
See also
*
/etc/passwd, a commonly used flat file, used to detail users in Unix
*
Comma-separated values
Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores Table (information), tabular data (numbers and text) in plain text, where each line of the file typically r ...
(CSV file format)
*
Tab-separated values (TSV file format)
*
Berkeley DB (typical flat-file database)
*
Awk (classic flat-file processor)
*
Recfiles (plain text database file format)
References
{{DEFAULTSORT:Flat File Database
Computer file formats
Database models
it:Flat file