text file
A text file (sometimes spelled textfile; an old alternative name is flat file) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system.
In ope ...
format that uses
comma
The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
s to separate values, and
newline
A newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or ...
s to separate records. A CSV file stores tabular data (numbers and text) in
plain text
In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
, where each line of the file typically represents one data record. Each record consists of the same number of
fields
Fields may refer to:
Music
*Fields (band), an indie rock band formed in 2006
* Fields (progressive rock band), a progressive rock band formed in 1971
* ''Fields'' (album), an LP by Swedish-based indie rock band Junip (2010)
* "Fields", a song by ...
, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.
The CSV file format is one type of delimiter-separated file format. Delimiters frequently used include the comma, tab, space, and semicolon. Delimiter-separated files are often given a ".csv"
extension
Extension, extend or extended may refer to:
Mathematics
Logic or set theory
* Axiom of extensionality
* Extensible cardinal
* Extension (model theory)
* Extension (proof theory)
* Extension (predicate logic), the set of tuples of values that ...
even when the field separator is not a comma. Many applications or libraries that consume or produce CSV files have options to specify an alternative delimiter.
The lack of adherence to the CSV standard RFC 4180 necessitates the support for a variety of CSV formats in data input software. Despite this drawback, CSV remains widespread in data applications and is widely supported by a variety of software, including common spreadsheet applications such as
Microsoft Excel
Microsoft Excel is a spreadsheet editor developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android, iOS and iPadOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a ...
. Benefits cited in favor of CSV include human readability and the simplicity of the format.
Applications
CSV is a common
data exchange
Data exchange is the process of taking data structured under a ''source'' schema and transforming it into a ''target'' schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between ...
format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data between programs that natively operate on incompatible (often
proprietary
{{Short pages monitor
The 2005 technical standard RFC 4180 formalizes the CSV file format and defines the
MIME type
In information and communications technology, a media type, content type or MIME type is a two-part identifier for file formats and content formats. Their purpose is comparable to filename extensions and uniform type identifiers, in that they ide ...
"text/csv" for the handling of text-based fields. However, the interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Among its requirements:
* MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
* An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
* Each record ''should'' contain the same number of comma-separated fields.
* Any field ''may'' be quoted (with double quotes).
* Fields containing a line-break, double-quote or commas ''should'' be quoted. (If they are not, the file will likely be impossible to process correctly.)
* ''If'' double-quotes are used to enclose fields, then a double-quote in a field ''must'' be represented by two double-quote characters.
The format can be processed by most programs that claim to read CSV files. The exceptions are ''(a)'' programs may not support line-breaks within quoted fields, ''(b)'' programs may confuse the optional header with data or interpret the first data line as an optional header, and ''(c)'' double-quotes in a field may not be parsed correctly automatically.
OKF frictionless tabular data package
In 2011
Open Knowledge Foundation
Open Knowledge Foundation (OKF) is a global, non-profit network that promotes and shares information at no charge, including both content and data. It was founded by Rufus Pollock on 20 May 2004 in Cambridge, England. It is incorporated in Engla ...
(OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata (CSV lacks any type information to distinguish the string "1" from the number 1).
The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.
W3C tabular data standard
In 2013 the
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...
"CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats. The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations
for modeling "Tabular Data", and enhancing CSV with
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
and
semantics
Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
.
While the
well-formedness
__NOTOC__
In linguistics, well-formedness is the quality of a clause, word, or other linguistic element that conforms to the grammar of the language of which it is a part. Well-formed words or phrases are grammatical, meaning they obey all releva ...
of CSV data can readily checked, testing validity and canonical form is less well developed, relative to more precise data formats, such as
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
and
SQL
Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel")
is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
, which offer richer types and rules-based validation.
Basic rules
Many informal documents exist that describe "CSV" formats.
IETF
The Internet Engineering Task Force (IETF) is a standards organization for the Internet standard, Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster ...
RFC 4180 (summarized above) defines the format for the "text/csv"
MIME type
In information and communications technology, a media type, content type or MIME type is a two-part identifier for file formats and content formats. Their purpose is comparable to filename extensions and uniform type identifiers, in that they ide ...
registered with the
IANA
The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...
.
Rules typical of these and other "CSV" specifications and implementations are as follows:
Example
The above table of data may be represented in CSV format as follows:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):
Year,Make,Model,Length
1997,Ford,E350,2.35
2000,Mercury,Cougar,2.38
Example of an analogous European CSV/ DSV file (where the decimal separator is a comma and the value separator is a semicolon):
Year;Make;Model;Length
1997;Ford;E350;2,35
2000;Mercury;Cougar;2,38
The latter format is not RFC 4180 compliant. Compliance could be achieved by the use of a comma instead of a semicolon as a separator and by quoting all numbers that have a decimal mark.
interoperability
Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader de ...
, exporting and importing CSV. Others use CSV as an ''internal format''.
As a data interchange format: the CSV file format is supported by almost all spreadsheets and database management systems,
*
Spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
s including Apple
Numbers
A number is a mathematical object used to count, measure, and label. The most basic examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers can ...
,
LibreOffice Calc
LibreOffice Calc is the spreadsheet component of the LibreOffice suite.
After forking from OpenOffice.org in 2010, LibreOffice Calc underwent a massive re-work of external reference handling to fix many defects in formula calculations involvi ...
, and
Apache OpenOffice
Apache OpenOffice (AOO) is an open-source software, open-source office suite, office productivity software suite. It is one of the successor projects of OpenOffice.org and the designated successor of IBM Lotus Symphony. It was a close cousin of ...
Calc.
Microsoft Excel
Microsoft Excel is a spreadsheet editor developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android, iOS and iPadOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a ...
also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma).
LibreOffice Calc
LibreOffice Calc is the spreadsheet component of the LibreOffice suite.
After forking from OpenOffice.org in 2010, LibreOffice Calc underwent a massive re-work of external reference handling to fix many defects in formula calculations involvi ...
CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming.
* Various
Relational databases
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured form ...
support saving query results to a CSV file.
PostgreSQL
PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
provides the COPY command, which allows for both saving and loading data to and from a file. saves the content of a table articles to a file called /home/wikipedia/file.csv.
* Many utility programs on
Unix
Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
-style systems (such as
cut
Cut or CUT may refer to:
Common uses
* The act of cutting, the separation of an object into two through acutely directed force
** A type of wound
** Cut (archaeology), a hole dug in the past
** Cut (clothing), the style or shape of a garment
** ...
join Join may refer to:
* Join (law), to include additional counts or additional defendants on an indictment
*In mathematics:
** Join (mathematics), a least upper bound of sets orders in lattice theory
** Join (topology), an operation combining two topo ...
,
sort
Sort may refer to:
* Sorting, any process of arranging items in sequence or in sets
** Sorting algorithm, any algorithm for ordering a list of elements
** Mainframe sort merge, sort utility for IBM mainframe systems
** Sort (Unix), which sorts the ...
,
uniq
uniq is a utility command on Unix, Plan 9, Inferno, and Unix-like operating systems which, when fed a text file or standard input, outputs the text with adjacent identical lines collapsed to one, unique line of text.
Overview
The command is ...
, awk) can split files on a comma delimiter, and can therefore process simple CSV files. However, this method does not correctly handle commas or new lines within quoted strings, hence it is better to use tools like csvkit or Miller.
As (main or optional) internal representation. Can be native or foreign, but differ from interchange format ("export/import only") because it is not necessary to create a copy in another format:
* Some
Spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
s including
LibreOffice Calc
LibreOffice Calc is the spreadsheet component of the LibreOffice suite.
After forking from OpenOffice.org in 2010, LibreOffice Calc underwent a massive re-work of external reference handling to fix many defects in formula calculations involvi ...
offers this option, without enforcing user to adopt another format.
* Some relational databases, when using standard SQL, offer ''foreign-data wrapper'' (FDW). For example, PostgreSQL offers the and commands to configure any variant of CSV.
* Databases like
Apache Hive
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
offer the option to express CSV or .csv.gz as an internal table format.
* The
emacs
Emacs (), originally named EMACS (an acronym for "Editor Macros"), is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, s ...
editor can operate on CSV files using csv-nav mode.
CSV format is supported by libraries available for many
programming language
A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s. Most provide some way to specify the field delimiter,
decimal separator
FIle:Decimal separators.svg, alt=Four types of separating decimals: a) 1,234.56. b) 1.234,56. c) 1'234,56. d) ١٬٢٣٤٫٥٦., Both a comma and a full stop (or period) are generally accepted decimal separators for international use. The apost ...
, character encoding, quoting conventions, date format, etc.
Software and row limits
Programs that work with CSV may have limits on the maximum number of rows CSV files can have.
Below is a list of common software and its limitations:
* Microsoft Excel: 1,048,576 row limit;
* Microsoft PowerShell, no row or cell limit. (Memory Limited)
* Apple Numbers: 1,000,000 row limit;
* Google Sheets: 10,000,000 cell limit (the product of columns and rows);
* OpenOffice and LibreOffice: 1,048,576 row limit;
* Sourcetable:large data spreadsheet Sourcetable Inc., 2024. Retrieved 2024-11-14. no row limit. (Spreadsheet-database hybrid);
* Text Editors (such as
WordPad
WordPad is a word processor software designed by Microsoft that was included in versions of Windows from Windows 95 through Windows 11, version 23H2. Similarly to its predecessor Microsoft Write, it served as a basic word processor, positione ...
,
TextEdit
TextEdit is an open-source software, open-source word processor and text editor, first featured in NeXT's NeXTSTEP and OPENSTEP. It is now distributed with macOS since Apple Inc.'s acquisition of NeXT, and available as a GNUstep application fo ...
, Vim, etc.): no row or cell limit;
* Databases (COPY command and FDW): no row or cell limit.
See also
*
Tab-separated values
Tab-separated values (TSV) is a simple, text-based file format for storing tabular data. Records are separated by newlines, and values within a record are separated by tab characters. The TSV format is thus a delimiter-separated values format, ...
*
Comparison of data-serialization formats
This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file format
A document file format is a Text file, text or bina ...
*
Delimiter-separated values
Formats that use delimiter-separated values (also DSV)DSV stands for ''Delimiter Separated Values'' store two-dimensional arrays of data by separating the values in each row with specific delimiter character (computing), characters. Most database ...
*
Delimiter collision
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts ...
*
Flat-file database
A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain t ...
Substitute character
In computer data, a substitute character (␚) is a control character that is used to pad transmitted data in order to send it in blocks of fixed size, or to stand in place of a character that is recognized to be invalid, erroneous or unreprese ...
,
Null character
The null character is a control character with the value zero. Many character sets include a code point for a null character including Unicode (Universal Coded Character Set), ASCII (ISO/IEC 646), Baudot, ITA2 codes, the C0 control code, and EB ...