Canterbury corpus
   HOME

TheInfoList



OR:

The Canterbury corpus is a collection of files intended for use as a benchmark for testing
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
algorithms. It was created in 1997 at the University of Canterbury,
New Zealand New Zealand ( mi, Aotearoa ) is an island country in the southwestern Pacific Ocean. It consists of two main landmasses—the North Island () and the South Island ()—and over 700 smaller islands. It is the sixth-largest island count ...
and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.


Contents

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents, totaling 2,810,784 bytes as follows. The University of Canterbury also offers the following corpora. Additional files may be added, so results should be only reported for individual files. * The Artificial Corpus, a set of files with highly "artificial" data designed to evoke pathological or worst-case behavior. Last updated 2000 (tar timestamp). * The Large Corpus, a set of large (megabyte-size) files. Contains an ''E. coli'' genome, a King James bible, and the CIA world fact book. Last updated 1997 (tar timestamp). * The Miscellaneous Corpus. Contains one million digits of pi. Last updated 2000 (tar timestamp).


See also

*
Data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...


References


External links

* Data compression Test items {{comp-sci-stub