Silesia Corpus
   HOME

TheInfoList



OR:

The Silesia corpus is a collection of files intended for use as a benchmark for testing
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
algorithms. It was created in 2003 as an alternative for the Canterbury corpus and Calgary corpus, based on concerns about how well these represented modern files. It contains various data types, including large text documents, executable files, and databases.


Contents

The corpus consists of 12 files, totaling 211MB. The files were chosen to represent what the author considered to be data types likely to grow rapidly in size over time, such as computer programs and databases, along with more traditional compression benchmarks, such as large text files. Because it has a broader and more modern selection of datatypes, it is considered a better source of test data for compression algorithms when compared to the Calgary corpus.


See also

*
Data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...


References


External links

* https://sun.aei.polsl.pl//~sdeor/pub/deo03.pdf Data compression Test items {{comp-sci-stub