Lossless compression

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy.^[1] By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates (and therefore reduced media sizes).

By operation of the pigeonhole principle, no lossless compression algorithm can shrink the size of all possible data: Some data will get longer by at least one symbol or byte.

Compression algorithms are usually effective for human- and machine-readable documents and cannot shrink the size of random data that contain no redundancy. Different algorithms exist that are designed either with a specific type of input data in mind or with specific assumptions about what kinds of redundancy the uncompressed data are likely to contain.

Lossless data compression is used in many applications. For example, it is used in the ZIP file format and in the GNU tool gzip. It is also often used as a component within lossy data compression technologies (e.g. lossless mid/side joint stereo preprocessing by MP3 encoders and other lossy audio encoders).^[2]

Lossless compression is used in cases where it is important that the original and the decompressed data be identical, or where deviations from the original data would be unfavourable. Common examples are executable programs, text documents, and source code. Some image file formats, like PNG or GIF, use only lossless compression, while others like TIFF and MNG may use either lossless or lossy methods. Lossless audio formats are most often used for archiving or production purposes, while smaller lossy audio files are typically used on portable players and in other cases where storage space is limited or exact replication of the audio is unnecessary.

– Entropy encoding, used by LZFSE and Zstandard

ANS

– Entropy encoding

Arithmetic coding

reversible transform for making textual data more compressible, used by bzip2

Burrows–Wheeler transform

– Entropy encoding, pairs well with other algorithms

Huffman coding

Lempel-Ziv compression

Lempel–Ziv–Markov chain algorithm

(PPM) – Optimized for compressing plain text

Prediction by partial matching

(RLE) – Simple scheme that provides good compression of data containing many runs of the same value

Run-length encoding

The dating back to 1987 is no longer widely used due to its small size. Matt Mahoney maintained the Calgary Compression Challenge, created and maintained from May 21, 1996, through May 21, 2016, by Leonid A. Broukhis.

Calgary Corpus

The Large Text Compression Benchmark and the similar Hutter Prize both use a trimmed Wikipedia XML UTF-8 data set.

[13]

The Generic Compression Benchmark, maintained by Matt Mahoney, tests compression of data generated by random Turing machines.

[14]

Sami Runsas (the author of NanoZip) maintained Compression Ratings, a benchmark similar to Maximum Compression multiple file test, but with minimum speed requirements. It offered the calculator that allowed the user to weight the importance of speed and compression ratio. The top programs were fairly different due to the speed requirement. In January 2010, the top program was NanoZip followed by , CCM, flashzip, and 7-Zip.

FreeArc

The Monster of Compression benchmark by Nania Francesco Antonio tested compression on 1Gb of public data with a 40-minute time limit. In December 2009, the top ranked archiver was NanoZip 0.07a and the top ranked single file compressor was 1.30c.

ccmx

Lossless compression algorithms and their implementations are routinely tested in head-to-head benchmarks. There are a number of better-known compression benchmarks. Some benchmarks cover only the data compression ratio, so winners in these benchmarks may be unsuitable for everyday use due to the slow speed of the top performers. Another drawback of some benchmarks is that their data files are known, so some program writers may optimize their programs for best performance on a particular data set. The winners on these benchmarks often come from the class of context-mixing compression software.

Matt Mahoney, in his February 2010 edition of the free booklet Data Compression Explained, additionally lists the following:^[12]

The Compression Ratings website published a chart summary of the "frontier" in compression ratio and time.^[15]

The Compression Analysis Tool^[16] is a Windows application that enables end users to benchmark the performance characteristics of streaming implementations of LZF4, Deflate, ZLIB, GZIP, BZIP2 and LZMA using their own data. It produces measurements and charts with which users can compare the compression speed, decompression speed and compression ratio of the different compression methods and to examine how the compression level, buffer size and flushing operations affect the results.

Assume that each file is represented as a string of bits of some arbitrary length.

Suppose that there is a compression algorithm that transforms every file into an output file that is no longer than the original file, and that at least one file will be compressed into an output file that is shorter than the original file.

Let M be the least number such that there is a file F with length M bits that compresses to something shorter. Let N be the length (in bits) of the compressed version of F.

Because N<M, every file of length N keeps its size during compression. There are 2^N such files possible. Together with F, this makes 2^N+1 files that all compress into one of the 2^N files of length N.

But 2^N is smaller than 2^N+1, so by the pigeonhole principle there must be some file of length N that is simultaneously the output of the compression function on two different inputs. That file cannot be decompressed reliably (which of the two originals should that yield?), which contradicts the assumption that the algorithm was lossless.

We must therefore conclude that our original hypothesis (that the compression function makes no file longer) is necessarily untrue.

Sayood, Khalid (October 27, 2017). Introduction to Data Compression. The Morgan Kaufmann Series in Multimedia Information and Systems (5 ed.). . ISBN 978-0-12809474-7. (790 pages)

Morgan Kaufmann

Sayood, Khalid, ed. (December 18, 2002). Lossless Compression Handbook (Communications, Networking and Multimedia) (1 ed.). . ISBN 978-0-12390754-7. (488 pages)

Academic Press

. github. Retrieved October 17, 2017.

"LZF compression format"

Phamdo, Nam. . Data Compression. Archived from the original on May 8, 2016. Retrieved October 17, 2017.