lrzip -- a compression tool optimised for large files

I have two other different articles Compression tools on linux -- gzip vs bzip2 vs lzma vs compress and parallel compression utilities on linux -- lbzip2, pbzip2 and pigz, I mentioned it in one article, but did not compare it with others. The reason is that not like other tools which use single compression algorithm,  lrzip uses an extended version of rzip which does a first pass long distance redundancy reduction. The lrzip modifications make it scale according to memory size. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (900 MB) in the input file. The second stage uses a standard compression algorithm (bzip2) to compress the output of the first stage.
It uses a lot of memory of course, usually as much as the size of the file to compress. Also, it's a parallel compression utility, by default it detects and use number of cores as the number of threads to run.

Test condition:

The machine has 2 sockets, 8 Intel(R) Core(TM) i7 CPUs         870  @ 2.93GHz
16GB memory, 2 sets of mirrored local SATA disks.
Linux kernel 2.6.32-358.18.1.el6.x86_64
 In the test below, I chose data file(like text file, but without space,tab and newline etc..), which generated log file by Postgresql. I made a large file out of many of small archive files for this test, 5GiB.

Performance test

The test metric below is lrzip utility with different compression algorithm, using default level 7 compression level. The ratio is the percentage reduction for each file compressed or decompressed(compatible with gzip -v)

 lrzip  compress  uncompress  ratio
 bzip2  2m40.056s  2m3.004s  75.62%
 gzip  1m42.475s  1m45.110s  72.80%
 lzo  1m24.291s  2m14.019s  61.46%
 zpaq 21m47.440s 21m26.733s 82.43%
 none(LZMA)  8m49.422s  1m35.528s  81.42%
If the disk space is really matters to you, then the tool is the best for you. compress speed is also relatively good for it uses multiple threads.
In general,
ZPAQ does the best on compression ratio, but takes a lot of time on compression an duncompression.
gzip takes almost same time as lzo, but with better compression ratio.
LZO is really fast on both compression and uncompression.
LZMA gives extremly good compression ratio.
bzip2 gives middle on both compression ratio and speed.


lrzip works only on single file so use wrapper for archive work. In the same lrzip utility, there are the following tools available
As you can see lrztar is for archive, while lrzuntar is for to extract files from archive. While lrzcat is to uncompress file to STDOUT

The major disadvantages:

The main lrzip application only works on single files so it requires the lrztar wrapper to fake a complete archiver.
It requires a lot of memory to get the best performance out of (as much memory as the size of the data to compress, and is not really usable (for compression) with less than 256MB. Decompression requires less ram and works on smaller ram machines. Sometimes swap may need to be enabled on these lower ram machines for the operating system to be happy.
STDIN/STDOUT works fine on both compression and decompression, but larger files compressed in this manner will end up being less efficiently compressed.