Parallel xz compression

I have been interested in compression from the mid 80’s. First working with zoo because I had to fit my modula-2 source and compiler on a Atari ST floppy disc. Later that decade making the GIF compression module in C++ for Renderstar output by doing a reverse engineering of the source of a display program for the IBM 8514/A

I submitted a patch to SuSE at the end of the 90’s that allowed RPM distribution based on bzip2 instead of gzip, which would have allowed to cut 2-3 CDs from the multi-cd installation of SuSE Linux.

After that I mainly worked with bzip2, switching to pbzip2 when that became available.

I only noticed xz some time ago, because an xz compressed archive of the Python source code was an option for downloading.

I downloaded, installed and compiled xz and a new version of gnu tar on all the machines I was working with (including a server still running Ubuntu 8.04). I tried out the compression on tar files of my client previously compressed with pbzip2 and switched to xz because the resulting files were significantly smaller. That xz was not taking advantage of the multiple CPUs available was not so much of a problem in this setup.

GNU parallel and xz

Not too long after that, I read about recend and pipe in the example Processing a big file using more cores. This combines the -k option to keep blocks in the same order in the output and splits the input. The example:

cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

works with gzip, but you can easily change that to use xz:

# DON'T USE THE FOLLOWING
cat bigfile | parallel --pipe --recend '' -k xz -9 >bigfile.xz

So I changed the line in the Makefile doing the compression on my clients machine and got a near lineair speedup on the multiple cores.

parallel and xz revisited

I keep the files generated on my clients machine. Using ls -lrt in the directory where they are kept, I noticed a rather extreme increase in size at one point in time (normal is a gradual increase of a few kb every day). Thinking back, I realised that must have been around the time I switched from using plain xz to the parallel + xz compression.

So after recompressing one of the parallel + xz files and finding a few percent increase in filesize, I investigated what the cause for this might be.

Doing some benchmarks

I decided to write a small to_xz utility with some test routines and have it automatically run gzip, bzip2, pbzip2, parallel + bzip2 and parallel + xz (the latter with various parameter settings). This was done multiple times on two larger (but not extreme) tar files. I used (-9) compression in all cases.

Here are the results:

original size: 236,431,360
for parallel runs suppressed printing of options "--pipe --recend '' -k"
                                       comp.size    perc       time
[/bin/gzip -9 -c                     ]  44621697  18.87%     23.505s
[/bin/bzip2 -9 -c                    ]  17998988   7.61%     33.202s
[xz -9 -c                            ]   9087260   3.84%     61.988s
[parallel bzip2 -9                   ]  19055224   8.05%      7.463s
[parallel xz -9                      ]  14809380   6.26%     11.695s
[parallel --block-size 8M bzip2 -9   ]  18061279   7.63%      6.816s
[parallel --block-size 8M xz -9      ]  10460876   4.42%     11.390s
[parallel --block-size 16M bzip2 -9  ]  18007478   7.61%      8.699s
[parallel --block-size 16M xz -9     ]   9940076   4.20%     14.116s
[parallel --block-size 32M bzip2 -9  ]  18024888   7.62%     10.860s
[parallel --block-size 32M xz -9     ]   9531560   4.03%     17.012s
[parallel --block-size 64M bzip2 -9  ]  18043517   7.63%     15.170s
[parallel --block-size 64M xz -9     ]   9290488   3.92%     24.645s
[parallel --block-size 128M bzip2 -9 ]  18024121   7.62%     21.631s
[parallel --block-size 128M xz -9    ]   9150320   3.87%     37.506s
[parallel --block-size 256M bzip2 -9 ]  17998988   7.61%     34.701s
[parallel --block-size 256M xz -9    ]   9087260   3.84%     63.408s
[parallel --block-size 512M bzip2 -9 ]  17998988   7.61%     34.697s
[parallel --block-size 512M xz -9    ]   9087260   3.84%     63.417s
[pbzip2 -9 -c                        ]  18922345   8.00%      6.634s
[pbzip2 -9 -c -m256                  ]  18922345   8.00%      6.589s
[pbzip2 -9 -c -p4                    ]  18922345   8.00%      8.872s

original size: 1,891,450,880
for parallel runs suppressed printing of options "--pipe --recend '' -k"
                                       comp.size    perc       time
[/bin/gzip -9 -c                     ] 356968234  18.87%    187.709s
[/bin/bzip2 -9 -c                    ] 144064108   7.61%    265.197s
[xz -9 -c                            ]  72277560   3.82%    496.560s
[parallel bzip2 -9                   ] 152589373   8.06%     57.701s
[parallel xz -9                      ] 118407888   6.26%     92.328s
[parallel --block-size 8M bzip2 -9   ] 144785912   7.65%     50.967s
[parallel --block-size 8M xz -9      ]  84230860   4.45%     80.172s
[parallel --block-size 16M bzip2 -9  ] 144425035   7.63%     53.440s
[parallel --block-size 16M xz -9     ]  79668828   4.21%     84.253s
[parallel --block-size 32M bzip2 -9  ] 144417816   7.63%     56.743s
[parallel --block-size 32M xz -9     ]  76702632   4.05%     92.188s
[parallel --block-size 64M bzip2 -9  ] 144312107   7.62%     62.722s
[parallel --block-size 64M xz -9     ]  74785896   3.95%     96.507s
[parallel --block-size 128M bzip2 -9 ] 144116096   7.61%     59.063s
[parallel --block-size 128M xz -9    ]  73523656   3.88%     95.385s
[parallel --block-size 256M bzip2 -9 ] 143945629   7.61%     53.055s
[parallel --block-size 256M xz -9    ]  72972556   3.85%     89.867s
[parallel --block-size 512M bzip2 -9 ] 143957632   7.61%     76.902s
[parallel --block-size 512M xz -9    ]  72586964   3.83%    150.103s
[parallel --block-size 1024M bzip2 -9] 144027013   7.61%    146.469s
[parallel --block-size 1024M xz -9   ]  72381080   3.82%    282.797s
[pbzip2 -9 -c                        ] 151292869   7.99%     51.491s
[pbzip2 -9 -c -m256                  ] 151292869   7.99%     51.622s
[pbzip2 -9 -c -p4                    ] 151292869   7.99%     70.287s

From the table can be seen that there is a speed penalty for blocksizes that make it impossible for parallel to feed all of the 8 kernels on my 16Gb memory machine. xz performs much better than parallel + bzip2 (at a speed penalty), which in intelf is always performs somewhat better than pbzip2.

After that I adapted my 2xz program to take the (uncompressed) size of its input into account when setting the blocksize. For smaller files that means that not all processors are churning data, but for the large files—where that matters—the whole family can play.

Posted on 2013-08-10.