View previous topic :: View next topic |
Author |
Message |
devsk Advocate
Joined: 24 Oct 2003 Posts: 3003 Location: Bay Area, CA
|
Posted: Wed Jan 13, 2010 1:48 am Post subject: Lzma - wow! |
|
|
So, I was comparing the various compression techniques and I found it amazing that the real entropy of the file of size 210MB+ was close to just 800KB. Look at this:
The file is a text file created by cat'ing /var/log/messages over and over.
Code: | # lt /var/tmp/mytestfile*
-rw-r--r-- 1 root root 217453638 2010-01-12 00:48 /var/tmp/mytestfile2
-rw-r--r-- 1 root root 20901610 2010-01-12 00:49 /var/tmp/mytestfile2.gz
-rw-r--r-- 1 root root 20831025 2010-01-12 00:49 /var/tmp/mytestfile2.pigz
-rw-r--r-- 1 root root 15097046 2010-01-12 17:24 /var/tmp/mytestfile2.bz2
-rw-r--r-- 1 root root 816164 2010-01-12 17:27 /var/tmp/mytestfile2.lzma
| Here are the times:
Code: |
# time lzma -z -M max -T 8 -c -9 /var/tmp/mytestfile2 > /var/tmp/mytestfile2.lzma
real 0m41.331s
# time bzip2 -c /var/tmp/mytestfile2 > /var/tmp/mytestfile2.bz2
real 0m27.679s
# time gzip -c /var/tmp/mytestfile2 > /var/tmp/mytestfile2.gz
real 0m2.951s
# time pigz -c /var/tmp/mytestfile2 > /var/tmp/mytestfile2.pigz
real 0m0.737s
|
pigz wins on time but lzma massacres the competition on the size. Less than 1MB for LZMA while the rest are close to 20MB. I could reduce the time by using lower level for LZMA but then its only as good as bzip2.
Look at the decompression time:
Code: |
# time lzma -d -c /var/tmp/mytestfile2.lzma > /var/tmp/mytestfile3
real 0m0.186s
# time bzip2 -d -c /var/tmp/mytestfile2.bz2 > /var/tmp/mytestfile3
real 0m2.893s
# time gzip -d -c /var/tmp/mytestfile2.gz > /var/tmp/mytestfile3
real 0m0.868s
| That's a massacre by lzma. Decompression is REALLY fast.
So, LZMA seems like the best choice for the situation where you need to compress-once-and-use-forever. Livecd is the ideal candidate for this. I don't mind if it takes 5 minutes to create. As long as it decompresses fast and takes fraction of space, I am fine with it.
Now, I am patiently waiting for the squashfs with LZMA support to land in the kernel.
Notes:
1. The real entropy of the file is the starting point of the 8MB /var/log/messages which I cat'ed over and over to create this 217MB file. LZMA came close to noticing that.
2. All tests were done in RAM to avoid I/O delays (note the folder /var/tmp). This is a pure test of the compression algo. |
|
Back to top |
|
|
ppurka Advocate
Joined: 26 Dec 2004 Posts: 3256
|
Posted: Wed Jan 13, 2010 2:40 am Post subject: |
|
|
Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel?
Personally, I have the compression in the kernel set to gzip because "apparently" gzip takes less time to decompress. Now, this "apparently" is in question
Also, does the compression of the 8MB original file by lzma also lead to same final size of ~800kB? This would mean that lzma is really close to the actual entropy of the source _________________ emerge --quiet redefined | E17 vids: I, II | Now using kde5 | e is unstable :-/ |
|
Back to top |
|
|
Mike Hunt Watchman
Joined: 19 Jul 2009 Posts: 5287
|
Posted: Wed Jan 13, 2010 2:53 am Post subject: |
|
|
I needed to cat /var/log/messages over 12 million times, and boy are my fingers sore!
Code: | # xz -z -M max -T 8 -c -9 /var/tmp/mytestfile2 > /var/tmp/mytestfile2.xz
# ls -l /var/tmp/mytestfile2*
-rw-r--r-- 1 root root 217946214 Jan 12 21:25 /var/tmp/mytestfile2
-rw-r--r-- 1 root root 31852 Jan 12 21:28 /var/tmp/mytestfile2.xz |
compressing it in xz was about 9 times faster than bzip2, decompression of xz was almost instantaneous.
app-arch/xz-utils is keyword masked and is blocked by app-arch/lzma-utils because of stable app-portage/eix
everything else is fine though because of: Code: | DEPEND="${DEPEND}
|| ( app-arch/xz-utils app-arch/lzma-utils )" |
|
|
Back to top |
|
|
cach0rr0 Bodhisattva
Joined: 13 Nov 2008 Posts: 4123 Location: Houston, Republic of Texas
|
Posted: Wed Jan 13, 2010 3:19 am Post subject: |
|
|
meaningless as my test is, since this isn't real data, but rather hugely redundant
Code: |
$ dd if=/dev/zero of=/home/meat/zeros
2421999+0 records in
2421999+0 records out
1240063488 bytes (1.2 GB) copied, 22.5843 s, 54.9 MB/s
|
I wasn't interested in time so much as overall compression
lzma -z -M max -T 8 -c -9 zeros > zeros.lzma
bzip2 zeros
yielded:
Code: |
-rw-r--r-- 1 meat meat 909 Jan 12 21:04 zeros.bz2
-rw-r--r-- 1 meat meat 171K Jan 12 21:07 zeros.lzma
|
I don't know how useful this is at all in determining the extent to which lzma compresses massively redundant data VS bz2, but thought worth sharing. |
|
Back to top |
|
|
LesCoke n00b
Joined: 02 Jun 2008 Posts: 48 Location: Denton, Tx
|
Posted: Wed Jan 13, 2010 3:56 am Post subject: |
|
|
Compression algorithms use various techniques to reduce the size by replacing redundancy with shorthand. I suspect that a single copy of your original log file will compress to a size very near the same as the results when using the file containing multiple copies.
Text compresses very well because each word / phrase can be replaced with a number. Frequent words get smaller numbers than less frequent words.
Files containing long sequences of identical bytes can be compresses to a shorthand form: Duplicate value XX, N times,...
I'd be more interested in the results of compressing a large e-book.
Les |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 3003 Location: Bay Area, CA
|
Posted: Wed Jan 13, 2010 4:35 am Post subject: |
|
|
ppurka wrote: | Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel?
Personally, I have the compression in the kernel set to gzip because "apparently" gzip takes less time to decompress. Now, this "apparently" is in question
Also, does the compression of the 8MB original file by lzma also lead to same final size of ~800kB? This would mean that lzma is really close to the actual entropy of the source | Yeah. I had the intermediate file of size 36MB which I had cat'ed six times. That 36MB file gave a size of 790KB... So, LZMA is pushing it almost to the limit of the source entropy with -9. |
|
Back to top |
|
|
ppurka Advocate
Joined: 26 Dec 2004 Posts: 3256
|
Posted: Wed Jan 13, 2010 5:19 am Post subject: |
|
|
Code: | /var/tmp/portage> time lzma -z -c -9 a > a.lzma
lzma -z -c -9 a > a.lzma 422.27s user 1.12s system 98% cpu 7:11.23 total
/var/tmp/portage> time gzip -c -9 a > a.gz; time bzip2 -c -9 a > a.bz2
gzip -c -9 a > a.gz 24.46s user 0.13s system 98% cpu 24.896 total
bzip2 -c -9 a > a.bz2 54.13s user 0.16s system 98% cpu 55.103 total
/var/tmp/portage> ll
total 253M
-rw-r--r-- 1 root root 215M Jan 13 00:00 a
-rw-r--r-- 1 root root 16M Jan 13 00:14 a.bz2
-rw-r--r-- 1 root root 22M Jan 13 00:13 a.gz
-rw-r--r-- 1 root root 253K Jan 13 00:12 a.lzma
/var/tmp/portage> cp /var/log/emerge.log a-orig
/var/tmp/portage> time lzma -z -c -9 a-orig > a-orig.lzma; time gzip -c -9 a-orig > a-orig.gz; time bzip2 -c -9 a-orig > a-orig.bz2
lzma -z -c -9 a-orig > a-orig.lzma 6.53s user 0.03s system 98% cpu 6.657 total
gzip -c -9 a-orig > a-orig.gz 0.38s user 0.00s system 98% cpu 0.393 total
bzip2 -c -9 a-orig > a-orig.bz2 0.87s user 0.01s system 98% cpu 0.883 total
/var/tmp/portage> ll
total 257M
-rw-r--r-- 1 root root 215M Jan 13 00:00 a
-rw-r----- 1 root root 3.4M Jan 13 00:15 a-orig
-rw-r--r-- 1 root root 260K Jan 13 00:16 a-orig.bz2
-rw-r--r-- 1 root root 349K Jan 13 00:16 a-orig.gz
-rw-r--r-- 1 root root 222K Jan 13 00:16 a-orig.lzma
-rw-r--r-- 1 root root 16M Jan 13 00:14 a.bz2
-rw-r--r-- 1 root root 22M Jan 13 00:13 a.gz
-rw-r--r-- 1 root root 253K Jan 13 00:12 a.lzma
| This is the real comparison (with -9 for both gzip and bzip2)
By the way, you guys have some different version of lzma. My version of lzma (lzma-utils is installed) doesn't have support for -T or -M. Secondly, you have some real good system there devsk! My lzma took over 7 minutes! Mike Hunt wrote: | I needed to cat /var/log/messages over 12 million times, and boy are my fingers sore! | For loops to the rescue for me Code: | for ((i=0;i<=60;i++)); do cat /var/log/emerge.log >> /var/tmp/portage/a; done |
_________________ emerge --quiet redefined | E17 vids: I, II | Now using kde5 | e is unstable :-/ |
|
Back to top |
|
|
Mike Hunt Watchman
Joined: 19 Jul 2009 Posts: 5287
|
Posted: Wed Jan 13, 2010 5:46 am Post subject: |
|
|
Actually, I used a loop like this: Code: | for i in $(seq 1 1000000); do cat /var/log/messages >> /var/tmp/mytestfile2; done |
Othetwise, I would probably be cat'ing for a couple of months! |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 3003 Location: Bay Area, CA
|
Posted: Wed Jan 13, 2010 5:49 am Post subject: |
|
|
My main box has i7 920 OCed to 4.4Ghz... So, it just tears through stuff!
I have xz-utils. The -T option currently doesn't do anything. Once that gets implemented, I think the lzma compression will just fly. Think 8 threads with HT on...yeah baby! Parallel mksquashfs creates a 600MB livecd in under 40 seconds. |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 3003 Location: Bay Area, CA
|
Posted: Wed Jan 13, 2010 8:13 am Post subject: |
|
|
I found this little utility called freearc. I am not sure why it is not there in a portage. This is extremely fast and extremely efficient in compressing. Get a load of this:
Code: | $ time ./arc create -mt8 -m9 /var/tmp/mytestfile2.arclzma /var/tmp/mytestfile2
FreeArc 0.60 creating archive: /var/tmp/mytestfile2.arclzma
Compressed 1 file, 217,453,638 => 703,401 bytes. Ratio 0.3%
Compression time: cpu 9.43 secs, real 5.66 secs. Speed 38,409 kB/s
All OK
real 0m5.710s
$ cd /
$ \rm /var/tmp/mytestfile2
$ time unarc x /var/tmp/mytestfile2.arclzma
FreeArc 0.60 unpacker. Extracting archive: mytestfile2.arclzma
Extracting var/tmp/mytestfile2 (217453638 bytes)
All OK
real 0m0.692s
$ md5sum /var/tmp/mytestfile2.org /var/tmp/mytestfile2
24696247c934b7d581c156f001f362b6 /var/tmp/mytestfile2.org
24696247c934b7d581c156f001f362b6 /var/tmp/mytestfile2
| So, not only this program created a file which is just 703KB (12% smaller than xz-utils) but it did it in 5.71 seconds compared to 40+ seconds for xz-utils. It decompresses in slower than xz-utils but it is still sub-second, so no big deal. Besides, its faster than gzip in decompression.
Now that's what I call compression! |
|
Back to top |
|
|
mv Watchman
Joined: 20 Apr 2005 Posts: 6780
|
Posted: Wed Jan 13, 2010 8:54 am Post subject: Re: Lzma - wow! |
|
|
devsk wrote: | The file is a text file created by cat'ing /var/log/messages over and over. |
Such kind of tests are bogus: It essentially measures only the size of dictionary of the compressor. If one copy of the file is longer than the dictionary (which is probably the case for most compressors you used except perhaps lzma-utils/xz-utils with -9) they are of course worse by some factors, since a compressor with a sufficiently large dictionary essentially stores only "now repeat the last thing x times". If the original file (which is repeated) gets larger, then also lzma-utils/xz-utils will "suddenly" produce results which are larger by some factors. |
|
Back to top |
|
|
d2_racing Bodhisattva
Joined: 25 Apr 2005 Posts: 13047 Location: Ste-Foy,Canada
|
Posted: Wed Jan 13, 2010 12:54 pm Post subject: |
|
|
ppurka wrote: | Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel? |
In fact, I know that we can use lzma for the record, but I never tested it.
Anyone tested that actually ? |
|
Back to top |
|
|
mv Watchman
Joined: 20 Apr 2005 Posts: 6780
|
Posted: Wed Jan 13, 2010 1:46 pm Post subject: |
|
|
d2_racing wrote: | ppurka wrote: | Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel? |
In fact, I know that we can use lzma for the record, but I never tested it. |
On x86 systems with 512MB RAM, it usually gives an out-of-memory error when booting with grub. On amd64 it works fine. The size difference is, as to be expected, some percentage, I do not remember in the moment. |
|
Back to top |
|
|
d2_racing Bodhisattva
Joined: 25 Apr 2005 Posts: 13047 Location: Ste-Foy,Canada
|
Posted: Wed Jan 13, 2010 5:00 pm Post subject: |
|
|
Out of memory, well that's weird |
|
Back to top |
|
|
mikegpitt Advocate
Joined: 22 May 2004 Posts: 3224
|
Posted: Wed Jan 13, 2010 6:20 pm Post subject: |
|
|
LZMA compression time is looong. I was playing around with it on a livecd I was building, and the compression time took about 1.5+ hours. This was up from around 15-20 mins for bzip2. Annoying if you are editing and need to rebuild something many times. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9890 Location: almost Mile High in the USA
|
Posted: Wed Jan 13, 2010 6:24 pm Post subject: |
|
|
Gzip still has its uses but lzma is pretty nice...
Code: |
doujima:/tmp# time lzma < vmlinux > vmlinux.lzma
real 0m6.763s
user 0m6.596s
sys 0m0.120s
doujima:/tmp# time bzip2 < vmlinux > vmlinux.bz
real 0m4.372s
user 0m4.280s
sys 0m0.043s
doujima:/tmp# time gzip -9 < vmlinux > vmlinux.gz
real 0m0.857s
user 0m0.827s
sys 0m0.030s
doujima:/tmp# ls -l vmlinux*
-rwxr-xr-x 1 root root 3494717 Jan 13 11:19 vmlinux*
-rw-r--r-- 1 root root 1653809 Jan 13 11:20 vmlinux.bz
-rw-r--r-- 1 root root 1685150 Jan 13 11:20 vmlinux.gz
-rw-r--r-- 1 root root 1399599 Jan 13 11:19 vmlinux.lzma
|
_________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
|