diff mbox

[GIT,PULL] Update LZO compression

Message ID 502B8FE3.7080501@oberhumer.com (mailing list archive)
State New, archived
Headers show

Commit Message

Markus F.X.J. Oberhumer Aug. 15, 2012, 12:02 p.m. UTC
Hi Johannes,

On 2012-08-14 14:39, Johannes Stezenbach wrote:
> On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote:
>> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote:
>>>
>>> As stated in the README this version is significantly faster (typically more
>>> than 2 times faster!) than the current version, has been thoroughly tested on
>>> x86_64/i386/powerpc platforms and is intended to get included into the
>>> official Linux 3.6 or 3.7 release.
>>>
>>> I encourage all compression users to test and benchmark this new version,
>>> and I also would ask some official LZO maintainer to convert the updated
>>> source files into a GIT commit and possibly push it to Linus or linux-next.
> 
> Sorry for not reporting earlier, but I didn't have time to do real
> benchmarks, just a quick test on ARM926EJ-S using barebox,
> and found in the new version decompression is slower:
> http://lists.infradead.org/pipermail/barebox/2012-July/008268.html

I can only guess, but maybe your ARM cpu does not have an efficient
implementation of {get,put}_unaligned().

Could you please try the following patch and test if you can see
any significant speed difference?

Thanks,
Markus




> 
> BTW, do you have userspace code matching the old and new
> lzo versions?  It would be easier to benchmark.
> 
> Unfortunately I cannot claim high confidence in my benchmark results
> due to missing time to do it properly, it would be useful if
> someone else could do some benchmarks on ARM before merging this.
> 
> 
> Johannes

Comments

Johannes Stezenbach Aug. 15, 2012, 2:45 p.m. UTC | #1
Hi Marcus,

On Wed, Aug 15, 2012 at 02:02:43PM +0200, Markus F.X.J. Oberhumer wrote:
> On 2012-08-14 14:39, Johannes Stezenbach wrote:
> > On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote:
> >> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote:
> >>>
> >>> As stated in the README this version is significantly faster (typically more
> >>> than 2 times faster!) than the current version, has been thoroughly tested on
> >>> x86_64/i386/powerpc platforms and is intended to get included into the
> >>> official Linux 3.6 or 3.7 release.
> >>>
> >>> I encourage all compression users to test and benchmark this new version,
> >>> and I also would ask some official LZO maintainer to convert the updated
> >>> source files into a GIT commit and possibly push it to Linus or linux-next.
> > 
> > Sorry for not reporting earlier, but I didn't have time to do real
> > benchmarks, just a quick test on ARM926EJ-S using barebox,
> > and found in the new version decompression is slower:
> > http://lists.infradead.org/pipermail/barebox/2012-July/008268.html
> 
> I can only guess, but maybe your ARM cpu does not have an efficient
> implementation of {get,put}_unaligned().

Yes, ARMv5 cannot do unaligned access.  ARMv6+ could, but
I think the Linux kernel normally traps it for debug,
all ARM seem to use generic {get,put}_unaligned() implementation
which use byte access and shift.

> Could you please try the following patch and test if you can see
> any significant speed difference?

It isn't.  I made the attached quick hack userspace code
using ARM kernel headers and barebox unlzop code.
(new == your new code, old == linux-3.5 git, test == new + your suggested change)
(sorry I had no time to clean it up)

I compressed a Linux Image with lzop (lzop <arch/arm/boot/Image >lzoimage)
and timed uncompression:

# time ./unlzopold <lzoimage >/dev/null
real    0m 0.29s
user    0m 0.19s
sys     0m 0.10s
# time ./unlzopold <lzoimage >/dev/null
real    0m 0.29s
user    0m 0.20s
sys     0m 0.09s
# time ./unlzopnew <lzoimage >/dev/null
real    0m 0.41s
user    0m 0.30s
sys     0m 0.10s
# time ./unlzopnew <lzoimage >/dev/null
real    0m 0.40s
user    0m 0.30s
sys     0m 0.10s
# time ./unlzopnew <lzoimage >/dev/null
real    0m 0.40s
user    0m 0.29s
sys     0m 0.11s
# time ./unlzoptest <lzoimage >/dev/null
real    0m 0.39s
user    0m 0.28s
sys     0m 0.11s
# time ./unlzoptest <lzoimage >/dev/null
real    0m 0.39s
user    0m 0.27s
sys     0m 0.11s
# time ./unlzoptest <lzoimage >/dev/null
real    0m 0.39s
user    0m 0.27s
sys     0m 0.11s

FWIW I also checked the sha1sum to confirm the Image uncompressed OK.


Johannes
Markus F.X.J. Oberhumer Aug. 16, 2012, 6:27 a.m. UTC | #2
On 2012-08-15 16:45, Johannes Stezenbach wrote:
> On Wed, Aug 15, 2012 at 02:02:43PM +0200, Markus F.X.J. Oberhumer wrote:
>> On 2012-08-14 14:39, Johannes Stezenbach wrote:
>>> On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote:
>>>> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote:
>>>>>
>>>>> As stated in the README this version is significantly faster (typically more
>>>>> than 2 times faster!) than the current version, has been thoroughly tested on
>>>>> x86_64/i386/powerpc platforms and is intended to get included into the
>>>>> official Linux 3.6 or 3.7 release.
>>>>>
>>>>> I encourage all compression users to test and benchmark this new version,
>>>>> and I also would ask some official LZO maintainer to convert the updated
>>>>> source files into a GIT commit and possibly push it to Linus or linux-next.
>>>
>>> Sorry for not reporting earlier, but I didn't have time to do real
>>> benchmarks, just a quick test on ARM926EJ-S using barebox,
>>> and found in the new version decompression is slower:
>>> http://lists.infradead.org/pipermail/barebox/2012-July/008268.html
>>
>> I can only guess, but maybe your ARM cpu does not have an efficient
>> implementation of {get,put}_unaligned().
> 
> Yes, ARMv5 cannot do unaligned access.  ARMv6+ could, but
> I think the Linux kernel normally traps it for debug,
> all ARM seem to use generic {get,put}_unaligned() implementation
> which use byte access and shift.

Hmm - I could imagine that we're wasting a lot of possible speed gain
by not exploiting that feature on ARMv6+.

>> Could you please try the following patch and test if you can see
>> any significant speed difference?
> 
> It isn't.  I made the attached quick hack userspace code
> using ARM kernel headers and barebox unlzop code.
> (new == your new code, old == linux-3.5 git, test == new + your suggested change)
> (sorry I had no time to clean it up)

My suggested COPY4 replacement probably has a lot of load stalls - maybe some
ARM expert could have a look and suggest a more efficient implementation.

In any case, I still would like to see the new code in linux-next because
of the huge improvements on other modern CPUs.

Cheers,
Markus

> 
> I compressed a Linux Image with lzop (lzop <arch/arm/boot/Image >lzoimage)
> and timed uncompression:
> 
> # time ./unlzopold <lzoimage >/dev/null
> real    0m 0.29s
> user    0m 0.19s
> sys     0m 0.10s
> # time ./unlzopold <lzoimage >/dev/null
> real    0m 0.29s
> user    0m 0.20s
> sys     0m 0.09s
> # time ./unlzopnew <lzoimage >/dev/null
> real    0m 0.41s
> user    0m 0.30s
> sys     0m 0.10s
> # time ./unlzopnew <lzoimage >/dev/null
> real    0m 0.40s
> user    0m 0.30s
> sys     0m 0.10s
> # time ./unlzopnew <lzoimage >/dev/null
> real    0m 0.40s
> user    0m 0.29s
> sys     0m 0.11s
> # time ./unlzoptest <lzoimage >/dev/null
> real    0m 0.39s
> user    0m 0.28s
> sys     0m 0.11s
> # time ./unlzoptest <lzoimage >/dev/null
> real    0m 0.39s
> user    0m 0.27s
> sys     0m 0.11s
> # time ./unlzoptest <lzoimage >/dev/null
> real    0m 0.39s
> user    0m 0.27s
> sys     0m 0.11s
> 
> FWIW I also checked the sha1sum to confirm the Image uncompressed OK.
> 
> 
> Johannes
Johannes Stezenbach Aug. 16, 2012, 3:06 p.m. UTC | #3
On Thu, Aug 16, 2012 at 08:27:15AM +0200, Markus F.X.J. Oberhumer wrote:
> On 2012-08-15 16:45, Johannes Stezenbach wrote:
> > 
> > I made the attached quick hack userspace code
> > using ARM kernel headers and barebox unlzop code.
> > (new == your new code, old == linux-3.5 git, test == new + your suggested change)
> > (sorry I had no time to clean it up)
> 
> My suggested COPY4 replacement probably has a lot of load stalls - maybe some
> ARM expert could have a look and suggest a more efficient implementation.
> 
> In any case, I still would like to see the new code in linux-next because
> of the huge improvements on other modern CPUs.

Well, ~2x speedup on x86 is certainly a good achievement, but there
are more ARM based devices than there are PCs, and I guess many
embedded devices use lzo compressed kernels and file systems
while I'm not convinced many PCs rely on lzo in the kernel.

I know everyone's either busy or on vacation, but it would
be so cool if someone could test on a more modern ARM core,
with the userspace test code I posted it should be easy to do.


Johannes
Jeff Garzik Aug. 16, 2012, 3:21 p.m. UTC | #4
On 08/16/2012 02:27 AM, Markus F.X.J. Oberhumer wrote:
> On 2012-08-15 16:45, Johannes Stezenbach wrote:
>> On Wed, Aug 15, 2012 at 02:02:43PM +0200, Markus F.X.J. Oberhumer wrote:
>>> On 2012-08-14 14:39, Johannes Stezenbach wrote:
>>>> On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote:
>>>>> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote:
>>>>>>
>>>>>> As stated in the README this version is significantly faster (typically more
>>>>>> than 2 times faster!) than the current version, has been thoroughly tested on
>>>>>> x86_64/i386/powerpc platforms and is intended to get included into the
>>>>>> official Linux 3.6 or 3.7 release.
>>>>>>
>>>>>> I encourage all compression users to test and benchmark this new version,
>>>>>> and I also would ask some official LZO maintainer to convert the updated
>>>>>> source files into a GIT commit and possibly push it to Linus or linux-next.
>>>>
>>>> Sorry for not reporting earlier, but I didn't have time to do real
>>>> benchmarks, just a quick test on ARM926EJ-S using barebox,
>>>> and found in the new version decompression is slower:
>>>> http://lists.infradead.org/pipermail/barebox/2012-July/008268.html
>>>
>>> I can only guess, but maybe your ARM cpu does not have an efficient
>>> implementation of {get,put}_unaligned().
>>
>> Yes, ARMv5 cannot do unaligned access.  ARMv6+ could, but
>> I think the Linux kernel normally traps it for debug,
>> all ARM seem to use generic {get,put}_unaligned() implementation
>> which use byte access and shift.
>
> Hmm - I could imagine that we're wasting a lot of possible speed gain
> by not exploiting that feature on ARMv6+.

Or you could just realize that unaligned accesses are slow in the best 
case, and are simply not supported on some processors.

If you think a little bit, I bet you could come up with a solution that 
operates at cacheline-aligned granularity, something that would be _even 
faster_ than simply fixing the code to do aligned accesses.

	Jeff
Andi Kleen Aug. 16, 2012, 4:20 p.m. UTC | #5
> If you think a little bit, I bet you could come up with a solution that 
> operates at cacheline-aligned granularity, something that would be _even 
> faster_ than simply fixing the code to do aligned accesses.

Cache aligned compression is unlikely to compress anything at all.
Compression algorithms are usually by definition unaligned.

-Andi
Jeff Garzik Aug. 16, 2012, 4:48 p.m. UTC | #6
On 08/16/2012 12:20 PM, Andi Kleen wrote:
>> If you think a little bit, I bet you could come up with a solution that
>> operates at cacheline-aligned granularity, something that would be _even
>> faster_ than simply fixing the code to do aligned accesses.
>
> Cache aligned compression is unlikely to compress anything at all.
> Compression algorithms are usually by definition unaligned.

Sure it's a bitstream, but that does not imply the impossibility of 
reading data in in an word-aligned manner.

Maybe cache-aligned is ambitious, because of resultant code bloat, but 
machine-int-aligned is doable and reasonable.

	Jeff
Johannes Stezenbach Aug. 16, 2012, 5:22 p.m. UTC | #7
On Thu, Aug 16, 2012 at 12:48:49PM -0400, Jeff Garzik wrote:
> On 08/16/2012 12:20 PM, Andi Kleen wrote:
> >>If you think a little bit, I bet you could come up with a solution that
> >>operates at cacheline-aligned granularity, something that would be _even
> >>faster_ than simply fixing the code to do aligned accesses.
> >
> >Cache aligned compression is unlikely to compress anything at all.
> >Compression algorithms are usually by definition unaligned.
> 
> Sure it's a bitstream, but that does not imply the impossibility of
> reading data in in an word-aligned manner.
> 
> Maybe cache-aligned is ambitious, because of resultant code bloat,
> but machine-int-aligned is doable and reasonable.

Well, I for one would be content if the old and new lzo versions
could be merged based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

(Assuming that the slowdown on ARM is due to unaligned access,
since the old version also uses get/put_unaligned, is the new
version actually using more unaligned accesses?)


Johannes
Roman Mamedov Aug. 16, 2012, 5:25 p.m. UTC | #8
On Thu, 16 Aug 2012 17:06:47 +0200
Johannes Stezenbach <js@sig21.net> wrote:

> Well, ~2x speedup on x86 is certainly a good achievement, but there
> are more ARM based devices than there are PCs, and I guess many
> embedded devices use lzo compressed kernels and file systems
> while I'm not convinced many PCs rely on lzo in the kernel.

Keep in mind that a major user of LZO is the BTRFS filesystem, and I believe it
is used much more often on larger machines than on ARM, in fact it had problems
operating on ARM at all, until quite recently.

> I know everyone's either busy or on vacation, but it would
> be so cool if someone could test on a more modern ARM core,
> with the userspace test code I posted it should be easy to do.

I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set,
and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo.
There doesn't seem to be a significant difference between all three variants.

# time for i in {1..20}; do old/unlzop < rnd.lzo >/dev/null ; done

real	0m11.353s
user	0m3.060s
sys	0m8.170s
# time for i in {1..20}; do new/unlzop < rnd.lzo >/dev/null ; done

real	0m11.416s
user	0m3.030s
sys	0m8.200s
# time for i in {1..20}; do test/unlzop < rnd.lzo >/dev/null ; done

real	0m11.310s
user	0m3.100s
sys	0m8.150s
Andi Kleen Aug. 16, 2012, 5:52 p.m. UTC | #9
> I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set,
> and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo.
> There doesn't seem to be a significant difference between all three variants.

I found that in compression benchmarks it depends a lot on the data
compressed.

urandom (which should be essentially incompressible) will be handled
by different code paths in the compressor than other more compressible data.
It becomes a complicated memcpy then.

But then there are IO benchmarks which also only do zeroes, which
also gives an unrealistic picture.

Usually it's best to use some corpus with different data types, from very
compressible to less so; and look at the aggregate.

For my snappy work I usually had at least large executables (medium) and some
pdfs (already compressed; low) and then uncompressed source code tars (high
compression)

-Andi
Geert Uytterhoeven Aug. 16, 2012, 6:18 p.m. UTC | #10
On Thu, Aug 16, 2012 at 7:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set,
>> and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo.
>> There doesn't seem to be a significant difference between all three variants.
>
> I found that in compression benchmarks it depends a lot on the data
> compressed.
>
> urandom (which should be essentially incompressible) will be handled
> by different code paths in the compressor than other more compressible data.
> It becomes a complicated memcpy then.

In addition, locking the CPU to 60 MHz may improve the memory access penalty,
as a cache miss may cost much less cycles.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
james northrup Aug. 16, 2012, 6:55 p.m. UTC | #11
looks like ARM results are inconclusive from a lot of folks without
bandwidth to do a write-up, what about just plain STAGING status for ARM so
the android tweakers can beat on it for a while?


On Thu, Aug 16, 2012 at 11:18 AM, Geert Uytterhoeven
<geert@linux-m68k.org>wrote:

> On Thu, Aug 16, 2012 at 7:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using
> cpufreq-set,
> >> and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed
> with lzo.
> >> There doesn't seem to be a significant difference between all three
> variants.
>
Andi Kleen Aug. 16, 2012, 10:17 p.m. UTC | #12
On Thu, Aug 16, 2012 at 11:55:06AM -0700, james northrup wrote:
> looks like ARM results are inconclusive from a lot of folks without
> bandwidth to do a write-up, what about just plain STAGING status for ARM so
> the android tweakers can beat on it for a while?

Staging only really works for new drivers, not for updating existing
library functions like this.

I suppose you could keep both and have the architecture select with a
CONFIG.

-Andi
Mitch Harder Aug. 17, 2012, 1:23 a.m. UTC | #13
On Thu, Aug 16, 2012 at 5:17 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Thu, Aug 16, 2012 at 11:55:06AM -0700, james northrup wrote:
>> looks like ARM results are inconclusive from a lot of folks without
>> bandwidth to do a write-up, what about just plain STAGING status for ARM so
>> the android tweakers can beat on it for a while?
>
> Staging only really works for new drivers, not for updating existing
> library functions like this.
>
> I suppose you could keep both and have the architecture select with a
> CONFIG.
>

I've been doing some rough benchmarking with the updated LZO in btrfs.

My tests primarily consist of timing some typical copying, git
manipulating, and running rsync using a set of kernel git sources.
Git sources are typically about 50% pack files which won't compress
very well, with the remainder being mostly highly compressible source
files.

Of course, any underlying speed improvement attributable only to LZO
is not shown by test like this. But I thought it would be interesting
to see the impact in some typical real-world btrfs operations.

I was seeing between 3-9% improvement in speed with the new LZO.

Copying several directories of git sources showed the most
improvement, ~9%.  Typical git operations, such as a git checkout or
git status where only showing 3-5% improvement, which is close to the
noise level of my tests.  Running multiple rsync processes showed a 5%
improvement.

With only 10 trials (5 with each LZO), I can't say I would
statistically hang my hat on these numbers.

Given all the other stuff that is going on in my rough benchmarks, a
3-9% improvement from a single change is probably pretty good.
Andi Kleen Sept. 7, 2012, 9:31 p.m. UTC | #14
james northrup <northrup.james@gmail.com> writes:

> looks like ARM results are inconclusive from a lot of folks without bandwidth
> to do a write-up, what about just plain STAGING status for ARM so the android
> tweakers can beat on it for a while?  

Again staging doesn't really work for a core library, it's more
for separate drivers.

FWIW we did some tests and the new library seems to be generally faster
on x86.

-Andi
diff mbox

Patch

diff --git a/lib/lzo/lzodefs.h b/lib/lzo/lzodefs.h
index ddc8db5..efc5714 100644
--- a/lib/lzo/lzodefs.h
+++ b/lib/lzo/lzodefs.h
@@ -12,8 +12,15 @@ 
  */


+#if defined(__arm__)
+#define COPY4(dst, src)        \
+               (dst)[0] = (src)[0]; (dst)[1] = (src)[1]; \
+               (dst)[2] = (src)[2]; (dst)[3] = (src)[3]
+#endif
+#ifndef COPY4
 #define COPY4(dst, src)        \
                put_unaligned(get_unaligned((const u32 *)(src)), (u32 *)(dst))
+#endif
 #if defined(__x86_64__)
 #define COPY8(dst, src)        \
                put_unaligned(get_unaligned((const u64 *)(src)), (u64 *)(dst))