Message ID | 502B8FE3.7080501@oberhumer.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Marcus, On Wed, Aug 15, 2012 at 02:02:43PM +0200, Markus F.X.J. Oberhumer wrote: > On 2012-08-14 14:39, Johannes Stezenbach wrote: > > On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote: > >> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote: > >>> > >>> As stated in the README this version is significantly faster (typically more > >>> than 2 times faster!) than the current version, has been thoroughly tested on > >>> x86_64/i386/powerpc platforms and is intended to get included into the > >>> official Linux 3.6 or 3.7 release. > >>> > >>> I encourage all compression users to test and benchmark this new version, > >>> and I also would ask some official LZO maintainer to convert the updated > >>> source files into a GIT commit and possibly push it to Linus or linux-next. > > > > Sorry for not reporting earlier, but I didn't have time to do real > > benchmarks, just a quick test on ARM926EJ-S using barebox, > > and found in the new version decompression is slower: > > http://lists.infradead.org/pipermail/barebox/2012-July/008268.html > > I can only guess, but maybe your ARM cpu does not have an efficient > implementation of {get,put}_unaligned(). Yes, ARMv5 cannot do unaligned access. ARMv6+ could, but I think the Linux kernel normally traps it for debug, all ARM seem to use generic {get,put}_unaligned() implementation which use byte access and shift. > Could you please try the following patch and test if you can see > any significant speed difference? It isn't. I made the attached quick hack userspace code using ARM kernel headers and barebox unlzop code. (new == your new code, old == linux-3.5 git, test == new + your suggested change) (sorry I had no time to clean it up) I compressed a Linux Image with lzop (lzop <arch/arm/boot/Image >lzoimage) and timed uncompression: # time ./unlzopold <lzoimage >/dev/null real 0m 0.29s user 0m 0.19s sys 0m 0.10s # time ./unlzopold <lzoimage >/dev/null real 0m 0.29s user 0m 0.20s sys 0m 0.09s # time ./unlzopnew <lzoimage >/dev/null real 0m 0.41s user 0m 0.30s sys 0m 0.10s # time ./unlzopnew <lzoimage >/dev/null real 0m 0.40s user 0m 0.30s sys 0m 0.10s # time ./unlzopnew <lzoimage >/dev/null real 0m 0.40s user 0m 0.29s sys 0m 0.11s # time ./unlzoptest <lzoimage >/dev/null real 0m 0.39s user 0m 0.28s sys 0m 0.11s # time ./unlzoptest <lzoimage >/dev/null real 0m 0.39s user 0m 0.27s sys 0m 0.11s # time ./unlzoptest <lzoimage >/dev/null real 0m 0.39s user 0m 0.27s sys 0m 0.11s FWIW I also checked the sha1sum to confirm the Image uncompressed OK. Johannes
On 2012-08-15 16:45, Johannes Stezenbach wrote: > On Wed, Aug 15, 2012 at 02:02:43PM +0200, Markus F.X.J. Oberhumer wrote: >> On 2012-08-14 14:39, Johannes Stezenbach wrote: >>> On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote: >>>> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote: >>>>> >>>>> As stated in the README this version is significantly faster (typically more >>>>> than 2 times faster!) than the current version, has been thoroughly tested on >>>>> x86_64/i386/powerpc platforms and is intended to get included into the >>>>> official Linux 3.6 or 3.7 release. >>>>> >>>>> I encourage all compression users to test and benchmark this new version, >>>>> and I also would ask some official LZO maintainer to convert the updated >>>>> source files into a GIT commit and possibly push it to Linus or linux-next. >>> >>> Sorry for not reporting earlier, but I didn't have time to do real >>> benchmarks, just a quick test on ARM926EJ-S using barebox, >>> and found in the new version decompression is slower: >>> http://lists.infradead.org/pipermail/barebox/2012-July/008268.html >> >> I can only guess, but maybe your ARM cpu does not have an efficient >> implementation of {get,put}_unaligned(). > > Yes, ARMv5 cannot do unaligned access. ARMv6+ could, but > I think the Linux kernel normally traps it for debug, > all ARM seem to use generic {get,put}_unaligned() implementation > which use byte access and shift. Hmm - I could imagine that we're wasting a lot of possible speed gain by not exploiting that feature on ARMv6+. >> Could you please try the following patch and test if you can see >> any significant speed difference? > > It isn't. I made the attached quick hack userspace code > using ARM kernel headers and barebox unlzop code. > (new == your new code, old == linux-3.5 git, test == new + your suggested change) > (sorry I had no time to clean it up) My suggested COPY4 replacement probably has a lot of load stalls - maybe some ARM expert could have a look and suggest a more efficient implementation. In any case, I still would like to see the new code in linux-next because of the huge improvements on other modern CPUs. Cheers, Markus > > I compressed a Linux Image with lzop (lzop <arch/arm/boot/Image >lzoimage) > and timed uncompression: > > # time ./unlzopold <lzoimage >/dev/null > real 0m 0.29s > user 0m 0.19s > sys 0m 0.10s > # time ./unlzopold <lzoimage >/dev/null > real 0m 0.29s > user 0m 0.20s > sys 0m 0.09s > # time ./unlzopnew <lzoimage >/dev/null > real 0m 0.41s > user 0m 0.30s > sys 0m 0.10s > # time ./unlzopnew <lzoimage >/dev/null > real 0m 0.40s > user 0m 0.30s > sys 0m 0.10s > # time ./unlzopnew <lzoimage >/dev/null > real 0m 0.40s > user 0m 0.29s > sys 0m 0.11s > # time ./unlzoptest <lzoimage >/dev/null > real 0m 0.39s > user 0m 0.28s > sys 0m 0.11s > # time ./unlzoptest <lzoimage >/dev/null > real 0m 0.39s > user 0m 0.27s > sys 0m 0.11s > # time ./unlzoptest <lzoimage >/dev/null > real 0m 0.39s > user 0m 0.27s > sys 0m 0.11s > > FWIW I also checked the sha1sum to confirm the Image uncompressed OK. > > > Johannes
On Thu, Aug 16, 2012 at 08:27:15AM +0200, Markus F.X.J. Oberhumer wrote: > On 2012-08-15 16:45, Johannes Stezenbach wrote: > > > > I made the attached quick hack userspace code > > using ARM kernel headers and barebox unlzop code. > > (new == your new code, old == linux-3.5 git, test == new + your suggested change) > > (sorry I had no time to clean it up) > > My suggested COPY4 replacement probably has a lot of load stalls - maybe some > ARM expert could have a look and suggest a more efficient implementation. > > In any case, I still would like to see the new code in linux-next because > of the huge improvements on other modern CPUs. Well, ~2x speedup on x86 is certainly a good achievement, but there are more ARM based devices than there are PCs, and I guess many embedded devices use lzo compressed kernels and file systems while I'm not convinced many PCs rely on lzo in the kernel. I know everyone's either busy or on vacation, but it would be so cool if someone could test on a more modern ARM core, with the userspace test code I posted it should be easy to do. Johannes
On 08/16/2012 02:27 AM, Markus F.X.J. Oberhumer wrote: > On 2012-08-15 16:45, Johannes Stezenbach wrote: >> On Wed, Aug 15, 2012 at 02:02:43PM +0200, Markus F.X.J. Oberhumer wrote: >>> On 2012-08-14 14:39, Johannes Stezenbach wrote: >>>> On Tue, Aug 14, 2012 at 01:44:02AM +0200, Markus F.X.J. Oberhumer wrote: >>>>> On 2012-07-16 20:30, Markus F.X.J. Oberhumer wrote: >>>>>> >>>>>> As stated in the README this version is significantly faster (typically more >>>>>> than 2 times faster!) than the current version, has been thoroughly tested on >>>>>> x86_64/i386/powerpc platforms and is intended to get included into the >>>>>> official Linux 3.6 or 3.7 release. >>>>>> >>>>>> I encourage all compression users to test and benchmark this new version, >>>>>> and I also would ask some official LZO maintainer to convert the updated >>>>>> source files into a GIT commit and possibly push it to Linus or linux-next. >>>> >>>> Sorry for not reporting earlier, but I didn't have time to do real >>>> benchmarks, just a quick test on ARM926EJ-S using barebox, >>>> and found in the new version decompression is slower: >>>> http://lists.infradead.org/pipermail/barebox/2012-July/008268.html >>> >>> I can only guess, but maybe your ARM cpu does not have an efficient >>> implementation of {get,put}_unaligned(). >> >> Yes, ARMv5 cannot do unaligned access. ARMv6+ could, but >> I think the Linux kernel normally traps it for debug, >> all ARM seem to use generic {get,put}_unaligned() implementation >> which use byte access and shift. > > Hmm - I could imagine that we're wasting a lot of possible speed gain > by not exploiting that feature on ARMv6+. Or you could just realize that unaligned accesses are slow in the best case, and are simply not supported on some processors. If you think a little bit, I bet you could come up with a solution that operates at cacheline-aligned granularity, something that would be _even faster_ than simply fixing the code to do aligned accesses. Jeff
> If you think a little bit, I bet you could come up with a solution that > operates at cacheline-aligned granularity, something that would be _even > faster_ than simply fixing the code to do aligned accesses. Cache aligned compression is unlikely to compress anything at all. Compression algorithms are usually by definition unaligned. -Andi
On 08/16/2012 12:20 PM, Andi Kleen wrote: >> If you think a little bit, I bet you could come up with a solution that >> operates at cacheline-aligned granularity, something that would be _even >> faster_ than simply fixing the code to do aligned accesses. > > Cache aligned compression is unlikely to compress anything at all. > Compression algorithms are usually by definition unaligned. Sure it's a bitstream, but that does not imply the impossibility of reading data in in an word-aligned manner. Maybe cache-aligned is ambitious, because of resultant code bloat, but machine-int-aligned is doable and reasonable. Jeff
On Thu, Aug 16, 2012 at 12:48:49PM -0400, Jeff Garzik wrote: > On 08/16/2012 12:20 PM, Andi Kleen wrote: > >>If you think a little bit, I bet you could come up with a solution that > >>operates at cacheline-aligned granularity, something that would be _even > >>faster_ than simply fixing the code to do aligned accesses. > > > >Cache aligned compression is unlikely to compress anything at all. > >Compression algorithms are usually by definition unaligned. > > Sure it's a bitstream, but that does not imply the impossibility of > reading data in in an word-aligned manner. > > Maybe cache-aligned is ambitious, because of resultant code bloat, > but machine-int-aligned is doable and reasonable. Well, I for one would be content if the old and new lzo versions could be merged based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. (Assuming that the slowdown on ARM is due to unaligned access, since the old version also uses get/put_unaligned, is the new version actually using more unaligned accesses?) Johannes
On Thu, 16 Aug 2012 17:06:47 +0200 Johannes Stezenbach <js@sig21.net> wrote: > Well, ~2x speedup on x86 is certainly a good achievement, but there > are more ARM based devices than there are PCs, and I guess many > embedded devices use lzo compressed kernels and file systems > while I'm not convinced many PCs rely on lzo in the kernel. Keep in mind that a major user of LZO is the BTRFS filesystem, and I believe it is used much more often on larger machines than on ARM, in fact it had problems operating on ARM at all, until quite recently. > I know everyone's either busy or on vacation, but it would > be so cool if someone could test on a more modern ARM core, > with the userspace test code I posted it should be easy to do. I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set, and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo. There doesn't seem to be a significant difference between all three variants. # time for i in {1..20}; do old/unlzop < rnd.lzo >/dev/null ; done real 0m11.353s user 0m3.060s sys 0m8.170s # time for i in {1..20}; do new/unlzop < rnd.lzo >/dev/null ; done real 0m11.416s user 0m3.030s sys 0m8.200s # time for i in {1..20}; do test/unlzop < rnd.lzo >/dev/null ; done real 0m11.310s user 0m3.100s sys 0m8.150s
> I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set, > and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo. > There doesn't seem to be a significant difference between all three variants. I found that in compression benchmarks it depends a lot on the data compressed. urandom (which should be essentially incompressible) will be handled by different code paths in the compressor than other more compressible data. It becomes a complicated memcpy then. But then there are IO benchmarks which also only do zeroes, which also gives an unrealistic picture. Usually it's best to use some corpus with different data types, from very compressible to less so; and look at the aggregate. For my snappy work I usually had at least large executables (medium) and some pdfs (already compressed; low) and then uncompressed source code tars (high compression) -Andi
On Thu, Aug 16, 2012 at 7:52 PM, Andi Kleen <andi@firstfloor.org> wrote: >> I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using cpufreq-set, >> and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed with lzo. >> There doesn't seem to be a significant difference between all three variants. > > I found that in compression benchmarks it depends a lot on the data > compressed. > > urandom (which should be essentially incompressible) will be handled > by different code paths in the compressor than other more compressible data. > It becomes a complicated memcpy then. In addition, locking the CPU to 60 MHz may improve the memory access penalty, as a cache miss may cost much less cycles. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
looks like ARM results are inconclusive from a lot of folks without bandwidth to do a write-up, what about just plain STAGING status for ARM so the android tweakers can beat on it for a while? On Thu, Aug 16, 2012 at 11:18 AM, Geert Uytterhoeven <geert@linux-m68k.org>wrote: > On Thu, Aug 16, 2012 at 7:52 PM, Andi Kleen <andi@firstfloor.org> wrote: > >> I have locked the Allwinner A10 CPU in my Mele A2000 to 60 MHz using > cpufreq-set, > >> and ran your test. rnd.lzo is a 9 MB file from /dev/urandom compressed > with lzo. > >> There doesn't seem to be a significant difference between all three > variants. >
On Thu, Aug 16, 2012 at 11:55:06AM -0700, james northrup wrote: > looks like ARM results are inconclusive from a lot of folks without > bandwidth to do a write-up, what about just plain STAGING status for ARM so > the android tweakers can beat on it for a while? Staging only really works for new drivers, not for updating existing library functions like this. I suppose you could keep both and have the architecture select with a CONFIG. -Andi
On Thu, Aug 16, 2012 at 5:17 PM, Andi Kleen <andi@firstfloor.org> wrote: > On Thu, Aug 16, 2012 at 11:55:06AM -0700, james northrup wrote: >> looks like ARM results are inconclusive from a lot of folks without >> bandwidth to do a write-up, what about just plain STAGING status for ARM so >> the android tweakers can beat on it for a while? > > Staging only really works for new drivers, not for updating existing > library functions like this. > > I suppose you could keep both and have the architecture select with a > CONFIG. > I've been doing some rough benchmarking with the updated LZO in btrfs. My tests primarily consist of timing some typical copying, git manipulating, and running rsync using a set of kernel git sources. Git sources are typically about 50% pack files which won't compress very well, with the remainder being mostly highly compressible source files. Of course, any underlying speed improvement attributable only to LZO is not shown by test like this. But I thought it would be interesting to see the impact in some typical real-world btrfs operations. I was seeing between 3-9% improvement in speed with the new LZO. Copying several directories of git sources showed the most improvement, ~9%. Typical git operations, such as a git checkout or git status where only showing 3-5% improvement, which is close to the noise level of my tests. Running multiple rsync processes showed a 5% improvement. With only 10 trials (5 with each LZO), I can't say I would statistically hang my hat on these numbers. Given all the other stuff that is going on in my rough benchmarks, a 3-9% improvement from a single change is probably pretty good.
james northrup <northrup.james@gmail.com> writes: > looks like ARM results are inconclusive from a lot of folks without bandwidth > to do a write-up, what about just plain STAGING status for ARM so the android > tweakers can beat on it for a while? Again staging doesn't really work for a core library, it's more for separate drivers. FWIW we did some tests and the new library seems to be generally faster on x86. -Andi
diff --git a/lib/lzo/lzodefs.h b/lib/lzo/lzodefs.h index ddc8db5..efc5714 100644 --- a/lib/lzo/lzodefs.h +++ b/lib/lzo/lzodefs.h @@ -12,8 +12,15 @@ */ +#if defined(__arm__) +#define COPY4(dst, src) \ + (dst)[0] = (src)[0]; (dst)[1] = (src)[1]; \ + (dst)[2] = (src)[2]; (dst)[3] = (src)[3] +#endif +#ifndef COPY4 #define COPY4(dst, src) \ put_unaligned(get_unaligned((const u32 *)(src)), (u32 *)(dst)) +#endif #if defined(__x86_64__) #define COPY8(dst, src) \ put_unaligned(get_unaligned((const u64 *)(src)), (u64 *)(dst))