Message ID | 20230412212126.3966502-1-j.neuschaefer@gmx.net (mailing list archive) |
---|---|
Headers | show |
Series | ARM ZSTD boot compression | expand |
On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: > This patchset enables ZSTD kernel (de)compression on 32-bit ARM. > Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): > > - LZO: 7.2 MiB, 6 seconds > - ZSTD: 5.6 MiB, 60 seconds That seems unexpected, as the usual numbers say it's about 25% slower than LZO. Do you have an idea why it is so much slower here? How long does it take to decompress the generated arch/arm/boot/Image file in user space on the same hardware using lzop and zstd? Arnd
On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote: > On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: >> This patchset enables ZSTD kernel (de)compression on 32-bit ARM. >> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): >> >> - LZO: 7.2 MiB, 6 seconds >> - ZSTD: 5.6 MiB, 60 seconds > > That seems unexpected, as the usual numbers say it's about 25% > slower than LZO. Do you have an idea why it is so much slower > here? How long does it take to decompress the > generated arch/arm/boot/Image file in user space on the same > hardware using lzop and zstd? I looked through this a bit more and found two interesting points: - zstd uses a lot more unaligned loads and stores while decompressing. On armv5 those turn into individual byte accesses, while the others can likely use word-aligned accesses. This could make a huge difference if caches are disabled during the decompression. - The sliding window on zstd is much larger, with the kernel using an 8MB window (zstd=23), compared to the normal 32kb for deflate (couldn't find the default for lzo), so on machines with no L2 cache, it is much likely to thrash a small L1 dcache that are used on most arm9. Arnd
On Wed, Apr 12, 2023 at 11:33:15PM +0200, Arnd Bergmann wrote: > On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: > > This patchset enables ZSTD kernel (de)compression on 32-bit ARM. > > Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): > > > > - LZO: 7.2 MiB, 6 seconds > > - ZSTD: 5.6 MiB, 60 seconds > > That seems unexpected, as the usual numbers say it's about 25% > slower than LZO. Do you have an idea why it is so much slower > here? No clear idea. I guess it might be related to caching or unaligned memory accesses somehow. I suspected CONFIG_CPU_DCACHE_WRITETHROUGH, which was enabled, but disabling it didn't improve performance. > How long does it take to decompress the generated arch/arm/boot/Image > file in user space on the same hardware using lzop and zstd? Unfortunately, the unzstd userspace tool requires a buffer of of 128 MiB (the window size), which is too big for my usual devboard (which has about 100 MiB available). I'd have to test on a different board. Jonathan --- # uname -a Linux buildroot 6.3.0-rc6-00020-g023058d50f2f #1212 PREEMPT Fri Apr 14 20:58:21 CEST 2023 armv5tejl GNU/Linux # ls -lh total 13M -rw-r--r-- 1 root root 7.5M Jan 1 00:07 piggy.lzo -rw-r--r-- 1 root root 5.8M Jan 1 00:07 piggy.zstd # time lzop -d piggy.lzo -c > /dev/null lzop: piggy.lzo: warning: ignoring trailing garbage in lzop file Command exited with non-zero status 2 real 0m 3.38s user 0m 3.20s sys 0m 0.18s # time unzstd piggy.zstd -c > /dev/null [ 858.270000] __vm_enough_memory: pid: 114, comm: unzstd, not enough memory for the allocation piggy.zstd : Decoding error (36) : Allocation error : not enough memory Command exited with non-zero status 1 real 0m 0.03s user 0m 0.01s sys 0m 0.03s
On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote: > On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote: > > On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: > >> This patchset enables ZSTD kernel (de)compression on 32-bit ARM. > >> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): > >> > >> - LZO: 7.2 MiB, 6 seconds > >> - ZSTD: 5.6 MiB, 60 seconds > > > > That seems unexpected, as the usual numbers say it's about 25% > > slower than LZO. Do you have an idea why it is so much slower > > here? How long does it take to decompress the > > generated arch/arm/boot/Image file in user space on the same > > hardware using lzop and zstd? > > I looked through this a bit more and found two interesting points: > > - zstd uses a lot more unaligned loads and stores while > decompressing. On armv5 those turn into individual byte > accesses, while the others can likely use word-aligned > accesses. This could make a huge difference if caches are > disabled during the decompression. > > - The sliding window on zstd is much larger, with the kernel > using an 8MB window (zstd=23), compared to the normal 32kb > for deflate (couldn't find the default for lzo), so on > machines with no L2 cache, it is much likely to thrash a > small L1 dcache that are used on most arm9. > > Arnd Make sense. For ZSTD as used in kernel decompression (the zstd22 configuration), the window is even bigger, 128 MiB. (AFAIU) Thanks Jonathan
> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote: > > On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote: >> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote: >>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: >>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM. >>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): >>>> >>>> - LZO: 7.2 MiB, 6 seconds >>>> - ZSTD: 5.6 MiB, 60 seconds >>> >>> That seems unexpected, as the usual numbers say it's about 25% >>> slower than LZO. Do you have an idea why it is so much slower >>> here? How long does it take to decompress the >>> generated arch/arm/boot/Image file in user space on the same >>> hardware using lzop and zstd? >> >> I looked through this a bit more and found two interesting points: >> >> - zstd uses a lot more unaligned loads and stores while >> decompressing. On armv5 those turn into individual byte >> accesses, while the others can likely use word-aligned >> accesses. This could make a huge difference if caches are >> disabled during the decompression. >> >> - The sliding window on zstd is much larger, with the kernel >> using an 8MB window (zstd=23), compared to the normal 32kb >> for deflate (couldn't find the default for lzo), so on >> machines with no L2 cache, it is much likely to thrash a >> small L1 dcache that are used on most arm9. >> >> Arnd > > Make sense. > > For ZSTD as used in kernel decompression (the zstd22 configuration), the > window is even bigger, 128 MiB. (AFAIU) Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time... But this is totally configurable. You can switch compression configurations at any time. If you believe that the window size is the issue causing speed regressions, you could use a zstd compression to use a e.g. 256KB window size like this: zstd -19 --zstd=wlog=18 This will keep the same algorithm search strength, but limit the decoder memory usage. I will also try to get this patchset working on my machine, and try to debug. The 10x slower speed difference is not expected, and we see much better speed in userspace ARM. I suspect it has something to do with the preboot environment. E.g. when implementing x86-64 zstd kernel decompression, I noticed that memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance penalty. Best, Nick Terrell > Thanks > > Jonathan
On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote: > > On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote: > > On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote: > >> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote: > >>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: > >>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM. > >>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): > >>>> > >>>> - LZO: 7.2 MiB, 6 seconds > >>>> - ZSTD: 5.6 MiB, 60 seconds [...] > > For ZSTD as used in kernel decompression (the zstd22 configuration), the > > window is even bigger, 128 MiB. (AFAIU) > > Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time... > > But this is totally configurable. You can switch compression configurations > at any time. If you believe that the window size is the issue causing speed > regressions, you could use a zstd compression to use a e.g. 256KB window > size like this: > > zstd -19 --zstd=wlog=18 > > This will keep the same algorithm search strength, but limit the decoder memory > usage. Noted. > I will also try to get this patchset working on my machine, and try to debug. > The 10x slower speed difference is not expected, and we see much better speed > in userspace ARM. I suspect it has something to do with the preboot environment. > E.g. when implementing x86-64 zstd kernel decompression, I noticed that > memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance > penalty. In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I think the main culprit here was particularly bad luck in my choice of test hardware. The inlining issues are a good point, noted for the next time I work on this. Thanks, Jonathan
> On Oct 12, 2023, at 6:27 PM, J. Neuschäfer <j.neuschaefer@gmx.net> wrote: > > On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote: >>> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote: >>> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote: >>>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote: >>>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote: >>>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM. >>>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S): >>>>>> >>>>>> - LZO: 7.2 MiB, 6 seconds >>>>>> - ZSTD: 5.6 MiB, 60 seconds > [...] >>> For ZSTD as used in kernel decompression (the zstd22 configuration), the >>> window is even bigger, 128 MiB. (AFAIU) >> >> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time... >> >> But this is totally configurable. You can switch compression configurations >> at any time. If you believe that the window size is the issue causing speed >> regressions, you could use a zstd compression to use a e.g. 256KB window >> size like this: >> >> zstd -19 --zstd=wlog=18 >> >> This will keep the same algorithm search strength, but limit the decoder memory >> usage. > > Noted. > >> I will also try to get this patchset working on my machine, and try to debug. >> The 10x slower speed difference is not expected, and we see much better speed >> in userspace ARM. I suspect it has something to do with the preboot environment. >> E.g. when implementing x86-64 zstd kernel decompression, I noticed that >> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance >> penalty. > > In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on > only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I > think the main culprit here was particularly bad luck in my choice of > test hardware. > > The inlining issues are a good point, noted for the next time I work on this. I went out and bought a Raspberry Pi 4 to test on. I’ve done some crude measurements and see that zstd kernel decompression is just slightly slower than gzip kernel decompression, and about 2x slower than lzo. In userspace decompression of the same file (a manually compressed kernel image) I see that zstd decompression is significantly faster than gzip. So it is definitely something about the preboot boot environment, or how the code is compiled for the preboot environment that is causing the issue. My next step is to set up qemu on my Pi to try to get some perf measurements of the decompression. One thing I’ve really been struggling with, and what thwarted my last attempts at adding ARM zstd kernel decompression, was getting preboot logs printed. I’ve figured out I need CONFIG_DEBUG_LL=y, but I’ve yet to actually get any logs. And I can’t figure out how to get it working in qemu. I haven’t tried qemu on an ARM host with kvm, but that’s the next thing I will try. Do you happen to have any advice about how to get preboot logs in qemu? Is it possible only on an ARM host, or would it also be possible on an x86-64 host? Thanks, Nick Terrell > Thanks, > Jonathan