[0/3] ARM ZSTD boot compression

Message ID	20230412212126.3966502-1-j.neuschaefer@gmx.net (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: =?utf-8?q?Jonathan_Neusch=C3=A4fer?= <j.neuschaefer@gmx.net> To: linux-arm-kernel@lists.infradead.org Cc: Russell King <linux@armlinux.org.uk>, Nick Terrell <terrelln@fb.com>, Arnd Bergmann <arnd@arndb.de>, Tony Lindgren <tony@atomide.com>, Geert Uytterhoeven <geert+renesas@glider.be>, Linus Walleij <linus.walleij@linaro.org>, Sebastian Reichel <sebastian.reichel@collabora.com>, Nick Hawkins <nick.hawkins@hpe.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, =?utf-8?q?Jonathan_Neusch?= =?utf-8?q?=C3=A4fer?= <j.neuschaefer@gmx.net>, Florian Fainelli <f.fainelli@gmail.com>, Nick Desaulniers <ndesaulniers@google.com>, Xin Li <xin3.li@intel.com>, Seung-Woo Kim <sw0312.kim@samsung.com>, Paul Bolle <pebolle@tiscali.nl>, Bart Van Assche <bvanassche@acm.org>, linux-kernel@vger.kernel.org Subject: [PATCH 0/3] ARM ZSTD boot compression Date: Wed, 12 Apr 2023 23:21:23 +0200 Message-Id: <20230412212126.3966502-1-j.neuschaefer@gmx.net> MIME-Version: 1.0 UI-OutboundReport: notjunk:1;M01:P0:UwrKKqhLQGA=;euOqLfHvy4QvTsfZ5FCaxrDzOB1 It83uznIjC2xTQ4eW0Y28lfM+yTzxUgxmD5rNV0N3bz0JeskMJLnu6jKJM/JeYopxJJG6JiXV 13IV2+xg3N2nysxq8hqtCsLD5ESVs1Ydt3q3DiYHzYRMCJwCJm0oeMB244EFcH+7VsMLisafo KoulktUQCgt6L91ZYUSqYH4TMtUie3ib+twA8I1qwgnPusMm5LMG8XiinV/W0ucgM8ZfR9dXS GZnUl7KSo2bdGfGD64VdqxLd44XUvlxtyL6XnyWpKRwpaUEoyuNKv+nAvOdG51vtbfNRwXsKE XL1/r3EtJeYwHjaWkG9aMN5DX8NRCw93NMRF9g0eyefiOsXuQrMgiaBeAJ1dg3skV9C3WYRv7 ZI13yOclARm8xcS7PK1AFBUDzNNuGh08AzCrYdiyB28OtcpKxoHAwXeP94k6PAaSfuvSmbBfd 6s/+O+HrGvwDN0v2SxSk1m1RNC48WJ/uENhDrZrHkbyFXy43aaWfSc7IdQ8WdorqM4E4SgFmd kr2sDrfm/kiTKc9QjFV/N5eGeZfIo/gfe7t7hNzJSEDo3lBquv5/1xXzplW/k/MPvIE+y5a+c ybzJ7jQHVJHVRn0YB/1ZJ0KGtQCJ6bdopZLKxzC1yGP8/Un5OO0VJrFRa7VphnwMl0HmrJ9cP xqxyLR91KNVJf5SabkIlRR4bUn1hMBjmSscwRObbiM+0VljR9qHdumQGSPs8Y+e3M3UPCYA45 Q+JbJGR4JxfIqa9bkcetJd5J2Vt0gGi/C2hQPyyj1a5I+k0tcFqouQa9csDFv2xr2yx7dtsWl XIF5APXYzdXOjcFCjCmAFSi6KiAzuVQQrxlYupr7jOeE3o2oX2j5jmqjxSYLuqKYIJ7bdFU3U l1jSkCQW/TWi2Qyat4anpM3EF5qnqhY/lPS2perfkIHoIy+Vt7eiOZv1jOupA/HPV0xKKQOGL saDXVLIAa5QhxYQ3pCFYNii0zgc= Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	ARM ZSTD boot compression \| expand [0/3] ARM ZSTD boot compression [1/3] ARM: compressed: Pass the actual output length to the decompressor [2/3] ARM: compressed: Bump MALLOC_SIZE to 128 KiB [3/3] ARM: compressed: Enable ZSTD compression

J. Neuschäfer April 12, 2023, 9:21 p.m. UTC

This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):

 - LZO:  7.2 MiB,  6 seconds
 - ZSTD: 5.6 MiB, 60 seconds

Jonathan Neuschäfer (3):
  ARM: compressed: Pass the actual output length to the decompressor
  ARM: compressed: Bump MALLOC_SIZE to 128 KiB
  ARM: compressed: Enable ZSTD compression

 arch/arm/Kconfig                      |  1 +
 arch/arm/boot/compressed/Makefile     |  5 +++--
 arch/arm/boot/compressed/decompress.c |  8 ++++++--
 arch/arm/boot/compressed/head.S       |  4 ++--
 arch/arm/boot/compressed/misc.c       | 12 ++++++++++--
 5 files changed, 22 insertions(+), 8 deletions(-)

--
2.39.2

Arnd Bergmann April 12, 2023, 9:33 p.m. UTC | #1

On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>
>  - LZO:  7.2 MiB,  6 seconds
>  - ZSTD: 5.6 MiB, 60 seconds

That seems unexpected, as the usual numbers say it's about 25%
slower than LZO. Do  you have an idea why it is so much slower
here? How long does it take to decompress the
generated arch/arm/boot/Image file in user space on the same
hardware using lzop and zstd?

       Arnd

Arnd Bergmann April 13, 2023, 11:13 a.m. UTC | #2

On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>
>>  - LZO:  7.2 MiB,  6 seconds
>>  - ZSTD: 5.6 MiB, 60 seconds
>
> That seems unexpected, as the usual numbers say it's about 25%
> slower than LZO. Do  you have an idea why it is so much slower
> here? How long does it take to decompress the
> generated arch/arm/boot/Image file in user space on the same
> hardware using lzop and zstd?

I looked through this a bit more and found two interesting points:

- zstd uses a lot more unaligned loads and stores while
  decompressing. On armv5 those turn into individual byte
  accesses, while the others can likely use word-aligned
  accesses. This could make a huge difference if caches are
  disabled during the decompression.

- The sliding window on zstd is much larger, with the kernel
  using an 8MB window (zstd=23), compared to the normal 32kb
  for deflate (couldn't find the default for lzo), so on
  machines with no L2 cache, it is much likely to thrash a
  small L1 dcache that are used on most arm9.

      Arnd

J. Neuschäfer April 14, 2023, 10:50 p.m. UTC | #3

On Wed, Apr 12, 2023 at 11:33:15PM +0200, Arnd Bergmann wrote:
> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> > This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> > Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >
> >  - LZO:  7.2 MiB,  6 seconds
> >  - ZSTD: 5.6 MiB, 60 seconds
> 
> That seems unexpected, as the usual numbers say it's about 25%
> slower than LZO. Do you have an idea why it is so much slower
> here?

No clear idea.

I guess it might be related to caching or unaligned memory accesses
somehow.

I suspected CONFIG_CPU_DCACHE_WRITETHROUGH, which was enabled, but
disabling it didn't improve performance.

> How long does it take to decompress the generated arch/arm/boot/Image
> file in user space on the same hardware using lzop and zstd?


Unfortunately, the unzstd userspace tool requires a buffer of of 128 MiB
(the window size), which is too big for my usual devboard (which has about
100 MiB available). I'd have to test on a different board.


Jonathan

---
# uname -a
Linux buildroot 6.3.0-rc6-00020-g023058d50f2f #1212 PREEMPT Fri Apr 14 20:58:21 CEST 2023 armv5tejl GNU/Linux

# ls -lh
total 13M
-rw-r--r--    1 root     root        7.5M Jan  1 00:07 piggy.lzo
-rw-r--r--    1 root     root        5.8M Jan  1 00:07 piggy.zstd

# time lzop -d piggy.lzo -c > /dev/null
lzop: piggy.lzo: warning: ignoring trailing garbage in lzop file
Command exited with non-zero status 2
real    0m 3.38s
user    0m 3.20s
sys     0m 0.18s

# time unzstd piggy.zstd -c > /dev/null
[  858.270000] __vm_enough_memory: pid: 114, comm: unzstd, not enough memory for the allocation
piggy.zstd : Decoding error (36) : Allocation error : not enough memory
Command exited with non-zero status 1
real    0m 0.03s
user    0m 0.01s
sys     0m 0.03s

J. Neuschäfer April 15, 2023, 2 a.m. UTC | #4

On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> > On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> >> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> >> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >>
> >>  - LZO:  7.2 MiB,  6 seconds
> >>  - ZSTD: 5.6 MiB, 60 seconds
> >
> > That seems unexpected, as the usual numbers say it's about 25%
> > slower than LZO. Do  you have an idea why it is so much slower
> > here? How long does it take to decompress the
> > generated arch/arm/boot/Image file in user space on the same
> > hardware using lzop and zstd?
> 
> I looked through this a bit more and found two interesting points:
> 
> - zstd uses a lot more unaligned loads and stores while
>   decompressing. On armv5 those turn into individual byte
>   accesses, while the others can likely use word-aligned
>   accesses. This could make a huge difference if caches are
>   disabled during the decompression.
> 
> - The sliding window on zstd is much larger, with the kernel
>   using an 8MB window (zstd=23), compared to the normal 32kb
>   for deflate (couldn't find the default for lzo), so on
>   machines with no L2 cache, it is much likely to thrash a
>   small L1 dcache that are used on most arm9.
> 
>       Arnd

Make sense.

For ZSTD as used in kernel decompression (the zstd22 configuration), the
window is even bigger, 128 MiB. (AFAIU)


Thanks

Jonathan

Nick Terrell Oct. 12, 2023, 10:33 p.m. UTC | #5

> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote:
> 
> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>> 
>>>> - LZO:  7.2 MiB,  6 seconds
>>>> - ZSTD: 5.6 MiB, 60 seconds
>>> 
>>> That seems unexpected, as the usual numbers say it's about 25%
>>> slower than LZO. Do  you have an idea why it is so much slower
>>> here? How long does it take to decompress the
>>> generated arch/arm/boot/Image file in user space on the same
>>> hardware using lzop and zstd?
>> 
>> I looked through this a bit more and found two interesting points:
>> 
>> - zstd uses a lot more unaligned loads and stores while
>>  decompressing. On armv5 those turn into individual byte
>>  accesses, while the others can likely use word-aligned
>>  accesses. This could make a huge difference if caches are
>>  disabled during the decompression.
>> 
>> - The sliding window on zstd is much larger, with the kernel
>>  using an 8MB window (zstd=23), compared to the normal 32kb
>>  for deflate (couldn't find the default for lzo), so on
>>  machines with no L2 cache, it is much likely to thrash a
>>  small L1 dcache that are used on most arm9.
>> 
>>      Arnd
> 
> Make sense.
> 
> For ZSTD as used in kernel decompression (the zstd22 configuration), the
> window is even bigger, 128 MiB. (AFAIU)

Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...

But this is totally configurable. You can switch compression configurations
at any time. If you believe that the window size is the issue causing speed
regressions, you could use a zstd compression to use a e.g. 256KB window
size like this:

  zstd -19 --zstd=wlog=18

This will keep the same algorithm search strength, but limit the decoder memory
usage.

I will also try to get this patchset working on my machine, and try to debug.
The 10x slower speed difference is not expected, and we see much better speed
in userspace ARM. I suspect it has something to do with the preboot environment.
E.g. when implementing x86-64 zstd kernel decompression, I noticed that
memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
penalty.

Best,
Nick Terrell

> Thanks
> 
> Jonathan

J. Neuschäfer Oct. 13, 2023, 1:27 a.m. UTC | #6

On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
> > On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote:
> > On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
> >> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> >>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> >>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> >>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >>>> 
> >>>> - LZO:  7.2 MiB,  6 seconds
> >>>> - ZSTD: 5.6 MiB, 60 seconds
[...]
> > For ZSTD as used in kernel decompression (the zstd22 configuration), the
> > window is even bigger, 128 MiB. (AFAIU)
> 
> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
> 
> But this is totally configurable. You can switch compression configurations
> at any time. If you believe that the window size is the issue causing speed
> regressions, you could use a zstd compression to use a e.g. 256KB window
> size like this:
> 
>   zstd -19 --zstd=wlog=18
> 
> This will keep the same algorithm search strength, but limit the decoder memory
> usage.

Noted.

> I will also try to get this patchset working on my machine, and try to debug.
> The 10x slower speed difference is not expected, and we see much better speed
> in userspace ARM. I suspect it has something to do with the preboot environment.
> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
> penalty.

In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
think the main culprit here was particularly bad luck in my choice of
test hardware.

The inlining issues are a good point, noted for the next time I work on this.


Thanks,
Jonathan

Nick Terrell Oct. 20, 2023, 6:53 p.m. UTC | #7

> On Oct 12, 2023, at 6:27 PM, J. Neuschäfer <j.neuschaefer@gmx.net> wrote:
> 
> On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
>>> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote:
>>> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>>>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>>>> 
>>>>>> - LZO:  7.2 MiB,  6 seconds
>>>>>> - ZSTD: 5.6 MiB, 60 seconds
> [...]
>>> For ZSTD as used in kernel decompression (the zstd22 configuration), the
>>> window is even bigger, 128 MiB. (AFAIU)
>> 
>> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
>> 
>> But this is totally configurable. You can switch compression configurations
>> at any time. If you believe that the window size is the issue causing speed
>> regressions, you could use a zstd compression to use a e.g. 256KB window
>> size like this:
>> 
>>  zstd -19 --zstd=wlog=18
>> 
>> This will keep the same algorithm search strength, but limit the decoder memory
>> usage.
> 
> Noted.
> 
>> I will also try to get this patchset working on my machine, and try to debug.
>> The 10x slower speed difference is not expected, and we see much better speed
>> in userspace ARM. I suspect it has something to do with the preboot environment.
>> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
>> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
>> penalty.
> 
> In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
> only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
> think the main culprit here was particularly bad luck in my choice of
> test hardware.
> 
> The inlining issues are a good point, noted for the next time I work on this.

I went out and bought a Raspberry Pi 4 to test on. I’ve done some crude measurements
and see that zstd kernel decompression is just slightly slower than gzip kernel
decompression, and about 2x slower than lzo. In userspace decompression of the same
file (a manually compressed kernel image) I see that zstd decompression is significantly
faster than gzip. So it is definitely something about the preboot boot environment, or how
the code is compiled for the preboot environment that is causing the issue.

My next step is to set up qemu on my Pi to try to get some perf measurements of the
decompression. One thing I’ve really been struggling with, and what thwarted my last
attempts at adding ARM zstd kernel decompression, was getting preboot logs printed.

I’ve figured out I need CONFIG_DEBUG_LL=y, but I’ve yet to actually get any logs.
And I can’t figure out how to get it working in qemu. I haven’t tried qemu on an ARM
host with kvm, but that’s the next thing I will try.

Do you happen to have any advice about how to get preboot logs in qemu? Is it
possible only on an ARM host, or would it also be possible on an x86-64 host?

Thanks,
Nick Terrell

> Thanks,
> Jonathan

[0/3] ARM ZSTD boot compression

Message

Comments