mm/zsmalloc: include sparsemem.h for MAX_PHYSMEM_BITS

Message ID	bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=PHNm=EN=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5B6FD20888 From: Stefan Agner <stefan@agner.ch> To: minchan@kernel.org, ngupta@vflare.org, sergey.senozhatsky.work@gmail.com Cc: akpm@linux-foundation.org, sjenning@linux.vnet.ibm.com, gregkh@linuxfoundation.org, arnd@arndb.de, stefan@agner.ch, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm/zsmalloc: include sparsemem.h for MAX_PHYSMEM_BITS Date: Sat, 7 Nov 2020 16:22:06 +0100 Message-Id: <bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/zsmalloc: include sparsemem.h for MAX_PHYSMEM_BITS \| expand mm/zsmalloc: include sparsemem.h for MAX_PHYSMEM_BITS

Stefan Agner Nov. 7, 2020, 3:22 p.m. UTC

Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
the MAX_PHYSMEM_BITS define on all architectures.

This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
more than 4GB of memory:
  Unable to handle kernel NULL pointer dereference at virtual address 00000000
  pgd = a27bd01c
  [00000000] *pgd=236a0003, *pmd=1ffa64003
  Internal error: Oops: 207 [#1] SMP ARM
  Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
  CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
  Hardware name: BCM2711
  PC is at zs_map_object+0x94/0x338
  LR is at zram_bvec_rw.constprop.0+0x330/0xa64
  pc : [<c0602b38>]    lr : [<c0bda6a0>]    psr: 60000013
  sp : e376bbe0  ip : 00000000  fp : c1e2921c
  r10: 00000002  r9 : c1dda730  r8 : 00000000
  r7 : e8ff7a00  r6 : 00000000  r5 : 02f9ffa0  r4 : e3710000
  r3 : 000fdffe  r2 : c1e0ce80  r1 : ebf979a0  r0 : 00000000
  Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
  Control: 30c5383d  Table: 235c2a80  DAC: fffffffd
  Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
  Stack: (0xe376bbe0 to 0xe376c000)
  ...
  [<c0602b38>] (zs_map_object) from [<c0bda6a0>] (zram_bvec_rw.constprop.0+0x330/0xa64)
  [<c0bda6a0>] (zram_bvec_rw.constprop.0) from [<c0bdaf78>] (zram_submit_bio+0x1a4/0x40c)
  [<c0bdaf78>] (zram_submit_bio) from [<c085806c>] (submit_bio_noacct+0xd0/0x3c8)
  [<c085806c>] (submit_bio_noacct) from [<c08583b0>] (submit_bio+0x4c/0x190)
  [<c08583b0>] (submit_bio) from [<c06496b4>] (submit_bh_wbc+0x188/0x1b8)
  [<c06496b4>] (submit_bh_wbc) from [<c064ce98>] (__block_write_full_page+0x340/0x5e4)
  [<c064ce98>] (__block_write_full_page) from [<c064d3ec>] (block_write_full_page+0x128/0x170)
  [<c064d3ec>] (block_write_full_page) from [<c0591ae8>] (__writepage+0x14/0x68)
  [<c0591ae8>] (__writepage) from [<c0593efc>] (write_cache_pages+0x1bc/0x494)
  [<c0593efc>] (write_cache_pages) from [<c059422c>] (generic_writepages+0x58/0x8c)
  [<c059422c>] (generic_writepages) from [<c0594c24>] (do_writepages+0x48/0xec)
  [<c0594c24>] (do_writepages) from [<c0589330>] (__filemap_fdatawrite_range+0xf0/0x128)
  [<c0589330>] (__filemap_fdatawrite_range) from [<c05894bc>] (file_write_and_wait_range+0x48/0x98)
  [<c05894bc>] (file_write_and_wait_range) from [<c064f3f8>] (blkdev_fsync+0x1c/0x44)
  [<c064f3f8>] (blkdev_fsync) from [<c064408c>] (do_fsync+0x3c/0x70)
  [<c064408c>] (do_fsync) from [<c0400374>] (__sys_trace_return+0x0/0x2c)
  Exception stack(0xe376bfa8 to 0xe376bff0)
  bfa0:                   0003d2e0 b6f7b6f0 00000003 00046e40 00001000 00000000
  bfc0: 0003d2e0 b6f7b6f0 00000000 00000076 00000000 00000000 befcbb20 befcbb28
  bfe0: b6f4e060 befcbad8 b6f23e0c b6dc4a80
  Code: e5927000 e0050391 e0871005 e5918018 (e5983000)

Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
Signed-off-by: Stefan Agner <stefan@agner.ch>
---
 mm/zsmalloc.c | 1 +
 1 file changed, 1 insertion(+)

Andrew Morton Nov. 8, 2020, 12:56 a.m. UTC | #1

On Sat,  7 Nov 2020 16:22:06 +0100 Stefan Agner <stefan@agner.ch> wrote:

> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
> the MAX_PHYSMEM_BITS define on all architectures.
> 
> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
> more than 4GB of memory:
>   Unable to handle kernel NULL pointer dereference at virtual address 00000000

Mysterious.  Presumably without this include, some compilation unit is
picking up the wrong value of MAX_PHYSMEM_BITS?  But I couldn't
actually see where/how this occurs.  Can you please explain further?

Stefan Agner Nov. 8, 2020, 1:16 a.m. UTC | #2

On 2020-11-08 01:56, Andrew Morton wrote:
> On Sat,  7 Nov 2020 16:22:06 +0100 Stefan Agner <stefan@agner.ch> wrote:
> 
>> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
>> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
>> the MAX_PHYSMEM_BITS define on all architectures.
>>
>> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
>> more than 4GB of memory:
>>   Unable to handle kernel NULL pointer dereference at virtual address 00000000
> 
> Mysterious.  Presumably without this include, some compilation unit is
> picking up the wrong value of MAX_PHYSMEM_BITS?  But I couldn't
> actually see where/how this occurs.  Can you please explain further?

Not sure if I got that right, but from what I understand if
MAX_PHYSMEM_BITS is not set in mm/zsmalloc.c it will set
MAX_PHYSMEM_BITS to BITS_PER_LONG. And this is 32-bit, too short when
LPAE is in use...

--
Stefan

Florian Fainelli Nov. 8, 2020, 4:54 a.m. UTC | #3

On 11/7/2020 5:16 PM, Stefan Agner wrote:
> On 2020-11-08 01:56, Andrew Morton wrote:
>> On Sat,  7 Nov 2020 16:22:06 +0100 Stefan Agner <stefan@agner.ch> wrote:
>>
>>> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
>>> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
>>> the MAX_PHYSMEM_BITS define on all architectures.
>>>
>>> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
>>> more than 4GB of memory:
>>>   Unable to handle kernel NULL pointer dereference at virtual address 00000000
>>
>> Mysterious.  Presumably without this include, some compilation unit is
>> picking up the wrong value of MAX_PHYSMEM_BITS?  But I couldn't
>> actually see where/how this occurs.  Can you please explain further?
> 
> Not sure if I got that right, but from what I understand if
> MAX_PHYSMEM_BITS is not set in mm/zsmalloc.c it will set
> MAX_PHYSMEM_BITS to BITS_PER_LONG. And this is 32-bit, too short when
> LPAE is in use...

You should probably include this in your commit message since that is
describing the bug and then your fix makes complete sense.

Mike Rapoport Nov. 8, 2020, 6:46 a.m. UTC | #4

On Sat, Nov 07, 2020 at 04:22:06PM +0100, Stefan Agner wrote:
> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
> the MAX_PHYSMEM_BITS define on all architectures.
> 
> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
> more than 4GB of memory:
>   Unable to handle kernel NULL pointer dereference at virtual address 00000000
>   pgd = a27bd01c
>   [00000000] *pgd=236a0003, *pmd=1ffa64003
>   Internal error: Oops: 207 [#1] SMP ARM
>   Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
>   CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
>   Hardware name: BCM2711
>   PC is at zs_map_object+0x94/0x338
>   LR is at zram_bvec_rw.constprop.0+0x330/0xa64
>   pc : [<c0602b38>]    lr : [<c0bda6a0>]    psr: 60000013
>   sp : e376bbe0  ip : 00000000  fp : c1e2921c
>   r10: 00000002  r9 : c1dda730  r8 : 00000000
>   r7 : e8ff7a00  r6 : 00000000  r5 : 02f9ffa0  r4 : e3710000
>   r3 : 000fdffe  r2 : c1e0ce80  r1 : ebf979a0  r0 : 00000000
>   Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>   Control: 30c5383d  Table: 235c2a80  DAC: fffffffd
>   Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
>   Stack: (0xe376bbe0 to 0xe376c000)
>   ...
>   [<c0602b38>] (zs_map_object) from [<c0bda6a0>] (zram_bvec_rw.constprop.0+0x330/0xa64)
>   [<c0bda6a0>] (zram_bvec_rw.constprop.0) from [<c0bdaf78>] (zram_submit_bio+0x1a4/0x40c)
>   [<c0bdaf78>] (zram_submit_bio) from [<c085806c>] (submit_bio_noacct+0xd0/0x3c8)
>   [<c085806c>] (submit_bio_noacct) from [<c08583b0>] (submit_bio+0x4c/0x190)
>   [<c08583b0>] (submit_bio) from [<c06496b4>] (submit_bh_wbc+0x188/0x1b8)
>   [<c06496b4>] (submit_bh_wbc) from [<c064ce98>] (__block_write_full_page+0x340/0x5e4)
>   [<c064ce98>] (__block_write_full_page) from [<c064d3ec>] (block_write_full_page+0x128/0x170)
>   [<c064d3ec>] (block_write_full_page) from [<c0591ae8>] (__writepage+0x14/0x68)
>   [<c0591ae8>] (__writepage) from [<c0593efc>] (write_cache_pages+0x1bc/0x494)
>   [<c0593efc>] (write_cache_pages) from [<c059422c>] (generic_writepages+0x58/0x8c)
>   [<c059422c>] (generic_writepages) from [<c0594c24>] (do_writepages+0x48/0xec)
>   [<c0594c24>] (do_writepages) from [<c0589330>] (__filemap_fdatawrite_range+0xf0/0x128)
>   [<c0589330>] (__filemap_fdatawrite_range) from [<c05894bc>] (file_write_and_wait_range+0x48/0x98)
>   [<c05894bc>] (file_write_and_wait_range) from [<c064f3f8>] (blkdev_fsync+0x1c/0x44)
>   [<c064f3f8>] (blkdev_fsync) from [<c064408c>] (do_fsync+0x3c/0x70)
>   [<c064408c>] (do_fsync) from [<c0400374>] (__sys_trace_return+0x0/0x2c)
>   Exception stack(0xe376bfa8 to 0xe376bff0)
>   bfa0:                   0003d2e0 b6f7b6f0 00000003 00046e40 00001000 00000000
>   bfc0: 0003d2e0 b6f7b6f0 00000000 00000076 00000000 00000000 befcbb20 befcbb28
>   bfe0: b6f4e060 befcbad8 b6f23e0c b6dc4a80
>   Code: e5927000 e0050391 e0871005 e5918018 (e5983000)
> 
> Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
> Signed-off-by: Stefan Agner <stefan@agner.ch>
> ---
>  mm/zsmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index c36fdff9a371..260bd48aacd0 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -40,6 +40,7 @@
>  #include <linux/string.h>
>  #include <linux/slab.h>
>  #include <linux/pgtable.h>
> +#include <asm/sparsemem.h>

asm/sparsemem.h is not available on some architectures.
It's better to use linux/mmzone.h instead.

>  #include <asm/tlbflush.h>
>  #include <linux/cpumask.h>
>  #include <linux/cpu.h>
> -- 
> 2.29.1
> 
>

Minchan Kim Nov. 9, 2020, 5:04 p.m. UTC | #5

On Sun, Nov 08, 2020 at 02:16:37AM +0100, Stefan Agner wrote:
> On 2020-11-08 01:56, Andrew Morton wrote:
> > On Sat,  7 Nov 2020 16:22:06 +0100 Stefan Agner <stefan@agner.ch> wrote:
> > 
> >> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
> >> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
> >> the MAX_PHYSMEM_BITS define on all architectures.
> >>
> >> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
> >> more than 4GB of memory:
> >>   Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > 
> > Mysterious.  Presumably without this include, some compilation unit is
> > picking up the wrong value of MAX_PHYSMEM_BITS?  But I couldn't
> > actually see where/how this occurs.  Can you please explain further?
> 
> Not sure if I got that right, but from what I understand if
> MAX_PHYSMEM_BITS is not set in mm/zsmalloc.c it will set
> MAX_PHYSMEM_BITS to BITS_PER_LONG. And this is 32-bit, too short when
> LPAE is in use...

True. It's headache in the zsmalloc.
Somedays I'd really like to fix it via redesigning of metadata
management.

Thanks for the fixing the ancient bug, Stefan.

Stefan Agner Nov. 10, 2020, 9:29 a.m. UTC | #6

On 2020-11-08 07:46, Mike Rapoport wrote:
> On Sat, Nov 07, 2020 at 04:22:06PM +0100, Stefan Agner wrote:
>> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
>> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
>> the MAX_PHYSMEM_BITS define on all architectures.
>>
>> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
>> more than 4GB of memory:
>>   Unable to handle kernel NULL pointer dereference at virtual address 00000000
>>   pgd = a27bd01c
>>   [00000000] *pgd=236a0003, *pmd=1ffa64003
>>   Internal error: Oops: 207 [#1] SMP ARM
>>   Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
>>   CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
>>   Hardware name: BCM2711
>>   PC is at zs_map_object+0x94/0x338
>>   LR is at zram_bvec_rw.constprop.0+0x330/0xa64
>>   pc : [<c0602b38>]    lr : [<c0bda6a0>]    psr: 60000013
>>   sp : e376bbe0  ip : 00000000  fp : c1e2921c
>>   r10: 00000002  r9 : c1dda730  r8 : 00000000
>>   r7 : e8ff7a00  r6 : 00000000  r5 : 02f9ffa0  r4 : e3710000
>>   r3 : 000fdffe  r2 : c1e0ce80  r1 : ebf979a0  r0 : 00000000
>>   Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>>   Control: 30c5383d  Table: 235c2a80  DAC: fffffffd
>>   Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
>>   Stack: (0xe376bbe0 to 0xe376c000)
>>   ...
>>   [<c0602b38>] (zs_map_object) from [<c0bda6a0>] (zram_bvec_rw.constprop.0+0x330/0xa64)
>>   [<c0bda6a0>] (zram_bvec_rw.constprop.0) from [<c0bdaf78>] (zram_submit_bio+0x1a4/0x40c)
>>   [<c0bdaf78>] (zram_submit_bio) from [<c085806c>] (submit_bio_noacct+0xd0/0x3c8)
>>   [<c085806c>] (submit_bio_noacct) from [<c08583b0>] (submit_bio+0x4c/0x190)
>>   [<c08583b0>] (submit_bio) from [<c06496b4>] (submit_bh_wbc+0x188/0x1b8)
>>   [<c06496b4>] (submit_bh_wbc) from [<c064ce98>] (__block_write_full_page+0x340/0x5e4)
>>   [<c064ce98>] (__block_write_full_page) from [<c064d3ec>] (block_write_full_page+0x128/0x170)
>>   [<c064d3ec>] (block_write_full_page) from [<c0591ae8>] (__writepage+0x14/0x68)
>>   [<c0591ae8>] (__writepage) from [<c0593efc>] (write_cache_pages+0x1bc/0x494)
>>   [<c0593efc>] (write_cache_pages) from [<c059422c>] (generic_writepages+0x58/0x8c)
>>   [<c059422c>] (generic_writepages) from [<c0594c24>] (do_writepages+0x48/0xec)
>>   [<c0594c24>] (do_writepages) from [<c0589330>] (__filemap_fdatawrite_range+0xf0/0x128)
>>   [<c0589330>] (__filemap_fdatawrite_range) from [<c05894bc>] (file_write_and_wait_range+0x48/0x98)
>>   [<c05894bc>] (file_write_and_wait_range) from [<c064f3f8>] (blkdev_fsync+0x1c/0x44)
>>   [<c064f3f8>] (blkdev_fsync) from [<c064408c>] (do_fsync+0x3c/0x70)
>>   [<c064408c>] (do_fsync) from [<c0400374>] (__sys_trace_return+0x0/0x2c)
>>   Exception stack(0xe376bfa8 to 0xe376bff0)
>>   bfa0:                   0003d2e0 b6f7b6f0 00000003 00046e40 00001000 00000000
>>   bfc0: 0003d2e0 b6f7b6f0 00000000 00000076 00000000 00000000 befcbb20 befcbb28
>>   bfe0: b6f4e060 befcbad8 b6f23e0c b6dc4a80
>>   Code: e5927000 e0050391 e0871005 e5918018 (e5983000)
>>
>> Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
>> Signed-off-by: Stefan Agner <stefan@agner.ch>
>> ---
>>  mm/zsmalloc.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> index c36fdff9a371..260bd48aacd0 100644
>> --- a/mm/zsmalloc.c
>> +++ b/mm/zsmalloc.c
>> @@ -40,6 +40,7 @@
>>  #include <linux/string.h>
>>  #include <linux/slab.h>
>>  #include <linux/pgtable.h>
>> +#include <asm/sparsemem.h>
> 
> asm/sparsemem.h is not available on some architectures.
> It's better to use linux/mmzone.h instead.
> 

Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
is enabled. However, on ARM at least I can have configurations without
CONFIG_SPARSEMEM and physical address extension on (e.g.
multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).

While sparsemem seems to be a good idea with LPAE it really seems not
required (see also https://lore.kernel.org/patchwork/patch/567589/).

There seem to be also other architectures which define MAX_PHYSMEM_BITS
only when SPARSEMEM is enabled, e.g.
arch/riscv/include/asm/sparsemem.h...

Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
to a compile time define...

--
Stefan

>>  #include <asm/tlbflush.h>
>>  #include <linux/cpumask.h>
>>  #include <linux/cpu.h>
>> --
>> 2.29.1
>>
>>

Mike Rapoport Nov. 10, 2020, 9:58 a.m. UTC | #7

On Tue, Nov 10, 2020 at 10:29:54AM +0100, Stefan Agner wrote:
> On 2020-11-08 07:46, Mike Rapoport wrote:
> > On Sat, Nov 07, 2020 at 04:22:06PM +0100, Stefan Agner wrote:
> >> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
> >> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
> >> the MAX_PHYSMEM_BITS define on all architectures.

...

> >> Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
> >> Signed-off-by: Stefan Agner <stefan@agner.ch>
> >> ---
> >>  mm/zsmalloc.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >> index c36fdff9a371..260bd48aacd0 100644
> >> --- a/mm/zsmalloc.c
> >> +++ b/mm/zsmalloc.c
> >> @@ -40,6 +40,7 @@
> >>  #include <linux/string.h>
> >>  #include <linux/slab.h>
> >>  #include <linux/pgtable.h>
> >> +#include <asm/sparsemem.h>
> > 
> > asm/sparsemem.h is not available on some architectures.
> > It's better to use linux/mmzone.h instead.
> > 
> 
> Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
> is enabled. However, on ARM at least I can have configurations without
> CONFIG_SPARSEMEM and physical address extension on (e.g.
> multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
> 
> While sparsemem seems to be a good idea with LPAE it really seems not
> required (see also https://lore.kernel.org/patchwork/patch/567589/).
> 
> There seem to be also other architectures which define MAX_PHYSMEM_BITS
> only when SPARSEMEM is enabled, e.g.
> arch/riscv/include/asm/sparsemem.h...
> 
> Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
> SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
> to a compile time define...

I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of 
arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
by !LPAE and LPAE.

That's what x86 does: 

$ git grep -w MAX_POSSIBLE_PHYSMEM_BITS arch/
arch/x86/include/asm/pgtable-3level_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS   36
arch/x86/include/asm/pgtable_64_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS       52

It seems that actual numbers would be 36 for !LPAE and 40 for LPAE, but
I'm not sure about that.

> --
> Stefan
> 
> >>  #include <asm/tlbflush.h>
> >>  #include <linux/cpumask.h>
> >>  #include <linux/cpu.h>
> >> --
> >> 2.29.1
> >>
> >>

Arnd Bergmann Nov. 10, 2020, 11:21 a.m. UTC | #8

On Tue, Nov 10, 2020 at 10:58 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > asm/sparsemem.h is not available on some architectures.
> > > It's better to use linux/mmzone.h instead.

Ah, I missed that, too.

> > Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
> > is enabled. However, on ARM at least I can have configurations without
> > CONFIG_SPARSEMEM and physical address extension on (e.g.
> > multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
> >
> > While sparsemem seems to be a good idea with LPAE it really seems not
> > required (see also https://lore.kernel.org/patchwork/patch/567589/).
> >
> > There seem to be also other architectures which define MAX_PHYSMEM_BITS
> > only when SPARSEMEM is enabled, e.g.
> > arch/riscv/include/asm/sparsemem.h...
> >
> > Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
> > SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
> > to a compile time define...
>
> I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of
> arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
> by !LPAE and LPAE.

Good idea. I wonder what other architectures need the same though.
Here are some I found:

$ git grep -l PHYS_ADDR_T_64BIT arch | grep Kconfig
arch/arc/Kconfig
arch/arm/mm/Kconfig
arch/mips/Kconfig
arch/powerpc/platforms/Kconfig.cputype
arch/x86/Kconfig

arch/arc has a CONFIG_ARC_HAS_PAE40 option
arch/riscv has 34-bit addressing in rv32 mode
arch/mips has up to 40 bits with mips32r3 XPA, but I don't know what
    supports that

arch/powerpc has this:
config PHYS_64BIT
        bool 'Large physical address support' if E500 || PPC_86xx
        depends on (44x || E500 || PPC_86xx) && !PPC_83xx && !PPC_82xx

Apparently all three (4xx, e500v2, mpc86xx/e600) do 36-bit physical
addressing, but each one has a different page table format.

Microblaze has physical address extensions, but neither those nor
64-bit mode have so far made it into the kernel.

To be on the safe side, we could provoke a compile-time error
when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.

> That's what x86 does:
>
> $ git grep -w MAX_POSSIBLE_PHYSMEM_BITS arch/
> arch/x86/include/asm/pgtable-3level_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS   36

Doesn't x86 also support a 40-bit addressing mode? I suppose
those machines that actually used it are long gone.

> arch/x86/include/asm/pgtable_64_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS       52
>
> It seems that actual numbers would be 36 for !LPAE and 40 for LPAE, but
> I'm not sure about that.

Close enough, yes.

The 36-bit addressing is on !LPAE is only used for early static mappings,
so I think we can pretend it's always 32-bit. I checked the ARMv8 reference,
and it says that ARMv8-Aarch32 actually supports 40 bit physical addressing
both with non-LPAE superpages (short descriptor format) and LPAE (long
descriptor format), but Linux only does 36-bit addressing on superpages
as specified for ARMv6/ARMv7 short descriptors.

       Arnd

Stefan Agner Nov. 10, 2020, 12:22 p.m. UTC | #9

[adding Russell King for ARM]

On 2020-11-10 12:21, Arnd Bergmann wrote:
> On Tue, Nov 10, 2020 at 10:58 AM Mike Rapoport <rppt@kernel.org> wrote:
>> > >
>> > > asm/sparsemem.h is not available on some architectures.
>> > > It's better to use linux/mmzone.h instead.
> 
> Ah, I missed that, too.
> 
>> > Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
>> > is enabled. However, on ARM at least I can have configurations without
>> > CONFIG_SPARSEMEM and physical address extension on (e.g.
>> > multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
>> >
>> > While sparsemem seems to be a good idea with LPAE it really seems not
>> > required (see also https://lore.kernel.org/patchwork/patch/567589/).
>> >
>> > There seem to be also other architectures which define MAX_PHYSMEM_BITS
>> > only when SPARSEMEM is enabled, e.g.
>> > arch/riscv/include/asm/sparsemem.h...
>> >
>> > Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
>> > SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
>> > to a compile time define...
>>
>> I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of
>> arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
>> by !LPAE and LPAE.

Hm I see mm/zsmalloc.c really only needs to know how many bits are
potentially used to calculate how many bits it can use for object
indexing.

> 
> Good idea. I wonder what other architectures need the same though.
> Here are some I found:
> 
> $ git grep -l PHYS_ADDR_T_64BIT arch | grep Kconfig
> arch/arc/Kconfig
> arch/arm/mm/Kconfig
> arch/mips/Kconfig
> arch/powerpc/platforms/Kconfig.cputype
> arch/x86/Kconfig
> 
> arch/arc has a CONFIG_ARC_HAS_PAE40 option
> arch/riscv has 34-bit addressing in rv32 mode
> arch/mips has up to 40 bits with mips32r3 XPA, but I don't know what
>     supports that
> 
> arch/powerpc has this:
> config PHYS_64BIT
>         bool 'Large physical address support' if E500 || PPC_86xx
>         depends on (44x || E500 || PPC_86xx) && !PPC_83xx && !PPC_82xx
> 
> Apparently all three (4xx, e500v2, mpc86xx/e600) do 36-bit physical
> addressing, but each one has a different page table format.
> 
> Microblaze has physical address extensions, but neither those nor
> 64-bit mode have so far made it into the kernel.
> 
> To be on the safe side, we could provoke a compile-time error
> when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> 
>> That's what x86 does:
>>
>> $ git grep -w MAX_POSSIBLE_PHYSMEM_BITS arch/
>> arch/x86/include/asm/pgtable-3level_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS   36
> 
> Doesn't x86 also support a 40-bit addressing mode? I suppose
> those machines that actually used it are long gone.
> 
>> arch/x86/include/asm/pgtable_64_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS       52
>>
>> It seems that actual numbers would be 36 for !LPAE and 40 for LPAE, but
>> I'm not sure about that.
> 
> Close enough, yes.
> 
> The 36-bit addressing is on !LPAE is only used for early static mappings,
> so I think we can pretend it's always 32-bit. I checked the ARMv8 reference,
> and it says that ARMv8-Aarch32 actually supports 40 bit physical addressing
> both with non-LPAE superpages (short descriptor format) and LPAE (long
> descriptor format), but Linux only does 36-bit addressing on superpages
> as specified for ARMv6/ARMv7 short descriptors.

Oh so, more than 4GB of memory can be supported by !LPAE systems via
superpages? Wasn't aware of that.

Since only ARM_LPAE selects CONFIG_PHYS_ADDR_T_64BIT it really is safe
to assume 32 bits for non-LPAE systems.

I guess that would mean adding a #define MAX_POSSIBLE_PHYSMEM_BITS 32 to
arch/arm/include/asm/pgtable-2level.h and a MAX_POSSIBLE_PHYSMEM_BITS 40
in arch/arm/include/asm/pgtable-3level.h. Seems straight forward and
would solve the problem I had. I can prepare a patch for ARM, not sure
about the other architectures...

--
Stefan

Arnd Bergmann Nov. 10, 2020, 3:19 p.m. UTC | #10

On Tue, Nov 10, 2020 at 1:24 PM Stefan Agner <stefan@agner.ch> wrote:
> On 2020-11-10 12:21, Arnd Bergmann wrote:
> > Good idea. I wonder what other architectures need the same though.
>
> >> That's what x86 does:
> >>
> >> $ git grep -w MAX_POSSIBLE_PHYSMEM_BITS arch/
> >> arch/x86/include/asm/pgtable-3level_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS   36
> >
> > Doesn't x86 also support a 40-bit addressing mode? I suppose
> > those machines that actually used it are long gone.
> >
> >> arch/x86/include/asm/pgtable_64_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS       52
> >>
> >> It seems that actual numbers would be 36 for !LPAE and 40 for LPAE, but
> >> I'm not sure about that.
> >
> > Close enough, yes.
> >
> > The 36-bit addressing is on !LPAE is only used for early static mappings,
> > so I think we can pretend it's always 32-bit. I checked the ARMv8 reference,
> > and it says that ARMv8-Aarch32 actually supports 40 bit physical addressing
> > both with non-LPAE superpages (short descriptor format) and LPAE (long
> > descriptor format), but Linux only does 36-bit addressing on superpages
> > as specified for ARMv6/ARMv7 short descriptors.
>
> Oh so, more than 4GB of memory can be supported by !LPAE systems via
> superpages? Wasn't aware of that.

Not really, we only really use it for MMIO mappings, and only on the
Xscale3 CPU. Support for this was originally added for IXP23xx,
and the same core is present in IOP13xx and PXA3xx, but only the
last one of these is still supported in mainline, and I don't know if it
actually has anything outside of the 32-bit address space
(arch/arm/mach-pxa/include/mach/addr-map.h suggests it does not)

Russell might remember more details here, and if there is a reason
to keep IO_36 support working after we removed IXP23xx back
in 2012 and never enabled it for ARMv6 or v7. I suppose removing
it would only change a few lines in asm/domain.h, as we'd still
want to keep supersection support regardless.

> Since only ARM_LPAE selects CONFIG_PHYS_ADDR_T_64BIT it really is safe
> to assume 32 bits for non-LPAE systems.
>
> I guess that would mean adding a #define MAX_POSSIBLE_PHYSMEM_BITS 32 to
> arch/arm/include/asm/pgtable-2level.h and a MAX_POSSIBLE_PHYSMEM_BITS 40
> in arch/arm/include/asm/pgtable-3level.h. Seems straight forward and
> would solve the problem I had. I can prepare a patch for ARM, not sure
> about the other architectures...

I think I can help with that once we have agreed on a patch for ARM.

     Arnd

Mike Rapoport Nov. 10, 2020, 4:21 p.m. UTC | #11

On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> On Tue, Nov 10, 2020 at 10:58 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > asm/sparsemem.h is not available on some architectures.
> > > > It's better to use linux/mmzone.h instead.
> 
> Ah, I missed that, too.
> 
> > > Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
> > > is enabled. However, on ARM at least I can have configurations without
> > > CONFIG_SPARSEMEM and physical address extension on (e.g.
> > > multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
> > >
> > > While sparsemem seems to be a good idea with LPAE it really seems not
> > > required (see also https://lore.kernel.org/patchwork/patch/567589/).
> > >
> > > There seem to be also other architectures which define MAX_PHYSMEM_BITS
> > > only when SPARSEMEM is enabled, e.g.
> > > arch/riscv/include/asm/sparsemem.h...
> > >
> > > Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
> > > SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
> > > to a compile time define...
> >
> > I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of
> > arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
> > by !LPAE and LPAE.
> 
> Good idea. I wonder what other architectures need the same though.
> Here are some I found:
> 
> $ git grep -l PHYS_ADDR_T_64BIT arch | grep Kconfig
> arch/arc/Kconfig
> arch/arm/mm/Kconfig
> arch/mips/Kconfig
> arch/powerpc/platforms/Kconfig.cputype
> arch/x86/Kconfig
> 
> arch/arc has a CONFIG_ARC_HAS_PAE40 option
> arch/riscv has 34-bit addressing in rv32 mode
> arch/mips has up to 40 bits with mips32r3 XPA, but I don't know what
>     supports that
> 
> arch/powerpc has this:
> config PHYS_64BIT
>         bool 'Large physical address support' if E500 || PPC_86xx
>         depends on (44x || E500 || PPC_86xx) && !PPC_83xx && !PPC_82xx
> 
> Apparently all three (4xx, e500v2, mpc86xx/e600) do 36-bit physical
> addressing, but each one has a different page table format.
> 
> Microblaze has physical address extensions, but neither those nor
> 64-bit mode have so far made it into the kernel.
> 
> To be on the safe side, we could provoke a compile-time error
> when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.

Maybe compile time warning and a runtime error in zs_init() if 32 bit
machine has memory above 4G?

> > That's what x86 does:
> >
> > $ git grep -w MAX_POSSIBLE_PHYSMEM_BITS arch/
> > arch/x86/include/asm/pgtable-3level_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS   36
> 
> Doesn't x86 also support a 40-bit addressing mode? I suppose
> those machines that actually used it are long gone.
> 
> > arch/x86/include/asm/pgtable_64_types.h:#define MAX_POSSIBLE_PHYSMEM_BITS       52
> >
> > It seems that actual numbers would be 36 for !LPAE and 40 for LPAE, but
> > I'm not sure about that.
> 
> Close enough, yes.
> 
> The 36-bit addressing is on !LPAE is only used for early static mappings,
> so I think we can pretend it's always 32-bit. I checked the ARMv8 reference,
> and it says that ARMv8-Aarch32 actually supports 40 bit physical addressing
> both with non-LPAE superpages (short descriptor format) and LPAE (long
> descriptor format), but Linux only does 36-bit addressing on superpages
> as specified for ARMv6/ARMv7 short descriptors.
> 
>        Arnd

Minchan Kim Nov. 10, 2020, 11:36 p.m. UTC | #12

On Tue, Nov 10, 2020 at 06:21:55PM +0200, Mike Rapoport wrote:
> On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > On Tue, Nov 10, 2020 at 10:58 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > >
> > > > > asm/sparsemem.h is not available on some architectures.
> > > > > It's better to use linux/mmzone.h instead.
> > 
> > Ah, I missed that, too.
> > 
> > > > Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
> > > > is enabled. However, on ARM at least I can have configurations without
> > > > CONFIG_SPARSEMEM and physical address extension on (e.g.
> > > > multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
> > > >
> > > > While sparsemem seems to be a good idea with LPAE it really seems not
> > > > required (see also https://lore.kernel.org/patchwork/patch/567589/).
> > > >
> > > > There seem to be also other architectures which define MAX_PHYSMEM_BITS
> > > > only when SPARSEMEM is enabled, e.g.
> > > > arch/riscv/include/asm/sparsemem.h...
> > > >
> > > > Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
> > > > SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
> > > > to a compile time define...
> > >
> > > I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of
> > > arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
> > > by !LPAE and LPAE.
> > 
> > Good idea. I wonder what other architectures need the same though.
> > Here are some I found:
> > 
> > $ git grep -l PHYS_ADDR_T_64BIT arch | grep Kconfig
> > arch/arc/Kconfig
> > arch/arm/mm/Kconfig
> > arch/mips/Kconfig
> > arch/powerpc/platforms/Kconfig.cputype
> > arch/x86/Kconfig
> > 
> > arch/arc has a CONFIG_ARC_HAS_PAE40 option
> > arch/riscv has 34-bit addressing in rv32 mode
> > arch/mips has up to 40 bits with mips32r3 XPA, but I don't know what
> >     supports that
> > 
> > arch/powerpc has this:
> > config PHYS_64BIT
> >         bool 'Large physical address support' if E500 || PPC_86xx
> >         depends on (44x || E500 || PPC_86xx) && !PPC_83xx && !PPC_82xx
> > 
> > Apparently all three (4xx, e500v2, mpc86xx/e600) do 36-bit physical
> > addressing, but each one has a different page table format.
> > 
> > Microblaze has physical address extensions, but neither those nor
> > 64-bit mode have so far made it into the kernel.
> > 
> > To be on the safe side, we could provoke a compile-time error
> > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> 
> Maybe compile time warning and a runtime error in zs_init() if 32 bit
> machine has memory above 4G?

I guess max_pfn will represent maximum pfn configued in the system
and will not be changed in the runtime. If it's true, how about this?
(didn't test at all but just for RFC)

From 2f51a743a08d10c787a36912515a91826693e308 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 10 Nov 2020 15:17:12 -0800
Subject: [PATCH] zsmalloc: do not support if system has too big memory

zsmalloc encode object location into unsigned long so if the system
has too big memory to encode the object location into BITS_PER_LONG,
it should return the error. Otherwise, system will be crash.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 82 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 37 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 662ee420706f..13552f412ca1 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -87,19 +87,8 @@
  * This is made more complicated by various memory models and PAE.
  */
 
-#ifndef MAX_POSSIBLE_PHYSMEM_BITS
-#ifdef MAX_PHYSMEM_BITS
-#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
-#else
-/*
- * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
- * be PAGE_SHIFT
- */
-#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
-#endif
-#endif
-
-#define _PFN_BITS		(MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
+static unsigned int pfn_bits;
+static unsigned int obj_index_bits;
 
 /*
  * Memory for allocating for handle keeps object position by
@@ -119,18 +108,20 @@
  */
 #define OBJ_ALLOCATED_TAG 1
 #define OBJ_TAG_BITS 1
-#define OBJ_INDEX_BITS	(BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
-#define OBJ_INDEX_MASK	((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
+static unsigned int obj_index_mask;
 
 #define FULLNESS_BITS	2
 #define CLASS_BITS	8
 #define ISOLATED_BITS	3
 #define MAGIC_VAL_BITS	8
 
-#define MAX(a, b) ((a) >= (b) ? (a) : (b))
-/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
-#define ZS_MIN_ALLOC_SIZE \
-	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
+/* zs_min_alloc_size must be multiple of ZS_ALIGN */
+int zs_min_alloc_size(void)
+{
+	return max_t(unsigned long, 32, ZS_MAX_PAGES_PER_ZSPAGE <<
+			PAGE_SHIFT >> obj_index_bits);
+}
+
 /* each chunk includes extra space to keep handle */
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
 
@@ -144,12 +135,11 @@
  *    determined). NOTE: all those class sizes must be set as multiple of
  *    ZS_ALIGN to make sure link_free itself never has to span 2 pages.
  *
- *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
+ *  zs_min_alloc_size and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
 #define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
-#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
-				      ZS_SIZE_CLASS_DELTA) + 1)
+static unsigned int zs_size_classes;
 
 enum fullness_group {
 	ZS_EMPTY,
@@ -253,7 +243,7 @@ struct link_free {
 struct zs_pool {
 	const char *name;
 
-	struct size_class *size_class[ZS_SIZE_CLASSES];
+	struct size_class **size_class;
 	struct kmem_cache *handle_cachep;
 	struct kmem_cache *zspage_cachep;
 
@@ -541,11 +531,11 @@ static int get_size_class_index(int size)
 {
 	int idx = 0;
 
-	if (likely(size > ZS_MIN_ALLOC_SIZE))
-		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
+	if (likely(size > zs_min_alloc_size()))
+		idx = DIV_ROUND_UP(size - zs_min_alloc_size(),
 				ZS_SIZE_CLASS_DELTA);
 
-	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	return min_t(int, zs_size_classes - 1, idx);
 }
 
 /* type can be of enum type zs_stat_type or fullness_group */
@@ -605,7 +595,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 			"obj_allocated", "obj_used", "pages_used",
 			"pages_per_zspage", "freeable");
 
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+	for (i = 0; i < zs_size_classes; i++) {
 		class = pool->size_class[i];
 
 		if (class->index != i)
@@ -845,8 +835,8 @@ static void obj_to_location(unsigned long obj, struct page **page,
 				unsigned int *obj_idx)
 {
 	obj >>= OBJ_TAG_BITS;
-	*page = pfn_to_page(obj >> OBJ_INDEX_BITS);
-	*obj_idx = (obj & OBJ_INDEX_MASK);
+	*page = pfn_to_page(obj >> obj_index_bits);
+	*obj_idx = (obj & obj_index_mask);
 }
 
 /**
@@ -858,8 +848,8 @@ static unsigned long location_to_obj(struct page *page, unsigned int obj_idx)
 {
 	unsigned long obj;
 
-	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
-	obj |= obj_idx & OBJ_INDEX_MASK;
+	obj = page_to_pfn(page) << obj_index_bits;
+	obj |= obj_idx & obj_index_mask;
 	obj <<= OBJ_TAG_BITS;
 
 	return obj;
@@ -2202,7 +2192,7 @@ static void async_free_zspage(struct work_struct *work)
 	struct zs_pool *pool = container_of(work, struct zs_pool,
 					free_work);
 
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+	for (i = 0; i < zs_size_classes; i++) {
 		class = pool->size_class[i];
 		if (class->index != i)
 			continue;
@@ -2320,7 +2310,7 @@ unsigned long zs_compact(struct zs_pool *pool)
 	int i;
 	struct size_class *class;
 
-	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
+	for (i = zs_size_classes - 1; i >= 0; i--) {
 		class = pool->size_class[i];
 		if (!class)
 			continue;
@@ -2366,7 +2356,7 @@ static unsigned long zs_shrinker_count(struct shrinker *shrinker,
 	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
 			shrinker);
 
-	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
+	for (i = zs_size_classes - 1; i >= 0; i--) {
 		class = pool->size_class[i];
 		if (!class)
 			continue;
@@ -2427,18 +2417,20 @@ struct zs_pool *zs_create_pool(const char *name)
 	if (create_cache(pool))
 		goto err;
 
+	pool->size_class = kmalloc(sizeof(void *) * zs_size_classes, GFP_KERNEL);
+
 	/*
 	 * Iterate reversely, because, size of size_class that we want to use
 	 * for merging should be larger or equal to current size.
 	 */
-	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
+	for (i = zs_size_classes - 1; i >= 0; i--) {
 		int size;
 		int pages_per_zspage;
 		int objs_per_zspage;
 		struct size_class *class;
 		int fullness = 0;
 
-		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
+		size = zs_min_alloc_size() + i * ZS_SIZE_CLASS_DELTA;
 		if (size > ZS_MAX_ALLOC_SIZE)
 			size = ZS_MAX_ALLOC_SIZE;
 		pages_per_zspage = get_pages_per_zspage(size);
@@ -2528,7 +2520,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 	zs_unregister_migration(pool);
 	zs_pool_stat_destroy(pool);
 
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+	for (i = 0; i < zs_size_classes; i++) {
 		int fg;
 		struct size_class *class = pool->size_class[i];
 
@@ -2547,6 +2539,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 		kfree(class);
 	}
 
+	kfree(pool->size_class);
 	destroy_cache(pool);
 	kfree(pool->name);
 	kfree(pool);
@@ -2557,6 +2550,21 @@ static int __init zs_init(void)
 {
 	int ret;
 
+	/*
+	 * If BITS_PER_LONG is not big enough to encode metadata(e.g., object
+	 * location and some tags), it could't work at this moment.
+	 */
+	pfn_bits = ilog2(max_pfn) + 1;
+	if (BITS_PER_LONG - OBJ_TAG_BITS - pfn_bits < 0) {
+		pr_warn("max_pfn %lu is too large for zsmalloc\n",
+				max_pfn);
+		return -EFBIG;
+	}
+
+	obj_index_bits = BITS_PER_LONG - OBJ_TAG_BITS - pfn_bits;
+	obj_index_mask = (1 << obj_index_bits) - 1;
+	zs_size_classes = (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - zs_min_alloc_size(),  ZS_SIZE_CLASS_DELTA) + 1);
+
 	ret = zsmalloc_mount();
 	if (ret)
 		goto out;

Mike Rapoport Nov. 11, 2020, 6:52 a.m. UTC | #13

Hi,

On Tue, Nov 10, 2020 at 03:36:20PM -0800, Minchan Kim wrote:
> On Tue, Nov 10, 2020 at 06:21:55PM +0200, Mike Rapoport wrote:
> > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > On Tue, Nov 10, 2020 at 10:58 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > > >
> > > > > > asm/sparsemem.h is not available on some architectures.
> > > > > > It's better to use linux/mmzone.h instead.
> > > 
> > > Ah, I missed that, too.
> > > 
> > > > > Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
> > > > > is enabled. However, on ARM at least I can have configurations without
> > > > > CONFIG_SPARSEMEM and physical address extension on (e.g.
> > > > > multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
> > > > >
> > > > > While sparsemem seems to be a good idea with LPAE it really seems not
> > > > > required (see also https://lore.kernel.org/patchwork/patch/567589/).
> > > > >
> > > > > There seem to be also other architectures which define MAX_PHYSMEM_BITS
> > > > > only when SPARSEMEM is enabled, e.g.
> > > > > arch/riscv/include/asm/sparsemem.h...
> > > > >
> > > > > Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
> > > > > SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
> > > > > to a compile time define...
> > > >
> > > > I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of
> > > > arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
> > > > by !LPAE and LPAE.
> > > 
> > > Good idea. I wonder what other architectures need the same though.
> > > Here are some I found:
> > > 
> > > $ git grep -l PHYS_ADDR_T_64BIT arch | grep Kconfig
> > > arch/arc/Kconfig
> > > arch/arm/mm/Kconfig
> > > arch/mips/Kconfig
> > > arch/powerpc/platforms/Kconfig.cputype
> > > arch/x86/Kconfig
> > > 
> > > arch/arc has a CONFIG_ARC_HAS_PAE40 option
> > > arch/riscv has 34-bit addressing in rv32 mode
> > > arch/mips has up to 40 bits with mips32r3 XPA, but I don't know what
> > >     supports that
> > > 
> > > arch/powerpc has this:
> > > config PHYS_64BIT
> > >         bool 'Large physical address support' if E500 || PPC_86xx
> > >         depends on (44x || E500 || PPC_86xx) && !PPC_83xx && !PPC_82xx
> > > 
> > > Apparently all three (4xx, e500v2, mpc86xx/e600) do 36-bit physical
> > > addressing, but each one has a different page table format.
> > > 
> > > Microblaze has physical address extensions, but neither those nor
> > > 64-bit mode have so far made it into the kernel.
> > > 
> > > To be on the safe side, we could provoke a compile-time error
> > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > 
> > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > machine has memory above 4G?
> 
> I guess max_pfn will represent maximum pfn configued in the system
> and will not be changed in the runtime. If it's true, how about this?
> (didn't test at all but just for RFC)

Largely, max_pfn is the maximal pfn at boot time but I don't think it
can be used on systems with memory hotplug. Unless I'm missing
something, with memory hotplug we must have compile-time maximum.

> From 2f51a743a08d10c787a36912515a91826693e308 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 10 Nov 2020 15:17:12 -0800
> Subject: [PATCH] zsmalloc: do not support if system has too big memory
> 
> zsmalloc encode object location into unsigned long so if the system
> has too big memory to encode the object location into BITS_PER_LONG,
> it should return the error. Otherwise, system will be crash.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/zsmalloc.c | 82 ++++++++++++++++++++++++++++-----------------------
>  1 file changed, 45 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 662ee420706f..13552f412ca1 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -87,19 +87,8 @@
>   * This is made more complicated by various memory models and PAE.
>   */
>  
> -#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> -#ifdef MAX_PHYSMEM_BITS
> -#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> -#else
> -/*
> - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
> - * be PAGE_SHIFT
> - */
> -#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
> -#endif
> -#endif
> -
> -#define _PFN_BITS		(MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
> +static unsigned int pfn_bits;
> +static unsigned int obj_index_bits;
>  
>  /*
>   * Memory for allocating for handle keeps object position by
> @@ -119,18 +108,20 @@
>   */
>  #define OBJ_ALLOCATED_TAG 1
>  #define OBJ_TAG_BITS 1
> -#define OBJ_INDEX_BITS	(BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
> -#define OBJ_INDEX_MASK	((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
> +static unsigned int obj_index_mask;
>  
>  #define FULLNESS_BITS	2
>  #define CLASS_BITS	8
>  #define ISOLATED_BITS	3
>  #define MAGIC_VAL_BITS	8
>  
> -#define MAX(a, b) ((a) >= (b) ? (a) : (b))
> -/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
> -#define ZS_MIN_ALLOC_SIZE \
> -	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
> +/* zs_min_alloc_size must be multiple of ZS_ALIGN */
> +int zs_min_alloc_size(void)
> +{
> +	return max_t(unsigned long, 32, ZS_MAX_PAGES_PER_ZSPAGE <<
> +			PAGE_SHIFT >> obj_index_bits);
> +}
> +
>  /* each chunk includes extra space to keep handle */
>  #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
>  
> @@ -144,12 +135,11 @@
>   *    determined). NOTE: all those class sizes must be set as multiple of
>   *    ZS_ALIGN to make sure link_free itself never has to span 2 pages.
>   *
> - *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
> + *  zs_min_alloc_size and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
>   *  (reason above)
>   */
>  #define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
> -#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
> -				      ZS_SIZE_CLASS_DELTA) + 1)
> +static unsigned int zs_size_classes;
>  
>  enum fullness_group {
>  	ZS_EMPTY,
> @@ -253,7 +243,7 @@ struct link_free {
>  struct zs_pool {
>  	const char *name;
>  
> -	struct size_class *size_class[ZS_SIZE_CLASSES];
> +	struct size_class **size_class;
>  	struct kmem_cache *handle_cachep;
>  	struct kmem_cache *zspage_cachep;
>  
> @@ -541,11 +531,11 @@ static int get_size_class_index(int size)
>  {
>  	int idx = 0;
>  
> -	if (likely(size > ZS_MIN_ALLOC_SIZE))
> -		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
> +	if (likely(size > zs_min_alloc_size()))
> +		idx = DIV_ROUND_UP(size - zs_min_alloc_size(),
>  				ZS_SIZE_CLASS_DELTA);
>  
> -	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
> +	return min_t(int, zs_size_classes - 1, idx);
>  }
>  
>  /* type can be of enum type zs_stat_type or fullness_group */
> @@ -605,7 +595,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  			"obj_allocated", "obj_used", "pages_used",
>  			"pages_per_zspage", "freeable");
>  
> -	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> +	for (i = 0; i < zs_size_classes; i++) {
>  		class = pool->size_class[i];
>  
>  		if (class->index != i)
> @@ -845,8 +835,8 @@ static void obj_to_location(unsigned long obj, struct page **page,
>  				unsigned int *obj_idx)
>  {
>  	obj >>= OBJ_TAG_BITS;
> -	*page = pfn_to_page(obj >> OBJ_INDEX_BITS);
> -	*obj_idx = (obj & OBJ_INDEX_MASK);
> +	*page = pfn_to_page(obj >> obj_index_bits);
> +	*obj_idx = (obj & obj_index_mask);
>  }
>  
>  /**
> @@ -858,8 +848,8 @@ static unsigned long location_to_obj(struct page *page, unsigned int obj_idx)
>  {
>  	unsigned long obj;
>  
> -	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
> -	obj |= obj_idx & OBJ_INDEX_MASK;
> +	obj = page_to_pfn(page) << obj_index_bits;
> +	obj |= obj_idx & obj_index_mask;
>  	obj <<= OBJ_TAG_BITS;
>  
>  	return obj;
> @@ -2202,7 +2192,7 @@ static void async_free_zspage(struct work_struct *work)
>  	struct zs_pool *pool = container_of(work, struct zs_pool,
>  					free_work);
>  
> -	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> +	for (i = 0; i < zs_size_classes; i++) {
>  		class = pool->size_class[i];
>  		if (class->index != i)
>  			continue;
> @@ -2320,7 +2310,7 @@ unsigned long zs_compact(struct zs_pool *pool)
>  	int i;
>  	struct size_class *class;
>  
> -	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> +	for (i = zs_size_classes - 1; i >= 0; i--) {
>  		class = pool->size_class[i];
>  		if (!class)
>  			continue;
> @@ -2366,7 +2356,7 @@ static unsigned long zs_shrinker_count(struct shrinker *shrinker,
>  	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
>  			shrinker);
>  
> -	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> +	for (i = zs_size_classes - 1; i >= 0; i--) {
>  		class = pool->size_class[i];
>  		if (!class)
>  			continue;
> @@ -2427,18 +2417,20 @@ struct zs_pool *zs_create_pool(const char *name)
>  	if (create_cache(pool))
>  		goto err;
>  
> +	pool->size_class = kmalloc(sizeof(void *) * zs_size_classes, GFP_KERNEL);
> +
>  	/*
>  	 * Iterate reversely, because, size of size_class that we want to use
>  	 * for merging should be larger or equal to current size.
>  	 */
> -	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> +	for (i = zs_size_classes - 1; i >= 0; i--) {
>  		int size;
>  		int pages_per_zspage;
>  		int objs_per_zspage;
>  		struct size_class *class;
>  		int fullness = 0;
>  
> -		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
> +		size = zs_min_alloc_size() + i * ZS_SIZE_CLASS_DELTA;
>  		if (size > ZS_MAX_ALLOC_SIZE)
>  			size = ZS_MAX_ALLOC_SIZE;
>  		pages_per_zspage = get_pages_per_zspage(size);
> @@ -2528,7 +2520,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>  	zs_unregister_migration(pool);
>  	zs_pool_stat_destroy(pool);
>  
> -	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> +	for (i = 0; i < zs_size_classes; i++) {
>  		int fg;
>  		struct size_class *class = pool->size_class[i];
>  
> @@ -2547,6 +2539,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>  		kfree(class);
>  	}
>  
> +	kfree(pool->size_class);
>  	destroy_cache(pool);
>  	kfree(pool->name);
>  	kfree(pool);
> @@ -2557,6 +2550,21 @@ static int __init zs_init(void)
>  {
>  	int ret;
>  
> +	/*
> +	 * If BITS_PER_LONG is not big enough to encode metadata(e.g., object
> +	 * location and some tags), it could't work at this moment.
> +	 */
> +	pfn_bits = ilog2(max_pfn) + 1;
> +	if (BITS_PER_LONG - OBJ_TAG_BITS - pfn_bits < 0) {
> +		pr_warn("max_pfn %lu is too large for zsmalloc\n",
> +				max_pfn);
> +		return -EFBIG;
> +	}
> +
> +	obj_index_bits = BITS_PER_LONG - OBJ_TAG_BITS - pfn_bits;
> +	obj_index_mask = (1 << obj_index_bits) - 1;
> +	zs_size_classes = (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - zs_min_alloc_size(),  ZS_SIZE_CLASS_DELTA) + 1);
> +
>  	ret = zsmalloc_mount();
>  	if (ret)
>  		goto out;
> -- 
> 2.29.2.299.gdc1121823c-goog
>

Arnd Bergmann Nov. 11, 2020, 9:33 a.m. UTC | #14

On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> >
> > To be on the safe side, we could provoke a compile-time error
> > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
>
> Maybe compile time warning and a runtime error in zs_init() if 32 bit
> machine has memory above 4G?

If the fix is as easy as adding a single line in a header, I think a
compile-time
error makes it easier, no need to wait for someone to boot a broken
system before fixing it.

       Arnd

Mike Rapoport Nov. 11, 2020, 10:26 a.m. UTC | #15

On Wed, Nov 11, 2020 at 10:33:29AM +0100, Arnd Bergmann wrote:
> On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > >
> > > To be on the safe side, we could provoke a compile-time error
> > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> >
> > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > machine has memory above 4G?
> 
> If the fix is as easy as adding a single line in a header, I think a
> compile-time
> error makes it easier, no need to wait for someone to boot a broken
> system before fixing it.

Not sure it would be as easy as adding a single line in a header for
MIPS with it's diversity.

How about adding a runtime detection like Minchan suggested for
!MEMORY_HOTPLUG (e.g. MIPS and ARC) and a compile time error otherwise
(i386, ppc)?

>        Arnd

Arnd Bergmann Nov. 11, 2020, 10:57 a.m. UTC | #16

On Wed, Nov 11, 2020 at 11:26 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Nov 11, 2020 at 10:33:29AM +0100, Arnd Bergmann wrote:
> > On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > >
> > > > To be on the safe side, we could provoke a compile-time error
> > > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > >
> > > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > > machine has memory above 4G?
> >
> > If the fix is as easy as adding a single line in a header, I think a
> > compile-time
> > error makes it easier, no need to wait for someone to boot a broken
> > system before fixing it.
>
> Not sure it would be as easy as adding a single line in a header for
> MIPS with it's diversity.

I looked up the architecture, and found:

- The pre-MIPS32r1 cores only support 32-bit addressing
- octeon selects PHYS_ADDR_T_64BIT but no longer
  supports 32-bit kernels
- Alchemy and netlogic (XLR, XLP) have 36-bit addressing
- CONFIG_XPA implies 40-bit addressing

We should run it by the MIPS maintainers, but I think this patch
is sufficient:

--- a/arch/mips/include/asm/pgtable-32.h
+++ b/arch/mips/include/asm/pgtable-32.h
@@ -154,6 +154,7 @@ static inline void pmd_clear(pmd_t *pmdp)

 #if defined(CONFIG_XPA)

+#define MAX_POSSIBLE_PHYSMEM_BITS 40
 #define pte_pfn(x)             (((unsigned long)((x).pte_high >>
_PFN_SHIFT)) | (unsigned long)((x).pte_low << _PAGE_PRESENT_SHIFT))
 static inline pte_t
 pfn_pte(unsigned long pfn, pgprot_t prot)
@@ -169,6 +170,7 @@ pfn_pte(unsigned long pfn, pgprot_t prot)

 #elif defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)

+#define MAX_POSSIBLE_PHYSMEM_BITS 35
 #define pte_pfn(x)             ((unsigned long)((x).pte_high >> 6))

 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
@@ -183,6 +185,7 @@ static inline pte_t pfn_pte(unsigned long pfn,
pgprot_t prot)

 #else

+#define MAX_POSSIBLE_PHYSMEM_BITS 32
 #ifdef CONFIG_CPU_VR41XX
 #define pte_pfn(x)             ((unsigned long)((x).pte >> (PAGE_SHIFT + 2)))
 #define pfn_pte(pfn, prot)     __pte(((pfn) << (PAGE_SHIFT + 2)) |
pgprot_val(prot))

     Arnd

Mike Rapoport Nov. 11, 2020, 11:04 a.m. UTC | #17

On Wed, Nov 11, 2020 at 11:57:02AM +0100, Arnd Bergmann wrote:
> On Wed, Nov 11, 2020 at 11:26 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Wed, Nov 11, 2020 at 10:33:29AM +0100, Arnd Bergmann wrote:
> > > On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > > > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > > >
> > > > > To be on the safe side, we could provoke a compile-time error
> > > > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > > >
> > > > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > > > machine has memory above 4G?
> > >
> > > If the fix is as easy as adding a single line in a header, I think a
> > > compile-time
> > > error makes it easier, no need to wait for someone to boot a broken
> > > system before fixing it.
> >
> > Not sure it would be as easy as adding a single line in a header for
> > MIPS with it's diversity.
> 
> I looked up the architecture, and found:
> 
> - The pre-MIPS32r1 cores only support 32-bit addressing
> - octeon selects PHYS_ADDR_T_64BIT but no longer
>   supports 32-bit kernels
> - Alchemy and netlogic (XLR, XLP) have 36-bit addressing
> - CONFIG_XPA implies 40-bit addressing

Fair enough. A typo below.

> We should run it by the MIPS maintainers, but I think this patch
> is sufficient:
> 
> --- a/arch/mips/include/asm/pgtable-32.h
> +++ b/arch/mips/include/asm/pgtable-32.h
> @@ -154,6 +154,7 @@ static inline void pmd_clear(pmd_t *pmdp)
> 
>  #if defined(CONFIG_XPA)
> 
> +#define MAX_POSSIBLE_PHYSMEM_BITS 40
>  #define pte_pfn(x)             (((unsigned long)((x).pte_high >>
> _PFN_SHIFT)) | (unsigned long)((x).pte_low << _PAGE_PRESENT_SHIFT))
>  static inline pte_t
>  pfn_pte(unsigned long pfn, pgprot_t prot)
> @@ -169,6 +170,7 @@ pfn_pte(unsigned long pfn, pgprot_t prot)
> 
>  #elif defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
> 
> +#define MAX_POSSIBLE_PHYSMEM_BITS 35

                                    ^ 36 ?

>  #define pte_pfn(x)             ((unsigned long)((x).pte_high >> 6))
> 
>  static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
> @@ -183,6 +185,7 @@ static inline pte_t pfn_pte(unsigned long pfn,
> pgprot_t prot)
> 
>  #else
> 
> +#define MAX_POSSIBLE_PHYSMEM_BITS 32
>  #ifdef CONFIG_CPU_VR41XX
>  #define pte_pfn(x)             ((unsigned long)((x).pte >> (PAGE_SHIFT + 2)))
>  #define pfn_pte(pfn, prot)     __pte(((pfn) << (PAGE_SHIFT + 2)) |
> pgprot_val(prot))
> 
>      Arnd

Thomas Bogendoerfer Nov. 11, 2020, 1:39 p.m. UTC | #18

On Wed, Nov 11, 2020 at 11:57:02AM +0100, Arnd Bergmann wrote:
> On Wed, Nov 11, 2020 at 11:26 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Wed, Nov 11, 2020 at 10:33:29AM +0100, Arnd Bergmann wrote:
> > > On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > > > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > > >
> > > > > To be on the safe side, we could provoke a compile-time error
> > > > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > > >
> > > > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > > > machine has memory above 4G?
> > >
> > > If the fix is as easy as adding a single line in a header, I think a
> > > compile-time
> > > error makes it easier, no need to wait for someone to boot a broken
> > > system before fixing it.
> >
> > Not sure it would be as easy as adding a single line in a header for
> > MIPS with it's diversity.
> 
> I looked up the architecture, and found:
> 
> - The pre-MIPS32r1 cores only support 32-bit addressing
> - octeon selects PHYS_ADDR_T_64BIT but no longer
>   supports 32-bit kernels
> - Alchemy and netlogic (XLR, XLP) have 36-bit addressing
> - CONFIG_XPA implies 40-bit addressing

MIPS32r5 might have up to 60-bit addressing according to the MIPS32 PRA
docuemnt (MD00090). But there is probably no chip, which implements it
so for now 40-bit addressing is correct.

> 
> We should run it by the MIPS maintainers, but I think this patch
> is sufficient:

Do you want me to run it through mips tree or do you need an
Acked-By from me ?

Thomas.

Arnd Bergmann Nov. 11, 2020, 2:33 p.m. UTC | #19

On Wed, Nov 11, 2020 at 2:39 PM Thomas Bogendoerfer
<tsbogend@alpha.franken.de> wrote:
> On Wed, Nov 11, 2020 at 11:57:02AM +0100, Arnd Bergmann wrote:
> > On Wed, Nov 11, 2020 at 11:26 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Wed, Nov 11, 2020 at 10:33:29AM +0100, Arnd Bergmann wrote:
> > > > On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > > > > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > > > >
> > > > > > To be on the safe side, we could provoke a compile-time error
> > > > > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > > > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > > > >
> > > > > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > > > > machine has memory above 4G?
> > > >
> > > > If the fix is as easy as adding a single line in a header, I think a
> > > > compile-time
> > > > error makes it easier, no need to wait for someone to boot a broken
> > > > system before fixing it.
> > >
> > > Not sure it would be as easy as adding a single line in a header for
> > > MIPS with it's diversity.
> >
> > I looked up the architecture, and found:
> >
> > - The pre-MIPS32r1 cores only support 32-bit addressing
> > - octeon selects PHYS_ADDR_T_64BIT but no longer
> >   supports 32-bit kernels
> > - Alchemy and netlogic (XLR, XLP) have 36-bit addressing
> > - CONFIG_XPA implies 40-bit addressing
>
> MIPS32r5 might have up to 60-bit addressing according to the MIPS32 PRA
> docuemnt (MD00090). But there is probably no chip, which implements it
> so for now 40-bit addressing is correct.
>
> >
> > We should run it by the MIPS maintainers, but I think this patch
> > is sufficient:
>
> Do you want me to run it through mips tree or do you need an
> Acked-By from me ?

I'll follow up with a proper submission for MIPS and the other
architectures, please apply that one.

       Arnd

Minchan Kim Nov. 12, 2020, 7:49 p.m. UTC | #20

Hi,

On Wed, Nov 11, 2020 at 08:52:00AM +0200, Mike Rapoport wrote:
> Hi,
> 
> On Tue, Nov 10, 2020 at 03:36:20PM -0800, Minchan Kim wrote:
> > On Tue, Nov 10, 2020 at 06:21:55PM +0200, Mike Rapoport wrote:
> > > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > > On Tue, Nov 10, 2020 at 10:58 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > > > >
> > > > > > > asm/sparsemem.h is not available on some architectures.
> > > > > > > It's better to use linux/mmzone.h instead.
> > > > 
> > > > Ah, I missed that, too.
> > > > 
> > > > > > Hm, linux/mmzone.h only includes asm/sparsemem.h when CONFIG_SPARSEMEM
> > > > > > is enabled. However, on ARM at least I can have configurations without
> > > > > > CONFIG_SPARSEMEM and physical address extension on (e.g.
> > > > > > multi_v7_defconfig + CONFIG_LPAE + CONFIG_ZSMALLOC).
> > > > > >
> > > > > > While sparsemem seems to be a good idea with LPAE it really seems not
> > > > > > required (see also https://lore.kernel.org/patchwork/patch/567589/).
> > > > > >
> > > > > > There seem to be also other architectures which define MAX_PHYSMEM_BITS
> > > > > > only when SPARSEMEM is enabled, e.g.
> > > > > > arch/riscv/include/asm/sparsemem.h...
> > > > > >
> > > > > > Not sure how to get out of this.. Maybe make ZSMALLOC dependent on
> > > > > > SPARSEMEM? It feels a bit silly restricting ZSMALLOC selection only due
> > > > > > to a compile time define...
> > > > >
> > > > > I think we can define MAX_POSSIBLE_PHYSMEM_BITS in one of
> > > > > arch/arm/inclide/asm/pgtable-{2,3}level-*.h headers to values supported
> > > > > by !LPAE and LPAE.
> > > > 
> > > > Good idea. I wonder what other architectures need the same though.
> > > > Here are some I found:
> > > > 
> > > > $ git grep -l PHYS_ADDR_T_64BIT arch | grep Kconfig
> > > > arch/arc/Kconfig
> > > > arch/arm/mm/Kconfig
> > > > arch/mips/Kconfig
> > > > arch/powerpc/platforms/Kconfig.cputype
> > > > arch/x86/Kconfig
> > > > 
> > > > arch/arc has a CONFIG_ARC_HAS_PAE40 option
> > > > arch/riscv has 34-bit addressing in rv32 mode
> > > > arch/mips has up to 40 bits with mips32r3 XPA, but I don't know what
> > > >     supports that
> > > > 
> > > > arch/powerpc has this:
> > > > config PHYS_64BIT
> > > >         bool 'Large physical address support' if E500 || PPC_86xx
> > > >         depends on (44x || E500 || PPC_86xx) && !PPC_83xx && !PPC_82xx
> > > > 
> > > > Apparently all three (4xx, e500v2, mpc86xx/e600) do 36-bit physical
> > > > addressing, but each one has a different page table format.
> > > > 
> > > > Microblaze has physical address extensions, but neither those nor
> > > > 64-bit mode have so far made it into the kernel.
> > > > 
> > > > To be on the safe side, we could provoke a compile-time error
> > > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > > 
> > > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > > machine has memory above 4G?
> > 
> > I guess max_pfn will represent maximum pfn configued in the system
> > and will not be changed in the runtime. If it's true, how about this?
> > (didn't test at all but just for RFC)
> 
> Largely, max_pfn is the maximal pfn at boot time but I don't think it
> can be used on systems with memory hotplug. Unless I'm missing
> something, with memory hotplug we must have compile-time maximum.
> 

Make sense. I am looking forward to seeing Arnd's patches at this
moment.

Thanks!

Thanks for the help!

Minchan Kim Nov. 12, 2020, 7:50 p.m. UTC | #21

On Wed, Nov 11, 2020 at 03:33:18PM +0100, Arnd Bergmann wrote:
> On Wed, Nov 11, 2020 at 2:39 PM Thomas Bogendoerfer
> <tsbogend@alpha.franken.de> wrote:
> > On Wed, Nov 11, 2020 at 11:57:02AM +0100, Arnd Bergmann wrote:
> > > On Wed, Nov 11, 2020 at 11:26 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > On Wed, Nov 11, 2020 at 10:33:29AM +0100, Arnd Bergmann wrote:
> > > > > On Tue, Nov 10, 2020 at 5:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > > > > > On Tue, Nov 10, 2020 at 12:21:11PM +0100, Arnd Bergmann wrote:
> > > > > > >
> > > > > > > To be on the safe side, we could provoke a compile-time error
> > > > > > > when CONFIG_PHYS_ADDR_T_64BIT is set on a 32-bit
> > > > > > > architecture, but MAX_POSSIBLE_PHYSMEM_BITS is not set.
> > > > > >
> > > > > > Maybe compile time warning and a runtime error in zs_init() if 32 bit
> > > > > > machine has memory above 4G?
> > > > >
> > > > > If the fix is as easy as adding a single line in a header, I think a
> > > > > compile-time
> > > > > error makes it easier, no need to wait for someone to boot a broken
> > > > > system before fixing it.
> > > >
> > > > Not sure it would be as easy as adding a single line in a header for
> > > > MIPS with it's diversity.
> > >
> > > I looked up the architecture, and found:
> > >
> > > - The pre-MIPS32r1 cores only support 32-bit addressing
> > > - octeon selects PHYS_ADDR_T_64BIT but no longer
> > >   supports 32-bit kernels
> > > - Alchemy and netlogic (XLR, XLP) have 36-bit addressing
> > > - CONFIG_XPA implies 40-bit addressing
> >
> > MIPS32r5 might have up to 60-bit addressing according to the MIPS32 PRA
> > docuemnt (MD00090). But there is probably no chip, which implements it
> > so for now 40-bit addressing is correct.
> >
> > >
> > > We should run it by the MIPS maintainers, but I think this patch
> > > is sufficient:
> >
> > Do you want me to run it through mips tree or do you need an
> > Acked-By from me ?
> 
> I'll follow up with a proper submission for MIPS and the other
> architectures, please apply that one.

Thanks, Arnd!

Florian Fainelli Nov. 12, 2020, 11:46 p.m. UTC | #22

On 11/7/20 7:22 AM, Stefan Agner wrote:
> Most architectures define MAX_PHYSMEM_BITS in asm/sparsemem.h and don't
> include it in asm/pgtable.h. Include asm/sparsemem.h directly to get
> the MAX_PHYSMEM_BITS define on all architectures.
> 
> This fixes a crash when accessing zram on 32-bit ARM platform with LPAE and
> more than 4GB of memory:
>   Unable to handle kernel NULL pointer dereference at virtual address 00000000
>   pgd = a27bd01c
>   [00000000] *pgd=236a0003, *pmd=1ffa64003
>   Internal error: Oops: 207 [#1] SMP ARM
>   Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
>   CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
>   Hardware name: BCM2711
>   PC is at zs_map_object+0x94/0x338
>   LR is at zram_bvec_rw.constprop.0+0x330/0xa64
>   pc : [<c0602b38>]    lr : [<c0bda6a0>]    psr: 60000013
>   sp : e376bbe0  ip : 00000000  fp : c1e2921c
>   r10: 00000002  r9 : c1dda730  r8 : 00000000
>   r7 : e8ff7a00  r6 : 00000000  r5 : 02f9ffa0  r4 : e3710000
>   r3 : 000fdffe  r2 : c1e0ce80  r1 : ebf979a0  r0 : 00000000
>   Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>   Control: 30c5383d  Table: 235c2a80  DAC: fffffffd
>   Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
>   Stack: (0xe376bbe0 to 0xe376c000)
>   ...
>   [<c0602b38>] (zs_map_object) from [<c0bda6a0>] (zram_bvec_rw.constprop.0+0x330/0xa64)
>   [<c0bda6a0>] (zram_bvec_rw.constprop.0) from [<c0bdaf78>] (zram_submit_bio+0x1a4/0x40c)
>   [<c0bdaf78>] (zram_submit_bio) from [<c085806c>] (submit_bio_noacct+0xd0/0x3c8)
>   [<c085806c>] (submit_bio_noacct) from [<c08583b0>] (submit_bio+0x4c/0x190)
>   [<c08583b0>] (submit_bio) from [<c06496b4>] (submit_bh_wbc+0x188/0x1b8)
>   [<c06496b4>] (submit_bh_wbc) from [<c064ce98>] (__block_write_full_page+0x340/0x5e4)
>   [<c064ce98>] (__block_write_full_page) from [<c064d3ec>] (block_write_full_page+0x128/0x170)
>   [<c064d3ec>] (block_write_full_page) from [<c0591ae8>] (__writepage+0x14/0x68)
>   [<c0591ae8>] (__writepage) from [<c0593efc>] (write_cache_pages+0x1bc/0x494)
>   [<c0593efc>] (write_cache_pages) from [<c059422c>] (generic_writepages+0x58/0x8c)
>   [<c059422c>] (generic_writepages) from [<c0594c24>] (do_writepages+0x48/0xec)
>   [<c0594c24>] (do_writepages) from [<c0589330>] (__filemap_fdatawrite_range+0xf0/0x128)
>   [<c0589330>] (__filemap_fdatawrite_range) from [<c05894bc>] (file_write_and_wait_range+0x48/0x98)
>   [<c05894bc>] (file_write_and_wait_range) from [<c064f3f8>] (blkdev_fsync+0x1c/0x44)
>   [<c064f3f8>] (blkdev_fsync) from [<c064408c>] (do_fsync+0x3c/0x70)
>   [<c064408c>] (do_fsync) from [<c0400374>] (__sys_trace_return+0x0/0x2c)
>   Exception stack(0xe376bfa8 to 0xe376bff0)
>   bfa0:                   0003d2e0 b6f7b6f0 00000003 00046e40 00001000 00000000
>   bfc0: 0003d2e0 b6f7b6f0 00000000 00000076 00000000 00000000 befcbb20 befcbb28
>   bfe0: b6f4e060 befcbad8 b6f23e0c b6dc4a80
>   Code: e5927000 e0050391 e0871005 e5918018 (e5983000)
> 
> Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
> Signed-off-by: Stefan Agner <stefan@agner.ch>

BTW, I can confirm on behalf of a colleague here that this patch fixes
zsmalloc on a 32-bit platform with 4GB of DRAM and with a 3GB user/1GB
kernel split. Thanks!

mm/zsmalloc: include sparsemem.h for MAX_PHYSMEM_BITS

Commit Message

Comments

Patch