Revert "mm/vunmap: add cond_resched() in vunmap_pmd_range"

Message ID	20201105170249.387069-1-minchan@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qThi=EL=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 299952151B From: Minchan Kim <minchan@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm <linux-mm@kvack.org>, Minchan Kim <minchan@kernel.org>, "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>, Harish Sriram <harish@linux.ibm.com>, stable@vger.kernel.org Subject: [PATCH] Revert "mm/vunmap: add cond_resched() in vunmap_pmd_range" Date: Thu, 5 Nov 2020 09:02:49 -0800 Message-Id: <20201105170249.387069-1-minchan@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Revert "mm/vunmap: add cond_resched() in vunmap_pmd_range" \| expand Revert "mm/vunmap: add cond_resched() in vunmap_pmd_range"

Minchan Kim Nov. 5, 2020, 5:02 p.m. UTC

This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.

While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation,
I found below commit calls cond_resched unconditionally so it could
make a problem in atomic context if the task is reschedule.

Revert the original commit for now.

[   55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108^M
[   55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog^M
[   55.111973] 3 locks held by memhog/946:^M
[   55.112807]  #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160^M
[   55.114151]  #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160^M
[   55.115848]  #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0^M
[   55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316^M
[   55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014^M
[   55.122540] Call Trace:^M
[   55.122974]  dump_stack+0x8b/0xb8^M
[   55.123588]  ___might_sleep.cold+0xb6/0xc6^M
[   55.124328]  unmap_kernel_range_noflush+0x2eb/0x350^M
[   55.125198]  unmap_kernel_range+0x14/0x30^M
[   55.125920]  zs_unmap_object+0xd5/0xe0^M
[   55.126604]  zram_bvec_rw.isra.0+0x38c/0x8e0^M
[   55.127462]  zram_rw_page+0x90/0x101^M
[   55.128199]  bdev_write_page+0x92/0xe0^M
[   55.128957]  ? swap_slot_free_notify+0xb0/0xb0^M
[   55.129841]  __swap_writepage+0x94/0x4a0^M
[   55.130636]  ? do_raw_spin_unlock+0x4b/0xa0^M
[   55.131462]  ? _raw_spin_unlock+0x1f/0x30^M
[   55.132261]  ? page_swapcount+0x6c/0x90^M
[   55.133038]  pageout+0xe3/0x3a0^M
[   55.133702]  shrink_page_list+0xb94/0xd60^M
[   55.134626]  shrink_inactive_list+0x158/0x460^M

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Harish Sriram <harish@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmalloc.c | 2 --
 1 file changed, 2 deletions(-)

Matthew Wilcox Nov. 5, 2020, 5:16 p.m. UTC | #1

On Thu, Nov 05, 2020 at 09:02:49AM -0800, Minchan Kim wrote:
> This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> 
> While I was doing zram testing, I found sometimes decompression failed
> since the compression buffer was corrupted. With investigation,
> I found below commit calls cond_resched unconditionally so it could
> make a problem in atomic context if the task is reschedule.

I don't think you're supposed to call unmap_kernel_range() from
atomic context.  At least vfree() punts to __vfree_deferred() if
in_interrupt() is true.  I forget the original reason for why that is.

Minchan Kim Nov. 5, 2020, 5:33 p.m. UTC | #2

On Thu, Nov 05, 2020 at 05:16:02PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 05, 2020 at 09:02:49AM -0800, Minchan Kim wrote:
> > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > 
> > While I was doing zram testing, I found sometimes decompression failed
> > since the compression buffer was corrupted. With investigation,
> > I found below commit calls cond_resched unconditionally so it could
> > make a problem in atomic context if the task is reschedule.
> 
> I don't think you're supposed to call unmap_kernel_range() from
> atomic context.  At least vfree() punts to __vfree_deferred() if
> in_interrupt() is true.  I forget the original reason for why that is.

It would be desirable, However, the logic was there for several years
and made regression from the commit in stable kernel for now.

Can't we have graceful shutdown if we want to deprecate the API usecase
in atomic context?

Andrew Morton Nov. 7, 2020, 1:59 a.m. UTC | #3

On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:

> This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> 
> While I was doing zram testing, I found sometimes decompression failed
> since the compression buffer was corrupted. With investigation,
> I found below commit calls cond_resched unconditionally so it could
> make a problem in atomic context if the task is reschedule.
> 
> Revert the original commit for now.
> 
> [   55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108
> [   55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
> [   55.111973] 3 locks held by memhog/946:
> [   55.112807]  #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
> [   55.114151]  #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
> [   55.115848]  #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
> [   55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
> [   55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [   55.122540] Call Trace:
> [   55.122974]  dump_stack+0x8b/0xb8
> [   55.123588]  ___might_sleep.cold+0xb6/0xc6
> [   55.124328]  unmap_kernel_range_noflush+0x2eb/0x350
> [   55.125198]  unmap_kernel_range+0x14/0x30
> [   55.125920]  zs_unmap_object+0xd5/0xe0
> [   55.126604]  zram_bvec_rw.isra.0+0x38c/0x8e0
> [   55.127462]  zram_rw_page+0x90/0x101
> [   55.128199]  bdev_write_page+0x92/0xe0
> [   55.128957]  ? swap_slot_free_notify+0xb0/0xb0
> [   55.129841]  __swap_writepage+0x94/0x4a0
> [   55.130636]  ? do_raw_spin_unlock+0x4b/0xa0
> [   55.131462]  ? _raw_spin_unlock+0x1f/0x30
> [   55.132261]  ? page_swapcount+0x6c/0x90
> [   55.133038]  pageout+0xe3/0x3a0
> [   55.133702]  shrink_page_list+0xb94/0xd60
> [   55.134626]  shrink_inactive_list+0x158/0x460
>
> ...
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -102,8 +102,6 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		if (pmd_none_or_clear_bad(pmd))
>  			continue;
>  		vunmap_pte_range(pmd, addr, next, mask);
> -
> -		cond_resched();
>  	} while (pmd++, addr = next, addr != end);
>  }

If this is triggering a warning then why isn't the might_sleep() in
remove_vm_area() also triggering?

Sigh.  I also cannot remember why these vfree() functions have to be so
awkward.  The mutex_trylock(&vmap_purge_lock) isn't permitted in
interrupt context because mutex_trylock() is stupid, but what was the
issue with non-interrupt atomic code?

Minchan Kim Nov. 7, 2020, 8:39 a.m. UTC | #4

Hi Andrew,

On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:
> 
> > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > 
> > While I was doing zram testing, I found sometimes decompression failed
> > since the compression buffer was corrupted. With investigation,
> > I found below commit calls cond_resched unconditionally so it could
> > make a problem in atomic context if the task is reschedule.
> > 
> > Revert the original commit for now.
> > 
> > [   55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108
> > [   55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
> > [   55.111973] 3 locks held by memhog/946:
> > [   55.112807]  #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
> > [   55.114151]  #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
> > [   55.115848]  #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
> > [   55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
> > [   55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> > [   55.122540] Call Trace:
> > [   55.122974]  dump_stack+0x8b/0xb8
> > [   55.123588]  ___might_sleep.cold+0xb6/0xc6
> > [   55.124328]  unmap_kernel_range_noflush+0x2eb/0x350
> > [   55.125198]  unmap_kernel_range+0x14/0x30
> > [   55.125920]  zs_unmap_object+0xd5/0xe0
> > [   55.126604]  zram_bvec_rw.isra.0+0x38c/0x8e0
> > [   55.127462]  zram_rw_page+0x90/0x101
> > [   55.128199]  bdev_write_page+0x92/0xe0
> > [   55.128957]  ? swap_slot_free_notify+0xb0/0xb0
> > [   55.129841]  __swap_writepage+0x94/0x4a0
> > [   55.130636]  ? do_raw_spin_unlock+0x4b/0xa0
> > [   55.131462]  ? _raw_spin_unlock+0x1f/0x30
> > [   55.132261]  ? page_swapcount+0x6c/0x90
> > [   55.133038]  pageout+0xe3/0x3a0
> > [   55.133702]  shrink_page_list+0xb94/0xd60
> > [   55.134626]  shrink_inactive_list+0x158/0x460
> >
> > ...
> >
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -102,8 +102,6 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  		if (pmd_none_or_clear_bad(pmd))
> >  			continue;
> >  		vunmap_pte_range(pmd, addr, next, mask);
> > -
> > -		cond_resched();
> >  	} while (pmd++, addr = next, addr != end);
> >  }
> 
> If this is triggering a warning then why isn't the might_sleep() in
> remove_vm_area() also triggering?

I don't understand what specific callpath you are talking but if
it's clearly called in atomic context, the reason would be config
combination I met.
    
    CONFIG_PREEMPT_VOLUNTARY + no CONFIG_DEBUG_ATOMIC_SLEEP

It makes preempt_count logic void so might_sleep warning will not work.

> 
> Sigh.  I also cannot remember why these vfree() functions have to be so
> awkward.  The mutex_trylock(&vmap_purge_lock) isn't permitted in
> interrupt context because mutex_trylock() is stupid, but what was the
> issue with non-interrupt atomic code?
> 

Seems like a latency issue.

commit f9e09977671b
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Dec 12 16:44:23 2016 -0800

    mm: turn vmap_purge_lock into a mutex

The purge_lock spinlock causes high latencies with non RT kernel,

Uladzislau Rezki Nov. 9, 2020, 11:33 a.m. UTC | #5

On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
> Hi Andrew,
> 
> On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> > On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > > 
> > > While I was doing zram testing, I found sometimes decompression failed
> > > since the compression buffer was corrupted. With investigation,
> > > I found below commit calls cond_resched unconditionally so it could
> > > make a problem in atomic context if the task is reschedule.
> > > 
> > > Revert the original commit for now.
> > > 
> > > [   55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108
> > > [   55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
> > > [   55.111973] 3 locks held by memhog/946:
> > > [   55.112807]  #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
> > > [   55.114151]  #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
> > > [   55.115848]  #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
> > > [   55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
> > > [   55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> > > [   55.122540] Call Trace:
> > > [   55.122974]  dump_stack+0x8b/0xb8
> > > [   55.123588]  ___might_sleep.cold+0xb6/0xc6
> > > [   55.124328]  unmap_kernel_range_noflush+0x2eb/0x350
> > > [   55.125198]  unmap_kernel_range+0x14/0x30
> > > [   55.125920]  zs_unmap_object+0xd5/0xe0
> > > [   55.126604]  zram_bvec_rw.isra.0+0x38c/0x8e0
> > > [   55.127462]  zram_rw_page+0x90/0x101
> > > [   55.128199]  bdev_write_page+0x92/0xe0
> > > [   55.128957]  ? swap_slot_free_notify+0xb0/0xb0
> > > [   55.129841]  __swap_writepage+0x94/0x4a0
> > > [   55.130636]  ? do_raw_spin_unlock+0x4b/0xa0
> > > [   55.131462]  ? _raw_spin_unlock+0x1f/0x30
> > > [   55.132261]  ? page_swapcount+0x6c/0x90
> > > [   55.133038]  pageout+0xe3/0x3a0
> > > [   55.133702]  shrink_page_list+0xb94/0xd60
> > > [   55.134626]  shrink_inactive_list+0x158/0x460
> > >
> > > ...
> > >
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -102,8 +102,6 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> > >  		if (pmd_none_or_clear_bad(pmd))
> > >  			continue;
> > >  		vunmap_pte_range(pmd, addr, next, mask);
> > > -
> > > -		cond_resched();
> > >  	} while (pmd++, addr = next, addr != end);
> > >  }
> > 
> > If this is triggering a warning then why isn't the might_sleep() in
> > remove_vm_area() also triggering?
> 
> I don't understand what specific callpath you are talking but if
> it's clearly called in atomic context, the reason would be config
> combination I met.
>     
>     CONFIG_PREEMPT_VOLUNTARY + no CONFIG_DEBUG_ATOMIC_SLEEP
> 
> It makes preempt_count logic void so might_sleep warning will not work.
> 
> > 
> > Sigh.  I also cannot remember why these vfree() functions have to be so
> > awkward.  The mutex_trylock(&vmap_purge_lock) isn't permitted in
> > interrupt context because mutex_trylock() is stupid, but what was the
> > issue with non-interrupt atomic code?
> > 
> 
> Seems like a latency issue.
> 
> commit f9e09977671b
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Mon Dec 12 16:44:23 2016 -0800
> 
>     mm: turn vmap_purge_lock into a mutex
> 
> The purge_lock spinlock causes high latencies with non RT kernel,
> 
__purge_vmap_area_lazy() is our bottleneck. The list of areas to be
drained can be quite big, so the process itself is time consuming.
That is why it is protected by the mutex instead of spinlock. Latency
issues.

I proposed to optimize it here: https://lkml.org/lkml/2020/9/7/815
I will send out the patch soon, but it is another topic not related
to this issue.

--
Vlad Rezki

Minchan Kim Nov. 12, 2020, 8:01 p.m. UTC | #6

Hi Andrew,

How should we proceed this problem?

On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
> Hi Andrew,
> 
> On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> > On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > > 
> > > While I was doing zram testing, I found sometimes decompression failed
> > > since the compression buffer was corrupted. With investigation,
> > > I found below commit calls cond_resched unconditionally so it could
> > > make a problem in atomic context if the task is reschedule.
> > > 
> > > Revert the original commit for now.
> > > 
> > > [   55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108
> > > [   55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
> > > [   55.111973] 3 locks held by memhog/946:
> > > [   55.112807]  #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
> > > [   55.114151]  #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
> > > [   55.115848]  #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
> > > [   55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
> > > [   55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> > > [   55.122540] Call Trace:
> > > [   55.122974]  dump_stack+0x8b/0xb8
> > > [   55.123588]  ___might_sleep.cold+0xb6/0xc6
> > > [   55.124328]  unmap_kernel_range_noflush+0x2eb/0x350
> > > [   55.125198]  unmap_kernel_range+0x14/0x30
> > > [   55.125920]  zs_unmap_object+0xd5/0xe0
> > > [   55.126604]  zram_bvec_rw.isra.0+0x38c/0x8e0
> > > [   55.127462]  zram_rw_page+0x90/0x101
> > > [   55.128199]  bdev_write_page+0x92/0xe0
> > > [   55.128957]  ? swap_slot_free_notify+0xb0/0xb0
> > > [   55.129841]  __swap_writepage+0x94/0x4a0
> > > [   55.130636]  ? do_raw_spin_unlock+0x4b/0xa0
> > > [   55.131462]  ? _raw_spin_unlock+0x1f/0x30
> > > [   55.132261]  ? page_swapcount+0x6c/0x90
> > > [   55.133038]  pageout+0xe3/0x3a0
> > > [   55.133702]  shrink_page_list+0xb94/0xd60
> > > [   55.134626]  shrink_inactive_list+0x158/0x460
> > >
> > > ...
> > >
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -102,8 +102,6 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> > >  		if (pmd_none_or_clear_bad(pmd))
> > >  			continue;
> > >  		vunmap_pte_range(pmd, addr, next, mask);
> > > -
> > > -		cond_resched();
> > >  	} while (pmd++, addr = next, addr != end);
> > >  }
> > 
> > If this is triggering a warning then why isn't the might_sleep() in
> > remove_vm_area() also triggering?
> 
> I don't understand what specific callpath you are talking but if
> it's clearly called in atomic context, the reason would be config
> combination I met.
>     
>     CONFIG_PREEMPT_VOLUNTARY + no CONFIG_DEBUG_ATOMIC_SLEEP
> 
> It makes preempt_count logic void so might_sleep warning will not work.
> 
> > 
> > Sigh.  I also cannot remember why these vfree() functions have to be so
> > awkward.  The mutex_trylock(&vmap_purge_lock) isn't permitted in
> > interrupt context because mutex_trylock() is stupid, but what was the
> > issue with non-interrupt atomic code?
> > 
> 
> Seems like a latency issue.
> 
> commit f9e09977671b
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Mon Dec 12 16:44:23 2016 -0800
> 
>     mm: turn vmap_purge_lock into a mutex
> 
> The purge_lock spinlock causes high latencies with non RT kernel,
>

Andrew Morton Nov. 12, 2020, 10:49 p.m. UTC | #7

On Thu, 12 Nov 2020 12:01:01 -0800 Minchan Kim <minchan@kernel.org> wrote:

> 
> On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
> > Hi Andrew,
> > 
> > On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> > > On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > > 
> > > > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > > > 
> > > > While I was doing zram testing, I found sometimes decompression failed
> > > > since the compression buffer was corrupted. With investigation,
> > > > I found below commit calls cond_resched unconditionally so it could
> > > > make a problem in atomic context if the task is reschedule.
> > > > 
> > > > Revert the original commit for now.
>
> How should we proceed this problem?
>

(top-posting repaired - please don't).

Well, we don't want to reintroduce the softlockup reports which
e47110e90584a2 fixed, and we certainly don't want to do that on behalf
of code which is using the unmap_kernel_range() interface incorrectly.

So I suggest either

a) make zs_unmap_object() stop doing the unmapping from atomic context or

b) figure out whether the vmalloc unmap code is *truly* unsafe from
   atomic context - perhaps it is only unsafe from interrupt context,
   in which case we can rework the vmalloc.c checks to reflect this, or

c) make the vfree code callable from all contexts.  Which is by far
   the preferred solution, but may be tough.

Or maybe not so tough - if the only problem in the vmalloc code is the
use of mutex_trylock() from irqs then it may be as simple as switching
to old-style semaphores and using down_trylock(), which I think is
irq-safe.

However old-style semaphores are deprecated.  A hackyish approach might
be to use an rwsem always in down_write mode and use
down_write_trylock(), which I think is also callable from interrrupt
context.

But I have a feeling that there are other reasons why vfree shouldn't
be called from atomic context, apart from the mutex_trylock-in-irq
issue.

Minchan Kim Nov. 13, 2020, 4:25 p.m. UTC | #8

On Thu, Nov 12, 2020 at 02:49:19PM -0800, Andrew Morton wrote:
> On Thu, 12 Nov 2020 12:01:01 -0800 Minchan Kim <minchan@kernel.org> wrote:
> 
> > 
> > On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
> > > Hi Andrew,
> > > 
> > > On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> > > > On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > > > 
> > > > > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > > > > 
> > > > > While I was doing zram testing, I found sometimes decompression failed
> > > > > since the compression buffer was corrupted. With investigation,
> > > > > I found below commit calls cond_resched unconditionally so it could
> > > > > make a problem in atomic context if the task is reschedule.
> > > > > 
> > > > > Revert the original commit for now.
> >
> > How should we proceed this problem?
> >
> 
> (top-posting repaired - please don't).
> 
> Well, we don't want to reintroduce the softlockup reports which
> e47110e90584a2 fixed, and we certainly don't want to do that on behalf
> of code which is using the unmap_kernel_range() interface incorrectly.
> 
> So I suggest either
> 
> a) make zs_unmap_object() stop doing the unmapping from atomic context or

It's not easy since the path already hold several spin_locks as well as
per-cpu context. I could pursuit the direction but it takes several
steps to change entire locking scheme in the zsmalloc, which will
take time(we couldn't leave zsmalloc broken until then) and hard to
land on stable tree.

> 
> b) figure out whether the vmalloc unmap code is *truly* unsafe from
>    atomic context - perhaps it is only unsafe from interrupt context,
>    in which case we can rework the vmalloc.c checks to reflect this, or

I don't get the point. I assume your suggestion would be "let's make the
vunmap code atomic context safe" but how could it solve this problem?
     
The point from e47110e90584a2 was softlockup could be triggered if
vunamp deal with large mapping so need *explict reschedule* point
for CONFIG_PREEMPT_VOLUNTARY. However, CONFIG_PREEMPT_VOLUNTARY
doesn't consider peempt count so even though we could make vunamp
atomic safe to make a call under spin_lock:

spin_lock(&A);
vunmap
  vunmap_pmd_range
    cond_resched <- bang
 
Below options would have same problem, too.
Let me know if I misunderstand something.

> 
> c) make the vfree code callable from all contexts.  Which is by far
>    the preferred solution, but may be tough.
> 
> 
> Or maybe not so tough - if the only problem in the vmalloc code is the
> use of mutex_trylock() from irqs then it may be as simple as switching
> to old-style semaphores and using down_trylock(), which I think is
> irq-safe.
> 
> However old-style semaphores are deprecated.  A hackyish approach might
> be to use an rwsem always in down_write mode and use
> down_write_trylock(), which I think is also callable from interrrupt
> context.
> 
> But I have a feeling that there are other reasons why vfree shouldn't
> be called from atomic context, apart from the mutex_trylock-in-irq
> issue.

Minchan Kim Nov. 16, 2020, 5:53 p.m. UTC | #9

On Fri, Nov 13, 2020 at 08:25:29AM -0800, Minchan Kim wrote:
> On Thu, Nov 12, 2020 at 02:49:19PM -0800, Andrew Morton wrote:
> > On Thu, 12 Nov 2020 12:01:01 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > 
> > > On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
> > > > Hi Andrew,
> > > > 
> > > > On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> > > > > On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > > > > 
> > > > > > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > > > > > 
> > > > > > While I was doing zram testing, I found sometimes decompression failed
> > > > > > since the compression buffer was corrupted. With investigation,
> > > > > > I found below commit calls cond_resched unconditionally so it could
> > > > > > make a problem in atomic context if the task is reschedule.
> > > > > > 
> > > > > > Revert the original commit for now.
> > >
> > > How should we proceed this problem?
> > >
> > 
> > (top-posting repaired - please don't).
> > 
> > Well, we don't want to reintroduce the softlockup reports which
> > e47110e90584a2 fixed, and we certainly don't want to do that on behalf
> > of code which is using the unmap_kernel_range() interface incorrectly.
> > 
> > So I suggest either
> > 
> > a) make zs_unmap_object() stop doing the unmapping from atomic context or
> 
> It's not easy since the path already hold several spin_locks as well as
> per-cpu context. I could pursuit the direction but it takes several
> steps to change entire locking scheme in the zsmalloc, which will
> take time(we couldn't leave zsmalloc broken until then) and hard to
> land on stable tree.
> 
> > 
> > b) figure out whether the vmalloc unmap code is *truly* unsafe from
> >    atomic context - perhaps it is only unsafe from interrupt context,
> >    in which case we can rework the vmalloc.c checks to reflect this, or
> 
> I don't get the point. I assume your suggestion would be "let's make the
> vunmap code atomic context safe" but how could it solve this problem?
>      
> The point from e47110e90584a2 was softlockup could be triggered if
> vunamp deal with large mapping so need *explict reschedule* point
> for CONFIG_PREEMPT_VOLUNTARY. However, CONFIG_PREEMPT_VOLUNTARY
> doesn't consider peempt count so even though we could make vunamp
> atomic safe to make a call under spin_lock:
> 
> spin_lock(&A);
> vunmap
>   vunmap_pmd_range
>     cond_resched <- bang
>  
> Below options would have same problem, too.
> Let me know if I misunderstand something.
> 
> > 
> > c) make the vfree code callable from all contexts.  Which is by far
> >    the preferred solution, but may be tough.
> > 
> > 
> > Or maybe not so tough - if the only problem in the vmalloc code is the
> > use of mutex_trylock() from irqs then it may be as simple as switching
> > to old-style semaphores and using down_trylock(), which I think is
> > irq-safe.
> > 
> > However old-style semaphores are deprecated.  A hackyish approach might
> > be to use an rwsem always in down_write mode and use
> > down_write_trylock(), which I think is also callable from interrrupt
> > context.
> > 
> > But I have a feeling that there are other reasons why vfree shouldn't
> > be called from atomic context, apart from the mutex_trylock-in-irq
> > issue.

How about this?

From 0733bc468d0072147c2ecf998d7cc3af2234bc87 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 16 Nov 2020 09:38:40 -0800
Subject: [PATCH] mm: unmap_kernel_range_atomic

unmap_kernel_range had been atomic operation and zsmalloc has used
it in atomic context in zs_unmap_object.
However, ("e47110e90584, mm/vunmap: add cond_resched() in vunmap_pmd_range")
changed it into non-atomic operation via adding cond_resched.
It causes zram decompresion failure by corrupting compressed buffer
in atomic context.

This patch introduces unmap_kernel_range_atomic which works for
only range less than PMD_SIZE to prevent cond_resched call.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/vmalloc.h |  2 ++
 mm/vmalloc.c            | 23 +++++++++++++++++++++--
 mm/zsmalloc.c           |  2 +-
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 938eaf9517e2..36b1ecc2d014 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -180,6 +180,7 @@ int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
 		struct page **pages);
 extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
 extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+extern void unmap_kernel_range_atomic(unsigned long addr, unsigned long size);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
@@ -200,6 +201,7 @@ unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
 {
 }
 #define unmap_kernel_range unmap_kernel_range_noflush
+#define unmap_kernel_range_atomic unmap_kernel_range_noflush
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d7075ad340aa..714e5425dc45 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -88,6 +88,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd_t *pmd;
 	unsigned long next;
 	int cleared;
+	bool check_resched = (end - addr) > PMD_SIZE;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -102,8 +103,8 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next, mask);
-
-		cond_resched();
+		if (check_resched)
+			cond_resched();
 	} while (pmd++, addr = next, addr != end);
 }
 
@@ -2024,6 +2025,24 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
 	flush_tlb_kernel_range(addr, end);
 }
 
+/**
+ * unmap_kernel_range_atomic - unmap kernel VM area and flush cache and TLB
+ * @addr: start of the VM area to unmap
+ * @size: size of the VM area to unmap
+ *
+ * Similar to unmap_kernel_range_noflush() but it's atomic. @size should be
+ * less than PMD_SIZE.
+ */
+void unmap_kernel_range_atomic(unsigned long addr, unsigned long size)
+{
+	unsigned long end = addr + size;
+
+	flush_cache_vunmap(addr, end);
+	WARN_ON(size > PMD_SIZE);
+	unmap_kernel_range_noflush(addr, size);
+	flush_tlb_kernel_range(addr, end);
+}
+
 static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 	struct vmap_area *va, unsigned long flags, const void *caller)
 {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 662ee420706f..9decc7634852 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1154,7 +1154,7 @@ static inline void __zs_unmap_object(struct mapping_area *area,
 {
 	unsigned long addr = (unsigned long)area->vm_addr;
 
-	unmap_kernel_range(addr, PAGE_SIZE * 2);
+	unmap_kernel_range_atomic(addr, PAGE_SIZE * 2);
 }
 
 #else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */

Andrew Morton Nov. 16, 2020, 11:20 p.m. UTC | #10

Let's cc Uladzislau on vmalloc things?

> How about this?

Well, lol, that's a simple approach to avoiding the problem ;)

> unmap_kernel_range had been atomic operation and zsmalloc has used
> it in atomic context in zs_unmap_object.
> However, ("e47110e90584, mm/vunmap: add cond_resched() in vunmap_pmd_range")
> changed it into non-atomic operation via adding cond_resched.
> It causes zram decompresion failure by corrupting compressed buffer
> in atomic context.
> 
> This patch introduces unmap_kernel_range_atomic which works for
> only range less than PMD_SIZE to prevent cond_resched call.
> 
> ...
>
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -180,6 +180,7 @@ int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
>  		struct page **pages);
>  extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
>  extern void unmap_kernel_range(unsigned long addr, unsigned long size);
> +extern void unmap_kernel_range_atomic(unsigned long addr, unsigned long size);
>  static inline void set_vm_flush_reset_perms(void *addr)
>  {
>  	struct vm_struct *vm = find_vm_area(addr);
> @@ -200,6 +201,7 @@ unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
>  {
>  }
>  #define unmap_kernel_range unmap_kernel_range_noflush
> +#define unmap_kernel_range_atomic unmap_kernel_range_noflush
>  static inline void set_vm_flush_reset_perms(void *addr)
>  {
>  }
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d7075ad340aa..714e5425dc45 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -88,6 +88,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	pmd_t *pmd;
>  	unsigned long next;
>  	int cleared;
> +	bool check_resched = (end - addr) > PMD_SIZE;
>  
>  	pmd = pmd_offset(pud, addr);
>  	do {
> @@ -102,8 +103,8 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		if (pmd_none_or_clear_bad(pmd))
>  			continue;
>  		vunmap_pte_range(pmd, addr, next, mask);
> -
> -		cond_resched();
> +		if (check_resched)
> +			cond_resched();
>  	} while (pmd++, addr = next, addr != end);
>  }
>  
> @@ -2024,6 +2025,24 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
>  	flush_tlb_kernel_range(addr, end);
>  }
>  
> +/**
> + * unmap_kernel_range_atomic - unmap kernel VM area and flush cache and TLB
> + * @addr: start of the VM area to unmap
> + * @size: size of the VM area to unmap
> + *
> + * Similar to unmap_kernel_range_noflush() but it's atomic. @size should be
> + * less than PMD_SIZE.
> + */
> +void unmap_kernel_range_atomic(unsigned long addr, unsigned long size)
> +{
> +	unsigned long end = addr + size;
> +
> +	flush_cache_vunmap(addr, end);
> +	WARN_ON(size > PMD_SIZE);

WARN_ON_ONCE() would be better here - no point in creating a million
warnings where one would suffice.

> +	unmap_kernel_range_noflush(addr, size);
> +	flush_tlb_kernel_range(addr, end);
> +}
> +
>  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
>  	struct vmap_area *va, unsigned long flags, const void *caller)
>  {
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 662ee420706f..9decc7634852 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1154,7 +1154,7 @@ static inline void __zs_unmap_object(struct mapping_area *area,
>  {
>  	unsigned long addr = (unsigned long)area->vm_addr;
>  
> -	unmap_kernel_range(addr, PAGE_SIZE * 2);
> +	unmap_kernel_range_atomic(addr, PAGE_SIZE * 2);
>  }

I suppose we could live with it if no better solutions are forthcoming.

Christoph Hellwig Nov. 17, 2020, 1:56 p.m. UTC | #11

Btw, I remember that the whole vmalloc magic in zsmalloc was only giving
a small benefit for a few niche use cases.  Given that it generally has
very strange interaction with the vmalloc core, including using various
APIs not used by any driver I'm going to ask once again why we can't
just drop the CONFIG_ZSMALLOC_PGTABLE_MAPPING case entirely?

Uladzislau Rezki Nov. 17, 2020, 1:57 p.m. UTC | #12

> 
> Let's cc Uladzislau on vmalloc things?
> 
> > How about this?
> 
> Well, lol, that's a simple approach to avoiding the problem ;)
> 
To me it looks like a specific workaround for a specific one user.

> > unmap_kernel_range had been atomic operation and zsmalloc has used
> > it in atomic context in zs_unmap_object.
> > However, ("e47110e90584, mm/vunmap: add cond_resched() in vunmap_pmd_range")
> > changed it into non-atomic operation via adding cond_resched.
> > It causes zram decompresion failure by corrupting compressed buffer
> > in atomic context.
> > 
> > This patch introduces unmap_kernel_range_atomic which works for
> > only range less than PMD_SIZE to prevent cond_resched call.
> > 
> > ...
> >
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -180,6 +180,7 @@ int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
> >  		struct page **pages);
> >  extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
> >  extern void unmap_kernel_range(unsigned long addr, unsigned long size);
> > +extern void unmap_kernel_range_atomic(unsigned long addr, unsigned long size);
> >  static inline void set_vm_flush_reset_perms(void *addr)
> >  {
> >  	struct vm_struct *vm = find_vm_area(addr);
> > @@ -200,6 +201,7 @@ unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
> >  {
> >  }
> >  #define unmap_kernel_range unmap_kernel_range_noflush
> > +#define unmap_kernel_range_atomic unmap_kernel_range_noflush
> >  static inline void set_vm_flush_reset_perms(void *addr)
> >  {
> >  }
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d7075ad340aa..714e5425dc45 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -88,6 +88,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  	pmd_t *pmd;
> >  	unsigned long next;
> >  	int cleared;
> > +	bool check_resched = (end - addr) > PMD_SIZE;
> >  
> >  	pmd = pmd_offset(pud, addr);
> >  	do {
> > @@ -102,8 +103,8 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  		if (pmd_none_or_clear_bad(pmd))
> >  			continue;
> >  		vunmap_pte_range(pmd, addr, next, mask);
> > -
> > -		cond_resched();
> > +		if (check_resched)
> > +			cond_resched();
> >  	} while (pmd++, addr = next, addr != end);
> >  }
> >  
> > @@ -2024,6 +2025,24 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
> >  	flush_tlb_kernel_range(addr, end);
> >  }
> >  
> > +/**
> > + * unmap_kernel_range_atomic - unmap kernel VM area and flush cache and TLB
> > + * @addr: start of the VM area to unmap
> > + * @size: size of the VM area to unmap
> > + *
> > + * Similar to unmap_kernel_range_noflush() but it's atomic. @size should be
> > + * less than PMD_SIZE.
> > + */
> > +void unmap_kernel_range_atomic(unsigned long addr, unsigned long size)
> > +{
> > +	unsigned long end = addr + size;
> > +
> > +	flush_cache_vunmap(addr, end);
> > +	WARN_ON(size > PMD_SIZE);
> 
> WARN_ON_ONCE() would be better here - no point in creating a million
> warnings where one would suffice.
> 
> > +	unmap_kernel_range_noflush(addr, size);
> > +	flush_tlb_kernel_range(addr, end);
> > +}
> > +
> >  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> >  	struct vmap_area *va, unsigned long flags, const void *caller)
> >  {
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 662ee420706f..9decc7634852 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -1154,7 +1154,7 @@ static inline void __zs_unmap_object(struct mapping_area *area,
> >  {
> >  	unsigned long addr = (unsigned long)area->vm_addr;
> >  
> > -	unmap_kernel_range(addr, PAGE_SIZE * 2);
> > +	unmap_kernel_range_atomic(addr, PAGE_SIZE * 2);
> >  }
> 
> I suppose we could live with it if no better solutions are forthcoming.
>
Maybe solve it on zsmalloc side? For example to add __zs_unmap_object_deferred(),
so it schedules the work that calls unmap_kernel_range() on a list of mapping_area
objects.

--
Vlad Rezki

Minchan Kim Nov. 17, 2020, 8:29 p.m. UTC | #13

On Tue, Nov 17, 2020 at 01:56:32PM +0000, Christoph Hellwig wrote:
> Btw, I remember that the whole vmalloc magic in zsmalloc was only giving
> a small benefit for a few niche use cases.  Given that it generally has
> very strange interaction with the vmalloc core, including using various
> APIs not used by any driver I'm going to ask once again why we can't
> just drop the CONFIG_ZSMALLOC_PGTABLE_MAPPING case entirely?

Yub, I remeber the discussion. 
https://lore.kernel.org/linux-mm/20200416203736.GB50092@google.com/

I wanted to remove it but 30% gain made me think again before
deciding to drop it.
Since it continue to make problems and Linux is approaching to
deprecate the 32bit machines, I think it would be better to drop it
rather than inventing weird workaround.

Ccing Tony since omap2plus have used it by default for just in case.

From fc1b17a120991fd86b9e1153ab22d0b0bdadd8d0 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 17 Nov 2020 11:58:51 -0800
Subject: [PATCH] mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING

Even though this option showed some amount improvement(e.g., 30%)
in some arm32 platforms, it has been headache to maintain since it
have abused APIs[1](e.g., unmap_kernel_range in atomic context).

Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly
it has been not default option in zsmalloc, it's time to drop the
option for better maintainance.

[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

Cc: Tony Lindgren <tony@atomide.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm/configs/omap2plus_defconfig |  1 -
 include/linux/zsmalloc.h             |  1 -
 mm/Kconfig                           | 13 -------
 mm/zsmalloc.c                        | 54 ----------------------------
 4 files changed, 69 deletions(-)

diff --git a/arch/arm/configs/omap2plus_defconfig b/arch/arm/configs/omap2plus_defconfig
index 77716f500813..70ee84a314c8 100644
--- a/arch/arm/configs/omap2plus_defconfig
+++ b/arch/arm/configs/omap2plus_defconfig
@@ -81,7 +81,6 @@ CONFIG_PARTITION_ADVANCED=y
 CONFIG_BINFMT_MISC=y
 CONFIG_CMA=y
 CONFIG_ZSMALLOC=m
-CONFIG_ZSMALLOC_PGTABLE_MAPPING=y
 CONFIG_NET=y
 CONFIG_PACKET=y
 CONFIG_UNIX=y
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 0fdbf653b173..4807ca4d52e0 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -20,7 +20,6 @@
  * zsmalloc mapping modes
  *
  * NOTE: These only make a difference when a mapped object spans pages.
- * They also have no effect when ZSMALLOC_PGTABLE_MAPPING is selected.
  */
 enum zs_mapmode {
 	ZS_MM_RW, /* normal read-write mapping */
diff --git a/mm/Kconfig b/mm/Kconfig
index 29904dc16bfc..7af1a55b708e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -707,19 +707,6 @@ config ZSMALLOC
 	  returned by an alloc().  This handle must be mapped in order to
 	  access the allocated space.
 
-config ZSMALLOC_PGTABLE_MAPPING
-	bool "Use page table mapping to access object in zsmalloc"
-	depends on ZSMALLOC=y
-	help
-	  By default, zsmalloc uses a copy-based object mapping method to
-	  access allocations that span two pages. However, if a particular
-	  architecture (ex, ARM) performs VM mapping faster than copying,
-	  then you should select this. This causes zsmalloc to use page table
-	  mapping rather than copying for object mapping.
-
-	  You can check speed with zsmalloc benchmark:
-	  https://github.com/spartacus06/zsmapbench
-
 config ZSMALLOC_STAT
 	bool "Export zsmalloc statistics"
 	depends on ZSMALLOC
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b03bee2e1b5f..7289f502ffac 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -293,11 +293,7 @@ struct zspage {
 };
 
 struct mapping_area {
-#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING
-	struct vm_struct *vm; /* vm area for mapping object that span pages */
-#else
 	char *vm_buf; /* copy buffer for objects that span pages */
-#endif
 	char *vm_addr; /* address of kmap_atomic()'ed pages */
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
@@ -1110,54 +1106,6 @@ static struct zspage *find_get_zspage(struct size_class *class)
 	return zspage;
 }
 
-#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING
-static inline int __zs_cpu_up(struct mapping_area *area)
-{
-	/*
-	 * Make sure we don't leak memory if a cpu UP notification
-	 * and zs_init() race and both call zs_cpu_up() on the same cpu
-	 */
-	if (area->vm)
-		return 0;
-	area->vm = get_vm_area(PAGE_SIZE * 2, 0);
-	if (!area->vm)
-		return -ENOMEM;
-
-	/*
-	 * Populate ptes in advance to avoid pte allocation with GFP_KERNEL
-	 * in non-preemtible context of zs_map_object.
-	 */
-	return apply_to_page_range(&init_mm, (unsigned long)area->vm->addr,
-			PAGE_SIZE * 2, NULL, NULL);
-}
-
-static inline void __zs_cpu_down(struct mapping_area *area)
-{
-	if (area->vm)
-		free_vm_area(area->vm);
-	area->vm = NULL;
-}
-
-static inline void *__zs_map_object(struct mapping_area *area,
-				struct page *pages[2], int off, int size)
-{
-	unsigned long addr = (unsigned long)area->vm->addr;
-
-	BUG_ON(map_kernel_range(addr, PAGE_SIZE * 2, PAGE_KERNEL, pages) < 0);
-	area->vm_addr = area->vm->addr;
-	return area->vm_addr + off;
-}
-
-static inline void __zs_unmap_object(struct mapping_area *area,
-				struct page *pages[2], int off, int size)
-{
-	unsigned long addr = (unsigned long)area->vm_addr;
-
-	unmap_kernel_range(addr, PAGE_SIZE * 2);
-}
-
-#else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-
 static inline int __zs_cpu_up(struct mapping_area *area)
 {
 	/*
@@ -1238,8 +1186,6 @@ static void __zs_unmap_object(struct mapping_area *area,
 	pagefault_enable();
 }
 
-#endif /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-
 static int zs_cpu_prepare(unsigned int cpu)
 {
 	struct mapping_area *area;

Sergey Senozhatsky Nov. 18, 2020, 2:06 a.m. UTC | #14

On (20/11/17 12:29), Minchan Kim wrote:
> Yub, I remeber the discussion. 
> https://lore.kernel.org/linux-mm/20200416203736.GB50092@google.com/
> 
> I wanted to remove it but 30% gain made me think again before
> deciding to drop it.
> Since it continue to make problems and Linux is approaching to
> deprecate the 32bit machines, I think it would be better to drop it
> rather than inventing weird workaround.
> 
> Ccing Tony since omap2plus have used it by default for just in case.
> 
> From fc1b17a120991fd86b9e1153ab22d0b0bdadd8d0 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 17 Nov 2020 11:58:51 -0800
> Subject: [PATCH] mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
> 
> Even though this option showed some amount improvement(e.g., 30%)
> in some arm32 platforms, it has been headache to maintain since it
> have abused APIs[1](e.g., unmap_kernel_range in atomic context).
> 
> Since we are approaching to deprecate 32bit machines and already made
> the config option available for only builtin build since v5.8, lastly
> it has been not default option in zsmalloc, it's time to drop the
> option for better maintainance.
> 
> [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
> 
> Cc: Tony Lindgren <tony@atomide.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: <stable@vger.kernel.org>
> Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Looks good to me.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

Tony Lindgren Nov. 19, 2020, 9:29 a.m. UTC | #15

* Sergey Senozhatsky <sergey.senozhatsky@gmail.com> [201118 02:07]:
> On (20/11/17 12:29), Minchan Kim wrote:
> > Yub, I remeber the discussion. 
> > https://lore.kernel.org/linux-mm/20200416203736.GB50092@google.com/
> > 
> > I wanted to remove it but 30% gain made me think again before
> > deciding to drop it.
> > Since it continue to make problems and Linux is approaching to
> > deprecate the 32bit machines, I think it would be better to drop it
> > rather than inventing weird workaround.
> > 
> > Ccing Tony since omap2plus have used it by default for just in case.

Looks like make omap2lus_defconfig is not selecting it anyways and I
never even noticed earlier:

Acked-by: Tony Lindgren <tony@atomide.com>

Revert "mm/vunmap: add cond_resched() in vunmap_pmd_range"

Commit Message

Comments

Patch