diff mbox series

[v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare()

Message ID 20241021173455.2691973-1-roman.gushchin@linux.dev (mailing list archive)
State New
Headers show
Series [v2] mm: page_alloc: move mlocked flag clearance into free_pages_prepare() | expand

Commit Message

Roman Gushchin Oct. 21, 2024, 5:34 p.m. UTC
Syzbot reported a bad page state problem caused by a page
being freed using free_page() still having a mlocked flag at
free_pages_prepare() stage:

  BUG: Bad page state in process syz.0.15  pfn:1137bb
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff8881137bb870 pfn:0x1137bb
  flags: 0x400000000080000(mlocked|node=0|zone=1)
  raw: 0400000000080000 0000000000000000 dead000000000122 0000000000000000
  raw: ffff8881137bb870 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  page_owner tracks the page as allocated
  page last allocated via order 0, migratetype Unmovable, gfp_mask
  0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 3005, tgid
  3004 (syz.0.15), ts 61546  608067, free_ts 61390082085
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x3008/0x31f0 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x292/0x7b0 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x3e8/0x630 mm/mempolicy.c:2265
   kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
   kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
   kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5500 [inline]
   kvm_dev_ioctl+0x13bb/0x2320 virt/kvm/kvm_main.c:5542
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:907 [inline]
   __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0x69/0x110 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  page last free pid 951 tgid 951 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_page+0xcb1/0xf00 mm/page_alloc.c:2638
   vfree+0x181/0x2e0 mm/vmalloc.c:3361
   delayed_vfree_work+0x56/0x80 mm/vmalloc.c:3282
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa5c/0x17a0 kernel/workqueue.c:3310
   worker_thread+0xa2b/0xf70 kernel/workqueue.c:3391
   kthread+0x2df/0x370 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

A reproducer is available here:
https://syzkaller.appspot.com/x/repro.c?x=1437939f980000

The problem was originally introduced by
commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final
clearance"): it was handling focused on handling pagecache
and anonymous memory and wasn't suitable for lower level
get_page()/free_page() API's used for example by KVM, as with
this reproducer.

Fix it by moving the mlocked flag clearance down to
free_page_prepare().

The bug itself if fairly old and harmless (aside from generating these
warnings).

Closes: https://syzkaller.appspot.com/x/report.txt?x=169a47d0580000
Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 15 +++++++++++++++
 mm/swap.c       | 14 --------------
 2 files changed, 15 insertions(+), 14 deletions(-)

Comments

Shakeel Butt Oct. 21, 2024, 5:57 p.m. UTC | #1
On Mon, Oct 21, 2024 at 05:34:55PM GMT, Roman Gushchin wrote:
> Syzbot reported a bad page state problem caused by a page
> being freed using free_page() still having a mlocked flag at
> free_pages_prepare() stage:
> 
>   BUG: Bad page state in process syz.0.15  pfn:1137bb
>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff8881137bb870 pfn:0x1137bb
>   flags: 0x400000000080000(mlocked|node=0|zone=1)
>   raw: 0400000000080000 0000000000000000 dead000000000122 0000000000000000
>   raw: ffff8881137bb870 0000000000000000 00000000ffffffff 0000000000000000
>   page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>   page_owner tracks the page as allocated
>   page last allocated via order 0, migratetype Unmovable, gfp_mask
>   0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 3005, tgid
>   3004 (syz.0.15), ts 61546  608067, free_ts 61390082085
>    set_page_owner include/linux/page_owner.h:32 [inline]
>    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
>    prep_new_page mm/page_alloc.c:1545 [inline]
>    get_page_from_freelist+0x3008/0x31f0 mm/page_alloc.c:3457
>    __alloc_pages_noprof+0x292/0x7b0 mm/page_alloc.c:4733
>    alloc_pages_mpol_noprof+0x3e8/0x630 mm/mempolicy.c:2265
>    kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
>    kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
>    kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5500 [inline]
>    kvm_dev_ioctl+0x13bb/0x2320 virt/kvm/kvm_main.c:5542
>    vfs_ioctl fs/ioctl.c:51 [inline]
>    __do_sys_ioctl fs/ioctl.c:907 [inline]
>    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
>    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>    do_syscall_64+0x69/0x110 arch/x86/entry/common.c:83
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   page last free pid 951 tgid 951 stack trace:
>    reset_page_owner include/linux/page_owner.h:25 [inline]
>    free_pages_prepare mm/page_alloc.c:1108 [inline]
>    free_unref_page+0xcb1/0xf00 mm/page_alloc.c:2638
>    vfree+0x181/0x2e0 mm/vmalloc.c:3361
>    delayed_vfree_work+0x56/0x80 mm/vmalloc.c:3282
>    process_one_work kernel/workqueue.c:3229 [inline]
>    process_scheduled_works+0xa5c/0x17a0 kernel/workqueue.c:3310
>    worker_thread+0xa2b/0xf70 kernel/workqueue.c:3391
>    kthread+0x2df/0x370 kernel/kthread.c:389
>    ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
>    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> 
> A reproducer is available here:
> https://syzkaller.appspot.com/x/repro.c?x=1437939f980000
> 
> The problem was originally introduced by
> commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final
> clearance"): it was handling focused on handling pagecache
> and anonymous memory and wasn't suitable for lower level
> get_page()/free_page() API's used for example by KVM, as with
> this reproducer.
> 
> Fix it by moving the mlocked flag clearance down to
> free_page_prepare().
> 
> The bug itself if fairly old and harmless (aside from generating these
> warnings).
> 
> Closes: https://syzkaller.appspot.com/x/report.txt?x=169a47d0580000

Can you open the access to the syzbot report?
Hugh Dickins Oct. 21, 2024, 7:49 p.m. UTC | #2
On Mon, 21 Oct 2024, Roman Gushchin wrote:

> Syzbot reported a bad page state problem caused by a page
> being freed using free_page() still having a mlocked flag at
> free_pages_prepare() stage:
> 
>   BUG: Bad page state in process syz.0.15  pfn:1137bb
>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff8881137bb870 pfn:0x1137bb
>   flags: 0x400000000080000(mlocked|node=0|zone=1)
>   raw: 0400000000080000 0000000000000000 dead000000000122 0000000000000000
>   raw: ffff8881137bb870 0000000000000000 00000000ffffffff 0000000000000000
>   page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>   page_owner tracks the page as allocated
>   page last allocated via order 0, migratetype Unmovable, gfp_mask
>   0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 3005, tgid
>   3004 (syz.0.15), ts 61546  608067, free_ts 61390082085
>    set_page_owner include/linux/page_owner.h:32 [inline]
>    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
>    prep_new_page mm/page_alloc.c:1545 [inline]
>    get_page_from_freelist+0x3008/0x31f0 mm/page_alloc.c:3457
>    __alloc_pages_noprof+0x292/0x7b0 mm/page_alloc.c:4733
>    alloc_pages_mpol_noprof+0x3e8/0x630 mm/mempolicy.c:2265
>    kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
>    kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
>    kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5500 [inline]
>    kvm_dev_ioctl+0x13bb/0x2320 virt/kvm/kvm_main.c:5542
>    vfs_ioctl fs/ioctl.c:51 [inline]
>    __do_sys_ioctl fs/ioctl.c:907 [inline]
>    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
>    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>    do_syscall_64+0x69/0x110 arch/x86/entry/common.c:83
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   page last free pid 951 tgid 951 stack trace:
>    reset_page_owner include/linux/page_owner.h:25 [inline]
>    free_pages_prepare mm/page_alloc.c:1108 [inline]
>    free_unref_page+0xcb1/0xf00 mm/page_alloc.c:2638
>    vfree+0x181/0x2e0 mm/vmalloc.c:3361
>    delayed_vfree_work+0x56/0x80 mm/vmalloc.c:3282
>    process_one_work kernel/workqueue.c:3229 [inline]
>    process_scheduled_works+0xa5c/0x17a0 kernel/workqueue.c:3310
>    worker_thread+0xa2b/0xf70 kernel/workqueue.c:3391
>    kthread+0x2df/0x370 kernel/kthread.c:389
>    ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
>    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> 
> A reproducer is available here:
> https://syzkaller.appspot.com/x/repro.c?x=1437939f980000
> 
> The problem was originally introduced by
> commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final
> clearance"): it was handling focused on handling pagecache
> and anonymous memory and wasn't suitable for lower level
> get_page()/free_page() API's used for example by KVM, as with
> this reproducer.
> 
> Fix it by moving the mlocked flag clearance down to
> free_page_prepare().
> 
> The bug itself if fairly old and harmless (aside from generating these
> warnings).
> 
> Closes: https://syzkaller.appspot.com/x/report.txt?x=169a47d0580000
> Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: <stable@vger.kernel.org>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks Roman - I'd been preparing a similar patch, so agree that this is
the right fix.  I don't think there's any need to change your text, but
let me remind us that any "Bad page" report stops that page from being
allocated again (because it's in an undefined, potentially dangerous
state): so does amount to a small memory leak even if otherwise harmless.

> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 15 +++++++++++++++
>  mm/swap.c       | 14 --------------
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bc55d39eb372..7535d78862ab 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1044,6 +1044,7 @@ __always_inline bool free_pages_prepare(struct page *page,
>  	bool skip_kasan_poison = should_skip_kasan_poison(page);
>  	bool init = want_init_on_free();
>  	bool compound = PageCompound(page);
> +	struct folio *folio = page_folio(page);
>  
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
> @@ -1053,6 +1054,20 @@ __always_inline bool free_pages_prepare(struct page *page,
>  	if (memcg_kmem_online() && PageMemcgKmem(page))
>  		__memcg_kmem_uncharge_page(page, order);
>  
> +	/*
> +	 * In rare cases, when truncation or holepunching raced with
> +	 * munlock after VM_LOCKED was cleared, Mlocked may still be
> +	 * found set here.  This does not indicate a problem, unless
> +	 * "unevictable_pgs_cleared" appears worryingly large.
> +	 */
> +	if (unlikely(folio_test_mlocked(folio))) {
> +		long nr_pages = folio_nr_pages(folio);
> +
> +		__folio_clear_mlocked(folio);
> +		zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
> +		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
> +	}
> +
>  	if (unlikely(PageHWPoison(page)) && !order) {
>  		/* Do not let hwpoison pages hit pcplists/buddy */
>  		reset_page_owner(page, order);
> diff --git a/mm/swap.c b/mm/swap.c
> index 835bdf324b76..7cd0f4719423 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -78,20 +78,6 @@ static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
>  		lruvec_del_folio(*lruvecp, folio);
>  		__folio_clear_lru_flags(folio);
>  	}
> -
> -	/*
> -	 * In rare cases, when truncation or holepunching raced with
> -	 * munlock after VM_LOCKED was cleared, Mlocked may still be
> -	 * found set here.  This does not indicate a problem, unless
> -	 * "unevictable_pgs_cleared" appears worryingly large.
> -	 */
> -	if (unlikely(folio_test_mlocked(folio))) {
> -		long nr_pages = folio_nr_pages(folio);
> -
> -		__folio_clear_mlocked(folio);
> -		zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
> -		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
> -	}
>  }
>  
>  /*
> -- 
> 2.47.0.105.g07ac214952-goog
> 
>
Matthew Wilcox Oct. 21, 2024, 8:34 p.m. UTC | #3
On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> Fix it by moving the mlocked flag clearance down to
> free_page_prepare().

Urgh, I don't like this new reference to folio in free_pages_prepare().
It feels like a layering violation.  I'll think about where else we
could put this.
Hugh Dickins Oct. 21, 2024, 9:14 p.m. UTC | #4
On Mon, 21 Oct 2024, Matthew Wilcox wrote:
> On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > Fix it by moving the mlocked flag clearance down to
> > free_page_prepare().
> 
> Urgh, I don't like this new reference to folio in free_pages_prepare().
> It feels like a layering violation.  I'll think about where else we
> could put this.

I'm glad to see that I guessed correctly when preparing my similar patch:
I expected you to feel that way.  The alternative seems to be to bring
back PageMlocked etc, but I thought you'd find that more distasteful.

Hugh
Roman Gushchin Oct. 22, 2024, 2:11 a.m. UTC | #5
On Mon, Oct 21, 2024 at 10:57:35AM -0700, Shakeel Butt wrote:
> On Mon, Oct 21, 2024 at 05:34:55PM GMT, Roman Gushchin wrote:
> > Syzbot reported a bad page state problem caused by a page
> > being freed using free_page() still having a mlocked flag at
> > free_pages_prepare() stage:
> > 
> >   BUG: Bad page state in process syz.0.15  pfn:1137bb
> >   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff8881137bb870 pfn:0x1137bb
> >   flags: 0x400000000080000(mlocked|node=0|zone=1)
> >   raw: 0400000000080000 0000000000000000 dead000000000122 0000000000000000
> >   raw: ffff8881137bb870 0000000000000000 00000000ffffffff 0000000000000000
> >   page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >   page_owner tracks the page as allocated
> >   page last allocated via order 0, migratetype Unmovable, gfp_mask
> >   0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 3005, tgid
> >   3004 (syz.0.15), ts 61546  608067, free_ts 61390082085
> >    set_page_owner include/linux/page_owner.h:32 [inline]
> >    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
> >    prep_new_page mm/page_alloc.c:1545 [inline]
> >    get_page_from_freelist+0x3008/0x31f0 mm/page_alloc.c:3457
> >    __alloc_pages_noprof+0x292/0x7b0 mm/page_alloc.c:4733
> >    alloc_pages_mpol_noprof+0x3e8/0x630 mm/mempolicy.c:2265
> >    kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
> >    kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
> >    kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5500 [inline]
> >    kvm_dev_ioctl+0x13bb/0x2320 virt/kvm/kvm_main.c:5542
> >    vfs_ioctl fs/ioctl.c:51 [inline]
> >    __do_sys_ioctl fs/ioctl.c:907 [inline]
> >    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
> >    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >    do_syscall_64+0x69/0x110 arch/x86/entry/common.c:83
> >    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >   page last free pid 951 tgid 951 stack trace:
> >    reset_page_owner include/linux/page_owner.h:25 [inline]
> >    free_pages_prepare mm/page_alloc.c:1108 [inline]
> >    free_unref_page+0xcb1/0xf00 mm/page_alloc.c:2638
> >    vfree+0x181/0x2e0 mm/vmalloc.c:3361
> >    delayed_vfree_work+0x56/0x80 mm/vmalloc.c:3282
> >    process_one_work kernel/workqueue.c:3229 [inline]
> >    process_scheduled_works+0xa5c/0x17a0 kernel/workqueue.c:3310
> >    worker_thread+0xa2b/0xf70 kernel/workqueue.c:3391
> >    kthread+0x2df/0x370 kernel/kthread.c:389
> >    ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> >    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> > 
> > A reproducer is available here:
> > https://syzkaller.appspot.com/x/repro.c?x=1437939f980000
> > 
> > The problem was originally introduced by
> > commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final
> > clearance"): it was handling focused on handling pagecache
> > and anonymous memory and wasn't suitable for lower level
> > get_page()/free_page() API's used for example by KVM, as with
> > this reproducer.
> > 
> > Fix it by moving the mlocked flag clearance down to
> > free_page_prepare().
> > 
> > The bug itself if fairly old and harmless (aside from generating these
> > warnings).
> > 
> > Closes: https://syzkaller.appspot.com/x/report.txt?x=169a47d0580000
> 
> Can you open the access to the syzbot report?
> 

Unfortunately I can't, but I asked the syzkaller team to run the reproducer
against upsteam again and generate a publicly available report.
Roman Gushchin Oct. 22, 2024, 2:14 a.m. UTC | #6
On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > Fix it by moving the mlocked flag clearance down to
> > free_page_prepare().
> 
> Urgh, I don't like this new reference to folio in free_pages_prepare().
> It feels like a layering violation.  I'll think about where else we
> could put this.

I agree, but it feels like it needs quite some work to do it in a nicer way,
no way it can be backported to older kernels. As for this fix, I don't
have better ideas...
Roman Gushchin Oct. 22, 2024, 2:16 a.m. UTC | #7
On Mon, Oct 21, 2024 at 12:49:28PM -0700, Hugh Dickins wrote:
> On Mon, 21 Oct 2024, Roman Gushchin wrote:
> 
> > Syzbot reported a bad page state problem caused by a page
> > being freed using free_page() still having a mlocked flag at
> > free_pages_prepare() stage:
> > 
> >   BUG: Bad page state in process syz.0.15  pfn:1137bb
> >   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff8881137bb870 pfn:0x1137bb
> >   flags: 0x400000000080000(mlocked|node=0|zone=1)
> >   raw: 0400000000080000 0000000000000000 dead000000000122 0000000000000000
> >   raw: ffff8881137bb870 0000000000000000 00000000ffffffff 0000000000000000
> >   page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >   page_owner tracks the page as allocated
> >   page last allocated via order 0, migratetype Unmovable, gfp_mask
> >   0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 3005, tgid
> >   3004 (syz.0.15), ts 61546  608067, free_ts 61390082085
> >    set_page_owner include/linux/page_owner.h:32 [inline]
> >    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
> >    prep_new_page mm/page_alloc.c:1545 [inline]
> >    get_page_from_freelist+0x3008/0x31f0 mm/page_alloc.c:3457
> >    __alloc_pages_noprof+0x292/0x7b0 mm/page_alloc.c:4733
> >    alloc_pages_mpol_noprof+0x3e8/0x630 mm/mempolicy.c:2265
> >    kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
> >    kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
> >    kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5500 [inline]
> >    kvm_dev_ioctl+0x13bb/0x2320 virt/kvm/kvm_main.c:5542
> >    vfs_ioctl fs/ioctl.c:51 [inline]
> >    __do_sys_ioctl fs/ioctl.c:907 [inline]
> >    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
> >    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >    do_syscall_64+0x69/0x110 arch/x86/entry/common.c:83
> >    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >   page last free pid 951 tgid 951 stack trace:
> >    reset_page_owner include/linux/page_owner.h:25 [inline]
> >    free_pages_prepare mm/page_alloc.c:1108 [inline]
> >    free_unref_page+0xcb1/0xf00 mm/page_alloc.c:2638
> >    vfree+0x181/0x2e0 mm/vmalloc.c:3361
> >    delayed_vfree_work+0x56/0x80 mm/vmalloc.c:3282
> >    process_one_work kernel/workqueue.c:3229 [inline]
> >    process_scheduled_works+0xa5c/0x17a0 kernel/workqueue.c:3310
> >    worker_thread+0xa2b/0xf70 kernel/workqueue.c:3391
> >    kthread+0x2df/0x370 kernel/kthread.c:389
> >    ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> >    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> > 
> > A reproducer is available here:
> > https://syzkaller.appspot.com/x/repro.c?x=1437939f980000
> > 
> > The problem was originally introduced by
> > commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final
> > clearance"): it was handling focused on handling pagecache
> > and anonymous memory and wasn't suitable for lower level
> > get_page()/free_page() API's used for example by KVM, as with
> > this reproducer.
> > 
> > Fix it by moving the mlocked flag clearance down to
> > free_page_prepare().
> > 
> > The bug itself if fairly old and harmless (aside from generating these
> > warnings).
> > 
> > Closes: https://syzkaller.appspot.com/x/report.txt?x=169a47d0580000
> > Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: <stable@vger.kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks Roman - I'd been preparing a similar patch, so agree that this is
> the right fix.

Thank you!

> I don't think there's any need to change your text, but
> let me remind us that any "Bad page" report stops that page from being
> allocated again (because it's in an undefined, potentially dangerous
> state): so does amount to a small memory leak even if otherwise harmless.

It looks like I need to post v3 as soon as I get a publicly available
syzkaller report, so I'll add this to the commit log.

Thanks!
Matthew Wilcox Oct. 22, 2024, 3:47 a.m. UTC | #8
On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > Fix it by moving the mlocked flag clearance down to
> > > free_page_prepare().
> > 
> > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > It feels like a layering violation.  I'll think about where else we
> > could put this.
> 
> I agree, but it feels like it needs quite some work to do it in a nicer way,
> no way it can be backported to older kernels. As for this fix, I don't
> have better ideas...

Well, what is KVM doing that causes this page to get mapped to userspace?
Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
can tell is that it's freed with vfree().

Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
but I'd hate to spend a lot of time on it and then discover I was looking
at the wrong thing.

The reason I'm interested in looking in this direction is that we're
separating pages from folios.  Pages allocated through vmalloc() won't
have refcounts, mapcounts, mlock bits, etc.  So it's quite important to
look at currently existing code and figure out how they can be modified
to work in this new environment.
Roman Gushchin Oct. 22, 2024, 4:33 a.m. UTC | #9
On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > Fix it by moving the mlocked flag clearance down to
> > > > free_page_prepare().
> > > 
> > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > It feels like a layering violation.  I'll think about where else we
> > > could put this.
> > 
> > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > no way it can be backported to older kernels. As for this fix, I don't
> > have better ideas...
> 
> Well, what is KVM doing that causes this page to get mapped to userspace?
> Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> can tell is that it's freed with vfree().
> 
> Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> but I'd hate to spend a lot of time on it and then discover I was looking
> at the wrong thing.

One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.

Here is the reproducer:

#define _GNU_SOURCE

#include <endian.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

#ifndef __NR_mlock2
#define __NR_mlock2 325
#endif

uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};

#ifndef KVM_CREATE_VM
#define KVM_CREATE_VM 0xae01
#endif

#ifndef KVM_CREATE_VCPU
#define KVM_CREATE_VCPU 0xae41
#endif

int main(void)
{
  syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
          /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
          /*offset=*/0ul);
  syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
          /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
          /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
          /*offset=*/0ul);
  syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
          /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
          /*offset=*/0ul);
  intptr_t res = syscall(__NR_openat, /*fd=*/0xffffff9c, /*file=*/"/dev/kvm",
                /*flags=*/0, /*mode=*/0);
  if (res != -1)
    r[0] = res;
  res = syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/KVM_CREATE_VM, /*type=*/0ul);
  if (res != -1)
    r[1] = res;
  res = syscall(__NR_ioctl, /*fd=*/r[1], /*cmd=*/KVM_CREATE_VCPU, /*id=*/0ul);
  if (res != -1)
    r[2] = res;
  syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0xb36000ul,
          /*prot=PROT_SEM|PROT_WRITE|PROT_READ|PROT_EXEC*/ 0xful,
          /*flags=MAP_FIXED|MAP_SHARED*/ 0x11ul, /*fd=*/r[2], /*offset=*/0ul);
  syscall(__NR_mlock2, /*addr=*/0x20000000ul, /*size=*/0x400000ul,
          /*flags=*/0ul);
  syscall(__NR_mremap, /*addr=*/0x200ab000ul, /*len=*/0x1000ul,
          /*newlen=*/0x1000ul,
          /*flags=MREMAP_DONTUNMAP|MREMAP_FIXED|MREMAP_MAYMOVE*/ 7ul,
          /*newaddr=*/0x20ffc000ul);
  return 0;
}
Yosry Ahmed Oct. 22, 2024, 8:26 a.m. UTC | #10
On Mon, Oct 21, 2024 at 9:33 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > > Fix it by moving the mlocked flag clearance down to
> > > > > free_page_prepare().
> > > >
> > > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > > It feels like a layering violation.  I'll think about where else we
> > > > could put this.
> > >
> > > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > > no way it can be backported to older kernels. As for this fix, I don't
> > > have better ideas...
> >
> > Well, what is KVM doing that causes this page to get mapped to userspace?
> > Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> > can tell is that it's freed with vfree().
> >
> > Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> > but I'd hate to spend a lot of time on it and then discover I was looking
> > at the wrong thing.
>
> One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.

Looking at kvm_vcpu_fault(), it seems like we after mmap'ing the fd
returned by KVM_CREATE_VCPU we can access one of the following:
- vcpu->run
- vcpu->arch.pio_data
- vcpu->kvm->coalesced_mmio_ring
- a page returned by kvm_dirty_ring_get_page()

It doesn't seem like any of these are reclaimable, why is mlock()'ing
them supported to begin with? Even if we don't want mlock() to err in
this case, shouldn't we just do nothing?

I see a lot of checks at the beginning of mlock_fixup() to check
whether we should operate on the vma, perhaps we should also check for
these KVM vmas? or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
sure tbh, but this doesn't seem right.

FWIW, I think moving the mlock clearing from __page_cache_release ()
to free_pages_prepare() (or another common function in the page
freeing path) may be the right thing to do in its own right. I am just
wondering why we are not questioning the mlock() on the KVM vCPU
mapping to begin with.

Is there a use case for this that I am missing?

>
> Here is the reproducer:
>
> #define _GNU_SOURCE
>
> #include <endian.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mount.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> #ifndef __NR_mlock2
> #define __NR_mlock2 325
> #endif
>
> uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
>
> #ifndef KVM_CREATE_VM
> #define KVM_CREATE_VM 0xae01
> #endif
>
> #ifndef KVM_CREATE_VCPU
> #define KVM_CREATE_VCPU 0xae41
> #endif
>
> int main(void)
> {
>   syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>           /*offset=*/0ul);
>   syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
>           /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
>           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>           /*offset=*/0ul);
>   syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>           /*offset=*/0ul);
>   intptr_t res = syscall(__NR_openat, /*fd=*/0xffffff9c, /*file=*/"/dev/kvm",
>                 /*flags=*/0, /*mode=*/0);
>   if (res != -1)
>     r[0] = res;
>   res = syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/KVM_CREATE_VM, /*type=*/0ul);
>   if (res != -1)
>     r[1] = res;
>   res = syscall(__NR_ioctl, /*fd=*/r[1], /*cmd=*/KVM_CREATE_VCPU, /*id=*/0ul);
>   if (res != -1)
>     r[2] = res;
>   syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0xb36000ul,
>           /*prot=PROT_SEM|PROT_WRITE|PROT_READ|PROT_EXEC*/ 0xful,
>           /*flags=MAP_FIXED|MAP_SHARED*/ 0x11ul, /*fd=*/r[2], /*offset=*/0ul);
>   syscall(__NR_mlock2, /*addr=*/0x20000000ul, /*size=*/0x400000ul,
>           /*flags=*/0ul);
>   syscall(__NR_mremap, /*addr=*/0x200ab000ul, /*len=*/0x1000ul,
>           /*newlen=*/0x1000ul,
>           /*flags=MREMAP_DONTUNMAP|MREMAP_FIXED|MREMAP_MAYMOVE*/ 7ul,
>           /*newaddr=*/0x20ffc000ul);
>   return 0;
> }
>
Sean Christopherson Oct. 22, 2024, 3:39 p.m. UTC | #11
On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> On Mon, Oct 21, 2024 at 9:33 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > > > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > > > Fix it by moving the mlocked flag clearance down to
> > > > > > free_page_prepare().
> > > > >
> > > > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > > > It feels like a layering violation.  I'll think about where else we
> > > > > could put this.
> > > >
> > > > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > > > no way it can be backported to older kernels. As for this fix, I don't
> > > > have better ideas...
> > >
> > > Well, what is KVM doing that causes this page to get mapped to userspace?
> > > Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> > > can tell is that it's freed with vfree().
> > >
> > > Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> > > but I'd hate to spend a lot of time on it and then discover I was looking
> > > at the wrong thing.
> >
> > One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.
> 
> Looking at kvm_vcpu_fault(), it seems like we after mmap'ing the fd
> returned by KVM_CREATE_VCPU we can access one of the following:
> - vcpu->run
> - vcpu->arch.pio_data
> - vcpu->kvm->coalesced_mmio_ring
> - a page returned by kvm_dirty_ring_get_page()
> 
> It doesn't seem like any of these are reclaimable,

Correct, these are all kernel allocated pages that KVM exposes to userspace to
facilitate bidirectional sharing of large chunks of data.

> why is mlock()'ing them supported to begin with?

Because no one realized it would be problematic, and KVM would have had to go out
of its way to prevent mlock().

> Even if we don't want mlock() to err in this case, shouldn't we just do
> nothing?

Ideally, yes.

> I see a lot of checks at the beginning of mlock_fixup() to check
> whether we should operate on the vma, perhaps we should also check for
> these KVM vmas?

Definitely not.  KVM may be doing something unexpected, but the VMA certainly
isn't unique enough to warrant mm/ needing dedicated handling.

Focusing on KVM is likely a waste of time.  There are probably other subsystems
and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
even if there aren't any other existing cases, nothing would prevent them from
coming along in the future.

> Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> sure tbh, but this doesn't seem right.

Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
but setting VM_DONTEXPAND could theoretically break userspace, and other than
preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
to fudge around this outside of core mm/ feels kludgy and has the potential to
turn into a game of whack-a-mole.

> FWIW, I think moving the mlock clearing from __page_cache_release ()
> to free_pages_prepare() (or another common function in the page
> freeing path) may be the right thing to do in its own right. I am just
> wondering why we are not questioning the mlock() on the KVM vCPU
> mapping to begin with.
> 
> Is there a use case for this that I am missing?

Not that I know of, I suspect mlock() is allowed simply because it's allowed by
default.
Matthew Wilcox Oct. 22, 2024, 4:59 p.m. UTC | #12
On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> > Even if we don't want mlock() to err in this case, shouldn't we just do
> > nothing?
> 
> Ideally, yes.

Agreed.  There's no sense in having this count against the NR_MLOCK
stats, for example.

> > I see a lot of checks at the beginning of mlock_fixup() to check
> > whether we should operate on the vma, perhaps we should also check for
> > these KVM vmas?
> 
> Definitely not.  KVM may be doing something unexpected, but the VMA certainly
> isn't unique enough to warrant mm/ needing dedicated handling.
> 
> Focusing on KVM is likely a waste of time.  There are probably other subsystems
> and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
> good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
> even if there aren't any other existing cases, nothing would prevent them from
> coming along in the future.

They all need to be fixed.  How to do that is not an answer I have at
this point.  Ideally we can fix them without changing them all immediately
(but they will all need to be fixed eventually because pages will no
longer have a refcount and so get_page() will need to go away ...)

> > Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> > sure tbh, but this doesn't seem right.
> 
> Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
> but setting VM_DONTEXPAND could theoretically break userspace, and other than
> preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
> any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
> to fudge around this outside of core mm/ feels kludgy and has the potential to
> turn into a game of whack-a-mole.

Actually, VM_PFNMAP is probably ideal.  We're not really mapping pages
here (I mean, they are pages, but they're not filesystem pages or
anonymous pages ... there's no rmap to them).  We're mapping blobs of
memory whose refcount is controlled by the vma that maps them.  We don't
particularly want to be able to splice() this memory, or do RDMA to it.
We probably do want gdb to be able to read it (... yes?) which might be
a complication with a PFNMAP VMA.

We've given a lot of flexibility to device drivers about how they
implement mmap() and I think that's now getting in the way of some
important improvements.  I want to see a simpler way of providing the
same functionality, and I'm not quite there yet.
Sean Christopherson Oct. 22, 2024, 7:52 p.m. UTC | #13
On Tue, Oct 22, 2024, Matthew Wilcox wrote:
> On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> > > Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> > > sure tbh, but this doesn't seem right.
> > 
> > Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
> > but setting VM_DONTEXPAND could theoretically break userspace, and other than
> > preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
> > any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
> > to fudge around this outside of core mm/ feels kludgy and has the potential to
> > turn into a game of whack-a-mole.
> 
> Actually, VM_PFNMAP is probably ideal.  We're not really mapping pages
> here (I mean, they are pages, but they're not filesystem pages or
> anonymous pages ... there's no rmap to them).  We're mapping blobs of
> memory whose refcount is controlled by the vma that maps them.  We don't
> particularly want to be able to splice() this memory, or do RDMA to it.
> We probably do want gdb to be able to read it (... yes?)

More than likely, yes.  And we probably want the pages to show up in core dumps,
and be gup()-able.  I think that's the underlying problem with KVM's pages.  In
many cases, we want them to show up as vm_normal_page() pages.  But for a few
things, e.g. mlock(), it's nonsensical because they aren't entirely normal, just
mostly normal.

> which might be a complication with a PFNMAP VMA.
> 
> We've given a lot of flexibility to device drivers about how they
> implement mmap() and I think that's now getting in the way of some
> important improvements.  I want to see a simpler way of providing the
> same functionality, and I'm not quite there yet.
Roman Gushchin Oct. 23, 2024, 2:04 a.m. UTC | #14
On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> > On Mon, Oct 21, 2024 at 9:33 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Oct 22, 2024 at 04:47:19AM +0100, Matthew Wilcox wrote:
> > > > On Tue, Oct 22, 2024 at 02:14:39AM +0000, Roman Gushchin wrote:
> > > > > On Mon, Oct 21, 2024 at 09:34:24PM +0100, Matthew Wilcox wrote:
> > > > > > On Mon, Oct 21, 2024 at 05:34:55PM +0000, Roman Gushchin wrote:
> > > > > > > Fix it by moving the mlocked flag clearance down to
> > > > > > > free_page_prepare().
> > > > > >
> > > > > > Urgh, I don't like this new reference to folio in free_pages_prepare().
> > > > > > It feels like a layering violation.  I'll think about where else we
> > > > > > could put this.
> > > > >
> > > > > I agree, but it feels like it needs quite some work to do it in a nicer way,
> > > > > no way it can be backported to older kernels. As for this fix, I don't
> > > > > have better ideas...
> > > >
> > > > Well, what is KVM doing that causes this page to get mapped to userspace?
> > > > Don't tell me to look at the reproducer as it is 403 Forbidden.  All I
> > > > can tell is that it's freed with vfree().
> > > >
> > > > Is it from kvm_dirty_ring_get_page()?  That looks like the obvious thing,
> > > > but I'd hate to spend a lot of time on it and then discover I was looking
> > > > at the wrong thing.
> > >
> > > One of the pages is vcpu->run, others belong to kvm->coalesced_mmio_ring.
> > 
> > Looking at kvm_vcpu_fault(), it seems like we after mmap'ing the fd
> > returned by KVM_CREATE_VCPU we can access one of the following:
> > - vcpu->run
> > - vcpu->arch.pio_data
> > - vcpu->kvm->coalesced_mmio_ring
> > - a page returned by kvm_dirty_ring_get_page()
> > 
> > It doesn't seem like any of these are reclaimable,
> 
> Correct, these are all kernel allocated pages that KVM exposes to userspace to
> facilitate bidirectional sharing of large chunks of data.
> 
> > why is mlock()'ing them supported to begin with?
> 
> Because no one realized it would be problematic, and KVM would have had to go out
> of its way to prevent mlock().
> 
> > Even if we don't want mlock() to err in this case, shouldn't we just do
> > nothing?
> 
> Ideally, yes.
> 
> > I see a lot of checks at the beginning of mlock_fixup() to check
> > whether we should operate on the vma, perhaps we should also check for
> > these KVM vmas?
> 
> Definitely not.  KVM may be doing something unexpected, but the VMA certainly
> isn't unique enough to warrant mm/ needing dedicated handling.
> 
> Focusing on KVM is likely a waste of time.  There are probably other subsystems
> and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
> good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
> even if there aren't any other existing cases, nothing would prevent them from
> coming along in the future.

Yeah, I also think so.
It seems that bpf/ringbuf.c contains another example. There are likely more.

So I think we have either to fix it like proposed or on the mlock side.
Sean Christopherson Nov. 6, 2024, 1:09 a.m. UTC | #15
On Tue, Oct 22, 2024, Roman Gushchin wrote:
> On Mon, Oct 21, 2024 at 12:49:28PM -0700, Hugh Dickins wrote:
> > On Mon, 21 Oct 2024, Roman Gushchin wrote:
> > I don't think there's any need to change your text, but
> > let me remind us that any "Bad page" report stops that page from being
> > allocated again (because it's in an undefined, potentially dangerous
> > state): so does amount to a small memory leak even if otherwise harmless.
> 
> It looks like I need to post v3 as soon as I get a publicly available
> syzkaller report, so I'll add this to the commit log.

Today is your lucky day :-)

https://lore.kernel.org/all/6729f475.050a0220.701a.0019.GAE@google.com
Roman Gushchin Nov. 6, 2024, 1:32 a.m. UTC | #16
On Tue, Nov 05, 2024 at 05:09:13PM -0800, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Roman Gushchin wrote:
> > On Mon, Oct 21, 2024 at 12:49:28PM -0700, Hugh Dickins wrote:
> > > On Mon, 21 Oct 2024, Roman Gushchin wrote:
> > > I don't think there's any need to change your text, but
> > > let me remind us that any "Bad page" report stops that page from being
> > > allocated again (because it's in an undefined, potentially dangerous
> > > state): so does amount to a small memory leak even if otherwise harmless.
> > 
> > It looks like I need to post v3 as soon as I get a publicly available
> > syzkaller report, so I'll add this to the commit log.
> 
> Today is your lucky day :-)

I've been waiting for it for a long time :)
Thanks for forwarding it my way!

I'm still not sure what the conclusion of our discussion was. My understanding
is that my fix is not that pretty, but there are no better immediate ideas, only
long-term improvement projects. Does it matches everybody else's understanding?

If so, I'll prepare a v3 with an updated link. Otherwise, please, let me know.

Thanks!
Hugh Dickins Nov. 6, 2024, 2:19 a.m. UTC | #17
On Wed, 6 Nov 2024, Roman Gushchin wrote:
> On Tue, Nov 05, 2024 at 05:09:13PM -0800, Sean Christopherson wrote:
> > On Tue, Oct 22, 2024, Roman Gushchin wrote:
> > > On Mon, Oct 21, 2024 at 12:49:28PM -0700, Hugh Dickins wrote:
> > > > On Mon, 21 Oct 2024, Roman Gushchin wrote:
> > > > I don't think there's any need to change your text, but
> > > > let me remind us that any "Bad page" report stops that page from being
> > > > allocated again (because it's in an undefined, potentially dangerous
> > > > state): so does amount to a small memory leak even if otherwise harmless.
> > > 
> > > It looks like I need to post v3 as soon as I get a publicly available
> > > syzkaller report, so I'll add this to the commit log.
> > 
> > Today is your lucky day :-)
> 
> I've been waiting for it for a long time :)
> Thanks for forwarding it my way!
> 
> I'm still not sure what the conclusion of our discussion was. My understanding
> is that my fix is not that pretty, but there are no better immediate ideas, only
> long-term improvement projects. Does it matches everybody else's understanding?

Yes, that matches my understanding, and my Acked-by stands:
thanks a lot for keeping on this, Roman and Sean.

Hugh

> 
> If so, I'll prepare a v3 with an updated link. Otherwise, please, let me know.
> 
> Thanks!
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc55d39eb372..7535d78862ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1044,6 +1044,7 @@  __always_inline bool free_pages_prepare(struct page *page,
 	bool skip_kasan_poison = should_skip_kasan_poison(page);
 	bool init = want_init_on_free();
 	bool compound = PageCompound(page);
+	struct folio *folio = page_folio(page);
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -1053,6 +1054,20 @@  __always_inline bool free_pages_prepare(struct page *page,
 	if (memcg_kmem_online() && PageMemcgKmem(page))
 		__memcg_kmem_uncharge_page(page, order);
 
+	/*
+	 * In rare cases, when truncation or holepunching raced with
+	 * munlock after VM_LOCKED was cleared, Mlocked may still be
+	 * found set here.  This does not indicate a problem, unless
+	 * "unevictable_pgs_cleared" appears worryingly large.
+	 */
+	if (unlikely(folio_test_mlocked(folio))) {
+		long nr_pages = folio_nr_pages(folio);
+
+		__folio_clear_mlocked(folio);
+		zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
+		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
+	}
+
 	if (unlikely(PageHWPoison(page)) && !order) {
 		/* Do not let hwpoison pages hit pcplists/buddy */
 		reset_page_owner(page, order);
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..7cd0f4719423 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,20 +78,6 @@  static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
 		lruvec_del_folio(*lruvecp, folio);
 		__folio_clear_lru_flags(folio);
 	}
-
-	/*
-	 * In rare cases, when truncation or holepunching raced with
-	 * munlock after VM_LOCKED was cleared, Mlocked may still be
-	 * found set here.  This does not indicate a problem, unless
-	 * "unevictable_pgs_cleared" appears worryingly large.
-	 */
-	if (unlikely(folio_test_mlocked(folio))) {
-		long nr_pages = folio_nr_pages(folio);
-
-		__folio_clear_mlocked(folio);
-		zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
-		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
-	}
 }
 
 /*