diff mbox series

[1/1] mm: disable CONFIG_PER_VMA_LOCK by default until its fixed

Message ID 20230703182150.2193578-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/1] mm: disable CONFIG_PER_VMA_LOCK by default until its fixed | expand

Commit Message

Suren Baghdasaryan July 3, 2023, 6:21 p.m. UTC
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86.
Disable per-VMA locks config to prevent this issue while the problem is
being investigated. This is expected to be a temporary measure.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Rientjes July 3, 2023, 8:07 p.m. UTC | #1
On Mon, 3 Jul 2023, Suren Baghdasaryan wrote:

> A memory corruption was reported in [1] with bisection pointing to the
> patch [2] enabling per-VMA locks for x86.
> Disable per-VMA locks config to prevent this issue while the problem is
> being investigated. This is expected to be a temporary measure.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> 
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Acked-by: David Rientjes <rientjes@google.com>

Thanks for the heads up!  The bisect commit [2] is a no-op with 
CONFIG_PER_VMA_LOCK disabled, this looks good.

Nit: in that patch the "done" label could have been a:
"done: __maybe_unused"
to a avoid the #ifdef :P

> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..de94b2497600 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
>         def_bool n
>  
>  config PER_VMA_LOCK
> -	def_bool y
> +	bool "Enable per-vma locking during page fault handling."
>  	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
>  	help
>  	  Allow per-vma locking during page fault handling.
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 
>
David Hildenbrand July 3, 2023, 8:30 p.m. UTC | #2
On 03.07.23 20:21, Suren Baghdasaryan wrote:
> A memory corruption was reported in [1] with bisection pointing to the
> patch [2] enabling per-VMA locks for x86.
> Disable per-VMA locks config to prevent this issue while the problem is
> being investigated. This is expected to be a temporary measure.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> 
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>   mm/Kconfig | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..de94b2497600 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
>          def_bool n
>   
>   config PER_VMA_LOCK
> -	def_bool y
> +	bool "Enable per-vma locking during page fault handling."
>   	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
>   	help
>   	  Allow per-vma locking during page fault handling.

As raised at LSF/MM, I was "surprised" that we can now handle page faults
concurrent to fork() and was expecting something to be broken already.

What probably happens is that we wr-protected the page in the parent process and
COW-shared an anon page with the child using copy_present_pte().

But we only flush the parent MM tlb before we drop the parent MM lock in
dup_mmap().


If we get a write-fault before that TLB flush in the parent, and we end up
replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
this might be problematic: some stale writable TLB entries can target the wrong (old) page.


We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():


	if (likely(!unshare)) {
		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
			pte_unmap_unlock(vmf->pte, vmf->ptl);
			return handle_userfault(vmf, VM_UFFD_WP);
		}

		/*
		 * Userfaultfd write-protect can defer flushes. Ensure the TLB
		 * is flushed in this case before copying.
		 */
		if (unlikely(userfaultfd_wp(vmf->vma) &&
			     mm_tlb_flush_pending(vmf->vma->vm_mm)))
			flush_tlb_page(vmf->vma, vmf->address);
	}


We really should not allow page faults concurrent to fork() without further investigation.
Suren Baghdasaryan July 4, 2023, 5:39 a.m. UTC | #3
On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.07.23 20:21, Suren Baghdasaryan wrote:
> > A memory corruption was reported in [1] with bisection pointing to the
> > patch [2] enabling per-VMA locks for x86.
> > Disable per-VMA locks config to prevent this issue while the problem is
> > being investigated. This is expected to be a temporary measure.
> >
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> > [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> >
> > Reported-by: Jiri Slaby <jirislaby@kernel.org>
> > Reported-by: Jacob Young <jacobly.alt@gmail.com>
> > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >   mm/Kconfig | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09130434e30d..de94b2497600 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
> >          def_bool n
> >
> >   config PER_VMA_LOCK
> > -     def_bool y
> > +     bool "Enable per-vma locking during page fault handling."
> >       depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> >       help
> >         Allow per-vma locking during page fault handling.
>
> As raised at LSF/MM, I was "surprised" that we can now handle page faults
> concurrent to fork() and was expecting something to be broken already.
>
> What probably happens is that we wr-protected the page in the parent process and
> COW-shared an anon page with the child using copy_present_pte().
>
> But we only flush the parent MM tlb before we drop the parent MM lock in
> dup_mmap().
>
>
> If we get a write-fault before that TLB flush in the parent, and we end up
> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
> this might be problematic: some stale writable TLB entries can target the wrong (old) page.

Hi David,
Thanks for the detailed explanation. Let me check if this is indeed
what's happening here. If that's indeed the cause, I think we can
write-lock the VMAs being dup'ed until the TLB is flushed and
mmap_write_unlock(oldmm) unlocks them all and lets page faults to
proceed. If that works we at least will know the reason for the memory
corruption.
Thanks,
Suren.

>
>
> We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
>
>
>         if (likely(!unshare)) {
>                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_WP);
>                 }
>
>                 /*
>                  * Userfaultfd write-protect can defer flushes. Ensure the TLB
>                  * is flushed in this case before copying.
>                  */
>                 if (unlikely(userfaultfd_wp(vmf->vma) &&
>                              mm_tlb_flush_pending(vmf->vma->vm_mm)))
>                         flush_tlb_page(vmf->vma, vmf->address);
>         }
>
>
> We really should not allow page faults concurrent to fork() without further investigation.
>
> --
> Cheers,
>
> David / dhildenb
>
Suren Baghdasaryan July 4, 2023, 6:50 a.m. UTC | #4
On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 03.07.23 20:21, Suren Baghdasaryan wrote:
> > > A memory corruption was reported in [1] with bisection pointing to the
> > > patch [2] enabling per-VMA locks for x86.
> > > Disable per-VMA locks config to prevent this issue while the problem is
> > > being investigated. This is expected to be a temporary measure.
> > >
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> > > [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> > >
> > > Reported-by: Jiri Slaby <jirislaby@kernel.org>
> > > Reported-by: Jacob Young <jacobly.alt@gmail.com>
> > > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >   mm/Kconfig | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 09130434e30d..de94b2497600 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
> > >          def_bool n
> > >
> > >   config PER_VMA_LOCK
> > > -     def_bool y
> > > +     bool "Enable per-vma locking during page fault handling."
> > >       depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> > >       help
> > >         Allow per-vma locking during page fault handling.
> >
> > As raised at LSF/MM, I was "surprised" that we can now handle page faults
> > concurrent to fork() and was expecting something to be broken already.
> >
> > What probably happens is that we wr-protected the page in the parent process and
> > COW-shared an anon page with the child using copy_present_pte().
> >
> > But we only flush the parent MM tlb before we drop the parent MM lock in
> > dup_mmap().
> >
> >
> > If we get a write-fault before that TLB flush in the parent, and we end up
> > replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
> > this might be problematic: some stale writable TLB entries can target the wrong (old) page.
>
> Hi David,
> Thanks for the detailed explanation. Let me check if this is indeed
> what's happening here. If that's indeed the cause, I think we can
> write-lock the VMAs being dup'ed until the TLB is flushed and
> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
> proceed. If that works we at least will know the reason for the memory
> corruption.

Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:

        for_each_vma(old_vmi, mpnt) {
                struct file *file;

+               vma_start_write(mpnt);
               if (mpnt->vm_flags & VM_DONTCOPY) {
                       vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
                        continue;
               }

At least the reproducer at
https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
I wonder if that's the best way to fix this. It's surely simple but
locking every VMA is not free and doing that on every fork might
regress performance.

> Thanks,
> Suren.
>
> >
> >
> > We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
> >
> >
> >         if (likely(!unshare)) {
> >                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> >                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >                         return handle_userfault(vmf, VM_UFFD_WP);
> >                 }
> >
> >                 /*
> >                  * Userfaultfd write-protect can defer flushes. Ensure the TLB
> >                  * is flushed in this case before copying.
> >                  */
> >                 if (unlikely(userfaultfd_wp(vmf->vma) &&
> >                              mm_tlb_flush_pending(vmf->vma->vm_mm)))
> >                         flush_tlb_page(vmf->vma, vmf->address);
> >         }

If do_wp_page() could identify that vmf->vma is being copied, we could
simply return VM_FAULT_RETRY and retry the page fault under mmap_lock,
which would block until dup_mmap() is done... Maybe we could use
mm_tlb_flush_pending() for that? WDYT?

> >
> >
> > We really should not allow page faults concurrent to fork() without further investigation.
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
David Hildenbrand July 4, 2023, 7:18 a.m. UTC | #5
On 04.07.23 08:50, Suren Baghdasaryan wrote:
> On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 03.07.23 20:21, Suren Baghdasaryan wrote:
>>>> A memory corruption was reported in [1] with bisection pointing to the
>>>> patch [2] enabling per-VMA locks for x86.
>>>> Disable per-VMA locks config to prevent this issue while the problem is
>>>> being investigated. This is expected to be a temporary measure.
>>>>
>>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
>>>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
>>>>
>>>> Reported-by: Jiri Slaby <jirislaby@kernel.org>
>>>> Reported-by: Jacob Young <jacobly.alt@gmail.com>
>>>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
>>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>>> ---
>>>>    mm/Kconfig | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 09130434e30d..de94b2497600 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
>>>>           def_bool n
>>>>
>>>>    config PER_VMA_LOCK
>>>> -     def_bool y
>>>> +     bool "Enable per-vma locking during page fault handling."
>>>>        depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
>>>>        help
>>>>          Allow per-vma locking during page fault handling.
>>>
>>> As raised at LSF/MM, I was "surprised" that we can now handle page faults
>>> concurrent to fork() and was expecting something to be broken already.
>>>
>>> What probably happens is that we wr-protected the page in the parent process and
>>> COW-shared an anon page with the child using copy_present_pte().
>>>
>>> But we only flush the parent MM tlb before we drop the parent MM lock in
>>> dup_mmap().
>>>
>>>
>>> If we get a write-fault before that TLB flush in the parent, and we end up
>>> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
>>> this might be problematic: some stale writable TLB entries can target the wrong (old) page.
>>
>> Hi David,
>> Thanks for the detailed explanation. Let me check if this is indeed
>> what's happening here. If that's indeed the cause, I think we can
>> write-lock the VMAs being dup'ed until the TLB is flushed and
>> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
>> proceed. If that works we at least will know the reason for the memory
>> corruption.
> 
> Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:
> 
>          for_each_vma(old_vmi, mpnt) {
>                  struct file *file;
> 
> +               vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>                          continue;
>                 }
> 
> At least the reproducer at
> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> I wonder if that's the best way to fix this. It's surely simple but
> locking every VMA is not free and doing that on every fork might
> regress performance.


That would mean that we can possibly still get page faults concurrent to 
fork(), on the yet unprocessed part. While that fixes the issue at hand, 
I cannot reliably tell if this doesn't mess with some other fork() 
corner case.

I'd suggest write-locking all VMAs upfront, before doing any kind of 
fork-mm operation. Just like the old code did. See below.

> 
>> Thanks,
>> Suren.
>>
>>>
>>>
>>> We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
>>>
>>>
>>>          if (likely(!unshare)) {
>>>                  if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>>>                          pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>                          return handle_userfault(vmf, VM_UFFD_WP);
>>>                  }
>>>
>>>                  /*
>>>                   * Userfaultfd write-protect can defer flushes. Ensure the TLB
>>>                   * is flushed in this case before copying.
>>>                   */
>>>                  if (unlikely(userfaultfd_wp(vmf->vma) &&
>>>                               mm_tlb_flush_pending(vmf->vma->vm_mm)))
>>>                          flush_tlb_page(vmf->vma, vmf->address);
>>>          }
> 
> If do_wp_page() could identify that vmf->vma is being copied, we could
> simply return VM_FAULT_RETRY and retry the page fault under mmap_lock,
> which would block until dup_mmap() is done... Maybe we could use
> mm_tlb_flush_pending() for that? WDYT?

I'm not convinced that we should be making that code more complicated 
simply to speed up fork() with concurrent page faults.

My gut feeling is that many operations that could possible take the VMA 
lock in the future (page pinning triggering unsharing) should not run 
concurrent with fork().

So IMHO, keep the old behavior of fork() -- i.e., no concurrent page 
faults -- and unlock that eventually in the future when deemed really 
required (but people should really avoid fork() in performance-sensitive 
applications if not absolutely required).
Suren Baghdasaryan July 4, 2023, 7:34 a.m. UTC | #6
On Tue, Jul 4, 2023 at 12:18 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.07.23 08:50, Suren Baghdasaryan wrote:
> > On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 03.07.23 20:21, Suren Baghdasaryan wrote:
> >>>> A memory corruption was reported in [1] with bisection pointing to the
> >>>> patch [2] enabling per-VMA locks for x86.
> >>>> Disable per-VMA locks config to prevent this issue while the problem is
> >>>> being investigated. This is expected to be a temporary measure.
> >>>>
> >>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> >>>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> >>>>
> >>>> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> >>>> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> >>>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> >>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >>>> ---
> >>>>    mm/Kconfig | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 09130434e30d..de94b2497600 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
> >>>>           def_bool n
> >>>>
> >>>>    config PER_VMA_LOCK
> >>>> -     def_bool y
> >>>> +     bool "Enable per-vma locking during page fault handling."
> >>>>        depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> >>>>        help
> >>>>          Allow per-vma locking during page fault handling.
> >>>
> >>> As raised at LSF/MM, I was "surprised" that we can now handle page faults
> >>> concurrent to fork() and was expecting something to be broken already.
> >>>
> >>> What probably happens is that we wr-protected the page in the parent process and
> >>> COW-shared an anon page with the child using copy_present_pte().
> >>>
> >>> But we only flush the parent MM tlb before we drop the parent MM lock in
> >>> dup_mmap().
> >>>
> >>>
> >>> If we get a write-fault before that TLB flush in the parent, and we end up
> >>> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
> >>> this might be problematic: some stale writable TLB entries can target the wrong (old) page.
> >>
> >> Hi David,
> >> Thanks for the detailed explanation. Let me check if this is indeed
> >> what's happening here. If that's indeed the cause, I think we can
> >> write-lock the VMAs being dup'ed until the TLB is flushed and
> >> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
> >> proceed. If that works we at least will know the reason for the memory
> >> corruption.
> >
> > Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:
> >
> >          for_each_vma(old_vmi, mpnt) {
> >                  struct file *file;
> >
> > +               vma_start_write(mpnt);
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >                         vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> >                          continue;
> >                 }
> >
> > At least the reproducer at
> > https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> > I wonder if that's the best way to fix this. It's surely simple but
> > locking every VMA is not free and doing that on every fork might
> > regress performance.
>
>
> That would mean that we can possibly still get page faults concurrent to
> fork(), on the yet unprocessed part. While that fixes the issue at hand,
> I cannot reliably tell if this doesn't mess with some other fork()
> corner case.
>
> I'd suggest write-locking all VMAs upfront, before doing any kind of
> fork-mm operation. Just like the old code did. See below.
>
> >
> >> Thanks,
> >> Suren.
> >>
> >>>
> >>>
> >>> We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
> >>>
> >>>
> >>>          if (likely(!unshare)) {
> >>>                  if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> >>>                          pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>>                          return handle_userfault(vmf, VM_UFFD_WP);
> >>>                  }
> >>>
> >>>                  /*
> >>>                   * Userfaultfd write-protect can defer flushes. Ensure the TLB
> >>>                   * is flushed in this case before copying.
> >>>                   */
> >>>                  if (unlikely(userfaultfd_wp(vmf->vma) &&
> >>>                               mm_tlb_flush_pending(vmf->vma->vm_mm)))
> >>>                          flush_tlb_page(vmf->vma, vmf->address);
> >>>          }
> >
> > If do_wp_page() could identify that vmf->vma is being copied, we could
> > simply return VM_FAULT_RETRY and retry the page fault under mmap_lock,
> > which would block until dup_mmap() is done... Maybe we could use
> > mm_tlb_flush_pending() for that? WDYT?
>
> I'm not convinced that we should be making that code more complicated
> simply to speed up fork() with concurrent page faults.
>
> My gut feeling is that many operations that could possible take the VMA
> lock in the future (page pinning triggering unsharing) should not run
> concurrent with fork().
>
> So IMHO, keep the old behavior of fork() -- i.e., no concurrent page
> faults -- and unlock that eventually in the future when deemed really
> required (but people should really avoid fork() in performance-sensitive
> applications if not absolutely required).

Thanks for the input, David. Yeah, that sounds reasonable. I'll test
some more tomorrow morning and if everything looks good will post a
patch to lock the VMAs and another one to re-enable
CONFIG_PER_VMA_LOCK.
Thanks for all the help!
Suren.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand July 4, 2023, 8:03 a.m. UTC | #7
On 04.07.23 09:34, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 12:18 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.07.23 08:50, Suren Baghdasaryan wrote:
>>> On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>
>>>> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 03.07.23 20:21, Suren Baghdasaryan wrote:
>>>>>> A memory corruption was reported in [1] with bisection pointing to the
>>>>>> patch [2] enabling per-VMA locks for x86.
>>>>>> Disable per-VMA locks config to prevent this issue while the problem is
>>>>>> being investigated. This is expected to be a temporary measure.
>>>>>>
>>>>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
>>>>>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
>>>>>>
>>>>>> Reported-by: Jiri Slaby <jirislaby@kernel.org>
>>>>>> Reported-by: Jacob Young <jacobly.alt@gmail.com>
>>>>>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
>>>>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>>>>> ---
>>>>>>     mm/Kconfig | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>> index 09130434e30d..de94b2497600 100644
>>>>>> --- a/mm/Kconfig
>>>>>> +++ b/mm/Kconfig
>>>>>> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
>>>>>>            def_bool n
>>>>>>
>>>>>>     config PER_VMA_LOCK
>>>>>> -     def_bool y
>>>>>> +     bool "Enable per-vma locking during page fault handling."
>>>>>>         depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
>>>>>>         help
>>>>>>           Allow per-vma locking during page fault handling.
>>>>>
>>>>> As raised at LSF/MM, I was "surprised" that we can now handle page faults
>>>>> concurrent to fork() and was expecting something to be broken already.
>>>>>
>>>>> What probably happens is that we wr-protected the page in the parent process and
>>>>> COW-shared an anon page with the child using copy_present_pte().
>>>>>
>>>>> But we only flush the parent MM tlb before we drop the parent MM lock in
>>>>> dup_mmap().
>>>>>
>>>>>
>>>>> If we get a write-fault before that TLB flush in the parent, and we end up
>>>>> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
>>>>> this might be problematic: some stale writable TLB entries can target the wrong (old) page.
>>>>
>>>> Hi David,
>>>> Thanks for the detailed explanation. Let me check if this is indeed
>>>> what's happening here. If that's indeed the cause, I think we can
>>>> write-lock the VMAs being dup'ed until the TLB is flushed and
>>>> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
>>>> proceed. If that works we at least will know the reason for the memory
>>>> corruption.
>>>
>>> Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:
>>>
>>>           for_each_vma(old_vmi, mpnt) {
>>>                   struct file *file;
>>>
>>> +               vma_start_write(mpnt);
>>>                  if (mpnt->vm_flags & VM_DONTCOPY) {
>>>                          vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>>>                           continue;
>>>                  }
>>>
>>> At least the reproducer at
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
>>> I wonder if that's the best way to fix this. It's surely simple but
>>> locking every VMA is not free and doing that on every fork might
>>> regress performance.
>>
>>
>> That would mean that we can possibly still get page faults concurrent to
>> fork(), on the yet unprocessed part. While that fixes the issue at hand,
>> I cannot reliably tell if this doesn't mess with some other fork()
>> corner case.
>>
>> I'd suggest write-locking all VMAs upfront, before doing any kind of
>> fork-mm operation. Just like the old code did. See below.
>>
>>>
>>>> Thanks,
>>>> Suren.
>>>>
>>>>>
>>>>>
>>>>> We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
>>>>>
>>>>>
>>>>>           if (likely(!unshare)) {
>>>>>                   if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>>>>>                           pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>                           return handle_userfault(vmf, VM_UFFD_WP);
>>>>>                   }
>>>>>
>>>>>                   /*
>>>>>                    * Userfaultfd write-protect can defer flushes. Ensure the TLB
>>>>>                    * is flushed in this case before copying.
>>>>>                    */
>>>>>                   if (unlikely(userfaultfd_wp(vmf->vma) &&
>>>>>                                mm_tlb_flush_pending(vmf->vma->vm_mm)))
>>>>>                           flush_tlb_page(vmf->vma, vmf->address);
>>>>>           }
>>>
>>> If do_wp_page() could identify that vmf->vma is being copied, we could
>>> simply return VM_FAULT_RETRY and retry the page fault under mmap_lock,
>>> which would block until dup_mmap() is done... Maybe we could use
>>> mm_tlb_flush_pending() for that? WDYT?
>>
>> I'm not convinced that we should be making that code more complicated
>> simply to speed up fork() with concurrent page faults.
>>
>> My gut feeling is that many operations that could possible take the VMA
>> lock in the future (page pinning triggering unsharing) should not run
>> concurrent with fork().
>>
>> So IMHO, keep the old behavior of fork() -- i.e., no concurrent page
>> faults -- and unlock that eventually in the future when deemed really
>> required (but people should really avoid fork() in performance-sensitive
>> applications if not absolutely required).
> 
> Thanks for the input, David. Yeah, that sounds reasonable. I'll test
> some more tomorrow morning and if everything looks good will post a
> patch to lock the VMAs and another one to re-enable
> CONFIG_PER_VMA_LOCK.
> Thanks for all the help!

Fortunately, I spotted fork() in the reproducer and remembered that 
there is something nasty about COW page replacement and TLB flushes :)

Can we avoid the temporary disabling of per-vma lock by a simple "lock 
all VMAs" patch, or is that patch (here) already upstream/on its way 
upstream?
Thorsten Leemhuis July 4, 2023, 8:12 a.m. UTC | #8
[CCing the regression list]

On 03.07.23 20:21, Suren Baghdasaryan wrote:
> A memory corruption was reported in [1] with bisection pointing to the
> patch [2] enabling per-VMA locks for x86.
> Disable per-VMA locks config to prevent this issue while the problem is
> being investigated. This is expected to be a temporary measure.

I have to wonder: is disabling by default sufficient here? Sure, it's a
new feature, so once this change is merged and backported to 6.4.y it's
not regression for anyone that will do the jump from older kernels to
6.4.y and run oldconfig.

But how about those that did the jump already or will do so before the
fix is backported (it's possible that Arch Linux and openSUSE Tumblweed
do this; and there is a certain chance that Fedora already has
CONFIG_PER_VMA_LOCK enabled in their 6.4 configs, too)? This sounds to
me like many of those will have CONFIG_PER_VMA_LOCK enabled in their
configs now. And unless I'm missing something switching the default
won't turn it off next time they run make oldconfig -- or will it? So
for them the regression won't be fixed (unless they fiddle manually with
their configuration, but they shouldn't do so to fix a regression).

Or is my logic mistaken somewhere?

> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624

Side note: that should be a proper Link: or Closes: tag as described in
Documentation/process/submitting-patches.rst or
Documentation/process/5.Posting.rst -- and it should be close to the
Reported-by: tag (checkpatch.pl should have mentioned that).

E.g. like this:

> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> 
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes:
https://lore.kernel.org/lkml/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
> Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 [1]
> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

> [...]

Ciao, Thorsten
Hans de Goede July 4, 2023, 8:18 a.m. UTC | #9
Hi Suren,

Thank you for your patch.

On 7/3/23 20:21, Suren Baghdasaryan wrote:
> A memory corruption was reported in [1] with bisection pointing to the
> patch [2] enabling per-VMA locks for x86.
> Disable per-VMA locks config to prevent this issue while the problem is
> being investigated. This is expected to be a temporary measure.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> 
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..de94b2497600 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
>         def_bool n
>  
>  config PER_VMA_LOCK
> -	def_bool y
> +	bool "Enable per-vma locking during page fault handling."
>  	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
>  	help
>  	  Allow per-vma locking during page fault handling.


This does not disable the option it only makes it user selectable
and for users with an existing .config which already has this
it changes nothing.

IMHO you should add a "depends on BROKEN" here until this is fixed,
so that this really gets disabled.

Or maybe just revert 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
for now?

Regards,

Hans
Hans de Goede July 4, 2023, 8:30 a.m. UTC | #10
Hi,

On 7/4/23 10:12, Linux regression tracking (Thorsten Leemhuis) wrote:
> [CCing the regression list]
> 
> On 03.07.23 20:21, Suren Baghdasaryan wrote:
>> A memory corruption was reported in [1] with bisection pointing to the
>> patch [2] enabling per-VMA locks for x86.
>> Disable per-VMA locks config to prevent this issue while the problem is
>> being investigated. This is expected to be a temporary measure.
> 
> I have to wonder: is disabling by default sufficient here? Sure, it's a
> new feature, so once this change is merged and backported to 6.4.y it's
> not regression for anyone that will do the jump from older kernels to
> 6.4.y and run oldconfig.
> 
> But how about those that did the jump already or will do so before the
> fix is backported (it's possible that Arch Linux and openSUSE Tumblweed
> do this; and there is a certain chance that Fedora already has
> CONFIG_PER_VMA_LOCK enabled in their 6.4 configs, too)? This sounds to
> me like many of those will have CONFIG_PER_VMA_LOCK enabled in their
> configs now. And unless I'm missing something switching the default
> won't turn it off next time they run make oldconfig -- or will it? So
> for them the regression won't be fixed (unless they fiddle manually with
> their configuration, but they shouldn't do so to fix a regression).
> 
> Or is my logic mistaken somewhere?

Your logic is correct I just tried Suren's patch with my local tree and
it did not change a thing.

If you have an existing .config this patch does nothing.

Regards,

Hans





> 
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> 
> Side note: that should be a proper Link: or Closes: tag as described in
> Documentation/process/submitting-patches.rst or
> Documentation/process/5.Posting.rst -- and it should be close to the
> Reported-by: tag (checkpatch.pl should have mentioned that).
> 
> E.g. like this:
> 
>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
>>
>> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Closes:
> https://lore.kernel.org/lkml/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
>> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 [1]
>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> 
>> [...]
> 
> Ciao, Thorsten
>
Matthew Wilcox July 4, 2023, 1:07 p.m. UTC | #11
On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
> > At least the reproducer at
> > https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> > I wonder if that's the best way to fix this. It's surely simple but
> > locking every VMA is not free and doing that on every fork might
> > regress performance.
> 
> 
> That would mean that we can possibly still get page faults concurrent to
> fork(), on the yet unprocessed part. While that fixes the issue at hand, I
> cannot reliably tell if this doesn't mess with some other fork() corner
> case.
> 
> I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
> operation. Just like the old code did. See below.

Calling fork() from a multi-threaded program is fraught with danger.
It's a rare thing to do, and we don't need to optimise for it.  It
does, of course, need to not crash.  But we can slow it down as much as
we want to.  Slowing down single-threaded programs calling fork is
much less acceptable.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
Suren Baghdasaryan July 4, 2023, 3:24 p.m. UTC | #12
On Tue, Jul 4, 2023 at 1:18 AM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Hi Suren,
>
> Thank you for your patch.
>
> On 7/3/23 20:21, Suren Baghdasaryan wrote:
> > A memory corruption was reported in [1] with bisection pointing to the
> > patch [2] enabling per-VMA locks for x86.
> > Disable per-VMA locks config to prevent this issue while the problem is
> > being investigated. This is expected to be a temporary measure.
> >
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> > [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
> >
> > Reported-by: Jiri Slaby <jirislaby@kernel.org>
> > Reported-by: Jacob Young <jacobly.alt@gmail.com>
> > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09130434e30d..de94b2497600 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
> >         def_bool n
> >
> >  config PER_VMA_LOCK
> > -     def_bool y
> > +     bool "Enable per-vma locking during page fault handling."
> >       depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> >       help
> >         Allow per-vma locking during page fault handling.
>
>
> This does not disable the option it only makes it user selectable
> and for users with an existing .config which already has this
> it changes nothing.

Hmm. Yes, I didn't think about a case when kernel would be updated and
.config would not...

>
> IMHO you should add a "depends on BROKEN" here until this is fixed,
> so that this really gets disabled.

Agree, that would be a sure way to disable it. I'll use it if the
proper fix still does not work. Thanks!

>
> Or maybe just revert 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> for now?

That would disable it only for x86 while keeping it for all other
supported platforms, so we would have to revert several patches. I
wanted the simplest way to temporarily disable the feature until a fix
is in place.
Let me test it a bit and if that works I'll send patches to fix and
re-enable CONFIG_PER_VMA_LOCK instead of disabling it.
Thanks,
Suren.

>
> Regards,
>
> Hans
>
>
Suren Baghdasaryan July 4, 2023, 5:21 p.m. UTC | #13
On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
> > > At least the reproducer at
> > > https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> > > I wonder if that's the best way to fix this. It's surely simple but
> > > locking every VMA is not free and doing that on every fork might
> > > regress performance.
> >
> >
> > That would mean that we can possibly still get page faults concurrent to
> > fork(), on the yet unprocessed part. While that fixes the issue at hand, I
> > cannot reliably tell if this doesn't mess with some other fork() corner
> > case.
> >
> > I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
> > operation. Just like the old code did. See below.
>
> Calling fork() from a multi-threaded program is fraught with danger.
> It's a rare thing to do, and we don't need to optimise for it.  It
> does, of course, need to not crash.  But we can slow it down as much as
> we want to.  Slowing down single-threaded programs calling fork is
> much less acceptable.

Hmm. Would you suggest we use different approaches for multi-threaded
vs single-threaded programs?
I think locking VMAs while forking a process which has lots of VMAs
will regress by some amount (we are adding non-zero work). The
question is if that's acceptable or we have to implement something
different. I verified that solution fixes the issue shown by the
reproducer, now I'm trying to quantify this fork performance
regression I suspect we will introduce.

>
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
David Hildenbrand July 4, 2023, 5:36 p.m. UTC | #14
On 04.07.23 19:21, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
>>>> At least the reproducer at
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
>>>> I wonder if that's the best way to fix this. It's surely simple but
>>>> locking every VMA is not free and doing that on every fork might
>>>> regress performance.
>>>
>>>
>>> That would mean that we can possibly still get page faults concurrent to
>>> fork(), on the yet unprocessed part. While that fixes the issue at hand, I
>>> cannot reliably tell if this doesn't mess with some other fork() corner
>>> case.
>>>
>>> I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
>>> operation. Just like the old code did. See below.
>>
>> Calling fork() from a multi-threaded program is fraught with danger.
>> It's a rare thing to do, and we don't need to optimise for it.  It
>> does, of course, need to not crash.  But we can slow it down as much as
>> we want to.  Slowing down single-threaded programs calling fork is
>> much less acceptable.
> 
> Hmm. Would you suggest we use different approaches for multi-threaded
> vs single-threaded programs?
> I think locking VMAs while forking a process which has lots of VMAs
> will regress by some amount (we are adding non-zero work). The
> question is if that's acceptable or we have to implement something
> different. I verified that solution fixes the issue shown by the
> reproducer, now I'm trying to quantify this fork performance
> regression I suspect we will introduce.

Well, the design decision that CONFIG_PER_VMA_LOCK made for now to make 
page faults fast and to make blocking any page faults from happening to 
be slower (unless there is some easy way that's already built in).

So it wouldn't surprise me if it might affect performance a bit, but 
it's to be quantified if it really matters in comparison to all the page 
table copying and other stuff we do during fork.

Maybe that can be optimized/sped up later. But for now we should fix 
this the straightforward way. That fix will be (and has to be) a NOP for 
!CONFIG_PER_VMA_LOCK, so that one won't be affected.

Maybe this patch in an adjusted form would still make sense (also to be 
backported), to keep the feature inactive as default until it stabilized 
a bit more.
Matthew Wilcox July 4, 2023, 5:55 p.m. UTC | #15
On Tue, Jul 04, 2023 at 10:21:12AM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
> > Calling fork() from a multi-threaded program is fraught with danger.
> > It's a rare thing to do, and we don't need to optimise for it.  It
> > does, of course, need to not crash.  But we can slow it down as much as
> > we want to.  Slowing down single-threaded programs calling fork is
> > much less acceptable.
> 
> Hmm. Would you suggest we use different approaches for multi-threaded
> vs single-threaded programs?
> I think locking VMAs while forking a process which has lots of VMAs
> will regress by some amount (we are adding non-zero work). The
> question is if that's acceptable or we have to implement something
> different. I verified that solution fixes the issue shown by the
> reproducer, now I'm trying to quantify this fork performance
> regression I suspect we will introduce.

It might make sense to do that.  Personally, I'd try to quantify it
with a make -jN build of the kernel.  That workload is fork-heavy of
single threaded processes, and if it doesn't show much difference, I think
we're good.
Suren Baghdasaryan July 4, 2023, 5:56 p.m. UTC | #16
On Tue, Jul 4, 2023 at 10:36 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.07.23 19:21, Suren Baghdasaryan wrote:
> > On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
> >>>> At least the reproducer at
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> >>>> I wonder if that's the best way to fix this. It's surely simple but
> >>>> locking every VMA is not free and doing that on every fork might
> >>>> regress performance.
> >>>
> >>>
> >>> That would mean that we can possibly still get page faults concurrent to
> >>> fork(), on the yet unprocessed part. While that fixes the issue at hand, I
> >>> cannot reliably tell if this doesn't mess with some other fork() corner
> >>> case.
> >>>
> >>> I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
> >>> operation. Just like the old code did. See below.
> >>
> >> Calling fork() from a multi-threaded program is fraught with danger.
> >> It's a rare thing to do, and we don't need to optimise for it.  It
> >> does, of course, need to not crash.  But we can slow it down as much as
> >> we want to.  Slowing down single-threaded programs calling fork is
> >> much less acceptable.
> >
> > Hmm. Would you suggest we use different approaches for multi-threaded
> > vs single-threaded programs?
> > I think locking VMAs while forking a process which has lots of VMAs
> > will regress by some amount (we are adding non-zero work). The
> > question is if that's acceptable or we have to implement something
> > different. I verified that solution fixes the issue shown by the
> > reproducer, now I'm trying to quantify this fork performance
> > regression I suspect we will introduce.
>
> Well, the design decision that CONFIG_PER_VMA_LOCK made for now to make
> page faults fast and to make blocking any page faults from happening to
> be slower (unless there is some easy way that's already built in).
>
> So it wouldn't surprise me if it might affect performance a bit, but
> it's to be quantified if it really matters in comparison to all the page
> table copying and other stuff we do during fork.
>
> Maybe that can be optimized/sped up later. But for now we should fix
> this the straightforward way. That fix will be (and has to be) a NOP for
> !CONFIG_PER_VMA_LOCK, so that one won't be affected.
>
> Maybe this patch in an adjusted form would still make sense (also to be
> backported), to keep the feature inactive as default until it stabilized
> a bit more.

Ok, IIUC your suggestion is to use VMA-lock-on-fork fix even if the
fork() regresses and keep CONFIG_PER_VMA_LOCK disabled by default
until it's more stable. That sounds good to me. With that fix, do we
still need to add the BROKEN dependency? I'm guessing it would be
safer to disable for sure.

>
> --
> Cheers,
>
> David / dhildenb
>
Suren Baghdasaryan July 4, 2023, 5:58 p.m. UTC | #17
On Tue, Jul 4, 2023 at 10:55 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jul 04, 2023 at 10:21:12AM -0700, Suren Baghdasaryan wrote:
> > On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Calling fork() from a multi-threaded program is fraught with danger.
> > > It's a rare thing to do, and we don't need to optimise for it.  It
> > > does, of course, need to not crash.  But we can slow it down as much as
> > > we want to.  Slowing down single-threaded programs calling fork is
> > > much less acceptable.
> >
> > Hmm. Would you suggest we use different approaches for multi-threaded
> > vs single-threaded programs?
> > I think locking VMAs while forking a process which has lots of VMAs
> > will regress by some amount (we are adding non-zero work). The
> > question is if that's acceptable or we have to implement something
> > different. I verified that solution fixes the issue shown by the
> > reproducer, now I'm trying to quantify this fork performance
> > regression I suspect we will introduce.
>
> It might make sense to do that.  Personally, I'd try to quantify it
> with a make -jN build of the kernel.  That workload is fork-heavy of
> single threaded processes, and if it doesn't show much difference, I think
> we're good.

That's a good idea. I wrote a test to mmap large number of vmas and
time the forks but I'll run your suggested test as well. Thanks!
David Hildenbrand July 4, 2023, 6:01 p.m. UTC | #18
On 04.07.23 09:34, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 12:18 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.07.23 08:50, Suren Baghdasaryan wrote:
>>> On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>
>>>> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 03.07.23 20:21, Suren Baghdasaryan wrote:
>>>>>> A memory corruption was reported in [1] with bisection pointing to the
>>>>>> patch [2] enabling per-VMA locks for x86.
>>>>>> Disable per-VMA locks config to prevent this issue while the problem is
>>>>>> being investigated. This is expected to be a temporary measure.
>>>>>>
>>>>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
>>>>>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
>>>>>>
>>>>>> Reported-by: Jiri Slaby <jirislaby@kernel.org>
>>>>>> Reported-by: Jacob Young <jacobly.alt@gmail.com>
>>>>>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
>>>>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>>>>> ---
>>>>>>     mm/Kconfig | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>> index 09130434e30d..de94b2497600 100644
>>>>>> --- a/mm/Kconfig
>>>>>> +++ b/mm/Kconfig
>>>>>> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
>>>>>>            def_bool n
>>>>>>
>>>>>>     config PER_VMA_LOCK
>>>>>> -     def_bool y
>>>>>> +     bool "Enable per-vma locking during page fault handling."
>>>>>>         depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
>>>>>>         help
>>>>>>           Allow per-vma locking during page fault handling.
>>>>>
>>>>> As raised at LSF/MM, I was "surprised" that we can now handle page faults
>>>>> concurrent to fork() and was expecting something to be broken already.
>>>>>
>>>>> What probably happens is that we wr-protected the page in the parent process and
>>>>> COW-shared an anon page with the child using copy_present_pte().
>>>>>
>>>>> But we only flush the parent MM tlb before we drop the parent MM lock in
>>>>> dup_mmap().
>>>>>
>>>>>
>>>>> If we get a write-fault before that TLB flush in the parent, and we end up
>>>>> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
>>>>> this might be problematic: some stale writable TLB entries can target the wrong (old) page.
>>>>
>>>> Hi David,
>>>> Thanks for the detailed explanation. Let me check if this is indeed
>>>> what's happening here. If that's indeed the cause, I think we can
>>>> write-lock the VMAs being dup'ed until the TLB is flushed and
>>>> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
>>>> proceed. If that works we at least will know the reason for the memory
>>>> corruption.
>>>
>>> Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:
>>>
>>>           for_each_vma(old_vmi, mpnt) {
>>>                   struct file *file;
>>>
>>> +               vma_start_write(mpnt);
>>>                  if (mpnt->vm_flags & VM_DONTCOPY) {
>>>                          vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>>>                           continue;
>>>                  }
>>>
>>> At least the reproducer at
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
>>> I wonder if that's the best way to fix this. It's surely simple but
>>> locking every VMA is not free and doing that on every fork might
>>> regress performance.
>>
>>
>> That would mean that we can possibly still get page faults concurrent to
>> fork(), on the yet unprocessed part. While that fixes the issue at hand,
>> I cannot reliably tell if this doesn't mess with some other fork()
>> corner case.
>>
>> I'd suggest write-locking all VMAs upfront, before doing any kind of
>> fork-mm operation. Just like the old code did. See below.

Maybe we could get away by not locking VM_MAYSHARE or VM_DONTCOPY. 
Possibly also when there are no other threads.

But at least to me it feels safer to defer any such optimizations, and 
to see if it's really required.

If there are no other threads, at least there will not be contention on 
the VMA locks. And if there are others threads, we used to have 
contention on the mmap lock already.
David Hildenbrand July 4, 2023, 6:05 p.m. UTC | #19
On 04.07.23 19:56, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 10:36 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.07.23 19:21, Suren Baghdasaryan wrote:
>>> On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
>>>>>> At least the reproducer at
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
>>>>>> I wonder if that's the best way to fix this. It's surely simple but
>>>>>> locking every VMA is not free and doing that on every fork might
>>>>>> regress performance.
>>>>>
>>>>>
>>>>> That would mean that we can possibly still get page faults concurrent to
>>>>> fork(), on the yet unprocessed part. While that fixes the issue at hand, I
>>>>> cannot reliably tell if this doesn't mess with some other fork() corner
>>>>> case.
>>>>>
>>>>> I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
>>>>> operation. Just like the old code did. See below.
>>>>
>>>> Calling fork() from a multi-threaded program is fraught with danger.
>>>> It's a rare thing to do, and we don't need to optimise for it.  It
>>>> does, of course, need to not crash.  But we can slow it down as much as
>>>> we want to.  Slowing down single-threaded programs calling fork is
>>>> much less acceptable.
>>>
>>> Hmm. Would you suggest we use different approaches for multi-threaded
>>> vs single-threaded programs?
>>> I think locking VMAs while forking a process which has lots of VMAs
>>> will regress by some amount (we are adding non-zero work). The
>>> question is if that's acceptable or we have to implement something
>>> different. I verified that solution fixes the issue shown by the
>>> reproducer, now I'm trying to quantify this fork performance
>>> regression I suspect we will introduce.
>>
>> Well, the design decision that CONFIG_PER_VMA_LOCK made for now to make
>> page faults fast and to make blocking any page faults from happening to
>> be slower (unless there is some easy way that's already built in).
>>
>> So it wouldn't surprise me if it might affect performance a bit, but
>> it's to be quantified if it really matters in comparison to all the page
>> table copying and other stuff we do during fork.
>>
>> Maybe that can be optimized/sped up later. But for now we should fix
>> this the straightforward way. That fix will be (and has to be) a NOP for
>> !CONFIG_PER_VMA_LOCK, so that one won't be affected.
>>
>> Maybe this patch in an adjusted form would still make sense (also to be
>> backported), to keep the feature inactive as default until it stabilized
>> a bit more.
> 
> Ok, IIUC your suggestion is to use VMA-lock-on-fork fix even if the
> fork() regresses and keep CONFIG_PER_VMA_LOCK disabled by default
> until it's more stable. That sounds good to me. With that fix, do we
> still need to add the BROKEN dependency? I'm guessing it would be
> safer to disable for sure.

With this fixed, I don't think we need a BROKEN dependency.

I'll let you decide if you want to keep it enabled as default, I'd 
rather disable it for one release and enable it as default later.

Happy so learn if taking all VMA locks without any contention causes a 
lot of harm.
Suren Baghdasaryan July 4, 2023, 7:11 p.m. UTC | #20
On Tue, Jul 4, 2023 at 11:05 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.07.23 19:56, Suren Baghdasaryan wrote:
> > On Tue, Jul 4, 2023 at 10:36 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 04.07.23 19:21, Suren Baghdasaryan wrote:
> >>> On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>>>
> >>>> On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
> >>>>>> At least the reproducer at
> >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> >>>>>> I wonder if that's the best way to fix this. It's surely simple but
> >>>>>> locking every VMA is not free and doing that on every fork might
> >>>>>> regress performance.
> >>>>>
> >>>>>
> >>>>> That would mean that we can possibly still get page faults concurrent to
> >>>>> fork(), on the yet unprocessed part. While that fixes the issue at hand, I
> >>>>> cannot reliably tell if this doesn't mess with some other fork() corner
> >>>>> case.
> >>>>>
> >>>>> I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
> >>>>> operation. Just like the old code did. See below.
> >>>>
> >>>> Calling fork() from a multi-threaded program is fraught with danger.
> >>>> It's a rare thing to do, and we don't need to optimise for it.  It
> >>>> does, of course, need to not crash.  But we can slow it down as much as
> >>>> we want to.  Slowing down single-threaded programs calling fork is
> >>>> much less acceptable.
> >>>
> >>> Hmm. Would you suggest we use different approaches for multi-threaded
> >>> vs single-threaded programs?
> >>> I think locking VMAs while forking a process which has lots of VMAs
> >>> will regress by some amount (we are adding non-zero work). The
> >>> question is if that's acceptable or we have to implement something
> >>> different. I verified that solution fixes the issue shown by the
> >>> reproducer, now I'm trying to quantify this fork performance
> >>> regression I suspect we will introduce.
> >>
> >> Well, the design decision that CONFIG_PER_VMA_LOCK made for now to make
> >> page faults fast and to make blocking any page faults from happening to
> >> be slower (unless there is some easy way that's already built in).
> >>
> >> So it wouldn't surprise me if it might affect performance a bit, but
> >> it's to be quantified if it really matters in comparison to all the page
> >> table copying and other stuff we do during fork.
> >>
> >> Maybe that can be optimized/sped up later. But for now we should fix
> >> this the straightforward way. That fix will be (and has to be) a NOP for
> >> !CONFIG_PER_VMA_LOCK, so that one won't be affected.
> >>
> >> Maybe this patch in an adjusted form would still make sense (also to be
> >> backported), to keep the feature inactive as default until it stabilized
> >> a bit more.
> >
> > Ok, IIUC your suggestion is to use VMA-lock-on-fork fix even if the
> > fork() regresses and keep CONFIG_PER_VMA_LOCK disabled by default
> > until it's more stable. That sounds good to me. With that fix, do we
> > still need to add the BROKEN dependency? I'm guessing it would be
> > safer to disable for sure.
>
> With this fixed, I don't think we need a BROKEN dependency.
>
> I'll let you decide if you want to keep it enabled as default, I'd
> rather disable it for one release and enable it as default later.
>
> Happy so learn if taking all VMA locks without any contention causes a
> lot of harm.

Ok, average kernel compilation time almost did not change - 0.3% which
is well within noise levels.
My fork test which mmaps 10000 unmergeable vmas and does 5000 forks in
a tight loop shows regression of about 5%. The test was specifically
designed to reveal this regression. This does not seem too much to me
considering that it's unlikely some program would issue 5000 forks.
So, I think the numbers are not bad and I'll prepare the patch to add
VMA locking but will not add BROKEN dependency just yet. If someone is
using an old .config, they probably won't notice the regression and if
they do, the only thing they have to do is to disable
CONFIG_PER_VMA_LOCK.
Of course if we still get reports about memory corruption then I'll
add the BROKEN dependency to disable the feature even for old
.configs. Let me know if that plan does not work for some reason.
Thanks,
Suren.

>
> --
> Cheers,
>
> David / dhildenb
>
Suren Baghdasaryan July 4, 2023, 8:10 p.m. UTC | #21
On Tue, Jul 4, 2023 at 12:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jul 4, 2023 at 11:05 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 04.07.23 19:56, Suren Baghdasaryan wrote:
> > > On Tue, Jul 4, 2023 at 10:36 AM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >> On 04.07.23 19:21, Suren Baghdasaryan wrote:
> > >>> On Tue, Jul 4, 2023 at 6:07 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >>>>
> > >>>> On Tue, Jul 04, 2023 at 09:18:18AM +0200, David Hildenbrand wrote:
> > >>>>>> At least the reproducer at
> > >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> > >>>>>> I wonder if that's the best way to fix this. It's surely simple but
> > >>>>>> locking every VMA is not free and doing that on every fork might
> > >>>>>> regress performance.
> > >>>>>
> > >>>>>
> > >>>>> That would mean that we can possibly still get page faults concurrent to
> > >>>>> fork(), on the yet unprocessed part. While that fixes the issue at hand, I
> > >>>>> cannot reliably tell if this doesn't mess with some other fork() corner
> > >>>>> case.
> > >>>>>
> > >>>>> I'd suggest write-locking all VMAs upfront, before doing any kind of fork-mm
> > >>>>> operation. Just like the old code did. See below.
> > >>>>
> > >>>> Calling fork() from a multi-threaded program is fraught with danger.
> > >>>> It's a rare thing to do, and we don't need to optimise for it.  It
> > >>>> does, of course, need to not crash.  But we can slow it down as much as
> > >>>> we want to.  Slowing down single-threaded programs calling fork is
> > >>>> much less acceptable.
> > >>>
> > >>> Hmm. Would you suggest we use different approaches for multi-threaded
> > >>> vs single-threaded programs?
> > >>> I think locking VMAs while forking a process which has lots of VMAs
> > >>> will regress by some amount (we are adding non-zero work). The
> > >>> question is if that's acceptable or we have to implement something
> > >>> different. I verified that solution fixes the issue shown by the
> > >>> reproducer, now I'm trying to quantify this fork performance
> > >>> regression I suspect we will introduce.
> > >>
> > >> Well, the design decision that CONFIG_PER_VMA_LOCK made for now to make
> > >> page faults fast and to make blocking any page faults from happening to
> > >> be slower (unless there is some easy way that's already built in).
> > >>
> > >> So it wouldn't surprise me if it might affect performance a bit, but
> > >> it's to be quantified if it really matters in comparison to all the page
> > >> table copying and other stuff we do during fork.
> > >>
> > >> Maybe that can be optimized/sped up later. But for now we should fix
> > >> this the straightforward way. That fix will be (and has to be) a NOP for
> > >> !CONFIG_PER_VMA_LOCK, so that one won't be affected.
> > >>
> > >> Maybe this patch in an adjusted form would still make sense (also to be
> > >> backported), to keep the feature inactive as default until it stabilized
> > >> a bit more.
> > >
> > > Ok, IIUC your suggestion is to use VMA-lock-on-fork fix even if the
> > > fork() regresses and keep CONFIG_PER_VMA_LOCK disabled by default
> > > until it's more stable. That sounds good to me. With that fix, do we
> > > still need to add the BROKEN dependency? I'm guessing it would be
> > > safer to disable for sure.
> >
> > With this fixed, I don't think we need a BROKEN dependency.
> >
> > I'll let you decide if you want to keep it enabled as default, I'd
> > rather disable it for one release and enable it as default later.
> >
> > Happy so learn if taking all VMA locks without any contention causes a
> > lot of harm.
>
> Ok, average kernel compilation time almost did not change - 0.3% which
> is well within noise levels.
> My fork test which mmaps 10000 unmergeable vmas and does 5000 forks in
> a tight loop shows regression of about 5%. The test was specifically
> designed to reveal this regression. This does not seem too much to me
> considering that it's unlikely some program would issue 5000 forks.
> So, I think the numbers are not bad and I'll prepare the patch to add
> VMA locking but will not add BROKEN dependency just yet. If someone is
> using an old .config, they probably won't notice the regression and if
> they do, the only thing they have to do is to disable
> CONFIG_PER_VMA_LOCK.
> Of course if we still get reports about memory corruption then I'll
> add the BROKEN dependency to disable the feature even for old
> .configs. Let me know if that plan does not work for some reason.

The fix is posted at
https://lore.kernel.org/all/20230704200656.2526715-1-surenb@google.com/
CC'ing stable for inclusion into 6.4.y stable branch.

Folks who reported the problem, could you please test and verify the fix?
Thanks,
Suren.

> Thanks,
> Suren.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
Suren Baghdasaryan July 4, 2023, 10:03 p.m. UTC | #22
On Tue, Jul 4, 2023 at 2:34 PM Holger Hoffstätte
<holger@applied-asynchrony.com> wrote:
>
> On 2023-07-04 22:10, Suren Baghdasaryan wrote:
> > The fix is posted at
> > https://lore.kernel.org/all/20230704200656.2526715-1-surenb@google.com/
> > CC'ing stable for inclusion into 6.4.y stable branch.
> >
> > Folks who reported the problem, could you please test and verify the fix?
> > Thanks,
> > Suren.
>
> I applied the fix and did a clean rebuild. The first attempt to boot resulted in
> the following oops, though it kind of continued:
>
> Jul  4 22:35:22 hho kernel: Console: switching to colour frame buffer device 240x67
> Jul  4 22:35:22 hho kernel: amdgpu 0000:06:00.0: [drm] fb0: amdgpudrmfb frame buffer device
> Jul  4 22:35:22 hho kernel: BUG: kernel NULL pointer dereference, address: 0000000000000052
> Jul  4 22:35:22 hho kernel: #PF: supervisor read access in kernel mode
> Jul  4 22:35:22 hho kernel: #PF: error_code(0x0000) - not-present page
> Jul  4 22:35:22 hho kernel: PGD 0 P4D 0
> Jul  4 22:35:22 hho kernel: Oops: 0000 [#1] SMP
> Jul  4 22:35:22 hho kernel: CPU: 10 PID: 1740 Comm: start-stop-daem Not tainted 6.4.1 #1
> Jul  4 22:35:22 hho kernel: Hardware name: LENOVO 20U50001GE/20U50001GE, BIOS R19ET32W (1.16 ) 01/26/2021
> Jul  4 22:35:22 hho kernel: RIP: 0010:wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> Jul  4 22:35:22 hho kernel: Call Trace:
> Jul  4 22:35:22 hho kernel:  <TASK>
> Jul  4 22:35:22 hho kernel:  ? __die+0x1f/0x60
> Jul  4 22:35:22 hho kernel:  ? page_fault_oops+0x14d/0x410
> Jul  4 22:35:22 hho kernel:  ? xa_load+0x82/0xa0
> Jul  4 22:35:22 hho kernel:  ? exc_page_fault+0x60/0x100
> Jul  4 22:35:22 hho kernel:  ? asm_exc_page_fault+0x22/0x30
> Jul  4 22:35:22 hho kernel:  ? wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho last message buffered 1 times
> Jul  4 22:35:22 hho kernel:  proc_task_name+0xa4/0xb0
> Jul  4 22:35:22 hho kernel:  ? seq_put_decimal_ull_width+0x96/0x100
> Jul  4 22:35:22 hho kernel:  do_task_stat+0x44b/0xe10
> Jul  4 22:35:22 hho kernel:  proc_single_show+0x4b/0xa0
> Jul  4 22:35:22 hho kernel:  seq_read_iter+0xff/0x410
> Jul  4 22:35:22 hho kernel:  ? generic_fillattr+0x45/0xf0
> Jul  4 22:35:22 hho kernel:  seq_read+0x93/0xb0
> Jul  4 22:35:22 hho kernel:  vfs_read+0x9b/0x2c0
> Jul  4 22:35:22 hho kernel:  ? __do_sys_newfstatat+0x22/0x30
> Jul  4 22:35:22 hho kernel:  ksys_read+0x53/0xc0
> Jul  4 22:35:22 hho kernel:  do_syscall_64+0x35/0x80
> Jul  4 22:35:22 hho kernel:  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> Jul  4 22:35:22 hho kernel: RIP: 0033:0x7f39ddf5877d
> Jul  4 22:35:22 hho kernel: Code: b9 fe ff ff 48 8d 3d 1a 71 0a 00 50 e8 2c 12 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 80 3d 81 4c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
> Jul  4 22:35:22 hho kernel: RSP: 002b:00007ffe4b98b6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> Jul  4 22:35:22 hho kernel: RAX: ffffffffffffffda RBX: 00005655194cab40 RCX: 00007f39ddf5877d
> Jul  4 22:35:22 hho kernel: RDX: 0000000000000400 RSI: 00005655194ccd30 RDI: 0000000000000004
> Jul  4 22:35:22 hho kernel: RBP: 00007ffe4b98b760 R08: 00007f39ddff8cb2 R09: 0000000000000001
> Jul  4 22:35:22 hho kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007f39de0324a0
> Jul  4 22:35:22 hho kernel: R13: 00005655194cd140 R14: 0000000000000a68 R15: 00007f39de031ba0
> Jul  4 22:35:22 hho kernel:  </TASK>
> Jul  4 22:35:22 hho kernel: Modules linked in: mousedev sch_fq_codel bpf_preload snd_ctl_led amdgpu iwlmvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi mac80211 pkcs8_key_parser drm_ttm_helper ttm iommu_v2 gpu_sched snd_hda_intel libarc4 i2c_algo_bit snd_intel_dspcfg drm_buddy drm_suballoc_helper uvcvideo snd_hda_codec drm_display_helper edac_mce_amd videobuf2_vmalloc snd_hwdep crct10dif_pclmul videobuf2_memops uvc crc32_pclmul cec snd_hda_core crc32c_intel videobuf2_v4l2 ghash_clmulni_intel lm92 r8169 sha512_ssse3 snd_pcm videodev psmouse thinkpad_acpi iwlwifi drivetemp ledtrig_audio drm_kms_helper rapl videobuf2_common realtek snd_timer serio_raw snd_rn_pci_acp3x wmi_bmof platform_profile cfg80211 mc snd_acp_config k10temp snd syscopyarea mdio_devres ucsi_acpi snd_soc_acpi sysfillrect drm snd_pci_acp3x i2c_piix4 sysimgblt soundcore typec_ucsi ipmi_devintf rfkill roles libphy ipmi_msghandler typec video battery ac wmi i2c_scmi button
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052
> Jul  4 22:35:22 hho kernel: ---[ end trace 0000000000000000 ]---
> Jul  4 22:35:22 hho kernel: RIP: 0010:wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> Jul  4 22:35:22 hho kernel: note: start-stop-daem[1740] exited with irqs disabled
> Jul  4 22:35:22 hho kernel: Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
> Jul  4 22:35:22 hho kernel: r8169 0000:02:00.0 eth0: Link is Down
> Jul  4 22:35:24 hho kernel: r8169 0000:02:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> Jul  4 22:35:24 hho kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>
> It then kind of limped along until I rebooted again. This second attempt to boot
> died and locked up completely, again during amdgpu initialization, and is on display here:
> https://imgur.com/a/3ZE66kh
>
> Finally I just edited mm/Kconfig and set config PER_VMA_LOCK to "defbool n" to override
> any setting in my old config. That made everything work again - it's what I'm using now.

Now I'm completely confused... I've been running my system with this
fix and collecting data the whole morning.
Ok, I'll post a dependency on BROKEN in the evening and will see what
this is all about. Thanks!

>
> Happy 4th and fireworks or whatever ¯\(ツ)/¯
>
> cheers
> Holger
Matthew Wilcox July 4, 2023, 10:42 p.m. UTC | #23
On Tue, Jul 04, 2023 at 11:34:27PM +0200, Holger Hoffstätte wrote:
> I applied the fix and did a clean rebuild. The first attempt to boot resulted in
> the following oops, though it kind of continued:

It would be helpful to run this through decode_stacktrace.sh

> Jul  4 22:35:22 hho kernel: BUG: kernel NULL pointer dereference, address: 0000000000000052
> Jul  4 22:35:22 hho kernel: #PF: supervisor read access in kernel mode
> Jul  4 22:35:22 hho kernel: #PF: error_code(0x0000) - not-present page
> Jul  4 22:35:22 hho kernel: PGD 0 P4D 0
> Jul  4 22:35:22 hho kernel: Oops: 0000 [#1] SMP
> Jul  4 22:35:22 hho kernel: CPU: 10 PID: 1740 Comm: start-stop-daem Not tainted 6.4.1 #1
> Jul  4 22:35:22 hho kernel: Hardware name: LENOVO 20U50001GE/20U50001GE, BIOS R19ET32W (1.16 ) 01/26/2021
> Jul  4 22:35:22 hho kernel: RIP: 0010:wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b

Faulting insn:

   0:	4c 8b 70 48          	mov    0x48(%rax),%r14

and rax is 0xa, which matches up with 0x52 as the faulting address.

I'm not sure this is related to the VMA patches.  It might be something
unrelated that doesn't often come up?

> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> Jul  4 22:35:22 hho kernel: Call Trace:
> Jul  4 22:35:22 hho kernel:  <TASK>
> Jul  4 22:35:22 hho kernel:  ? __die+0x1f/0x60
> Jul  4 22:35:22 hho kernel:  ? page_fault_oops+0x14d/0x410
> Jul  4 22:35:22 hho kernel:  ? xa_load+0x82/0xa0
> Jul  4 22:35:22 hho kernel:  ? exc_page_fault+0x60/0x100
> Jul  4 22:35:22 hho kernel:  ? asm_exc_page_fault+0x22/0x30
> Jul  4 22:35:22 hho kernel:  ? wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho last message buffered 1 times
> Jul  4 22:35:22 hho kernel:  proc_task_name+0xa4/0xb0
> Jul  4 22:35:22 hho kernel:  ? seq_put_decimal_ull_width+0x96/0x100
> Jul  4 22:35:22 hho kernel:  do_task_stat+0x44b/0xe10
> Jul  4 22:35:22 hho kernel:  proc_single_show+0x4b/0xa0
> Jul  4 22:35:22 hho kernel:  seq_read_iter+0xff/0x410
> Jul  4 22:35:22 hho kernel:  ? generic_fillattr+0x45/0xf0
> Jul  4 22:35:22 hho kernel:  seq_read+0x93/0xb0
> Jul  4 22:35:22 hho kernel:  vfs_read+0x9b/0x2c0
> Jul  4 22:35:22 hho kernel:  ? __do_sys_newfstatat+0x22/0x30
> Jul  4 22:35:22 hho kernel:  ksys_read+0x53/0xc0
> Jul  4 22:35:22 hho kernel:  do_syscall_64+0x35/0x80
> Jul  4 22:35:22 hho kernel:  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> Jul  4 22:35:22 hho kernel: RIP: 0033:0x7f39ddf5877d
> Jul  4 22:35:22 hho kernel: Code: b9 fe ff ff 48 8d 3d 1a 71 0a 00 50 e8 2c 12 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 80 3d 81 4c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
> Jul  4 22:35:22 hho kernel: RSP: 002b:00007ffe4b98b6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> Jul  4 22:35:22 hho kernel: RAX: ffffffffffffffda RBX: 00005655194cab40 RCX: 00007f39ddf5877d
> Jul  4 22:35:22 hho kernel: RDX: 0000000000000400 RSI: 00005655194ccd30 RDI: 0000000000000004
> Jul  4 22:35:22 hho kernel: RBP: 00007ffe4b98b760 R08: 00007f39ddff8cb2 R09: 0000000000000001
> Jul  4 22:35:22 hho kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007f39de0324a0
> Jul  4 22:35:22 hho kernel: R13: 00005655194cd140 R14: 0000000000000a68 R15: 00007f39de031ba0
> Jul  4 22:35:22 hho kernel:  </TASK>
> Jul  4 22:35:22 hho kernel: Modules linked in: mousedev sch_fq_codel bpf_preload snd_ctl_led amdgpu iwlmvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi mac80211 pkcs8_key_parser drm_ttm_helper ttm iommu_v2 gpu_sched snd_hda_intel libarc4 i2c_algo_bit snd_intel_dspcfg drm_buddy drm_suballoc_helper uvcvideo snd_hda_codec drm_display_helper edac_mce_amd videobuf2_vmalloc snd_hwdep crct10dif_pclmul videobuf2_memops uvc crc32_pclmul cec snd_hda_core crc32c_intel videobuf2_v4l2 ghash_clmulni_intel lm92 r8169 sha512_ssse3 snd_pcm videodev psmouse thinkpad_acpi iwlwifi drivetemp ledtrig_audio drm_kms_helper rapl videobuf2_common realtek snd_timer serio_raw snd_rn_pci_acp3x wmi_bmof platform_profile cfg80211 mc snd_acp_config k10temp snd syscopyarea mdio_devres ucsi_acpi snd_soc_acpi sysfillrect drm snd_pci_acp3x i2c_piix4 sysimgblt soundcore typec_ucsi ipmi_devintf rfkill roles libphy ipmi_msghandler typec video battery ac wmi i2c_scmi button
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052
> Jul  4 22:35:22 hho kernel: ---[ end trace 0000000000000000 ]---
> Jul  4 22:35:22 hho kernel: RIP: 0010:wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> Jul  4 22:35:22 hho kernel: note: start-stop-daem[1740] exited with irqs disabled
> Jul  4 22:35:22 hho kernel: Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
> Jul  4 22:35:22 hho kernel: r8169 0000:02:00.0 eth0: Link is Down
> Jul  4 22:35:24 hho kernel: r8169 0000:02:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> Jul  4 22:35:24 hho kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> 
> It then kind of limped along until I rebooted again. This second attempt to boot
> died and locked up completely, again during amdgpu initialization, and is on display here:
> https://imgur.com/a/3ZE66kh

refill_obj_stock() is also somewhat unrelated to VMA stuff.  This is
all very bizarre.

> Finally I just edited mm/Kconfig and set config PER_VMA_LOCK to "defbool n" to override
> any setting in my old config. That made everything work again - it's what I'm using now.

Could I ask you to try a few boots with PER_VMA_LOCK set to "n", just
to eliminate the possibility that this is a coincidence?
Suren Baghdasaryan July 5, 2023, 6:46 a.m. UTC | #24
On Tue, Jul 4, 2023 at 4:59 PM Holger Hoffstätte
<holger@applied-asynchrony.com> wrote:
>
> On 2023-07-05 00:42, Matthew Wilcox wrote:
> > On Tue, Jul 04, 2023 at 11:34:27PM +0200, Holger Hoffstätte wrote:
> >> I applied the fix and did a clean rebuild. The first attempt to boot resulted in
> >> the following oops, though it kind of continued:
> >
> > It would be helpful to run this through decode_stacktrace.sh
> >
> >> Jul  4 22:35:22 hho kernel: BUG: kernel NULL pointer dereference, address: 0000000000000052
> >> Jul  4 22:35:22 hho kernel: #PF: supervisor read access in kernel mode
> >> Jul  4 22:35:22 hho kernel: #PF: error_code(0x0000) - not-present page
> >> Jul  4 22:35:22 hho kernel: PGD 0 P4D 0
> >> Jul  4 22:35:22 hho kernel: Oops: 0000 [#1] SMP
> >> Jul  4 22:35:22 hho kernel: CPU: 10 PID: 1740 Comm: start-stop-daem Not tainted 6.4.1 #1
> >> Jul  4 22:35:22 hho kernel: Hardware name: LENOVO 20U50001GE/20U50001GE, BIOS R19ET32W (1.16 ) 01/26/2021
> >> Jul  4 22:35:22 hho kernel: RIP: 0010:wq_worker_comm+0x63/0xc0
> >> Jul  4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> >
> > Faulting insn:
> >
> >     0:        4c 8b 70 48             mov    0x48(%rax),%r14
> >
> > and rax is 0xa, which matches up with 0x52 as the faulting address.
> >
> > I'm not sure this is related to the VMA patches.  It might be something
> > unrelated that doesn't often come up?
>
> See below for the reveal!
>
> >> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> >> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> >> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> >> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> >> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> >> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> >> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> >> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> >> Jul  4 22:35:22 hho kernel: Call Trace:
> >> Jul  4 22:35:22 hho kernel:  <TASK>
> >> Jul  4 22:35:22 hho kernel:  ? __die+0x1f/0x60
> >> Jul  4 22:35:22 hho kernel:  ? page_fault_oops+0x14d/0x410
> >> Jul  4 22:35:22 hho kernel:  ? xa_load+0x82/0xa0
> >> Jul  4 22:35:22 hho kernel:  ? exc_page_fault+0x60/0x100
> >> Jul  4 22:35:22 hho kernel:  ? asm_exc_page_fault+0x22/0x30
> >> Jul  4 22:35:22 hho kernel:  ? wq_worker_comm+0x63/0xc0
> >> Jul  4 22:35:22 hho last message buffered 1 times
> >> Jul  4 22:35:22 hho kernel:  proc_task_name+0xa4/0xb0
> >> Jul  4 22:35:22 hho kernel:  ? seq_put_decimal_ull_width+0x96/0x100
> >> Jul  4 22:35:22 hho kernel:  do_task_stat+0x44b/0xe10
> >> Jul  4 22:35:22 hho kernel:  proc_single_show+0x4b/0xa0
> >> Jul  4 22:35:22 hho kernel:  seq_read_iter+0xff/0x410
> >> Jul  4 22:35:22 hho kernel:  ? generic_fillattr+0x45/0xf0
> >> Jul  4 22:35:22 hho kernel:  seq_read+0x93/0xb0
> >> Jul  4 22:35:22 hho kernel:  vfs_read+0x9b/0x2c0
> >> Jul  4 22:35:22 hho kernel:  ? __do_sys_newfstatat+0x22/0x30
> >> Jul  4 22:35:22 hho kernel:  ksys_read+0x53/0xc0
> >> Jul  4 22:35:22 hho kernel:  do_syscall_64+0x35/0x80
> >> Jul  4 22:35:22 hho kernel:  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >> Jul  4 22:35:22 hho kernel: RIP: 0033:0x7f39ddf5877d
> >> Jul  4 22:35:22 hho kernel: Code: b9 fe ff ff 48 8d 3d 1a 71 0a 00 50 e8 2c 12 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 80 3d 81 4c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
> >> Jul  4 22:35:22 hho kernel: RSP: 002b:00007ffe4b98b6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> >> Jul  4 22:35:22 hho kernel: RAX: ffffffffffffffda RBX: 00005655194cab40 RCX: 00007f39ddf5877d
> >> Jul  4 22:35:22 hho kernel: RDX: 0000000000000400 RSI: 00005655194ccd30 RDI: 0000000000000004
> >> Jul  4 22:35:22 hho kernel: RBP: 00007ffe4b98b760 R08: 00007f39ddff8cb2 R09: 0000000000000001
> >> Jul  4 22:35:22 hho kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007f39de0324a0
> >> Jul  4 22:35:22 hho kernel: R13: 00005655194cd140 R14: 0000000000000a68 R15: 00007f39de031ba0
> >> Jul  4 22:35:22 hho kernel:  </TASK>
> >> Jul  4 22:35:22 hho kernel: Modules linked in: mousedev sch_fq_codel bpf_preload snd_ctl_led amdgpu iwlmvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi mac80211 pkcs8_key_parser drm_ttm_helper ttm iommu_v2 gpu_sched snd_hda_intel libarc4 i2c_algo_bit snd_intel_dspcfg drm_buddy drm_suballoc_helper uvcvideo snd_hda_codec drm_display_helper edac_mce_amd videobuf2_vmalloc snd_hwdep crct10dif_pclmul videobuf2_memops uvc crc32_pclmul cec snd_hda_core crc32c_intel videobuf2_v4l2 ghash_clmulni_intel lm92 r8169 sha512_ssse3 snd_pcm videodev psmouse thinkpad_acpi iwlwifi drivetemp ledtrig_audio drm_kms_helper rapl videobuf2_common realtek snd_timer serio_raw snd_rn_pci_acp3x wmi_bmof platform_profile cfg80211 mc snd_acp_config k10temp snd syscopyarea mdio_devres ucsi_acpi snd_soc_acpi sysfillrect drm snd_pci_acp3x i2c_piix4 sysimgblt soundcore typec_ucsi ipmi_devintf rfkill roles libphy ipmi_msghandler typec video battery ac wmi i2c_scmi button
> >> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052
> >> Jul  4 22:35:22 hho kernel: ---[ end trace 0000000000000000 ]---
> >> Jul  4 22:35:22 hho kernel: RIP: 0010:wq_worker_comm+0x63/0xc0
> >> Jul  4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> >> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> >> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> >> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> >> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> >> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> >> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> >> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> >> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> >> Jul  4 22:35:22 hho kernel: note: start-stop-daem[1740] exited with irqs disabled
> >> Jul  4 22:35:22 hho kernel: Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
> >> Jul  4 22:35:22 hho kernel: r8169 0000:02:00.0 eth0: Link is Down
> >> Jul  4 22:35:24 hho kernel: r8169 0000:02:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> >> Jul  4 22:35:24 hho kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> >>
> >> It then kind of limped along until I rebooted again. This second attempt to boot
> >> died and locked up completely, again during amdgpu initialization, and is on display here:
> >> https://imgur.com/a/3ZE66kh
> >
> > refill_obj_stock() is also somewhat unrelated to VMA stuff.  This is
> > all very bizarre.
> >
> >> Finally I just edited mm/Kconfig and set config PER_VMA_LOCK to "defbool n" to override
> >> any setting in my old config. That made everything work again - it's what I'm using now.
> >
> > Could I ask you to try a few boots with PER_VMA_LOCK set to "n", just
> > to eliminate the possibility that this is a coincidence?
> >
>
> HOLY SMOKES! You are on to something! I wanted to do 10 reboots and didn't expect
> anything to happen since this has been working fine since forever, and I don't boot
> that often since suspend is quite reliable these days. It did 9 without problems and
> then on the 10th reboot it crapped out, again with the xa_load pagefault.

Ok, sounds like the results of the fix are inconclusive. I guess we
should wait for more testing before concluding whether the fix is
valid.
In the meantime, per Andrew's request, I posted the patchset that
includes both the fix and the proper kill switch of the feature at
https://lore.kernel.org/all/20230705063711.2670599-1-surenb@google.com/.
Thanks,
Suren.

>
> Here's the first trace:
>
> holger>/tmp/linux-6.4.1/scripts/decode_stacktrace.sh /boot/kernel-genkernel-x86_64-6.4.1 < /tmp/kern.log
> Jul  4 22:35:22 hho kernel: [drm] Initialized amdgpu 3.52.0 20150101 for 0000:06:00.0 on minor 0
> Jul  4 22:35:22 hho kernel: fbcon: amdgpudrmfb (fb0) is primary device
> Jul  4 22:35:22 hho kernel: [drm] DSC precompute is not needed.
> Jul  4 22:35:22 hho kernel: Console: switching to colour frame buffer device 240x67
> Jul  4 22:35:22 hho kernel: amdgpu 0000:06:00.0: [drm] fb0: amdgpudrmfb frame buffer device
> Jul  4 22:35:22 hho kernel: BUG: kernel NULL pointer dereference, address: 0000000000000052
> Jul  4 22:35:22 hho kernel: #PF: supervisor read access in kernel mode
> Jul  4 22:35:22 hho kernel: #PF: error_code(0x0000) - not-present page
> Jul  4 22:35:22 hho kernel: PGD 0 P4D 0
> Jul  4 22:35:22 hho kernel: Oops: 0000 [#1] SMP
> Jul  4 22:35:22 hho kernel: CPU: 10 PID: 1740 Comm: start-stop-daem Not tainted 6.4.1 #1
> Jul  4 22:35:22 hho kernel: Hardware name: LENOVO 20U50001GE/20U50001GE, BIOS R19ET32W (1.16 ) 01/26/2021
> Jul 4 22:35:22 hho kernel: RIP: wq_worker_comm+0x63/0xc0
> Jul 4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> All code
> ========
>     0:  43 2c 20                rex.XB sub $0x20,%al
>     3:  75 1d                   jne    0x22
>     5:  5b                      pop    %rbx
>     6:  5d                      pop    %rbp
>     7:  48 c7 c7 e0 a4 43 82    mov    $0xffffffff8243a4e0,%rdi
>     e:  41 5c                   pop    %r12
>    10:  41 5d                   pop    %r13
>    12:  41 5e                   pop    %r14
>    14:  e9 7e 6b 8b 00          jmp    0x8b6b97
>    19:  5b                      pop    %rbx
>    1a:  5d                      pop    %rbp
>    1b:  41 5c                   pop    %r12
>    1d:  41 5d                   pop    %r13
>    1f:  41 5e                   pop    %r14
>    21:  c3                      ret
>    22:  48 89 df                mov    %rbx,%rdi
>    25:  e8 ad 35 00 00          call   0x35d7
>    2a:* 4c 8b 70 48             mov    0x48(%rax),%r14          <-- trapping instruction
>    2e:  48 89 c3                mov    %rax,%rbx
>    31:  4d 85 f6                test   %r14,%r14
>    34:  74 cf                   je     0x5
>    36:  4c 89 f7                mov    %r14,%rdi
>    39:  e8 29 b6 8b 00          call   0x8bb667
>    3e:  80                      .byte 0x80
>    3f:  7b                      .byte 0x7b
>
> Code starting with the faulting instruction
> ===========================================
>     0:  4c 8b 70 48             mov    0x48(%rax),%r14
>     4:  48 89 c3                mov    %rax,%rbx
>     7:  4d 85 f6                test   %r14,%r14
>     a:  74 cf                   je     0xffffffffffffffdb
>     c:  4c 89 f7                mov    %r14,%rdi
>     f:  e8 29 b6 8b 00          call   0x8bb63d
>    14:  80                      .byte 0x80
>    15:  7b                      .byte 0x7b
> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> Jul  4 22:35:22 hho kernel: Call Trace:
> Jul  4 22:35:22 hho kernel:  <TASK>
> Jul 4 22:35:22 hho kernel: ? __die+0x1f/0x60
> Jul 4 22:35:22 hho kernel: ? page_fault_oops+0x14d/0x410
> Jul 4 22:35:22 hho kernel: ? xa_load+0x82/0xa0
> Jul 4 22:35:22 hho kernel: ? exc_page_fault+0x60/0x100
> Jul 4 22:35:22 hho kernel: ? asm_exc_page_fault+0x22/0x30
> Jul 4 22:35:22 hho kernel: ? wq_worker_comm+0x63/0xc0
> Jul  4 22:35:22 hho last message buffered 1 times
> Jul 4 22:35:22 hho kernel: proc_task_name+0xa4/0xb0
> Jul 4 22:35:22 hho kernel: ? seq_put_decimal_ull_width+0x96/0x100
> Jul 4 22:35:22 hho kernel: do_task_stat+0x44b/0xe10
> Jul 4 22:35:22 hho kernel: proc_single_show+0x4b/0xa0
> Jul 4 22:35:22 hho kernel: seq_read_iter+0xff/0x410
> Jul 4 22:35:22 hho kernel: ? generic_fillattr+0x45/0xf0
> Jul 4 22:35:22 hho kernel: seq_read+0x93/0xb0
> Jul 4 22:35:22 hho kernel: vfs_read+0x9b/0x2c0
> Jul 4 22:35:22 hho kernel: ? __do_sys_newfstatat+0x22/0x30
> Jul 4 22:35:22 hho kernel: ksys_read+0x53/0xc0
> Jul 4 22:35:22 hho kernel: do_syscall_64+0x35/0x80
> Jul 4 22:35:22 hho kernel: entry_SYSCALL_64_after_hwframe+0x46/0xb0
> Jul  4 22:35:22 hho kernel: RIP: 0033:0x7f39ddf5877d
> Jul 4 22:35:22 hho kernel: Code: b9 fe ff ff 48 8d 3d 1a 71 0a 00 50 e8 2c 12 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 80 3d 81 4c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
> All code
> ========
>     0:  b9 fe ff ff 48          mov    $0x48fffffe,%ecx
>     5:  8d 3d 1a 71 0a 00       lea    0xa711a(%rip),%edi        # 0xa7125
>     b:  50                      push   %rax
>     c:  e8 2c 12 02 00          call   0x2123d
>    11:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    18:  00 00 00
>    1b:  66 90                   xchg   %ax,%ax
>    1d:  80 3d 81 4c 0e 00 00    cmpb   $0x0,0xe4c81(%rip)        # 0xe4ca5
>    24:  74 17                   je     0x3d
>    26:  31 c0                   xor    %eax,%eax
>    28:  0f 05                   syscall
>    2a:* 48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
>    30:  77 5b                   ja     0x8d
>    32:  c3                      ret
>    33:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    3a:  00 00 00
>    3d:  53                      push   %rbx
>    3e:  48                      rex.W
>    3f:  83                      .byte 0x83
>
> Code starting with the faulting instruction
> ===========================================
>     0:  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
>     6:  77 5b                   ja     0x63
>     8:  c3                      ret
>     9:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    10:  00 00 00
>    13:  53                      push   %rbx
>    14:  48                      rex.W
>    15:  83                      .byte 0x83
> Jul  4 22:35:22 hho kernel: RSP: 002b:00007ffe4b98b6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> Jul  4 22:35:22 hho kernel: RAX: ffffffffffffffda RBX: 00005655194cab40 RCX: 00007f39ddf5877d
> Jul  4 22:35:22 hho kernel: RDX: 0000000000000400 RSI: 00005655194ccd30 RDI: 0000000000000004
> Jul  4 22:35:22 hho kernel: RBP: 00007ffe4b98b760 R08: 00007f39ddff8cb2 R09: 0000000000000001
> Jul  4 22:35:22 hho kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007f39de0324a0
> Jul  4 22:35:22 hho kernel: R13: 00005655194cd140 R14: 0000000000000a68 R15: 00007f39de031ba0
> Jul  4 22:35:22 hho kernel:  </TASK>
> Jul  4 22:35:22 hho kernel: Modules linked in: mousedev sch_fq_codel bpf_preload snd_ctl_led amdgpu iwlmvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi mac80211 pkcs8_key_parser drm_ttm_helper ttm iommu_v2 gpu_sched snd_hda_intel libarc4 i2c_algo_bit snd_intel_dspcfg drm_buddy drm_suballoc_helper uvcvideo snd_hda_codec drm_display_helper edac_mce_amd videobuf2_vmalloc snd_hwdep crct10dif_pclmul videobuf2_memops uvc crc32_pclmul cec snd_hda_core crc32c_intel videobuf2_v4l2 ghash_clmulni_intel lm92 r8169 sha512_ssse3 snd_pcm videodev psmouse thinkpad_acpi iwlwifi drivetemp ledtrig_audio drm_kms_helper rapl videobuf2_common realtek snd_timer serio_raw snd_rn_pci_acp3x wmi_bmof platform_profile cfg80211 mc snd_acp_config k10temp snd syscopyarea mdio_devres ucsi_acpi snd_soc_acpi sysfillrect drm snd_pci_acp3x i2c_piix4 sysimgblt soundcore typec_ucsi ipmi_devintf rfkill roles libphy ipmi_msghandler typec video battery ac wmi i2c_scmi button
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052
> Jul  4 22:35:22 hho kernel: ---[ end trace 0000000000000000 ]---
> Jul 4 22:35:22 hho kernel: RIP: wq_worker_comm+0x63/0xc0
> Jul 4 22:35:22 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 7e 6b 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 29 b6 8b 00 80 7b
> All code
> ========
>     0:  43 2c 20                rex.XB sub $0x20,%al
>     3:  75 1d                   jne    0x22
>     5:  5b                      pop    %rbx
>     6:  5d                      pop    %rbp
>     7:  48 c7 c7 e0 a4 43 82    mov    $0xffffffff8243a4e0,%rdi
>     e:  41 5c                   pop    %r12
>    10:  41 5d                   pop    %r13
>    12:  41 5e                   pop    %r14
>    14:  e9 7e 6b 8b 00          jmp    0x8b6b97
>    19:  5b                      pop    %rbx
>    1a:  5d                      pop    %rbp
>    1b:  41 5c                   pop    %r12
>    1d:  41 5d                   pop    %r13
>    1f:  41 5e                   pop    %r14
>    21:  c3                      ret
>    22:  48 89 df                mov    %rbx,%rdi
>    25:  e8 ad 35 00 00          call   0x35d7
>    2a:* 4c 8b 70 48             mov    0x48(%rax),%r14          <-- trapping instruction
>    2e:  48 89 c3                mov    %rax,%rbx
>    31:  4d 85 f6                test   %r14,%r14
>    34:  74 cf                   je     0x5
>    36:  4c 89 f7                mov    %r14,%rdi
>    39:  e8 29 b6 8b 00          call   0x8bb667
>    3e:  80                      .byte 0x80
>    3f:  7b                      .byte 0x7b
>
> Code starting with the faulting instruction
> ===========================================
>     0:  4c 8b 70 48             mov    0x48(%rax),%r14
>     4:  48 89 c3                mov    %rax,%rbx
>     7:  4d 85 f6                test   %r14,%r14
>     a:  74 cf                   je     0xffffffffffffffdb
>     c:  4c 89 f7                mov    %r14,%rdi
>     f:  e8 29 b6 8b 00          call   0x8bb63d
>    14:  80                      .byte 0x80
>    15:  7b                      .byte 0x7b
> Jul  4 22:35:22 hho kernel: RSP: 0018:ffffc90000fb7bb8 EFLAGS: 00010202
> Jul  4 22:35:22 hho kernel: RAX: 000000000000000a RBX: ffff88810cd43300 RCX: 0001020304050608
> Jul  4 22:35:22 hho kernel: RDX: ffff88811395bfc0 RSI: 7fffffffffffffff RDI: ffff88810cd43300
> Jul  4 22:35:22 hho kernel: RBP: 000000000000000f R08: ffffc90000fb7be8 R09: 0000000000000040
> Jul  4 22:35:22 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90000fb7be8
> Jul  4 22:35:22 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  4 22:35:22 hho kernel: FS:  00007f39dde1c740(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
> Jul  4 22:35:22 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  4 22:35:22 hho kernel: CR2: 0000000000000052 CR3: 0000000112188000 CR4: 0000000000350ee0
> Jul  4 22:35:22 hho kernel: note: start-stop-daem[1740] exited with irqs disabled
> Jul  4 22:35:22 hho kernel: Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
> Jul  4 22:35:22 hho kernel: r8169 0000:02:00.0 eth0: Link is Down
> Jul  4 22:35:24 hho kernel: r8169 0000:02:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> Jul  4 22:35:24 hho kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>
> Here is the second one from the reboot bonanza:
>
> holger>/tmp/linux-6.4.1/scripts/decode_stacktrace.sh /boot/kernel-genkernel-x86_64-6.4.1 < /tmp/kern.log
> Jul  5 01:34:20 hho kernel: [drm] Initialized amdgpu 3.52.0 20150101 for 0000:06:00.0 on minor 0
> Jul  5 01:34:20 hho kernel: fbcon: amdgpudrmfb (fb0) is primary device
> Jul  5 01:34:20 hho kernel: [drm] DSC precompute is not needed.
> Jul  5 01:34:20 hho kernel: Console: switching to colour frame buffer device 240x67
> Jul  5 01:34:20 hho kernel: amdgpu 0000:06:00.0: [drm] fb0: amdgpudrmfb frame buffer device
> Jul  5 01:34:20 hho kernel: BUG: kernel NULL pointer dereference, address: 0000000000000052
> Jul  5 01:34:20 hho kernel: #PF: supervisor read access in kernel mode
> Jul  5 01:34:20 hho kernel: #PF: error_code(0x0000) - not-present page
> Jul  5 01:34:20 hho kernel: PGD 0 P4D 0
> Jul  5 01:34:20 hho kernel: Oops: 0000 [#1] SMP
> Jul  5 01:34:20 hho kernel: CPU: 8 PID: 1716 Comm: start-stop-daem Not tainted 6.4.1 #1
> Jul  5 01:34:20 hho kernel: Hardware name: LENOVO 20U50001GE/20U50001GE, BIOS R19ET32W (1.16 ) 01/26/2021
> Jul 5 01:34:20 hho kernel: RIP: wq_worker_comm+0x63/0xc0
> Jul 5 01:34:20 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 2e 59 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 d9 a3 8b 00 80 7b
> All code
> ========
>     0:  43 2c 20                rex.XB sub $0x20,%al
>     3:  75 1d                   jne    0x22
>     5:  5b                      pop    %rbx
>     6:  5d                      pop    %rbp
>     7:  48 c7 c7 e0 a4 43 82    mov    $0xffffffff8243a4e0,%rdi
>     e:  41 5c                   pop    %r12
>    10:  41 5d                   pop    %r13
>    12:  41 5e                   pop    %r14
>    14:  e9 2e 59 8b 00          jmp    0x8b5947
>    19:  5b                      pop    %rbx
>    1a:  5d                      pop    %rbp
>    1b:  41 5c                   pop    %r12
>    1d:  41 5d                   pop    %r13
>    1f:  41 5e                   pop    %r14
>    21:  c3                      ret
>    22:  48 89 df                mov    %rbx,%rdi
>    25:  e8 ad 35 00 00          call   0x35d7
>    2a:* 4c 8b 70 48             mov    0x48(%rax),%r14          <-- trapping instruction
>    2e:  48 89 c3                mov    %rax,%rbx
>    31:  4d 85 f6                test   %r14,%r14
>    34:  74 cf                   je     0x5
>    36:  4c 89 f7                mov    %r14,%rdi
>    39:  e8 d9 a3 8b 00          call   0x8ba417
>    3e:  80                      .byte 0x80
>    3f:  7b                      .byte 0x7b
>
> Code starting with the faulting instruction
> ===========================================
>     0:  4c 8b 70 48             mov    0x48(%rax),%r14
>     4:  48 89 c3                mov    %rax,%rbx
>     7:  4d 85 f6                test   %r14,%r14
>     a:  74 cf                   je     0xffffffffffffffdb
>     c:  4c 89 f7                mov    %r14,%rdi
>     f:  e8 d9 a3 8b 00          call   0x8ba3ed
>    14:  80                      .byte 0x80
>    15:  7b                      .byte 0x7b
> Jul  5 01:34:20 hho kernel: RSP: 0018:ffffc90001027bb8 EFLAGS: 00010202
> Jul  5 01:34:20 hho kernel: RAX: 000000000000000a RBX: ffff888111052640 RCX: 0001020304050608
> Jul  5 01:34:20 hho kernel: RDX: ffff88810490b300 RSI: 7fffffffffffffff RDI: ffff888111052640
> Jul  5 01:34:20 hho kernel: RBP: 000000000000000f R08: ffffc90001027be8 R09: 0000000000000040
> Jul  5 01:34:20 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90001027be8
> Jul  5 01:34:20 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  5 01:34:20 hho kernel: FS:  00007f917809a740(0000) GS:ffff8887ef600000(0000) knlGS:0000000000000000
> Jul  5 01:34:20 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  5 01:34:20 hho kernel: CR2: 0000000000000052 CR3: 0000000107562000 CR4: 0000000000350ee0
> Jul  5 01:34:20 hho kernel: Call Trace:
> Jul  5 01:34:20 hho kernel:  <TASK>
> Jul 5 01:34:20 hho kernel: ? __die+0x1f/0x60
> Jul 5 01:34:20 hho kernel: ? page_fault_oops+0x14d/0x410
> Jul 5 01:34:20 hho kernel: ? xa_load+0x82/0xa0
> Jul  5 01:34:20 hho last message buffered 1 times
> Jul 5 01:34:20 hho kernel: ? exc_page_fault+0x60/0x100
> Jul 5 01:34:20 hho kernel: ? asm_exc_page_fault+0x22/0x30
> Jul 5 01:34:20 hho kernel: ? wq_worker_comm+0x63/0xc0
> Jul  5 01:34:20 hho last message buffered 1 times
> Jul 5 01:34:20 hho kernel: proc_task_name+0xa4/0xb0
> Jul 5 01:34:20 hho kernel: ? seq_put_decimal_ull_width+0x96/0x100
> Jul 5 01:34:20 hho kernel: do_task_stat+0x44b/0xe10
> Jul 5 01:34:20 hho kernel: proc_single_show+0x4b/0xa0
> Jul 5 01:34:20 hho kernel: seq_read_iter+0xff/0x410
> Jul 5 01:34:20 hho kernel: ? generic_fillattr+0x45/0xf0
> Jul 5 01:34:20 hho kernel: seq_read+0x93/0xb0
> Jul 5 01:34:20 hho kernel: vfs_read+0x9b/0x2c0
> Jul 5 01:34:20 hho kernel: ? __do_sys_newfstatat+0x22/0x30
> Jul 5 01:34:20 hho kernel: ksys_read+0x53/0xc0
> Jul 5 01:34:20 hho kernel: do_syscall_64+0x35/0x80
> Jul 5 01:34:20 hho kernel: entry_SYSCALL_64_after_hwframe+0x46/0xb0
> Jul  5 01:34:20 hho kernel: RIP: 0033:0x7f91781d677d
> Jul 5 01:34:20 hho kernel: Code: b9 fe ff ff 48 8d 3d 1a 71 0a 00 50 e8 2c 12 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 80 3d 81 4c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
> All code
> ========
>     0:  b9 fe ff ff 48          mov    $0x48fffffe,%ecx
>     5:  8d 3d 1a 71 0a 00       lea    0xa711a(%rip),%edi        # 0xa7125
>     b:  50                      push   %rax
>     c:  e8 2c 12 02 00          call   0x2123d
>    11:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    18:  00 00 00
>    1b:  66 90                   xchg   %ax,%ax
>    1d:  80 3d 81 4c 0e 00 00    cmpb   $0x0,0xe4c81(%rip)        # 0xe4ca5
>    24:  74 17                   je     0x3d
>    26:  31 c0                   xor    %eax,%eax
>    28:  0f 05                   syscall
>    2a:* 48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
>    30:  77 5b                   ja     0x8d
>    32:  c3                      ret
>    33:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    3a:  00 00 00
>    3d:  53                      push   %rbx
>    3e:  48                      rex.W
>    3f:  83                      .byte 0x83
>
> Code starting with the faulting instruction
> ===========================================
>     0:  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
>     6:  77 5b                   ja     0x63
>     8:  c3                      ret
>     9:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    10:  00 00 00
>    13:  53                      push   %rbx
>    14:  48                      rex.W
>    15:  83                      .byte 0x83
> Jul  5 01:34:20 hho kernel: RSP: 002b:00007ffe56a8adb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> Jul  5 01:34:20 hho kernel: RAX: ffffffffffffffda RBX: 0000559458207b40 RCX: 00007f91781d677d
> Jul  5 01:34:20 hho kernel: RDX: 0000000000000400 RSI: 0000559458209d30 RDI: 0000000000000004
> Jul  5 01:34:20 hho kernel: RBP: 00007ffe56a8ae20 R08: 00007f9178276cb2 R09: 0000000000000001
> Jul  5 01:34:20 hho kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007f91782b04a0
> Jul  5 01:34:20 hho kernel: R13: 000055945820a140 R14: 0000000000000a68 R15: 00007f91782afba0
> Jul  5 01:34:20 hho kernel:  </TASK>
> Jul  5 01:34:20 hho kernel: Modules linked in: sch_fq_codel bpf_preload mousedev snd_ctl_led iwlmvm snd_hda_codec_realtek amdgpu pkcs8_key_parser snd_hda_codec_generic mac80211 libarc4 drm_ttm_helper snd_hda_codec_hdmi ttm iommu_v2 uvcvideo gpu_sched videobuf2_vmalloc i2c_algo_bit videobuf2_memops snd_hda_intel drm_buddy uvc edac_mce_amd snd_intel_dspcfg crct10dif_pclmul videobuf2_v4l2 drm_suballoc_helper crc32_pclmul lm92 snd_hda_codec drm_display_helper crc32c_intel videodev snd_hwdep ghash_clmulni_intel r8169 drivetemp cec sha512_ssse3 thinkpad_acpi snd_hda_core videobuf2_common psmouse realtek iwlwifi drm_kms_helper rapl ledtrig_audio snd_pcm mc serio_raw snd_rn_pci_acp3x platform_profile syscopyarea wmi_bmof mdio_devres k10temp ipmi_devintf snd_timer snd_acp_config sysfillrect cfg80211 drm ucsi_acpi sysimgblt snd snd_soc_acpi libphy i2c_piix4 ipmi_msghandler snd_pci_acp3x typec_ucsi soundcore rfkill video roles typec battery ac wmi i2c_scmi button
> Jul  5 01:34:20 hho kernel: CR2: 0000000000000052
> Jul  5 01:34:20 hho kernel: ---[ end trace 0000000000000000 ]---
> Jul 5 01:34:20 hho kernel: RIP: wq_worker_comm+0x63/0xc0
> Jul 5 01:34:20 hho kernel: Code: 43 2c 20 75 1d 5b 5d 48 c7 c7 e0 a4 43 82 41 5c 41 5d 41 5e e9 2e 59 8b 00 5b 5d 41 5c 41 5d 41 5e c3 48 89 df e8 ad 35 00 00 <4c> 8b 70 48 48 89 c3 4d 85 f6 74 cf 4c 89 f7 e8 d9 a3 8b 00 80 7b
> All code
> ========
>     0:  43 2c 20                rex.XB sub $0x20,%al
>     3:  75 1d                   jne    0x22
>     5:  5b                      pop    %rbx
>     6:  5d                      pop    %rbp
>     7:  48 c7 c7 e0 a4 43 82    mov    $0xffffffff8243a4e0,%rdi
>     e:  41 5c                   pop    %r12
>    10:  41 5d                   pop    %r13
>    12:  41 5e                   pop    %r14
>    14:  e9 2e 59 8b 00          jmp    0x8b5947
>    19:  5b                      pop    %rbx
>    1a:  5d                      pop    %rbp
>    1b:  41 5c                   pop    %r12
>    1d:  41 5d                   pop    %r13
>    1f:  41 5e                   pop    %r14
>    21:  c3                      ret
>    22:  48 89 df                mov    %rbx,%rdi
>    25:  e8 ad 35 00 00          call   0x35d7
>    2a:* 4c 8b 70 48             mov    0x48(%rax),%r14          <-- trapping instruction
>    2e:  48 89 c3                mov    %rax,%rbx
>    31:  4d 85 f6                test   %r14,%r14
>    34:  74 cf                   je     0x5
>    36:  4c 89 f7                mov    %r14,%rdi
>    39:  e8 d9 a3 8b 00          call   0x8ba417
>    3e:  80                      .byte 0x80
>    3f:  7b                      .byte 0x7b
>
> Code starting with the faulting instruction
> ===========================================
>     0:  4c 8b 70 48             mov    0x48(%rax),%r14
>     4:  48 89 c3                mov    %rax,%rbx
>     7:  4d 85 f6                test   %r14,%r14
>     a:  74 cf                   je     0xffffffffffffffdb
>     c:  4c 89 f7                mov    %r14,%rdi
>     f:  e8 d9 a3 8b 00          call   0x8ba3ed
>    14:  80                      .byte 0x80
>    15:  7b                      .byte 0x7b
> Jul  5 01:34:20 hho kernel: RSP: 0018:ffffc90001027bb8 EFLAGS: 00010202
> Jul  5 01:34:20 hho kernel: RAX: 000000000000000a RBX: ffff888111052640 RCX: 0001020304050608
> Jul  5 01:34:20 hho kernel: RDX: ffff88810490b300 RSI: 7fffffffffffffff RDI: ffff888111052640
> Jul  5 01:34:20 hho kernel: RBP: 000000000000000f R08: ffffc90001027be8 R09: 0000000000000040
> Jul  5 01:34:20 hho kernel: R10: fefefefefefefeff R11: 0000000000000040 R12: ffffc90001027be8
> Jul  5 01:34:20 hho kernel: R13: 0000000000000040 R14: 000000000000000c R15: 0000000000000001
> Jul  5 01:34:20 hho kernel: FS:  00007f917809a740(0000) GS:ffff8887ef600000(0000) knlGS:0000000000000000
> Jul  5 01:34:20 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul  5 01:34:20 hho kernel: CR2: 0000000000000052 CR3: 0000000107562000 CR4: 0000000000350ee0
> Jul  5 01:34:20 hho kernel: note: start-stop-daem[1716] exited with irqs disabled
> Jul  5 01:34:20 hho kernel: Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
> Jul  5 01:34:21 hho kernel: r8169 0000:02:00.0 eth0: Link is Down
> Jul  5 01:34:23 hho kernel: r8169 0000:02:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> Jul  5 01:34:23 hho kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>
> The crashing process was openrc's start-stop-daemon starting acpid, though I think
> both are just the victims here.
>
> Hope this helps!
>
> cheers
> Holger
diff mbox series

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..de94b2497600 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1224,7 +1224,7 @@  config ARCH_SUPPORTS_PER_VMA_LOCK
        def_bool n
 
 config PER_VMA_LOCK
-	def_bool y
+	bool "Enable per-vma locking during page fault handling."
 	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
 	help
 	  Allow per-vma locking during page fault handling.