diff mbox series

[MM,Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background

Message ID CAEemH2dMW6oh6Bbm=yqUADF+mDhuQgFTTGYftB+xAhqqdYV3Ng@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series [MM,Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background | expand

Commit Message

Li Wang July 29, 2019, 5:17 a.m. UTC
Hi Naoya and Linux-MMers,

The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c

It seems like the retry mmap() triggers SIGBUS while doing the
numa_move_pages()
in background. That is very similar to the kernel bug which was mentioned
by commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race
condition between soft offline and hugetlb_fault which causes unexpected
process SIGBUS killing.

I'm not sure if that below patch is making sene to memory-failures.c, but after
building a new kernel-5.2.3 with this change, the problem can NOT be
reproduced.

Any comments?

----------------------------------


----- test on kernel-v5.2.3 ------
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:251: INFO: Free RAM 194212832 kB
move_pages12.c:269: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:279: INFO: Increasing 2048kB hugepages pool on node 1 to 6
move_pages12.c:195: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:195: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:185: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:114: FAIL: move_pages failed: ESRCH

----- test on kernel-v5.2.3  + above patch------
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:252: INFO: Free RAM 64780164 kB
move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 7
move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 10
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:186: PASS: Bug not reproduced
move_pages12.c:186: PASS: Bug not reproduced
--
Regards,
Li Wang

Comments

Mike Kravetz July 29, 2019, 7 p.m. UTC | #1
On 7/28/19 10:17 PM, Li Wang wrote:
> Hi Naoya and Linux-MMers,
> 
> The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> 
> It seems like the retry mmap() triggers SIGBUS while doing thenuma_move_pages() in background. That is very similar to the kernelbug which was mentioned by commit 6bc9b56433b76e40d(mm: fix race onsoft-offlining ): A race condition between soft offline andhugetlb_fault which causes unexpected process SIGBUS killing.
> 
> I'm not sure if that below patch is making sene to memory-failures.c, but after building a new kernel-5.2.3 with this change, the problem can NOT be reproduced. 
> 
> Any comments?

Something seems strange.  I can not reproduce with unmodified 5.2.3

[root@f23d move_pages]# uname -r
5.2.3
[root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12
tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:201: INFO: Free RAM 6725424 kB
move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:135: PASS: Bug not reproduced

Summary:
passed   1
failed   0
skipped  0
warnings 0

Also, the soft_offline_huge_page() code should not come into play with
this specific test.
Li Wang July 30, 2019, 6:29 a.m. UTC | #2
Hi Mike,

Thanks for trying this.

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com>
wrote:
>
> On 7/28/19 10:17 PM, Li Wang wrote:
> > Hi Naoya and Linux-MMers,
> >
> > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> >
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> >
> > It seems like the retry mmap() triggers SIGBUS while doing
thenuma_move_pages() in background. That is very similar to the kernelbug
which was mentioned by commit 6bc9b56433b76e40d(mm: fix race
onsoft-offlining ): A race condition between soft offline andhugetlb_fault
which causes unexpected process SIGBUS killing.
> >
> > I'm not sure if that below patch is making sene to memory-failures.c,
but after building a new kernel-5.2.3 with this change, the problem can NOT
be reproduced.
> >
> > Any comments?
>
> Something seems strange.  I can not reproduce with unmodified 5.2.3

It's not 100% reproducible, I tried ten times only hit 4~6 times fail.

Did you try the test case with patch V3(in my branch)?
https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6

# git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
# cd ltp.wangli/; make autotools;
# ./configure ; make -j24
# cd testcases/kernel/syscalls/move_pages/
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:249: INFO: Free RAM 64386300 kB
move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:183: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:117: FAIL: move_pages failed: ESRCH

# uname -r
5.2.3

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 16049 MB
node 0 free: 15736 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16123 MB
node 1 free: 15850 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16123 MB
node 2 free: 15989 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16097 MB
node 3 free: 15278 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

> Also, the soft_offline_huge_page() code should not come into play with
> this specific test.

I got the "soft offline xxx.. hugepage failed to isolate" message from
soft_offline_huge_page()
in dmesg log.

=== debug print info ===
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1701,7 +1701,7 @@ static int soft_offline_huge_page(struct page *page,
int flags)
         */
        put_hwpoison_page(hpage);
        if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               pr_info("liwang -- soft offline: %#lx hugepage failed to
isolate\n", pfn);
                return -EBUSY;
        }

# dmesg
...
[ 1068.947205] Soft offlining pfn 0x40b200 at process virtual address
0x7f9d8d000000
[ 1068.987054] Soft offlining pfn 0x40ac00 at process virtual address
0x7f9d8d200000
[ 1069.048478] Soft offlining pfn 0x40a800 at process virtual address
0x7f9d8d000000
[ 1069.087413] Soft offlining pfn 0x40ae00 at process virtual address
0x7f9d8d200000
[ 1069.123285] liwang -- soft offline: 0x40ae00 hugepage failed to isolate
[ 1069.160137] Soft offlining pfn 0x80f800 at process virtual address
0x7f9d8d000000
[ 1069.196009] Soft offlining pfn 0x80fe00 at process virtual address
0x7f9d8d200000
[ 1069.243436] Soft offlining pfn 0x40a400 at process virtual address
0x7f9d8d000000
[ 1069.281301] Soft offlining pfn 0x40a600 at process virtual address
0x7f9d8d200000
[ 1069.318171] liwang -- soft offline: 0x40a600 hugepage failed to isolate
Li Wang July 30, 2019, 6:38 a.m. UTC | #3
On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
...
> Something seems strange.  I can not reproduce with unmodified 5.2.3
>
> [root@f23d move_pages]# uname -r
> 5.2.3
> [root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12
> tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:201: INFO: Free RAM 6725424 kB
> move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4
> move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:135: PASS: Bug not reproduced
>
> Summary:
> passed   1
> failed   0
> skipped  0
> warnings 0

FYI:

And, from your test log, it looks like you were running an old LTP
version, the test#2 was added in move_page12 in the latest master
branch.

So, the completely test log should be included two-part:

# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:252: INFO: Free RAM 63759028 kB
move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 6
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:186: PASS: Bug not reproduced
move_pages12.c:186: PASS: Bug not reproduced

Summary:
passed   2
failed   0
skipped  0
warnings 0
Mike Kravetz July 31, 2019, 12:44 a.m. UTC | #4
On 7/29/19 11:29 PM, Li Wang wrote:
> It's not 100% reproducible, I tried ten times only hit 4~6 times fail.
> 
> Did you try the test case with patch V3(in my branch)?
> https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6
> 

My bad!  I was using an old version of the test without the soft offline
testing.

> # git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
> # cd ltp.wangli/; make autotools;
> # ./configure ; make -j24
> # cd testcases/kernel/syscalls/move_pages/
> # ./move_pages12 
> tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:249: INFO: Free RAM 64386300 kB
> move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
> move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:183: PASS: Bug not reproduced
> tst_test.c:1145: BROK: Test killed by SIGBUS!
> move_pages12.c:117: FAIL: move_pages failed: ESRCH

Yes, I can recreate.

When I see this failure, the SIGBUS is the result of a huge page allocation
failure.  The allocation was in response to a page fault.

Note that running the test will deplete memory of the system as huge pages
are marked 'poisoned' and can not be reused.  So, each run of the test will
take additional memory offline.

A SIGBUS is the normal behavior for a hugetlb page fault failure due to
lack of huge pages.  Ugly, but that is the design.  I do not believe this
test should not be experiencing this due to reservations taken at mmap
time.  However, the test is combining faults, soft offline and page
migrations, so the there are lots of moving parts.

I'll continue to investigate.

Naoya may have more context as he contributed to both the kernel code and
the testcase.
Mike Kravetz Aug. 2, 2019, 12:19 a.m. UTC | #5
On 7/30/19 5:44 PM, Mike Kravetz wrote:
> A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> lack of huge pages.  Ugly, but that is the design.  I do not believe this
> test should not be experiencing this due to reservations taken at mmap
> time.  However, the test is combining faults, soft offline and page
> migrations, so the there are lots of moving parts.
> 
> I'll continue to investigate.

There appears to be a race with hugetlb_fault and try_to_unmap_one of
the migration path.

Can you try this patch in your environment?  I am not sure if it will
be the final fix, but just wanted to see if it addresses issue for you.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..f3156c5432e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 		page = alloc_huge_page(vma, haddr, 0);
 		if (IS_ERR(page)) {
+			/*
+			 * We could race with page migration (try_to_unmap_one)
+			 * which is modifying page table with lock.  However,
+			 * we are not holding lock here.  Before returning
+			 * error that will SIGBUS caller, get ptl and make
+			 * sure there really is no entry.
+			 */
+			ptl = huge_pte_lock(h, mm, ptep);
+			if (!huge_pte_none(huge_ptep_get(ptep))) {
+				ret = 0;
+				spin_unlock(ptl);
+				goto out;
+			}
+			spin_unlock(ptl);
 			ret = vmf_error(PTR_ERR(page));
 			goto out;
 		}
Naoya Horiguchi Aug. 2, 2019, 3:48 a.m. UTC | #6
On Mon, Jul 29, 2019 at 01:17:27PM +0800, Li Wang wrote:
> Hi Naoya and Linux-MMers,
> 
> The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/
> move_pages/move_pages12.c
> 
> It seems like the retry mmap() triggers SIGBUS while doing the numa_move_pages
> () in background. That is very similar to the kernel bug which was mentioned by
> commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race condition
> between soft offline and hugetlb_fault which causes unexpected process SIGBUS
> killing.
> 
> I'm not sure if that below patch is making sene to memory-failures.c, but after
> building a new kernel-5.2.3 with this change, the problem can NOT be reproduced
> . 
> 
> Any comments?
> 
> ----------------------------------
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page *page,
> int flags)
>         unlock_page(hpage);
> 
>         ret = isolate_huge_page(hpage, &pagelist);
> +       if (!ret) {
> +               pr_info("soft offline: %#lx hugepage failed to isolate\n",
> pfn);
> +               return -EBUSY;
> +       }
> +
>         /*
>          * get_any_page() and isolate_huge_page() takes a refcount each,
>          * so need to drop one here.
>          */
>         put_hwpoison_page(hpage);
> -       if (!ret) {
> -               pr_info("soft offline: %#lx hugepage failed to isolate\n",
> pfn);
> -               return -EBUSY;
> -       }

Sorry for my late response.

This change skips put_hwpoison_page() in failure path, so soft_offline_page()
should return without releasing hpage's refcount taken by get_any_page(),
maybe which is not what we want.

- Naoya
Naoya Horiguchi Aug. 2, 2019, 4:15 a.m. UTC | #7
On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> On 7/30/19 5:44 PM, Mike Kravetz wrote:
> > A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> > lack of huge pages.  Ugly, but that is the design.  I do not believe this
> > test should not be experiencing this due to reservations taken at mmap
> > time.  However, the test is combining faults, soft offline and page
> > migrations, so the there are lots of moving parts.
> > 
> > I'll continue to investigate.
> 
> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> the migration path.
> 
> Can you try this patch in your environment?  I am not sure if it will
> be the final fix, but just wanted to see if it addresses issue for you.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ede7e7f5d1ab..f3156c5432e3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  
>  		page = alloc_huge_page(vma, haddr, 0);
>  		if (IS_ERR(page)) {
> +			/*
> +			 * We could race with page migration (try_to_unmap_one)
> +			 * which is modifying page table with lock.  However,
> +			 * we are not holding lock here.  Before returning
> +			 * error that will SIGBUS caller, get ptl and make
> +			 * sure there really is no entry.
> +			 */
> +			ptl = huge_pte_lock(h, mm, ptep);
> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> +				ret = 0;
> +				spin_unlock(ptl);
> +				goto out;
> +			}
> +			spin_unlock(ptl);

Thanks you for investigation, Mike.
I tried this change and found no SIGBUS, so it works well.

I'm still not clear about how !huge_pte_none() becomes true here,
because we enter hugetlb_no_page() only when huge_pte_none() is non-null
and (racy) try_to_unmap_one() from page migration should convert the
huge_pte into a migration entry, not null.

Thanks,
Naoya Horiguchi
Li Wang Aug. 2, 2019, 9:59 a.m. UTC | #8
Hi Mike,

Thanks for working on this.

On Fri, Aug 2, 2019 at 8:20 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 7/30/19 5:44 PM, Mike Kravetz wrote:
> > A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> > lack of huge pages.  Ugly, but that is the design.  I do not believe this
> > test should not be experiencing this due to reservations taken at mmap
> > time.  However, the test is combining faults, soft offline and page
> > migrations, so the there are lots of moving parts.
> >
> > I'll continue to investigate.
>
> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> the migration path.
>
> Can you try this patch in your environment?  I am not sure if it will
> be the final fix, but just wanted to see if it addresses issue for you.

It works for me. After rebuilding the kernel with your patch, SIGBUS
does not appear anymore.

And I'm also thinking that why the huge_pte is not none here when race
with page migration (try_to_unmap_one).

--
Regards,
Li Wang
Mike Kravetz Aug. 2, 2019, 5:42 p.m. UTC | #9
On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>> the migration path.
>>
>> Can you try this patch in your environment?  I am not sure if it will
>> be the final fix, but just wanted to see if it addresses issue for you.
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ede7e7f5d1ab..f3156c5432e3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>  
>>  		page = alloc_huge_page(vma, haddr, 0);
>>  		if (IS_ERR(page)) {
>> +			/*
>> +			 * We could race with page migration (try_to_unmap_one)
>> +			 * which is modifying page table with lock.  However,
>> +			 * we are not holding lock here.  Before returning
>> +			 * error that will SIGBUS caller, get ptl and make
>> +			 * sure there really is no entry.
>> +			 */
>> +			ptl = huge_pte_lock(h, mm, ptep);
>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>> +				ret = 0;
>> +				spin_unlock(ptl);
>> +				goto out;
>> +			}
>> +			spin_unlock(ptl);
> 
> Thanks you for investigation, Mike.
> I tried this change and found no SIGBUS, so it works well.
> 
> I'm still not clear about how !huge_pte_none() becomes true here,
> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> and (racy) try_to_unmap_one() from page migration should convert the
> huge_pte into a migration entry, not null.

Thanks for taking a look Naoya.

In try_to_unmap_one(), there is this code block:

		/* Nuke the page table entry. */
		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
		if (should_defer_flush(mm, flags)) {
			/*
			 * We clear the PTE but do not flush so potentially
			 * a remote CPU could still be writing to the page.
			 * If the entry was previously clean then the
			 * architecture must guarantee that a clear->dirty
			 * transition on a cached TLB entry is written through
			 * and traps if the PTE is unmapped.
			 */
			pteval = ptep_get_and_clear(mm, address, pvmw.pte);

			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
		} else {
			pteval = ptep_clear_flush(vma, address, pvmw.pte);
		}

That happens before setting the migration entry.  Therefore, for a period
of time the pte is NULL (huge_pte_none() returns true).

try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
returns true, it calls hugetlb_no_page which is where we try to allocate
a page and fails.

Does that make sense, or am I missing something?

The patch checks for this specific condition: someone changing the pte
from NULL to non-NULL while holding the lock.  I am not sure if this is
the best way to fix.  But, it may be the easiest.
Naoya Horiguchi Aug. 5, 2019, 12:40 a.m. UTC | #10
On Fri, Aug 02, 2019 at 10:42:33AM -0700, Mike Kravetz wrote:
> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> >> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> >> the migration path.
> >>
> >> Can you try this patch in your environment?  I am not sure if it will
> >> be the final fix, but just wanted to see if it addresses issue for you.
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index ede7e7f5d1ab..f3156c5432e3 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>  
> >>  		page = alloc_huge_page(vma, haddr, 0);
> >>  		if (IS_ERR(page)) {
> >> +			/*
> >> +			 * We could race with page migration (try_to_unmap_one)
> >> +			 * which is modifying page table with lock.  However,
> >> +			 * we are not holding lock here.  Before returning
> >> +			 * error that will SIGBUS caller, get ptl and make
> >> +			 * sure there really is no entry.
> >> +			 */
> >> +			ptl = huge_pte_lock(h, mm, ptep);
> >> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >> +				ret = 0;
> >> +				spin_unlock(ptl);
> >> +				goto out;
> >> +			}
> >> +			spin_unlock(ptl);
> > 
> > Thanks you for investigation, Mike.
> > I tried this change and found no SIGBUS, so it works well.
> > 
> > I'm still not clear about how !huge_pte_none() becomes true here,
> > because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> > and (racy) try_to_unmap_one() from page migration should convert the
> > huge_pte into a migration entry, not null.
> 
> Thanks for taking a look Naoya.
> 
> In try_to_unmap_one(), there is this code block:
> 
> 		/* Nuke the page table entry. */
> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> 		if (should_defer_flush(mm, flags)) {
> 			/*
> 			 * We clear the PTE but do not flush so potentially
> 			 * a remote CPU could still be writing to the page.
> 			 * If the entry was previously clean then the
> 			 * architecture must guarantee that a clear->dirty
> 			 * transition on a cached TLB entry is written through
> 			 * and traps if the PTE is unmapped.
> 			 */
> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> 
> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 		} else {
> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> 		}
> 
> That happens before setting the migration entry.  Therefore, for a period
> of time the pte is NULL (huge_pte_none() returns true).
> 
> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
> returns true, it calls hugetlb_no_page which is where we try to allocate
> a page and fails.
> 
> Does that make sense, or am I missing something?

Make sense to me, thanks.

> 
> The patch checks for this specific condition: someone changing the pte
> from NULL to non-NULL while holding the lock.  I am not sure if this is
> the best way to fix.  But, it may be the easiest.

Yes, I think so.

- Naoya
Michal Hocko Aug. 5, 2019, 8:57 a.m. UTC | #11
On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> >> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> >> the migration path.
> >>
> >> Can you try this patch in your environment?  I am not sure if it will
> >> be the final fix, but just wanted to see if it addresses issue for you.
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index ede7e7f5d1ab..f3156c5432e3 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>  
> >>  		page = alloc_huge_page(vma, haddr, 0);
> >>  		if (IS_ERR(page)) {
> >> +			/*
> >> +			 * We could race with page migration (try_to_unmap_one)
> >> +			 * which is modifying page table with lock.  However,
> >> +			 * we are not holding lock here.  Before returning
> >> +			 * error that will SIGBUS caller, get ptl and make
> >> +			 * sure there really is no entry.
> >> +			 */
> >> +			ptl = huge_pte_lock(h, mm, ptep);
> >> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >> +				ret = 0;
> >> +				spin_unlock(ptl);
> >> +				goto out;
> >> +			}
> >> +			spin_unlock(ptl);
> > 
> > Thanks you for investigation, Mike.
> > I tried this change and found no SIGBUS, so it works well.
> > 
> > I'm still not clear about how !huge_pte_none() becomes true here,
> > because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> > and (racy) try_to_unmap_one() from page migration should convert the
> > huge_pte into a migration entry, not null.
> 
> Thanks for taking a look Naoya.
> 
> In try_to_unmap_one(), there is this code block:
> 
> 		/* Nuke the page table entry. */
> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> 		if (should_defer_flush(mm, flags)) {
> 			/*
> 			 * We clear the PTE but do not flush so potentially
> 			 * a remote CPU could still be writing to the page.
> 			 * If the entry was previously clean then the
> 			 * architecture must guarantee that a clear->dirty
> 			 * transition on a cached TLB entry is written through
> 			 * and traps if the PTE is unmapped.
> 			 */
> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> 
> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 		} else {
> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> 		}
> 
> That happens before setting the migration entry.  Therefore, for a period
> of time the pte is NULL (huge_pte_none() returns true).
> 
> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
> returns true, it calls hugetlb_no_page which is where we try to allocate
> a page and fails.
> 
> Does that make sense, or am I missing something?
> 
> The patch checks for this specific condition: someone changing the pte
> from NULL to non-NULL while holding the lock.  I am not sure if this is
> the best way to fix.  But, it may be the easiest.

Please add a comment to explain this because this is quite subtle and
tricky. Unlike the regular page fault hugetlb_no_page is protected by a
large lock so a retry check seems unexpected.

Thanks!
Mike Kravetz Aug. 5, 2019, 5:36 p.m. UTC | #12
On 8/5/19 1:57 AM, Michal Hocko wrote:
> On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
>> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
>>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>>>> the migration path.
>>>>
>>>> Can you try this patch in your environment?  I am not sure if it will
>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>  
>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>  		if (IS_ERR(page)) {
>>>> +			/*
>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>> +			 * which is modifying page table with lock.  However,
>>>> +			 * we are not holding lock here.  Before returning
>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>> +			 * sure there really is no entry.
>>>> +			 */
>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>> +				ret = 0;
>>>> +				spin_unlock(ptl);
>>>> +				goto out;
>>>> +			}
>>>> +			spin_unlock(ptl);
>>>
>>> Thanks you for investigation, Mike.
>>> I tried this change and found no SIGBUS, so it works well.
>>>
>>> I'm still not clear about how !huge_pte_none() becomes true here,
>>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
>>> and (racy) try_to_unmap_one() from page migration should convert the
>>> huge_pte into a migration entry, not null.
>>
>> Thanks for taking a look Naoya.
>>
>> In try_to_unmap_one(), there is this code block:
>>
>> 		/* Nuke the page table entry. */
>> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>> 		if (should_defer_flush(mm, flags)) {
>> 			/*
>> 			 * We clear the PTE but do not flush so potentially
>> 			 * a remote CPU could still be writing to the page.
>> 			 * If the entry was previously clean then the
>> 			 * architecture must guarantee that a clear->dirty
>> 			 * transition on a cached TLB entry is written through
>> 			 * and traps if the PTE is unmapped.
>> 			 */
>> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>
>> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>> 		} else {
>> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>> 		}
>>
>> That happens before setting the migration entry.  Therefore, for a period
>> of time the pte is NULL (huge_pte_none() returns true).
>>
>> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
>> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
>> returns true, it calls hugetlb_no_page which is where we try to allocate
>> a page and fails.
>>
>> Does that make sense, or am I missing something?
>>
>> The patch checks for this specific condition: someone changing the pte
>> from NULL to non-NULL while holding the lock.  I am not sure if this is
>> the best way to fix.  But, it may be the easiest.
> 
> Please add a comment to explain this because this is quite subtle and
> tricky. Unlike the regular page fault hugetlb_no_page is protected by a
> large lock so a retry check seems unexpected.

Will do.

Fixing up hugetlbfs locking is still 'on my list'.  There are known issues.
The last RFC/attempt was this:
http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@oracle.com
I believe that patch would have handled this issue.

However, as mentioned above it may better to just patch this issue exposed
by LTP and work on the more comprehensive change in the background.
Mike Kravetz Aug. 7, 2019, 12:07 a.m. UTC | #13
On 8/5/19 10:36 AM, Mike Kravetz wrote:
>>>>> Can you try this patch in your environment?  I am not sure if it will
>>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>  
>>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>>  		if (IS_ERR(page)) {
>>>>> +			/*
>>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>>> +			 * which is modifying page table with lock.  However,
>>>>> +			 * we are not holding lock here.  Before returning
>>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>>> +			 * sure there really is no entry.
>>>>> +			 */
>>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>>> +				ret = 0;
>>>>> +				spin_unlock(ptl);
>>>>> +				goto out;
>>>>> +			}
>>>>> +			spin_unlock(ptl);
>>>>
>>>> Thanks you for investigation, Mike.
>>>> I tried this change and found no SIGBUS, so it works well.

Here is another way to address the issue.  Take the hugetlb fault mutex in
the migration code when modifying the page tables.  IIUC, the fault mutex
was introduced to prevent this same issue when there were two page faults
on the same page (and we were unable to allocate an 'extra' page).  The
downside to such an approach is that we add more hugetlbfs specific code
to try_to_unmap_one.

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edf476c8cfb9..df0e74f9962e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -485,6 +485,17 @@ static inline int hstate_index(struct hstate *h)
 	return h - hstates;
 }
 
+/*
+ * Convert the address within this vma to the page offset within
+ * the mapping, in pagecache page units; huge pages here.
+ */
+static inline pgoff_t vma_hugecache_offset(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	return ((address - vma->vm_start) >> huge_page_shift(h)) +
+		(vma->vm_pgoff >> huge_page_order(h));
+}
+
 pgoff_t __basepage_index(struct page *page);
 
 /* Return page->index in PAGE_SIZE units */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..959aed5b7969 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -615,17 +615,6 @@ static long region_count(struct resv_map *resv, long f, long t)
 	return chg;
 }
 
-/*
- * Convert the address within this vma to the page offset within
- * the mapping, in pagecache page units; huge pages here.
- */
-static pgoff_t vma_hugecache_offset(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
-{
-	return ((address - vma->vm_start) >> huge_page_shift(h)) +
-			(vma->vm_pgoff >> huge_page_order(h));
-}
-
 pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..f8c95482c23e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1350,6 +1350,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	u32 hugetlb_hash = 0;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1377,6 +1378,19 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				min(vma->vm_end, address +
 				    (PAGE_SIZE << compound_order(page))));
 	if (PageHuge(page)) {
+		struct hstate *h = hstate_vma(vma);
+
+		/*
+		 * Take the hugetlb fault mutex so that we do not race with
+		 * page faults while modifying page table.  Mutex must be
+		 * acquired before ptl below.
+		 */
+		hugetlb_hash = hugetlb_fault_mutex_hash(h,
+					vma->vm_file->f_mapping,
+					vma_hugecache_offset(h, vma, address),
+					address);
+		mutex_lock(&hugetlb_fault_mutex_table[hugetlb_hash]);
+
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
@@ -1659,6 +1673,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
+	if (PageHuge(page))
+		mutex_unlock(&hugetlb_fault_mutex_table[hugetlb_hash]);
 
 	return ret;
 }


Michal, Naoya any preferences on how this should be fixed?  I'll send a
proper patch if we agree on an approach.
Michal Hocko Aug. 7, 2019, 7:39 a.m. UTC | #14
On Tue 06-08-19 17:07:25, Mike Kravetz wrote:
> On 8/5/19 10:36 AM, Mike Kravetz wrote:
> >>>>> Can you try this patch in your environment?  I am not sure if it will
> >>>>> be the final fix, but just wanted to see if it addresses issue for you.
> >>>>>
> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>>>> index ede7e7f5d1ab..f3156c5432e3 100644
> >>>>> --- a/mm/hugetlb.c
> >>>>> +++ b/mm/hugetlb.c
> >>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>>>>  
> >>>>>  		page = alloc_huge_page(vma, haddr, 0);
> >>>>>  		if (IS_ERR(page)) {
> >>>>> +			/*
> >>>>> +			 * We could race with page migration (try_to_unmap_one)
> >>>>> +			 * which is modifying page table with lock.  However,
> >>>>> +			 * we are not holding lock here.  Before returning
> >>>>> +			 * error that will SIGBUS caller, get ptl and make
> >>>>> +			 * sure there really is no entry.
> >>>>> +			 */
> >>>>> +			ptl = huge_pte_lock(h, mm, ptep);
> >>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >>>>> +				ret = 0;
> >>>>> +				spin_unlock(ptl);
> >>>>> +				goto out;
> >>>>> +			}
> >>>>> +			spin_unlock(ptl);
> >>>>
> >>>> Thanks you for investigation, Mike.
> >>>> I tried this change and found no SIGBUS, so it works well.
> 
> Here is another way to address the issue.  Take the hugetlb fault mutex in
> the migration code when modifying the page tables.  IIUC, the fault mutex
> was introduced to prevent this same issue when there were two page faults
> on the same page (and we were unable to allocate an 'extra' page).  The
> downside to such an approach is that we add more hugetlbfs specific code
> to try_to_unmap_one.

I would rather go with the hugetlb_no_page which is better isolated.
Mike Kravetz Aug. 7, 2019, 3:10 p.m. UTC | #15
On 8/7/19 12:39 AM, Michal Hocko wrote:
> On Tue 06-08-19 17:07:25, Mike Kravetz wrote:
>> On 8/5/19 10:36 AM, Mike Kravetz wrote:
>>>>>>> Can you try this patch in your environment?  I am not sure if it will
>>>>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>>>>
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>>>  
>>>>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>>>>  		if (IS_ERR(page)) {
>>>>>>> +			/*
>>>>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>>>>> +			 * which is modifying page table with lock.  However,
>>>>>>> +			 * we are not holding lock here.  Before returning
>>>>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>>>>> +			 * sure there really is no entry.
>>>>>>> +			 */
>>>>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>>>>> +				ret = 0;
>>>>>>> +				spin_unlock(ptl);
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			spin_unlock(ptl);
>>>>>>
>>>>>> Thanks you for investigation, Mike.
>>>>>> I tried this change and found no SIGBUS, so it works well.
>>
>> Here is another way to address the issue.  Take the hugetlb fault mutex in
>> the migration code when modifying the page tables.  IIUC, the fault mutex
>> was introduced to prevent this same issue when there were two page faults
>> on the same page (and we were unable to allocate an 'extra' page).  The
>> downside to such an approach is that we add more hugetlbfs specific code
>> to try_to_unmap_one.
> 
> I would rather go with the hugetlb_no_page which is better isolated.

Sounds good.

And, after more thought modifying try_to_unmap_one will not work.  Why?
It violates lock ordering.  Current ordering is hugetlb_mutex, page lock
then page table lock.  The page lock is taken before calling try_to_unmap_one.
In addition, try_to_unmap is unmapping from multiple vmas so we can not
know the values for hugetlb hash before taking page lock as the hash values
are vma specific.  So, without many modifications we can not add hugetlb
fault mutex to this code path.
diff mbox series

Patch

--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1695,15 +1695,16 @@  static int soft_offline_huge_page(struct page
*page, int flags)
        unlock_page(hpage);

        ret = isolate_huge_page(hpage, &pagelist);
+       if (!ret) {
+               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               return -EBUSY;
+       }
+
        /*
         * get_any_page() and isolate_huge_page() takes a refcount each,
         * so need to drop one here.
         */
        put_hwpoison_page(hpage);
-       if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
-               return -EBUSY;
-       }