Message ID | CAEemH2dMW6oh6Bbm=yqUADF+mDhuQgFTTGYftB+xAhqqdYV3Ng@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [MM,Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background | expand |
On 7/28/19 10:17 PM, Li Wang wrote: > Hi Naoya and Linux-MMers, > > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing. > https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c > > It seems like the retry mmap() triggers SIGBUS while doing thenuma_move_pages() in background. That is very similar to the kernelbug which was mentioned by commit 6bc9b56433b76e40d(mm: fix race onsoft-offlining ): A race condition between soft offline andhugetlb_fault which causes unexpected process SIGBUS killing. > > I'm not sure if that below patch is making sene to memory-failures.c, but after building a new kernel-5.2.3 with this change, the problem can NOT be reproduced. > > Any comments? Something seems strange. I can not reproduce with unmodified 5.2.3 [root@f23d move_pages]# uname -r 5.2.3 [root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12 tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s move_pages12.c:201: INFO: Free RAM 6725424 kB move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4 move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4 move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0 move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1 move_pages12.c:135: PASS: Bug not reproduced Summary: passed 1 failed 0 skipped 0 warnings 0 Also, the soft_offline_huge_page() code should not come into play with this specific test.
Hi Mike, Thanks for trying this. On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 7/28/19 10:17 PM, Li Wang wrote: > > Hi Naoya and Linux-MMers, > > > > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing. > > https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c > > > > It seems like the retry mmap() triggers SIGBUS while doing thenuma_move_pages() in background. That is very similar to the kernelbug which was mentioned by commit 6bc9b56433b76e40d(mm: fix race onsoft-offlining ): A race condition between soft offline andhugetlb_fault which causes unexpected process SIGBUS killing. > > > > I'm not sure if that below patch is making sene to memory-failures.c, but after building a new kernel-5.2.3 with this change, the problem can NOT be reproduced. > > > > Any comments? > > Something seems strange. I can not reproduce with unmodified 5.2.3 It's not 100% reproducible, I tried ten times only hit 4~6 times fail. Did you try the test case with patch V3(in my branch)? https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6 # git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1 # cd ltp.wangli/; make autotools; # ./configure ; make -j24 # cd testcases/kernel/syscalls/move_pages/ # ./move_pages12 tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s move_pages12.c:249: INFO: Free RAM 64386300 kB move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4 move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4 move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0 move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1 move_pages12.c:183: PASS: Bug not reproduced tst_test.c:1145: BROK: Test killed by SIGBUS! move_pages12.c:117: FAIL: move_pages failed: ESRCH # uname -r 5.2.3 # numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 16049 MB node 0 free: 15736 MB node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 1 size: 16123 MB node 1 free: 15850 MB node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size: 16123 MB node 2 free: 15989 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 16097 MB node 3 free: 15278 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 > Also, the soft_offline_huge_page() code should not come into play with > this specific test. I got the "soft offline xxx.. hugepage failed to isolate" message from soft_offline_huge_page() in dmesg log. === debug print info === --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1701,7 +1701,7 @@ static int soft_offline_huge_page(struct page *page, int flags) */ put_hwpoison_page(hpage); if (!ret) { - pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); + pr_info("liwang -- soft offline: %#lx hugepage failed to isolate\n", pfn); return -EBUSY; } # dmesg ... [ 1068.947205] Soft offlining pfn 0x40b200 at process virtual address 0x7f9d8d000000 [ 1068.987054] Soft offlining pfn 0x40ac00 at process virtual address 0x7f9d8d200000 [ 1069.048478] Soft offlining pfn 0x40a800 at process virtual address 0x7f9d8d000000 [ 1069.087413] Soft offlining pfn 0x40ae00 at process virtual address 0x7f9d8d200000 [ 1069.123285] liwang -- soft offline: 0x40ae00 hugepage failed to isolate [ 1069.160137] Soft offlining pfn 0x80f800 at process virtual address 0x7f9d8d000000 [ 1069.196009] Soft offlining pfn 0x80fe00 at process virtual address 0x7f9d8d200000 [ 1069.243436] Soft offlining pfn 0x40a400 at process virtual address 0x7f9d8d000000 [ 1069.281301] Soft offlining pfn 0x40a600 at process virtual address 0x7f9d8d200000 [ 1069.318171] liwang -- soft offline: 0x40a600 hugepage failed to isolate
On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: ... > Something seems strange. I can not reproduce with unmodified 5.2.3 > > [root@f23d move_pages]# uname -r > 5.2.3 > [root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12 > tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s > move_pages12.c:201: INFO: Free RAM 6725424 kB > move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4 > move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4 > move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0 > move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1 > move_pages12.c:135: PASS: Bug not reproduced > > Summary: > passed 1 > failed 0 > skipped 0 > warnings 0 FYI: And, from your test log, it looks like you were running an old LTP version, the test#2 was added in move_page12 in the latest master branch. So, the completely test log should be included two-part: # ./move_pages12 tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s move_pages12.c:252: INFO: Free RAM 63759028 kB move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 4 move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 6 move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0 move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1 move_pages12.c:186: PASS: Bug not reproduced move_pages12.c:186: PASS: Bug not reproduced Summary: passed 2 failed 0 skipped 0 warnings 0
On 7/29/19 11:29 PM, Li Wang wrote: > It's not 100% reproducible, I tried ten times only hit 4~6 times fail. > > Did you try the test case with patch V3(in my branch)? > https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6 > My bad! I was using an old version of the test without the soft offline testing. > # git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1 > # cd ltp.wangli/; make autotools; > # ./configure ; make -j24 > # cd testcases/kernel/syscalls/move_pages/ > # ./move_pages12 > tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s > move_pages12.c:249: INFO: Free RAM 64386300 kB > move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4 > move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4 > move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0 > move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1 > move_pages12.c:183: PASS: Bug not reproduced > tst_test.c:1145: BROK: Test killed by SIGBUS! > move_pages12.c:117: FAIL: move_pages failed: ESRCH Yes, I can recreate. When I see this failure, the SIGBUS is the result of a huge page allocation failure. The allocation was in response to a page fault. Note that running the test will deplete memory of the system as huge pages are marked 'poisoned' and can not be reused. So, each run of the test will take additional memory offline. A SIGBUS is the normal behavior for a hugetlb page fault failure due to lack of huge pages. Ugly, but that is the design. I do not believe this test should not be experiencing this due to reservations taken at mmap time. However, the test is combining faults, soft offline and page migrations, so the there are lots of moving parts. I'll continue to investigate. Naoya may have more context as he contributed to both the kernel code and the testcase.
On 7/30/19 5:44 PM, Mike Kravetz wrote: > A SIGBUS is the normal behavior for a hugetlb page fault failure due to > lack of huge pages. Ugly, but that is the design. I do not believe this > test should not be experiencing this due to reservations taken at mmap > time. However, the test is combining faults, soft offline and page > migrations, so the there are lots of moving parts. > > I'll continue to investigate. There appears to be a race with hugetlb_fault and try_to_unmap_one of the migration path. Can you try this patch in your environment? I am not sure if it will be the final fix, but just wanted to see if it addresses issue for you. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ede7e7f5d1ab..f3156c5432e3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, page = alloc_huge_page(vma, haddr, 0); if (IS_ERR(page)) { + /* + * We could race with page migration (try_to_unmap_one) + * which is modifying page table with lock. However, + * we are not holding lock here. Before returning + * error that will SIGBUS caller, get ptl and make + * sure there really is no entry. + */ + ptl = huge_pte_lock(h, mm, ptep); + if (!huge_pte_none(huge_ptep_get(ptep))) { + ret = 0; + spin_unlock(ptl); + goto out; + } + spin_unlock(ptl); ret = vmf_error(PTR_ERR(page)); goto out; }
On Mon, Jul 29, 2019 at 01:17:27PM +0800, Li Wang wrote: > Hi Naoya and Linux-MMers, > > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing. > https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/ > move_pages/move_pages12.c > > It seems like the retry mmap() triggers SIGBUS while doing the numa_move_pages > () in background. That is very similar to the kernel bug which was mentioned by > commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race condition > between soft offline and hugetlb_fault which causes unexpected process SIGBUS > killing. > > I'm not sure if that below patch is making sene to memory-failures.c, but after > building a new kernel-5.2.3 with this change, the problem can NOT be reproduced > . > > Any comments? > > ---------------------------------- > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page *page, > int flags) > unlock_page(hpage); > > ret = isolate_huge_page(hpage, &pagelist); > + if (!ret) { > + pr_info("soft offline: %#lx hugepage failed to isolate\n", > pfn); > + return -EBUSY; > + } > + > /* > * get_any_page() and isolate_huge_page() takes a refcount each, > * so need to drop one here. > */ > put_hwpoison_page(hpage); > - if (!ret) { > - pr_info("soft offline: %#lx hugepage failed to isolate\n", > pfn); > - return -EBUSY; > - } Sorry for my late response. This change skips put_hwpoison_page() in failure path, so soft_offline_page() should return without releasing hpage's refcount taken by get_any_page(), maybe which is not what we want. - Naoya
On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: > On 7/30/19 5:44 PM, Mike Kravetz wrote: > > A SIGBUS is the normal behavior for a hugetlb page fault failure due to > > lack of huge pages. Ugly, but that is the design. I do not believe this > > test should not be experiencing this due to reservations taken at mmap > > time. However, the test is combining faults, soft offline and page > > migrations, so the there are lots of moving parts. > > > > I'll continue to investigate. > > There appears to be a race with hugetlb_fault and try_to_unmap_one of > the migration path. > > Can you try this patch in your environment? I am not sure if it will > be the final fix, but just wanted to see if it addresses issue for you. > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ede7e7f5d1ab..f3156c5432e3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > > page = alloc_huge_page(vma, haddr, 0); > if (IS_ERR(page)) { > + /* > + * We could race with page migration (try_to_unmap_one) > + * which is modifying page table with lock. However, > + * we are not holding lock here. Before returning > + * error that will SIGBUS caller, get ptl and make > + * sure there really is no entry. > + */ > + ptl = huge_pte_lock(h, mm, ptep); > + if (!huge_pte_none(huge_ptep_get(ptep))) { > + ret = 0; > + spin_unlock(ptl); > + goto out; > + } > + spin_unlock(ptl); Thanks you for investigation, Mike. I tried this change and found no SIGBUS, so it works well. I'm still not clear about how !huge_pte_none() becomes true here, because we enter hugetlb_no_page() only when huge_pte_none() is non-null and (racy) try_to_unmap_one() from page migration should convert the huge_pte into a migration entry, not null. Thanks, Naoya Horiguchi
Hi Mike, Thanks for working on this. On Fri, Aug 2, 2019 at 8:20 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 7/30/19 5:44 PM, Mike Kravetz wrote: > > A SIGBUS is the normal behavior for a hugetlb page fault failure due to > > lack of huge pages. Ugly, but that is the design. I do not believe this > > test should not be experiencing this due to reservations taken at mmap > > time. However, the test is combining faults, soft offline and page > > migrations, so the there are lots of moving parts. > > > > I'll continue to investigate. > > There appears to be a race with hugetlb_fault and try_to_unmap_one of > the migration path. > > Can you try this patch in your environment? I am not sure if it will > be the final fix, but just wanted to see if it addresses issue for you. It works for me. After rebuilding the kernel with your patch, SIGBUS does not appear anymore. And I'm also thinking that why the huge_pte is not none here when race with page migration (try_to_unmap_one). -- Regards, Li Wang
On 8/1/19 9:15 PM, Naoya Horiguchi wrote: > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: >> There appears to be a race with hugetlb_fault and try_to_unmap_one of >> the migration path. >> >> Can you try this patch in your environment? I am not sure if it will >> be the final fix, but just wanted to see if it addresses issue for you. >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index ede7e7f5d1ab..f3156c5432e3 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >> >> page = alloc_huge_page(vma, haddr, 0); >> if (IS_ERR(page)) { >> + /* >> + * We could race with page migration (try_to_unmap_one) >> + * which is modifying page table with lock. However, >> + * we are not holding lock here. Before returning >> + * error that will SIGBUS caller, get ptl and make >> + * sure there really is no entry. >> + */ >> + ptl = huge_pte_lock(h, mm, ptep); >> + if (!huge_pte_none(huge_ptep_get(ptep))) { >> + ret = 0; >> + spin_unlock(ptl); >> + goto out; >> + } >> + spin_unlock(ptl); > > Thanks you for investigation, Mike. > I tried this change and found no SIGBUS, so it works well. > > I'm still not clear about how !huge_pte_none() becomes true here, > because we enter hugetlb_no_page() only when huge_pte_none() is non-null > and (racy) try_to_unmap_one() from page migration should convert the > huge_pte into a migration entry, not null. Thanks for taking a look Naoya. In try_to_unmap_one(), there is this code block: /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); if (should_defer_flush(mm, flags)) { /* * We clear the PTE but do not flush so potentially * a remote CPU could still be writing to the page. * If the entry was previously clean then the * architecture must guarantee that a clear->dirty * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); } That happens before setting the migration entry. Therefore, for a period of time the pte is NULL (huge_pte_none() returns true). try_to_unmap_one holds the page table lock, but hugetlb_fault does not take the lock to 'optimistically' check huge_pte_none(). When huge_pte_none returns true, it calls hugetlb_no_page which is where we try to allocate a page and fails. Does that make sense, or am I missing something? The patch checks for this specific condition: someone changing the pte from NULL to non-NULL while holding the lock. I am not sure if this is the best way to fix. But, it may be the easiest.
On Fri, Aug 02, 2019 at 10:42:33AM -0700, Mike Kravetz wrote: > On 8/1/19 9:15 PM, Naoya Horiguchi wrote: > > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: > >> There appears to be a race with hugetlb_fault and try_to_unmap_one of > >> the migration path. > >> > >> Can you try this patch in your environment? I am not sure if it will > >> be the final fix, but just wanted to see if it addresses issue for you. > >> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index ede7e7f5d1ab..f3156c5432e3 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > >> > >> page = alloc_huge_page(vma, haddr, 0); > >> if (IS_ERR(page)) { > >> + /* > >> + * We could race with page migration (try_to_unmap_one) > >> + * which is modifying page table with lock. However, > >> + * we are not holding lock here. Before returning > >> + * error that will SIGBUS caller, get ptl and make > >> + * sure there really is no entry. > >> + */ > >> + ptl = huge_pte_lock(h, mm, ptep); > >> + if (!huge_pte_none(huge_ptep_get(ptep))) { > >> + ret = 0; > >> + spin_unlock(ptl); > >> + goto out; > >> + } > >> + spin_unlock(ptl); > > > > Thanks you for investigation, Mike. > > I tried this change and found no SIGBUS, so it works well. > > > > I'm still not clear about how !huge_pte_none() becomes true here, > > because we enter hugetlb_no_page() only when huge_pte_none() is non-null > > and (racy) try_to_unmap_one() from page migration should convert the > > huge_pte into a migration entry, not null. > > Thanks for taking a look Naoya. > > In try_to_unmap_one(), there is this code block: > > /* Nuke the page table entry. */ > flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); > if (should_defer_flush(mm, flags)) { > /* > * We clear the PTE but do not flush so potentially > * a remote CPU could still be writing to the page. > * If the entry was previously clean then the > * architecture must guarantee that a clear->dirty > * transition on a cached TLB entry is written through > * and traps if the PTE is unmapped. > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > } > > That happens before setting the migration entry. Therefore, for a period > of time the pte is NULL (huge_pte_none() returns true). > > try_to_unmap_one holds the page table lock, but hugetlb_fault does not take > the lock to 'optimistically' check huge_pte_none(). When huge_pte_none > returns true, it calls hugetlb_no_page which is where we try to allocate > a page and fails. > > Does that make sense, or am I missing something? Make sense to me, thanks. > > The patch checks for this specific condition: someone changing the pte > from NULL to non-NULL while holding the lock. I am not sure if this is > the best way to fix. But, it may be the easiest. Yes, I think so. - Naoya
On Fri 02-08-19 10:42:33, Mike Kravetz wrote: > On 8/1/19 9:15 PM, Naoya Horiguchi wrote: > > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: > >> There appears to be a race with hugetlb_fault and try_to_unmap_one of > >> the migration path. > >> > >> Can you try this patch in your environment? I am not sure if it will > >> be the final fix, but just wanted to see if it addresses issue for you. > >> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index ede7e7f5d1ab..f3156c5432e3 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > >> > >> page = alloc_huge_page(vma, haddr, 0); > >> if (IS_ERR(page)) { > >> + /* > >> + * We could race with page migration (try_to_unmap_one) > >> + * which is modifying page table with lock. However, > >> + * we are not holding lock here. Before returning > >> + * error that will SIGBUS caller, get ptl and make > >> + * sure there really is no entry. > >> + */ > >> + ptl = huge_pte_lock(h, mm, ptep); > >> + if (!huge_pte_none(huge_ptep_get(ptep))) { > >> + ret = 0; > >> + spin_unlock(ptl); > >> + goto out; > >> + } > >> + spin_unlock(ptl); > > > > Thanks you for investigation, Mike. > > I tried this change and found no SIGBUS, so it works well. > > > > I'm still not clear about how !huge_pte_none() becomes true here, > > because we enter hugetlb_no_page() only when huge_pte_none() is non-null > > and (racy) try_to_unmap_one() from page migration should convert the > > huge_pte into a migration entry, not null. > > Thanks for taking a look Naoya. > > In try_to_unmap_one(), there is this code block: > > /* Nuke the page table entry. */ > flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); > if (should_defer_flush(mm, flags)) { > /* > * We clear the PTE but do not flush so potentially > * a remote CPU could still be writing to the page. > * If the entry was previously clean then the > * architecture must guarantee that a clear->dirty > * transition on a cached TLB entry is written through > * and traps if the PTE is unmapped. > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > } > > That happens before setting the migration entry. Therefore, for a period > of time the pte is NULL (huge_pte_none() returns true). > > try_to_unmap_one holds the page table lock, but hugetlb_fault does not take > the lock to 'optimistically' check huge_pte_none(). When huge_pte_none > returns true, it calls hugetlb_no_page which is where we try to allocate > a page and fails. > > Does that make sense, or am I missing something? > > The patch checks for this specific condition: someone changing the pte > from NULL to non-NULL while holding the lock. I am not sure if this is > the best way to fix. But, it may be the easiest. Please add a comment to explain this because this is quite subtle and tricky. Unlike the regular page fault hugetlb_no_page is protected by a large lock so a retry check seems unexpected. Thanks!
On 8/5/19 1:57 AM, Michal Hocko wrote: > On Fri 02-08-19 10:42:33, Mike Kravetz wrote: >> On 8/1/19 9:15 PM, Naoya Horiguchi wrote: >>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: >>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of >>>> the migration path. >>>> >>>> Can you try this patch in your environment? I am not sure if it will >>>> be the final fix, but just wanted to see if it addresses issue for you. >>>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index ede7e7f5d1ab..f3156c5432e3 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >>>> >>>> page = alloc_huge_page(vma, haddr, 0); >>>> if (IS_ERR(page)) { >>>> + /* >>>> + * We could race with page migration (try_to_unmap_one) >>>> + * which is modifying page table with lock. However, >>>> + * we are not holding lock here. Before returning >>>> + * error that will SIGBUS caller, get ptl and make >>>> + * sure there really is no entry. >>>> + */ >>>> + ptl = huge_pte_lock(h, mm, ptep); >>>> + if (!huge_pte_none(huge_ptep_get(ptep))) { >>>> + ret = 0; >>>> + spin_unlock(ptl); >>>> + goto out; >>>> + } >>>> + spin_unlock(ptl); >>> >>> Thanks you for investigation, Mike. >>> I tried this change and found no SIGBUS, so it works well. >>> >>> I'm still not clear about how !huge_pte_none() becomes true here, >>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null >>> and (racy) try_to_unmap_one() from page migration should convert the >>> huge_pte into a migration entry, not null. >> >> Thanks for taking a look Naoya. >> >> In try_to_unmap_one(), there is this code block: >> >> /* Nuke the page table entry. */ >> flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); >> if (should_defer_flush(mm, flags)) { >> /* >> * We clear the PTE but do not flush so potentially >> * a remote CPU could still be writing to the page. >> * If the entry was previously clean then the >> * architecture must guarantee that a clear->dirty >> * transition on a cached TLB entry is written through >> * and traps if the PTE is unmapped. >> */ >> pteval = ptep_get_and_clear(mm, address, pvmw.pte); >> >> set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); >> } else { >> pteval = ptep_clear_flush(vma, address, pvmw.pte); >> } >> >> That happens before setting the migration entry. Therefore, for a period >> of time the pte is NULL (huge_pte_none() returns true). >> >> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take >> the lock to 'optimistically' check huge_pte_none(). When huge_pte_none >> returns true, it calls hugetlb_no_page which is where we try to allocate >> a page and fails. >> >> Does that make sense, or am I missing something? >> >> The patch checks for this specific condition: someone changing the pte >> from NULL to non-NULL while holding the lock. I am not sure if this is >> the best way to fix. But, it may be the easiest. > > Please add a comment to explain this because this is quite subtle and > tricky. Unlike the regular page fault hugetlb_no_page is protected by a > large lock so a retry check seems unexpected. Will do. Fixing up hugetlbfs locking is still 'on my list'. There are known issues. The last RFC/attempt was this: http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@oracle.com I believe that patch would have handled this issue. However, as mentioned above it may better to just patch this issue exposed by LTP and work on the more comprehensive change in the background.
On 8/5/19 10:36 AM, Mike Kravetz wrote: >>>>> Can you try this patch in your environment? I am not sure if it will >>>>> be the final fix, but just wanted to see if it addresses issue for you. >>>>> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>> index ede7e7f5d1ab..f3156c5432e3 100644 >>>>> --- a/mm/hugetlb.c >>>>> +++ b/mm/hugetlb.c >>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >>>>> >>>>> page = alloc_huge_page(vma, haddr, 0); >>>>> if (IS_ERR(page)) { >>>>> + /* >>>>> + * We could race with page migration (try_to_unmap_one) >>>>> + * which is modifying page table with lock. However, >>>>> + * we are not holding lock here. Before returning >>>>> + * error that will SIGBUS caller, get ptl and make >>>>> + * sure there really is no entry. >>>>> + */ >>>>> + ptl = huge_pte_lock(h, mm, ptep); >>>>> + if (!huge_pte_none(huge_ptep_get(ptep))) { >>>>> + ret = 0; >>>>> + spin_unlock(ptl); >>>>> + goto out; >>>>> + } >>>>> + spin_unlock(ptl); >>>> >>>> Thanks you for investigation, Mike. >>>> I tried this change and found no SIGBUS, so it works well. Here is another way to address the issue. Take the hugetlb fault mutex in the migration code when modifying the page tables. IIUC, the fault mutex was introduced to prevent this same issue when there were two page faults on the same page (and we were unable to allocate an 'extra' page). The downside to such an approach is that we add more hugetlbfs specific code to try_to_unmap_one. diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index edf476c8cfb9..df0e74f9962e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -485,6 +485,17 @@ static inline int hstate_index(struct hstate *h) return h - hstates; } +/* + * Convert the address within this vma to the page offset within + * the mapping, in pagecache page units; huge pages here. + */ +static inline pgoff_t vma_hugecache_offset(struct hstate *h, + struct vm_area_struct *vma, unsigned long address) +{ + return ((address - vma->vm_start) >> huge_page_shift(h)) + + (vma->vm_pgoff >> huge_page_order(h)); +} + pgoff_t __basepage_index(struct page *page); /* Return page->index in PAGE_SIZE units */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ede7e7f5d1ab..959aed5b7969 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -615,17 +615,6 @@ static long region_count(struct resv_map *resv, long f, long t) return chg; } -/* - * Convert the address within this vma to the page offset within - * the mapping, in pagecache page units; huge pages here. - */ -static pgoff_t vma_hugecache_offset(struct hstate *h, - struct vm_area_struct *vma, unsigned long address) -{ - return ((address - vma->vm_start) >> huge_page_shift(h)) + - (vma->vm_pgoff >> huge_page_order(h)); -} - pgoff_t linear_hugepage_index(struct vm_area_struct *vma, unsigned long address) { diff --git a/mm/rmap.c b/mm/rmap.c index e5dfe2ae6b0d..f8c95482c23e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1350,6 +1350,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, bool ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)arg; + u32 hugetlb_hash = 0; /* munlock has nothing to gain from examining un-locked vmas */ if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) @@ -1377,6 +1378,19 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, min(vma->vm_end, address + (PAGE_SIZE << compound_order(page)))); if (PageHuge(page)) { + struct hstate *h = hstate_vma(vma); + + /* + * Take the hugetlb fault mutex so that we do not race with + * page faults while modifying page table. Mutex must be + * acquired before ptl below. + */ + hugetlb_hash = hugetlb_fault_mutex_hash(h, + vma->vm_file->f_mapping, + vma_hugecache_offset(h, vma, address), + address); + mutex_lock(&hugetlb_fault_mutex_table[hugetlb_hash]); + /* * If sharing is possible, start and end will be adjusted * accordingly. @@ -1659,6 +1673,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } mmu_notifier_invalidate_range_end(&range); + if (PageHuge(page)) + mutex_unlock(&hugetlb_fault_mutex_table[hugetlb_hash]); return ret; } Michal, Naoya any preferences on how this should be fixed? I'll send a proper patch if we agree on an approach.
On Tue 06-08-19 17:07:25, Mike Kravetz wrote: > On 8/5/19 10:36 AM, Mike Kravetz wrote: > >>>>> Can you try this patch in your environment? I am not sure if it will > >>>>> be the final fix, but just wanted to see if it addresses issue for you. > >>>>> > >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >>>>> index ede7e7f5d1ab..f3156c5432e3 100644 > >>>>> --- a/mm/hugetlb.c > >>>>> +++ b/mm/hugetlb.c > >>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > >>>>> > >>>>> page = alloc_huge_page(vma, haddr, 0); > >>>>> if (IS_ERR(page)) { > >>>>> + /* > >>>>> + * We could race with page migration (try_to_unmap_one) > >>>>> + * which is modifying page table with lock. However, > >>>>> + * we are not holding lock here. Before returning > >>>>> + * error that will SIGBUS caller, get ptl and make > >>>>> + * sure there really is no entry. > >>>>> + */ > >>>>> + ptl = huge_pte_lock(h, mm, ptep); > >>>>> + if (!huge_pte_none(huge_ptep_get(ptep))) { > >>>>> + ret = 0; > >>>>> + spin_unlock(ptl); > >>>>> + goto out; > >>>>> + } > >>>>> + spin_unlock(ptl); > >>>> > >>>> Thanks you for investigation, Mike. > >>>> I tried this change and found no SIGBUS, so it works well. > > Here is another way to address the issue. Take the hugetlb fault mutex in > the migration code when modifying the page tables. IIUC, the fault mutex > was introduced to prevent this same issue when there were two page faults > on the same page (and we were unable to allocate an 'extra' page). The > downside to such an approach is that we add more hugetlbfs specific code > to try_to_unmap_one. I would rather go with the hugetlb_no_page which is better isolated.
On 8/7/19 12:39 AM, Michal Hocko wrote: > On Tue 06-08-19 17:07:25, Mike Kravetz wrote: >> On 8/5/19 10:36 AM, Mike Kravetz wrote: >>>>>>> Can you try this patch in your environment? I am not sure if it will >>>>>>> be the final fix, but just wanted to see if it addresses issue for you. >>>>>>> >>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>>> index ede7e7f5d1ab..f3156c5432e3 100644 >>>>>>> --- a/mm/hugetlb.c >>>>>>> +++ b/mm/hugetlb.c >>>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >>>>>>> >>>>>>> page = alloc_huge_page(vma, haddr, 0); >>>>>>> if (IS_ERR(page)) { >>>>>>> + /* >>>>>>> + * We could race with page migration (try_to_unmap_one) >>>>>>> + * which is modifying page table with lock. However, >>>>>>> + * we are not holding lock here. Before returning >>>>>>> + * error that will SIGBUS caller, get ptl and make >>>>>>> + * sure there really is no entry. >>>>>>> + */ >>>>>>> + ptl = huge_pte_lock(h, mm, ptep); >>>>>>> + if (!huge_pte_none(huge_ptep_get(ptep))) { >>>>>>> + ret = 0; >>>>>>> + spin_unlock(ptl); >>>>>>> + goto out; >>>>>>> + } >>>>>>> + spin_unlock(ptl); >>>>>> >>>>>> Thanks you for investigation, Mike. >>>>>> I tried this change and found no SIGBUS, so it works well. >> >> Here is another way to address the issue. Take the hugetlb fault mutex in >> the migration code when modifying the page tables. IIUC, the fault mutex >> was introduced to prevent this same issue when there were two page faults >> on the same page (and we were unable to allocate an 'extra' page). The >> downside to such an approach is that we add more hugetlbfs specific code >> to try_to_unmap_one. > > I would rather go with the hugetlb_no_page which is better isolated. Sounds good. And, after more thought modifying try_to_unmap_one will not work. Why? It violates lock ordering. Current ordering is hugetlb_mutex, page lock then page table lock. The page lock is taken before calling try_to_unmap_one. In addition, try_to_unmap is unmapping from multiple vmas so we can not know the values for hugetlb hash before taking page lock as the hash values are vma specific. So, without many modifications we can not add hugetlb fault mutex to this code path.
--- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page *page, int flags) unlock_page(hpage); ret = isolate_huge_page(hpage, &pagelist); + if (!ret) { + pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); + return -EBUSY; + } + /* * get_any_page() and isolate_huge_page() takes a refcount each, * so need to drop one here. */ put_hwpoison_page(hpage); - if (!ret) { - pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); - return -EBUSY; - }