Message ID | 20220901222707.477402-1-shy828301@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: gup: fix the fast GUP race against THP collapse | expand |
Hi, Yang, On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > sufficient to handle concurrent GUP-fast in all cases, it only handles > traditional IPI-based GUP-fast correctly. If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, I'm kind of confused why it's not sufficient even if with RCU gup? Isn't that'll keep working as long as interrupt disabled (which current fast-gup will still do)? IIUC the issue is you suspect not all archs correctly implemented pmdp_collapse_flush(), or am I wrong? > On architectures that send > an IPI broadcast on TLB flush, it works as expected. But on the > architectures that do not use IPI to broadcast TLB flush, it may have > the below race: > > CPU A CPU B > THP collapse fast GUP > gup_pmd_range() <-- see valid pmd > gup_pte_range() <-- work on pte > pmdp_collapse_flush() <-- clear pmd and flush > __collapse_huge_page_isolate() > check page pinned <-- before GUP bump refcount > pin the page > check PTE <-- no change > __collapse_huge_page_copy() > copy data to huge page > ptep_clear() > install huge pmd for the huge page > return the stale page > discard the stale page > > The race could be fixed by checking whether PMD is changed or not after > taking the page pin in fast GUP, just like what it does for PTE. If the > PMD is changed it means there may be parallel THP collapse, so GUP > should back off. Could the race also be fixed by impl pmdp_collapse_flush() correctly for the archs that are missing? Do you know which arch(s) is broken with it? It's just not clear to me whether this patch is an optimization or a fix, if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would still be needed. Thanks, > > Also update the stale comment about serializing against fast GUP in > khugepaged. > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > Signed-off-by: Yang Shi <shy828301@gmail.com> > --- > mm/gup.c | 30 ++++++++++++++++++++++++------ > mm/khugepaged.c | 10 ++++++---- > 2 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index f3fc1f08d90c..4365b2811269 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > } > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > struct dev_pagemap *pgmap = NULL; > int nr_start = *nr, ret = 0; > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > goto pte_unmap; > } > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > + /* > + * THP collapse conceptually does: > + * 1. Clear and flush PMD > + * 2. Check the base page refcount > + * 3. Copy data to huge page > + * 4. Clear PTE > + * 5. Discard the base page > + * > + * So fast GUP may race with THP collapse then pin and > + * return an old page since TLB flush is no longer sufficient > + * to serialize against fast GUP. > + * > + * Check PMD, if it is changed just back off since it > + * means there may be parallel THP collapse. > + */ > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > + unlikely(pte_val(pte) != pte_val(*ptep))) { > gup_put_folio(folio, 1, flags); > goto pte_unmap; > } > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > * get_user_pages_fast_only implementation that can pin pages. Thus it's still > * useful to have gup_huge_pmd even if we can't operate on ptes. > */ > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > return 0; > } > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, > PMD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) > return 0; > } while (pmdp++, addr = next, addr != end); > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2d74cf01f694..518b49095db3 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > /* > - * After this gup_fast can't run anymore. This also removes > - * any huge TLB entry from the CPU so we won't allow > - * huge and small TLB entries for the same virtual address > - * to avoid the risk of CPU bugs in that area. > + * This removes any huge TLB entry from the CPU so we won't allow > + * huge and small TLB entries for the same virtual address to > + * avoid the risk of CPU bugs in that area. > + * > + * Parallel fast GUP is fine since fast GUP will back off when > + * it detects PMD is changed. > */ > _pmd = pmdp_collapse_flush(vma, address, pmd); > spin_unlock(pmd_ptl); > -- > 2.26.3 >
On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > > Hi, Yang, > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > > sufficient to handle concurrent GUP-fast in all cases, it only handles > > traditional IPI-based GUP-fast correctly. > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > that'll keep working as long as interrupt disabled (which current fast-gup > will still do)? Actually the wording was copied from David's commit log for his PageAnonExclusive fix. My understanding is the IPI broadcast still works, but it may not be supported by all architectures and not preferred anymore. So we should avoid depending on IPI broadcast IIUC. > > IIUC the issue is you suspect not all archs correctly implemented > pmdp_collapse_flush(), or am I wrong? This is a possible fix, please see below for details. > > > On architectures that send > > an IPI broadcast on TLB flush, it works as expected. But on the > > architectures that do not use IPI to broadcast TLB flush, it may have > > the below race: > > > > CPU A CPU B > > THP collapse fast GUP > > gup_pmd_range() <-- see valid pmd > > gup_pte_range() <-- work on pte > > pmdp_collapse_flush() <-- clear pmd and flush > > __collapse_huge_page_isolate() > > check page pinned <-- before GUP bump refcount > > pin the page > > check PTE <-- no change > > __collapse_huge_page_copy() > > copy data to huge page > > ptep_clear() > > install huge pmd for the huge page > > return the stale page > > discard the stale page > > > > The race could be fixed by checking whether PMD is changed or not after > > taking the page pin in fast GUP, just like what it does for PTE. If the > > PMD is changed it means there may be parallel THP collapse, so GUP > > should back off. > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for > the archs that are missing? Do you know which arch(s) is broken with it? Yes, and this was suggested by me in the first place, but per the suggestion from John and David, this is not the preferred way. I think it is because: Firstly, using IPI to serialize against fast GUP is not recommended anymore since fast GUP does check PTE then back off so we should avoid it. Secondly, if checking PMD then backing off could solve the problem, why do we still need broadcast IPI? It doesn't sound performant. > > It's just not clear to me whether this patch is an optimization or a fix, > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > still be needed. It is a fix and the fix will make IPI broadcast not useful anymore. > > Thanks, > > > > > Also update the stale comment about serializing against fast GUP in > > khugepaged. > > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > --- > > mm/gup.c | 30 ++++++++++++++++++++++++------ > > mm/khugepaged.c | 10 ++++++---- > > 2 files changed, 30 insertions(+), 10 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index f3fc1f08d90c..4365b2811269 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > > } > > > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > - unsigned int flags, struct page **pages, int *nr) > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > + unsigned long end, unsigned int flags, > > + struct page **pages, int *nr) > > { > > struct dev_pagemap *pgmap = NULL; > > int nr_start = *nr, ret = 0; > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > goto pte_unmap; > > } > > > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > + /* > > + * THP collapse conceptually does: > > + * 1. Clear and flush PMD > > + * 2. Check the base page refcount > > + * 3. Copy data to huge page > > + * 4. Clear PTE > > + * 5. Discard the base page > > + * > > + * So fast GUP may race with THP collapse then pin and > > + * return an old page since TLB flush is no longer sufficient > > + * to serialize against fast GUP. > > + * > > + * Check PMD, if it is changed just back off since it > > + * means there may be parallel THP collapse. > > + */ > > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > > + unlikely(pte_val(pte) != pte_val(*ptep))) { > > gup_put_folio(folio, 1, flags); > > goto pte_unmap; > > } > > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > * get_user_pages_fast_only implementation that can pin pages. Thus it's still > > * useful to have gup_huge_pmd even if we can't operate on ptes. > > */ > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > - unsigned int flags, struct page **pages, int *nr) > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > + unsigned long end, unsigned int flags, > > + struct page **pages, int *nr) > > { > > return 0; > > } > > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > > if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, > > PMD_SHIFT, next, flags, pages, nr)) > > return 0; > > - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > > + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) > > return 0; > > } while (pmdp++, addr = next, addr != end); > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 2d74cf01f694..518b49095db3 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > > /* > > - * After this gup_fast can't run anymore. This also removes > > - * any huge TLB entry from the CPU so we won't allow > > - * huge and small TLB entries for the same virtual address > > - * to avoid the risk of CPU bugs in that area. > > + * This removes any huge TLB entry from the CPU so we won't allow > > + * huge and small TLB entries for the same virtual address to > > + * avoid the risk of CPU bugs in that area. > > + * > > + * Parallel fast GUP is fine since fast GUP will back off when > > + * it detects PMD is changed. > > */ > > _pmd = pmdp_collapse_flush(vma, address, pmd); > > spin_unlock(pmd_ptl); > > -- > > 2.26.3 > > > > -- > Peter Xu >
On 02.09.22 01:50, Yang Shi wrote: > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: >> >> Hi, Yang, >> >> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: >>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer >>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>> traditional IPI-based GUP-fast correctly. >> >> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, >> I'm kind of confused why it's not sufficient even if with RCU gup? Isn't >> that'll keep working as long as interrupt disabled (which current fast-gup >> will still do)? > > Actually the wording was copied from David's commit log for his > PageAnonExclusive fix. My understanding is the IPI broadcast still > works, but it may not be supported by all architectures and not > preferred anymore. So we should avoid depending on IPI broadcast IIUC. Right. Not all architectures perform an IPI broadcast on TLB flush. IPI broadcasts will continue working until we use RCU instead of disabling local interrupts in GUP-fast. >>> CPU A CPU B >>> THP collapse fast GUP >>> gup_pmd_range() <-- see valid pmd >>> gup_pte_range() <-- work on pte >>> pmdp_collapse_flush() <-- clear pmd and flush >>> __collapse_huge_page_isolate() >>> check page pinned <-- before GUP bump refcount >>> pin the page >>> check PTE <-- no change >>> __collapse_huge_page_copy() >>> copy data to huge page >>> ptep_clear() >>> install huge pmd for the huge page >>> return the stale page >>> discard the stale page >>> >>> The race could be fixed by checking whether PMD is changed or not after >>> taking the page pin in fast GUP, just like what it does for PTE. If the >>> PMD is changed it means there may be parallel THP collapse, so GUP >>> should back off. >> >> Could the race also be fixed by impl pmdp_collapse_flush() correctly for >> the archs that are missing? Do you know which arch(s) is broken with it? > > Yes, and this was suggested by me in the first place, but per the > suggestion from John and David, this is not the preferred way. I think > it is because: > > Firstly, using IPI to serialize against fast GUP is not recommended > anymore since fast GUP does check PTE then back off so we should avoid > it. > Secondly, if checking PMD then backing off could solve the problem, > why do we still need broadcast IPI? It doesn't sound performant. I'd say, using an IPI is the old-styled way of doing things. Sure, using an IPI broadcast will work (and IMHO it's a lot easier to not-get-wrong). But it somewhat contradicts to the new way of doing things. >> >> It's just not clear to me whether this patch is an optimization or a fix, >> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would >> still be needed. > > It is a fix and the fix will make IPI broadcast not useful anymore. I'd wonder how "easy" adding the IPI broadcast would be -- IOW, if the IPI fix has a real advantage.
On 02.09.22 00:27, Yang Shi wrote: > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > sufficient to handle concurrent GUP-fast in all cases, it only handles > traditional IPI-based GUP-fast correctly. On architectures that send > an IPI broadcast on TLB flush, it works as expected. But on the > architectures that do not use IPI to broadcast TLB flush, it may have > the below race: > > CPU A CPU B > THP collapse fast GUP > gup_pmd_range() <-- see valid pmd > gup_pte_range() <-- work on pte > pmdp_collapse_flush() <-- clear pmd and flush > __collapse_huge_page_isolate() > check page pinned <-- before GUP bump refcount > pin the page > check PTE <-- no change > __collapse_huge_page_copy() > copy data to huge page > ptep_clear() > install huge pmd for the huge page > return the stale page > discard the stale page > > The race could be fixed by checking whether PMD is changed or not after > taking the page pin in fast GUP, just like what it does for PTE. If the > PMD is changed it means there may be parallel THP collapse, so GUP > should back off. > > Also update the stale comment about serializing against fast GUP in > khugepaged. > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > Signed-off-by: Yang Shi <shy828301@gmail.com> > --- > mm/gup.c | 30 ++++++++++++++++++++++++------ > mm/khugepaged.c | 10 ++++++---- > 2 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index f3fc1f08d90c..4365b2811269 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > } > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > struct dev_pagemap *pgmap = NULL; > int nr_start = *nr, ret = 0; > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > goto pte_unmap; > } > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > + /* > + * THP collapse conceptually does: > + * 1. Clear and flush PMD > + * 2. Check the base page refcount > + * 3. Copy data to huge page > + * 4. Clear PTE > + * 5. Discard the base page > + * > + * So fast GUP may race with THP collapse then pin and > + * return an old page since TLB flush is no longer sufficient > + * to serialize against fast GUP. > + * > + * Check PMD, if it is changed just back off since it > + * means there may be parallel THP collapse. > + */ > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > + unlikely(pte_val(pte) != pte_val(*ptep))) { > gup_put_folio(folio, 1, flags); > goto pte_unmap; > } > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > * get_user_pages_fast_only implementation that can pin pages. Thus it's still > * useful to have gup_huge_pmd even if we can't operate on ptes. > */ > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > return 0; > } > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, > PMD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) > return 0; > } while (pmdp++, addr = next, addr != end); > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2d74cf01f694..518b49095db3 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > /* > - * After this gup_fast can't run anymore. This also removes > - * any huge TLB entry from the CPU so we won't allow > - * huge and small TLB entries for the same virtual address > - * to avoid the risk of CPU bugs in that area. > + * This removes any huge TLB entry from the CPU so we won't allow > + * huge and small TLB entries for the same virtual address to > + * avoid the risk of CPU bugs in that area. > + * > + * Parallel fast GUP is fine since fast GUP will back off when > + * it detects PMD is changed. > */ > _pmd = pmdp_collapse_flush(vma, address, pmd); > spin_unlock(pmd_ptl); As long as pmdp_collapse_flush() implies a full memory barrier (which I think it does), this should work as expected. Hopefully someone with experience on RCU GUP-fast (Jason, John? :) ) can double-check. To me this sound sane. Acked-by: David Hildenbrand <david@redhat.com>
On Thu, Sep 1, 2022 at 11:39 PM David Hildenbrand <david@redhat.com> wrote: > > On 02.09.22 01:50, Yang Shi wrote: > > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > >> > >> Hi, Yang, > >> > >> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > >>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > >>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > >>> sufficient to handle concurrent GUP-fast in all cases, it only handles > >>> traditional IPI-based GUP-fast correctly. > >> > >> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > >> I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > >> that'll keep working as long as interrupt disabled (which current fast-gup > >> will still do)? > > > > Actually the wording was copied from David's commit log for his > > PageAnonExclusive fix. My understanding is the IPI broadcast still > > works, but it may not be supported by all architectures and not > > preferred anymore. So we should avoid depending on IPI broadcast IIUC. > > Right. Not all architectures perform an IPI broadcast on TLB flush. > > IPI broadcasts will continue working until we use RCU instead of > disabling local interrupts in GUP-fast. > > > >>> CPU A CPU B > >>> THP collapse fast GUP > >>> gup_pmd_range() <-- see valid pmd > >>> gup_pte_range() <-- work on pte > >>> pmdp_collapse_flush() <-- clear pmd and flush > >>> __collapse_huge_page_isolate() > >>> check page pinned <-- before GUP bump refcount > >>> pin the page > >>> check PTE <-- no change > >>> __collapse_huge_page_copy() > >>> copy data to huge page > >>> ptep_clear() > >>> install huge pmd for the huge page > >>> return the stale page > >>> discard the stale page > >>> > >>> The race could be fixed by checking whether PMD is changed or not after > >>> taking the page pin in fast GUP, just like what it does for PTE. If the > >>> PMD is changed it means there may be parallel THP collapse, so GUP > >>> should back off. > >> > >> Could the race also be fixed by impl pmdp_collapse_flush() correctly for > >> the archs that are missing? Do you know which arch(s) is broken with it? > > > > Yes, and this was suggested by me in the first place, but per the > > suggestion from John and David, this is not the preferred way. I think > > it is because: > > > > Firstly, using IPI to serialize against fast GUP is not recommended > > anymore since fast GUP does check PTE then back off so we should avoid > > it. > > Secondly, if checking PMD then backing off could solve the problem, > > why do we still need broadcast IPI? It doesn't sound performant. > > I'd say, using an IPI is the old-styled way of doing things. Sure, using > an IPI broadcast will work (and IMHO it's a lot easier to > not-get-wrong). But it somewhat contradicts to the new way of doing things. > > >> > >> It's just not clear to me whether this patch is an optimization or a fix, > >> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > >> still be needed. > > > > It is a fix and the fix will make IPI broadcast not useful anymore. > > I'd wonder how "easy" adding the IPI broadcast would be -- IOW, if the > IPI fix has a real advantage. Not sure either, but I guess calling a dummy function via IPI broadcast should just work. Powepc does so. > > > -- > Thanks, > > David / dhildenb >
On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > > > > Hi, Yang, > > > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > > > sufficient to handle concurrent GUP-fast in all cases, it only handles > > > traditional IPI-based GUP-fast correctly. > > > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > > that'll keep working as long as interrupt disabled (which current fast-gup > > will still do)? > > Actually the wording was copied from David's commit log for his > PageAnonExclusive fix. My understanding is the IPI broadcast still > works, but it may not be supported by all architectures and not > preferred anymore. So we should avoid depending on IPI broadcast IIUC. > > > > > IIUC the issue is you suspect not all archs correctly implemented > > pmdp_collapse_flush(), or am I wrong? > > This is a possible fix, please see below for details. > > > > > > On architectures that send > > > an IPI broadcast on TLB flush, it works as expected. But on the > > > architectures that do not use IPI to broadcast TLB flush, it may have > > > the below race: > > > > > > CPU A CPU B > > > THP collapse fast GUP > > > gup_pmd_range() <-- see valid pmd > > > gup_pte_range() <-- work on pte > > > pmdp_collapse_flush() <-- clear pmd and flush > > > __collapse_huge_page_isolate() > > > check page pinned <-- before GUP bump refcount > > > pin the page > > > check PTE <-- no change > > > __collapse_huge_page_copy() > > > copy data to huge page > > > ptep_clear() > > > install huge pmd for the huge page > > > return the stale page > > > discard the stale page > > > > > > The race could be fixed by checking whether PMD is changed or not after > > > taking the page pin in fast GUP, just like what it does for PTE. If the > > > PMD is changed it means there may be parallel THP collapse, so GUP > > > should back off. > > > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for > > the archs that are missing? Do you know which arch(s) is broken with it? > > Yes, and this was suggested by me in the first place, but per the > suggestion from John and David, this is not the preferred way. I think > it is because: > > Firstly, using IPI to serialize against fast GUP is not recommended > anymore since fast GUP does check PTE then back off so we should avoid > it. > Secondly, if checking PMD then backing off could solve the problem, > why do we still need broadcast IPI? It doesn't sound performant. > > > > > It's just not clear to me whether this patch is an optimization or a fix, > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > > still be needed. > > It is a fix and the fix will make IPI broadcast not useful anymore. How about another patch to remove the ppc impl too? Then it can be a two patches series. So that ppc developers can be copied and maybe it helps to have the ppc people looking at current approach too. Then the last piece of it is the s390 pmdp_collapse_flush(). I'm wondering whether generic pmdp_collapse_flush() would be good enough, since the only addition comparing to the s390 one will be flush_tlb_range() (which is a further __tlb_flush_mm_lazy). David may have some thoughts. The patch itself looks good to me, one trivial nit below. > > > > > Thanks, > > > > > > > > Also update the stale comment about serializing against fast GUP in > > > khugepaged. > > > > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > --- > > > mm/gup.c | 30 ++++++++++++++++++++++++------ > > > mm/khugepaged.c | 10 ++++++---- > > > 2 files changed, 30 insertions(+), 10 deletions(-) > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > index f3fc1f08d90c..4365b2811269 100644 > > > --- a/mm/gup.c > > > +++ b/mm/gup.c > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > > > } > > > > > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > - unsigned int flags, struct page **pages, int *nr) > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > > + unsigned long end, unsigned int flags, > > > + struct page **pages, int *nr) > > > { > > > struct dev_pagemap *pgmap = NULL; > > > int nr_start = *nr, ret = 0; > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > goto pte_unmap; > > > } > > > > > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > > + /* > > > + * THP collapse conceptually does: > > > + * 1. Clear and flush PMD > > > + * 2. Check the base page refcount > > > + * 3. Copy data to huge page > > > + * 4. Clear PTE > > > + * 5. Discard the base page > > > + * > > > + * So fast GUP may race with THP collapse then pin and > > > + * return an old page since TLB flush is no longer sufficient > > > + * to serialize against fast GUP. > > > + * > > > + * Check PMD, if it is changed just back off since it > > > + * means there may be parallel THP collapse. Would you mind rewording this comment a bit? I feel it a bit weird to suddenly mention about thp collapse especially its details. Maybe some statement on the whole history of why check pte, and in what case pmd check is needed (where the thp collapse example can be moved to, imho)? One of my attempt for reference.. /* * Fast-gup relies on pte change detection to avoid * concurrent pgtable operations. * * To pin the page, fast-gup needs to do below in order: * (1) pin the page (by prefetching pte), then (2) check * pte not changed. * * For the rest of pgtable operations where pgtable updates * can be racy with fast-gup, we need to do (1) clear pte, * then (2) check whether page is pinned. * * Above will work for all pte-level operations, including * thp split. * * For thp collapse, it's a bit more complicated because * with RCU pgtable free fast-gup can be walking a pgtable * page that is being freed (so pte is still valid but pmd * can be cleared already). To avoid race in such * condition, we need to also check pmd here to make sure * pmd doesn't change (corresponds to pmdp_collapse_flush() * in the thp collide code path). */ If you agree with the comment change, feel free to add: Acked-by: Peter Xu <peterx@redhat.com> Thanks,
On Fri, Sep 02, 2022 at 11:59:56AM -0400, Peter Xu wrote: > On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: > > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Hi, Yang, > > > > > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > > > > sufficient to handle concurrent GUP-fast in all cases, it only handles > > > > traditional IPI-based GUP-fast correctly. > > > > > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > > > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > > > that'll keep working as long as interrupt disabled (which current fast-gup > > > will still do)? > > > > Actually the wording was copied from David's commit log for his > > PageAnonExclusive fix. My understanding is the IPI broadcast still > > works, but it may not be supported by all architectures and not > > preferred anymore. So we should avoid depending on IPI broadcast IIUC. > > > > > > > > IIUC the issue is you suspect not all archs correctly implemented > > > pmdp_collapse_flush(), or am I wrong? > > > > This is a possible fix, please see below for details. > > > > > > > > > On architectures that send > > > > an IPI broadcast on TLB flush, it works as expected. But on the > > > > architectures that do not use IPI to broadcast TLB flush, it may have > > > > the below race: > > > > > > > > CPU A CPU B > > > > THP collapse fast GUP > > > > gup_pmd_range() <-- see valid pmd > > > > gup_pte_range() <-- work on pte > > > > pmdp_collapse_flush() <-- clear pmd and flush > > > > __collapse_huge_page_isolate() > > > > check page pinned <-- before GUP bump refcount > > > > pin the page > > > > check PTE <-- no change > > > > __collapse_huge_page_copy() > > > > copy data to huge page > > > > ptep_clear() > > > > install huge pmd for the huge page > > > > return the stale page > > > > discard the stale page > > > > > > > > The race could be fixed by checking whether PMD is changed or not after > > > > taking the page pin in fast GUP, just like what it does for PTE. If the > > > > PMD is changed it means there may be parallel THP collapse, so GUP > > > > should back off. > > > > > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for > > > the archs that are missing? Do you know which arch(s) is broken with it? > > > > Yes, and this was suggested by me in the first place, but per the > > suggestion from John and David, this is not the preferred way. I think > > it is because: > > > > Firstly, using IPI to serialize against fast GUP is not recommended > > anymore since fast GUP does check PTE then back off so we should avoid > > it. > > Secondly, if checking PMD then backing off could solve the problem, > > why do we still need broadcast IPI? It doesn't sound performant. > > > > > > > > It's just not clear to me whether this patch is an optimization or a fix, > > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > > > still be needed. > > > > It is a fix and the fix will make IPI broadcast not useful anymore. > > How about another patch to remove the ppc impl too? Then it can be a two > patches series. > > So that ppc developers can be copied and maybe it helps to have the ppc > people looking at current approach too. > > Then the last piece of it is the s390 pmdp_collapse_flush(). I'm wondering > whether generic pmdp_collapse_flush() would be good enough, since the only > addition comparing to the s390 one will be flush_tlb_range() (which is a > further __tlb_flush_mm_lazy). David may have some thoughts. > > The patch itself looks good to me, one trivial nit below. > > > > > > > > > Thanks, > > > > > > > > > > > Also update the stale comment about serializing against fast GUP in > > > > khugepaged. > > > > > > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > > --- > > > > mm/gup.c | 30 ++++++++++++++++++++++++------ > > > > mm/khugepaged.c | 10 ++++++---- > > > > 2 files changed, 30 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > index f3fc1f08d90c..4365b2811269 100644 > > > > --- a/mm/gup.c > > > > +++ b/mm/gup.c > > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > > > > } > > > > > > > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > > - unsigned int flags, struct page **pages, int *nr) > > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > > > + unsigned long end, unsigned int flags, > > > > + struct page **pages, int *nr) > > > > { > > > > struct dev_pagemap *pgmap = NULL; > > > > int nr_start = *nr, ret = 0; > > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > > goto pte_unmap; > > > > } > > > > > > > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > > > + /* > > > > + * THP collapse conceptually does: > > > > + * 1. Clear and flush PMD > > > > + * 2. Check the base page refcount > > > > + * 3. Copy data to huge page > > > > + * 4. Clear PTE > > > > + * 5. Discard the base page > > > > + * > > > > + * So fast GUP may race with THP collapse then pin and > > > > + * return an old page since TLB flush is no longer sufficient > > > > + * to serialize against fast GUP. > > > > + * > > > > + * Check PMD, if it is changed just back off since it > > > > + * means there may be parallel THP collapse. > > Would you mind rewording this comment a bit? I feel it a bit weird to > suddenly mention about thp collapse especially its details. > > Maybe some statement on the whole history of why check pte, and in what > case pmd check is needed (where the thp collapse example can be moved to, > imho)? > > One of my attempt for reference.. > > /* > * Fast-gup relies on pte change detection to avoid > * concurrent pgtable operations. > * > * To pin the page, fast-gup needs to do below in order: > * (1) pin the page (by prefetching pte), then (2) check > * pte not changed. > * > * For the rest of pgtable operations where pgtable updates > * can be racy with fast-gup, we need to do (1) clear pte, > * then (2) check whether page is pinned. > * > * Above will work for all pte-level operations, including > * thp split. Here a slight amendment could be: * Above will work for all pte-level operations, including * thp split, which applies the same logic but only done * all above in the pmd level rather than pte level. To be clearer. > * > * For thp collapse, it's a bit more complicated because > * with RCU pgtable free fast-gup can be walking a pgtable > * page that is being freed (so pte is still valid but pmd > * can be cleared already). To avoid race in such > * condition, we need to also check pmd here to make sure > * pmd doesn't change (corresponds to pmdp_collapse_flush() > * in the thp collide code path). > */ > > If you agree with the comment change, feel free to add: > > Acked-by: Peter Xu <peterx@redhat.com> > > Thanks, > > -- > Peter Xu
On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: > > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Hi, Yang, > > > > > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > > > > sufficient to handle concurrent GUP-fast in all cases, it only handles > > > > traditional IPI-based GUP-fast correctly. > > > > > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > > > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > > > that'll keep working as long as interrupt disabled (which current fast-gup > > > will still do)? > > > > Actually the wording was copied from David's commit log for his > > PageAnonExclusive fix. My understanding is the IPI broadcast still > > works, but it may not be supported by all architectures and not > > preferred anymore. So we should avoid depending on IPI broadcast IIUC. > > > > > > > > IIUC the issue is you suspect not all archs correctly implemented > > > pmdp_collapse_flush(), or am I wrong? > > > > This is a possible fix, please see below for details. > > > > > > > > > On architectures that send > > > > an IPI broadcast on TLB flush, it works as expected. But on the > > > > architectures that do not use IPI to broadcast TLB flush, it may have > > > > the below race: > > > > > > > > CPU A CPU B > > > > THP collapse fast GUP > > > > gup_pmd_range() <-- see valid pmd > > > > gup_pte_range() <-- work on pte > > > > pmdp_collapse_flush() <-- clear pmd and flush > > > > __collapse_huge_page_isolate() > > > > check page pinned <-- before GUP bump refcount > > > > pin the page > > > > check PTE <-- no change > > > > __collapse_huge_page_copy() > > > > copy data to huge page > > > > ptep_clear() > > > > install huge pmd for the huge page > > > > return the stale page > > > > discard the stale page > > > > > > > > The race could be fixed by checking whether PMD is changed or not after > > > > taking the page pin in fast GUP, just like what it does for PTE. If the > > > > PMD is changed it means there may be parallel THP collapse, so GUP > > > > should back off. > > > > > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for > > > the archs that are missing? Do you know which arch(s) is broken with it? > > > > Yes, and this was suggested by me in the first place, but per the > > suggestion from John and David, this is not the preferred way. I think > > it is because: > > > > Firstly, using IPI to serialize against fast GUP is not recommended > > anymore since fast GUP does check PTE then back off so we should avoid > > it. > > Secondly, if checking PMD then backing off could solve the problem, > > why do we still need broadcast IPI? It doesn't sound performant. > > > > > > > > It's just not clear to me whether this patch is an optimization or a fix, > > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > > > still be needed. > > > > It is a fix and the fix will make IPI broadcast not useful anymore. > > How about another patch to remove the ppc impl too? Then it can be a two > patches series. > > So that ppc developers can be copied and maybe it helps to have the ppc > people looking at current approach too. I was thinking about it before sending this patch, but I thought it may be better to wait for ppc developers to confirm some questions. I think Aneesh was copied in another thread. > > Then the last piece of it is the s390 pmdp_collapse_flush(). I'm wondering > whether generic pmdp_collapse_flush() would be good enough, since the only > addition comparing to the s390 one will be flush_tlb_range() (which is a > further __tlb_flush_mm_lazy). David may have some thoughts. > > The patch itself looks good to me, one trivial nit below. > > > > > > > > > Thanks, > > > > > > > > > > > Also update the stale comment about serializing against fast GUP in > > > > khugepaged. > > > > > > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > > --- > > > > mm/gup.c | 30 ++++++++++++++++++++++++------ > > > > mm/khugepaged.c | 10 ++++++---- > > > > 2 files changed, 30 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > index f3fc1f08d90c..4365b2811269 100644 > > > > --- a/mm/gup.c > > > > +++ b/mm/gup.c > > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > > > > } > > > > > > > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > > - unsigned int flags, struct page **pages, int *nr) > > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > > > + unsigned long end, unsigned int flags, > > > > + struct page **pages, int *nr) > > > > { > > > > struct dev_pagemap *pgmap = NULL; > > > > int nr_start = *nr, ret = 0; > > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > > goto pte_unmap; > > > > } > > > > > > > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > > > + /* > > > > + * THP collapse conceptually does: > > > > + * 1. Clear and flush PMD > > > > + * 2. Check the base page refcount > > > > + * 3. Copy data to huge page > > > > + * 4. Clear PTE > > > > + * 5. Discard the base page > > > > + * > > > > + * So fast GUP may race with THP collapse then pin and > > > > + * return an old page since TLB flush is no longer sufficient > > > > + * to serialize against fast GUP. > > > > + * > > > > + * Check PMD, if it is changed just back off since it > > > > + * means there may be parallel THP collapse. > > Would you mind rewording this comment a bit? I feel it a bit weird to > suddenly mention about thp collapse especially its details. > > Maybe some statement on the whole history of why check pte, and in what > case pmd check is needed (where the thp collapse example can be moved to, > imho)? > > One of my attempt for reference.. > > /* > * Fast-gup relies on pte change detection to avoid > * concurrent pgtable operations. > * > * To pin the page, fast-gup needs to do below in order: > * (1) pin the page (by prefetching pte), then (2) check > * pte not changed. > * > * For the rest of pgtable operations where pgtable updates > * can be racy with fast-gup, we need to do (1) clear pte, > * then (2) check whether page is pinned. > * > * Above will work for all pte-level operations, including > * thp split. > * > * For thp collapse, it's a bit more complicated because > * with RCU pgtable free fast-gup can be walking a pgtable > * page that is being freed (so pte is still valid but pmd > * can be cleared already). To avoid race in such > * condition, we need to also check pmd here to make sure > * pmd doesn't change (corresponds to pmdp_collapse_flush() > * in the thp collide code path). > */ > > If you agree with the comment change, feel free to add: Thanks a lot, it looks good to me, will update the patch to incorporate it. > > Acked-by: Peter Xu <peterx@redhat.com> > > Thanks, > > -- > Peter Xu >
On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: > > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Hi, Yang, > > > > > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > > > > sufficient to handle concurrent GUP-fast in all cases, it only handles > > > > traditional IPI-based GUP-fast correctly. > > > > > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > > > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > > > that'll keep working as long as interrupt disabled (which current fast-gup > > > will still do)? > > > > Actually the wording was copied from David's commit log for his > > PageAnonExclusive fix. My understanding is the IPI broadcast still > > works, but it may not be supported by all architectures and not > > preferred anymore. So we should avoid depending on IPI broadcast IIUC. > > > > > > > > IIUC the issue is you suspect not all archs correctly implemented > > > pmdp_collapse_flush(), or am I wrong? > > > > This is a possible fix, please see below for details. > > > > > > > > > On architectures that send > > > > an IPI broadcast on TLB flush, it works as expected. But on the > > > > architectures that do not use IPI to broadcast TLB flush, it may have > > > > the below race: > > > > > > > > CPU A CPU B > > > > THP collapse fast GUP > > > > gup_pmd_range() <-- see valid pmd > > > > gup_pte_range() <-- work on pte > > > > pmdp_collapse_flush() <-- clear pmd and flush > > > > __collapse_huge_page_isolate() > > > > check page pinned <-- before GUP bump refcount > > > > pin the page > > > > check PTE <-- no change > > > > __collapse_huge_page_copy() > > > > copy data to huge page > > > > ptep_clear() > > > > install huge pmd for the huge page > > > > return the stale page > > > > discard the stale page > > > > > > > > The race could be fixed by checking whether PMD is changed or not after > > > > taking the page pin in fast GUP, just like what it does for PTE. If the > > > > PMD is changed it means there may be parallel THP collapse, so GUP > > > > should back off. > > > > > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for > > > the archs that are missing? Do you know which arch(s) is broken with it? > > > > Yes, and this was suggested by me in the first place, but per the > > suggestion from John and David, this is not the preferred way. I think > > it is because: > > > > Firstly, using IPI to serialize against fast GUP is not recommended > > anymore since fast GUP does check PTE then back off so we should avoid > > it. > > Secondly, if checking PMD then backing off could solve the problem, > > why do we still need broadcast IPI? It doesn't sound performant. > > > > > > > > It's just not clear to me whether this patch is an optimization or a fix, > > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > > > still be needed. > > > > It is a fix and the fix will make IPI broadcast not useful anymore. > > How about another patch to remove the ppc impl too? Then it can be a two > patches series. BTW, I don't think we could remove the ppc implementation since it is different from the generic pmdp_collapse_flush(), particularly for the hash part IIUC. The generic version calls flush_tlb_range() -> hash__flush_tlb_range() for hash, but the hash call is actually no-op. The ppc version calls hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does something useful. But, as I mentioned in another thread, we should be able to remove the IPI call, which just calls a dummy function. > > So that ppc developers can be copied and maybe it helps to have the ppc > people looking at current approach too. > > Then the last piece of it is the s390 pmdp_collapse_flush(). I'm wondering > whether generic pmdp_collapse_flush() would be good enough, since the only > addition comparing to the s390 one will be flush_tlb_range() (which is a > further __tlb_flush_mm_lazy). David may have some thoughts. > > The patch itself looks good to me, one trivial nit below. > > > > > > > > > Thanks, > > > > > > > > > > > Also update the stale comment about serializing against fast GUP in > > > > khugepaged. > > > > > > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > > > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > > > --- > > > > mm/gup.c | 30 ++++++++++++++++++++++++------ > > > > mm/khugepaged.c | 10 ++++++---- > > > > 2 files changed, 30 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > index f3fc1f08d90c..4365b2811269 100644 > > > > --- a/mm/gup.c > > > > +++ b/mm/gup.c > > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > > > > } > > > > > > > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > > - unsigned int flags, struct page **pages, int *nr) > > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > > > + unsigned long end, unsigned int flags, > > > > + struct page **pages, int *nr) > > > > { > > > > struct dev_pagemap *pgmap = NULL; > > > > int nr_start = *nr, ret = 0; > > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > > > goto pte_unmap; > > > > } > > > > > > > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > > > + /* > > > > + * THP collapse conceptually does: > > > > + * 1. Clear and flush PMD > > > > + * 2. Check the base page refcount > > > > + * 3. Copy data to huge page > > > > + * 4. Clear PTE > > > > + * 5. Discard the base page > > > > + * > > > > + * So fast GUP may race with THP collapse then pin and > > > > + * return an old page since TLB flush is no longer sufficient > > > > + * to serialize against fast GUP. > > > > + * > > > > + * Check PMD, if it is changed just back off since it > > > > + * means there may be parallel THP collapse. > > Would you mind rewording this comment a bit? I feel it a bit weird to > suddenly mention about thp collapse especially its details. > > Maybe some statement on the whole history of why check pte, and in what > case pmd check is needed (where the thp collapse example can be moved to, > imho)? > > One of my attempt for reference.. > > /* > * Fast-gup relies on pte change detection to avoid > * concurrent pgtable operations. > * > * To pin the page, fast-gup needs to do below in order: > * (1) pin the page (by prefetching pte), then (2) check > * pte not changed. > * > * For the rest of pgtable operations where pgtable updates > * can be racy with fast-gup, we need to do (1) clear pte, > * then (2) check whether page is pinned. > * > * Above will work for all pte-level operations, including > * thp split. > * > * For thp collapse, it's a bit more complicated because > * with RCU pgtable free fast-gup can be walking a pgtable > * page that is being freed (so pte is still valid but pmd > * can be cleared already). To avoid race in such > * condition, we need to also check pmd here to make sure > * pmd doesn't change (corresponds to pmdp_collapse_flush() > * in the thp collide code path). > */ > > If you agree with the comment change, feel free to add: > > Acked-by: Peter Xu <peterx@redhat.com> > > Thanks, > > -- > Peter Xu >
On Fri, Sep 02, 2022 at 10:45:20AM -0700, Yang Shi wrote: > > How about another patch to remove the ppc impl too? Then it can be a two > > patches series. > > BTW, I don't think we could remove the ppc implementation since it is > different from the generic pmdp_collapse_flush(), particularly for the > hash part IIUC. > > The generic version calls flush_tlb_range() -> hash__flush_tlb_range() > for hash, but the hash call is actually no-op. The ppc version calls > hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does > something useful. One thing I found interesting (and also a bit confused..) is that the ppc code used the name flush_tlb_pmd_range() to "flush tlb range in pte level", which is kind of against the tlb API design.. The generic tlb API has a very close function called flush_pmd_tlb_range() which is only used to do pmd-level flushing, while here the ppc version of flush_tlb_pmd_range() is actually flush_tlb_range() in the generic API. Agreed that it may worth having a look from ppc developers.
On 9/2/22 08:59, Peter Xu wrote: ... > Would you mind rewording this comment a bit? I feel it a bit weird to > suddenly mention about thp collapse especially its details. > > Maybe some statement on the whole history of why check pte, and in what > case pmd check is needed (where the thp collapse example can be moved to, > imho)? > > One of my attempt for reference.. > > /* > * Fast-gup relies on pte change detection to avoid > * concurrent pgtable operations. > * > * To pin the page, fast-gup needs to do below in order: > * (1) pin the page (by prefetching pte), then (2) check > * pte not changed. > * > * For the rest of pgtable operations where pgtable updates > * can be racy with fast-gup, we need to do (1) clear pte, > * then (2) check whether page is pinned. > * > * Above will work for all pte-level operations, including > * thp split. > * > * For thp collapse, it's a bit more complicated because > * with RCU pgtable free fast-gup can be walking a pgtable > * page that is being freed (so pte is still valid but pmd > * can be cleared already). To avoid race in such > * condition, we need to also check pmd here to make sure > * pmd doesn't change (corresponds to pmdp_collapse_flush() > * in the thp collide code path). > */ > > If you agree with the comment change, feel free to add: > > Acked-by: Peter Xu <peterx@redhat.com> > This seems fine, but I'd like to additionally request that we move it to function-level documentation. Because it expands the length of the function, which previously fit neatly on my screen. So I think it's time to move it up a level. thanks,
On 9/1/22 15:27, Yang Shi wrote: > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > sufficient to handle concurrent GUP-fast in all cases, it only handles > traditional IPI-based GUP-fast correctly. On architectures that send > an IPI broadcast on TLB flush, it works as expected. But on the > architectures that do not use IPI to broadcast TLB flush, it may have > the below race: > > CPU A CPU B > THP collapse fast GUP > gup_pmd_range() <-- see valid pmd > gup_pte_range() <-- work on pte > pmdp_collapse_flush() <-- clear pmd and flush > __collapse_huge_page_isolate() > check page pinned <-- before GUP bump refcount > pin the page > check PTE <-- no change > __collapse_huge_page_copy() > copy data to huge page > ptep_clear() > install huge pmd for the huge page > return the stale page > discard the stale page Hi Yang, Thanks for taking the trouble to write down these notes. I always forget which race we are dealing with, and this is a great help. :) More... > > The race could be fixed by checking whether PMD is changed or not after > taking the page pin in fast GUP, just like what it does for PTE. If the > PMD is changed it means there may be parallel THP collapse, so GUP > should back off. > > Also update the stale comment about serializing against fast GUP in > khugepaged. > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > Signed-off-by: Yang Shi <shy828301@gmail.com> > --- > mm/gup.c | 30 ++++++++++++++++++++++++------ > mm/khugepaged.c | 10 ++++++---- > 2 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index f3fc1f08d90c..4365b2811269 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > } > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > struct dev_pagemap *pgmap = NULL; > int nr_start = *nr, ret = 0; > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > goto pte_unmap; > } > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > + /* > + * THP collapse conceptually does: > + * 1. Clear and flush PMD > + * 2. Check the base page refcount > + * 3. Copy data to huge page > + * 4. Clear PTE > + * 5. Discard the base page > + * > + * So fast GUP may race with THP collapse then pin and > + * return an old page since TLB flush is no longer sufficient > + * to serialize against fast GUP. > + * > + * Check PMD, if it is changed just back off since it > + * means there may be parallel THP collapse. > + */ As I mentioned in the other thread, it would be a nice touch to move such discussion into the comment header. > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > + unlikely(pte_val(pte) != pte_val(*ptep))) { That should be READ_ONCE() for the *pmdp and *ptep reads. Because this whole lockless house of cards may fall apart if we try reading the page table values without READ_ONCE(). That's a rather vague statement, and in fact, the READ_ONCE() should be paired with a page table write somewhere else, to make that claim more precise. > gup_put_folio(folio, 1, flags); > goto pte_unmap; > } > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > * get_user_pages_fast_only implementation that can pin pages. Thus it's still > * useful to have gup_huge_pmd even if we can't operate on ptes. > */ > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > - unsigned int flags, struct page **pages, int *nr) > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > + unsigned long end, unsigned int flags, > + struct page **pages, int *nr) > { > return 0; > } > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, > PMD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) > return 0; > } while (pmdp++, addr = next, addr != end); > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2d74cf01f694..518b49095db3 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > /* > - * After this gup_fast can't run anymore. This also removes > - * any huge TLB entry from the CPU so we won't allow > - * huge and small TLB entries for the same virtual address > - * to avoid the risk of CPU bugs in that area. > + * This removes any huge TLB entry from the CPU so we won't allow > + * huge and small TLB entries for the same virtual address to > + * avoid the risk of CPU bugs in that area. > + * > + * Parallel fast GUP is fine since fast GUP will back off when > + * it detects PMD is changed. > */ > _pmd = pmdp_collapse_flush(vma, address, pmd); To follow up on David Hildenbrand's note about this in the nearby thread... I'm also not sure if pmdp_collapse_flush() implies a memory barrier on all arches. It definitely does do an atomic op with a return value on x86, but that's just one arch. thanks,
On 05.09.22 00:29, John Hubbard wrote: > On 9/1/22 15:27, Yang Shi wrote: >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer >> sufficient to handle concurrent GUP-fast in all cases, it only handles >> traditional IPI-based GUP-fast correctly. On architectures that send >> an IPI broadcast on TLB flush, it works as expected. But on the >> architectures that do not use IPI to broadcast TLB flush, it may have >> the below race: >> >> CPU A CPU B >> THP collapse fast GUP >> gup_pmd_range() <-- see valid pmd >> gup_pte_range() <-- work on pte >> pmdp_collapse_flush() <-- clear pmd and flush >> __collapse_huge_page_isolate() >> check page pinned <-- before GUP bump refcount >> pin the page >> check PTE <-- no change >> __collapse_huge_page_copy() >> copy data to huge page >> ptep_clear() >> install huge pmd for the huge page >> return the stale page >> discard the stale page > > Hi Yang, > > Thanks for taking the trouble to write down these notes. I always > forget which race we are dealing with, and this is a great help. :) > > More... > >> >> The race could be fixed by checking whether PMD is changed or not after >> taking the page pin in fast GUP, just like what it does for PTE. If the >> PMD is changed it means there may be parallel THP collapse, so GUP >> should back off. >> >> Also update the stale comment about serializing against fast GUP in >> khugepaged. >> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") >> Signed-off-by: Yang Shi <shy828301@gmail.com> >> --- >> mm/gup.c | 30 ++++++++++++++++++++++++------ >> mm/khugepaged.c | 10 ++++++---- >> 2 files changed, 30 insertions(+), 10 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index f3fc1f08d90c..4365b2811269 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, >> } >> >> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> - unsigned int flags, struct page **pages, int *nr) >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >> + unsigned long end, unsigned int flags, >> + struct page **pages, int *nr) >> { >> struct dev_pagemap *pgmap = NULL; >> int nr_start = *nr, ret = 0; >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> goto pte_unmap; >> } >> >> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { >> + /* >> + * THP collapse conceptually does: >> + * 1. Clear and flush PMD >> + * 2. Check the base page refcount >> + * 3. Copy data to huge page >> + * 4. Clear PTE >> + * 5. Discard the base page >> + * >> + * So fast GUP may race with THP collapse then pin and >> + * return an old page since TLB flush is no longer sufficient >> + * to serialize against fast GUP. >> + * >> + * Check PMD, if it is changed just back off since it >> + * means there may be parallel THP collapse. >> + */ > > As I mentioned in the other thread, it would be a nice touch to move > such discussion into the comment header. > >> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >> + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > whole lockless house of cards may fall apart if we try reading the > page table values without READ_ONCE(). I came to the conclusion that the implicit memory barrier when grabbing a reference on the page is sufficient such that we don't need READ_ONCE here. If we still intend to change that code, we should fixup all GUP-fast functions in a similar way. But again, I don't think we need a change here. >> - * After this gup_fast can't run anymore. This also removes >> - * any huge TLB entry from the CPU so we won't allow >> - * huge and small TLB entries for the same virtual address >> - * to avoid the risk of CPU bugs in that area. >> + * This removes any huge TLB entry from the CPU so we won't allow >> + * huge and small TLB entries for the same virtual address to >> + * avoid the risk of CPU bugs in that area. >> + * >> + * Parallel fast GUP is fine since fast GUP will back off when >> + * it detects PMD is changed. >> */ >> _pmd = pmdp_collapse_flush(vma, address, pmd); > > To follow up on David Hildenbrand's note about this in the nearby thread... > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > all arches. It definitely does do an atomic op with a return value on x86, > but that's just one arch. > I think a ptep/pmdp clear + TLB flush really has to imply a memory barrier, otherwise TLB flushing code might easily mess up with surrounding code. But we should better double-check. s390x executes an IDTE instruction, which performs serialization (-> memory barrier). arm64 seems to use DSB instructions to enforce memory ordering.
Yang Shi <shy828301@gmail.com> writes: > > On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote: >> >> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: >> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: >> > > >> > > Hi, Yang, >> > > >> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: >> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer >> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles >> > > > traditional IPI-based GUP-fast correctly. >> > > >> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, >> > > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't >> > > that'll keep working as long as interrupt disabled (which current fast-gup >> > > will still do)? >> > >> > Actually the wording was copied from David's commit log for his >> > PageAnonExclusive fix. My understanding is the IPI broadcast still >> > works, but it may not be supported by all architectures and not >> > preferred anymore. So we should avoid depending on IPI broadcast IIUC. >> > >> > > >> > > IIUC the issue is you suspect not all archs correctly implemented >> > > pmdp_collapse_flush(), or am I wrong? >> > >> > This is a possible fix, please see below for details. >> > >> > > >> > > > On architectures that send >> > > > an IPI broadcast on TLB flush, it works as expected. But on the >> > > > architectures that do not use IPI to broadcast TLB flush, it may have >> > > > the below race: >> > > > >> > > > CPU A CPU B >> > > > THP collapse fast GUP >> > > > gup_pmd_range() <-- see valid pmd >> > > > gup_pte_range() <-- work on pte >> > > > pmdp_collapse_flush() <-- clear pmd and flush >> > > > __collapse_huge_page_isolate() >> > > > check page pinned <-- before GUP bump refcount >> > > > pin the page >> > > > check PTE <-- no change >> > > > __collapse_huge_page_copy() >> > > > copy data to huge page >> > > > ptep_clear() >> > > > install huge pmd for the huge page >> > > > return the stale page >> > > > discard the stale page >> > > > >> > > > The race could be fixed by checking whether PMD is changed or not after >> > > > taking the page pin in fast GUP, just like what it does for PTE. If the >> > > > PMD is changed it means there may be parallel THP collapse, so GUP >> > > > should back off. >> > > >> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for >> > > the archs that are missing? Do you know which arch(s) is broken with it? >> > >> > Yes, and this was suggested by me in the first place, but per the >> > suggestion from John and David, this is not the preferred way. I think >> > it is because: >> > >> > Firstly, using IPI to serialize against fast GUP is not recommended >> > anymore since fast GUP does check PTE then back off so we should avoid >> > it. >> > Secondly, if checking PMD then backing off could solve the problem, >> > why do we still need broadcast IPI? It doesn't sound performant. >> > >> > > >> > > It's just not clear to me whether this patch is an optimization or a fix, >> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would >> > > still be needed. >> > >> > It is a fix and the fix will make IPI broadcast not useful anymore. >> >> How about another patch to remove the ppc impl too? Then it can be a two >> patches series. > > BTW, I don't think we could remove the ppc implementation since it is > different from the generic pmdp_collapse_flush(), particularly for the > hash part IIUC. > > The generic version calls flush_tlb_range() -> hash__flush_tlb_range() > for hash, but the hash call is actually no-op. The ppc version calls > hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does > something useful. > We should actually rename flush_tlb_pmd_range(). It actually flush the hash page table entries. I will do the below patch for ppc64 to clarify this better diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h index 8b762f282190..fd30fa20c392 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start, struct mmu_gather; extern void hash__tlb_flush(struct mmu_gather *tlb); -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); #ifdef CONFIG_PPC_64S_HASH_MMU /* Private function for use by PCI IO mapping code */ extern void __flush_hash_table_range(unsigned long start, unsigned long end); -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr); +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr); #else static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { } #endif diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c index ae008b9df0e6..f30131933a01 100644 --- a/arch/powerpc/mm/book3s64/hash_pgtable.c +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres * the __collapse_huge_page_copy can result in copying * the old content. */ - flush_tlb_pmd_range(vma->vm_mm, &pmd, address); + flush_hash_table_pmd_range(vma->vm_mm, &pmd, address); return pmd; } diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c index eb0bccaf221e..a64ea0a7ef96 100644 --- a/arch/powerpc/mm/book3s64/hash_tlb.c +++ b/arch/powerpc/mm/book3s64/hash_tlb.c @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) local_irq_restore(flags); } -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) { pte_t *pte; pte_t *start_pte;
Peter Xu <peterx@redhat.com> writes: > > On Fri, Sep 02, 2022 at 10:45:20AM -0700, Yang Shi wrote: >> > How about another patch to remove the ppc impl too? Then it can be a two >> > patches series. >> >> BTW, I don't think we could remove the ppc implementation since it is >> different from the generic pmdp_collapse_flush(), particularly for the >> hash part IIUC. >> >> The generic version calls flush_tlb_range() -> hash__flush_tlb_range() >> for hash, but the hash call is actually no-op. The ppc version calls >> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does >> something useful. > > One thing I found interesting (and also a bit confused..) is that the ppc > code used the name flush_tlb_pmd_range() to "flush tlb range in pte level", > which is kind of against the tlb API design.. > > The generic tlb API has a very close function called flush_pmd_tlb_range() > which is only used to do pmd-level flushing, while here the ppc version of > flush_tlb_pmd_range() is actually flush_tlb_range() in the generic API. > > Agreed that it may worth having a look from ppc developers. > It is actually flushing the hash page table entries. I will rename flush_tlb_pmd_range to flush_hash_table_pmd_range(). -aneesh
On 9/5/2022 6:29 AM, John Hubbard wrote: > On 9/1/22 15:27, Yang Shi wrote: >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer >> sufficient to handle concurrent GUP-fast in all cases, it only handles >> traditional IPI-based GUP-fast correctly. On architectures that send >> an IPI broadcast on TLB flush, it works as expected. But on the >> architectures that do not use IPI to broadcast TLB flush, it may have >> the below race: >> >> CPU A CPU B >> THP collapse fast GUP >> gup_pmd_range() <-- see valid pmd >> gup_pte_range() <-- work on pte >> pmdp_collapse_flush() <-- clear pmd and flush >> __collapse_huge_page_isolate() >> check page pinned <-- before GUP bump refcount >> pin the page >> check PTE <-- no change >> __collapse_huge_page_copy() >> copy data to huge page >> ptep_clear() >> install huge pmd for the huge page >> return the stale page >> discard the stale page > > Hi Yang, > > Thanks for taking the trouble to write down these notes. I always > forget which race we are dealing with, and this is a great help. :) > > More... > >> >> The race could be fixed by checking whether PMD is changed or not after >> taking the page pin in fast GUP, just like what it does for PTE. If the >> PMD is changed it means there may be parallel THP collapse, so GUP >> should back off. >> >> Also update the stale comment about serializing against fast GUP in >> khugepaged. >> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") >> Signed-off-by: Yang Shi <shy828301@gmail.com> >> --- >> mm/gup.c | 30 ++++++++++++++++++++++++------ >> mm/khugepaged.c | 10 ++++++---- >> 2 files changed, 30 insertions(+), 10 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index f3fc1f08d90c..4365b2811269 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, >> } >> >> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> - unsigned int flags, struct page **pages, int *nr) >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >> + unsigned long end, unsigned int flags, >> + struct page **pages, int *nr) >> { >> struct dev_pagemap *pgmap = NULL; >> int nr_start = *nr, ret = 0; >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> goto pte_unmap; >> } >> >> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { >> + /* >> + * THP collapse conceptually does: >> + * 1. Clear and flush PMD >> + * 2. Check the base page refcount >> + * 3. Copy data to huge page >> + * 4. Clear PTE >> + * 5. Discard the base page >> + * >> + * So fast GUP may race with THP collapse then pin and >> + * return an old page since TLB flush is no longer sufficient >> + * to serialize against fast GUP. >> + * >> + * Check PMD, if it is changed just back off since it >> + * means there may be parallel THP collapse. >> + */ > > As I mentioned in the other thread, it would be a nice touch to move > such discussion into the comment header. > >> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >> + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > whole lockless house of cards may fall apart if we try reading the > page table values without READ_ONCE(). > > That's a rather vague statement, and in fact, the READ_ONCE() should > be paired with a page table write somewhere else, to make that claim > more precise. Agree. We also talked about using READ_ONCE() for *ptep (or using ptep_get_lockless()) before in a concurrent scenario of GUP-fast and migration [1]. >>> CPU 0: CPU 1: >>> get_user_pages_fast() >>> lockless_pages_from_mm() >>> local_irq_save() >>> gup_pgd_range() >>> gup_p4d_range() >>> gup_pud_range() >>> gup_pmd_range() >>> gup_pte_range() >>> pte_t pte = ptep_get_lockless(ptep); >>> migrate_vma_collect_pmd() >>> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl) >>> ptep_get_and_clear(mm, addr, ptep); >>> page = pte_page(pte); >>> set_pte_at(mm, addr, ptep, swp_pte); >>> migrate_page_move_mapping() >>> head = try_grab_compound_head(page, 1, flags); [1] https://lore.kernel.org/all/7ec1d098-0021-ae82-8d73-dd9c2eb80dab@linux.alibaba.com/ >> gup_put_folio(folio, 1, flags); >> goto pte_unmap; >> } >> @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> * get_user_pages_fast_only implementation that can pin pages. Thus it's still >> * useful to have gup_huge_pmd even if we can't operate on ptes. >> */ >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> - unsigned int flags, struct page **pages, int *nr) >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >> + unsigned long end, unsigned int flags, >> + struct page **pages, int *nr) >> { >> return 0; >> } >> @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo >> if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, >> PMD_SHIFT, next, flags, pages, nr)) >> return 0; >> - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) >> + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) >> return 0; >> } while (pmdp++, addr = next, addr != end); >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 2d74cf01f694..518b49095db3 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> >> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ >> /* >> - * After this gup_fast can't run anymore. This also removes >> - * any huge TLB entry from the CPU so we won't allow >> - * huge and small TLB entries for the same virtual address >> - * to avoid the risk of CPU bugs in that area. >> + * This removes any huge TLB entry from the CPU so we won't allow >> + * huge and small TLB entries for the same virtual address to >> + * avoid the risk of CPU bugs in that area. >> + * >> + * Parallel fast GUP is fine since fast GUP will back off when >> + * it detects PMD is changed. >> */ >> _pmd = pmdp_collapse_flush(vma, address, pmd); > > To follow up on David Hildenbrand's note about this in the nearby thread... > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > all arches. It definitely does do an atomic op with a return value on x86, > but that's just one arch. > > > thanks, >
On 9/5/2022 3:59 PM, David Hildenbrand wrote: > On 05.09.22 00:29, John Hubbard wrote: >> On 9/1/22 15:27, Yang Shi wrote: >>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>> longer >>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>> traditional IPI-based GUP-fast correctly. On architectures that send >>> an IPI broadcast on TLB flush, it works as expected. But on the >>> architectures that do not use IPI to broadcast TLB flush, it may have >>> the below race: >>> >>>    CPU A                                         CPU B >>> THP collapse                                    fast GUP >>>                                               gup_pmd_range() <-- >>> see valid pmd >>>                                                   gup_pte_range() >>> <-- work on pte >>> pmdp_collapse_flush() <-- clear pmd and flush >>> __collapse_huge_page_isolate() >>>     check page pinned <-- before GUP bump refcount >>>                                                       pin the page >>>                                                       check PTE <-- >>> no change >>> __collapse_huge_page_copy() >>>     copy data to huge page >>>     ptep_clear() >>> install huge pmd for the huge page >>>                                                       return the >>> stale page >>> discard the stale page >> >> Hi Yang, >> >> Thanks for taking the trouble to write down these notes. I always >> forget which race we are dealing with, and this is a great help. :) >> >> More... >> >>> >>> The race could be fixed by checking whether PMD is changed or not after >>> taking the page pin in fast GUP, just like what it does for PTE. If the >>> PMD is changed it means there may be parallel THP collapse, so GUP >>> should back off. >>> >>> Also update the stale comment about serializing against fast GUP in >>> khugepaged. >>> >>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>> get_user_pages_fast()") >>> Signed-off-by: Yang Shi <shy828301@gmail.com> >>> --- >>>  mm/gup.c       | 30 ++++++++++++++++++++++++------ >>>  mm/khugepaged.c | 10 ++++++---- >>>  2 files changed, 30 insertions(+), 10 deletions(-) >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index f3fc1f08d90c..4365b2811269 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>> *nr, int nr_start, >>>  } >>>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>> long end, >>> -            unsigned int flags, struct page **pages, int *nr) >>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>> +            unsigned long end, unsigned int flags, >>> +            struct page **pages, int *nr) >>>  { >>>      struct dev_pagemap *pgmap = NULL; >>>      int nr_start = *nr, ret = 0; >>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>> long addr, unsigned long end, >>>              goto pte_unmap; >>>          } >>> -       if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>> +       /* >>> +        * THP collapse conceptually does: >>> +        *  1. Clear and flush PMD >>> +        *  2. Check the base page refcount >>> +        *  3. Copy data to huge page >>> +        *  4. Clear PTE >>> +        *  5. Discard the base page >>> +        * >>> +        * So fast GUP may race with THP collapse then pin and >>> +        * return an old page since TLB flush is no longer sufficient >>> +        * to serialize against fast GUP. >>> +        * >>> +        * Check PMD, if it is changed just back off since it >>> +        * means there may be parallel THP collapse. >>> +        */ >> >> As I mentioned in the other thread, it would be a nice touch to move >> such discussion into the comment header. >> >>> +       if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>> +           unlikely(pte_val(pte) != pte_val(*ptep))) { >> >> >> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >> whole lockless house of cards may fall apart if we try reading the >> page table values without READ_ONCE(). > > I came to the conclusion that the implicit memory barrier when grabbing > a reference on the page is sufficient such that we don't need READ_ONCE > here. IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always get from a register, then we can get an old value if other thread did set_pte(). I am not sure how the implicit memory barrier can pervent the compiler optimization? Please correct me if I missed something. > If we still intend to change that code, we should fixup all GUP-fast > functions in a similar way. But again, I don't think we need a change here. > > >>> -    * After this gup_fast can't run anymore. This also removes >>> -    * any huge TLB entry from the CPU so we won't allow >>> -    * huge and small TLB entries for the same virtual address >>> -    * to avoid the risk of CPU bugs in that area. >>> +    * This removes any huge TLB entry from the CPU so we won't allow >>> +    * huge and small TLB entries for the same virtual address to >>> +    * avoid the risk of CPU bugs in that area. >>> +    * >>> +    * Parallel fast GUP is fine since fast GUP will back off when >>> +    * it detects PMD is changed. >>>       */ >>>      _pmd = pmdp_collapse_flush(vma, address, pmd); >> >> To follow up on David Hildenbrand's note about this in the nearby >> thread... >> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on >> all arches. It definitely does do an atomic op with a return value on >> x86, >> but that's just one arch. >> > > I think a ptep/pmdp clear + TLB flush really has to imply a memory > barrier, otherwise TLB flushing code might easily mess up with > surrounding code. But we should better double-check. > > s390x executes an IDTE instruction, which performs serialization (-> > memory barrier). arm64 seems to use DSB instructions to enforce memory > ordering. >
On 05.09.22 12:16, Baolin Wang wrote: > > > On 9/5/2022 3:59 PM, David Hildenbrand wrote: >> On 05.09.22 00:29, John Hubbard wrote: >>> On 9/1/22 15:27, Yang Shi wrote: >>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>> longer >>>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>>> traditional IPI-based GUP-fast correctly. On architectures that send >>>> an IPI broadcast on TLB flush, it works as expected. But on the >>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>> the below race: >>>> >>>>    CPU A                                         CPU B >>>> THP collapse                                    fast GUP >>>>                                               gup_pmd_range() <-- >>>> see valid pmd >>>>                                                   gup_pte_range() >>>> <-- work on pte >>>> pmdp_collapse_flush() <-- clear pmd and flush >>>> __collapse_huge_page_isolate() >>>>     check page pinned <-- before GUP bump refcount >>>>                                                       pin the page >>>>                                                       check PTE <-- >>>> no change >>>> __collapse_huge_page_copy() >>>>     copy data to huge page >>>>     ptep_clear() >>>> install huge pmd for the huge page >>>>                                                       return the >>>> stale page >>>> discard the stale page >>> >>> Hi Yang, >>> >>> Thanks for taking the trouble to write down these notes. I always >>> forget which race we are dealing with, and this is a great help. :) >>> >>> More... >>> >>>> >>>> The race could be fixed by checking whether PMD is changed or not after >>>> taking the page pin in fast GUP, just like what it does for PTE. If the >>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>> should back off. >>>> >>>> Also update the stale comment about serializing against fast GUP in >>>> khugepaged. >>>> >>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>> get_user_pages_fast()") >>>> Signed-off-by: Yang Shi <shy828301@gmail.com> >>>> --- >>>>  mm/gup.c       | 30 ++++++++++++++++++++++++------ >>>>  mm/khugepaged.c | 10 ++++++---- >>>>  2 files changed, 30 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/mm/gup.c b/mm/gup.c >>>> index f3fc1f08d90c..4365b2811269 100644 >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>>> *nr, int nr_start, >>>>  } >>>>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>> long end, >>>> -            unsigned int flags, struct page **pages, int *nr) >>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>>> +            unsigned long end, unsigned int flags, >>>> +            struct page **pages, int *nr) >>>>  { >>>>      struct dev_pagemap *pgmap = NULL; >>>>      int nr_start = *nr, ret = 0; >>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>> long addr, unsigned long end, >>>>              goto pte_unmap; >>>>          } >>>> -       if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>> +       /* >>>> +        * THP collapse conceptually does: >>>> +        *  1. Clear and flush PMD >>>> +        *  2. Check the base page refcount >>>> +        *  3. Copy data to huge page >>>> +        *  4. Clear PTE >>>> +        *  5. Discard the base page >>>> +        * >>>> +        * So fast GUP may race with THP collapse then pin and >>>> +        * return an old page since TLB flush is no longer sufficient >>>> +        * to serialize against fast GUP. >>>> +        * >>>> +        * Check PMD, if it is changed just back off since it >>>> +        * means there may be parallel THP collapse. >>>> +        */ >>> >>> As I mentioned in the other thread, it would be a nice touch to move >>> such discussion into the comment header. >>> >>>> +       if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>> +           unlikely(pte_val(pte) != pte_val(*ptep))) { >>> >>> >>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>> whole lockless house of cards may fall apart if we try reading the >>> page table values without READ_ONCE(). >> >> I came to the conclusion that the implicit memory barrier when grabbing >> a reference on the page is sufficient such that we don't need READ_ONCE >> here. > > IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always > get from a register, then we can get an old value if other thread did > set_pte(). I am not sure how the implicit memory barrier can pervent the > compiler optimization? Please correct me if I missed something. IIUC, an memory barrier always implies a compiler barrier.
On 05.09.22 12:24, David Hildenbrand wrote: > On 05.09.22 12:16, Baolin Wang wrote: >> >> >> On 9/5/2022 3:59 PM, David Hildenbrand wrote: >>> On 05.09.22 00:29, John Hubbard wrote: >>>> On 9/1/22 15:27, Yang Shi wrote: >>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>>> longer >>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>>>> traditional IPI-based GUP-fast correctly. On architectures that send >>>>> an IPI broadcast on TLB flush, it works as expected. But on the >>>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>>> the below race: >>>>> >>>>>    CPU A                                         CPU B >>>>> THP collapse                                    fast GUP >>>>>                                               gup_pmd_range() <-- >>>>> see valid pmd >>>>>                                                   gup_pte_range() >>>>> <-- work on pte >>>>> pmdp_collapse_flush() <-- clear pmd and flush >>>>> __collapse_huge_page_isolate() >>>>>     check page pinned <-- before GUP bump refcount >>>>>                                                       pin the page >>>>>                                                       check PTE <-- >>>>> no change >>>>> __collapse_huge_page_copy() >>>>>     copy data to huge page >>>>>     ptep_clear() >>>>> install huge pmd for the huge page >>>>>                                                       return the >>>>> stale page >>>>> discard the stale page >>>> >>>> Hi Yang, >>>> >>>> Thanks for taking the trouble to write down these notes. I always >>>> forget which race we are dealing with, and this is a great help. :) >>>> >>>> More... >>>> >>>>> >>>>> The race could be fixed by checking whether PMD is changed or not after >>>>> taking the page pin in fast GUP, just like what it does for PTE. If the >>>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>>> should back off. >>>>> >>>>> Also update the stale comment about serializing against fast GUP in >>>>> khugepaged. >>>>> >>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>>> get_user_pages_fast()") >>>>> Signed-off-by: Yang Shi <shy828301@gmail.com> >>>>> --- >>>>>  mm/gup.c       | 30 ++++++++++++++++++++++++------ >>>>>  mm/khugepaged.c | 10 ++++++---- >>>>>  2 files changed, 30 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>> index f3fc1f08d90c..4365b2811269 100644 >>>>> --- a/mm/gup.c >>>>> +++ b/mm/gup.c >>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>>>> *nr, int nr_start, >>>>>  } >>>>>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>>> long end, >>>>> -            unsigned int flags, struct page **pages, int *nr) >>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>>>> +            unsigned long end, unsigned int flags, >>>>> +            struct page **pages, int *nr) >>>>>  { >>>>>      struct dev_pagemap *pgmap = NULL; >>>>>      int nr_start = *nr, ret = 0; >>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>>> long addr, unsigned long end, >>>>>              goto pte_unmap; >>>>>          } >>>>> -       if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>> +       /* >>>>> +        * THP collapse conceptually does: >>>>> +        *  1. Clear and flush PMD >>>>> +        *  2. Check the base page refcount >>>>> +        *  3. Copy data to huge page >>>>> +        *  4. Clear PTE >>>>> +        *  5. Discard the base page >>>>> +        * >>>>> +        * So fast GUP may race with THP collapse then pin and >>>>> +        * return an old page since TLB flush is no longer sufficient >>>>> +        * to serialize against fast GUP. >>>>> +        * >>>>> +        * Check PMD, if it is changed just back off since it >>>>> +        * means there may be parallel THP collapse. >>>>> +        */ >>>> >>>> As I mentioned in the other thread, it would be a nice touch to move >>>> such discussion into the comment header. >>>> >>>>> +       if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>>> +           unlikely(pte_val(pte) != pte_val(*ptep))) { >>>> >>>> >>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>>> whole lockless house of cards may fall apart if we try reading the >>>> page table values without READ_ONCE(). >>> >>> I came to the conclusion that the implicit memory barrier when grabbing >>> a reference on the page is sufficient such that we don't need READ_ONCE >>> here. >> >> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always >> get from a register, then we can get an old value if other thread did >> set_pte(). I am not sure how the implicit memory barrier can pervent the >> compiler optimization? Please correct me if I missed something. > > IIUC, an memory barrier always implies a compiler barrier. > To clarify what I mean, Documentation/atomic_t.txt documents NOTE: when the atomic RmW ops are fully ordered, they should also imply a compiler barrier.
On 9/5/2022 7:11 PM, David Hildenbrand wrote: > On 05.09.22 12:24, David Hildenbrand wrote: >> On 05.09.22 12:16, Baolin Wang wrote: >>> >>> >>> On 9/5/2022 3:59 PM, David Hildenbrand wrote: >>>> On 05.09.22 00:29, John Hubbard wrote: >>>>> On 9/1/22 15:27, Yang Shi wrote: >>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 >>>>>> ("mm: >>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>>>> longer >>>>>> sufficient to handle concurrent GUP-fast in all cases, it only >>>>>> handles >>>>>> traditional IPI-based GUP-fast correctly. On architectures that send >>>>>> an IPI broadcast on TLB flush, it works as expected. But on the >>>>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>>>> the below race: >>>>>> >>>>>>     CPU A                                         CPU B >>>>>> THP collapse                                    fast GUP >>>>>>                                                gup_pmd_range() <-- >>>>>> see valid pmd >>>>>>                                                    gup_pte_range() >>>>>> <-- work on pte >>>>>> pmdp_collapse_flush() <-- clear pmd and flush >>>>>> __collapse_huge_page_isolate() >>>>>>      check page pinned <-- before GUP bump refcount >>>>>>                                                        pin the page >>>>>>                                                        check PTE >>>>>> <-- >>>>>> no change >>>>>> __collapse_huge_page_copy() >>>>>>      copy data to huge page >>>>>>      ptep_clear() >>>>>> install huge pmd for the huge page >>>>>>                                                        return the >>>>>> stale page >>>>>> discard the stale page >>>>> >>>>> Hi Yang, >>>>> >>>>> Thanks for taking the trouble to write down these notes. I always >>>>> forget which race we are dealing with, and this is a great help. :) >>>>> >>>>> More... >>>>> >>>>>> >>>>>> The race could be fixed by checking whether PMD is changed or not >>>>>> after >>>>>> taking the page pin in fast GUP, just like what it does for PTE. >>>>>> If the >>>>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>>>> should back off. >>>>>> >>>>>> Also update the stale comment about serializing against fast GUP in >>>>>> khugepaged. >>>>>> >>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>>>> get_user_pages_fast()") >>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com> >>>>>> --- >>>>>>   mm/gup.c       | 30 ++++++++++++++++++++++++------ >>>>>>   mm/khugepaged.c | 10 ++++++---- >>>>>>   2 files changed, 30 insertions(+), 10 deletions(-) >>>>>> >>>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>>> index f3fc1f08d90c..4365b2811269 100644 >>>>>> --- a/mm/gup.c >>>>>> +++ b/mm/gup.c >>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>>>>> *nr, int nr_start, >>>>>>   } >>>>>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>>>> long end, >>>>>> -            unsigned int flags, struct page **pages, int *nr) >>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>>>>> +            unsigned long end, unsigned int flags, >>>>>> +            struct page **pages, int *nr) >>>>>>   { >>>>>>       struct dev_pagemap *pgmap = NULL; >>>>>>       int nr_start = *nr, ret = 0; >>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>>>> long addr, unsigned long end, >>>>>>               goto pte_unmap; >>>>>>           } >>>>>> -       if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>>> +       /* >>>>>> +        * THP collapse conceptually does: >>>>>> +        *  1. Clear and flush PMD >>>>>> +        *  2. Check the base page refcount >>>>>> +        *  3. Copy data to huge page >>>>>> +        *  4. Clear PTE >>>>>> +        *  5. Discard the base page >>>>>> +        * >>>>>> +        * So fast GUP may race with THP collapse then pin and >>>>>> +        * return an old page since TLB flush is no longer >>>>>> sufficient >>>>>> +        * to serialize against fast GUP. >>>>>> +        * >>>>>> +        * Check PMD, if it is changed just back off since it >>>>>> +        * means there may be parallel THP collapse. >>>>>> +        */ >>>>> >>>>> As I mentioned in the other thread, it would be a nice touch to move >>>>> such discussion into the comment header. >>>>> >>>>>> +       if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>>>> +           unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>> >>>>> >>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>>>> whole lockless house of cards may fall apart if we try reading the >>>>> page table values without READ_ONCE(). >>>> >>>> I came to the conclusion that the implicit memory barrier when grabbing >>>> a reference on the page is sufficient such that we don't need READ_ONCE >>>> here. >>> >>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always >>> get from a register, then we can get an old value if other thread did >>> set_pte(). I am not sure how the implicit memory barrier can pervent the >>> compiler optimization? Please correct me if I missed something. >> >> IIUC, an memory barrier always implies a compiler barrier. >> > > To clarify what I mean, Documentation/atomic_t.txt documents > > NOTE: when the atomic RmW ops are fully ordered, they should also imply > a compiler barrier. Right, I agree. That means the complier can not optimize the order of the 'pte_val(*ptep)', however what I am confusing is that the complier can still save the value of *ptep into a register or stack instead of reloading from memory? A similar issue in commit d6c1f098f2a7 ("mm/swap_state: fix a data race in swapin_nr_pages"). --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -509,10 +509,11 @@ static unsigned long swapin_nr_pages(unsigned long offset) return 1; hits = atomic_xchg(&swapin_readahead_hits, 0); - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, + pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits, + max_pages, atomic_read(&last_readahead_pages)); if (!hits) - prev_offset = offset; + WRITE_ONCE(prev_offset, offset); atomic_set(&last_readahead_pages, pages); return pages;
On 05.09.22 16:35, Baolin Wang wrote: > > > On 9/5/2022 7:11 PM, David Hildenbrand wrote: >> On 05.09.22 12:24, David Hildenbrand wrote: >>> On 05.09.22 12:16, Baolin Wang wrote: >>>> >>>> >>>> On 9/5/2022 3:59 PM, David Hildenbrand wrote: >>>>> On 05.09.22 00:29, John Hubbard wrote: >>>>>> On 9/1/22 15:27, Yang Shi wrote: >>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 >>>>>>> ("mm: >>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>>>>> longer >>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only >>>>>>> handles >>>>>>> traditional IPI-based GUP-fast correctly. On architectures that send >>>>>>> an IPI broadcast on TLB flush, it works as expected. But on the >>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>>>>> the below race: >>>>>>> >>>>>>>     CPU A                                         CPU B >>>>>>> THP collapse                                    fast GUP >>>>>>>                                                gup_pmd_range() <-- >>>>>>> see valid pmd >>>>>>>                                                    gup_pte_range() >>>>>>> <-- work on pte >>>>>>> pmdp_collapse_flush() <-- clear pmd and flush >>>>>>> __collapse_huge_page_isolate() >>>>>>>      check page pinned <-- before GUP bump refcount >>>>>>>                                                        pin the page >>>>>>>                                                        check PTE >>>>>>> <-- >>>>>>> no change >>>>>>> __collapse_huge_page_copy() >>>>>>>      copy data to huge page >>>>>>>      ptep_clear() >>>>>>> install huge pmd for the huge page >>>>>>>                                                        return the >>>>>>> stale page >>>>>>> discard the stale page >>>>>> >>>>>> Hi Yang, >>>>>> >>>>>> Thanks for taking the trouble to write down these notes. I always >>>>>> forget which race we are dealing with, and this is a great help. :) >>>>>> >>>>>> More... >>>>>> >>>>>>> >>>>>>> The race could be fixed by checking whether PMD is changed or not >>>>>>> after >>>>>>> taking the page pin in fast GUP, just like what it does for PTE. >>>>>>> If the >>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>>>>> should back off. >>>>>>> >>>>>>> Also update the stale comment about serializing against fast GUP in >>>>>>> khugepaged. >>>>>>> >>>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>>>>> get_user_pages_fast()") >>>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com> >>>>>>> --- >>>>>>>   mm/gup.c       | 30 ++++++++++++++++++++++++------ >>>>>>>   mm/khugepaged.c | 10 ++++++---- >>>>>>>   2 files changed, 30 insertions(+), 10 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>>>> index f3fc1f08d90c..4365b2811269 100644 >>>>>>> --- a/mm/gup.c >>>>>>> +++ b/mm/gup.c >>>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>>>>>> *nr, int nr_start, >>>>>>>   } >>>>>>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>>>>> long end, >>>>>>> -            unsigned int flags, struct page **pages, int *nr) >>>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>>>>>> +            unsigned long end, unsigned int flags, >>>>>>> +            struct page **pages, int *nr) >>>>>>>   { >>>>>>>       struct dev_pagemap *pgmap = NULL; >>>>>>>       int nr_start = *nr, ret = 0; >>>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>>>>> long addr, unsigned long end, >>>>>>>               goto pte_unmap; >>>>>>>           } >>>>>>> -       if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>>>> +       /* >>>>>>> +        * THP collapse conceptually does: >>>>>>> +        *  1. Clear and flush PMD >>>>>>> +        *  2. Check the base page refcount >>>>>>> +        *  3. Copy data to huge page >>>>>>> +        *  4. Clear PTE >>>>>>> +        *  5. Discard the base page >>>>>>> +        * >>>>>>> +        * So fast GUP may race with THP collapse then pin and >>>>>>> +        * return an old page since TLB flush is no longer >>>>>>> sufficient >>>>>>> +        * to serialize against fast GUP. >>>>>>> +        * >>>>>>> +        * Check PMD, if it is changed just back off since it >>>>>>> +        * means there may be parallel THP collapse. >>>>>>> +        */ >>>>>> >>>>>> As I mentioned in the other thread, it would be a nice touch to move >>>>>> such discussion into the comment header. >>>>>> >>>>>>> +       if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>>>>> +           unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>>> >>>>>> >>>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>>>>> whole lockless house of cards may fall apart if we try reading the >>>>>> page table values without READ_ONCE(). >>>>> >>>>> I came to the conclusion that the implicit memory barrier when grabbing >>>>> a reference on the page is sufficient such that we don't need READ_ONCE >>>>> here. >>>> >>>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always >>>> get from a register, then we can get an old value if other thread did >>>> set_pte(). I am not sure how the implicit memory barrier can pervent the >>>> compiler optimization? Please correct me if I missed something. >>> >>> IIUC, an memory barrier always implies a compiler barrier. >>> >> >> To clarify what I mean, Documentation/atomic_t.txt documents >> >> NOTE: when the atomic RmW ops are fully ordered, they should also imply >> a compiler barrier. > > Right, I agree. That means the complier can not optimize the order of > the 'pte_val(*ptep)', however what I am confusing is that the complier > can still save the value of *ptep into a register or stack instead of > reloading from memory? After the memory+compiler barrier, the value has to be reloaded. Documentation/memory-barriers.txt documents under "COMPILER BARRIERS": "READ_ONCE() and WRITE_ONCE() can be thought of as weak forms of barrier() that affect only the specific accesses flagged by the READ_ONCE() or WRITE_ONCE()." Consequently, if there already is a compile barrier, additional READ_ONCE/WRITE_ONCE isn't required. > > A similar issue in commit d6c1f098f2a7 ("mm/swap_state: fix a data race > in swapin_nr_pages"). > > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -509,10 +509,11 @@ static unsigned long swapin_nr_pages(unsigned long > offset) > return 1; > > hits = atomic_xchg(&swapin_readahead_hits, 0); > - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > + pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits, > + max_pages, > atomic_read(&last_readahead_pages)); > if (!hits) > - prev_offset = offset; > + WRITE_ONCE(prev_offset, offset); > atomic_set(&last_readahead_pages, pages); > > return pages; > IIUC the difference here is that there is not other implicit memory+compile barrier in between.
On 9/5/22 00:59, David Hildenbrand wrote: ... >>> diff --git a/mm/gup.c b/mm/gup.c >>> index f3fc1f08d90c..4365b2811269 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, >>> } >>> >>> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >>> - unsigned int flags, struct page **pages, int *nr) >>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>> + unsigned long end, unsigned int flags, >>> + struct page **pages, int *nr) >>> { >>> struct dev_pagemap *pgmap = NULL; >>> int nr_start = *nr, ret = 0; >>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >>> goto pte_unmap; >>> } >>> >>> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>> + /* >>> + * THP collapse conceptually does: >>> + * 1. Clear and flush PMD >>> + * 2. Check the base page refcount >>> + * 3. Copy data to huge page >>> + * 4. Clear PTE >>> + * 5. Discard the base page >>> + * >>> + * So fast GUP may race with THP collapse then pin and >>> + * return an old page since TLB flush is no longer sufficient >>> + * to serialize against fast GUP. >>> + * >>> + * Check PMD, if it is changed just back off since it >>> + * means there may be parallel THP collapse. >>> + */ >> >> As I mentioned in the other thread, it would be a nice touch to move >> such discussion into the comment header. >> >>> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>> + unlikely(pte_val(pte) != pte_val(*ptep))) { >> >> >> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >> whole lockless house of cards may fall apart if we try reading the >> page table values without READ_ONCE(). > > I came to the conclusion that the implicit memory barrier when grabbing > a reference on the page is sufficient such that we don't need READ_ONCE > here. OK, I believe you're referring to this: folio = try_grab_folio(page, 1, flags); just earlier in gup_pte_range(). Yes that's true...but it's hidden, which is unfortunate. Maybe a comment could help. > > If we still intend to change that code, we should fixup all GUP-fast > functions in a similar way. But again, I don't think we need a change here. > It's really rough, having to play this hide-and-seek game of "who did the memory barrier". And I'm tempted to suggest adding READ_ONCE() to any and all reads of the page table entries, just to help stay out of trouble. It's a visual reminder that page table reads are always a lockless read and are inherently volatile. Of course, I realize that adding extra READ_ONCE() calls is not a good thing. It might be a performance hit, although, again, these are volatile reads by nature, so you probably had a membar anyway. And looking in reverse, there are actually a number of places here where we could probably get away with *removing* READ_ONCE()! Overall, I would be inclined to load up on READ_ONCE() calls, yes. But I sort of expect to be overridden on that, due to potential performance concerns, and that's reasonable. At a minimum we should add a few short comments about what memory barriers are used, and why we don't need a READ_ONCE() or something stronger when reading the pte. > >>> - * After this gup_fast can't run anymore. This also removes >>> - * any huge TLB entry from the CPU so we won't allow >>> - * huge and small TLB entries for the same virtual address >>> - * to avoid the risk of CPU bugs in that area. >>> + * This removes any huge TLB entry from the CPU so we won't allow >>> + * huge and small TLB entries for the same virtual address to >>> + * avoid the risk of CPU bugs in that area. >>> + * >>> + * Parallel fast GUP is fine since fast GUP will back off when >>> + * it detects PMD is changed. >>> */ >>> _pmd = pmdp_collapse_flush(vma, address, pmd); >> >> To follow up on David Hildenbrand's note about this in the nearby thread... >> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on >> all arches. It definitely does do an atomic op with a return value on x86, >> but that's just one arch. >> > > I think a ptep/pmdp clear + TLB flush really has to imply a memory > barrier, otherwise TLB flushing code might easily mess up with > surrounding code. But we should better double-check. Let's document the function as such, once it's verified: "This is a guaranteed memory barrier". > > s390x executes an IDTE instruction, which performs serialization (-> > memory barrier). arm64 seems to use DSB instructions to enforce memory > ordering. > thanks,
On 9/5/2022 10:40 PM, David Hildenbrand wrote: > On 05.09.22 16:35, Baolin Wang wrote: >> >> >> On 9/5/2022 7:11 PM, David Hildenbrand wrote: >>> On 05.09.22 12:24, David Hildenbrand wrote: >>>> On 05.09.22 12:16, Baolin Wang wrote: >>>>> >>>>> >>>>> On 9/5/2022 3:59 PM, David Hildenbrand wrote: >>>>>> On 05.09.22 00:29, John Hubbard wrote: >>>>>>> On 9/1/22 15:27, Yang Shi wrote: >>>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 >>>>>>>> ("mm: >>>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>>>>>> longer >>>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only >>>>>>>> handles >>>>>>>> traditional IPI-based GUP-fast correctly. On architectures that >>>>>>>> send >>>>>>>> an IPI broadcast on TLB flush, it works as expected. But on the >>>>>>>> architectures that do not use IPI to broadcast TLB flush, it may >>>>>>>> have >>>>>>>> the below race: >>>>>>>> >>>>>>>>      CPU A                                         CPU B >>>>>>>> THP collapse                                    fast GUP >>>>>>>> >>>>>>>> gup_pmd_range() <-- >>>>>>>> see valid pmd >>>>>>>> >>>>>>>> gup_pte_range() >>>>>>>> <-- work on pte >>>>>>>> pmdp_collapse_flush() <-- clear pmd and flush >>>>>>>> __collapse_huge_page_isolate() >>>>>>>>       check page pinned <-- before GUP bump refcount >>>>>>>>                                                         pin >>>>>>>> the page >>>>>>>>                                                         check PTE >>>>>>>> <-- >>>>>>>> no change >>>>>>>> __collapse_huge_page_copy() >>>>>>>>       copy data to huge page >>>>>>>>       ptep_clear() >>>>>>>> install huge pmd for the huge page >>>>>>>>                                                         return >>>>>>>> the >>>>>>>> stale page >>>>>>>> discard the stale page >>>>>>> >>>>>>> Hi Yang, >>>>>>> >>>>>>> Thanks for taking the trouble to write down these notes. I always >>>>>>> forget which race we are dealing with, and this is a great help. :) >>>>>>> >>>>>>> More... >>>>>>> >>>>>>>> >>>>>>>> The race could be fixed by checking whether PMD is changed or not >>>>>>>> after >>>>>>>> taking the page pin in fast GUP, just like what it does for PTE. >>>>>>>> If the >>>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>>>>>> should back off. >>>>>>>> >>>>>>>> Also update the stale comment about serializing against fast GUP in >>>>>>>> khugepaged. >>>>>>>> >>>>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>>>>>> get_user_pages_fast()") >>>>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com> >>>>>>>> --- >>>>>>>>    mm/gup.c       | 30 ++++++++++++++++++++++++------ >>>>>>>>    mm/khugepaged.c | 10 ++++++---- >>>>>>>>    2 files changed, 30 insertions(+), 10 deletions(-) >>>>>>>> >>>>>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>>>>> index f3fc1f08d90c..4365b2811269 100644 >>>>>>>> --- a/mm/gup.c >>>>>>>> +++ b/mm/gup.c >>>>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused >>>>>>>> undo_dev_pagemap(int >>>>>>>> *nr, int nr_start, >>>>>>>>    } >>>>>>>>    #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>>>>>> long end, >>>>>>>> -            unsigned int flags, struct page **pages, int *nr) >>>>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long >>>>>>>> addr, >>>>>>>> +            unsigned long end, unsigned int flags, >>>>>>>> +            struct page **pages, int *nr) >>>>>>>>    { >>>>>>>>        struct dev_pagemap *pgmap = NULL; >>>>>>>>        int nr_start = *nr, ret = 0; >>>>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>>>>>> long addr, unsigned long end, >>>>>>>>                goto pte_unmap; >>>>>>>>            } >>>>>>>> -       if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>>>>> +       /* >>>>>>>> +        * THP collapse conceptually does: >>>>>>>> +        *  1. Clear and flush PMD >>>>>>>> +        *  2. Check the base page refcount >>>>>>>> +        *  3. Copy data to huge page >>>>>>>> +        *  4. Clear PTE >>>>>>>> +        *  5. Discard the base page >>>>>>>> +        * >>>>>>>> +        * So fast GUP may race with THP collapse then pin and >>>>>>>> +        * return an old page since TLB flush is no longer >>>>>>>> sufficient >>>>>>>> +        * to serialize against fast GUP. >>>>>>>> +        * >>>>>>>> +        * Check PMD, if it is changed just back off since it >>>>>>>> +        * means there may be parallel THP collapse. >>>>>>>> +        */ >>>>>>> >>>>>>> As I mentioned in the other thread, it would be a nice touch to move >>>>>>> such discussion into the comment header. >>>>>>> >>>>>>>> +       if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>>>>>> +           unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>>>> >>>>>>> >>>>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because >>>>>>> this >>>>>>> whole lockless house of cards may fall apart if we try reading the >>>>>>> page table values without READ_ONCE(). >>>>>> >>>>>> I came to the conclusion that the implicit memory barrier when >>>>>> grabbing >>>>>> a reference on the page is sufficient such that we don't need >>>>>> READ_ONCE >>>>>> here. >>>>> >>>>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always >>>>> get from a register, then we can get an old value if other thread did >>>>> set_pte(). I am not sure how the implicit memory barrier can >>>>> pervent the >>>>> compiler optimization? Please correct me if I missed something. >>>> >>>> IIUC, an memory barrier always implies a compiler barrier. >>>> >>> >>> To clarify what I mean, Documentation/atomic_t.txt documents >>> >>> NOTE: when the atomic RmW ops are fully ordered, they should also imply >>> a compiler barrier. >> >> Right, I agree. That means the complier can not optimize the order of >> the 'pte_val(*ptep)', however what I am confusing is that the complier >> can still save the value of *ptep into a register or stack instead of >> reloading from memory? > > After the memory+compiler barrier, the value has to be reloaded. > Documentation/memory-barriers.txt documents under "COMPILER BARRIERS": After some investigation, I realized you are totally right. The GCC Extended Asm manual [1] also says: "To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm. Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler." So as you said, the value will be reloaded after the memory+compiler barrier. Thanks for your explanation. [1] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html > > "READ_ONCE() and WRITE_ONCE() can be thought of as weak forms of > barrier() that affect only the specific accesses flagged by the > READ_ONCE() or WRITE_ONCE()." > > Consequently, if there already is a compile barrier, additional > READ_ONCE/WRITE_ONCE isn't required. > >> >> A similar issue in commit d6c1f098f2a7 ("mm/swap_state: fix a data race >> in swapin_nr_pages"). >> >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -509,10 +509,11 @@ static unsigned long swapin_nr_pages(unsigned long >> offset) >>                  return 1; >> >>          hits = atomic_xchg(&swapin_readahead_hits, 0); >> -      pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, >> +      pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits, >> +                                max_pages, >>                                    atomic_read(&last_readahead_pages)); >>          if (!hits) >> -              prev_offset = offset; >> +              WRITE_ONCE(prev_offset, offset); >>          atomic_set(&last_readahead_pages, pages); >> >>          return pages; >> > > IIUC the difference here is that there is not other implicit > memory+compile barrier in between. Yes, I see the difference.
> OK, I believe you're referring to this: > > folio = try_grab_folio(page, 1, flags); > > just earlier in gup_pte_range(). Yes that's true...but it's hidden, which > is unfortunate. Maybe a comment could help. > Most certainly. >> >> If we still intend to change that code, we should fixup all GUP-fast >> functions in a similar way. But again, I don't think we need a change here. >> > > It's really rough, having to play this hide-and-seek game of "who did > the memory barrier". And I'm tempted to suggest adding READ_ONCE() to > any and all reads of the page table entries, just to help stay out of > trouble. It's a visual reminder that page table reads are always a > lockless read and are inherently volatile. > > Of course, I realize that adding extra READ_ONCE() calls is not a good > thing. It might be a performance hit, although, again, these are > volatile reads by nature, so you probably had a membar anyway. > > And looking in reverse, there are actually a number of places here where > we could probably get away with *removing* READ_ONCE()! > > Overall, I would be inclined to load up on READ_ONCE() calls, yes. But I > sort of expect to be overridden on that, due to potential performance > concerns, and that's reasonable. > > At a minimum we should add a few short comments about what memory > barriers are used, and why we don't need a READ_ONCE() or something > stronger when reading the pte. Adding more unnecessary memory barriers doesn't necessarily improve the situation. Messing with memory barriers is and remains absolutely disgusting. IMHO, only clear documentation and ASCII art can keep it somehow maintainable for human beings. > > >> >>>> - * After this gup_fast can't run anymore. This also removes >>>> - * any huge TLB entry from the CPU so we won't allow >>>> - * huge and small TLB entries for the same virtual address >>>> - * to avoid the risk of CPU bugs in that area. >>>> + * This removes any huge TLB entry from the CPU so we won't allow >>>> + * huge and small TLB entries for the same virtual address to >>>> + * avoid the risk of CPU bugs in that area. >>>> + * >>>> + * Parallel fast GUP is fine since fast GUP will back off when >>>> + * it detects PMD is changed. >>>> */ >>>> _pmd = pmdp_collapse_flush(vma, address, pmd); >>> >>> To follow up on David Hildenbrand's note about this in the nearby thread... >>> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on >>> all arches. It definitely does do an atomic op with a return value on x86, >>> but that's just one arch. >>> >> >> I think a ptep/pmdp clear + TLB flush really has to imply a memory >> barrier, otherwise TLB flushing code might easily mess up with >> surrounding code. But we should better double-check. > > Let's document the function as such, once it's verified: "This is a > guaranteed memory barrier". Yes. Hopefully it indeed is on all architectures :)
On Mon, Sep 05, 2022 at 09:59:47AM +0200, David Hildenbrand wrote: > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > > whole lockless house of cards may fall apart if we try reading the > > page table values without READ_ONCE(). > > I came to the conclusion that the implicit memory barrier when grabbing a > reference on the page is sufficient such that we don't need READ_ONCE here. READ_ONCE is not about barriers or ordering, you still need the acquire inside the atomic to make the algorithm work. READ_ONCE primarily is a marker that the data being read is unstable and that the compiler must avoid all instability when reading it. eg in this case the compiler could insanely double read the value, even though the 'if' requires only a single read. This would result in corrupt calculation. > If we still intend to change that code, we should fixup all GUP-fast > functions in a similar way. This is correct, IMHO we have been slowly modernizing the mm approach to the memory model to include things like this. While it would be nice to do everything I think small bits are welcomed as the are discovered. Jason
On 06.09.22 15:47, Jason Gunthorpe wrote: > On Mon, Sep 05, 2022 at 09:59:47AM +0200, David Hildenbrand wrote: > >>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>> whole lockless house of cards may fall apart if we try reading the >>> page table values without READ_ONCE(). >> >> I came to the conclusion that the implicit memory barrier when grabbing a >> reference on the page is sufficient such that we don't need READ_ONCE here. > > READ_ONCE is not about barriers or ordering, you still need the > acquire inside the atomic to make the algorithm work. While I don't disagree with what say is, I'll refer to Documentation/memory-barriers.txt "COMPILER BARRIER". As discussed somewhere in this thread, if we already have an atomic RWM that implies a full ordering, it implies a compile barrier. > > READ_ONCE primarily is a marker that the data being read is unstable > and that the compiler must avoid all instability when reading it. eg > in this case the compiler could insanely double read the value, even > though the 'if' requires only a single read. This would result in > corrupt calculation. As we have a full memory barrier + compile barrier, the compiler might indeed do double reads and all that stuff. BUT, it has to re-read after we incremented the refcount, and IMHO that's the important part to detect the change.
On Tue, Sep 06, 2022 at 03:57:30PM +0200, David Hildenbrand wrote: > > READ_ONCE primarily is a marker that the data being read is unstable > > and that the compiler must avoid all instability when reading it. eg > > in this case the compiler could insanely double read the value, even > > though the 'if' requires only a single read. This would result in > > corrupt calculation. > > As we have a full memory barrier + compile barrier, the compiler might > indeed do double reads and all that stuff. BUT, it has to re-read after we > incremented the refcount, and IMHO that's the important part to detect the > change. Yes, it is important, but it is not the only important part. The compiler still has to exectute "if (*a != b)" *correctly*. This is what READ_ONCE is for. It doesn't set order, it doesn't implement a barrier, it tells the compiler that '*a' is unstable data and the compiler cannot make assumptions based on the idea that reading '*a' multiple times will always return the same value. If the compiler makes those assumptions then maybe even though 'if (*a != b)' is the reality, it could mis-compute '*a == b'. You enter into undefined behavior here. Though it is all very unlikely, the general memory model standard is to annotate with READ_ONCE. Jason
On 06.09.22 16:30, Jason Gunthorpe wrote: > On Tue, Sep 06, 2022 at 03:57:30PM +0200, David Hildenbrand wrote: > >>> READ_ONCE primarily is a marker that the data being read is unstable >>> and that the compiler must avoid all instability when reading it. eg >>> in this case the compiler could insanely double read the value, even >>> though the 'if' requires only a single read. This would result in >>> corrupt calculation. >> >> As we have a full memory barrier + compile barrier, the compiler might >> indeed do double reads and all that stuff. BUT, it has to re-read after we >> incremented the refcount, and IMHO that's the important part to detect the >> change. > > Yes, it is important, but it is not the only important part. > > The compiler still has to exectute "if (*a != b)" *correctly*. > > This is what READ_ONCE is for. It doesn't set order, it doesn't > implement a barrier, it tells the compiler that '*a' is unstable data > and the compiler cannot make assumptions based on the idea that > reading '*a' multiple times will always return the same value. > > If the compiler makes those assumptions then maybe even though 'if (*a > != b)' is the reality, it could mis-compute '*a == b'. You enter into > undefined behavior here. > > Though it is all very unlikely, the general memory model standard is > to annotate with READ_ONCE. The only thing I could see going wrong in the comparison once the stars alingn would be something like the following: if (*a != b) implemented as if ((*a).lower != b.lower && (*a).higher != b.higher) This could only go wrong if we have more than one change such that: Original: *a = 0x00000000ffffffffull; First modification: *a = 0xffffffffffffffffull; Second modification: *a = 0x00000000eeeeeeeeull; If we race with both modifications, we could see that ffffffff matches, and could see that 00000000 matches as well. So I agree that we should change it, but not necessarily as an urgent fix and not necessarily in this patch. It's best to adjust all gup_* functions in one patch. ... I do wonder if we want to reuse ptep_get_lockless() instead of the READ_ONCE(). CONFIG_GUP_GET_PTE_LOW_HIGH is confusing.
On Tue, Sep 06, 2022 at 04:44:20PM +0200, David Hildenbrand wrote: > ... I do wonder if we want to reuse ptep_get_lockless() instead of the > READ_ONCE(). CONFIG_GUP_GET_PTE_LOW_HIGH is confusing. Yes, that is a whole other topic, and you may be quite right on that. Jason
On Sun, Sep 4, 2022 at 3:29 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 9/1/22 15:27, Yang Shi wrote: > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > > sufficient to handle concurrent GUP-fast in all cases, it only handles > > traditional IPI-based GUP-fast correctly. On architectures that send > > an IPI broadcast on TLB flush, it works as expected. But on the > > architectures that do not use IPI to broadcast TLB flush, it may have > > the below race: > > > > CPU A CPU B > > THP collapse fast GUP > > gup_pmd_range() <-- see valid pmd > > gup_pte_range() <-- work on pte > > pmdp_collapse_flush() <-- clear pmd and flush > > __collapse_huge_page_isolate() > > check page pinned <-- before GUP bump refcount > > pin the page > > check PTE <-- no change > > __collapse_huge_page_copy() > > copy data to huge page > > ptep_clear() > > install huge pmd for the huge page > > return the stale page > > discard the stale page > > Hi Yang, > > Thanks for taking the trouble to write down these notes. I always > forget which race we are dealing with, and this is a great help. :) My pleasure, I'm glad it is helpful. > > More... > > > > > The race could be fixed by checking whether PMD is changed or not after > > taking the page pin in fast GUP, just like what it does for PTE. If the > > PMD is changed it means there may be parallel THP collapse, so GUP > > should back off. > > > > Also update the stale comment about serializing against fast GUP in > > khugepaged. > > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > --- > > mm/gup.c | 30 ++++++++++++++++++++++++------ > > mm/khugepaged.c | 10 ++++++---- > > 2 files changed, 30 insertions(+), 10 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index f3fc1f08d90c..4365b2811269 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > > } > > > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > - unsigned int flags, struct page **pages, int *nr) > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > + unsigned long end, unsigned int flags, > > + struct page **pages, int *nr) > > { > > struct dev_pagemap *pgmap = NULL; > > int nr_start = *nr, ret = 0; > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > goto pte_unmap; > > } > > > > - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > + /* > > + * THP collapse conceptually does: > > + * 1. Clear and flush PMD > > + * 2. Check the base page refcount > > + * 3. Copy data to huge page > > + * 4. Clear PTE > > + * 5. Discard the base page > > + * > > + * So fast GUP may race with THP collapse then pin and > > + * return an old page since TLB flush is no longer sufficient > > + * to serialize against fast GUP. > > + * > > + * Check PMD, if it is changed just back off since it > > + * means there may be parallel THP collapse. > > + */ > > As I mentioned in the other thread, it would be a nice touch to move > such discussion into the comment header. Sure, you mean the comment before gup_pte_range() so that the real code stays succinct, right? > > > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > > + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > whole lockless house of cards may fall apart if we try reading the > page table values without READ_ONCE(). > > That's a rather vague statement, and in fact, the READ_ONCE() should > be paired with a page table write somewhere else, to make that claim > more precise. Thanks for the suggestion. Per the discussion later (mainly from David and Jason), I think we are going to have a separate patch to clean up all the page table access for GUP. > > > > gup_put_folio(folio, 1, flags); > > goto pte_unmap; > > } > > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > * get_user_pages_fast_only implementation that can pin pages. Thus it's still > > * useful to have gup_huge_pmd even if we can't operate on ptes. > > */ > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > - unsigned int flags, struct page **pages, int *nr) > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > > + unsigned long end, unsigned int flags, > > + struct page **pages, int *nr) > > { > > return 0; > > } > > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > > if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, > > PMD_SHIFT, next, flags, pages, nr)) > > return 0; > > - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > > + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) > > return 0; > > } while (pmdp++, addr = next, addr != end); > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 2d74cf01f694..518b49095db3 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > > /* > > - * After this gup_fast can't run anymore. This also removes > > - * any huge TLB entry from the CPU so we won't allow > > - * huge and small TLB entries for the same virtual address > > - * to avoid the risk of CPU bugs in that area. > > + * This removes any huge TLB entry from the CPU so we won't allow > > + * huge and small TLB entries for the same virtual address to > > + * avoid the risk of CPU bugs in that area. > > + * > > + * Parallel fast GUP is fine since fast GUP will back off when > > + * it detects PMD is changed. > > */ > > _pmd = pmdp_collapse_flush(vma, address, pmd); > > To follow up on David Hildenbrand's note about this in the nearby thread... > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > all arches. It definitely does do an atomic op with a return value on x86, > but that's just one arch. Will reply in detail to David's thread. > > > thanks, > > -- > John Hubbard > NVIDIA > > > spin_unlock(pmd_ptl); >
On Mon, Sep 5, 2022 at 12:59 AM David Hildenbrand <david@redhat.com> wrote: > > On 05.09.22 00:29, John Hubbard wrote: > > On 9/1/22 15:27, Yang Shi wrote: > >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > >> sufficient to handle concurrent GUP-fast in all cases, it only handles > >> traditional IPI-based GUP-fast correctly. On architectures that send > >> an IPI broadcast on TLB flush, it works as expected. But on the > >> architectures that do not use IPI to broadcast TLB flush, it may have > >> the below race: > >> > >> CPU A CPU B > >> THP collapse fast GUP > >> gup_pmd_range() <-- see valid pmd > >> gup_pte_range() <-- work on pte > >> pmdp_collapse_flush() <-- clear pmd and flush > >> __collapse_huge_page_isolate() > >> check page pinned <-- before GUP bump refcount > >> pin the page > >> check PTE <-- no change > >> __collapse_huge_page_copy() > >> copy data to huge page > >> ptep_clear() > >> install huge pmd for the huge page > >> return the stale page > >> discard the stale page > > > > Hi Yang, > > > > Thanks for taking the trouble to write down these notes. I always > > forget which race we are dealing with, and this is a great help. :) > > > > More... > > > >> > >> The race could be fixed by checking whether PMD is changed or not after > >> taking the page pin in fast GUP, just like what it does for PTE. If the > >> PMD is changed it means there may be parallel THP collapse, so GUP > >> should back off. > >> > >> Also update the stale comment about serializing against fast GUP in > >> khugepaged. > >> > >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > >> Signed-off-by: Yang Shi <shy828301@gmail.com> > >> --- > >> mm/gup.c | 30 ++++++++++++++++++++++++------ > >> mm/khugepaged.c | 10 ++++++---- > >> 2 files changed, 30 insertions(+), 10 deletions(-) > >> > >> diff --git a/mm/gup.c b/mm/gup.c > >> index f3fc1f08d90c..4365b2811269 100644 > >> --- a/mm/gup.c > >> +++ b/mm/gup.c > >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > >> } > >> > >> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > >> - unsigned int flags, struct page **pages, int *nr) > >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > >> + unsigned long end, unsigned int flags, > >> + struct page **pages, int *nr) > >> { > >> struct dev_pagemap *pgmap = NULL; > >> int nr_start = *nr, ret = 0; > >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > >> goto pte_unmap; > >> } > >> > >> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > >> + /* > >> + * THP collapse conceptually does: > >> + * 1. Clear and flush PMD > >> + * 2. Check the base page refcount > >> + * 3. Copy data to huge page > >> + * 4. Clear PTE > >> + * 5. Discard the base page > >> + * > >> + * So fast GUP may race with THP collapse then pin and > >> + * return an old page since TLB flush is no longer sufficient > >> + * to serialize against fast GUP. > >> + * > >> + * Check PMD, if it is changed just back off since it > >> + * means there may be parallel THP collapse. > >> + */ > > > > As I mentioned in the other thread, it would be a nice touch to move > > such discussion into the comment header. > > > >> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > >> + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > > whole lockless house of cards may fall apart if we try reading the > > page table values without READ_ONCE(). > > I came to the conclusion that the implicit memory barrier when grabbing > a reference on the page is sufficient such that we don't need READ_ONCE > here. > > If we still intend to change that code, we should fixup all GUP-fast > functions in a similar way. But again, I don't think we need a change here. > > > >> - * After this gup_fast can't run anymore. This also removes > >> - * any huge TLB entry from the CPU so we won't allow > >> - * huge and small TLB entries for the same virtual address > >> - * to avoid the risk of CPU bugs in that area. > >> + * This removes any huge TLB entry from the CPU so we won't allow > >> + * huge and small TLB entries for the same virtual address to > >> + * avoid the risk of CPU bugs in that area. > >> + * > >> + * Parallel fast GUP is fine since fast GUP will back off when > >> + * it detects PMD is changed. > >> */ > >> _pmd = pmdp_collapse_flush(vma, address, pmd); > > > > To follow up on David Hildenbrand's note about this in the nearby thread... > > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > > all arches. It definitely does do an atomic op with a return value on x86, > > but that's just one arch. > > > > I think a ptep/pmdp clear + TLB flush really has to imply a memory > barrier, otherwise TLB flushing code might easily mess up with > surrounding code. But we should better double-check. > > s390x executes an IDTE instruction, which performs serialization (-> > memory barrier). arm64 seems to use DSB instructions to enforce memory > ordering. IIUC we just need to care about the architectures which support both THP and fast-GUP, right? If so I got the below list: Loongarch - I didn't see explicit memory barrier, but it does IPI MIPS - as same as Loongarch PowerPC - has memory barrier x86 - atomic operations imply memory barrier Arm - I didn't see explicit memory barrier, not sure if ARM's tlb flush instructions imply memory barrier or not, but it does IPI Arm64 - uses DSB as David mentioned s390 - executes IDTE as David mentioned So I think the questionable architectures are Loongarch, MIPS and ARM. And IPI is not called if the PMD entry is not present on a remote CPU. So they may have race against fast GUP IIUC. But we'd better double check with the maintainers. > > -- > Thanks, > > David / dhildenb >
On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > Yang Shi <shy828301@gmail.com> writes: > > > > > On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote: > >> > >> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: > >> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > >> > > > >> > > Hi, Yang, > >> > > > >> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > >> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > >> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > >> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles > >> > > > traditional IPI-based GUP-fast correctly. > >> > > > >> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > >> > > I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > >> > > that'll keep working as long as interrupt disabled (which current fast-gup > >> > > will still do)? > >> > > >> > Actually the wording was copied from David's commit log for his > >> > PageAnonExclusive fix. My understanding is the IPI broadcast still > >> > works, but it may not be supported by all architectures and not > >> > preferred anymore. So we should avoid depending on IPI broadcast IIUC. > >> > > >> > > > >> > > IIUC the issue is you suspect not all archs correctly implemented > >> > > pmdp_collapse_flush(), or am I wrong? > >> > > >> > This is a possible fix, please see below for details. > >> > > >> > > > >> > > > On architectures that send > >> > > > an IPI broadcast on TLB flush, it works as expected. But on the > >> > > > architectures that do not use IPI to broadcast TLB flush, it may have > >> > > > the below race: > >> > > > > >> > > > CPU A CPU B > >> > > > THP collapse fast GUP > >> > > > gup_pmd_range() <-- see valid pmd > >> > > > gup_pte_range() <-- work on pte > >> > > > pmdp_collapse_flush() <-- clear pmd and flush > >> > > > __collapse_huge_page_isolate() > >> > > > check page pinned <-- before GUP bump refcount > >> > > > pin the page > >> > > > check PTE <-- no change > >> > > > __collapse_huge_page_copy() > >> > > > copy data to huge page > >> > > > ptep_clear() > >> > > > install huge pmd for the huge page > >> > > > return the stale page > >> > > > discard the stale page > >> > > > > >> > > > The race could be fixed by checking whether PMD is changed or not after > >> > > > taking the page pin in fast GUP, just like what it does for PTE. If the > >> > > > PMD is changed it means there may be parallel THP collapse, so GUP > >> > > > should back off. > >> > > > >> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for > >> > > the archs that are missing? Do you know which arch(s) is broken with it? > >> > > >> > Yes, and this was suggested by me in the first place, but per the > >> > suggestion from John and David, this is not the preferred way. I think > >> > it is because: > >> > > >> > Firstly, using IPI to serialize against fast GUP is not recommended > >> > anymore since fast GUP does check PTE then back off so we should avoid > >> > it. > >> > Secondly, if checking PMD then backing off could solve the problem, > >> > why do we still need broadcast IPI? It doesn't sound performant. > >> > > >> > > > >> > > It's just not clear to me whether this patch is an optimization or a fix, > >> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > >> > > still be needed. > >> > > >> > It is a fix and the fix will make IPI broadcast not useful anymore. > >> > >> How about another patch to remove the ppc impl too? Then it can be a two > >> patches series. > > > > BTW, I don't think we could remove the ppc implementation since it is > > different from the generic pmdp_collapse_flush(), particularly for the > > hash part IIUC. > > > > The generic version calls flush_tlb_range() -> hash__flush_tlb_range() > > for hash, but the hash call is actually no-op. The ppc version calls > > hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does > > something useful. > > > > We should actually rename flush_tlb_pmd_range(). It actually flush the > hash page table entries. > > I will do the below patch for ppc64 to clarify this better Thanks, Aneesh. It looks more readable. A follow-up question, I think we could remove serialize_against_pte_lookup(), which just issues IPI broadcast to run a dummy function. This IPI should not be needed anymore with my patch. Of course, we need to keep the memory barrier. > > diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > index 8b762f282190..fd30fa20c392 100644 > --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start, > > struct mmu_gather; > extern void hash__tlb_flush(struct mmu_gather *tlb); > -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); > > #ifdef CONFIG_PPC_64S_HASH_MMU > /* Private function for use by PCI IO mapping code */ > extern void __flush_hash_table_range(unsigned long start, unsigned long end); > -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, > - unsigned long addr); > +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, > + unsigned long addr); > #else > static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { } > #endif > diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c > index ae008b9df0e6..f30131933a01 100644 > --- a/arch/powerpc/mm/book3s64/hash_pgtable.c > +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c > @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres > * the __collapse_huge_page_copy can result in copying > * the old content. > */ > - flush_tlb_pmd_range(vma->vm_mm, &pmd, address); > + flush_hash_table_pmd_range(vma->vm_mm, &pmd, address); > return pmd; > } > > diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c > index eb0bccaf221e..a64ea0a7ef96 100644 > --- a/arch/powerpc/mm/book3s64/hash_tlb.c > +++ b/arch/powerpc/mm/book3s64/hash_tlb.c > @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) > local_irq_restore(flags); > } > > -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) > +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) > { > pte_t *pte; > pte_t *start_pte; >
On Tue, Sep 6, 2022 at 7:44 AM David Hildenbrand <david@redhat.com> wrote: > > On 06.09.22 16:30, Jason Gunthorpe wrote: > > On Tue, Sep 06, 2022 at 03:57:30PM +0200, David Hildenbrand wrote: > > > >>> READ_ONCE primarily is a marker that the data being read is unstable > >>> and that the compiler must avoid all instability when reading it. eg > >>> in this case the compiler could insanely double read the value, even > >>> though the 'if' requires only a single read. This would result in > >>> corrupt calculation. > >> > >> As we have a full memory barrier + compile barrier, the compiler might > >> indeed do double reads and all that stuff. BUT, it has to re-read after we > >> incremented the refcount, and IMHO that's the important part to detect the > >> change. > > > > Yes, it is important, but it is not the only important part. > > > > The compiler still has to exectute "if (*a != b)" *correctly*. > > > > This is what READ_ONCE is for. It doesn't set order, it doesn't > > implement a barrier, it tells the compiler that '*a' is unstable data > > and the compiler cannot make assumptions based on the idea that > > reading '*a' multiple times will always return the same value. > > > > If the compiler makes those assumptions then maybe even though 'if (*a > > != b)' is the reality, it could mis-compute '*a == b'. You enter into > > undefined behavior here. > > > > Though it is all very unlikely, the general memory model standard is > > to annotate with READ_ONCE. > > The only thing I could see going wrong in the comparison once the stars > alingn would be something like the following: > > if (*a != b) > > implemented as > > if ((*a).lower != b.lower && (*a).higher != b.higher) > > > This could only go wrong if we have more than one change such that: > > Original: > > *a = 0x00000000ffffffffull; > > > First modification: > *a = 0xffffffffffffffffull; > > Second modification: > *a = 0x00000000eeeeeeeeull; IIUC this is typically a 32-bit thing. > > > If we race with both modifications, we could see that ffffffff matches, > and could see that 00000000 matches as well. > > > So I agree that we should change it, but not necessarily as an urgent > fix and not necessarily in this patch. It's best to adjust all gup_* > functions in one patch. > > ... I do wonder if we want to reuse ptep_get_lockless() instead of the > READ_ONCE(). CONFIG_GUP_GET_PTE_LOW_HIGH is confusing. > > -- > Thanks, > > David / dhildenb >
On 9/6/22 11:50, Yang Shi wrote: >>> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>> + /* >>> + * THP collapse conceptually does: >>> + * 1. Clear and flush PMD >>> + * 2. Check the base page refcount >>> + * 3. Copy data to huge page >>> + * 4. Clear PTE >>> + * 5. Discard the base page >>> + * >>> + * So fast GUP may race with THP collapse then pin and >>> + * return an old page since TLB flush is no longer sufficient >>> + * to serialize against fast GUP. >>> + * >>> + * Check PMD, if it is changed just back off since it >>> + * means there may be parallel THP collapse. >>> + */ >> >> As I mentioned in the other thread, it would be a nice touch to move >> such discussion into the comment header. > > Sure, you mean the comment before gup_pte_range() so that the real > code stays succinct, right? Yes. thanks,
On 9/6/22 07:44, David Hildenbrand wrote: >> Though it is all very unlikely, the general memory model standard is >> to annotate with READ_ONCE. > > The only thing I could see going wrong in the comparison once the stars > alingn would be something like the following: > > if (*a != b) > > implemented as > > if ((*a).lower != b.lower && (*a).higher != b.higher) > > > This could only go wrong if we have more than one change such that: > > Original: > > *a = 0x00000000ffffffffull; > > > First modification: > *a = 0xffffffffffffffffull; > > Second modification: > *a = 0x00000000eeeeeeeeull; > > > If we race with both modifications, we could see that ffffffff matches, > and could see that 00000000 matches as well. > > > So I agree that we should change it, but not necessarily as an urgent > fix and not necessarily in this patch. It's best to adjust all gup_* > functions in one patch. > We had a long thread with Paul McKenney back in May [1] about this exact sort of problem. In that thread, I recall that "someone" tried to claim that a bare one-byte read was safe, and even that innocent-sounding claim got basically torn apart! :) Because the kernel memory model simply does not cover you for bare reads and writes to shared mutable memory. Unfortunately, until now, I'd only really remembered the conclusion: "use READ_ONCE() and WRITE_ONCE() for any touching of shared mutable memory", and not the point about other memory barriers not covering this aspect. Thanks to Jason for reminding us of this. This time I think I will remember it well enough to avoid another long thread. Maybe. [1] https://lore.kernel.org/all/20220524163728.GO1790663@paulmck-ThinkPad-P17-Gen-1/ thanks,
On 9/7/22 12:37 AM, Yang Shi wrote: > On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> Yang Shi <shy828301@gmail.com> writes: >> >>> >>> On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote: >>>> >>>> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: >>>>> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: >>>>>> >>>>>> Hi, Yang, >>>>>> >>>>>> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: >>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer >>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>>>>>> traditional IPI-based GUP-fast correctly. >>>>>> >>>>>> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, >>>>>> I'm kind of confused why it's not sufficient even if with RCU gup? Isn't >>>>>> that'll keep working as long as interrupt disabled (which current fast-gup >>>>>> will still do)? >>>>> >>>>> Actually the wording was copied from David's commit log for his >>>>> PageAnonExclusive fix. My understanding is the IPI broadcast still >>>>> works, but it may not be supported by all architectures and not >>>>> preferred anymore. So we should avoid depending on IPI broadcast IIUC. >>>>> >>>>>> >>>>>> IIUC the issue is you suspect not all archs correctly implemented >>>>>> pmdp_collapse_flush(), or am I wrong? >>>>> >>>>> This is a possible fix, please see below for details. >>>>> >>>>>> >>>>>>> On architectures that send >>>>>>> an IPI broadcast on TLB flush, it works as expected. But on the >>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>>>>> the below race: >>>>>>> >>>>>>> CPU A CPU B >>>>>>> THP collapse fast GUP >>>>>>> gup_pmd_range() <-- see valid pmd >>>>>>> gup_pte_range() <-- work on pte >>>>>>> pmdp_collapse_flush() <-- clear pmd and flush >>>>>>> __collapse_huge_page_isolate() >>>>>>> check page pinned <-- before GUP bump refcount >>>>>>> pin the page >>>>>>> check PTE <-- no change >>>>>>> __collapse_huge_page_copy() >>>>>>> copy data to huge page >>>>>>> ptep_clear() >>>>>>> install huge pmd for the huge page >>>>>>> return the stale page >>>>>>> discard the stale page >>>>>>> >>>>>>> The race could be fixed by checking whether PMD is changed or not after >>>>>>> taking the page pin in fast GUP, just like what it does for PTE. If the >>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>>>>> should back off. >>>>>> >>>>>> Could the race also be fixed by impl pmdp_collapse_flush() correctly for >>>>>> the archs that are missing? Do you know which arch(s) is broken with it? >>>>> >>>>> Yes, and this was suggested by me in the first place, but per the >>>>> suggestion from John and David, this is not the preferred way. I think >>>>> it is because: >>>>> >>>>> Firstly, using IPI to serialize against fast GUP is not recommended >>>>> anymore since fast GUP does check PTE then back off so we should avoid >>>>> it. >>>>> Secondly, if checking PMD then backing off could solve the problem, >>>>> why do we still need broadcast IPI? It doesn't sound performant. >>>>> >>>>>> >>>>>> It's just not clear to me whether this patch is an optimization or a fix, >>>>>> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would >>>>>> still be needed. >>>>> >>>>> It is a fix and the fix will make IPI broadcast not useful anymore. >>>> >>>> How about another patch to remove the ppc impl too? Then it can be a two >>>> patches series. >>> >>> BTW, I don't think we could remove the ppc implementation since it is >>> different from the generic pmdp_collapse_flush(), particularly for the >>> hash part IIUC. >>> >>> The generic version calls flush_tlb_range() -> hash__flush_tlb_range() >>> for hash, but the hash call is actually no-op. The ppc version calls >>> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does >>> something useful. >>> >> >> We should actually rename flush_tlb_pmd_range(). It actually flush the >> hash page table entries. >> >> I will do the below patch for ppc64 to clarify this better > > Thanks, Aneesh. It looks more readable. A follow-up question, I think > we could remove serialize_against_pte_lookup(), which just issues IPI > broadcast to run a dummy function. This IPI should not be needed > anymore with my patch. Of course, we need to keep the memory barrier. > For radix translation yes. For hash we still need that. W.r.t memory barrier, radix do use radix__flush_tlb_collapsed_pmd() which does a tlb invalidate. IIUC that will enfocre the required memory barrier there. So you should be able to just remove modified arch/powerpc/mm/book3s64/radix_pgtable.c @@ -937,15 +937,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre pmd = *pmdp; pmd_clear(pmdp); - /* - * pmdp collapse_flush need to ensure that there are no parallel gup - * walk after this call. This is needed so that we can have stable - * page ref count when collapsing a page. We don't allow a collapse page - * if we have gup taken on the page. We can ensure that by sending IPI - * because gup walk happens with IRQ disabled. - */ - serialize_against_pte_lookup(vma->vm_mm); - radix__flush_tlb_collapsed_pmd(vma->vm_mm, address); return pmd; in your patch. This will also consolidate all the related changes together. >> >> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h >> index 8b762f282190..fd30fa20c392 100644 >> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h >> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h >> @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start, >> >> struct mmu_gather; >> extern void hash__tlb_flush(struct mmu_gather *tlb); >> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); >> >> #ifdef CONFIG_PPC_64S_HASH_MMU >> /* Private function for use by PCI IO mapping code */ >> extern void __flush_hash_table_range(unsigned long start, unsigned long end); >> -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, >> - unsigned long addr); >> +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, >> + unsigned long addr); >> #else >> static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { } >> #endif >> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c >> index ae008b9df0e6..f30131933a01 100644 >> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c >> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c >> @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres >> * the __collapse_huge_page_copy can result in copying >> * the old content. >> */ >> - flush_tlb_pmd_range(vma->vm_mm, &pmd, address); >> + flush_hash_table_pmd_range(vma->vm_mm, &pmd, address); >> return pmd; >> } >> >> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c >> index eb0bccaf221e..a64ea0a7ef96 100644 >> --- a/arch/powerpc/mm/book3s64/hash_tlb.c >> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c >> @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) >> local_irq_restore(flags); >> } >> >> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) >> +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) >> { >> pte_t *pte; >> pte_t *start_pte; >>
On Tue, Sep 6, 2022 at 9:51 PM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 9/7/22 12:37 AM, Yang Shi wrote: > > On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> Yang Shi <shy828301@gmail.com> writes: > >> > >>> > >>> On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote: > >>>> > >>>> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote: > >>>>> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote: > >>>>>> > >>>>>> Hi, Yang, > >>>>>> > >>>>>> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote: > >>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > >>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > >>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles > >>>>>>> traditional IPI-based GUP-fast correctly. > >>>>>> > >>>>>> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast, > >>>>>> I'm kind of confused why it's not sufficient even if with RCU gup? Isn't > >>>>>> that'll keep working as long as interrupt disabled (which current fast-gup > >>>>>> will still do)? > >>>>> > >>>>> Actually the wording was copied from David's commit log for his > >>>>> PageAnonExclusive fix. My understanding is the IPI broadcast still > >>>>> works, but it may not be supported by all architectures and not > >>>>> preferred anymore. So we should avoid depending on IPI broadcast IIUC. > >>>>> > >>>>>> > >>>>>> IIUC the issue is you suspect not all archs correctly implemented > >>>>>> pmdp_collapse_flush(), or am I wrong? > >>>>> > >>>>> This is a possible fix, please see below for details. > >>>>> > >>>>>> > >>>>>>> On architectures that send > >>>>>>> an IPI broadcast on TLB flush, it works as expected. But on the > >>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have > >>>>>>> the below race: > >>>>>>> > >>>>>>> CPU A CPU B > >>>>>>> THP collapse fast GUP > >>>>>>> gup_pmd_range() <-- see valid pmd > >>>>>>> gup_pte_range() <-- work on pte > >>>>>>> pmdp_collapse_flush() <-- clear pmd and flush > >>>>>>> __collapse_huge_page_isolate() > >>>>>>> check page pinned <-- before GUP bump refcount > >>>>>>> pin the page > >>>>>>> check PTE <-- no change > >>>>>>> __collapse_huge_page_copy() > >>>>>>> copy data to huge page > >>>>>>> ptep_clear() > >>>>>>> install huge pmd for the huge page > >>>>>>> return the stale page > >>>>>>> discard the stale page > >>>>>>> > >>>>>>> The race could be fixed by checking whether PMD is changed or not after > >>>>>>> taking the page pin in fast GUP, just like what it does for PTE. If the > >>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP > >>>>>>> should back off. > >>>>>> > >>>>>> Could the race also be fixed by impl pmdp_collapse_flush() correctly for > >>>>>> the archs that are missing? Do you know which arch(s) is broken with it? > >>>>> > >>>>> Yes, and this was suggested by me in the first place, but per the > >>>>> suggestion from John and David, this is not the preferred way. I think > >>>>> it is because: > >>>>> > >>>>> Firstly, using IPI to serialize against fast GUP is not recommended > >>>>> anymore since fast GUP does check PTE then back off so we should avoid > >>>>> it. > >>>>> Secondly, if checking PMD then backing off could solve the problem, > >>>>> why do we still need broadcast IPI? It doesn't sound performant. > >>>>> > >>>>>> > >>>>>> It's just not clear to me whether this patch is an optimization or a fix, > >>>>>> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would > >>>>>> still be needed. > >>>>> > >>>>> It is a fix and the fix will make IPI broadcast not useful anymore. > >>>> > >>>> How about another patch to remove the ppc impl too? Then it can be a two > >>>> patches series. > >>> > >>> BTW, I don't think we could remove the ppc implementation since it is > >>> different from the generic pmdp_collapse_flush(), particularly for the > >>> hash part IIUC. > >>> > >>> The generic version calls flush_tlb_range() -> hash__flush_tlb_range() > >>> for hash, but the hash call is actually no-op. The ppc version calls > >>> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does > >>> something useful. > >>> > >> > >> We should actually rename flush_tlb_pmd_range(). It actually flush the > >> hash page table entries. > >> > >> I will do the below patch for ppc64 to clarify this better > > > > Thanks, Aneesh. It looks more readable. A follow-up question, I think > > we could remove serialize_against_pte_lookup(), which just issues IPI > > broadcast to run a dummy function. This IPI should not be needed > > anymore with my patch. Of course, we need to keep the memory barrier. > > > > > For radix translation yes. For hash we still need that. W.r.t memory barrier, > radix do use radix__flush_tlb_collapsed_pmd() which does a tlb invalidate. > IIUC that will enfocre the required memory barrier there. So you should be able > to just remove > > modified arch/powerpc/mm/book3s64/radix_pgtable.c > @@ -937,15 +937,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre > pmd = *pmdp; > pmd_clear(pmdp); > > - /* > - * pmdp collapse_flush need to ensure that there are no parallel gup > - * walk after this call. This is needed so that we can have stable > - * page ref count when collapsing a page. We don't allow a collapse page > - * if we have gup taken on the page. We can ensure that by sending IPI > - * because gup walk happens with IRQ disabled. > - */ > - serialize_against_pte_lookup(vma->vm_mm); > - > radix__flush_tlb_collapsed_pmd(vma->vm_mm, address); > > return pmd; > > in your patch. This will also consolidate all the related changes together. Thanks, Aneesh. It may be better to have the ppc cleanup in a separate patch. > > >> > >> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > >> index 8b762f282190..fd30fa20c392 100644 > >> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > >> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > >> @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start, > >> > >> struct mmu_gather; > >> extern void hash__tlb_flush(struct mmu_gather *tlb); > >> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); > >> > >> #ifdef CONFIG_PPC_64S_HASH_MMU > >> /* Private function for use by PCI IO mapping code */ > >> extern void __flush_hash_table_range(unsigned long start, unsigned long end); > >> -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, > >> - unsigned long addr); > >> +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, > >> + unsigned long addr); > >> #else > >> static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { } > >> #endif > >> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c > >> index ae008b9df0e6..f30131933a01 100644 > >> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c > >> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c > >> @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres > >> * the __collapse_huge_page_copy can result in copying > >> * the old content. > >> */ > >> - flush_tlb_pmd_range(vma->vm_mm, &pmd, address); > >> + flush_hash_table_pmd_range(vma->vm_mm, &pmd, address); > >> return pmd; > >> } > >> > >> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c > >> index eb0bccaf221e..a64ea0a7ef96 100644 > >> --- a/arch/powerpc/mm/book3s64/hash_tlb.c > >> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c > >> @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) > >> local_irq_restore(flags); > >> } > >> > >> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) > >> +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) > >> { > >> pte_t *pte; > >> pte_t *start_pte; > >> >
diff --git a/mm/gup.c b/mm/gup.c index f3fc1f08d90c..4365b2811269 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, } #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - unsigned int flags, struct page **pages, int *nr) +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct dev_pagemap *pgmap = NULL; int nr_start = *nr, ret = 0; @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, goto pte_unmap; } - if (unlikely(pte_val(pte) != pte_val(*ptep))) { + /* + * THP collapse conceptually does: + * 1. Clear and flush PMD + * 2. Check the base page refcount + * 3. Copy data to huge page + * 4. Clear PTE + * 5. Discard the base page + * + * So fast GUP may race with THP collapse then pin and + * return an old page since TLB flush is no longer sufficient + * to serialize against fast GUP. + * + * Check PMD, if it is changed just back off since it + * means there may be parallel THP collapse. + */ + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || + unlikely(pte_val(pte) != pte_val(*ptep))) { gup_put_folio(folio, 1, flags); goto pte_unmap; } @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, * get_user_pages_fast_only implementation that can pin pages. Thus it's still * useful to have gup_huge_pmd even if we can't operate on ptes. */ -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - unsigned int flags, struct page **pages, int *nr) +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { return 0; } @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, PMD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) return 0; } while (pmdp++, addr = next, addr != end); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2d74cf01f694..518b49095db3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* - * After this gup_fast can't run anymore. This also removes - * any huge TLB entry from the CPU so we won't allow - * huge and small TLB entries for the same virtual address - * to avoid the risk of CPU bugs in that area. + * This removes any huge TLB entry from the CPU so we won't allow + * huge and small TLB entries for the same virtual address to + * avoid the risk of CPU bugs in that area. + * + * Parallel fast GUP is fine since fast GUP will back off when + * it detects PMD is changed. */ _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl);
Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases, it only handles traditional IPI-based GUP-fast correctly. On architectures that send an IPI broadcast on TLB flush, it works as expected. But on the architectures that do not use IPI to broadcast TLB flush, it may have the below race: CPU A CPU B THP collapse fast GUP gup_pmd_range() <-- see valid pmd gup_pte_range() <-- work on pte pmdp_collapse_flush() <-- clear pmd and flush __collapse_huge_page_isolate() check page pinned <-- before GUP bump refcount pin the page check PTE <-- no change __collapse_huge_page_copy() copy data to huge page ptep_clear() install huge pmd for the huge page return the stale page discard the stale page The race could be fixed by checking whether PMD is changed or not after taking the page pin in fast GUP, just like what it does for PTE. If the PMD is changed it means there may be parallel THP collapse, so GUP should back off. Also update the stale comment about serializing against fast GUP in khugepaged. Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Signed-off-by: Yang Shi <shy828301@gmail.com> --- mm/gup.c | 30 ++++++++++++++++++++++++------ mm/khugepaged.c | 10 ++++++---- 2 files changed, 30 insertions(+), 10 deletions(-)