Message ID | 594c1f0-d396-5346-1f36-606872cddb18@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: page_vma_mapped_walk() cleanup and THP fixes | expand |
On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote: > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) > have a multi-word pmd entry, for which READ_ONCE() is not good enough. > > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: <stable@vger.kernel.org> > --- > mm/page_vma_mapped.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index 7c0504641fb8..973c3c4e72cc 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > pud = pud_offset(p4d, pvmw->address); > if (!pud_present(*pud)) > return false; > + > pvmw->pmd = pmd_offset(pud, pvmw->address); > /* > * Make sure the pmd value isn't cached in a register by the > * compiler and used as a stale value after we've observed a > * subsequent update. > */ > - pmde = READ_ONCE(*pvmw->pmd); > + pmde = pmd_read_atomic(pvmw->pmd); > + barrier(); > + Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic(). mm/hmm.c uses the same pattern as you are and I tend to think that the rest of pmd_read_atomic() users may be broken. Am I wrong?
On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote: > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote: > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) > > have a multi-word pmd entry, for which READ_ONCE() is not good enough. > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > Cc: <stable@vger.kernel.org> > > mm/page_vma_mapped.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index 7c0504641fb8..973c3c4e72cc 100644 > > +++ b/mm/page_vma_mapped.c > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > pud = pud_offset(p4d, pvmw->address); > > if (!pud_present(*pud)) > > return false; > > + > > pvmw->pmd = pmd_offset(pud, pvmw->address); > > /* > > * Make sure the pmd value isn't cached in a register by the > > * compiler and used as a stale value after we've observed a > > * subsequent update. > > */ > > - pmde = READ_ONCE(*pvmw->pmd); > > + pmde = pmd_read_atomic(pvmw->pmd); > > + barrier(); > > + > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic(). > mm/hmm.c uses the same pattern as you are and I tend to think that the > rest of pmd_read_atomic() users may be broken. > > Am I wrong? I agree with you, something called _atomic should not require the caller to provide barriers. I think the issue is simply that the two implementations of pmd_read_atomic() should use READ_ONCE() internally, no? Jason
On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote: > On Thu, 10 Jun 2021, Jason Gunthorpe wrote: > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote: > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote: > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough. > > > > > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > Cc: <stable@vger.kernel.org> > > > > mm/page_vma_mapped.c | 5 ++++- > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > index 7c0504641fb8..973c3c4e72cc 100644 > > > > +++ b/mm/page_vma_mapped.c > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > pud = pud_offset(p4d, pvmw->address); > > > > if (!pud_present(*pud)) > > > > return false; > > > > + > > > > pvmw->pmd = pmd_offset(pud, pvmw->address); > > > > /* > > > > * Make sure the pmd value isn't cached in a register by the > > > > * compiler and used as a stale value after we've observed a > > > > * subsequent update. > > > > */ > > > > - pmde = READ_ONCE(*pvmw->pmd); > > > > + pmde = pmd_read_atomic(pvmw->pmd); > > > > + barrier(); > > > > + > > > > > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic(). > > > mm/hmm.c uses the same pattern as you are and I tend to think that the > > > rest of pmd_read_atomic() users may be broken. > > > > > > Am I wrong? > > > > I agree with you, something called _atomic should not require the > > caller to provide barriers. > > > > I think the issue is simply that the two implementations of > > pmd_read_atomic() should use READ_ONCE() internally, no? > > I've had great difficulty coming up with answers for you. > > This patch was based on two notions I've had lodged in my mind > for several years: > > 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t > atomically (even if the pmd_t spans multiple words); but reading again > after all this time the comment above it, it seems to be more specialized > than I'd thought (biggest selling point being for when you want to check > pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside, > so really needs that barrier() to be as safe as the previous READ_ONCE(). IMHO it is a simple bug that the 64 bit version of this is not marked with READ_ONCE() internally. It is reading data without a lock, AFAIK our modern understanding of the memory model is that requires READ_ONCE(). diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index e896ebef8c24cb..0bf1fdec928e71 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) static inline pmd_t pmd_read_atomic(pmd_t *pmdp) { pmdval_t ret; - u32 *tmp = (u32 *)pmdp; + u32 *tmp = READ_ONCE((u32 *)pmdp); ret = (pmdval_t) (*tmp); if (ret) { @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) * or we can end up with a partial pmd. */ smp_rmb(); - ret |= ((pmdval_t)*(tmp + 1)) << 32; + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32; } return (pmd_t) { ret }; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 46b13780c2c8c9..c8c7a3307d2773 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) * only going to work, if the pmdval_t isn't larger than * an unsigned long. */ - return *pmdp; + return READ_ONCE(*pmdp); } #endif > 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(), > because READ_ONCE() did not work (did not give the necessary guarantee? or > did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE. This is really interesting, the git history e37c69827063 ("mm: replace ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was dropped here "because it doesn't work on non-scalar types" due to a (now 8 year old) gcc bug. According to the gcc bug READ_ONCE() on anything that is a scalar sized struct triggers GCC to ignore the READ_ONCEyness. To work around this bug then READ_ONCE can never be used on any of the struct protected page table elements. While I am not 100% sure, it looks like this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum required compiler this is all just cruft. Use READ_ONCE() here too... > But I've now come across some changes that Will Deacon made last year: > the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for > native word type *or* type sizeof(long long) (e.g. i386 PAE) - given > "a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm: > Restructure sparc32 MMU page-table layout") put an end to sparc32's > typedef struct { unsigned long pmdv[16]; } pmd_t. Indeed, that is good news > As to your questions about pmd_read_atomic() usage elsewhere: > please don't force me to think so hard! (And you've set me half- > wondering, whether there are sneaky THP transitions, perhaps of the > "unstable" kind, that page_vma_mapped_walk() should be paying more > attention to: but for sanity's sake I won't go there, not now.) If I recall, (and I didn't recheck this right now) the only thing pmd_read_atomic() provides is the special property that if the pmd's flags are observed to point to a pte table then pmd_read_atomic() will reliably return the pte table pointer. Otherwise it returns the flags and a garbage pointer because under the THP protocol a PMD pointing at a page can be converted to a PTE table if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD page requires mmap sem in write mode so once a PTE table is observed 'locklessly' the value is stable.. Or at least so says the documentation pmd_read_atomic() is badly named, tricky to use, and missing the READ_ONCE() because it is so old.. As far as is page_vma_mapped_walk correct.. Everything except is_pmd_migration_entry() is fine to my eye, and I simply don't know the rules aroudn migration entries to know right/wrong. I suspect it probably is required to manipulate a migration entry while holding the mmap_sem in write mode?? There is some list of changes to the page table that require holding the mmap sem in write mode that I've never seen documented for.. Jason
On Fri, 11 Jun 2021, Jason Gunthorpe wrote: > On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote: > > On Thu, 10 Jun 2021, Jason Gunthorpe wrote: > > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote: > > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote: > > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() > > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) > > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough. > > > > > > > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > > Cc: <stable@vger.kernel.org> > > > > > mm/page_vma_mapped.c | 5 ++++- > > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > > index 7c0504641fb8..973c3c4e72cc 100644 > > > > > +++ b/mm/page_vma_mapped.c > > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > > pud = pud_offset(p4d, pvmw->address); > > > > > if (!pud_present(*pud)) > > > > > return false; > > > > > + > > > > > pvmw->pmd = pmd_offset(pud, pvmw->address); > > > > > /* > > > > > * Make sure the pmd value isn't cached in a register by the > > > > > * compiler and used as a stale value after we've observed a > > > > > * subsequent update. > > > > > */ > > > > > - pmde = READ_ONCE(*pvmw->pmd); > > > > > + pmde = pmd_read_atomic(pvmw->pmd); > > > > > + barrier(); > > > > > + > > > > > > > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic(). > > > > mm/hmm.c uses the same pattern as you are and I tend to think that the > > > > rest of pmd_read_atomic() users may be broken. > > > > > > > > Am I wrong? > > > > > > I agree with you, something called _atomic should not require the > > > caller to provide barriers. > > > > > > I think the issue is simply that the two implementations of > > > pmd_read_atomic() should use READ_ONCE() internally, no? > > > > I've had great difficulty coming up with answers for you. > > > > This patch was based on two notions I've had lodged in my mind > > for several years: > > > > 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t > > atomically (even if the pmd_t spans multiple words); but reading again > > after all this time the comment above it, it seems to be more specialized > > than I'd thought (biggest selling point being for when you want to check > > pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside, > > so really needs that barrier() to be as safe as the previous READ_ONCE(). > > IMHO it is a simple bug that the 64 bit version of this is not marked > with READ_ONCE() internally. It is reading data without a lock, AFAIK > our modern understanding of the memory model is that requires > READ_ONCE(). > > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h > index e896ebef8c24cb..0bf1fdec928e71 100644 > --- a/arch/x86/include/asm/pgtable-3level.h > +++ b/arch/x86/include/asm/pgtable-3level.h > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) > static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > { > pmdval_t ret; > - u32 *tmp = (u32 *)pmdp; > + u32 *tmp = READ_ONCE((u32 *)pmdp); > > ret = (pmdval_t) (*tmp); > if (ret) { > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > * or we can end up with a partial pmd. > */ > smp_rmb(); > - ret |= ((pmdval_t)*(tmp + 1)) << 32; > + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32; > } Maybe that. Or maybe now (since Will's changes) it can just do one READ_ONCE() of the whole, then adjust its local copy. > > return (pmd_t) { ret }; > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 46b13780c2c8c9..c8c7a3307d2773 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > * only going to work, if the pmdval_t isn't larger than > * an unsigned long. > */ > - return *pmdp; > + return READ_ONCE(*pmdp); > } > #endif And it should now be possible to delete the #ifdef THP barrier() in function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic(). > > > > 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(), > > because READ_ONCE() did not work (did not give the necessary guarantee? or > > did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE. > > This is really interesting, the git history e37c69827063 ("mm: replace > ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was > dropped here "because it doesn't work on non-scalar types" due to a > (now 8 year old) gcc bug. > > According to the gcc bug READ_ONCE() on anything that is a scalar > sized struct triggers GCC to ignore the READ_ONCEyness. To work around > this bug then READ_ONCE can never be used on any of the struct > protected page table elements. While I am not 100% sure, it looks like > this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum > required compiler this is all just cruft. Use READ_ONCE() here too... Oh, thanks particularly for looking into the gcc end of it: I never knew that part of the story. > > > But I've now come across some changes that Will Deacon made last year: > > the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for > > native word type *or* type sizeof(long long) (e.g. i386 PAE) - given > > "a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm: > > Restructure sparc32 MMU page-table layout") put an end to sparc32's > > typedef struct { unsigned long pmdv[16]; } pmd_t. > > Indeed, that is good news > > > As to your questions about pmd_read_atomic() usage elsewhere: > > please don't force me to think so hard! (And you've set me half- > > wondering, whether there are sneaky THP transitions, perhaps of the > > "unstable" kind, that page_vma_mapped_walk() should be paying more > > attention to: but for sanity's sake I won't go there, not now.) > > If I recall, (and I didn't recheck this right now) the only thing > pmd_read_atomic() provides is the special property that if the pmd's > flags are observed to point to a pte table then pmd_read_atomic() will > reliably return the pte table pointer. I expect your right, I haven't rechecked. But it does also provide that special guarantee on matching pmd_none() when half matches pmd_none(): which one or more callers want, but irrelevant where I added it. > > Otherwise it returns the flags and a garbage pointer because under the > THP protocol a PMD pointing at a page can be converted to a PTE table > if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD > page requires mmap sem in write mode so once a PTE table is observed > 'locklessly' the value is stable.. Or at least so says the documentation > > pmd_read_atomic() is badly named, tricky to use, and missing the > READ_ONCE() because it is so old.. I think yes to each of those three points. > > As far as is page_vma_mapped_walk correct.. Everything except > is_pmd_migration_entry() is fine to my eye, and I simply don't know > the rules aroudn migration entries to know right/wrong. So long as the swap "type" is entirely in the same word as the pte flags would be, I think is_pmd_migration_entry() should be fine. There it's just looking for a hint, is it worth taking pmd_lock() to obtain a reliable pmde. > > I suspect it probably is required to manipulate a migration entry > while holding the mmap_sem in write mode?? I don't think in terms of mmap_sem/mmap_lock here: this is on the rmap lookup path, and typically mmap_lock is not held (whereas I was surprised to find the comment above pmd_read_atomic() saying a lot about it - another reason it's inappropriate in pvmw I guess). The important lock here is pmd_lock() a.k.a. pmd_trans_huge_lock(): get it when it's necessary, but (the tricky part) try to avoid the overhead and contention in getting it when not necessary. Before THP it was easy, but various THP transitions make it hard. > > There is some list of changes to the page table that require holding > the mmap sem in write mode that I've never seen documented for.. That is a subtler subject than I dare get into at the moment. But if you're just doing lookups, pmd_lock() is enough. I'll direct a further mail in this thread to Andrew now, asking him to drop this patch, when convenient. Hugh
On Fri, 11 Jun 2021, Jason Gunthorpe wrote: > On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote: > > On Thu, 10 Jun 2021, Jason Gunthorpe wrote: > > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote: > > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote: > > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() > > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) > > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough. > > > > > > > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > > Cc: <stable@vger.kernel.org> > > > > > mm/page_vma_mapped.c | 5 ++++- > > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > > index 7c0504641fb8..973c3c4e72cc 100644 > > > > > +++ b/mm/page_vma_mapped.c > > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > > pud = pud_offset(p4d, pvmw->address); > > > > > if (!pud_present(*pud)) > > > > > return false; > > > > > + > > > > > pvmw->pmd = pmd_offset(pud, pvmw->address); > > > > > /* > > > > > * Make sure the pmd value isn't cached in a register by the > > > > > * compiler and used as a stale value after we've observed a > > > > > * subsequent update. > > > > > */ > > > > > - pmde = READ_ONCE(*pvmw->pmd); > > > > > + pmde = pmd_read_atomic(pvmw->pmd); > > > > > + barrier(); > > > > > + Andrew, please drop this patch from your queue: it's harmless, but pretending to do some good when it does not. The situation has changed since I originally wrote it, gcc min version has been raised, Will Deacon has corrected the applicability of READ_ONCE() to pmd_t in April 2020 commits, pmd_read_atomic() is not as magic as I thought (it has good uses but not here), and the commit comment is no longer right. However... if you're nearing the point of a fresh mmotm, probably best to leave it in for now and we sort out the mess later. Because although it's functionally independent from the other patches in the series, there is of course the "indentation" patch, and this falls in the middle of what's indented a level there. I don't imagine that will tax your abilities to their limit, but after that there's the main bugfix patch, which expects a blank line I added in this one, that's no longer there when reverted. Enough to make me sigh, and just write to say "drop, when you have time". Let me know if you'd prefer a resend of the series (today? not sure). (Discussion of pmd_read_atomic() and barrier() and mm_find_pmd() and suchlike may continue in this thread, but this 03/11 should just be dropped. Whether it's in or out should not affect Alistair's series.) Thanks, HUgh
On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote: > > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h > > index e896ebef8c24cb..0bf1fdec928e71 100644 > > +++ b/arch/x86/include/asm/pgtable-3level.h > > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) > > static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > > { > > pmdval_t ret; > > - u32 *tmp = (u32 *)pmdp; > > + u32 *tmp = READ_ONCE((u32 *)pmdp); > > > > ret = (pmdval_t) (*tmp); > > if (ret) { > > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > > * or we can end up with a partial pmd. > > */ > > smp_rmb(); > > - ret |= ((pmdval_t)*(tmp + 1)) << 32; > > + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32; > > } > > Maybe that. Or maybe now (since Will's changes) it can just do > one READ_ONCE() of the whole, then adjust its local copy. I think the smb_rmb() is critical here to ensure a PTE table pointer is coherent, READ_ONCE is not a substitute, unless I am miss understanding what Will's changes are??? > > @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > > * only going to work, if the pmdval_t isn't larger than > > * an unsigned long. > > */ > > - return *pmdp; > > + return READ_ONCE(*pmdp); > > } > > #endif > > And it should now be possible to delete the #ifdef THP barrier() in > function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic(). Yes - I think I know what you mean :) > > If I recall, (and I didn't recheck this right now) the only thing > > pmd_read_atomic() provides is the special property that if the pmd's > > flags are observed to point to a pte table then pmd_read_atomic() will > > reliably return the pte table pointer. > > I expect your right, I haven't rechecked. But it does also provide that > special guarantee on matching pmd_none() when half matches pmd_none(): > which one or more callers want, but irrelevant where I added it. Ah, this I don't know much about.. Hum, I'd have to think about it way too much to have an opinion if the races around pmd_none are meaningful enough to resolve with _atomic vs READ_ONCE() > > As far as is page_vma_mapped_walk correct.. Everything except > > is_pmd_migration_entry() is fine to my eye, and I simply don't know > > the rules aroudn migration entries to know right/wrong. > > So long as the swap "type" is entirely in the same word as the pte > flags would be, I think is_pmd_migration_entry() should be fine. > There it's just looking for a hint, is it worth taking pmd_lock() > to obtain a reliable pmde. I wonder if anyone knows to guarantee that in the arches? > > I suspect it probably is required to manipulate a migration entry > > while holding the mmap_sem in write mode?? > > I don't think in terms of mmap_sem/mmap_lock here: this is on the > rmap lookup path, and typically mmap_lock is not held (whereas I > was surprised to find the comment above pmd_read_atomic() saying a > lot about it - another reason it's inappropriate in pvmw I guess). Ah.. Honestly I'm not familiar with all the ways to lock a VMA so that the page tables it spans are guaranteed not to be freed. I just saw the _offset() ladder without any page table spinlocks and knew this was one of the lockless walkers that, somehow, relies on another lock to ensure the page table level memory is not concurrently freed. In that case I take it back, I have no idea if this is correct or not because it calls map_pte() which does pte_offset_map() based on that READ_ONCE. pte_offset_map() without holding the pmd_lock is only OK if you know that the pmd can't be upgraded to a THP - and the only locking for that I have memorized is the mmap lock :\ I can't tell what other lock is protecting the page tables here or if that lock is held during the THP upgrade process.. Sorry > > There is some list of changes to the page table that require > > holding the mmap sem in write mode that I've never seen documented > > for.. > > That is a subtler subject than I dare get into at the moment. > But if you're just doing lookups, pmd_lock() is enough. It is enough if you take it, I don't see page_vma_mapped_walk() taking it in the map_pte() flow :\ Jason
On Fri, Jun 11, 2021 at 04:42:49PM -0300, Jason Gunthorpe wrote: > On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote: > > > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h > > > index e896ebef8c24cb..0bf1fdec928e71 100644 > > > +++ b/arch/x86/include/asm/pgtable-3level.h > > > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) > > > static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > > > { > > > pmdval_t ret; > > > - u32 *tmp = (u32 *)pmdp; > > > + u32 *tmp = READ_ONCE((u32 *)pmdp); > > > > > > ret = (pmdval_t) (*tmp); > > > if (ret) { > > > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > > > * or we can end up with a partial pmd. > > > */ > > > smp_rmb(); > > > - ret |= ((pmdval_t)*(tmp + 1)) << 32; > > > + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32; > > > } > > > > Maybe that. Or maybe now (since Will's changes) it can just do > > one READ_ONCE() of the whole, then adjust its local copy. > > I think the smb_rmb() is critical here to ensure a PTE table pointer > is coherent, READ_ONCE is not a substitute, unless I am miss > understanding what Will's changes are??? Yes, I agree that the barrier is needed here for x86 PAE. I would really have liked to enforce native-sized access in READ_ONCE(), but unfortunately there is plenty of code out there which is resilient to a 64-bit access being split into two separate 32-bit accesses and so I wasn't able to go that far. That being said, pmd_read_atomic() probably _should_ be using READ_ONCE() because using it inconsistently can give rise to broken codegen, e.g. if you do: pmdval_t x, y, z; x = *pmdp; // Invalid y = READ_ONCE(*pmdp); // Valid if (pmd_valid(y)) z = *pmdp; // Invalid again! Then the compiler can allocate the same register for x and z, but will issue an additional load for y. If a concurrent update takes place to the pmd which transitions from Invalid -> Valid, then it will look as though things went back in time, because z will be stale. We actually hit this on arm64 in practice [1]. Will [1] https://lore.kernel.org/lkml/20171003114244.430374928@linuxfoundation.org/
On Tue, Jun 15, 2021 at 10:46:39AM +0100, Will Deacon wrote: > Then the compiler can allocate the same register for x and z, but will > issue an additional load for y. If a concurrent update takes place to the > pmd which transitions from Invalid -> Valid, then it will look as though > things went back in time, because z will be stale. We actually hit this > on arm64 in practice [1]. The fact you actually hit this in the real world just seem to confirm my thinking that the mm's lax use of the memory model is something that deserves addressing. Honestly I'm not sure the fix to stick a READ_ONCE in the macros is very robust. I prefer the gup_fast pattern of: pmd_t pmd = READ_ONCE(*pmdp); pte_offset_phys(&pmd, addr); To correctly force the READ_ONCE under unlocked access and the consistently use the single read of the unstable data. It seems more maintainable 'hey look at me, I have no locks!' and has fewer possibilities for obscure order related bugs to creep in. Jason
On Tue, Jun 15, 2021 at 09:42:07PM -0300, Jason Gunthorpe wrote: > On Tue, Jun 15, 2021 at 10:46:39AM +0100, Will Deacon wrote: > > > Then the compiler can allocate the same register for x and z, but will > > issue an additional load for y. If a concurrent update takes place to the > > pmd which transitions from Invalid -> Valid, then it will look as though > > things went back in time, because z will be stale. We actually hit this > > on arm64 in practice [1]. > > The fact you actually hit this in the real world just seem to confirm > my thinking that the mm's lax use of the memory model is something > that deserves addressing. > > Honestly I'm not sure the fix to stick a READ_ONCE in the macros is > very robust. I prefer the gup_fast pattern of: > > pmd_t pmd = READ_ONCE(*pmdp); > pte_offset_phys(&pmd, addr); > > To correctly force the READ_ONCE under unlocked access and the > consistently use the single read of the unstable data. > > It seems more maintainable 'hey look at me, I have no locks!' and has > fewer possibilities for obscure order related bugs to creep in. Oh, no objection to cleaning this up. It was a "issuing msync(2) causes data loss argh!" issue, so adding READ_ONCE() to all the macros was the most straightforward way to solve the immediate problem. Generally speaking, I think all accesses to live page-tables should be using READ_ONCE(), as there's also hardware updates from the CPU table walker to contend with. If that's done in the caller and the macros are changed to operate on the loaded value, all the better (although this probably doesn't work so well once you get into rmw operations such as ptep_test_and_clear_young()). Will
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 7c0504641fb8..973c3c4e72cc 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pud = pud_offset(p4d, pvmw->address); if (!pud_present(*pud)) return false; + pvmw->pmd = pmd_offset(pud, pvmw->address); /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a * subsequent update. */ - pmde = READ_ONCE(*pvmw->pmd); + pmde = pmd_read_atomic(pvmw->pmd); + barrier(); + if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) { pvmw->ptl = pmd_lock(mm, pvmw->pmd); if (likely(pmd_trans_huge(*pvmw->pmd))) {
page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) have a multi-word pmd entry, for which READ_ONCE() is not good enough. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> --- mm/page_vma_mapped.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)