Message ID | 923480d5-35ab-7cac-79d0-343d16e29318@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: free retracted page table by RCU | expand |
On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote: > +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ > + (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) > +/* > + * See the comment above ptep_get_lockless() in include/linux/pgtable.h: > + * the barriers in pmdp_get_lockless() cannot guarantee that the value in > + * pmd_high actually belongs with the value in pmd_low; but holding interrupts > + * off blocks the TLB flush between present updates, which guarantees that a > + * successful __pte_offset_map() points to a page from matched halves. > + */ > +#define config_might_irq_save(flags) local_irq_save(flags) > +#define config_might_irq_restore(flags) local_irq_restore(flags) > +#else > +#define config_might_irq_save(flags) > +#define config_might_irq_restore(flags) I don't like the name. It should indicate that it's PMD-related, so pmd_read_start(flags) / pmd_read_end(flags)?
On Wed, 31 May 2023, Jason Gunthorpe wrote: > On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote: > > There is a faint risk that __pte_offset_map(), on a 32-bit architecture > > with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on > > a pmdval assembled from a pmd_low and a pmd_high which never belonged > > together: their combination not pointing to a page table at all, perhaps > > not even a valid pfn. pmdp_get_lockless() is not enough to prevent that. > > > > Guard against that (on such configs) by local_irq_save() blocking TLB > > flush between present updates, as linux/pgtable.h suggests. It's only > > needed around the pmdp_get_lockless() in __pte_offset_map(): a race when > > __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the > > lock, would just send it back to __pte_offset_map() again. > > What about the other places calling pmdp_get_lockless ? It seems like > this is quietly making it part of the API that the caller must hold > the IPIs off. No, I'm making no judgment of other places where pmdp_get_lockless() is used: examination might show that some need more care, but I'll just assume that each is taking as much care as it needs. But here where I'm making changes, I do see that we need this extra care. > > And Jann had a note that this approach used by the lockless functions > doesn't work anyhow: > > https://lore.kernel.org/linux-mm/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/ Thanks a lot for the link: I don't know why, but I never saw that mail thread at all before. I have not fully digested it yet, to be honest: MADV_DONTNEED, doesn't flush TLB yet, etc - I'll have to get into the right frame of mind for that. > > Though we never fixed it, AFAIK.. I'm certainly depending very much on pmdp_get_lockless(): and hoping to find its case is easier to defend than at the ptep_get_lockless() level. Thanks, Hugh
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 674671835631..d28b63386cef 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -232,12 +232,32 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ + (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) +/* + * See the comment above ptep_get_lockless() in include/linux/pgtable.h: + * the barriers in pmdp_get_lockless() cannot guarantee that the value in + * pmd_high actually belongs with the value in pmd_low; but holding interrupts + * off blocks the TLB flush between present updates, which guarantees that a + * successful __pte_offset_map() points to a page from matched halves. + */ +#define config_might_irq_save(flags) local_irq_save(flags) +#define config_might_irq_restore(flags) local_irq_restore(flags) +#else +#define config_might_irq_save(flags) +#define config_might_irq_restore(flags) +#endif + pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { + unsigned long __maybe_unused flags; pmd_t pmdval; rcu_read_lock(); + config_might_irq_save(flags); pmdval = pmdp_get_lockless(pmd); + config_might_irq_restore(flags); + if (pmdvalp) *pmdvalp = pmdval; if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
There is a faint risk that __pte_offset_map(), on a 32-bit architecture with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a pmdval assembled from a pmd_low and a pmd_high which never belonged together: their combination not pointing to a page table at all, perhaps not even a valid pfn. pmdp_get_lockless() is not enough to prevent that. Guard against that (on such configs) by local_irq_save() blocking TLB flush between present updates, as linux/pgtable.h suggests. It's only needed around the pmdp_get_lockless() in __pte_offset_map(): a race when __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the lock, would just send it back to __pte_offset_map() again. CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86. It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit. Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a little more work, to retry if pmd_low good for page table, but pmd_high non-zero from THP (and that might be making x86-specific assumptions). Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/pgtable-generic.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)