Message ID | 1408635812-31584-2-git-send-email-steve.capper@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Steve, A few minor comments (took me a while to understand how this works, so I thought I'd make some noise :) On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > get_user_pages_fast attempts to pin user pages by walking the page > tables directly and avoids taking locks. Thus the walker needs to be > protected from page table pages being freed from under it, and needs > to block any THP splits. > > One way to achieve this is to have the walker disable interrupts, and > rely on IPIs from the TLB flushing code blocking before the page table > pages are freed. > > On some platforms we have hardware broadcast of TLB invalidations, thus > the TLB flushing code doesn't necessarily need to broadcast IPIs; and > spuriously broadcasting IPIs can hurt system performance if done too > often. > > This problem has been solved on PowerPC and Sparc by batching up page > table pages belonging to more than one mm_user, then scheduling an > rcu_sched callback to free the pages. This RCU page table free logic > has been promoted to core code and is activated when one enables > HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement > their own get_user_pages_fast routines. > > The RCU page table free logic coupled with a an IPI broadcast on THP > split (which is a rare event), allows one to protect a page table > walker by merely disabling the interrupts during the walk. Disabling interrupts isn't completely free (it's a self-synchronising operation on ARM). It would be interesting to see if your futex workload performance is improved by my simple irq_save optimisation for ARM: https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches&id=312a70adfa6f22e9d62803dd21400f481253e58b (I've been struggling to show anything other than tiny improvements from that patch). > This patch provides a general RCU implementation of get_user_pages_fast > that can be used by architectures that perform hardware broadcast of > TLB invalidations. > > It is based heavily on the PowerPC implementation by Nick Piggin. [...] > diff --git a/mm/gup.c b/mm/gup.c > index 91d044b..2f684fa 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -10,6 +10,10 @@ > #include <linux/swap.h> > #include <linux/swapops.h> > > +#include <linux/sched.h> > +#include <linux/rwsem.h> > +#include <asm/pgtable.h> > + > #include "internal.h" > > static struct page *no_page_table(struct vm_area_struct *vma, > @@ -672,3 +676,277 @@ struct page *get_dump_page(unsigned long addr) > return page; > } > #endif /* CONFIG_ELF_CORE */ > + > +#ifdef CONFIG_HAVE_RCU_GUP > + > +#ifdef __HAVE_ARCH_PTE_SPECIAL Do we actually require this (pte special) if hugepages are disabled or not supported? Will
On Wed, Aug 27, 2014 at 09:54:42AM +0100, Will Deacon wrote: > Hi Steve, > Hey Will, > A few minor comments (took me a while to understand how this works, so I > thought I'd make some noise :) A big thank you for reading through it :-). > > On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > > get_user_pages_fast attempts to pin user pages by walking the page > > tables directly and avoids taking locks. Thus the walker needs to be > > protected from page table pages being freed from under it, and needs > > to block any THP splits. > > > > One way to achieve this is to have the walker disable interrupts, and > > rely on IPIs from the TLB flushing code blocking before the page table > > pages are freed. > > > > On some platforms we have hardware broadcast of TLB invalidations, thus > > the TLB flushing code doesn't necessarily need to broadcast IPIs; and > > spuriously broadcasting IPIs can hurt system performance if done too > > often. > > > > This problem has been solved on PowerPC and Sparc by batching up page > > table pages belonging to more than one mm_user, then scheduling an > > rcu_sched callback to free the pages. This RCU page table free logic > > has been promoted to core code and is activated when one enables > > HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement > > their own get_user_pages_fast routines. > > > > The RCU page table free logic coupled with a an IPI broadcast on THP > > split (which is a rare event), allows one to protect a page table > > walker by merely disabling the interrupts during the walk. > > Disabling interrupts isn't completely free (it's a self-synchronising > operation on ARM). It would be interesting to see if your futex workload > performance is improved by my simple irq_save optimisation for ARM: > > https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches&id=312a70adfa6f22e9d62803dd21400f481253e58b > > (I've been struggling to show anything other than tiny improvements from > that patch). > This looks like a useful optimisation; I'll have a think about workloads that fire many futexes on THP tails. (The test I used only fired off one futex). > > This patch provides a general RCU implementation of get_user_pages_fast > > that can be used by architectures that perform hardware broadcast of > > TLB invalidations. > > > > It is based heavily on the PowerPC implementation by Nick Piggin. > > [...] > > > diff --git a/mm/gup.c b/mm/gup.c > > index 91d044b..2f684fa 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -10,6 +10,10 @@ > > #include <linux/swap.h> > > #include <linux/swapops.h> > > > > +#include <linux/sched.h> > > +#include <linux/rwsem.h> > > +#include <asm/pgtable.h> > > + > > #include "internal.h" > > > > static struct page *no_page_table(struct vm_area_struct *vma, > > @@ -672,3 +676,277 @@ struct page *get_dump_page(unsigned long addr) > > return page; > > } > > #endif /* CONFIG_ELF_CORE */ > > + > > +#ifdef CONFIG_HAVE_RCU_GUP > > + > > +#ifdef __HAVE_ARCH_PTE_SPECIAL > > Do we actually require this (pte special) if hugepages are disabled or > not supported? We need this logic if we want use fast_gup on normal pages safely. The special bit indicates that we should not attempt to take a reference to the underlying page. Huge pages are guaranteed not to be special. Cheers,
On Wed, Aug 27, 2014 at 01:50:28PM +0100, Steve Capper wrote: > On Wed, Aug 27, 2014 at 09:54:42AM +0100, Will Deacon wrote: > > On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > > > @@ -672,3 +676,277 @@ struct page *get_dump_page(unsigned long addr) > > > return page; > > > } > > > #endif /* CONFIG_ELF_CORE */ > > > + > > > +#ifdef CONFIG_HAVE_RCU_GUP > > > + > > > +#ifdef __HAVE_ARCH_PTE_SPECIAL > > > > Do we actually require this (pte special) if hugepages are disabled or > > not supported? > > We need this logic if we want use fast_gup on normal pages safely. The special > bit indicates that we should not attempt to take a reference to the underlying > page. > > Huge pages are guaranteed not to be special. Gah, I somehow mixed up sp-litting and sp-ecial. Step away from the computer. In which case, the patch looks fine. You might need to repost with '[PATCH]' instead of '[PATH]', in case you confused people's filters. Will
On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > diff --git a/mm/Kconfig b/mm/Kconfig > index 886db21..6a4d764 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP > config HAVE_MEMBLOCK_PHYS_MAP > boolean > > +config HAVE_RCU_GUP > + boolean Minor detail, maybe HAVE_GENERIC_RCU_GUP to avoid confusion. Otherwise the patch looks fine to me. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Wed, Aug 27, 2014 at 03:28:01PM +0100, Catalin Marinas wrote: > On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 886db21..6a4d764 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP > > config HAVE_MEMBLOCK_PHYS_MAP > > boolean > > > > +config HAVE_RCU_GUP > > + boolean > > Minor detail, maybe HAVE_GENERIC_RCU_GUP to avoid confusion. Yeah, that does look better, I'll amend the series accordingly. > > Otherwise the patch looks fine to me. > > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Thanks Catalin.
On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > +int get_user_pages_fast(unsigned long start, int nr_pages, int write, > + struct page **pages) > +{ > + struct mm_struct *mm = current->mm; > + int nr, ret; > + > + start &= PAGE_MASK; > + nr = __get_user_pages_fast(start, nr_pages, write, pages); > + ret = nr; > + > + if (nr < nr_pages) { > + /* Try to get the remaining pages with get_user_pages */ > + start += nr << PAGE_SHIFT; > + pages += nr; When I read this, my first reaction was... what if nr is negative? In that case, if nr_pages is positive, we fall through into this if, and start to wind things backwards - which isn't what we want. It looks like that can't happen... right? __get_user_pages_fast() only returns greater-or-equal to zero right now, but what about the future? > + > + down_read(&mm->mmap_sem); > + ret = get_user_pages(current, mm, start, > + nr_pages - nr, write, 0, pages, NULL); > + up_read(&mm->mmap_sem); > + > + /* Have to be a bit careful with return values */ > + if (nr > 0) { This kind'a makes it look like nr could be negative. Other than that, I don't see anything obviously wrong with it.
On Wed, Aug 27, 2014 at 04:01:39PM +0100, Russell King - ARM Linux wrote: Hi Russell, > On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote: > > +int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > + struct page **pages) > > +{ > > + struct mm_struct *mm = current->mm; > > + int nr, ret; > > + > > + start &= PAGE_MASK; > > + nr = __get_user_pages_fast(start, nr_pages, write, pages); > > + ret = nr; > > + > > + if (nr < nr_pages) { > > + /* Try to get the remaining pages with get_user_pages */ > > + start += nr << PAGE_SHIFT; > > + pages += nr; > > When I read this, my first reaction was... what if nr is negative? In > that case, if nr_pages is positive, we fall through into this if, and > start to wind things backwards - which isn't what we want. > > It looks like that can't happen... right? __get_user_pages_fast() only > returns greater-or-equal to zero right now, but what about the future? __get_user_pages_fast is a strict fast path, it will grab as many page references as it can and if something gets in its way it backs off. As it can't take locks, it can't inspect the VMA, thus it really isn't in a position to know if there's an error. It may be possible for the slow path to take a write fault for a read only pte, for instance. (we could in theory return an error on pte_special and save a fallback to the slowpath but I don't believe it's worth doing as special ptes should be encountered very rarely by the fast_gup). I think it's safe to assume that __get_use_pages_fast has non-negative return values; also it is logically contained in the same area as get_user_pages_fast, so if this does change we can apply changes below it too. get_user_pages_fast attempts the fast path but is allowed to fallback to the slowpath, so is in a position to return an error code thus can return negative values. > > > + > > + down_read(&mm->mmap_sem); > > + ret = get_user_pages(current, mm, start, > > + nr_pages - nr, write, 0, pages, NULL); > > + up_read(&mm->mmap_sem); > > + > > + /* Have to be a bit careful with return values */ > > + if (nr > 0) { > > This kind'a makes it look like nr could be negative. I read it as "did the fast path get at least one page?". > > Other than that, I don't see anything obviously wrong with it. Thank you for giving this a going over. Cheers,
diff --git a/mm/Kconfig b/mm/Kconfig index 886db21..6a4d764 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP config HAVE_MEMBLOCK_PHYS_MAP boolean +config HAVE_RCU_GUP + boolean + config ARCH_DISCARD_MEMBLOCK boolean diff --git a/mm/gup.c b/mm/gup.c index 91d044b..2f684fa 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -10,6 +10,10 @@ #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/sched.h> +#include <linux/rwsem.h> +#include <asm/pgtable.h> + #include "internal.h" static struct page *no_page_table(struct vm_area_struct *vma, @@ -672,3 +676,277 @@ struct page *get_dump_page(unsigned long addr) return page; } #endif /* CONFIG_ELF_CORE */ + +#ifdef CONFIG_HAVE_RCU_GUP + +#ifdef __HAVE_ARCH_PTE_SPECIAL +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + pte_t *ptep, *ptem; + int ret = 0; + + ptem = ptep = pte_offset_map(&pmd, addr); + do { + pte_t pte = ACCESS_ONCE(*ptep); + struct page *page; + + if (!pte_present(pte) || pte_special(pte) + || (write && !pte_write(pte))) + goto pte_unmap; + + VM_BUG_ON(!pfn_valid(pte_pfn(pte))); + page = pte_page(pte); + + if (!page_cache_get_speculative(page)) + goto pte_unmap; + + if (unlikely(pte_val(pte) != pte_val(*ptep))) { + put_page(page); + goto pte_unmap; + } + + pages[*nr] = page; + (*nr)++; + + } while (ptep++, addr += PAGE_SIZE, addr != end); + + ret = 1; + +pte_unmap: + pte_unmap(ptem); + return ret; +} +#else + +/* + * If we can't determine whether or not a pte is special, then fail immediately + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not + * to be special. + */ +static inline int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + return 0; +} +#endif /* __HAVE_ARCH_PTE_SPECIAL */ + +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, + unsigned long end, int write, struct page **pages, int *nr) +{ + struct page *head, *page, *tail; + int refs; + + if (write && !pmd_write(orig)) + return 0; + + refs = 0; + head = pmd_page(orig); + page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + tail = page; + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + /* + * Any tail pages need their mapcount reference taken before we + * return. (This allows the THP code to bump their ref count when + * they are split into base pages). + */ + while (refs--) { + if (PageTail(tail)) + get_huge_page_tail(tail); + tail++; + } + + return 1; +} + +static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, + unsigned long end, int write, struct page **pages, int *nr) +{ + struct page *head, *page, *tail; + int refs; + + if (write && !pud_write(orig)) + return 0; + + refs = 0; + head = pud_page(orig); + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + tail = page; + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(orig) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + while (refs--) { + if (PageTail(tail)) + get_huge_page_tail(tail); + tail++; + } + + return 1; +} + +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + unsigned long next; + pmd_t *pmdp; + + pmdp = pmd_offset(&pud, addr); + do { + pmd_t pmd = ACCESS_ONCE(*pmdp); + next = pmd_addr_end(addr, end); + if (pmd_none(pmd) || pmd_trans_splitting(pmd)) + return 0; + + if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) { + if (!gup_huge_pmd(pmd, pmdp, addr, next, write, + pages, nr)) + return 0; + } else { + if (!gup_pte_range(pmd, addr, next, write, pages, nr)) + return 0; + } + } while (pmdp++, addr = next, addr != end); + + return 1; +} + +static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + unsigned long next; + pud_t *pudp; + + pudp = pud_offset(pgdp, addr); + do { + pud_t pud = ACCESS_ONCE(*pudp); + next = pud_addr_end(addr, end); + if (pud_none(pud)) + return 0; + if (pud_huge(pud)) { + if (!gup_huge_pud(pud, pudp, addr, next, write, + pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + return 0; + } while (pudp++, addr = next, addr != end); + + return 1; +} + +/* + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall + * back to the regular GUP. + */ +int __get_user_pages_fast(unsigned long start, int nr_pages, int write, + struct page **pages) +{ + struct mm_struct *mm = current->mm; + unsigned long addr, len, end; + unsigned long next, flags; + pgd_t *pgdp; + int nr = 0; + + start &= PAGE_MASK; + addr = start; + len = (unsigned long) nr_pages << PAGE_SHIFT; + end = start + len; + + if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, + start, len))) + return 0; + + /* + * Disable interrupts, we use the nested form as we can already + * have interrupts disabled by get_futex_key. + * + * With interrupts disabled, we block page table pages from being + * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h + * for more details. + * + * We do not adopt an rcu_read_lock(.) here as we also want to + * block IPIs that come from THPs splitting. + */ + + local_irq_save(flags); + pgdp = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none(*pgdp)) + break; + else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr)) + break; + } while (pgdp++, addr = next, addr != end); + local_irq_restore(flags); + + return nr; +} + +int get_user_pages_fast(unsigned long start, int nr_pages, int write, + struct page **pages) +{ + struct mm_struct *mm = current->mm; + int nr, ret; + + start &= PAGE_MASK; + nr = __get_user_pages_fast(start, nr_pages, write, pages); + ret = nr; + + if (nr < nr_pages) { + /* Try to get the remaining pages with get_user_pages */ + start += nr << PAGE_SHIFT; + pages += nr; + + down_read(&mm->mmap_sem); + ret = get_user_pages(current, mm, start, + nr_pages - nr, write, 0, pages, NULL); + up_read(&mm->mmap_sem); + + /* Have to be a bit careful with return values */ + if (nr > 0) { + if (ret < 0) + ret = nr; + else + ret += nr; + } + } + + return ret; +} + +#endif /* CONFIG_HAVE_RCU_GUP */
get_user_pages_fast attempts to pin user pages by walking the page tables directly and avoids taking locks. Thus the walker needs to be protected from page table pages being freed from under it, and needs to block any THP splits. One way to achieve this is to have the walker disable interrupts, and rely on IPIs from the TLB flushing code blocking before the page table pages are freed. On some platforms we have hardware broadcast of TLB invalidations, thus the TLB flushing code doesn't necessarily need to broadcast IPIs; and spuriously broadcasting IPIs can hurt system performance if done too often. This problem has been solved on PowerPC and Sparc by batching up page table pages belonging to more than one mm_user, then scheduling an rcu_sched callback to free the pages. This RCU page table free logic has been promoted to core code and is activated when one enables HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their own get_user_pages_fast routines. The RCU page table free logic coupled with a an IPI broadcast on THP split (which is a rare event), allows one to protect a page table walker by merely disabling the interrupts during the walk. This patch provides a general RCU implementation of get_user_pages_fast that can be used by architectures that perform hardware broadcast of TLB invalidations. It is based heavily on the PowerPC implementation by Nick Piggin. Signed-off-by: Steve Capper <steve.capper@linaro.org> --- mm/Kconfig | 3 + mm/gup.c | 278 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 281 insertions(+)