diff mbox

[V4,1/6] mm: Introduce a general RCU get_user_pages_fast.

Message ID 1411740233-28038-2-git-send-email-steve.capper@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Steve Capper Sept. 26, 2014, 2:03 p.m. UTC
get_user_pages_fast attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs
to block any THP splits.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.

This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic
has been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
their own get_user_pages_fast routines.

The RCU page table free logic coupled with a an IPI broadcast on THP
split (which is a rare event), allows one to protect a page table
walker by merely disabling the interrupts during the walk.

This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of
TLB invalidations.

It is based heavily on the PowerPC implementation by Nick Piggin.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changed in V4:
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.

Catalin, I've kept your Reviewed-by, please shout if you dislike the
pte_numa and pmd_numa calls.
---
 mm/Kconfig |   3 +
 mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 357 insertions(+)

Comments

Hugh Dickins Sept. 29, 2014, 9:51 p.m. UTC | #1
On Fri, 26 Sep 2014, Steve Capper wrote:

> get_user_pages_fast attempts to pin user pages by walking the page
> tables directly and avoids taking locks. Thus the walker needs to be
> protected from page table pages being freed from under it, and needs
> to block any THP splits.
> 
> One way to achieve this is to have the walker disable interrupts, and
> rely on IPIs from the TLB flushing code blocking before the page table
> pages are freed.
> 
> On some platforms we have hardware broadcast of TLB invalidations, thus
> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
> spuriously broadcasting IPIs can hurt system performance if done too
> often.
> 
> This problem has been solved on PowerPC and Sparc by batching up page
> table pages belonging to more than one mm_user, then scheduling an
> rcu_sched callback to free the pages. This RCU page table free logic
> has been promoted to core code and is activated when one enables
> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
> their own get_user_pages_fast routines.
> 
> The RCU page table free logic coupled with a an IPI broadcast on THP
> split (which is a rare event), allows one to protect a page table
> walker by merely disabling the interrupts during the walk.
> 
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> Tested-by: Dann Frazier <dann.frazier@canonical.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks for making all those clarifications, Steve: this looks very
good to me now.  I'm not sure which tree you're hoping will take this
and the arm+arm64 patches 2-6: although this one would normally go
through akpm, I expect it's easier for you to synchronize if it goes
in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
I see no clash with what's currently in mmotm.

> ---
> Changed in V4:
>  * Added pte_numa and pmd_numa calls.
>  * Added comments to clarify what assumptions are being made by the
>    implementation.
>  * Cleaned up formatting for checkpatch.
> 
> Catalin, I've kept your Reviewed-by, please shout if you dislike the
> pte_numa and pmd_numa calls.
> ---
>  mm/Kconfig |   3 +
>  mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 357 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 886db21..0ceb8a5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
>  config HAVE_MEMBLOCK_PHYS_MAP
>  	boolean
>  
> +config HAVE_GENERIC_RCU_GUP
> +	boolean
> +
>  config ARCH_DISCARD_MEMBLOCK
>  	boolean
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..35c0160 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -10,6 +10,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <asm/pgtable.h>
> +
>  #include "internal.h"
>  
>  static struct page *no_page_table(struct vm_area_struct *vma,
> @@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
>  	return page;
>  }
>  #endif /* CONFIG_ELF_CORE */
> +
> +/**
> + * Generic RCU Fast GUP
> + *
> + * get_user_pages_fast attempts to pin user pages by walking the page
> + * tables directly and avoids taking locks. Thus the walker needs to be
> + * protected from page table pages being freed from under it, and should
> + * block any THP splits.
> + *
> + * One way to achieve this is to have the walker disable interrupts, and
> + * rely on IPIs from the TLB flushing code blocking before the page table
> + * pages are freed. This is unsuitable for architectures that do not need
> + * to broadcast an IPI when invalidating TLBs.
> + *
> + * Another way to achieve this is to batch up page table containing pages
> + * belonging to more than one mm_user, then rcu_sched a callback to free those
> + * pages. Disabling interrupts will allow the fast_gup walker to both block
> + * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> + * (which is a relatively rare event). The code below adopts this strategy.
> + *
> + * Before activating this code, please be aware that the following assumptions
> + * are currently made:
> + *
> + *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
> + *      pages containing page tables.
> + *
> + *  *) THP splits will broadcast an IPI, this can be achieved by overriding
> + *      pmdp_splitting_flush.
> + *
> + *  *) ptes can be read atomically by the architecture.
> + *
> + *  *) access_ok is sufficient to validate userspace address ranges.
> + *
> + * The last two assumptions can be relaxed by the addition of helper functions.
> + *
> + * This code is based heavily on the PowerPC implementation by Nick Piggin.
> + */
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +
> +#ifdef __HAVE_ARCH_PTE_SPECIAL
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	pte_t *ptep, *ptem;
> +	int ret = 0;
> +
> +	ptem = ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * In the line below we are assuming that the pte can be read
> +		 * atomically. If this is not the case for your architecture,
> +		 * please wrap this in a helper function!
> +		 *
> +		 * for an example see gup_get_pte in arch/x86/mm/gup.c
> +		 */
> +		pte_t pte = ACCESS_ONCE(*ptep);
> +		struct page *page;
> +
> +		/*
> +		 * Similar to the PMD case below, NUMA hinting must take slow
> +		 * path
> +		 */
> +		if (!pte_present(pte) || pte_special(pte) ||
> +			pte_numa(pte) || (write && !pte_write(pte)))
> +			goto pte_unmap;
> +
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +
> +		if (!page_cache_get_speculative(page))
> +			goto pte_unmap;
> +
> +		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +			put_page(page);
> +			goto pte_unmap;
> +		}
> +
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +
> +	ret = 1;
> +
> +pte_unmap:
> +	pte_unmap(ptem);
> +	return ret;
> +}
> +#else
> +
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> + * to be special.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * __get_user_pages_fast implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	return 0;
> +}
> +#endif /* __HAVE_ARCH_PTE_SPECIAL */
> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pud_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pud_page(orig);
> +	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = pud_offset(pgdp, addr);
> +	do {
> +		pud_t pud = ACCESS_ONCE(*pudp);
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (pud_huge(pud)) {
> +			if (!gup_huge_pud(pud, pudp, addr, next, write,
> +					pages, nr))
> +				return 0;
> +		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +/*
> + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
> + * back to the regular GUP. It will only return non-negative values.
> + */
> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			  struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, len, end;
> +	unsigned long next, flags;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	start &= PAGE_MASK;
> +	addr = start;
> +	len = (unsigned long) nr_pages << PAGE_SHIFT;
> +	end = start + len;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, len)))
> +		return 0;
> +
> +	/*
> +	 * Disable interrupts, we use the nested form as we can already
> +	 * have interrupts disabled by get_futex_key.
> +	 *
> +	 * With interrupts disabled, we block page table pages from being
> +	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> +	 * for more details.
> +	 *
> +	 * We do not adopt an rcu_read_lock(.) here as we also want to
> +	 * block IPIs that come from THPs splitting.
> +	 */
> +
> +	local_irq_save(flags);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(*pgdp))
> +			break;
> +		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
> +			break;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_restore(flags);
> +
> +	return nr;
> +}
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @write:	whether pages will be written to
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number
> + * requested. If nr_pages is 0 or negative, returns 0. If no pages
> + * were pinned, returns -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int nr, ret;
> +
> +	start &= PAGE_MASK;
> +	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> +	ret = nr;
> +
> +	if (nr < nr_pages) {
> +		/* Try to get the remaining pages with get_user_pages */
> +		start += nr << PAGE_SHIFT;
> +		pages += nr;
> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		/* Have to be a bit careful with return values */
> +		if (nr > 0) {
> +			if (ret < 0)
> +				ret = nr;
> +			else
> +				ret += nr;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
> -- 
> 1.9.3
> 
>
Catalin Marinas Oct. 1, 2014, 11:11 a.m. UTC | #2
On Mon, Sep 29, 2014 at 10:51:25PM +0100, Hugh Dickins wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
> > This patch provides a general RCU implementation of get_user_pages_fast
> > that can be used by architectures that perform hardware broadcast of
> > TLB invalidations.
> >
> > It is based heavily on the PowerPC implementation by Nick Piggin.
> >
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > Tested-by: Dann Frazier <dann.frazier@canonical.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

From an arm64 perspective, I'm more than happy for Andrew to pick up the
entire series. I already reviewed the patches.

Thanks.
Andrea Arcangeli Oct. 2, 2014, 12:19 p.m. UTC | #3
Hi Steve,

On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.

It'd be nice if you could also at the same time apply it to sparc and
powerpc in this same patchset to show the effectiveness of having a
generic version. Because if it's not a trivial drop-in replacement,
then this should go in arch/arm* instead of mm/gup.c...

Also I wonder if it wouldn't be better to add it to mm/util.c along
with the __weak gup_fast but then this is ok too. I'm just saying
because we never had sings of gup_fast code in mm/gup.c so far but
then this isn't exactly a __weak version of it... so I don't mind
either ways.

> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);

This has a collision with a patchset I posted, but it's trivial to
solve, the above three lines need to be replaced with:

+		ret = get_user_pages_unlocked(current, mm, start,
+				     nr_pages - nr, write, 0, pages);

And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
the first time to release the mmap_sem before I/O.

Thanks,
Andrea
Steve Capper Oct. 2, 2014, 4 p.m. UTC | #4
On 30 September 2014 04:51, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
>
>> get_user_pages_fast attempts to pin user pages by walking the page
>> tables directly and avoids taking locks. Thus the walker needs to be
>> protected from page table pages being freed from under it, and needs
>> to block any THP splits.
>>
>> One way to achieve this is to have the walker disable interrupts, and
>> rely on IPIs from the TLB flushing code blocking before the page table
>> pages are freed.
>>
>> On some platforms we have hardware broadcast of TLB invalidations, thus
>> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
>> spuriously broadcasting IPIs can hurt system performance if done too
>> often.
>>
>> This problem has been solved on PowerPC and Sparc by batching up page
>> table pages belonging to more than one mm_user, then scheduling an
>> rcu_sched callback to free the pages. This RCU page table free logic
>> has been promoted to core code and is activated when one enables
>> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
>> their own get_user_pages_fast routines.
>>
>> The RCU page table free logic coupled with a an IPI broadcast on THP
>> split (which is a rare event), allows one to protect a page table
>> walker by merely disabling the interrupts during the walk.
>>
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> Signed-off-by: Steve Capper <steve.capper@linaro.org>
>> Tested-by: Dann Frazier <dann.frazier@canonical.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>
> Acked-by: Hugh Dickins <hughd@google.com>
>

Thanks Hugh!

> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

I see it's gone into mmotm.
Steve Capper Oct. 2, 2014, 4:18 p.m. UTC | #5
On 2 October 2014 19:19, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Steve,
>

Hi Andrea,

> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

I think it should be adapted (if need be) and adopted for sparc, power
and others, especially as it will result in a reduction in code size
and make future alterations to gup easier.
I would prefer to get this in iteratively; and have people who are
knowledgeable of those architectures and have a means of testing the
code thoroughly to help out. (it will be very hard for me to implement
this on my own, but likely trivial for people who know and can test
those architectures).

>
> Also I wonder if it wouldn't be better to add it to mm/util.c along
> with the __weak gup_fast but then this is ok too. I'm just saying
> because we never had sings of gup_fast code in mm/gup.c so far but
> then this isn't exactly a __weak version of it... so I don't mind
> either ways.

mm/gup.c was recently created?
It may even make sense to move the weak version in a future patch?

>
>> +             down_read(&mm->mmap_sem);
>> +             ret = get_user_pages(current, mm, start,
>> +                                  nr_pages - nr, write, 0, pages, NULL);
>> +             up_read(&mm->mmap_sem);
>
> This has a collision with a patchset I posted, but it's trivial to
> solve, the above three lines need to be replaced with:
>
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                    nr_pages - nr, write, 0, pages);
>
> And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
> the first time to release the mmap_sem before I/O.
>

Ahh thanks.
I'm currently on holiday and have very limited access to email, I'd
appreciate it if someone can keep an eye out for this during the merge
window if this conflict arises?

> Thanks,
> Andrea

Cheers,
--
Steve
Andrea Arcangeli Oct. 2, 2014, 4:54 p.m. UTC | #6
On Thu, Oct 02, 2014 at 11:18:00PM +0700, Steve Capper wrote:
> mm/gup.c was recently created?
> It may even make sense to move the weak version in a future patch?

I think the __weak stuff tends to go in lib, that's probably why it's
there. I don't mind either ways.

> I'm currently on holiday and have very limited access to email, I'd
> appreciate it if someone can keep an eye out for this during the merge
> window if this conflict arises?

No problem, I assume Andrew will merge your patchset first, so I can
resubmit against -mm patching the gup_fast_rcu too.

Thanks,
Andrea
Aneesh Kumar K.V Oct. 13, 2014, 5:15 a.m. UTC | #7
Andrea Arcangeli <aarcange@redhat.com> writes:

> Hi Steve,
>
> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>> 
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

on ppc64 we have one challenge, we do need to support hugepd. At the pmd
level we can have hugepte, normal pmd pointer or a pointer to hugepage
directory which is used in case of some sub-architectures/platforms. ie,
the below part of gup implementation in ppc64

else if (is_hugepd(pmdp)) {
	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
			addr, next, write, pages, nr))
		return 0;


-aneesh
David Miller Oct. 13, 2014, 5:21 a.m. UTC | #8
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Date: Mon, 13 Oct 2014 10:45:24 +0530

> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
>> Hi Steve,
>>
>> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>>> This patch provides a general RCU implementation of get_user_pages_fast
>>> that can be used by architectures that perform hardware broadcast of
>>> TLB invalidations.
>>> 
>>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> It'd be nice if you could also at the same time apply it to sparc and
>> powerpc in this same patchset to show the effectiveness of having a
>> generic version. Because if it's not a trivial drop-in replacement,
>> then this should go in arch/arm* instead of mm/gup.c...
> 
> on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> level we can have hugepte, normal pmd pointer or a pointer to hugepage
> directory which is used in case of some sub-architectures/platforms. ie,
> the below part of gup implementation in ppc64
> 
> else if (is_hugepd(pmdp)) {
> 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> 			addr, next, write, pages, nr))
> 		return 0;

Sparc has to deal with the same issue.
Aneesh Kumar K.V Oct. 13, 2014, 6:22 a.m. UTC | #9
Steve Capper <steve.capper@linaro.org> writes:

.....

> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
.....

> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {

We don't check the _PAGE_PRESENT here


> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +

-aneesh
Steve Capper Oct. 13, 2014, 11:44 a.m. UTC | #10
On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Date: Mon, 13 Oct 2014 10:45:24 +0530
> 
> > Andrea Arcangeli <aarcange@redhat.com> writes:
> > 
> >> Hi Steve,
> >>
> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> >>> This patch provides a general RCU implementation of get_user_pages_fast
> >>> that can be used by architectures that perform hardware broadcast of
> >>> TLB invalidations.
> >>> 
> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
> >>
> >> It'd be nice if you could also at the same time apply it to sparc and
> >> powerpc in this same patchset to show the effectiveness of having a
> >> generic version. Because if it's not a trivial drop-in replacement,
> >> then this should go in arch/arm* instead of mm/gup.c...
> > 
> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
> > directory which is used in case of some sub-architectures/platforms. ie,
> > the below part of gup implementation in ppc64
> > 
> > else if (is_hugepd(pmdp)) {
> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> > 			addr, next, write, pages, nr))
> > 		return 0;
> 
> Sparc has to deal with the same issue.

Hi Aneesh, David,

Could we add some helpers to mm/gup.c to deal with the hugepage
directory cases? If my understanding is correct, this arises for
HugeTLB pages rather than THP? (I should have listed under the
assumptions made that HugeTLB and THP have the same page table
entries).

For Sparc, if the huge pte case were to be separated out from the
normal pte case we could use page_cache_add_speculative rather than
make repeated calls to page_cache_get_speculative?

Also, as a heads up for Sparc. I don't see any definition of
__get_user_pages_fast. Does this mean that a futex on THP tail page
can cause an infinite loop?

I don't have the means to thoroughly test patches for PowerPC and Sparc
(nor do I have enough knowledge to safely write them). I was going to
ask if you could please have a go at enabling this for PowerPC and
Sparc and I could check the ARM side and help out with mm/gup.c?

Cheers,
David Miller Oct. 13, 2014, 4:06 p.m. UTC | #11
From: Steve Capper <steve.capper@linaro.org>
Date: Mon, 13 Oct 2014 12:44:28 +0100

> Also, as a heads up for Sparc. I don't see any definition of
> __get_user_pages_fast. Does this mean that a futex on THP tail page
> can cause an infinite loop?

I have no idea, I didn't realize this was required to be implemented.
Steve Capper Oct. 14, 2014, 12:38 p.m. UTC | #12
On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
> From: Steve Capper <steve.capper@linaro.org>
> Date: Mon, 13 Oct 2014 12:44:28 +0100
> 
> > Also, as a heads up for Sparc. I don't see any definition of
> > __get_user_pages_fast. Does this mean that a futex on THP tail page
> > can cause an infinite loop?
> 
> I have no idea, I didn't realize this was required to be implemented.

In get_futex_key, a call is made to __get_user_pages_fast to handle the
case where a THP tail page needs to be pinned for the futex. There is a
stock implementation of __get_user_pages_fast, but this is just an
empty function that returns 0. Unfortunately this will provoke a goto
to "again:" and end up in an infinite loop. The process will appear
to hang with a high system cpu usage.
David Miller Oct. 14, 2014, 4:30 p.m. UTC | #13
From: Steve Capper <steve.capper@linaro.org>
Date: Tue, 14 Oct 2014 13:38:34 +0100

> On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
>> From: Steve Capper <steve.capper@linaro.org>
>> Date: Mon, 13 Oct 2014 12:44:28 +0100
>> 
>> > Also, as a heads up for Sparc. I don't see any definition of
>> > __get_user_pages_fast. Does this mean that a futex on THP tail page
>> > can cause an infinite loop?
>> 
>> I have no idea, I didn't realize this was required to be implemented.
> 
> In get_futex_key, a call is made to __get_user_pages_fast to handle the
> case where a THP tail page needs to be pinned for the futex. There is a
> stock implementation of __get_user_pages_fast, but this is just an
> empty function that returns 0. Unfortunately this will provoke a goto
> to "again:" and end up in an infinite loop. The process will appear
> to hang with a high system cpu usage.

I'd rather the build fail and force me to implement the interface for
my architecture than have a default implementation that causes issues
like that.
diff mbox

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..0ceb8a5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@  config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..35c0160 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@ 
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,353 @@  struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pmd_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pud_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pud_page(orig);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */