diff mbox series

[PATCHv3,1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

Message ID 1559725820-26138-1-git-send-email-kernelfans@gmail.com (mailing list archive)
State New, archived
Headers show
Series [PATCHv3,1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast() | expand

Commit Message

Pingfan Liu June 5, 2019, 9:10 a.m. UTC
As for FOLL_LONGTERM, it is checked in the slow path
__gup_longterm_unlocked(). But it is not checked in the fast path, which
means a possible leak of CMA page to longterm pinned requirement through
this crack.

Place a check in the fast path.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Comments

Andrew Morton June 5, 2019, 9:49 p.m. UTC | #1
On Wed,  5 Jun 2019 17:10:19 +0800 Pingfan Liu <kernelfans@gmail.com> wrote:

> As for FOLL_LONGTERM, it is checked in the slow path
> __gup_longterm_unlocked(). But it is not checked in the fast path, which
> means a possible leak of CMA page to longterm pinned requirement through
> this crack.
> 
> Place a check in the fast path.

I'm not actually seeing a description (in either the existing code or
this changelog or patch) an explanation of *why* we wish to exclude CMA
pages from longterm pinning.

> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_CMA
> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_pinned; i++)
> +		if (is_migrate_cma_page(pages[i])) {
> +			put_user_pages(pages + i, nr_pinned - i);
> +			return i;
> +		}
> +
> +	return nr_pinned;
> +}

There's no point in inlining this.

The code seems inefficient.  If it encounters a single CMA page it can
end up discarding a possibly significant number of non-CMA pages.  I
guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
rare.  But could we avoid this (and the second pass across pages[]) by
checking for a CMA page within gup_pte_range()?

> +#else
> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> +{
> +	return nr_pinned;
> +}
> +#endif
> +
>  /**
>   * get_user_pages_fast() - pin user pages in memory
>   * @start:	starting user address
> @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  		ret = nr;
>  	}
>  
> +	if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
> +		nr = reject_cma_pages(nr, pages);
> +

This would be a suitable place to add a comment explaining why we're
doing this...
Pingfan Liu June 6, 2019, 2:19 a.m. UTC | #2
On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  5 Jun 2019 17:10:19 +0800 Pingfan Liu <kernelfans@gmail.com> wrote:
>
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path, which
> > means a possible leak of CMA page to longterm pinned requirement through
> > this crack.
> >
> > Place a check in the fast path.
>
> I'm not actually seeing a description (in either the existing code or
> this changelog or patch) an explanation of *why* we wish to exclude CMA
> pages from longterm pinning.
>
What about a short description like this:
FOLL_LONGTERM suggests a pin which is going to be given to hardware
and can't move. It would truncate CMA permanently and should be
excluded.

> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >       return ret;
> >  }
> >
> > +#ifdef CONFIG_CMA
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < nr_pinned; i++)
> > +             if (is_migrate_cma_page(pages[i])) {
> > +                     put_user_pages(pages + i, nr_pinned - i);
> > +                     return i;
> > +             }
> > +
> > +     return nr_pinned;
> > +}
>
> There's no point in inlining this.
OK, will drop it in V4.

>
> The code seems inefficient.  If it encounters a single CMA page it can
> end up discarding a possibly significant number of non-CMA pages.  I
The trick is the page is not be discarded, in fact, they are still be
referrenced by pte. We just leave the slow path to pick up the non-CMA
pages again.

> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> rare.  But could we avoid this (and the second pass across pages[]) by
> checking for a CMA page within gup_pte_range()?
It will spread the same logic to hugetlb pte and normal pte. And no
improvement in performance due to slow path. So I think maybe it is
not worth.

>
> > +#else
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > +     return nr_pinned;
> > +}
> > +#endif
> > +
> >  /**
> >   * get_user_pages_fast() - pin user pages in memory
> >   * @start:   starting user address
> > @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >               ret = nr;
> >       }
> >
> > +     if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
> > +             nr = reject_cma_pages(nr, pages);
> > +
>
> This would be a suitable place to add a comment explaining why we're
> doing this...
Would add one comment "FOLL_LONGTERM suggests a pin given to hardware
and rarely returned."

Thanks for your kind review.

Regards,
  Pingfan
John Hubbard June 6, 2019, 9:17 p.m. UTC | #3
On 6/5/19 7:19 PM, Pingfan Liu wrote:
> On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
...
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>>>       return ret;
>>>  }
>>>
>>> +#ifdef CONFIG_CMA
>>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < nr_pinned; i++)
>>> +             if (is_migrate_cma_page(pages[i])) {
>>> +                     put_user_pages(pages + i, nr_pinned - i);
>>> +                     return i;
>>> +             }
>>> +
>>> +     return nr_pinned;
>>> +}
>>
>> There's no point in inlining this.
> OK, will drop it in V4.
> 
>>
>> The code seems inefficient.  If it encounters a single CMA page it can
>> end up discarding a possibly significant number of non-CMA pages.  I
> The trick is the page is not be discarded, in fact, they are still be
> referrenced by pte. We just leave the slow path to pick up the non-CMA
> pages again.
> 
>> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
>> rare.  But could we avoid this (and the second pass across pages[]) by
>> checking for a CMA page within gup_pte_range()?
> It will spread the same logic to hugetlb pte and normal pte. And no
> improvement in performance due to slow path. So I think maybe it is
> not worth.
> 
>>

I think the concern is: for the successful gup_fast case with no CMA
pages, this patch is adding another complete loop through all the 
pages. In the fast case.

If the check were instead done as part of the gup_pte_range(), then
it would be a little more efficient for that case.

As for whether it's worth it, *probably* this is too small an effect to measure. 
But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file 
that Jan Kara and Tom Talpey helped me come up with, for related testing:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64



thanks,
Pingfan Liu June 7, 2019, 6:10 a.m. UTC | #4
On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> ...
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >>>       return ret;
> >>>  }
> >>>
> >>> +#ifdef CONFIG_CMA
> >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < nr_pinned; i++)
> >>> +             if (is_migrate_cma_page(pages[i])) {
> >>> +                     put_user_pages(pages + i, nr_pinned - i);
> >>> +                     return i;
> >>> +             }
> >>> +
> >>> +     return nr_pinned;
> >>> +}
> >>
> >> There's no point in inlining this.
> > OK, will drop it in V4.
> >
> >>
> >> The code seems inefficient.  If it encounters a single CMA page it can
> >> end up discarding a possibly significant number of non-CMA pages.  I
> > The trick is the page is not be discarded, in fact, they are still be
> > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > pages again.
> >
> >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> >> rare.  But could we avoid this (and the second pass across pages[]) by
> >> checking for a CMA page within gup_pte_range()?
> > It will spread the same logic to hugetlb pte and normal pte. And no
> > improvement in performance due to slow path. So I think maybe it is
> > not worth.
> >
> >>
>
> I think the concern is: for the successful gup_fast case with no CMA
> pages, this patch is adding another complete loop through all the
> pages. In the fast case.
>
> If the check were instead done as part of the gup_pte_range(), then
> it would be a little more efficient for that case.
>
> As for whether it's worth it, *probably* this is too small an effect to measure.
> But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> that Jan Kara and Tom Talpey helped me come up with, for related testing:
>
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
>
Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
will try to bring out the result.

Thanks,
  Pingfan
Pingfan Liu June 11, 2019, 12:29 p.m. UTC | #5
On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > ...
> > >>> --- a/mm/gup.c
> > >>> +++ b/mm/gup.c
> > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > >>>       return ret;
> > >>>  }
> > >>>
> > >>> +#ifdef CONFIG_CMA
> > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > >>> +{
> > >>> +     int i;
> > >>> +
> > >>> +     for (i = 0; i < nr_pinned; i++)
> > >>> +             if (is_migrate_cma_page(pages[i])) {
> > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > >>> +                     return i;
> > >>> +             }
> > >>> +
> > >>> +     return nr_pinned;
> > >>> +}
> > >>
> > >> There's no point in inlining this.
> > > OK, will drop it in V4.
> > >
> > >>
> > >> The code seems inefficient.  If it encounters a single CMA page it can
> > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > The trick is the page is not be discarded, in fact, they are still be
> > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > pages again.
> > >
> > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > >> checking for a CMA page within gup_pte_range()?
> > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > improvement in performance due to slow path. So I think maybe it is
> > > not worth.
> > >
> > >>
> >
> > I think the concern is: for the successful gup_fast case with no CMA
> > pages, this patch is adding another complete loop through all the
> > pages. In the fast case.
> >
> > If the check were instead done as part of the gup_pte_range(), then
> > it would be a little more efficient for that case.
> >
> > As for whether it's worth it, *probably* this is too small an effect to measure.
> > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> >
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> >
Unable to get a NVME device to have a test. And when testing fio on the
tranditional disk, I got the error "fio: engine libaio not loadable
fio: failed to load engine
fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"

But I found a test case which can be slightly adjusted to met the aim.
It is tools/testing/selftests/vm/gup_benchmark.c

Test enviroment:
  MemTotal:       264079324 kB
  MemFree:        262306788 kB
  CmaTotal:              0 kB
  CmaFree:               0 kB
  on AMD EPYC 7601

Test command:
  gup_benchmark -r 100 -n 64
  gup_benchmark -r 100 -n 64 -l
where -r stands for repeat times, -n is nr_pages param for
get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
path, see a patch at the tail.

Test result:
w/o     477.800000
w/o-l   481.070000
a       481.800000
a-l     640.410000
b       466.240000  (question a: b outperforms w/o ?)
b-l     529.740000

Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.

I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41
As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
considered as variance.

Based on the above result, I think it is better to do the check inside
gup_pte_range().

Any comment?

Thanks,


> Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> will try to bring out the result.
> 
> Thanks,
>   Pingfan
>
---
Patch to do check inside gup_pte_range()

diff --git a/mm/gup.c b/mm/gup.c
index 2ce3091..ba213a0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (unlikely(flags & FOLL_LONGTERM) &&
+			is_migrate_cma_page(page))
+				goto pte_unmap;
+
 		head = try_get_compound_head(page, 1);
 		if (!head)
 			goto pte_unmap;
@@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(flags & FOLL_LONGTERM) &&
+		is_migrate_cma_page(page)) {
+		*nr -= refs;
+		return 0;
+	}
+
 	head = try_get_compound_head(pmd_page(orig), refs);
 	if (!head) {
 		*nr -= refs;
@@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(flags & FOLL_LONGTERM) &&
+		is_migrate_cma_page(page)) {
+		*nr -= refs;
+		return 0;
+	}
+
 	head = try_get_compound_head(pud_page(orig), refs);
 	if (!head) {
 		*nr -= refs;
@@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(flags & FOLL_LONGTERM) &&
+		is_migrate_cma_page(page)) {
+		*nr -= refs;
+		return 0;
+	}
+
 	head = try_get_compound_head(pgd_page(orig), refs);
 	if (!head) {
 		*nr -= refs;
---
Patch for testing

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d..61dec5f 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,8 +6,9 @@
 #include <linux/debugfs.h>
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
 						 pages + i);
 			break;
+		case GUP_FAST_LONGTERM_BENCHMARK:
+			nr = get_user_pages_fast(addr, nr,
+						 (gup->flags & 1) | FOLL_LONGTERM,
+						 pages + i);
+			break;
 		case GUP_LONGTERM_BENCHMARK:
 			nr = get_user_pages(addr, nr,
 					    (gup->flags & 1) | FOLL_LONGTERM,
@@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
 
 	switch (cmd) {
 	case GUP_FAST_BENCHMARK:
+	case GUP_FAST_LONGTERM_BENCHMARK:
 	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
 		break;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index c0534e2..ade8acb 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,8 +15,9 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -37,7 +38,7 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
 		switch (opt) {
 		case 'm':
 			size = atoi(optarg) * MB;
@@ -54,6 +55,9 @@ int main(int argc, char **argv)
 		case 'T':
 			thp = 0;
 			break;
+		case 'l':
+			cmd = GUP_FAST_LONGTERM_BENCHMARK;
+			break;
 		case 'L':
 			cmd = GUP_LONGTERM_BENCHMARK;
 			break;
Christoph Hellwig June 11, 2019, 1:52 p.m. UTC | #6
On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> Unable to get a NVME device to have a test. And when testing fio on the

How would a nvme test help?  FOLL_LONGTERM isn't used by any performance
critical path to start with, so I don't see how this patch could be
a problem.
Aneesh Kumar K.V June 11, 2019, 4:15 p.m. UTC | #7
Pingfan Liu <kernelfans@gmail.com> writes:

> As for FOLL_LONGTERM, it is checked in the slow path
> __gup_longterm_unlocked(). But it is not checked in the fast path, which
> means a possible leak of CMA page to longterm pinned requirement through
> this crack.

Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
dax check we need vma to ensure whether a long term pin is allowed or not.
If FOLL_LONGTERM is specified we should fallback to slow path.

-aneesh
Ira Weiny June 11, 2019, 4:29 p.m. UTC | #8
> Pingfan Liu <kernelfans@gmail.com> writes:
> 
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > which means a possible leak of CMA page to longterm pinned requirement
> > through this crack.
> 
> Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> dax check we need vma to ensure whether a long term pin is allowed or not.
> If FOLL_LONGTERM is specified we should fallback to slow path.

Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.

Ira
Ira Weiny June 11, 2019, 4:47 p.m. UTC | #9
On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > ...
> > > >>> --- a/mm/gup.c
> > > >>> +++ b/mm/gup.c
> > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > > >>>       return ret;
> > > >>>  }
> > > >>>
> > > >>> +#ifdef CONFIG_CMA
> > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > > >>> +{
> > > >>> +     int i;
> > > >>> +
> > > >>> +     for (i = 0; i < nr_pinned; i++)
> > > >>> +             if (is_migrate_cma_page(pages[i])) {
> > > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > > >>> +                     return i;
> > > >>> +             }
> > > >>> +
> > > >>> +     return nr_pinned;
> > > >>> +}
> > > >>
> > > >> There's no point in inlining this.
> > > > OK, will drop it in V4.
> > > >
> > > >>
> > > >> The code seems inefficient.  If it encounters a single CMA page it can
> > > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > > The trick is the page is not be discarded, in fact, they are still be
> > > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > > pages again.
> > > >
> > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > > >> checking for a CMA page within gup_pte_range()?
> > > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > > improvement in performance due to slow path. So I think maybe it is
> > > > not worth.
> > > >
> > > >>
> > >
> > > I think the concern is: for the successful gup_fast case with no CMA
> > > pages, this patch is adding another complete loop through all the
> > > pages. In the fast case.
> > >
> > > If the check were instead done as part of the gup_pte_range(), then
> > > it would be a little more efficient for that case.
> > >
> > > As for whether it's worth it, *probably* this is too small an effect to measure.
> > > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> > >
> > > [reader]
> > > direct=1
> > > ioengine=libaio
> > > blocksize=4096
> > > size=1g
> > > numjobs=1
> > > rw=read
> > > iodepth=64
> > >
> Unable to get a NVME device to have a test. And when testing fio on the
> tranditional disk, I got the error "fio: engine libaio not loadable
> fio: failed to load engine
> fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"
> 
> But I found a test case which can be slightly adjusted to met the aim.
> It is tools/testing/selftests/vm/gup_benchmark.c
> 
> Test enviroment:
>   MemTotal:       264079324 kB
>   MemFree:        262306788 kB
>   CmaTotal:              0 kB
>   CmaFree:               0 kB
>   on AMD EPYC 7601
> 
> Test command:
>   gup_benchmark -r 100 -n 64
>   gup_benchmark -r 100 -n 64 -l
> where -r stands for repeat times, -n is nr_pages param for
> get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
> path, see a patch at the tail.

Thanks!  That is a good test to add.  You should add the patch to the series.

> 
> Test result:
> w/o     477.800000
> w/o-l   481.070000
> a       481.800000
> a-l     640.410000
> b       466.240000  (question a: b outperforms w/o ?)
> b-l     529.740000
> 
> Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
> does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.
> 
> I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41

Wow that is bigger than I would have thought.  I suspect it gets worse as -n
increases?

>
> As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
> considered as variance.

:-/

Does this change with larger -r or -n values?

> 
> Based on the above result, I think it is better to do the check inside
> gup_pte_range().
> 
> Any comment?

I agree.

Ira

> 
> Thanks,
> 
> 
> > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> > will try to bring out the result.
> > 
> > Thanks,
> >   Pingfan
> > 
> 

> ---
> Patch to do check inside gup_pte_range()
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 2ce3091..ba213a0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
>  
> +		if (unlikely(flags & FOLL_LONGTERM) &&
> +			is_migrate_cma_page(page))
> +				goto pte_unmap;
> +
>  		head = try_get_compound_head(page, 1);
>  		if (!head)
>  			goto pte_unmap;
> @@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		refs++;
>  	} while (addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(flags & FOLL_LONGTERM) &&
> +		is_migrate_cma_page(page)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
>  	head = try_get_compound_head(pmd_page(orig), refs);
>  	if (!head) {
>  		*nr -= refs;
> @@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  		refs++;
>  	} while (addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(flags & FOLL_LONGTERM) &&
> +		is_migrate_cma_page(page)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
>  	head = try_get_compound_head(pud_page(orig), refs);
>  	if (!head) {
>  		*nr -= refs;
> @@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  		refs++;
>  	} while (addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(flags & FOLL_LONGTERM) &&
> +		is_migrate_cma_page(page)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
>  	head = try_get_compound_head(pgd_page(orig), refs);
>  	if (!head) {
>  		*nr -= refs;

> ---
> Patch for testing
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d..61dec5f 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -6,8 +6,9 @@
>  #include <linux/debugfs.h>
>  
>  #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> -#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> +#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
>  
>  struct gup_benchmark {
>  	__u64 get_delta_usec;
> @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>  			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
>  						 pages + i);
>  			break;
> +		case GUP_FAST_LONGTERM_BENCHMARK:
> +			nr = get_user_pages_fast(addr, nr,
> +						 (gup->flags & 1) | FOLL_LONGTERM,
> +						 pages + i);
> +			break;
>  		case GUP_LONGTERM_BENCHMARK:
>  			nr = get_user_pages(addr, nr,
>  					    (gup->flags & 1) | FOLL_LONGTERM,
> @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
>  
>  	switch (cmd) {
>  	case GUP_FAST_BENCHMARK:
> +	case GUP_FAST_LONGTERM_BENCHMARK:
>  	case GUP_LONGTERM_BENCHMARK:
>  	case GUP_BENCHMARK:
>  		break;
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> index c0534e2..ade8acb 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -15,8 +15,9 @@
>  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
>  
>  #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> -#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> +#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
>  
>  struct gup_benchmark {
>  	__u64 get_delta_usec;
> @@ -37,7 +38,7 @@ int main(int argc, char **argv)
>  	char *file = "/dev/zero";
>  	char *p;
>  
> -	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> +	while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
>  		switch (opt) {
>  		case 'm':
>  			size = atoi(optarg) * MB;
> @@ -54,6 +55,9 @@ int main(int argc, char **argv)
>  		case 'T':
>  			thp = 0;
>  			break;
> +		case 'l':
> +			cmd = GUP_FAST_LONGTERM_BENCHMARK;
> +			break;
>  		case 'L':
>  			cmd = GUP_LONGTERM_BENCHMARK;
>  			break;
> -- 
> 2.7.5
>
John Hubbard June 11, 2019, 7:49 p.m. UTC | #10
On 6/11/19 6:52 AM, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
>> Unable to get a NVME device to have a test. And when testing fio on the
> 
> How would a nvme test help?  FOLL_LONGTERM isn't used by any performance
> critical path to start with, so I don't see how this patch could be
> a problem.
> 

yes, you're right of course. We skip the loop entirely for FOLL_LONGTERM,
and I forgot for the moment that the direct IO paths are never going to
set that flag. :)

thanks,
Pingfan Liu June 12, 2019, 1:54 p.m. UTC | #11
On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > Pingfan Liu <kernelfans@gmail.com> writes:
> > 
> > > As for FOLL_LONGTERM, it is checked in the slow path
> > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > which means a possible leak of CMA page to longterm pinned requirement
> > > through this crack.
> > 
> > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > dax check we need vma to ensure whether a long term pin is allowed or not.
> > If FOLL_LONGTERM is specified we should fallback to slow path.
> 
> Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> 
Then for CMA pages, are you suggesting something like:
diff --git a/mm/gup.c b/mm/gup.c
index 42a47c0..8bf3cc3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
        if (unlikely(!access_ok((void __user *)start, len)))
                return -EFAULT;

+       if (unlikely(gup_flags & FOLL_LONGTERM))
+               goto slow;
        if (gup_fast_permitted(start, nr_pages)) {
                local_irq_disable();
                gup_pgd_range(addr, end, gup_flags, pages, &nr);
@@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
                ret = nr;
        }

+slow:
        if (nr < nr_pages) {
                /* Try to get the remaining pages with get_user_pages */
                start += nr << PAGE_SHIFT;

Thanks,
  Pingfan
Pingfan Liu June 12, 2019, 2:10 p.m. UTC | #12
On Wed, Jun 12, 2019 at 12:46 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> > On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> > > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > >
> > > > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > ...
> > > > >>> --- a/mm/gup.c
> > > > >>> +++ b/mm/gup.c
> > > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > > > >>>       return ret;
> > > > >>>  }
> > > > >>>
> > > > >>> +#ifdef CONFIG_CMA
> > > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > > > >>> +{
> > > > >>> +     int i;
> > > > >>> +
> > > > >>> +     for (i = 0; i < nr_pinned; i++)
> > > > >>> +             if (is_migrate_cma_page(pages[i])) {
> > > > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > > > >>> +                     return i;
> > > > >>> +             }
> > > > >>> +
> > > > >>> +     return nr_pinned;
> > > > >>> +}
> > > > >>
> > > > >> There's no point in inlining this.
> > > > > OK, will drop it in V4.
> > > > >
> > > > >>
> > > > >> The code seems inefficient.  If it encounters a single CMA page it can
> > > > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > > > The trick is the page is not be discarded, in fact, they are still be
> > > > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > > > pages again.
> > > > >
> > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > > > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > > > >> checking for a CMA page within gup_pte_range()?
> > > > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > > > improvement in performance due to slow path. So I think maybe it is
> > > > > not worth.
> > > > >
> > > > >>
> > > >
> > > > I think the concern is: for the successful gup_fast case with no CMA
> > > > pages, this patch is adding another complete loop through all the
> > > > pages. In the fast case.
> > > >
> > > > If the check were instead done as part of the gup_pte_range(), then
> > > > it would be a little more efficient for that case.
> > > >
> > > > As for whether it's worth it, *probably* this is too small an effect to measure.
> > > > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > > > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > > > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> > > >
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > >
> > Unable to get a NVME device to have a test. And when testing fio on the
> > tranditional disk, I got the error "fio: engine libaio not loadable
> > fio: failed to load engine
> > fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"
> >
> > But I found a test case which can be slightly adjusted to met the aim.
> > It is tools/testing/selftests/vm/gup_benchmark.c
> >
> > Test enviroment:
> >   MemTotal:       264079324 kB
> >   MemFree:        262306788 kB
> >   CmaTotal:              0 kB
> >   CmaFree:               0 kB
> >   on AMD EPYC 7601
> >
> > Test command:
> >   gup_benchmark -r 100 -n 64
> >   gup_benchmark -r 100 -n 64 -l
> > where -r stands for repeat times, -n is nr_pages param for
> > get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
> > path, see a patch at the tail.
>
> Thanks!  That is a good test to add.  You should add the patch to the series.
OK.
>
> >
> > Test result:
> > w/o     477.800000
> > w/o-l   481.070000
> > a       481.800000
> > a-l     640.410000
> > b       466.240000  (question a: b outperforms w/o ?)
> > b-l     529.740000
> >
> > Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
> > does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.
> >
> > I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41
>
> Wow that is bigger than I would have thought.  I suspect it gets worse as -n
> increases?
Yes. I test with -n 64/128/256/512. It has this trend. See the data below.

>
> >
> > As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
> > considered as variance.
>
> :-/
>
> Does this change with larger -r or -n values?
-r should have no effect on this. And I change -n 64/128/256/512. The
data always shows b outperforms w/o a bit.

      64        128         256        512
a-l  633.23   676.83  747.14  683.19    (n=256 should be disturbed by
something, but the overall trend keeps going up)
b-l  528.32   529.10  523.95  512.88
w/o  479.73   473.87  477.67  488.70
b    470.13   467.11  463.06  469.62

Thanks,
  Pingfan
>
> >
> > Based on the above result, I think it is better to do the check inside
> > gup_pte_range().
> >
> > Any comment?
>
> I agree.
>
> Ira
>
> >
> > Thanks,
> >
> >
> > > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> > > will try to bring out the result.
> > >
> > > Thanks,
> > >   Pingfan
> > >
> >
>
> > ---
> > Patch to do check inside gup_pte_range()
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 2ce3091..ba213a0 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >               VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> >               page = pte_page(pte);
> >
> > +             if (unlikely(flags & FOLL_LONGTERM) &&
> > +                     is_migrate_cma_page(page))
> > +                             goto pte_unmap;
> > +
> >               head = try_get_compound_head(page, 1);
> >               if (!head)
> >                       goto pte_unmap;
> > @@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pmd_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
> > @@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pud_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
> > @@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pgd_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
>
> > ---
> > Patch for testing
> >
> > diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> > index 7dd602d..61dec5f 100644
> > --- a/mm/gup_benchmark.c
> > +++ b/mm/gup_benchmark.c
> > @@ -6,8 +6,9 @@
> >  #include <linux/debugfs.h>
> >
> >  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK                _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK                _IOWR('g', 4, struct gup_benchmark)
> >
> >  struct gup_benchmark {
> >       __u64 get_delta_usec;
> > @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> >                       nr = get_user_pages_fast(addr, nr, gup->flags & 1,
> >                                                pages + i);
> >                       break;
> > +             case GUP_FAST_LONGTERM_BENCHMARK:
> > +                     nr = get_user_pages_fast(addr, nr,
> > +                                              (gup->flags & 1) | FOLL_LONGTERM,
> > +                                              pages + i);
> > +                     break;
> >               case GUP_LONGTERM_BENCHMARK:
> >                       nr = get_user_pages(addr, nr,
> >                                           (gup->flags & 1) | FOLL_LONGTERM,
> > @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
> >
> >       switch (cmd) {
> >       case GUP_FAST_BENCHMARK:
> > +     case GUP_FAST_LONGTERM_BENCHMARK:
> >       case GUP_LONGTERM_BENCHMARK:
> >       case GUP_BENCHMARK:
> >               break;
> > diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> > index c0534e2..ade8acb 100644
> > --- a/tools/testing/selftests/vm/gup_benchmark.c
> > +++ b/tools/testing/selftests/vm/gup_benchmark.c
> > @@ -15,8 +15,9 @@
> >  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
> >
> >  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK                _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK                _IOWR('g', 4, struct gup_benchmark)
> >
> >  struct gup_benchmark {
> >       __u64 get_delta_usec;
> > @@ -37,7 +38,7 @@ int main(int argc, char **argv)
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
> >               switch (opt) {
> >               case 'm':
> >                       size = atoi(optarg) * MB;
> > @@ -54,6 +55,9 @@ int main(int argc, char **argv)
> >               case 'T':
> >                       thp = 0;
> >                       break;
> > +             case 'l':
> > +                     cmd = GUP_FAST_LONGTERM_BENCHMARK;
> > +                     break;
> >               case 'L':
> >                       cmd = GUP_LONGTERM_BENCHMARK;
> >                       break;
> > --
> > 2.7.5
> >
>
Ira Weiny June 12, 2019, 11:50 p.m. UTC | #13
On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > > Pingfan Liu <kernelfans@gmail.com> writes:
> > > 
> > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > through this crack.
> > > 
> > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > dax check we need vma to ensure whether a long term pin is allowed or not.
> > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > 
> > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> > 
> Then for CMA pages, are you suggesting something like:

I'm not suggesting this.

Sorry I wrote this prior to seeing the numbers in your other email.  Given
the numbers it looks like performing the check whilst walking the tables is
worth the extra complexity.  I was just trying to summarize the thread.  I
don't think we should disallow FOLL_LONGTERM because it only affects CMA and
DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
we don't have to.  Also in the case of DAX the use of vma will be going
away...[1]  Eventually...  ;-)

Ira

[1] https://lkml.org/lkml/2019/6/5/1049

> diff --git a/mm/gup.c b/mm/gup.c
> index 42a47c0..8bf3cc3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>         if (unlikely(!access_ok((void __user *)start, len)))
>                 return -EFAULT;
> 
> +       if (unlikely(gup_flags & FOLL_LONGTERM))
> +               goto slow;
>         if (gup_fast_permitted(start, nr_pages)) {
>                 local_irq_disable();
>                 gup_pgd_range(addr, end, gup_flags, pages, &nr);
> @@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>                 ret = nr;
>         }
> 
> +slow:
>         if (nr < nr_pages) {
>                 /* Try to get the remaining pages with get_user_pages */
>                 start += nr << PAGE_SHIFT;
> 
> Thanks,
>   Pingfan
Pingfan Liu June 13, 2019, 10:48 a.m. UTC | #14
On Thu, Jun 13, 2019 at 7:49 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> > On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > > > Pingfan Liu <kernelfans@gmail.com> writes:
> > > >
> > > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > > through this crack.
> > > >
> > > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > > dax check we need vma to ensure whether a long term pin is allowed or not.
> > > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > >
> > > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> > >
> > Then for CMA pages, are you suggesting something like:
>
> I'm not suggesting this.
OK, then I send out v4.
>
> Sorry I wrote this prior to seeing the numbers in your other email.  Given
> the numbers it looks like performing the check whilst walking the tables is
> worth the extra complexity.  I was just trying to summarize the thread.  I
> don't think we should disallow FOLL_LONGTERM because it only affects CMA and
> DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
> we don't have to.  Also in the case of DAX the use of vma will be going
> away...[1]  Eventually...  ;-)
A good feature. Trying to catch up.

Thanks,
Pingfan
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..0e59af9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2196,6 +2196,26 @@  static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
+#ifdef CONFIG_CMA
+static inline int reject_cma_pages(int nr_pinned, struct page **pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pinned; i++)
+		if (is_migrate_cma_page(pages[i])) {
+			put_user_pages(pages + i, nr_pinned - i);
+			return i;
+		}
+
+	return nr_pinned;
+}
+#else
+static inline int reject_cma_pages(int nr_pinned, struct page **pages)
+{
+	return nr_pinned;
+}
+#endif
+
 /**
  * get_user_pages_fast() - pin user pages in memory
  * @start:	starting user address
@@ -2236,6 +2256,9 @@  int get_user_pages_fast(unsigned long start, int nr_pages,
 		ret = nr;
 	}
 
+	if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
+		nr = reject_cma_pages(nr, pages);
+
 	if (nr < nr_pages) {
 		/* Try to get the remaining pages with get_user_pages */
 		start += nr << PAGE_SHIFT;