Message ID | 20241016202242.456953-1-jhubbard@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/gup: stop leaking pinned pages in low memory conditions | expand |
On Wed, 16 Oct 2024 13:22:42 -0700 John Hubbard <jhubbard@nvidia.com> wrote: > If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) > family of functions, and requests "too many" pages, then the call will > erroneously leave pages pinned. This is visible in user space as an > actual memory leak. > > Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls > to exhaust memory. > > The root cause of the problem is this sequence, within > __gup_longterm_locked(): > > __get_user_pages_locked() > rc = check_and_migrate_movable_pages() > > ...which gets retried in a loop. The loop error handling is incomplete, > clearly due to a somewhat unusual and complicated tri-state error API. > But anyway, if -ENOMEM, or in fact, any unexpected error is returned > from check_and_migrate_movable_pages(), then __gup_longterm_locked() > happily returns the error, while leaving the pages pinned. > > In the failed case, which is an app that requests (via a device driver) > 30720000000 bytes to be pinned, and then exits, I see this: > > $ grep foll /proc/vmstat > nr_foll_pin_acquired 7502048 > nr_foll_pin_released 2048 > > And after applying this patch, it returns to balanced pins: > > $ grep foll /proc/vmstat > nr_foll_pin_acquired 7502048 > nr_foll_pin_released 7502048 > > Fix this by unpinning the pages that __get_user_pages_locked() has > pinned, in such error cases. Thanks. > Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") I'll add this to the -stable backport pile, although this seems a bit marginal?
On 10/16/24 2:57 PM, Andrew Morton wrote: > On Wed, 16 Oct 2024 13:22:42 -0700 John Hubbard <jhubbard@nvidia.com> wrote: ... >> Fix this by unpinning the pages that __get_user_pages_locked() has >> pinned, in such error cases. > > Thanks. > >> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") > > I'll add this to the -stable backport pile, although this seems a bit > marginal? I'm on the fence about that. It is marginal: you have to exhaust memory. On the other hand, a real user reported this bug to us. I guess I'd lean toward "correctness in -stable", and add it to the pile, in the end. thanks,
John Hubbard <jhubbard@nvidia.com> writes: > If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) > family of functions, and requests "too many" pages, then the call will > erroneously leave pages pinned. This is visible in user space as an > actual memory leak. > > Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls > to exhaust memory. > > The root cause of the problem is this sequence, within > __gup_longterm_locked(): > > __get_user_pages_locked() > rc = check_and_migrate_movable_pages() > > ...which gets retried in a loop. The loop error handling is incomplete, > clearly due to a somewhat unusual and complicated tri-state error API. > But anyway, if -ENOMEM, or in fact, any unexpected error is returned > from check_and_migrate_movable_pages(), then __gup_longterm_locked() > happily returns the error, while leaving the pages pinned. > > In the failed case, which is an app that requests (via a device driver) > 30720000000 bytes to be pinned, and then exits, I see this: > > $ grep foll /proc/vmstat > nr_foll_pin_acquired 7502048 > nr_foll_pin_released 2048 > > And after applying this patch, it returns to balanced pins: > > $ grep foll /proc/vmstat > nr_foll_pin_acquired 7502048 > nr_foll_pin_released 7502048 > > Fix this by unpinning the pages that __get_user_pages_locked() has > pinned, in such error cases. > > Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") > Cc: Alistair Popple <apopple@nvidia.com> > Cc: Shigeru Yoshida <syoshida@redhat.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Pasha Tatashin <pasha.tatashin@soleen.com> > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > --- > mm/gup.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/gup.c b/mm/gup.c > index a82890b46a36..24acf53c8294 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, > > /* FOLL_LONGTERM implies FOLL_PIN */ > rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); > + > + /* > + * The __get_user_pages_locked() call happens before we know > + * that whether it's possible to successfully complete the whole > + * operation. To compensate for this, if we get an unexpected > + * error (such as -ENOMEM) then we must unpin everything, before > + * erroring out. > + */ > + if (rc != -EAGAIN && rc != 0) > + unpin_user_pages(pages, nr_pinned_pages); I know this is going to fall out of the loop in the next line but given this is an error handling case it would be nice if this was made explicit here with a break statement. It looks correct to me though so: Reviewed-by: Alistair Popple <apopple@nvidia.com> > + > } while (rc == -EAGAIN); > memalloc_pin_restore(flags); > return rc ? rc : nr_pinned_pages;
On 10/16/24 3:13 PM, Alistair Popple wrote: > John Hubbard <jhubbard@nvidia.com> writes: ... >> diff --git a/mm/gup.c b/mm/gup.c >> index a82890b46a36..24acf53c8294 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, >> >> /* FOLL_LONGTERM implies FOLL_PIN */ >> rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); >> + >> + /* >> + * The __get_user_pages_locked() call happens before we know >> + * that whether it's possible to successfully complete the whole >> + * operation. To compensate for this, if we get an unexpected >> + * error (such as -ENOMEM) then we must unpin everything, before >> + * erroring out. >> + */ >> + if (rc != -EAGAIN && rc != 0) >> + unpin_user_pages(pages, nr_pinned_pages); > > I know this is going to fall out of the loop in the next line but given > this is an error handling case it would be nice if this was made > explicit here with a break statement. It looks correct to me though so: Yes, I'm not sure that adding another line of code for cosmetics really makes this better. At this point, smaller is better IMHO, and then the next step should be something bigger, such as refactoring to avoid the tri-state return codes. > > Reviewed-by: Alistair Popple <apopple@nvidia.com> Thanks for the super fast response and the review! thanks,
On Wed, 16 Oct 2024 15:05:28 -0700 John Hubbard <jhubbard@nvidia.com> wrote: > On 10/16/24 2:57 PM, Andrew Morton wrote: > > On Wed, 16 Oct 2024 13:22:42 -0700 John Hubbard <jhubbard@nvidia.com> wrote: > ... > >> Fix this by unpinning the pages that __get_user_pages_locked() has > >> pinned, in such error cases. > > > > Thanks. > > > >> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") > > > > I'll add this to the -stable backport pile, although this seems a bit > > marginal? > > I'm on the fence about that. It is marginal: you have to > exhaust memory. On the other hand, a real user reported > this bug to us. > > I guess I'd lean toward "correctness in -stable", and > add it to the pile, in the end. Thanks. It's a super-simple patch, which helps the decision.
On 16.10.24 22:22, John Hubbard wrote: > If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) > family of functions, and requests "too many" pages, then the call will > erroneously leave pages pinned. This is visible in user space as an > actual memory leak. > > Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls > to exhaust memory. > > The root cause of the problem is this sequence, within > __gup_longterm_locked(): > > __get_user_pages_locked() > rc = check_and_migrate_movable_pages() > > ...which gets retried in a loop. The loop error handling is incomplete, > clearly due to a somewhat unusual and complicated tri-state error API. > But anyway, if -ENOMEM, or in fact, any unexpected error is returned > from check_and_migrate_movable_pages(), then __gup_longterm_locked() > happily returns the error, while leaving the pages pinned. > > In the failed case, which is an app that requests (via a device driver) > 30720000000 bytes to be pinned, and then exits, I see this: > > $ grep foll /proc/vmstat > nr_foll_pin_acquired 7502048 > nr_foll_pin_released 2048 > > And after applying this patch, it returns to balanced pins: > > $ grep foll /proc/vmstat > nr_foll_pin_acquired 7502048 > nr_foll_pin_released 7502048 > > Fix this by unpinning the pages that __get_user_pages_locked() has > pinned, in such error cases. > > Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") > Cc: Alistair Popple <apopple@nvidia.com> > Cc: Shigeru Yoshida <syoshida@redhat.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Pasha Tatashin <pasha.tatashin@soleen.com> > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > --- > mm/gup.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/gup.c b/mm/gup.c > index a82890b46a36..24acf53c8294 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, > > /* FOLL_LONGTERM implies FOLL_PIN */ > rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); > + > + /* > + * The __get_user_pages_locked() call happens before we know > + * that whether it's possible to successfully complete the whole > + * operation. To compensate for this, if we get an unexpected > + * error (such as -ENOMEM) then we must unpin everything, before > + * erroring out. > + */ > + if (rc != -EAGAIN && rc != 0) > + unpin_user_pages(pages, nr_pinned_pages); > + > } while (rc == -EAGAIN); Wouldn't it be cleaner to simply have here after the loop (possibly even after the memalloc_pin_restore()) if (rc) unpin_user_pages(pages, nr_pinned_pages); But maybe I am missing something. > memalloc_pin_restore(flags); > return rc ? rc : nr_pinned_pages;
On 17.10.24 10:51, David Hildenbrand wrote: > On 16.10.24 22:22, John Hubbard wrote: >> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) >> family of functions, and requests "too many" pages, then the call will >> erroneously leave pages pinned. This is visible in user space as an >> actual memory leak. >> >> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls >> to exhaust memory. >> >> The root cause of the problem is this sequence, within >> __gup_longterm_locked(): >> >> __get_user_pages_locked() >> rc = check_and_migrate_movable_pages() >> >> ...which gets retried in a loop. The loop error handling is incomplete, >> clearly due to a somewhat unusual and complicated tri-state error API. >> But anyway, if -ENOMEM, or in fact, any unexpected error is returned >> from check_and_migrate_movable_pages(), then __gup_longterm_locked() >> happily returns the error, while leaving the pages pinned. >> >> In the failed case, which is an app that requests (via a device driver) >> 30720000000 bytes to be pinned, and then exits, I see this: >> >> $ grep foll /proc/vmstat >> nr_foll_pin_acquired 7502048 >> nr_foll_pin_released 2048 >> >> And after applying this patch, it returns to balanced pins: >> >> $ grep foll /proc/vmstat >> nr_foll_pin_acquired 7502048 >> nr_foll_pin_released 7502048 >> >> Fix this by unpinning the pages that __get_user_pages_locked() has >> pinned, in such error cases. >> >> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") >> Cc: Alistair Popple <apopple@nvidia.com> >> Cc: Shigeru Yoshida <syoshida@redhat.com> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Jason Gunthorpe <jgg@nvidia.com> >> Cc: Minchan Kim <minchan@kernel.org> >> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> >> Signed-off-by: John Hubbard <jhubbard@nvidia.com> >> --- >> mm/gup.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index a82890b46a36..24acf53c8294 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, >> >> /* FOLL_LONGTERM implies FOLL_PIN */ >> rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); >> + >> + /* >> + * The __get_user_pages_locked() call happens before we know >> + * that whether it's possible to successfully complete the whole >> + * operation. To compensate for this, if we get an unexpected >> + * error (such as -ENOMEM) then we must unpin everything, before >> + * erroring out. >> + */ >> + if (rc != -EAGAIN && rc != 0) >> + unpin_user_pages(pages, nr_pinned_pages); >> + >> } while (rc == -EAGAIN); > > Wouldn't it be cleaner to simply have here after the loop (possibly even > after the memalloc_pin_restore()) > > if (rc) > unpin_user_pages(pages, nr_pinned_pages); > > But maybe I am missing something. And staring at memfd_pin_folios(), don't we have the same issue there if check_and_migrate_movable_folios() fails? diff --git a/mm/gup.c b/mm/gup.c index a82890b46a36..f79974d38608 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3708,12 +3708,10 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, ret = check_and_migrate_movable_folios(nr_folios, folios); } while (ret == -EAGAIN); - memalloc_pin_restore(flags); - return ret ? ret : nr_folios; err: memalloc_pin_restore(flags); - unpin_folios(folios, nr_folios); - - return ret; + if (ret) + unpin_folios(folios, nr_folios); + return ret ? ret : nr_folios; } EXPORT_SYMBOL_GPL(memfd_pin_folios);
On 10/17/24 1:51 AM, David Hildenbrand wrote: > On 16.10.24 22:22, John Hubbard wrote: ... >> diff --git a/mm/gup.c b/mm/gup.c >> index a82890b46a36..24acf53c8294 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct >> mm_struct *mm, >> /* FOLL_LONGTERM implies FOLL_PIN */ >> rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); >> + >> + /* >> + * The __get_user_pages_locked() call happens before we know >> + * that whether it's possible to successfully complete the whole oops, s/that whether/that/ >> + * operation. To compensate for this, if we get an unexpected >> + * error (such as -ENOMEM) then we must unpin everything, before >> + * erroring out. >> + */ >> + if (rc != -EAGAIN && rc != 0) >> + unpin_user_pages(pages, nr_pinned_pages); >> + >> } while (rc == -EAGAIN); > > Wouldn't it be cleaner to simply have here after the loop (possibly even > after the memalloc_pin_restore()) > > if (rc) > unpin_user_pages(pages, nr_pinned_pages); > > But maybe I am missing something. Yes, I think you are correct. That is cleaner. Let me retest real quick just in case, and then send a v2 that also picks up the typo and moves the comment to the new location. thanks,
On 10/17/24 1:53 AM, David Hildenbrand wrote: > On 17.10.24 10:51, David Hildenbrand wrote: >> On 16.10.24 22:22, John Hubbard wrote: ... > And staring at memfd_pin_folios(), don't we have the same issue there if > check_and_migrate_movable_folios() fails? Yes, it looks very clearly like the exact same bug, in a different location. This complicated return code is the gift that keeps on giving. Although likely people are just copying the pattern, which had the problem. > > diff --git a/mm/gup.c b/mm/gup.c > index a82890b46a36..f79974d38608 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -3708,12 +3708,10 @@ long memfd_pin_folios(struct file *memfd, loff_t > start, loff_t end, > ret = check_and_migrate_movable_folios(nr_folios, folios); > } while (ret == -EAGAIN); > > - memalloc_pin_restore(flags); > - return ret ? ret : nr_folios; > err: > memalloc_pin_restore(flags); > - unpin_folios(folios, nr_folios); > - > - return ret; > + if (ret) > + unpin_folios(folios, nr_folios); > + return ret ? ret : nr_folios; That looks correct. I can send this out with the other patch as a tiny 2-patch series since they are related. Would you prefer to appear as a Signed-off-by, or a Suggested-by, or "other"? :) > } > EXPORT_SYMBOL_GPL(memfd_pin_folios); > > thanks,
On 17.10.24 19:06, John Hubbard wrote: > On 10/17/24 1:53 AM, David Hildenbrand wrote: >> On 17.10.24 10:51, David Hildenbrand wrote: >>> On 16.10.24 22:22, John Hubbard wrote: > ... >> And staring at memfd_pin_folios(), don't we have the same issue there if >> check_and_migrate_movable_folios() fails? > > Yes, it looks very clearly like the exact same bug, in a different location. > This complicated return code is the gift that keeps on giving. Although > likely people are just copying the pattern, which had the problem. > > >> >> diff --git a/mm/gup.c b/mm/gup.c >> index a82890b46a36..f79974d38608 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -3708,12 +3708,10 @@ long memfd_pin_folios(struct file *memfd, loff_t >> start, loff_t end, >> ret = check_and_migrate_movable_folios(nr_folios, folios); >> } while (ret == -EAGAIN); >> >> - memalloc_pin_restore(flags); >> - return ret ? ret : nr_folios; >> err: >> memalloc_pin_restore(flags); >> - unpin_folios(folios, nr_folios); >> - >> - return ret; >> + if (ret) >> + unpin_folios(folios, nr_folios); >> + return ret ? ret : nr_folios; > > That looks correct. I can send this out with the other patch as a tiny > 2-patch series since they are related. Would you prefer to appear > as a Signed-off-by, or a Suggested-by, or "other"? :) Suggested-by: please :)
David Hildenbrand <david@redhat.com> writes: > On 16.10.24 22:22, John Hubbard wrote: >> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) >> family of functions, and requests "too many" pages, then the call will >> erroneously leave pages pinned. This is visible in user space as an >> actual memory leak. >> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) >> calls >> to exhaust memory. >> The root cause of the problem is this sequence, within >> __gup_longterm_locked(): >> __get_user_pages_locked() >> rc = check_and_migrate_movable_pages() >> ...which gets retried in a loop. The loop error handling is >> incomplete, >> clearly due to a somewhat unusual and complicated tri-state error API. >> But anyway, if -ENOMEM, or in fact, any unexpected error is returned >> from check_and_migrate_movable_pages(), then __gup_longterm_locked() >> happily returns the error, while leaving the pages pinned. >> In the failed case, which is an app that requests (via a device >> driver) >> 30720000000 bytes to be pinned, and then exits, I see this: >> $ grep foll /proc/vmstat >> nr_foll_pin_acquired 7502048 >> nr_foll_pin_released 2048 >> And after applying this patch, it returns to balanced pins: >> $ grep foll /proc/vmstat >> nr_foll_pin_acquired 7502048 >> nr_foll_pin_released 7502048 >> Fix this by unpinning the pages that __get_user_pages_locked() has >> pinned, in such error cases. >> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix >> check_and_migrate_movable_pages() return codes") >> Cc: Alistair Popple <apopple@nvidia.com> >> Cc: Shigeru Yoshida <syoshida@redhat.com> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Jason Gunthorpe <jgg@nvidia.com> >> Cc: Minchan Kim <minchan@kernel.org> >> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> >> Signed-off-by: John Hubbard <jhubbard@nvidia.com> >> --- >> mm/gup.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> diff --git a/mm/gup.c b/mm/gup.c >> index a82890b46a36..24acf53c8294 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, >> /* FOLL_LONGTERM implies FOLL_PIN */ >> rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); >> + >> + /* >> + * The __get_user_pages_locked() call happens before we know >> + * that whether it's possible to successfully complete the whole >> + * operation. To compensate for this, if we get an unexpected >> + * error (such as -ENOMEM) then we must unpin everything, before >> + * erroring out. >> + */ >> + if (rc != -EAGAIN && rc != 0) >> + unpin_user_pages(pages, nr_pinned_pages); >> + >> } while (rc == -EAGAIN); > > Wouldn't it be cleaner to simply have here after the loop (possibly > even after the memalloc_pin_restore()) > > if (rc) > unpin_user_pages(pages, nr_pinned_pages); > > But maybe I am missing something. I initially thought the same thing but I'm not sure it is correct. Consider what happens when __get_user_pages_locked() fails earlier in the loop. In this case it will have bailed out of the loop with rc <= 0 but we shouldn't call unpin_user_pages(). >> memalloc_pin_restore(flags); >> return rc ? rc : nr_pinned_pages;
On 17.10.24 23:28, Alistair Popple wrote: > > David Hildenbrand <david@redhat.com> writes: > >> On 16.10.24 22:22, John Hubbard wrote: >>> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) >>> family of functions, and requests "too many" pages, then the call will >>> erroneously leave pages pinned. This is visible in user space as an >>> actual memory leak. >>> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) >>> calls >>> to exhaust memory. >>> The root cause of the problem is this sequence, within >>> __gup_longterm_locked(): >>> __get_user_pages_locked() >>> rc = check_and_migrate_movable_pages() >>> ...which gets retried in a loop. The loop error handling is >>> incomplete, >>> clearly due to a somewhat unusual and complicated tri-state error API. >>> But anyway, if -ENOMEM, or in fact, any unexpected error is returned >>> from check_and_migrate_movable_pages(), then __gup_longterm_locked() >>> happily returns the error, while leaving the pages pinned. >>> In the failed case, which is an app that requests (via a device >>> driver) >>> 30720000000 bytes to be pinned, and then exits, I see this: >>> $ grep foll /proc/vmstat >>> nr_foll_pin_acquired 7502048 >>> nr_foll_pin_released 2048 >>> And after applying this patch, it returns to balanced pins: >>> $ grep foll /proc/vmstat >>> nr_foll_pin_acquired 7502048 >>> nr_foll_pin_released 7502048 >>> Fix this by unpinning the pages that __get_user_pages_locked() has >>> pinned, in such error cases. >>> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix >>> check_and_migrate_movable_pages() return codes") >>> Cc: Alistair Popple <apopple@nvidia.com> >>> Cc: Shigeru Yoshida <syoshida@redhat.com> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Jason Gunthorpe <jgg@nvidia.com> >>> Cc: Minchan Kim <minchan@kernel.org> >>> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> >>> Signed-off-by: John Hubbard <jhubbard@nvidia.com> >>> --- >>> mm/gup.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> diff --git a/mm/gup.c b/mm/gup.c >>> index a82890b46a36..24acf53c8294 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, >>> /* FOLL_LONGTERM implies FOLL_PIN */ >>> rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); >>> + >>> + /* >>> + * The __get_user_pages_locked() call happens before we know >>> + * that whether it's possible to successfully complete the whole >>> + * operation. To compensate for this, if we get an unexpected >>> + * error (such as -ENOMEM) then we must unpin everything, before >>> + * erroring out. >>> + */ >>> + if (rc != -EAGAIN && rc != 0) >>> + unpin_user_pages(pages, nr_pinned_pages); >>> + >>> } while (rc == -EAGAIN); >> >> Wouldn't it be cleaner to simply have here after the loop (possibly >> even after the memalloc_pin_restore()) >> >> if (rc) >> unpin_user_pages(pages, nr_pinned_pages); >> >> But maybe I am missing something. > > I initially thought the same thing but I'm not sure it is > correct. Consider what happens when __get_user_pages_locked() fails > earlier in the loop. In this case it will have bailed out of the loop > with rc <= 0 but we shouldn't call unpin_user_pages(). Ah, I see what you mean, I primarily only stared at the diff. We should likely avoid using nr_pinned_pages as a temporary variable that can hold an error value.
On 10/17/24 2:47 PM, David Hildenbrand wrote: > On 17.10.24 23:28, Alistair Popple wrote: >> David Hildenbrand <david@redhat.com> writes: >>> On 16.10.24 22:22, John Hubbard wrote: ... >>>> + if (rc != -EAGAIN && rc != 0) >>>> + unpin_user_pages(pages, nr_pinned_pages); >>>> + >>>> } while (rc == -EAGAIN); >>> >>> Wouldn't it be cleaner to simply have here after the loop (possibly >>> even after the memalloc_pin_restore()) >>> >>> if (rc) >>> unpin_user_pages(pages, nr_pinned_pages); >>> >>> But maybe I am missing something. >> >> I initially thought the same thing but I'm not sure it is >> correct. Consider what happens when __get_user_pages_locked() fails >> earlier in the loop. In this case it will have bailed out of the loop >> with rc <= 0 but we shouldn't call unpin_user_pages(). doh. yes. Thanks for catching that, Alistair! I actually considered it during the first draft, too, but got tunnel vision during the review, sigh. > > Ah, I see what you mean, I primarily only stared at the diff. > > We should likely avoid using nr_pinned_pages as a temporary variable that > can hold an error value. > OK, I still want to defer all the pretty refactoring ideas into some future effort, so for now, let's just leave v1 alone except for fixing the typo in the comment, yes? I'll still send out a 2-patch series with that, plus a suitably modified fix for the memfd case too. thanks,
On 17.10.24 23:57, John Hubbard wrote: > On 10/17/24 2:47 PM, David Hildenbrand wrote: >> On 17.10.24 23:28, Alistair Popple wrote: >>> David Hildenbrand <david@redhat.com> writes: >>>> On 16.10.24 22:22, John Hubbard wrote: > ... >>>>> + if (rc != -EAGAIN && rc != 0) >>>>> + unpin_user_pages(pages, nr_pinned_pages); >>>>> + >>>>> } while (rc == -EAGAIN); >>>> >>>> Wouldn't it be cleaner to simply have here after the loop (possibly >>>> even after the memalloc_pin_restore()) >>>> >>>> if (rc) >>>> unpin_user_pages(pages, nr_pinned_pages); >>>> >>>> But maybe I am missing something. >>> >>> I initially thought the same thing but I'm not sure it is >>> correct. Consider what happens when __get_user_pages_locked() fails >>> earlier in the loop. In this case it will have bailed out of the loop >>> with rc <= 0 but we shouldn't call unpin_user_pages(). > > doh. yes. Thanks for catching that, Alistair! I actually considered > it during the first draft, too, but got tunnel vision during the > review, sigh. > >> >> Ah, I see what you mean, I primarily only stared at the diff. >> >> We should likely avoid using nr_pinned_pages as a temporary variable that >> can hold an error value. >> > > OK, I still want to defer all the pretty refactoring ideas into some > future effort, so for now, let's just leave v1 alone except for fixing > the typo in the comment, yes? Fine with me! Acked-by: David Hildenbrand <david@redhat.com>
diff --git a/mm/gup.c b/mm/gup.c index a82890b46a36..24acf53c8294 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm, /* FOLL_LONGTERM implies FOLL_PIN */ rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); + + /* + * The __get_user_pages_locked() call happens before we know + * that whether it's possible to successfully complete the whole + * operation. To compensate for this, if we get an unexpected + * error (such as -ENOMEM) then we must unpin everything, before + * erroring out. + */ + if (rc != -EAGAIN && rc != 0) + unpin_user_pages(pages, nr_pinned_pages); + } while (rc == -EAGAIN); memalloc_pin_restore(flags); return rc ? rc : nr_pinned_pages;
If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) family of functions, and requests "too many" pages, then the call will erroneously leave pages pinned. This is visible in user space as an actual memory leak. Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls to exhaust memory. The root cause of the problem is this sequence, within __gup_longterm_locked(): __get_user_pages_locked() rc = check_and_migrate_movable_pages() ...which gets retried in a loop. The loop error handling is incomplete, clearly due to a somewhat unusual and complicated tri-state error API. But anyway, if -ENOMEM, or in fact, any unexpected error is returned from check_and_migrate_movable_pages(), then __gup_longterm_locked() happily returns the error, while leaving the pages pinned. In the failed case, which is an app that requests (via a device driver) 30720000000 bytes to be pinned, and then exits, I see this: $ grep foll /proc/vmstat nr_foll_pin_acquired 7502048 nr_foll_pin_released 2048 And after applying this patch, it returns to balanced pins: $ grep foll /proc/vmstat nr_foll_pin_acquired 7502048 nr_foll_pin_released 7502048 Fix this by unpinning the pages that __get_user_pages_locked() has pinned, in such error cases. Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") Cc: Alistair Popple <apopple@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> --- mm/gup.c | 11 +++++++++++ 1 file changed, 11 insertions(+)