diff mbox series

mm/gup: restore the ability to pin more than 2GB at a time

Message ID 20241030030116.670307-1-jhubbard@nvidia.com (mailing list archive)
State New
Headers show
Series mm/gup: restore the ability to pin more than 2GB at a time | expand

Commit Message

John Hubbard Oct. 30, 2024, 3:01 a.m. UTC
commit 53ba78de064b ("mm/gup: introduce
check_and_migrate_movable_folios()") created a new constraint on the
pin_user_pages*() API family: a potentially large allocation must now
occur, internally.

A user-visible consequence has now appeared: user space can no longer
pin more than 2GB of memory anymore on x86_64. That's because, on a 4KB
PAGE_SIZE system, when user space tries to (indirectly, via a device
driver that calls pin_user_pages()) pin 2GB, this requires an allocation
of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
kmalloc().

Fix this (restore the original behavior), by using replacing
kmalloc_array() with kvmalloc_array(), which falls back to vmalloc() for
larger allocations.

Fixes: 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()")
Cc: linux-stable@vger.kernel.org

Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---

This applies to mm-hotfixes-unstable (only), because it relies on my
earlier patch to this exact same location: commit 255231c75dcd mm/gup:
stop leaking pinned pages in low memory conditions.

thanks,
John Hubbard

 mm/gup.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


base-commit: b70a32bbebeae216a3e846e01965880b309ca173

Comments

Christoph Hellwig Oct. 30, 2024, 4:21 a.m. UTC | #1
On Tue, Oct 29, 2024 at 08:01:16PM -0700, John Hubbard wrote:
> A user-visible consequence has now appeared: user space can no longer
> pin more than 2GB of memory anymore on x86_64. That's because, on a 4KB
> PAGE_SIZE system, when user space tries to (indirectly, via a device
> driver that calls pin_user_pages()) pin 2GB, this requires an allocation
> of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
> kmalloc().

Do you have a report whee someone tries to pin that much memor in a
single call?  What driver is this?  Because it seems like a not very
smart thing to do.
John Hubbard Oct. 30, 2024, 4:30 a.m. UTC | #2
On 10/29/24 9:21 PM, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 08:01:16PM -0700, John Hubbard wrote:
>> A user-visible consequence has now appeared: user space can no longer
>> pin more than 2GB of memory anymore on x86_64. That's because, on a 4KB
>> PAGE_SIZE system, when user space tries to (indirectly, via a device
>> driver that calls pin_user_pages()) pin 2GB, this requires an allocation
>> of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
>> kmalloc().
> 
> Do you have a report whee someone tries to pin that much memor in a
> single call?  What driver is this?  Because it seems like a not very
> smart thing to do.
> 

I do, yes. And what happens is that when you use GPUs, drivers like
to pin system memory, and then point the GPU page tables to that
memory. For older GPUs that don't support replayable page faults,
that's required.

So this behavior has been around forever.

The customer was qualifying their software and noticed that before
Linux 6.10, they could allocate >2GB, and with 6.11, they could
not.

Whether it is "wise" for user space to allocate that much at once
is a reasonable question, but at least one place is (or was!) doing
it.


thanks,
Christoph Hellwig Oct. 30, 2024, 4:33 a.m. UTC | #3
On Tue, Oct 29, 2024 at 09:30:41PM -0700, John Hubbard wrote:
> I do, yes. And what happens is that when you use GPUs, drivers like
> to pin system memory, and then point the GPU page tables to that
> memory. For older GPUs that don't support replayable page faults,
> that's required.
> 
> So this behavior has been around forever.
> 
> The customer was qualifying their software and noticed that before
> Linux 6.10, they could allocate >2GB, and with 6.11, they could
> not.
> 
> Whether it is "wise" for user space to allocate that much at once
> is a reasonable question, but at least one place is (or was!) doing
> it.

Still missing a callchain, which make me suspect that it is your weird
out of tree driver, in which case this simply does not matter.
John Hubbard Oct. 30, 2024, 4:39 a.m. UTC | #4
On 10/29/24 9:33 PM, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:30:41PM -0700, John Hubbard wrote:
>> I do, yes. And what happens is that when you use GPUs, drivers like
>> to pin system memory, and then point the GPU page tables to that
>> memory. For older GPUs that don't support replayable page faults,
>> that's required.
>>
>> So this behavior has been around forever.
>>
>> The customer was qualifying their software and noticed that before
>> Linux 6.10, they could allocate >2GB, and with 6.11, they could
>> not.
>>
>> Whether it is "wise" for user space to allocate that much at once
>> is a reasonable question, but at least one place is (or was!) doing
>> it.
> 
> Still missing a callchain, which make me suspect that it is your weird
> out of tree driver, in which case this simply does not matter.
> 

I expect I could piece together something with Nouveau, given enough
time and help from Ben Skeggs and Danillo and all...

Yes, this originated with the out of tree driver. But it never occurred
to me that upstream be uninterested in an obvious fix to an obvious
regression.


thanks,
Christoph Hellwig Oct. 30, 2024, 4:42 a.m. UTC | #5
On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
> I expect I could piece together something with Nouveau, given enough
> time and help from Ben Skeggs and Danillo and all...
> 
> Yes, this originated with the out of tree driver. But it never occurred
> to me that upstream be uninterested in an obvious fix to an obvious
> regression.

Because pinning down these amounts of memoryt is completely insane.
I don't mind the switch to kvmalloc, but we need to put in an upper
bound of what can be pinned.
John Hubbard Oct. 30, 2024, 4:44 a.m. UTC | #6
On 10/29/24 9:42 PM, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
>> I expect I could piece together something with Nouveau, given enough
>> time and help from Ben Skeggs and Danillo and all...
>>
>> Yes, this originated with the out of tree driver. But it never occurred
>> to me that upstream be uninterested in an obvious fix to an obvious
>> regression.
> 
> Because pinning down these amounts of memoryt is completely insane.
> I don't mind the switch to kvmalloc, but we need to put in an upper
> bound of what can be pinned.

I'm wondering though, how it is that we decide how much of the user's
system we prevent them from using? :)  People with hardware accelerators
do not always have page fault capability, and yet these troublesome
users insist on stacking their system full of DRAM and then pointing
the accelerator to it.

How would we choose a value? Memory sizes keep going up...


thanks,
Alistair Popple Oct. 30, 2024, 6:18 a.m. UTC | #7
John Hubbard <jhubbard@nvidia.com> writes:

> On 10/29/24 9:42 PM, Christoph Hellwig wrote:
>> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
>>> I expect I could piece together something with Nouveau, given enough
>>> time and help from Ben Skeggs and Danillo and all...
>>>
>>> Yes, this originated with the out of tree driver. But it never occurred
>>> to me that upstream be uninterested in an obvious fix to an obvious
>>> regression.
>> Because pinning down these amounts of memoryt is completely insane.
>> I don't mind the switch to kvmalloc, but we need to put in an upper
>> bound of what can be pinned.
>
> I'm wondering though, how it is that we decide how much of the user's
> system we prevent them from using? :)  People with hardware accelerators
> do not always have page fault capability, and yet these troublesome
> users insist on stacking their system full of DRAM and then pointing
> the accelerator to it.
>
> How would we choose a value? Memory sizes keep going up...

The obvious answer is you let users decide. I did have a patch series to
do that via a cgroup[1]. However I dropped that series mostly because I
couldn't find any users of such a limit to provide feedback on how they
would use it or how they wanted it to work.

- Alistair

[1] - https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/

> thanks,
John Hubbard Oct. 30, 2024, 6:50 a.m. UTC | #8
On 10/29/24 11:18 PM, Alistair Popple wrote:
> John Hubbard <jhubbard@nvidia.com> writes:
>> On 10/29/24 9:42 PM, Christoph Hellwig wrote:
>>> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
...
>>> Because pinning down these amounts of memoryt is completely insane.
>>> I don't mind the switch to kvmalloc, but we need to put in an upper
>>> bound of what can be pinned.
>>
>> I'm wondering though, how it is that we decide how much of the user's
>> system we prevent them from using? :)  People with hardware accelerators
>> do not always have page fault capability, and yet these troublesome
>> users insist on stacking their system full of DRAM and then pointing
>> the accelerator to it.
>>
>> How would we choose a value? Memory sizes keep going up...
> 
> The obvious answer is you let users decide. I did have a patch series to
> do that via a cgroup[1]. However I dropped that series mostly because I
> couldn't find any users of such a limit to provide feedback on how they
> would use it or how they wanted it to work.
> 

Trawling through the discussion there, I see that Jason Gunthorpe mentioned:

"Things like VFIO & KVM use cases effectively pin 90% of all system memory"

...which means that we'll be able to get that in-tree call trace that 
Christoph
is asking for, pretty soon. No GPUs required. :)


thanks,
David Hildenbrand Oct. 30, 2024, 8:34 a.m. UTC | #9
On 30.10.24 07:50, John Hubbard wrote:
> On 10/29/24 11:18 PM, Alistair Popple wrote:
>> John Hubbard <jhubbard@nvidia.com> writes:
>>> On 10/29/24 9:42 PM, Christoph Hellwig wrote:
>>>> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
> ...
>>>> Because pinning down these amounts of memoryt is completely insane.
>>>> I don't mind the switch to kvmalloc, but we need to put in an upper
>>>> bound of what can be pinned.
>>>
>>> I'm wondering though, how it is that we decide how much of the user's
>>> system we prevent them from using? :)  People with hardware accelerators
>>> do not always have page fault capability, and yet these troublesome
>>> users insist on stacking their system full of DRAM and then pointing
>>> the accelerator to it.
>>>
>>> How would we choose a value? Memory sizes keep going up...
>>
>> The obvious answer is you let users decide. I did have a patch series to
>> do that via a cgroup[1]. However I dropped that series mostly because I
>> couldn't find any users of such a limit to provide feedback on how they
>> would use it or how they wanted it to work.
>>
> 
> Trawling through the discussion there, I see that Jason Gunthorpe mentioned:
> 
> "Things like VFIO & KVM use cases effectively pin 90% of all system memory"

The unusual thing is not the amount of system memory we are pinning but 
*how many* pages we try pinning in the single call.

If you stare at vfio_pin_pages_remote, we seem to be batching it.

long req_pages = min_t(long, npage, batch->capacity);

Which is

#define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))


So you can fix this in your driver ;)


We should maybe try a similar limit internally: if you call 
pin_user_pages_remote() with a large number, we'll cap it at some magic 
value (similar to above). The caller will simply realize that not all 
pages were pinned and will retry.

See get_user_pages_remote(): "Returns either number of pages pinned 
(which may be less than the number requested), or an error. Details 
about the return value:"


Alternatively, I recall there was a way to avoid the temporary 
allocation ... let me hack up a prototype real quick.
David Hildenbrand Oct. 30, 2024, 9:01 a.m. UTC | #10
On 30.10.24 09:34, David Hildenbrand wrote:
> On 30.10.24 07:50, John Hubbard wrote:
>> On 10/29/24 11:18 PM, Alistair Popple wrote:
>>> John Hubbard <jhubbard@nvidia.com> writes:
>>>> On 10/29/24 9:42 PM, Christoph Hellwig wrote:
>>>>> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
>> ...
>>>>> Because pinning down these amounts of memoryt is completely insane.
>>>>> I don't mind the switch to kvmalloc, but we need to put in an upper
>>>>> bound of what can be pinned.
>>>>
>>>> I'm wondering though, how it is that we decide how much of the user's
>>>> system we prevent them from using? :)  People with hardware accelerators
>>>> do not always have page fault capability, and yet these troublesome
>>>> users insist on stacking their system full of DRAM and then pointing
>>>> the accelerator to it.
>>>>
>>>> How would we choose a value? Memory sizes keep going up...
>>>
>>> The obvious answer is you let users decide. I did have a patch series to
>>> do that via a cgroup[1]. However I dropped that series mostly because I
>>> couldn't find any users of such a limit to provide feedback on how they
>>> would use it or how they wanted it to work.
>>>
>>
>> Trawling through the discussion there, I see that Jason Gunthorpe mentioned:
>>
>> "Things like VFIO & KVM use cases effectively pin 90% of all system memory"
> 
> The unusual thing is not the amount of system memory we are pinning but
> *how many* pages we try pinning in the single call.
> 
> If you stare at vfio_pin_pages_remote, we seem to be batching it.
> 
> long req_pages = min_t(long, npage, batch->capacity);
> 
> Which is
> 
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> 
> 
> So you can fix this in your driver ;)
> 
> 
> We should maybe try a similar limit internally: if you call
> pin_user_pages_remote() with a large number, we'll cap it at some magic
> value (similar to above). The caller will simply realize that not all
> pages were pinned and will retry.
> 
> See get_user_pages_remote(): "Returns either number of pages pinned
> (which may be less than the number requested), or an error. Details
> about the return value:"
> 
> 
> Alternatively, I recall there was a way to avoid the temporary
> allocation ... let me hack up a prototype real quick.

Completely untested (also note the interesting TODO):


 From a23984b4f1a39ec984489fbe16708aedf4f9db95 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 30 Oct 2024 10:00:50 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/gup.c | 120 ++++++++++++++++++++++++++++++++++++-------------------
  1 file changed, 78 insertions(+), 42 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a82890b46a36..8807b36c2363 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2273,20 +2273,56 @@ struct page *get_dump_page(unsigned long addr)
  #endif /* CONFIG_ELF_CORE */
  
  #ifdef CONFIG_MIGRATION
+
+/*
+ * An array of either pages or folios. While we could currently interpret a
+ * list of folios like a list of pages, it would only work as long as
+ * "struct folio" overlays "struct page" -- and it would not allow for
+ * avoiding page_folio() calls.
+ */
+struct pages_or_folios {
+	union {
+		struct page **pages;
+		struct folio **folios;
+		void **entries;
+	};
+	bool has_folios;
+	long nr_entries;
+};
+
+static struct folio *pofs_get_folio(struct pages_or_folios *pofs, long i)
+{
+	if (pofs->has_folios)
+		return pofs->folios[i];
+	return page_folio(pofs->pages[i]);
+}
+
+static void pofs_clear_entry(struct pages_or_folios *pofs, long i)
+{
+	pofs->entries[i] = NULL;
+}
+
+static void pofs_unpin(struct pages_or_folios *pofs)
+{
+	if (pofs->has_folios)
+		unpin_folios(pofs->folios, pofs->nr_entries);
+	else
+		unpin_user_pages(pofs->pages, pofs->nr_entries);
+}
+
  /*
   * Returns the number of collected folios. Return value is always >= 0.
   */
  static unsigned long collect_longterm_unpinnable_folios(
-					struct list_head *movable_folio_list,
-					unsigned long nr_folios,
-					struct folio **folios)
+		struct list_head *movable_folio_list,
+		struct pages_or_folios *pofs)
  {
  	unsigned long i, collected = 0;
  	struct folio *prev_folio = NULL;
  	bool drain_allow = true;
  
-	for (i = 0; i < nr_folios; i++) {
-		struct folio *folio = folios[i];
+	for (i = 0; i < pofs->nr_entries; i++) {
+		struct folio *folio = pofs_get_folio(pofs, i);
  
  		if (folio == prev_folio)
  			continue;
@@ -2310,6 +2346,11 @@ static unsigned long collect_longterm_unpinnable_folios(
  			drain_allow = false;
  		}
  
+		/*
+		 * TODO: if isolation fails we might already have it in the
+		 * list, if pages of different folios are interleaved
+		 * (e.g., COW). We might want to check all entries in the list.
+		 */
  		if (!folio_isolate_lru(folio))
  			continue;
  
@@ -2328,15 +2369,14 @@ static unsigned long collect_longterm_unpinnable_folios(
   * failure (or partial success).
   */
  static int migrate_longterm_unpinnable_folios(
-					struct list_head *movable_folio_list,
-					unsigned long nr_folios,
-					struct folio **folios)
+		struct list_head *movable_folio_list,
+		struct pages_or_folios *pofs)
  {
  	int ret;
  	unsigned long i;
  
-	for (i = 0; i < nr_folios; i++) {
-		struct folio *folio = folios[i];
+	for (i = 0; i < pofs->nr_entries; i++) {
+		struct folio *folio = pofs_get_folio(pofs, i);
  
  		if (folio_is_device_coherent(folio)) {
  			/*
@@ -2344,7 +2384,7 @@ static int migrate_longterm_unpinnable_folios(
  			 * convert the pin on the source folio to a normal
  			 * reference.
  			 */
-			folios[i] = NULL;
+			pofs_clear_entry(pofs, i);
  			folio_get(folio);
  			gup_put_folio(folio, 1, FOLL_PIN);
  
@@ -2363,8 +2403,8 @@ static int migrate_longterm_unpinnable_folios(
  		 * calling folio_isolate_lru() which takes a reference so the
  		 * folio won't be freed if it's migrating.
  		 */
-		unpin_folio(folios[i]);
-		folios[i] = NULL;
+		unpin_folio(pofs_get_folio(pofs, i));
+		pofs_clear_entry(pofs, i);
  	}
  
  	if (!list_empty(movable_folio_list)) {
@@ -2387,12 +2427,24 @@ static int migrate_longterm_unpinnable_folios(
  	return -EAGAIN;
  
  err:
-	unpin_folios(folios, nr_folios);
+	pofs_unpin(pofs);
  	putback_movable_pages(movable_folio_list);
  
  	return ret;
  }
  
+static long check_and_migrate_movable_pages_or_folios(struct pages_or_folios *pofs)
+{
+	LIST_HEAD(movable_folio_list);
+	unsigned long collected;
+
+	collected = collect_longterm_unpinnable_folios(&movable_folio_list, pofs);
+	if (!collected)
+		return 0;
+
+	return migrate_longterm_unpinnable_folios(&movable_folio_list, pofs);
+}
+
  /*
   * Check whether all folios are *allowed* to be pinned indefinitely (longterm).
   * Rather confusingly, all folios in the range are required to be pinned via
@@ -2412,41 +2464,25 @@ static int migrate_longterm_unpinnable_folios(
  static long check_and_migrate_movable_folios(unsigned long nr_folios,
  					     struct folio **folios)
  {
-	unsigned long collected;
-	LIST_HEAD(movable_folio_list);
-
-	collected = collect_longterm_unpinnable_folios(&movable_folio_list,
-						       nr_folios, folios);
-	if (!collected)
-		return 0;
+	struct pages_or_folios pofs = {
+		.folios = folios,
+		.has_folios = true,
+		.nr_entries = nr_folios,
+	};
  
-	return migrate_longterm_unpinnable_folios(&movable_folio_list,
-						  nr_folios, folios);
+	return check_and_migrate_movable_pages_or_folios(&pofs);
  }
  
-/*
- * This routine just converts all the pages in the @pages array to folios and
- * calls check_and_migrate_movable_folios() to do the heavy lifting.
- *
- * Please see the check_and_migrate_movable_folios() documentation for details.
- */
+/* See check_and_migrate_movable_folios(). */
  static long check_and_migrate_movable_pages(unsigned long nr_pages,
  					    struct page **pages)
  {
-	struct folio **folios;
-	long i, ret;
+	struct pages_or_folios pofs = {
+		.pages = pages,
+		.nr_entries = nr_pages,
+	};
  
-	folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
-	if (!folios)
-		return -ENOMEM;
-
-	for (i = 0; i < nr_pages; i++)
-		folios[i] = page_folio(pages[i]);
-
-	ret = check_and_migrate_movable_folios(nr_pages, folios);
-
-	kfree(folios);
-	return ret;
+	return check_and_migrate_movable_pages_or_folios(&pofs);
  }
  #else
  static long check_and_migrate_movable_pages(unsigned long nr_pages,
Vlastimil Babka Oct. 30, 2024, 11:03 a.m. UTC | #11
On 10/30/24 05:39, John Hubbard wrote:
> On 10/29/24 9:33 PM, Christoph Hellwig wrote:
>> On Tue, Oct 29, 2024 at 09:30:41PM -0700, John Hubbard wrote:
>>> I do, yes. And what happens is that when you use GPUs, drivers like
>>> to pin system memory, and then point the GPU page tables to that
>>> memory. For older GPUs that don't support replayable page faults,
>>> that's required.
>>>
>>> So this behavior has been around forever.
>>>
>>> The customer was qualifying their software and noticed that before
>>> Linux 6.10, they could allocate >2GB, and with 6.11, they could
>>> not.
>>>
>>> Whether it is "wise" for user space to allocate that much at once
>>> is a reasonable question, but at least one place is (or was!) doing
>>> it.
>> 
>> Still missing a callchain, which make me suspect that it is your weird
>> out of tree driver, in which case this simply does not matter.
>> 
> 
> I expect I could piece together something with Nouveau, given enough
> time and help from Ben Skeggs and Danillo and all...
> 
> Yes, this originated with the out of tree driver. But it never occurred
> to me that upstream be uninterested in an obvious fix to an obvious
> regression.

It might be a regression even if you don't try to pin over 2GB. high-order
(>costly order) allocations can fail and/or cause disruptive
reclaim/compaction cycles even below MAX_PAGE_ORDER and it's better to use
kvmalloc if physical contiguity is not needed, it will attempt the physical
kmalloc() allocation with __GFP_NORETRY (little disruption) and fallback to
vmalloc() quickly.

Of course if there's a way to avoid the allocation completely, even beter.

> 
> thanks,
Jason Gunthorpe Oct. 30, 2024, 11:59 a.m. UTC | #12
On Tue, Oct 29, 2024 at 09:42:10PM -0700, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
> > I expect I could piece together something with Nouveau, given enough
> > time and help from Ben Skeggs and Danillo and all...
> > 
> > Yes, this originated with the out of tree driver. But it never occurred
> > to me that upstream be uninterested in an obvious fix to an obvious
> > regression.
> 
> Because pinning down these amounts of memoryt is completely insane.
> I don't mind the switch to kvmalloc, but we need to put in an upper
> bound of what can be pinned.

I have RDMA customers that pin basically the entire machine's
memory, which is 100Gb's of RAM. It is a dedicated server doing some
HPC or Database job and it has been carefully setup to do that.

Many VFIO VM deployments will pin almost all the memory too - just a
little left over for the hypervisor. Again the machine was carefully
setup to have huge pages that cover almost all system ram and are
dedicated to backing memory for VMs.

Pinning almost all memory is a legitimate use case.

Jason
Jason Gunthorpe Oct. 30, 2024, 12:04 p.m. UTC | #13
On Wed, Oct 30, 2024 at 09:34:51AM +0100, David Hildenbrand wrote:

> The unusual thing is not the amount of system memory we are pinning but *how
> many* pages we try pinning in the single call.
> 
> If you stare at vfio_pin_pages_remote, we seem to be batching it.
> 
> long req_pages = min_t(long, npage, batch->capacity);
> 
> Which is
> 
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> 
> So you can fix this in your driver ;)

Yeah, everything batches that I'm aware of. RDMA also uses a 4k batch
size, and iommufd uses 64k.

Jason
John Hubbard Oct. 30, 2024, 5:25 p.m. UTC | #14
On 10/30/24 5:04 AM, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2024 at 09:34:51AM +0100, David Hildenbrand wrote:
> 
>> The unusual thing is not the amount of system memory we are pinning but *how
>> many* pages we try pinning in the single call.
>>
>> If you stare at vfio_pin_pages_remote, we seem to be batching it.
>>
>> long req_pages = min_t(long, npage, batch->capacity);
>>
>> Which is
>>
>> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>
>> So you can fix this in your driver ;)
> 
> Yeah, everything batches that I'm aware of. RDMA also uses a 4k batch
> size, and iommufd uses 64k.
> 
> Jason

Yes. It's a surprise, but the driver can do that.


thanks,
John Hubbard Oct. 30, 2024, 5:29 p.m. UTC | #15
On 10/30/24 4:03 AM, Vlastimil Babka wrote:
> On 10/30/24 05:39, John Hubbard wrote:
>> On 10/29/24 9:33 PM, Christoph Hellwig wrote:
>>> On Tue, Oct 29, 2024 at 09:30:41PM -0700, John Hubbard wrote:
...
> It might be a regression even if you don't try to pin over 2GB. high-order
> (>costly order) allocations can fail and/or cause disruptive
> reclaim/compaction cycles even below MAX_PAGE_ORDER and it's better to use
> kvmalloc if physical contiguity is not needed, it will attempt the physical
> kmalloc() allocation with __GFP_NORETRY (little disruption) and fallback to
> vmalloc() quickly.
> 
> Of course if there's a way to avoid the allocation completely, even beter.

Why not both? I'm going to ask our driver team to batch the pinning calls,
as recommended nearby, just to be sure that we are following best
practices.

But it also seems good to use kvmalloc() here, and avoid any other
regressions. That's also a best practice.

thanks,
Vlastimil Babka Oct. 30, 2024, 5:42 p.m. UTC | #16
On 10/30/24 18:29, John Hubbard wrote:
> On 10/30/24 4:03 AM, Vlastimil Babka wrote:
>> On 10/30/24 05:39, John Hubbard wrote:
>>> On 10/29/24 9:33 PM, Christoph Hellwig wrote:
>>>> On Tue, Oct 29, 2024 at 09:30:41PM -0700, John Hubbard wrote:
> ...
>> It might be a regression even if you don't try to pin over 2GB. high-order
>> (>costly order) allocations can fail and/or cause disruptive
>> reclaim/compaction cycles even below MAX_PAGE_ORDER and it's better to use
>> kvmalloc if physical contiguity is not needed, it will attempt the physical
>> kmalloc() allocation with __GFP_NORETRY (little disruption) and fallback to
>> vmalloc() quickly.
>> 
>> Of course if there's a way to avoid the allocation completely, even beter.
> 
> Why not both? I'm going to ask our driver team to batch the pinning calls,
> as recommended nearby, just to be sure that we are following best
> practices.
> 
> But it also seems good to use kvmalloc() here, and avoid any other
> regressions. That's also a best practice.

By "avoid the allocation completely" I meant David's proof of concept
elsewhere in this thread, that seems to replace that kmalloc_array() with no
allocation :)

> thanks,
John Hubbard Oct. 30, 2024, 5:49 p.m. UTC | #17
On 10/30/24 10:42 AM, Vlastimil Babka wrote:
> On 10/30/24 18:29, John Hubbard wrote:
>> On 10/30/24 4:03 AM, Vlastimil Babka wrote:
>>> On 10/30/24 05:39, John Hubbard wrote:
>>>> On 10/29/24 9:33 PM, Christoph Hellwig wrote:
>>>>> On Tue, Oct 29, 2024 at 09:30:41PM -0700, John Hubbard wrote:
>> ...
>>> It might be a regression even if you don't try to pin over 2GB. high-order
>>> (>costly order) allocations can fail and/or cause disruptive
>>> reclaim/compaction cycles even below MAX_PAGE_ORDER and it's better to use
>>> kvmalloc if physical contiguity is not needed, it will attempt the physical
>>> kmalloc() allocation with __GFP_NORETRY (little disruption) and fallback to
>>> vmalloc() quickly.
>>>
>>> Of course if there's a way to avoid the allocation completely, even beter.
>>
>> Why not both? I'm going to ask our driver team to batch the pinning calls,
>> as recommended nearby, just to be sure that we are following best
>> practices.
>>
>> But it also seems good to use kvmalloc() here, and avoid any other
>> regressions. That's also a best practice.
> 
> By "avoid the allocation completely" I meant David's proof of concept
> elsewhere in this thread, that seems to replace that kmalloc_array() with no
> allocation :)
> 

aha, OK let me look into that then.

thanks,
John Hubbard Oct. 30, 2024, 6:34 p.m. UTC | #18
On 10/30/24 2:01 AM, David Hildenbrand wrote:
> On 30.10.24 09:34, David Hildenbrand wrote:
>> On 30.10.24 07:50, John Hubbard wrote:
>>> On 10/29/24 11:18 PM, Alistair Popple wrote:
>>>> John Hubbard <jhubbard@nvidia.com> writes:
>>>>> On 10/29/24 9:42 PM, Christoph Hellwig wrote:
>>>>>> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
>>> ...
>> We should maybe try a similar limit internally: if you call
>> pin_user_pages_remote() with a large number, we'll cap it at some magic
>> value (similar to above). The caller will simply realize that not all
>> pages were pinned and will retry.
>>
>> See get_user_pages_remote(): "Returns either number of pages pinned
>> (which may be less than the number requested), or an error. Details
>> about the return value:"
>>

Yes. The API has a retry built in, and so far it has been vague about
any upper limit. So if we want such a limit, it might be helpful to
actually put in a PIN_USER_PAGES_MAX_LENGTH, in order to make it
clear to all the callers.

 From a very high level design perspective, it's not yet clear to me
that there is either a "preferred" or "not recommended" aspect to
pinning in batches vs. all at once here, as long as one stays
below the type (int, long, unsigned...) limits of the API. Batching
seems like what you do if the internal implementation is crippled
and unable to meet its API requirements. So the fact that many
callers do batching is sort of "tail wags dog".

But anyway:

>>
>> Alternatively, I recall there was a way to avoid the temporary
>> allocation ... let me hack up a prototype real quick.
> 
> Completely untested (also note the interesting TODO):
> 

Thanks for doing such a quick proof of concept, much appreciated.
This approach seems reasonable, and I do like the idea of avoiding
a new allocation deep within pin_user_pages(). Callers already had
to allocate **pages and it seemed unfortunate to do another one.

I'll try this out.


thanks,
Jason Gunthorpe Oct. 31, 2024, 12:02 a.m. UTC | #19
On Wed, Oct 30, 2024 at 11:34:49AM -0700, John Hubbard wrote:

> From a very high level design perspective, it's not yet clear to me
> that there is either a "preferred" or "not recommended" aspect to
> pinning in batches vs. all at once here, as long as one stays
> below the type (int, long, unsigned...) limits of the API. Batching
> seems like what you do if the internal implementation is crippled
> and unable to meet its API requirements. So the fact that many
> callers do batching is sort of "tail wags dog".

No.. all things need to do batching because nothing should be storing
a linear struct page array that is so enormous. That is going to
create vmemap pressure that is not desirable.

For instance rdma pins in batches and copies the pins into a scatter
list and never has an allocation over PAGE_SIZE.

iommufd transfers them into a radix tree.

It is not so much that there is a limit, but that good kernel code
just *shouldn't* be allocating gigantic contiguous memory arrays at
all.

Jason
John Hubbard Oct. 31, 2024, 12:17 a.m. UTC | #20
On 10/30/24 5:02 PM, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2024 at 11:34:49AM -0700, John Hubbard wrote:
> 
>>  From a very high level design perspective, it's not yet clear to me
>> that there is either a "preferred" or "not recommended" aspect to
>> pinning in batches vs. all at once here, as long as one stays
>> below the type (int, long, unsigned...) limits of the API. Batching
>> seems like what you do if the internal implementation is crippled
>> and unable to meet its API requirements. So the fact that many
>> callers do batching is sort of "tail wags dog".
> 
> No.. all things need to do batching because nothing should be storing
> a linear struct page array that is so enormous. That is going to
> create vmemap pressure that is not desirable.

Are we talking about the same allocation size here? It's not 2GB. It
is enough folio pointers to cover 2GB of memory, so 4MB.

That's not really much pressure.

> 
> For instance rdma pins in batches and copies the pins into a scatter
> list and never has an allocation over PAGE_SIZE.
> 
> iommufd transfers them into a radix tree.
> 
> It is not so much that there is a limit, but that good kernel code
> just *shouldn't* be allocating gigantic contiguous memory arrays at
> all.

That high level guidance makes sense, but here we are attempting only
a 4MB physically contiguous allocation, and if larger than that, then
it goes to vmalloc() which is merely virtually contiguous.

I'm writing this because your adjectives make me suspect that you
are referring to a 2GB allocation. But this is orders of magnitude
smaller.

thanks,
Jason Gunthorpe Oct. 31, 2024, 12:25 a.m. UTC | #21
On Wed, Oct 30, 2024 at 05:17:25PM -0700, John Hubbard wrote:
> On 10/30/24 5:02 PM, Jason Gunthorpe wrote:
> > On Wed, Oct 30, 2024 at 11:34:49AM -0700, John Hubbard wrote:
> > 
> > >  From a very high level design perspective, it's not yet clear to me
> > > that there is either a "preferred" or "not recommended" aspect to
> > > pinning in batches vs. all at once here, as long as one stays
> > > below the type (int, long, unsigned...) limits of the API. Batching
> > > seems like what you do if the internal implementation is crippled
> > > and unable to meet its API requirements. So the fact that many
> > > callers do batching is sort of "tail wags dog".
> > 
> > No.. all things need to do batching because nothing should be storing
> > a linear struct page array that is so enormous. That is going to
> > create vmemap pressure that is not desirable.
> 
> Are we talking about the same allocation size here? It's not 2GB. It
> is enough folio pointers to cover 2GB of memory, so 4MB.

Is 2GB a hard limit? I was expecting this was a range that had upper
bounds of 100GB's like for rdma.. Then it is 400MB, and yeah, that is
not great.

> That high level guidance makes sense, but here we are attempting only
> a 4MB physically contiguous allocation, and if larger than that, then
> it goes to vmalloc() which is merely virtually contiguous.

AFAIK any contiguous allocation beyond 4K basically doesn't work
reliably in a server environment due to fragmentation.

So you are always using the vmemap..

> I'm writing this because your adjectives make me suspect that you
> are referring to a 2GB allocation. But this is orders of magnitude
> smaller.

Even 4MB I would wonder about getting it split to PAGE_SIZE chunks
instead of vmemmap, but I don't know what it is being used for.

Jason
John Hubbard Oct. 31, 2024, 12:47 a.m. UTC | #22
On 10/30/24 5:25 PM, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2024 at 05:17:25PM -0700, John Hubbard wrote:
>> On 10/30/24 5:02 PM, Jason Gunthorpe wrote:
>>> On Wed, Oct 30, 2024 at 11:34:49AM -0700, John Hubbard wrote:
>>>
>>>>   From a very high level design perspective, it's not yet clear to me
>>>> that there is either a "preferred" or "not recommended" aspect to
>>>> pinning in batches vs. all at once here, as long as one stays
>>>> below the type (int, long, unsigned...) limits of the API. Batching
>>>> seems like what you do if the internal implementation is crippled
>>>> and unable to meet its API requirements. So the fact that many
>>>> callers do batching is sort of "tail wags dog".
>>>
>>> No.. all things need to do batching because nothing should be storing
>>> a linear struct page array that is so enormous. That is going to
>>> create vmemap pressure that is not desirable.
>>
>> Are we talking about the same allocation size here? It's not 2GB. It
>> is enough folio pointers to cover 2GB of memory, so 4MB.
> 
> Is 2GB a hard limit? I was expecting this was a range that had upper
> bounds of 100GB's like for rdma.. Then it is 400MB, and yeah, that is
> not great.
>

No, 2GB (original allocation, thus 4MB real allocation) is just the point at
which the page alloc code typically switches over from kmalloc to vmalloc
(internal to kvmalloc)--for a freshly booted machine, that is.

For some reason, I've had "a few GB" in mind as kind of a "likely as much as
people will request" limit, rather than 100's of GB, just from what I've 
seen.
However, I don't have much additional data about how user space (which 
does the
allocation requests, in the end) behaves, either. Maybe it is actually quite
rare to do such large allocation requests. Or maybe not.

But yes, if this went 10x+ higher, it would definitely be "too much".


>> That high level guidance makes sense, but here we are attempting only
>> a 4MB physically contiguous allocation, and if larger than that, then
>> it goes to vmalloc() which is merely virtually contiguous.
> 
> AFAIK any contiguous allocation beyond 4K basically doesn't work
> reliably in a server environment due to fragmentation.
> 
> So you are always using the vmemap..
> 
>> I'm writing this because your adjectives make me suspect that you
>> are referring to a 2GB allocation. But this is orders of magnitude
>> smaller.
> 
> Even 4MB I would wonder about getting it split to PAGE_SIZE chunks
> instead of vmemmap, but I don't know what it is being used for.
> 

For a 64-bit system, I think we have quite a healthy chunk of vmalloc() 
space
(ignoring, with some effort, the multiple KASLR bugs that have been recently
messing that up), right? I mean, your points about keeping kernel
allocations small or at least reasonable are resonating with me, but it's
also true that the numbers are much bigger with 64 bit systems.


thanks,
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 4637dab7b54f..346186788a49 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -21,6 +21,7 @@ 
 #include <linux/pagevec.h>
 #include <linux/sched/mm.h>
 #include <linux/shmem_fs.h>
+#include <linux/vmalloc.h>
 
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -2439,7 +2440,7 @@  static long check_and_migrate_movable_pages(unsigned long nr_pages,
 	struct folio **folios;
 	long i, ret;
 
-	folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
+	folios = kvmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
 	if (!folios) {
 		unpin_user_pages(pages, nr_pages);
 		return -ENOMEM;
@@ -2450,7 +2451,7 @@  static long check_and_migrate_movable_pages(unsigned long nr_pages,
 
 	ret = check_and_migrate_movable_folios(nr_pages, folios);
 
-	kfree(folios);
+	kvfree(folios);
 	return ret;
 }
 #else