diff mbox series

[RFC,v3,3/3] ktest: sys_move_phys_pages ktest

Message ID 20240319172609.332900-4-gregory.price@memverge.com (mailing list archive)
State New
Headers show
Series move_phys_pages syscall - migrate page contents given | expand

Commit Message

Gregory Price March 19, 2024, 5:26 p.m. UTC
Implement simple ktest that looks up the physical address via
/proc/self/pagemap and migrates the page based on that information.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 tools/testing/selftests/mm/migration.c | 99 ++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

Comments

Matthew Wilcox (Oracle) March 19, 2024, 5:52 p.m. UTC | #1
On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> Implement simple ktest that looks up the physical address via
> /proc/self/pagemap and migrates the page based on that information.

What?  LOL.  No.
Matthew Wilcox (Oracle) March 19, 2024, 6:08 p.m. UTC | #2
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > Implement simple ktest that looks up the physical address via
> > /proc/self/pagemap and migrates the page based on that information.
> 
> What?  LOL.  No.

Also, how is this v3 and the first one to land on linux-mm?

https://lore.kernel.org/linux-mm/?q=move_phys_pages

Also, where is the syscall itself?  The only thing here is the ktest.
Gregory Price March 19, 2024, 6:14 p.m. UTC | #3
On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > Implement simple ktest that looks up the physical address via
> > /proc/self/pagemap and migrates the page based on that information.
> 
> What?  LOL.  No.
> 

Certainly the test is stupid and requires admin, but I could not
come up an easier test to demonstrate the concept - and the docs
say to include a test with all syscall proposals.

Am I missing something else important?
(stupid question: of course I am, but alas I must ask it)

~Gregory
Gregory Price March 19, 2024, 6:16 p.m. UTC | #4
On Tue, Mar 19, 2024 at 06:08:39PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > > Implement simple ktest that looks up the physical address via
> > > /proc/self/pagemap and migrates the page based on that information.
> > 
> > What?  LOL.  No.
> 
> Also, how is this v3 and the first one to land on linux-mm?
> 
> https://lore.kernel.org/linux-mm/?q=move_phys_pages
> 
> Also, where is the syscall itself?  The only thing here is the ktest.
>

OH, I see the confusion now.

There were two other versions, and I have experienced this delivery
failure before, i'm not sure why the other commits have not been delivered.

Let me look into this.

~Gregory
Gregory Price March 19, 2024, 6:18 p.m. UTC | #5
On Tue, Mar 19, 2024 at 02:16:01PM -0400, Gregory Price wrote:
> On Tue, Mar 19, 2024 at 06:08:39PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > > > Implement simple ktest that looks up the physical address via
> > > > /proc/self/pagemap and migrates the page based on that information.
> > > 
> > > What?  LOL.  No.
> > 
> > Also, how is this v3 and the first one to land on linux-mm?
> > 
> > https://lore.kernel.org/linux-mm/?q=move_phys_pages
> > 
> > Also, where is the syscall itself?  The only thing here is the ktest.
> >
> 
> OH, I see the confusion now.
> 
> There were two other versions, and I have experienced this delivery
> failure before, i'm not sure why the other commits have not been delivered.
> 
> Let me look into this.
> 
> ~Gregory
> 

Full set of patches:
https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/

I've experienced silent linux-mm delivery failures like this before, I
still do not understand this issue.

v1:
https://lore.kernel.org/all/20230907075453.350554-1-gregory.price@memverge.com/
v2:
https://lore.kernel.org/all/20230919230909.530174-1-gregory.price@memverge.com/

~Gregory
Matthew Wilcox (Oracle) March 19, 2024, 6:20 p.m. UTC | #6
On Tue, Mar 19, 2024 at 02:14:33PM -0400, Gregory Price wrote:
> On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > > Implement simple ktest that looks up the physical address via
> > > /proc/self/pagemap and migrates the page based on that information.
> > 
> > What?  LOL.  No.
> > 
> 
> Certainly the test is stupid and requires admin, but I could not
> come up an easier test to demonstrate the concept - and the docs
> say to include a test with all syscall proposals.
> 
> Am I missing something else important?
> (stupid question: of course I am, but alas I must ask it)

It's not that the test is stupid.  It's the concept that's stupid.
Gregory Price March 19, 2024, 6:32 p.m. UTC | #7
On Tue, Mar 19, 2024 at 06:20:33PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 19, 2024 at 02:14:33PM -0400, Gregory Price wrote:
> > On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > > > Implement simple ktest that looks up the physical address via
> > > > /proc/self/pagemap and migrates the page based on that information.
> > > 
> > > What?  LOL.  No.
> > > 
> > 
> > Certainly the test is stupid and requires admin, but I could not
> > come up an easier test to demonstrate the concept - and the docs
> > say to include a test with all syscall proposals.
> > 
> > Am I missing something else important?
> > (stupid question: of course I am, but alas I must ask it)
> 
> It's not that the test is stupid.  It's the concept that's stupid.

Ok i'll bite.

The 2 major ways page-hotness is detected right now is page-faults
(induced or otherwise) and things like IBS/PEBS.

page-faults cause overhead, and IBS/PEBS actually miss upwards of ~66%
of all traffic (if you want the details i can dig up the presentation,
but TL;DR: prefetcher traffic is missed entirely).

so OCP folks have been proposing hotness-tracking offloaded to the
memory devices themselves:

https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1

(it's come along further than this white paper, but i need to dig up
the new information).

These devices are incapable of providing virtual addressing information,
and doing reverse lookups of addresses is inordinately expensive from
user space.  This leaves: Do it all in a kernel task, or give user space
an an interface to operate on data provided by the device.

The syscall design is mostly being posted right now to collaborate via
public channels, but if the idea is so fundamentally offensive then i'll
drop it and relay the opinion accordingly.

~Gregory
Matthew Wilcox (Oracle) March 19, 2024, 6:38 p.m. UTC | #8
On Tue, Mar 19, 2024 at 02:32:17PM -0400, Gregory Price wrote:
> On Tue, Mar 19, 2024 at 06:20:33PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 19, 2024 at 02:14:33PM -0400, Gregory Price wrote:
> > > On Tue, Mar 19, 2024 at 05:52:46PM +0000, Matthew Wilcox wrote:
> > > > On Tue, Mar 19, 2024 at 01:26:09PM -0400, Gregory Price wrote:
> > > > > Implement simple ktest that looks up the physical address via
> > > > > /proc/self/pagemap and migrates the page based on that information.
> > > > 
> > > > What?  LOL.  No.
> > > > 
> > > 
> > > Certainly the test is stupid and requires admin, but I could not
> > > come up an easier test to demonstrate the concept - and the docs
> > > say to include a test with all syscall proposals.
> > > 
> > > Am I missing something else important?
> > > (stupid question: of course I am, but alas I must ask it)
> > 
> > It's not that the test is stupid.  It's the concept that's stupid.
> 
> Ok i'll bite.
> 
> The 2 major ways page-hotness is detected right now is page-faults
> (induced or otherwise) and things like IBS/PEBS.
> 
> page-faults cause overhead, and IBS/PEBS actually miss upwards of ~66%
> of all traffic (if you want the details i can dig up the presentation,
> but TL;DR: prefetcher traffic is missed entirely).
> 
> so OCP folks have been proposing hotness-tracking offloaded to the
> memory devices themselves:
> 
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> 
> (it's come along further than this white paper, but i need to dig up
> the new information).
> 
> These devices are incapable of providing virtual addressing information,
> and doing reverse lookups of addresses is inordinately expensive from
> user space.  This leaves: Do it all in a kernel task, or give user space
> an an interface to operate on data provided by the device.
> 
> The syscall design is mostly being posted right now to collaborate via
> public channels, but if the idea is so fundamentally offensive then i'll
> drop it and relay the opinion accordingly.

The syscall design is wrong.  Exposing physical addresses to userspace
is never the right answer.  Think rowhammer.

I'm vehemently opposed to all of the bullshit around CXL.  However, if you
are going to propose something, it should be based around an abstraction.
Say "We have 8 pools of memory.  This VMA is backed by memory from pools
3 & 6.  The relative hotness of the 8 pools are <vector>.  The quantities
of memory in the 8 ppols are <vector>".  And then you can say "migrate
this range of memory to pool 2".

That's just an initial response to the idea.  I refuse to invest a
serious amount of time in a dead-end idea like CXL memory pooling.
Gregory Price March 19, 2024, 6:50 p.m. UTC | #9
On Tue, Mar 19, 2024 at 06:38:23PM +0000, Matthew Wilcox wrote:
> > The syscall design is mostly being posted right now to collaborate via
> > public channels, but if the idea is so fundamentally offensive then i'll
> > drop it and relay the opinion accordingly.
> 
> The syscall design is wrong.  Exposing physical addresses to userspace
> is never the right answer.  Think rowhammer.
> 

1) The syscall does not expose physical addresses information, it
   consumes it.

2) The syscall does not allow the user to select target physical address
   only the target node. Now, that said, if source-pages are zeroed on
   migration, that's definitely a concern.  I did not see this to be the
   case, however, and the frequency of write required to make use of
   that for rowhammer seems to be a mitigating factor.

3) there exist 4 interfaces which do expose physical address information
   - /proc/pid/pagemap
   - perf / IBS and PEBs
   - zoneinfo
   - /sys/kerne/mm/page_idle (PFNs)

4) The syscall requires CAP_SYS_ADMIN because these other sources
   require the same, though as v1/v2 discussed there could be an
   argument for CAP_SYS_NIDE.

> I'm vehemently opposed to all of the bullshit around CXL.  However, if you
> are going to propose something, it should be based around an abstraction.
> Say "We have 8 pools of memory.  This VMA is backed by memory from pools
> 3 & 6.  The relative hotness of the 8 pools are <vector>.  The quantities
> of memory in the 8 ppols are <vector>".  And then you can say "migrate
> this range of memory to pool 2".
> 
> That's just an initial response to the idea.  I refuse to invest a
> serious amount of time in a dead-end idea like CXL memory pooling.

Who said anything about pools? Local memory expanders are capable of
hosting hotness tracking offload.

~Gregory
diff mbox series

Patch

diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c
index 6908569ef406..c005c98dbdc1 100644
--- a/tools/testing/selftests/mm/migration.c
+++ b/tools/testing/selftests/mm/migration.c
@@ -5,6 +5,8 @@ 
  */
 
 #include "../kselftest_harness.h"
+#include <stdint.h>
+#include <stdio.h>
 #include <strings.h>
 #include <pthread.h>
 #include <numa.h>
@@ -14,11 +16,17 @@ 
 #include <sys/types.h>
 #include <signal.h>
 #include <time.h>
+#include <unistd.h>
 
 #define TWOMEG (2<<20)
 #define RUNTIME (20)
 
+#define GET_BIT(X, Y) ((X & ((uint64_t)1<<Y)) >> Y)
+#define GET_PFN(X) (X & 0x7FFFFFFFFFFFFFull)
 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+#define PAGEMAP_ENTRY 8
+const int __endian_bit = 1;
+#define is_bigendian() ((*(char *)&__endian_bit) == 0)
 
 FIXTURE(migration)
 {
@@ -94,6 +102,45 @@  int migrate(uint64_t *ptr, int n1, int n2)
 	return 0;
 }
 
+int migrate_phys(uint64_t paddr, int n1, int n2)
+{
+	int ret, tmp;
+	int status = 0;
+	struct timespec ts1, ts2;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &ts1))
+		return -1;
+
+	while (1) {
+		if (clock_gettime(CLOCK_MONOTONIC, &ts2))
+			return -1;
+
+		if (ts2.tv_sec - ts1.tv_sec >= RUNTIME)
+			return 0;
+
+		/*
+		 * FIXME: move_phys_pages was syscall 462 during RFC.
+		 * Update this when an official syscall number is adopted
+		 * and the libnuma interface is implemented.
+		 */
+		ret = syscall(462, 1, (void **) &paddr, &n2, &status,
+			      MPOL_MF_MOVE_ALL);
+		if (ret) {
+			if (ret > 0)
+				printf("Didn't migrate %d pages\n", ret);
+			else
+				perror("Couldn't migrate pages");
+			return -2;
+		}
+
+		tmp = n2;
+		n2 = n1;
+		n1 = tmp;
+	}
+
+	return 0;
+}
+
 void *access_mem(void *ptr)
 {
 	volatile uint64_t y = 0;
@@ -199,4 +246,56 @@  TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME)
 		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
 }
 
+/*
+ * Same as the basic migration, but test move_phys_pages.
+ */
+TEST_F_TIMEOUT(migration, phys_addr, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	uint64_t pagemap_val, paddr, file_offset;
+	unsigned char c_buf[PAGEMAP_ENTRY];
+	int i, c, status;
+	FILE *f;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	ptr = mmap(NULL, TWOMEG, PROT_READ | PROT_WRITE,
+		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	memset(ptr, 0xde, TWOMEG);
+
+	/* PFN of ptr from /proc/self/pagemap */
+	f = fopen("/proc/self/pagemap", "rb");
+	file_offset = ((uint64_t)ptr) / getpagesize() * PAGEMAP_ENTRY;
+	status = fseek(f, file_offset, SEEK_SET);
+	ASSERT_EQ(status, 0);
+	for (i = 0; i < PAGEMAP_ENTRY; i++) {
+		c = getc(f);
+		ASSERT_NE(c, EOF);
+		/* handle endiand differences */
+		if (is_bigendian())
+			c_buf[i] = c;
+		else
+			c_buf[PAGEMAP_ENTRY - i - 1] = c;
+	}
+	fclose(f);
+
+	for (i = 0; i < PAGEMAP_ENTRY; i++)
+		pagemap_val = (pagemap_val << 8) + c_buf[i];
+
+	ASSERT_TRUE(GET_BIT(pagemap_val, 63));
+	/* This reports a pfn, we need to shift this by page size */
+	paddr = GET_PFN(pagemap_val) << __builtin_ctz(getpagesize());
+
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate_phys(paddr, self->n1, self->n2), 0);
+	for (i = 0; i < self->nthreads - 1; i++)
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+}
+
 TEST_HARNESS_MAIN