mbox series

[RFC,0/6] Add support for shared PTEs across processes

Message ID cover.1642526745.git.khalid.aziz@oracle.com (mailing list archive)
Headers show
Series Add support for shared PTEs across processes | expand

Message

Khalid Aziz Jan. 18, 2022, 9:19 p.m. UTC
Page tables in kernel consume some of the memory and as long as
number of mappings being maintained is small enough, this space
consumed by page tables is not objectionable. When very few memory
pages are shared between processes, the number of page table entries
(PTEs) to maintain is mostly constrained by the number of pages of
memory on the system. As the number of shared pages and the number
of times pages are shared goes up, amount of memory consumed by page
tables starts to become significant.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system carsh was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This is a proposal to implement a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. The proposal is to add
a new system call - mshare(), which can be used by a process to
create a region (we will call it mshare'd region) which can be used
by other processes to map same pages using shared PTEs. Other
process(es), assuming they have the right permissions, can then make
the mashare() system call to map the shared pages into their address
space using the shared PTEs.  When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.


API Proposal
============

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
  exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64)
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
  shared automatically
- Shared address range can have unmapped addresses in it. Any access
  to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.


Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
			GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        strncpy(addr, "Some random shared text",
			sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
        fd = open("testregion", O_RDONLY);
        if (fd < 0) {
                perror("open failed");
                exit(1);
        }

        if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
                printf("INFO: %ld bytes shared at addr %lx \n",
				mshare_info[1], mshare_info[0]);
        else
                perror("read failed");

        close(fd);

        addr = (char *)mshare_info[0];
        err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
			mshare_info[1], O_RDWR, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
	printf("\nDone\n");

        err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
        if (err < 0) {
                perror("mshare_unlink() failed");
                exit(1);
        }
-----------------


RFC Proposal
============

This series of RFC patches is a prototype implementation of these
two system calls. This code is just a prototype with many bugs and
limitations and is incomplete. It works well enough to show how
this concept will work.

Prototype for the two syscalls is:

SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
		unsigned long, len, int, oflag, mode_t, mode)

SYSCALL_DEFINE1(mshare_unlink, const char *, name)

In oreder to facilitate page table sharing, this implemntation adds
a new in-memory filesystem - msharefs which will be mounted at
/sys/fs/mshare. When a new mshare'd region is
created, a file with the name given by initial mshare() call is
created under this filesystem. Permissions for this file are given
by the "mode" parameter to mshare(). The only operation supported on
this file is read. A read from this file returns two long values -
(1) starting virtual address for the region, and (2) size of
mshare'd region.

A donor process that wants to create an mshare'd region from a
section of its mapped addresses calls mshare() with O_CREAT oflag.
mshare() syscall then creates a new mm_struct which will host the
page tables for the mshare'd region.  vma->vm_mm for the vmas
covering address range for this region are updated to point to this
new mm_struct instead of current->mm.  Existing page tables are
copied over to new mm struct.

A consumer process that wants to map mshare'd region opens
/sys/fs/mshare/<filename> and reads the starting address and size of
mshare'd region. It then calls mshare() with these values to map the
entire region in its address space. Consumer process calls
mshare_unlink() to terminate its access.

I would very much appreciate feedback on - (1) the API and how it
defines the PTE sharing concept, and (2) core of the initial
implementation. The file mm/mshare.c contains a list of current
issues/bugs/holes in the code at this time at the top of the file.
If the general direction of this work looks reasonable, I will
continue working on adding the missing code including much of error
handling and fixing bugs.


Khalid Aziz (6):
  mm: Add new system calls mshare, mshare_unlink
  mm: Add msharefs filesystem
  mm: Add read for msharefs
  mm: implement mshare_unlink syscall
  mm: Add locking to msharefs syscalls
  mm: Add basic page table sharing using mshare

 Documentation/filesystems/msharefs.rst |  19 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/mm.h                     |   8 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/magic.h             |   1 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   7 +
 mm/memory.c                            |  35 +-
 mm/mshare.c                            | 604 +++++++++++++++++++++++++
 10 files changed, 682 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

Comments

Dave Hansen Jan. 18, 2022, 9:41 p.m. UTC | #1
On 1/18/22 1:19 PM, Khalid Aziz wrote:
> - Starting address must be aligned to pgdir size (512GB on x86_64)

How does this work on systems with 5-level paging where a top-level page
table entry covers 256TB?  Is the alignment requirement 512GB or 256TB?
 How does userspace figure out which alignment requirement it is subject to?
Matthew Wilcox (Oracle) Jan. 18, 2022, 9:46 p.m. UTC | #2
On Tue, Jan 18, 2022 at 01:41:40PM -0800, Dave Hansen wrote:
> On 1/18/22 1:19 PM, Khalid Aziz wrote:
> > - Starting address must be aligned to pgdir size (512GB on x86_64)
> 
> How does this work on systems with 5-level paging where a top-level page
> table entry covers 256TB?  Is the alignment requirement 512GB or 256TB?
>  How does userspace figure out which alignment requirement it is subject to?

The original idea was any power of two, naturally aligned, >= PAGE_SIZE,
but I suspect Khalid has simplified it for this first implementation.
Dave Hansen Jan. 18, 2022, 10:06 p.m. UTC | #3
On 1/18/22 1:19 PM, Khalid Aziz wrote:
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.

One thing that went over my head here was that this allowing sharing of
relatively arbitrary *EXISTING* regions.  The mshared'd region might be
anonymous or an plain mmap()'d file.  It can even be a filesystem or
device DAX mmap().

In other words, donors can (ideally) share anything.  Consumers have
must use msharefs to access the donated areas.

Right?

( btw... thanks to willy for the correction on IRC.)
Khalid Aziz Jan. 18, 2022, 10:47 p.m. UTC | #4
On 1/18/22 14:46, Matthew Wilcox wrote:
> On Tue, Jan 18, 2022 at 01:41:40PM -0800, Dave Hansen wrote:
>> On 1/18/22 1:19 PM, Khalid Aziz wrote:
>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>>
>> How does this work on systems with 5-level paging where a top-level page
>> table entry covers 256TB?  Is the alignment requirement 512GB or 256TB?
>>   How does userspace figure out which alignment requirement it is subject to?
> 
> The original idea was any power of two, naturally aligned, >= PAGE_SIZE,
> but I suspect Khalid has simplified it for this first implementation.
> 

Hi Dave,

Yes, this is mostly to keep code somewhat simpler. Large regions make it easier to manage the separate set of shared 
VMAs. Part of the exploration here is to see what size regions work for other people. This initial prototype is x86 only 
and for now I am using PGDIR_SIZE. I see your point about how would userspace figure out alignment since this should 
across all architectures and PGDIR_SIZE/PMD_SIZE/PUD_SIZE are not the same across architectures. We can choose a fixed 
size and alignment. I would like to keep the region size at 2^20 pages or larger to minimize having to manage large 
number of small shared regions if those regions are not contiguous. Do you have any suggestion?

Thanks for your feedback.

--
Khalid
Khalid Aziz Jan. 18, 2022, 10:52 p.m. UTC | #5
On 1/18/22 15:06, Dave Hansen wrote:
> On 1/18/22 1:19 PM, Khalid Aziz wrote:
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.
> 
> One thing that went over my head here was that this allowing sharing of
> relatively arbitrary *EXISTING* regions.  The mshared'd region might be
> anonymous or an plain mmap()'d file.  It can even be a filesystem or
> device DAX mmap().
> 
> In other words, donors can (ideally) share anything.  Consumers have
> must use msharefs to access the donated areas.
> 
> Right?
> 
> ( btw... thanks to willy for the correction on IRC.)
> 

Hi Dave,

Consumers use msharefs only to get information on address and size of shared region. Access to the donated are does not 
go through msharefs. So the consumer opens the file in msharefs to read starting address and size:

         fd = open("testregion", O_RDONLY);

         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
                 printf("INFO: %ld bytes shared at addr %lx \n",
                                 mshare_info[1], mshare_info[0]);
         else
                 perror("read failed");

         close(fd);

It then uses that information to map in the donated region:

         addr = (char *)mshare_info[0];
         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
                         mshare_info[1], O_RDWR, 600);

Makes sense?

Thanks,
Khalid
Mark Hemment Jan. 19, 2022, 11:38 a.m. UTC | #6
On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>
>
> API Proposal
> ============
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
>
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
>
> --
> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
>
>         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr %lx \n",
>                                 mshare_info[1], mshare_info[0]);
>         else
>                 perror("read failed");
>
>         close(fd);
>
>         addr = (char *)mshare_info[0];
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>                         mshare_info[1], O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------
...
Hi Khalid,

The proposed mshare() appears to be similar to POSIX shared memory,
but with two extra (related) attributes;
a) Internally, uses shared page tables.
b) Shared memory is mapped at same address for all users.

Rather than introduce two new system calls, along with /sys/ file to
communicate global addresses, could mshare() be built on top of shmem
API?  Thinking of something like the below;
1) For shm_open(3), add a new oflag to indicate the properties needed
for mshare() (say, O_SHARED_PTE - better name?)
2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
in the sizes which can be set.
3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
objects.  On first mmap()ing an appropiate address is assigned,
otherwise the current 'global' address is used.
4) shm_unlink(3) destroys the object when last reference is dropped.

For 3), might be able to weaken the NULL requirement and validate a
given address on first mapping to ensure it is correctly aligned.
shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
the default behaviour you require.

Internally, the handling of mshare()/O_SHARED_PTE memory might be
sufficiently different to shmem that there is not much code sharing
between the two (I haven't thought this through, but the object
naming/refcounting should be similiar), but using shmem would be a
familiar API.

Any thoughts?

Cheers,
Mark
Khalid Aziz Jan. 19, 2022, 5:02 p.m. UTC | #7
On 1/19/22 04:38, Mark Hemment wrote:
> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system carsh was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API Proposal
>> ============
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>          if (addr == MAP_FAILED)
>>                  perror("ERROR: mmap failed");
>>
>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          strncpy(addr, "Some random shared text",
>>                          sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>>          fd = open("testregion", O_RDONLY);
>>          if (fd < 0) {
>>                  perror("open failed");
>>                  exit(1);
>>          }
>>
>>          if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>                  printf("INFO: %ld bytes shared at addr %lx \n",
>>                                  mshare_info[1], mshare_info[0]);
>>          else
>>                  perror("read failed");
>>
>>          close(fd);
>>
>>          addr = (char *)mshare_info[0];
>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>                          mshare_info[1], O_RDWR, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          printf("Guest mmap at %px:\n", addr);
>>          printf("%s\n", addr);
>>          printf("\nDone\n");
>>
>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>          if (err < 0) {
>>                  perror("mshare_unlink() failed");
>>                  exit(1);
>>          }
>> -----------------
> ...
> Hi Khalid,
> 
> The proposed mshare() appears to be similar to POSIX shared memory,
> but with two extra (related) attributes;
> a) Internally, uses shared page tables.
> b) Shared memory is mapped at same address for all users.

Hi Mark,

You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region 
access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. 
mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. 
Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different 
from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to 
leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a 
separate implementation. I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share 
a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.

Does that rationale sound reasonable?

Thanks,
Khalid

> 
> Rather than introduce two new system calls, along with /sys/ file to
> communicate global addresses, could mshare() be built on top of shmem
> API?  Thinking of something like the below;
> 1) For shm_open(3), add a new oflag to indicate the properties needed
> for mshare() (say, O_SHARED_PTE - better name?)
> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
> in the sizes which can be set.
> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
> objects.  On first mmap()ing an appropiate address is assigned,
> otherwise the current 'global' address is used.
> 4) shm_unlink(3) destroys the object when last reference is dropped.
> 
> For 3), might be able to weaken the NULL requirement and validate a
> given address on first mapping to ensure it is correctly aligned.
> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
> the default behaviour you require.
> 
> Internally, the handling of mshare()/O_SHARED_PTE memory might be
> sufficiently different to shmem that there is not much code sharing
> between the two (I haven't thought this through, but the object
> naming/refcounting should be similiar), but using shmem would be a
> familiar API.
> 
> Any thoughts?
> 
> Cheers,
> Mark
>
Mark Hemment Jan. 20, 2022, 12:49 p.m. UTC | #8
On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/19/22 04:38, Mark Hemment wrote:
> > On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> Page tables in kernel consume some of the memory and as long as
> >> number of mappings being maintained is small enough, this space
> >> consumed by page tables is not objectionable. When very few memory
> >> pages are shared between processes, the number of page table entries
> >> (PTEs) to maintain is mostly constrained by the number of pages of
> >> memory on the system. As the number of shared pages and the number
> >> of times pages are shared goes up, amount of memory consumed by page
> >> tables starts to become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared
> >> across 1000s of processes. On x86_64, each page requires a PTE that
> >> is only 8 bytes long which is very small compared to the 4K page
> >> size. When 2000 processes map the same page in their address space,
> >> each one of them requires 8 bytes for its PTE and together that adds
> >> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >> database server with 300GB SGA, a system carsh was seen with
> >> out-of-memory condition when 1500+ clients tried to share this SGA
> >> even though the system had 512GB of memory. On this server, in the
> >> worst case scenario of all 1500 processes mapping every page from
> >> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >> could be shared, amount of memory saved is very significant.
> >>
> >> This is a proposal to implement a mechanism in kernel to allow
> >> userspace processes to opt into sharing PTEs. The proposal is to add
> >> a new system call - mshare(), which can be used by a process to
> >> create a region (we will call it mshare'd region) which can be used
> >> by other processes to map same pages using shared PTEs. Other
> >> process(es), assuming they have the right permissions, can then make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API Proposal
> >> ============
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>    exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >>
> >> PTEs are shared at pgdir level and hence it imposes following
> >> requirements on the address and size given to the mshare():
> >>
> >> - Starting address must be aligned to pgdir size (512GB on x86_64)
> >> - Size must be a multiple of pgdir size
> >> - Any mappings created in this address range at any time become
> >>    shared automatically
> >> - Shared address range can have unmapped addresses in it. Any access
> >>    to unmapped address will result in SIGBUS
> >>
> >> Mappings within this address range behave as if they were shared
> >> between threads, so a write to a MAP_PRIVATE mapping will create a
> >> page which is shared between all the sharers. The first process that
> >> declares an address range mshare'd can continue to map objects in
> >> the shared area. All other processes that want mshare'd access to
> >> this memory area can do so by calling mshare(). After this call, the
> >> address range given by mshare becomes a shared range in its address
> >> space. Anonymous mappings will be shared and not COWed.
> >>
> >> A file under /sys/fs/mshare can be opened and read from. A read from
> >> this file returns two long values - (1) starting address, and (2)
> >> size of the mshare'd region.
> >>
> >> --
> >> int mshare_unlink(char *name)
> >>
> >> A shared address range created by mshare() can be destroyed using
> >> mshare_unlink() which removes the  shared named object. Once all
> >> processes have unmapped the shared object, the shared address range
> >> references are de-allocated and destroyed.
> >>
> >> mshare_unlink() returns 0 on success or -1 on error.
> >>
> >>
> >> Example Code
> >> ============
> >>
> >> Snippet of the code that a donor process would run looks like below:
> >>
> >> -----------------
> >>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >>          if (addr == MAP_FAILED)
> >>                  perror("ERROR: mmap failed");
> >>
> >>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >>          if (err < 0) {
> >>                  perror("mshare() syscall failed");
> >>                  exit(1);
> >>          }
> >>
> >>          strncpy(addr, "Some random shared text",
> >>                          sizeof("Some random shared text"));
> >> -----------------
> >>
> >> Snippet of code that a consumer process would execute looks like:
> >>
> >> -----------------
> >>          fd = open("testregion", O_RDONLY);
> >>          if (fd < 0) {
> >>                  perror("open failed");
> >>                  exit(1);
> >>          }
> >>
> >>          if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
> >>                  printf("INFO: %ld bytes shared at addr %lx \n",
> >>                                  mshare_info[1], mshare_info[0]);
> >>          else
> >>                  perror("read failed");
> >>
> >>          close(fd);
> >>
> >>          addr = (char *)mshare_info[0];
> >>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
> >>                          mshare_info[1], O_RDWR, 600);
> >>          if (err < 0) {
> >>                  perror("mshare() syscall failed");
> >>                  exit(1);
> >>          }
> >>
> >>          printf("Guest mmap at %px:\n", addr);
> >>          printf("%s\n", addr);
> >>          printf("\nDone\n");
> >>
> >>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >>          if (err < 0) {
> >>                  perror("mshare_unlink() failed");
> >>                  exit(1);
> >>          }
> >> -----------------
> > ...
> > Hi Khalid,
> >
> > The proposed mshare() appears to be similar to POSIX shared memory,
> > but with two extra (related) attributes;
> > a) Internally, uses shared page tables.
> > b) Shared memory is mapped at same address for all users.
>
> Hi Mark,
>
> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
> separate implementation.

Ah, I jumped in assuming this was only for anon memory.

> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.

So possible you have already considered the below...which does make
the API a little more POSIX shm like.

The mshare() syscall does two operations;
1) create/open mshare object
2) export/import the given memory region

Would it be better if these were seperate operations?  That is,
mshare_open() (say) creates/opens the object returning a file
descriptor.  The fd used as the identifier for the export/import after
mmap(2); eg.
addr = mshare_op(EXPORT, fd, addr, size);
addr = mshare_op(IMPORT, fd, NULL, 0);
(Not sure about export/import terms..)

The benefit of the the separate ops is the file descriptor.  This
could be used for fstat(2) (and fchown(2)?), although not sure how
much value this would add.

The 'importer' would use the address/size of the memory region as
exported (and stored in msharefs), so no need for /sys file (except
for human readable info).

If the set-up operations are split in two, then would it make sense to
also split the teardown as well?  Say, mshare_op(DROP, fd) and
mshare_unlink(fd)?

>
> Does that rationale sound reasonable?

It doesn't sound unreasonable.  As msharefs is providing a namespace
and perms, it doesn't need much flexibility.  Being able to modifying
the perms post namespace creation (fchown(2)), before exporting the
memory region, might be useful in some cases - but as I don't have any
usecases I'm not claiming it is essential.

>
> Thanks,
> Khalid

Cheers,
Mark
>
> >
> > Rather than introduce two new system calls, along with /sys/ file to
> > communicate global addresses, could mshare() be built on top of shmem
> > API?  Thinking of something like the below;
> > 1) For shm_open(3), add a new oflag to indicate the properties needed
> > for mshare() (say, O_SHARED_PTE - better name?)
> > 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
> > in the sizes which can be set.
> > 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
> > objects.  On first mmap()ing an appropiate address is assigned,
> > otherwise the current 'global' address is used.
> > 4) shm_unlink(3) destroys the object when last reference is dropped.
> >
> > For 3), might be able to weaken the NULL requirement and validate a
> > given address on first mapping to ensure it is correctly aligned.
> > shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
> > the default behaviour you require.
> >
> > Internally, the handling of mshare()/O_SHARED_PTE memory might be
> > sufficiently different to shmem that there is not much code sharing
> > between the two (I haven't thought this through, but the object
> > naming/refcounting should be similiar), but using shmem would be a
> > familiar API.
> >
> > Any thoughts?
> >
> > Cheers,
> > Mark
> >
>
Khalid Aziz Jan. 20, 2022, 7:15 p.m. UTC | #9
On 1/20/22 05:49, Mark Hemment wrote:
> On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/19/22 04:38, Mark Hemment wrote:
>>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> Page tables in kernel consume some of the memory and as long as
>>>> number of mappings being maintained is small enough, this space
>>>> consumed by page tables is not objectionable. When very few memory
>>>> pages are shared between processes, the number of page table entries
>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>> memory on the system. As the number of shared pages and the number
>>>> of times pages are shared goes up, amount of memory consumed by page
>>>> tables starts to become significant.
>>>>
>>>> Some of the field deployments commonly see memory pages shared
>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>> is only 8 bytes long which is very small compared to the 4K page
>>>> size. When 2000 processes map the same page in their address space,
>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>> database server with 300GB SGA, a system carsh was seen with
>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>> even though the system had 512GB of memory. On this server, in the
>>>> worst case scenario of all 1500 processes mapping every page from
>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>> could be shared, amount of memory saved is very significant.
>>>>
>>>> This is a proposal to implement a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>> a new system call - mshare(), which can be used by a process to
>>>> create a region (we will call it mshare'd region) which can be used
>>>> by other processes to map same pages using shared PTEs. Other
>>>> process(es), assuming they have the right permissions, can then make
>>>> the mashare() system call to map the shared pages into their address
>>>> space using the shared PTEs.  When a process is done using this
>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>> access. When the last process accessing mshare'd region calls
>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>> it is freed.
>>>>
>>>>
>>>> API Proposal
>>>> ============
>>>>
>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>
>>>> --
>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>
>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>> region that will be shared at PTE level. "name" refers to shared object
>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>> of this shared memory area and length is the size of this area.
>>>> oflags can be one of:
>>>>
>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>> - O_RDWR opens shared memory area for read and write access
>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>     exists with that name, return an error.
>>>>
>>>> mode represents the creation mode for the shared object under
>>>> /sys/fs/mshare.
>>>>
>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>
>>>> PTEs are shared at pgdir level and hence it imposes following
>>>> requirements on the address and size given to the mshare():
>>>>
>>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>>>> - Size must be a multiple of pgdir size
>>>> - Any mappings created in this address range at any time become
>>>>     shared automatically
>>>> - Shared address range can have unmapped addresses in it. Any access
>>>>     to unmapped address will result in SIGBUS
>>>>
>>>> Mappings within this address range behave as if they were shared
>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>> page which is shared between all the sharers. The first process that
>>>> declares an address range mshare'd can continue to map objects in
>>>> the shared area. All other processes that want mshare'd access to
>>>> this memory area can do so by calling mshare(). After this call, the
>>>> address range given by mshare becomes a shared range in its address
>>>> space. Anonymous mappings will be shared and not COWed.
>>>>
>>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>>> this file returns two long values - (1) starting address, and (2)
>>>> size of the mshare'd region.
>>>>
>>>> --
>>>> int mshare_unlink(char *name)
>>>>
>>>> A shared address range created by mshare() can be destroyed using
>>>> mshare_unlink() which removes the  shared named object. Once all
>>>> processes have unmapped the shared object, the shared address range
>>>> references are de-allocated and destroyed.
>>>>
>>>> mshare_unlink() returns 0 on success or -1 on error.
>>>>
>>>>
>>>> Example Code
>>>> ============
>>>>
>>>> Snippet of the code that a donor process would run looks like below:
>>>>
>>>> -----------------
>>>>           addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>>                           MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>>           if (addr == MAP_FAILED)
>>>>                   perror("ERROR: mmap failed");
>>>>
>>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>>                           GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>>           if (err < 0) {
>>>>                   perror("mshare() syscall failed");
>>>>                   exit(1);
>>>>           }
>>>>
>>>>           strncpy(addr, "Some random shared text",
>>>>                           sizeof("Some random shared text"));
>>>> -----------------
>>>>
>>>> Snippet of code that a consumer process would execute looks like:
>>>>
>>>> -----------------
>>>>           fd = open("testregion", O_RDONLY);
>>>>           if (fd < 0) {
>>>>                   perror("open failed");
>>>>                   exit(1);
>>>>           }
>>>>
>>>>           if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>>>                   printf("INFO: %ld bytes shared at addr %lx \n",
>>>>                                   mshare_info[1], mshare_info[0]);
>>>>           else
>>>>                   perror("read failed");
>>>>
>>>>           close(fd);
>>>>
>>>>           addr = (char *)mshare_info[0];
>>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>>>                           mshare_info[1], O_RDWR, 600);
>>>>           if (err < 0) {
>>>>                   perror("mshare() syscall failed");
>>>>                   exit(1);
>>>>           }
>>>>
>>>>           printf("Guest mmap at %px:\n", addr);
>>>>           printf("%s\n", addr);
>>>>           printf("\nDone\n");
>>>>
>>>>           err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>>           if (err < 0) {
>>>>                   perror("mshare_unlink() failed");
>>>>                   exit(1);
>>>>           }
>>>> -----------------
>>> ...
>>> Hi Khalid,
>>>
>>> The proposed mshare() appears to be similar to POSIX shared memory,
>>> but with two extra (related) attributes;
>>> a) Internally, uses shared page tables.
>>> b) Shared memory is mapped at same address for all users.
>>
>> Hi Mark,
>>
>> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
>> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
>> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
>> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
>> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
>> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
>> separate implementation.
> 
> Ah, I jumped in assuming this was only for anon memory.
> 
>> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
>> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.
> 
> So possible you have already considered the below...which does make
> the API a little more POSIX shm like.
> 
> The mshare() syscall does two operations;
> 1) create/open mshare object
> 2) export/import the given memory region
> 
> Would it be better if these were seperate operations?  That is,
> mshare_open() (say) creates/opens the object returning a file
> descriptor.  The fd used as the identifier for the export/import after
> mmap(2); eg.
> addr = mshare_op(EXPORT, fd, addr, size);
> addr = mshare_op(IMPORT, fd, NULL, 0);
> (Not sure about export/import terms..)
> 
> The benefit of the the separate ops is the file descriptor.  This
> could be used for fstat(2) (and fchown(2)?), although not sure how
> much value this would add.

Hi Mark,

That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a 
regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region.

> 
> The 'importer' would use the address/size of the memory region as
> exported (and stored in msharefs), so no need for /sys file (except
> for human readable info).

I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size, 
I assume you are not proposing eliminating msharefs.

> 
> If the set-up operations are split in two, then would it make sense to
> also split the teardown as well?  Say, mshare_op(DROP, fd) and
> mshare_unlink(fd)?

A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care 
of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that 
created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I 
would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the 
mshare'd region disappearing on its own when all references to it are dropped.

Thanks,
Khalid

> 
>>
>> Does that rationale sound reasonable?
> 
> It doesn't sound unreasonable.  As msharefs is providing a namespace
> and perms, it doesn't need much flexibility.  Being able to modifying
> the perms post namespace creation (fchown(2)), before exporting the
> memory region, might be useful in some cases - but as I don't have any
> usecases I'm not claiming it is essential.
> 
>>
>> Thanks,
>> Khalid
> 
> Cheers,
> Mark
>>
>>>
>>> Rather than introduce two new system calls, along with /sys/ file to
>>> communicate global addresses, could mshare() be built on top of shmem
>>> API?  Thinking of something like the below;
>>> 1) For shm_open(3), add a new oflag to indicate the properties needed
>>> for mshare() (say, O_SHARED_PTE - better name?)
>>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
>>> in the sizes which can be set.
>>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
>>> objects.  On first mmap()ing an appropiate address is assigned,
>>> otherwise the current 'global' address is used.
>>> 4) shm_unlink(3) destroys the object when last reference is dropped.
>>>
>>> For 3), might be able to weaken the NULL requirement and validate a
>>> given address on first mapping to ensure it is correctly aligned.
>>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
>>> the default behaviour you require.
>>>
>>> Internally, the handling of mshare()/O_SHARED_PTE memory might be
>>> sufficiently different to shmem that there is not much code sharing
>>> between the two (I haven't thought this through, but the object
>>> naming/refcounting should be similiar), but using shmem would be a
>>> familiar API.
>>>
>>> Any thoughts?
>>>
>>> Cheers,
>>> Mark
>>>
>>
Barry Song Jan. 21, 2022, 1:08 a.m. UTC | #10
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
> 
> --
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.

> mshare_unlink() returns 0 on success or -1 on error.

I am still struggling with the user scenarios of these new APIs. This patch
supposes multiple processes will have same virtual address for the shared
area? How can this be guaranteed while different processes can map different
stack, heap, libraries, files?

BTW, it seems you have different intention with the below?
Shared page tables during fork[1]
[1] https://lwn.net/Articles/861547/

Thanks
Barry
Matthew Wilcox (Oracle) Jan. 21, 2022, 2:13 a.m. UTC | #11
On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> > A file under /sys/fs/mshare can be opened and read from. A read from
> > this file returns two long values - (1) starting address, and (2)
> > size of the mshare'd region.
> > 
> > --
> > int mshare_unlink(char *name)
> > 
> > A shared address range created by mshare() can be destroyed using
> > mshare_unlink() which removes the  shared named object. Once all
> > processes have unmapped the shared object, the shared address range
> > references are de-allocated and destroyed.
> 
> > mshare_unlink() returns 0 on success or -1 on error.
> 
> I am still struggling with the user scenarios of these new APIs. This patch
> supposes multiple processes will have same virtual address for the shared
> area? How can this be guaranteed while different processes can map different
> stack, heap, libraries, files?

The two processes choose to share a chunk of their address space.
They can map anything they like in that shared area, and then also
anything they like in the areas that aren't shared.  They can choose
for that shared area to have the same address in both processes
or different locations in each process.

If two processes want to put a shared library in that shared address
space, that should work.  They probably would need to agree to use
the same virtual address for the shared page tables for that to work.

Processes should probably not put their stacks in the shared region.
I mean, it could work, I suppose ... threads manage it in a single
address space.  But I don't see why you'd want to do that.  For
heaps, if you want the other process to be able to access the memory,
I suppose you could put it in the shared region, but heaps aren't
going to be put in the shared region by default.

Think of this like hugetlbfs, only instead of sharing hugetlbfs
memory, you can share _anything_ that's mmapable.

> BTW, it seems you have different intention with the below?
> Shared page tables during fork[1]
> [1] https://lwn.net/Articles/861547/

Yes, that's completely different.
Barry Song Jan. 21, 2022, 7:35 a.m. UTC | #12
On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> > > A file under /sys/fs/mshare can be opened and read from. A read from
> > > this file returns two long values - (1) starting address, and (2)
> > > size of the mshare'd region.
> > >
> > > --
> > > int mshare_unlink(char *name)
> > >
> > > A shared address range created by mshare() can be destroyed using
> > > mshare_unlink() which removes the  shared named object. Once all
> > > processes have unmapped the shared object, the shared address range
> > > references are de-allocated and destroyed.
> >
> > > mshare_unlink() returns 0 on success or -1 on error.
> >
> > I am still struggling with the user scenarios of these new APIs. This patch
> > supposes multiple processes will have same virtual address for the shared
> > area? How can this be guaranteed while different processes can map different
> > stack, heap, libraries, files?
>
> The two processes choose to share a chunk of their address space.
> They can map anything they like in that shared area, and then also
> anything they like in the areas that aren't shared.  They can choose
> for that shared area to have the same address in both processes
> or different locations in each process.
>
> If two processes want to put a shared library in that shared address
> space, that should work.  They probably would need to agree to use
> the same virtual address for the shared page tables for that to work.

we are depending on an elf loader and ld to map the library
dynamically , so hardly
can we find a chance in users' code to call mshare() to map libraries
in application
level?

so we are supposed to modify some very low level code to use this feature?

>
> Processes should probably not put their stacks in the shared region.
> I mean, it could work, I suppose ... threads manage it in a single
> address space.  But I don't see why you'd want to do that.  For
> heaps, if you want the other process to be able to access the memory,
> I suppose you could put it in the shared region, but heaps aren't
> going to be put in the shared region by default.
>
> Think of this like hugetlbfs, only instead of sharing hugetlbfs
> memory, you can share _anything_ that's mmapable.

yep, we can call mshare() on any kind of memory. for example, if multiple
processes use SYSV shmem, posix shmem or mmap the same file. but
it seems it is more sensible to let kernel do it automatically rather than
depending on calling mshare() from users? It is difficult for users to
decide which areas should be applied mshare(). users might want to call
mshare() for all shared areas to save memory coming from duplicated PTEs?
unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
communications,  mshare() looks not like a feature for applications,
but like a feature
for the whole system level? why would applications have to call something which
doesn't directly help them? without mshare(), those applications
will still work without any problem, right? is there anything in
mshare() which is
a must-have for applications? or mshare() is only a suggestion from applications
like madvise()?

>
> > BTW, it seems you have different intention with the below?
> > Shared page tables during fork[1]
> > [1] https://lwn.net/Articles/861547/
>
> Yes, that's completely different.

Thanks for clarification.

Best Regards.
Barry
Matthew Wilcox (Oracle) Jan. 21, 2022, 2:47 p.m. UTC | #13
On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> > > > A file under /sys/fs/mshare can be opened and read from. A read from
> > > > this file returns two long values - (1) starting address, and (2)
> > > > size of the mshare'd region.
> > > >
> > > > --
> > > > int mshare_unlink(char *name)
> > > >
> > > > A shared address range created by mshare() can be destroyed using
> > > > mshare_unlink() which removes the  shared named object. Once all
> > > > processes have unmapped the shared object, the shared address range
> > > > references are de-allocated and destroyed.
> > >
> > > > mshare_unlink() returns 0 on success or -1 on error.
> > >
> > > I am still struggling with the user scenarios of these new APIs. This patch
> > > supposes multiple processes will have same virtual address for the shared
> > > area? How can this be guaranteed while different processes can map different
> > > stack, heap, libraries, files?
> >
> > The two processes choose to share a chunk of their address space.
> > They can map anything they like in that shared area, and then also
> > anything they like in the areas that aren't shared.  They can choose
> > for that shared area to have the same address in both processes
> > or different locations in each process.
> >
> > If two processes want to put a shared library in that shared address
> > space, that should work.  They probably would need to agree to use
> > the same virtual address for the shared page tables for that to work.
> 
> we are depending on an elf loader and ld to map the library
> dynamically , so hardly
> can we find a chance in users' code to call mshare() to map libraries
> in application
> level?

If somebody wants to modify ld.so to take advantage of mshare(), they
could.  That wasn't our primary motivation here, so if it turns out to
not work for that usecase, well, that's a shame.

> > Think of this like hugetlbfs, only instead of sharing hugetlbfs
> > memory, you can share _anything_ that's mmapable.
> 
> yep, we can call mshare() on any kind of memory. for example, if multiple
> processes use SYSV shmem, posix shmem or mmap the same file. but
> it seems it is more sensible to let kernel do it automatically rather than
> depending on calling mshare() from users? It is difficult for users to
> decide which areas should be applied mshare(). users might want to call
> mshare() for all shared areas to save memory coming from duplicated PTEs?
> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
> communications,  mshare() looks not like a feature for applications,
> but like a feature
> for the whole system level? why would applications have to call something which
> doesn't directly help them? without mshare(), those applications
> will still work without any problem, right? is there anything in
> mshare() which is
> a must-have for applications? or mshare() is only a suggestion from applications
> like madvise()?

Our use case is that we have some very large files stored on persistent
memory which we want to mmap in thousands of processes.  So the first
one shares a chunk of its address space and mmaps all the files into
that chunk of address space.  Subsequent processes find that a suitable
address space already exists and use it, sharing the page tables and
avoiding the calls to mmap.

Sharing page tables is akin to running multiple threads in a single
address space; except that only part of the address space is the same.
There does need to be a certain amount of trust between the processes
sharing the address space.  You don't want to do it to an unsuspecting
process.
Khalid Aziz Jan. 21, 2022, 4:41 p.m. UTC | #14
On 1/21/22 07:47, Matthew Wilcox wrote:
> On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
>> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
>>>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>>>> this file returns two long values - (1) starting address, and (2)
>>>>> size of the mshare'd region.
>>>>>
>>>>> --
>>>>> int mshare_unlink(char *name)
>>>>>
>>>>> A shared address range created by mshare() can be destroyed using
>>>>> mshare_unlink() which removes the  shared named object. Once all
>>>>> processes have unmapped the shared object, the shared address range
>>>>> references are de-allocated and destroyed.
>>>>
>>>>> mshare_unlink() returns 0 on success or -1 on error.
>>>>
>>>> I am still struggling with the user scenarios of these new APIs. This patch
>>>> supposes multiple processes will have same virtual address for the shared
>>>> area? How can this be guaranteed while different processes can map different
>>>> stack, heap, libraries, files?
>>>
>>> The two processes choose to share a chunk of their address space.
>>> They can map anything they like in that shared area, and then also
>>> anything they like in the areas that aren't shared.  They can choose
>>> for that shared area to have the same address in both processes
>>> or different locations in each process.
>>>
>>> If two processes want to put a shared library in that shared address
>>> space, that should work.  They probably would need to agree to use
>>> the same virtual address for the shared page tables for that to work.
>>
>> we are depending on an elf loader and ld to map the library
>> dynamically , so hardly
>> can we find a chance in users' code to call mshare() to map libraries
>> in application
>> level?
> 
> If somebody wants to modify ld.so to take advantage of mshare(), they
> could.  That wasn't our primary motivation here, so if it turns out to
> not work for that usecase, well, that's a shame.
> 
>>> Think of this like hugetlbfs, only instead of sharing hugetlbfs
>>> memory, you can share _anything_ that's mmapable.
>>
>> yep, we can call mshare() on any kind of memory. for example, if multiple
>> processes use SYSV shmem, posix shmem or mmap the same file. but
>> it seems it is more sensible to let kernel do it automatically rather than
>> depending on calling mshare() from users? It is difficult for users to
>> decide which areas should be applied mshare(). users might want to call
>> mshare() for all shared areas to save memory coming from duplicated PTEs?
>> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
>> communications,  mshare() looks not like a feature for applications,
>> but like a feature
>> for the whole system level? why would applications have to call something which
>> doesn't directly help them? without mshare(), those applications
>> will still work without any problem, right? is there anything in
>> mshare() which is
>> a must-have for applications? or mshare() is only a suggestion from applications
>> like madvise()?
> 
> Our use case is that we have some very large files stored on persistent
> memory which we want to mmap in thousands of processes.  So the first
> one shares a chunk of its address space and mmaps all the files into
> that chunk of address space.  Subsequent processes find that a suitable
> address space already exists and use it, sharing the page tables and
> avoiding the calls to mmap.
> 
> Sharing page tables is akin to running multiple threads in a single
> address space; except that only part of the address space is the same.
> There does need to be a certain amount of trust between the processes
> sharing the address space.  You don't want to do it to an unsuspecting
> process.
> 

Hello Barry,

mshare() is really meant for sharing data across unrelated processes by sharing address space explicitly and hence 
opt-in is required. As Matthew said, the processes sharing this virtual address space need to have a level of trust.
Permissions on the msharefs files control who can access this shared address space. It is possible to adapt this
mechanism to share stack, libraries etc but that is not the intent. This feature will be used by applications that share
data with multiple processes using shared mapping normally and it helps them avoid the overhead of large number of
duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces amount of memory available for
applications and can result in out-of-memory condition. An example from the patch 0/6:

"On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant."

--
Khalid
Longpeng(Mike) Jan. 22, 2022, 1:39 a.m. UTC | #15
> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@oracle.com]
> Sent: Saturday, January 22, 2022 12:42 AM
> To: Matthew Wilcox <willy@infradead.org>; Barry Song <21cnbao@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>; Arnd Bergmann <arnd@arndb.de>;
> Dave Hansen <dave.hansen@linux.intel.com>; David Hildenbrand
> <david@redhat.com>; LKML <linux-kernel@vger.kernel.org>; Linux-MM
> <linux-mm@kvack.org>; Longpeng (Mike, Cloud Infrastructure Service Product
> Dept.) <longpeng2@huawei.com>; Mike Rapoport <rppt@kernel.org>; Suren
> Baghdasaryan <surenb@google.com>
> Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
> 
> On 1/21/22 07:47, Matthew Wilcox wrote:
> > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
> >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> >>>>> A file under /sys/fs/mshare can be opened and read from. A read from
> >>>>> this file returns two long values - (1) starting address, and (2)
> >>>>> size of the mshare'd region.
> >>>>>
> >>>>> --
> >>>>> int mshare_unlink(char *name)
> >>>>>
> >>>>> A shared address range created by mshare() can be destroyed using
> >>>>> mshare_unlink() which removes the  shared named object. Once all
> >>>>> processes have unmapped the shared object, the shared address range
> >>>>> references are de-allocated and destroyed.
> >>>>
> >>>>> mshare_unlink() returns 0 on success or -1 on error.
> >>>>
> >>>> I am still struggling with the user scenarios of these new APIs. This patch
> >>>> supposes multiple processes will have same virtual address for the shared
> >>>> area? How can this be guaranteed while different processes can map different
> >>>> stack, heap, libraries, files?
> >>>
> >>> The two processes choose to share a chunk of their address space.
> >>> They can map anything they like in that shared area, and then also
> >>> anything they like in the areas that aren't shared.  They can choose
> >>> for that shared area to have the same address in both processes
> >>> or different locations in each process.
> >>>
> >>> If two processes want to put a shared library in that shared address
> >>> space, that should work.  They probably would need to agree to use
> >>> the same virtual address for the shared page tables for that to work.
> >>
> >> we are depending on an elf loader and ld to map the library
> >> dynamically , so hardly
> >> can we find a chance in users' code to call mshare() to map libraries
> >> in application
> >> level?
> >
> > If somebody wants to modify ld.so to take advantage of mshare(), they
> > could.  That wasn't our primary motivation here, so if it turns out to
> > not work for that usecase, well, that's a shame.
> >
> >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs
> >>> memory, you can share _anything_ that's mmapable.
> >>
> >> yep, we can call mshare() on any kind of memory. for example, if multiple
> >> processes use SYSV shmem, posix shmem or mmap the same file. but
> >> it seems it is more sensible to let kernel do it automatically rather than
> >> depending on calling mshare() from users? It is difficult for users to
> >> decide which areas should be applied mshare(). users might want to call
> >> mshare() for all shared areas to save memory coming from duplicated PTEs?
> >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
> >> communications,  mshare() looks not like a feature for applications,
> >> but like a feature
> >> for the whole system level? why would applications have to call something
> which
> >> doesn't directly help them? without mshare(), those applications
> >> will still work without any problem, right? is there anything in
> >> mshare() which is
> >> a must-have for applications? or mshare() is only a suggestion from
> applications
> >> like madvise()?
> >
> > Our use case is that we have some very large files stored on persistent
> > memory which we want to mmap in thousands of processes.  So the first
> > one shares a chunk of its address space and mmaps all the files into
> > that chunk of address space.  Subsequent processes find that a suitable
> > address space already exists and use it, sharing the page tables and
> > avoiding the calls to mmap.
> >
> > Sharing page tables is akin to running multiple threads in a single
> > address space; except that only part of the address space is the same.
> > There does need to be a certain amount of trust between the processes
> > sharing the address space.  You don't want to do it to an unsuspecting
> > process.
> >
> 
> Hello Barry,
> 
> mshare() is really meant for sharing data across unrelated processes by sharing
> address space explicitly and hence
> opt-in is required. As Matthew said, the processes sharing this virtual address
> space need to have a level of trust.
> Permissions on the msharefs files control who can access this shared address
> space. It is possible to adapt this
> mechanism to share stack, libraries etc but that is not the intent. This feature
> will be used by applications that share
> data with multiple processes using shared mapping normally and it helps them
> avoid the overhead of large number of
> duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces
> amount of memory available for
> applications and can result in out-of-memory condition. An example from the patch
> 0/6:
> 
> "On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant."
> 

The memory overhead of PTEs would be significantly saved if we use
hugetlbfs in this case, but why not?

> --
> Khalid
Matthew Wilcox (Oracle) Jan. 22, 2022, 1:41 a.m. UTC | #16
On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > Our use case is that we have some very large files stored on persistent
> > > memory which we want to mmap in thousands of processes.  So the first
> 
> The memory overhead of PTEs would be significantly saved if we use
> hugetlbfs in this case, but why not?

Because we want the files to be persistent across reboots.
Thomas Schoebel-Theuer Jan. 22, 2022, 10:18 a.m. UTC | #17
On 1/22/22 2:41 AM, Matthew Wilcox wrote:
> On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>> Our use case is that we have some very large files stored on persistent
>>>> memory which we want to mmap in thousands of processes.  So the first
>> The memory overhead of PTEs would be significantly saved if we use
>> hugetlbfs in this case, but why not?
> Because we want the files to be persistent across reboots.

100% agree. There is another use case: geo-redundancy.

My view is publicly documented at 
https://github.com/schoebel/mars/tree/master/docu and click at 
architecture-guide-geo-redundancy.pdf

In some scenarios, migration or (temporary) co-existence of block 
devices from/between hardware architecture A to/between hardware 
architecture B might become a future requirement for me.

The currrent implementation does not yet use hugetlbfs and/or its 
proposed / low-overhead / more fine-grained and/or less 
hardware-architecture specific (future) alternatives.

For me, all of these are future options. In particular, when (1) 
abstractable for reduction of architectural dependencies, and hopefully 
(2) usable from both kernelspace and userspace.

It would be great if msharefs is not only low-footprint, but also would 
be usable from kernelspace.

Reduction (or getting rid) of preallocation strategies would be also a 
valuable feature for me.

Of course, I cannot decide what I will prefer in future for any future 
requirements. But some kind of mutual awareness and future collaboration 
would be great.
Mike Rapoport Jan. 22, 2022, 11:31 a.m. UTC | #18
(added linux-api)

On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
> 
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
> 
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
> 
> 
> API Proposal
> ============
> 
> The mshare API consists of two system calls - mshare() and mshare_unlink()
> 
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> 
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
> 
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
> 
> mshare() returns an error code if it fails, otherwise it returns 0.

Did you consider returning a file descriptor from mshare() system call?
Then there would be no need in mshare_unlink() as close(fd) would work.
 
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
> 
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.

Maybe read should return a structure containing some data identifier and
the data itself, so that it could be extended in the future.
 
> --
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
> 
> mshare_unlink() returns 0 on success or -1 on error.
Matthew Wilcox (Oracle) Jan. 22, 2022, 4:09 p.m. UTC | #19
On Sat, Jan 22, 2022 at 11:18:14AM +0100, Thomas Schoebel-Theuer wrote:
> On 1/22/22 2:41 AM, Matthew Wilcox wrote:
> > On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > > > Our use case is that we have some very large files stored on persistent
> > > > > memory which we want to mmap in thousands of processes.  So the first
> > > The memory overhead of PTEs would be significantly saved if we use
> > > hugetlbfs in this case, but why not?
> > Because we want the files to be persistent across reboots.
> 
> 100% agree. There is another use case: geo-redundancy.
> 
> My view is publicly documented at
> https://github.com/schoebel/mars/tree/master/docu and click at
> architecture-guide-geo-redundancy.pdf

That's a 160+ page PDF.  No offence, Thomas, I'm not reading that to
try to understand how you want to use page table sharing.

> In some scenarios, migration or (temporary) co-existence of block devices
> from/between hardware architecture A to/between hardware architecture B
> might become a future requirement for me.

I'm not sure how sharing block devices between systems matches up with
sharing page tables between processes.

> It would be great if msharefs is not only low-footprint, but also would be
> usable from kernelspace.

I don't understand what you want here either.  Kernel threads already
share their page tables.
Andy Lutomirski Jan. 22, 2022, 6:29 p.m. UTC | #20
> On Jan 22, 2022, at 3:31 AM, Mike Rapoport <rppt@kernel.org> wrote:
>
> (added linux-api)
>
>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.

Sharing PTEs is nice, but merely sharing a chunk of address space
regardless of optimizations is nontrivial.  It’s also quite useful,
potentially.  So I think a good way to start would be to make a nice
design for just sharing address space and then, on top of it, figure
out how to share page tables.

See here for an earlier proposal:

https://lore.kernel.org/all/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com/

Alternatively, one could try to optimize memfd so that large similarly
aligned mappings in different processes could share page tables.

Any of the above will require some interesting thought as to whether
TLB shootdowns are managed by the core rmap code or by mmu notifiers.
Mark Hemment Jan. 24, 2022, 3:15 p.m. UTC | #21
On Thu, 20 Jan 2022 at 19:15, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/20/22 05:49, Mark Hemment wrote:
> > On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> On 1/19/22 04:38, Mark Hemment wrote:
> >>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>>>
> >>>> Page tables in kernel consume some of the memory and as long as
> >>>> number of mappings being maintained is small enough, this space
> >>>> consumed by page tables is not objectionable. When very few memory
> >>>> pages are shared between processes, the number of page table entries
> >>>> (PTEs) to maintain is mostly constrained by the number of pages of
> >>>> memory on the system. As the number of shared pages and the number
> >>>> of times pages are shared goes up, amount of memory consumed by page
> >>>> tables starts to become significant.
> >>>>
> >>>> Some of the field deployments commonly see memory pages shared
> >>>> across 1000s of processes. On x86_64, each page requires a PTE that
> >>>> is only 8 bytes long which is very small compared to the 4K page
> >>>> size. When 2000 processes map the same page in their address space,
> >>>> each one of them requires 8 bytes for its PTE and together that adds
> >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >>>> database server with 300GB SGA, a system carsh was seen with
> >>>> out-of-memory condition when 1500+ clients tried to share this SGA
> >>>> even though the system had 512GB of memory. On this server, in the
> >>>> worst case scenario of all 1500 processes mapping every page from
> >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >>>> could be shared, amount of memory saved is very significant.
> >>>>
> >>>> This is a proposal to implement a mechanism in kernel to allow
> >>>> userspace processes to opt into sharing PTEs. The proposal is to add
> >>>> a new system call - mshare(), which can be used by a process to
> >>>> create a region (we will call it mshare'd region) which can be used
> >>>> by other processes to map same pages using shared PTEs. Other
> >>>> process(es), assuming they have the right permissions, can then make
> >>>> the mashare() system call to map the shared pages into their address
> >>>> space using the shared PTEs.  When a process is done using this
> >>>> mshare'd region, it makes a mshare_unlink() system call to end its
> >>>> access. When the last process accessing mshare'd region calls
> >>>> mshare_unlink(), the mshare'd region is torn down and memory used by
> >>>> it is freed.
> >>>>
> >>>>
> >>>> API Proposal
> >>>> ============
> >>>>
> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>>>
> >>>> --
> >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>>>
> >>>> mshare() creates and opens a new, or opens an existing mshare'd
> >>>> region that will be shared at PTE level. "name" refers to shared object
> >>>> name that exists under /sys/fs/mshare. "addr" is the starting address
> >>>> of this shared memory area and length is the size of this area.
> >>>> oflags can be one of:
> >>>>
> >>>> - O_RDONLY opens shared memory area for read only access by everyone
> >>>> - O_RDWR opens shared memory area for read and write access
> >>>> - O_CREAT creates the named shared memory area if it does not exist
> >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>>>     exists with that name, return an error.
> >>>>
> >>>> mode represents the creation mode for the shared object under
> >>>> /sys/fs/mshare.
> >>>>
> >>>> mshare() returns an error code if it fails, otherwise it returns 0.
> >>>>
> >>>> PTEs are shared at pgdir level and hence it imposes following
> >>>> requirements on the address and size given to the mshare():
> >>>>
> >>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
> >>>> - Size must be a multiple of pgdir size
> >>>> - Any mappings created in this address range at any time become
> >>>>     shared automatically
> >>>> - Shared address range can have unmapped addresses in it. Any access
> >>>>     to unmapped address will result in SIGBUS
> >>>>
> >>>> Mappings within this address range behave as if they were shared
> >>>> between threads, so a write to a MAP_PRIVATE mapping will create a
> >>>> page which is shared between all the sharers. The first process that
> >>>> declares an address range mshare'd can continue to map objects in
> >>>> the shared area. All other processes that want mshare'd access to
> >>>> this memory area can do so by calling mshare(). After this call, the
> >>>> address range given by mshare becomes a shared range in its address
> >>>> space. Anonymous mappings will be shared and not COWed.
> >>>>
> >>>> A file under /sys/fs/mshare can be opened and read from. A read from
> >>>> this file returns two long values - (1) starting address, and (2)
> >>>> size of the mshare'd region.
> >>>>
> >>>> --
> >>>> int mshare_unlink(char *name)
> >>>>
> >>>> A shared address range created by mshare() can be destroyed using
> >>>> mshare_unlink() which removes the  shared named object. Once all
> >>>> processes have unmapped the shared object, the shared address range
> >>>> references are de-allocated and destroyed.
> >>>>
> >>>> mshare_unlink() returns 0 on success or -1 on error.
> >>>>
> >>>>
> >>>> Example Code
> >>>> ============
> >>>>
> >>>> Snippet of the code that a donor process would run looks like below:
> >>>>
> >>>> -----------------
> >>>>           addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >>>>                           MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >>>>           if (addr == MAP_FAILED)
> >>>>                   perror("ERROR: mmap failed");
> >>>>
> >>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >>>>                           GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >>>>           if (err < 0) {
> >>>>                   perror("mshare() syscall failed");
> >>>>                   exit(1);
> >>>>           }
> >>>>
> >>>>           strncpy(addr, "Some random shared text",
> >>>>                           sizeof("Some random shared text"));
> >>>> -----------------
> >>>>
> >>>> Snippet of code that a consumer process would execute looks like:
> >>>>
> >>>> -----------------
> >>>>           fd = open("testregion", O_RDONLY);
> >>>>           if (fd < 0) {
> >>>>                   perror("open failed");
> >>>>                   exit(1);
> >>>>           }
> >>>>
> >>>>           if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
> >>>>                   printf("INFO: %ld bytes shared at addr %lx \n",
> >>>>                                   mshare_info[1], mshare_info[0]);
> >>>>           else
> >>>>                   perror("read failed");
> >>>>
> >>>>           close(fd);
> >>>>
> >>>>           addr = (char *)mshare_info[0];
> >>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
> >>>>                           mshare_info[1], O_RDWR, 600);
> >>>>           if (err < 0) {
> >>>>                   perror("mshare() syscall failed");
> >>>>                   exit(1);
> >>>>           }
> >>>>
> >>>>           printf("Guest mmap at %px:\n", addr);
> >>>>           printf("%s\n", addr);
> >>>>           printf("\nDone\n");
> >>>>
> >>>>           err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >>>>           if (err < 0) {
> >>>>                   perror("mshare_unlink() failed");
> >>>>                   exit(1);
> >>>>           }
> >>>> -----------------
> >>> ...
> >>> Hi Khalid,
> >>>
> >>> The proposed mshare() appears to be similar to POSIX shared memory,
> >>> but with two extra (related) attributes;
> >>> a) Internally, uses shared page tables.
> >>> b) Shared memory is mapped at same address for all users.
> >>
> >> Hi Mark,
> >>
> >> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
> >> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
> >> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
> >> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
> >> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
> >> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
> >> separate implementation.
> >
> > Ah, I jumped in assuming this was only for anon memory.
> >
> >> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
> >> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.
> >
> > So possible you have already considered the below...which does make
> > the API a little more POSIX shm like.
> >
> > The mshare() syscall does two operations;
> > 1) create/open mshare object
> > 2) export/import the given memory region
> >
> > Would it be better if these were seperate operations?  That is,
> > mshare_open() (say) creates/opens the object returning a file
> > descriptor.  The fd used as the identifier for the export/import after
> > mmap(2); eg.
> > addr = mshare_op(EXPORT, fd, addr, size);
> > addr = mshare_op(IMPORT, fd, NULL, 0);
> > (Not sure about export/import terms..)
> >
> > The benefit of the the separate ops is the file descriptor.  This
> > could be used for fstat(2) (and fchown(2)?), although not sure how
> > much value this would add.
>
> Hi Mark,
>
> That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a
> regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region.

Hi Khalid,

In your proposed API, the 'importer' of the mshared region does not
open the mshared backing object (when a file being mapped) instead it
does an open on the msharefs file.
From the code sample in your initial email (simplified), where a
process attaches to the mshared region;
    fd = open("testregion", O_RDONLY);
    read(fd, &mshare_info, sizeof (mshare_info));
    mshare("testregion", addr, len, RDWR, 0600);

Open permission checks are done by the mshare() system call against
the msharefs file ("testregion").

From the code sample in your initial email (simplified), where a
process creates a msharefs file with the anonymous mmap()ed region to
be shared;
    addr = mmap(RDWR, ANON);
    mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600);

Now, consider the case where the mmap() is named (that is, against a
file).  I believe this is the usecase for Oracle's SGA.
My (simplified) code for msharing a named file ("SGA") using your
proposed API (does not matter if the mapping is PRIVATE or SHARED);
    fd = open("SGA", RDWR);
    addr = mmap(RDWR, ..., fd);
    mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600);

If the permissions (usr/grp+perms+ACL) between the "SGA" file and the
"SGA-region" msharefs are different, then it is very likely a serious
security issue.
That is, a user who could not open(2) the "SGA" file might be able to
open the "SGA-region" msharefs file, and so gain at least read
permission on the file.

This is why I was proposing a file descriptor, so the msharefs file
could be set to have the same permissions as the backing file it is
exporting (but I got this wrong).
This would still leave a window between the msharefs file being
creating and the permissions being set, where a rogue process could
attach to a region when they should not have the permission (this
could be closed by failing a non-creating mshare() if the region is of
zero len - nothing yet shared - until permission are set and the
region shared).
But relying on userspace to always set the correct permissions on the
msharefs file is dangerous - likely to get it wrong on occasion - and
isn't sufficient.  The msharefs API needs to be bullet proof.

Looking at the patches, I cannot see where extra validation is being
done for a named mapping to ensure any 'importer' has the necessary
permission against the backing file.
The 'struct file' (->vm_file, and associated inode) in the VMA is
sufficient to perform required access checks against the file's perms
- the posted patches do not check this (but they are for an RFC, so
don't expect all cases to be handled).  But what about a full path
permission check?  That is, the 'importer' has necessary permissions
on the backing file, but would not be able to find this file due to
directory permissions?  msharefs would bypass the directory checks.


> >
> > The 'importer' would use the address/size of the memory region as
> > exported (and stored in msharefs), so no need for /sys file (except
> > for human readable info).
>
> I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size,
> I assume you are not proposing eliminating msharefs.

The 'exporter' of the mshared region specifies the address and length,
and is therefore is known by the mshare code.
An 'import' needs to only pass NULL/0 for addr/len and is told by
mshare where the region has been attached in its address-space.  With
this, the /sys file is no longer part of the API.


> >
> > If the set-up operations are split in two, then would it make sense to
> > also split the teardown as well?  Say, mshare_op(DROP, fd) and
> > mshare_unlink(fd)?
>
> A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care
> of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that
> created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I
> would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the
> mshare'd region disappearing on its own when all references to it are dropped.
>
> Thanks,
> Khalid

Cheers,
Mark

>
> >
> >>
> >> Does that rationale sound reasonable?
> >
> > It doesn't sound unreasonable.  As msharefs is providing a namespace
> > and perms, it doesn't need much flexibility.  Being able to modifying
> > the perms post namespace creation (fchown(2)), before exporting the
> > memory region, might be useful in some cases - but as I don't have any
> > usecases I'm not claiming it is essential.
> >
> >>
> >> Thanks,
> >> Khalid
> >
> > Cheers,
> > Mark
> >>
> >>>
> >>> Rather than introduce two new system calls, along with /sys/ file to
> >>> communicate global addresses, could mshare() be built on top of shmem
> >>> API?  Thinking of something like the below;
> >>> 1) For shm_open(3), add a new oflag to indicate the properties needed
> >>> for mshare() (say, O_SHARED_PTE - better name?)
> >>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
> >>> in the sizes which can be set.
> >>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
> >>> objects.  On first mmap()ing an appropiate address is assigned,
> >>> otherwise the current 'global' address is used.
> >>> 4) shm_unlink(3) destroys the object when last reference is dropped.
> >>>
> >>> For 3), might be able to weaken the NULL requirement and validate a
> >>> given address on first mapping to ensure it is correctly aligned.
> >>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
> >>> the default behaviour you require.
> >>>
> >>> Internally, the handling of mshare()/O_SHARED_PTE memory might be
> >>> sufficiently different to shmem that there is not much code sharing
> >>> between the two (I haven't thought this through, but the object
> >>> naming/refcounting should be similiar), but using shmem would be a
> >>> familiar API.
> >>>
> >>> Any thoughts?
> >>>
> >>> Cheers,
> >>> Mark
> >>>
> >>
>
Matthew Wilcox (Oracle) Jan. 24, 2022, 3:27 p.m. UTC | #22
On Mon, Jan 24, 2022 at 03:15:36PM +0000, Mark Hemment wrote:
> From the code sample in your initial email (simplified), where a
> process creates a msharefs file with the anonymous mmap()ed region to
> be shared;
>     addr = mmap(RDWR, ANON);
>     mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600);
> 
> Now, consider the case where the mmap() is named (that is, against a
> file).  I believe this is the usecase for Oracle's SGA.
> My (simplified) code for msharing a named file ("SGA") using your
> proposed API (does not matter if the mapping is PRIVATE or SHARED);
>     fd = open("SGA", RDWR);
>     addr = mmap(RDWR, ..., fd);
>     mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600);

Don't think of an mshared region as containing only one file.
It might easily contain dozens.  Or none at the start.  They're
dynamic; the mshare fd represents a chunk of address space, not
whatever is currently mapped there.

> If the permissions (usr/grp+perms+ACL) between the "SGA" file and the
> "SGA-region" msharefs are different, then it is very likely a serious
> security issue.

Only in the same sense that an application might open() a file that it
has permission to access and then open a pipe/socket to a process that
does not have permission and send the data to it.
Khalid Aziz Jan. 24, 2022, 6:48 p.m. UTC | #23
On 1/22/22 04:31, Mike Rapoport wrote:
> (added linux-api)
> 
> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system carsh was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API Proposal
>> ============
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
> 
> Did you consider returning a file descriptor from mshare() system call?
> Then there would be no need in mshare_unlink() as close(fd) would work.

That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though 
for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases 
for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling 
mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to 
bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic 
with a close() semantics though unless there was another way to force refcount to 0. Right?


>   
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
> 
> Maybe read should return a structure containing some data identifier and
> the data itself, so that it could be extended in the future.

I like that idea. I will work on it.

Thanks!

--
Khalid

>   
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>
Andy Lutomirski Jan. 24, 2022, 7:45 p.m. UTC | #24
On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/22/22 04:31, Mike Rapoport wrote:
> > (added linux-api)
> >
> > On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> >> Page tables in kernel consume some of the memory and as long as
> >> number of mappings being maintained is small enough, this space
> >> consumed by page tables is not objectionable. When very few memory
> >> pages are shared between processes, the number of page table entries
> >> (PTEs) to maintain is mostly constrained by the number of pages of
> >> memory on the system. As the number of shared pages and the number
> >> of times pages are shared goes up, amount of memory consumed by page
> >> tables starts to become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared
> >> across 1000s of processes. On x86_64, each page requires a PTE that
> >> is only 8 bytes long which is very small compared to the 4K page
> >> size. When 2000 processes map the same page in their address space,
> >> each one of them requires 8 bytes for its PTE and together that adds
> >> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >> database server with 300GB SGA, a system carsh was seen with
> >> out-of-memory condition when 1500+ clients tried to share this SGA
> >> even though the system had 512GB of memory. On this server, in the
> >> worst case scenario of all 1500 processes mapping every page from
> >> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >> could be shared, amount of memory saved is very significant.
> >>
> >> This is a proposal to implement a mechanism in kernel to allow
> >> userspace processes to opt into sharing PTEs. The proposal is to add
> >> a new system call - mshare(), which can be used by a process to
> >> create a region (we will call it mshare'd region) which can be used
> >> by other processes to map same pages using shared PTEs. Other
> >> process(es), assuming they have the right permissions, can then make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API Proposal
> >> ============
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>    exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >
> > Did you consider returning a file descriptor from mshare() system call?
> > Then there would be no need in mshare_unlink() as close(fd) would work.
>
> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
> with a close() semantics though unless there was another way to force refcount to 0. Right?
>

I'm not sure I understand the problem.  If you're sharing a portion of
an mm and the mm goes away, then all that should be left are some
struct files that are no longer useful.  They'll go away when their
refcount goes to zero.

--Andy
Khalid Aziz Jan. 24, 2022, 10:20 p.m. UTC | #25
On 1/24/22 08:15, Mark Hemment wrote:
> On Thu, 20 Jan 2022 at 19:15, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/20/22 05:49, Mark Hemment wrote:
>>> On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> On 1/19/22 04:38, Mark Hemment wrote:
>>>>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>>>
>>>>>> Page tables in kernel consume some of the memory and as long as
>>>>>> number of mappings being maintained is small enough, this space
>>>>>> consumed by page tables is not objectionable. When very few memory
>>>>>> pages are shared between processes, the number of page table entries
>>>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>>>> memory on the system. As the number of shared pages and the number
>>>>>> of times pages are shared goes up, amount of memory consumed by page
>>>>>> tables starts to become significant.
>>>>>>
>>>>>> Some of the field deployments commonly see memory pages shared
>>>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>>>> is only 8 bytes long which is very small compared to the 4K page
>>>>>> size. When 2000 processes map the same page in their address space,
>>>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>>>> database server with 300GB SGA, a system carsh was seen with
>>>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>>>> even though the system had 512GB of memory. On this server, in the
>>>>>> worst case scenario of all 1500 processes mapping every page from
>>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>>>> could be shared, amount of memory saved is very significant.
>>>>>>
>>>>>> This is a proposal to implement a mechanism in kernel to allow
>>>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>>>> a new system call - mshare(), which can be used by a process to
>>>>>> create a region (we will call it mshare'd region) which can be used
>>>>>> by other processes to map same pages using shared PTEs. Other
>>>>>> process(es), assuming they have the right permissions, can then make
>>>>>> the mashare() system call to map the shared pages into their address
>>>>>> space using the shared PTEs.  When a process is done using this
>>>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>>>> access. When the last process accessing mshare'd region calls
>>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>>>> it is freed.
>>>>>>
>>>>>>
>>>>>> API Proposal
>>>>>> ============
>>>>>>
>>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>>>
>>>>>> --
>>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>>>
>>>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>>>> region that will be shared at PTE level. "name" refers to shared object
>>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>>>> of this shared memory area and length is the size of this area.
>>>>>> oflags can be one of:
>>>>>>
>>>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>>>> - O_RDWR opens shared memory area for read and write access
>>>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>>>      exists with that name, return an error.
>>>>>>
>>>>>> mode represents the creation mode for the shared object under
>>>>>> /sys/fs/mshare.
>>>>>>
>>>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>>>
>>>>>> PTEs are shared at pgdir level and hence it imposes following
>>>>>> requirements on the address and size given to the mshare():
>>>>>>
>>>>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>>>>>> - Size must be a multiple of pgdir size
>>>>>> - Any mappings created in this address range at any time become
>>>>>>      shared automatically
>>>>>> - Shared address range can have unmapped addresses in it. Any access
>>>>>>      to unmapped address will result in SIGBUS
>>>>>>
>>>>>> Mappings within this address range behave as if they were shared
>>>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>>>> page which is shared between all the sharers. The first process that
>>>>>> declares an address range mshare'd can continue to map objects in
>>>>>> the shared area. All other processes that want mshare'd access to
>>>>>> this memory area can do so by calling mshare(). After this call, the
>>>>>> address range given by mshare becomes a shared range in its address
>>>>>> space. Anonymous mappings will be shared and not COWed.
>>>>>>
>>>>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>>>>> this file returns two long values - (1) starting address, and (2)
>>>>>> size of the mshare'd region.
>>>>>>
>>>>>> --
>>>>>> int mshare_unlink(char *name)
>>>>>>
>>>>>> A shared address range created by mshare() can be destroyed using
>>>>>> mshare_unlink() which removes the  shared named object. Once all
>>>>>> processes have unmapped the shared object, the shared address range
>>>>>> references are de-allocated and destroyed.
>>>>>>
>>>>>> mshare_unlink() returns 0 on success or -1 on error.
>>>>>>
>>>>>>
>>>>>> Example Code
>>>>>> ============
>>>>>>
>>>>>> Snippet of the code that a donor process would run looks like below:
>>>>>>
>>>>>> -----------------
>>>>>>            addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>>>>                            MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>>>>            if (addr == MAP_FAILED)
>>>>>>                    perror("ERROR: mmap failed");
>>>>>>
>>>>>>            err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>>>>                            GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>>>>            if (err < 0) {
>>>>>>                    perror("mshare() syscall failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>>
>>>>>>            strncpy(addr, "Some random shared text",
>>>>>>                            sizeof("Some random shared text"));
>>>>>> -----------------
>>>>>>
>>>>>> Snippet of code that a consumer process would execute looks like:
>>>>>>
>>>>>> -----------------
>>>>>>            fd = open("testregion", O_RDONLY);
>>>>>>            if (fd < 0) {
>>>>>>                    perror("open failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>>
>>>>>>            if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>>>>>                    printf("INFO: %ld bytes shared at addr %lx \n",
>>>>>>                                    mshare_info[1], mshare_info[0]);
>>>>>>            else
>>>>>>                    perror("read failed");
>>>>>>
>>>>>>            close(fd);
>>>>>>
>>>>>>            addr = (char *)mshare_info[0];
>>>>>>            err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>>>>>                            mshare_info[1], O_RDWR, 600);
>>>>>>            if (err < 0) {
>>>>>>                    perror("mshare() syscall failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>>
>>>>>>            printf("Guest mmap at %px:\n", addr);
>>>>>>            printf("%s\n", addr);
>>>>>>            printf("\nDone\n");
>>>>>>
>>>>>>            err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>>>>            if (err < 0) {
>>>>>>                    perror("mshare_unlink() failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>> -----------------
>>>>> ...
>>>>> Hi Khalid,
>>>>>
>>>>> The proposed mshare() appears to be similar to POSIX shared memory,
>>>>> but with two extra (related) attributes;
>>>>> a) Internally, uses shared page tables.
>>>>> b) Shared memory is mapped at same address for all users.
>>>>
>>>> Hi Mark,
>>>>
>>>> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
>>>> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
>>>> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
>>>> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
>>>> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
>>>> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
>>>> separate implementation.
>>>
>>> Ah, I jumped in assuming this was only for anon memory.
>>>
>>>> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
>>>> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.
>>>
>>> So possible you have already considered the below...which does make
>>> the API a little more POSIX shm like.
>>>
>>> The mshare() syscall does two operations;
>>> 1) create/open mshare object
>>> 2) export/import the given memory region
>>>
>>> Would it be better if these were seperate operations?  That is,
>>> mshare_open() (say) creates/opens the object returning a file
>>> descriptor.  The fd used as the identifier for the export/import after
>>> mmap(2); eg.
>>> addr = mshare_op(EXPORT, fd, addr, size);
>>> addr = mshare_op(IMPORT, fd, NULL, 0);
>>> (Not sure about export/import terms..)
>>>
>>> The benefit of the the separate ops is the file descriptor.  This
>>> could be used for fstat(2) (and fchown(2)?), although not sure how
>>> much value this would add.
>>
>> Hi Mark,
>>
>> That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a
>> regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region.
> 
> Hi Khalid,
> 
> In your proposed API, the 'importer' of the mshared region does not
> open the mshared backing object (when a file being mapped) instead it
> does an open on the msharefs file.
>  From the code sample in your initial email (simplified), where a
> process attaches to the mshared region;
>      fd = open("testregion", O_RDONLY);
>      read(fd, &mshare_info, sizeof (mshare_info));
>      mshare("testregion", addr, len, RDWR, 0600);
> 
> Open permission checks are done by the mshare() system call against
> the msharefs file ("testregion").
> 
>  From the code sample in your initial email (simplified), where a
> process creates a msharefs file with the anonymous mmap()ed region to
> be shared;
>      addr = mmap(RDWR, ANON);
>      mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600);
> 
> Now, consider the case where the mmap() is named (that is, against a
> file).  I believe this is the usecase for Oracle's SGA.
> My (simplified) code for msharing a named file ("SGA") using your
> proposed API (does not matter if the mapping is PRIVATE or SHARED);
>      fd = open("SGA", RDWR);
>      addr = mmap(RDWR, ..., fd);
>      mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600);
> 
> If the permissions (usr/grp+perms+ACL) between the "SGA" file and the
> "SGA-region" msharefs are different, then it is very likely a serious
> security issue.
> That is, a user who could not open(2) the "SGA" file might be able to
> open the "SGA-region" msharefs file, and so gain at least read
> permission on the file.

If an app creates an mshare'd region and gives wider access permissions than the on the objects it has mapped in or maps 
in in future, I would read it as app intends to do that. mshare'd region can cover any mapped object besides files. It 
could be anonymous memory holding data supplied by an application user, or could be data captured over network. How 
would one validate intended permissions in those cases? A uniform check would be to ensure access given to mshare'd 
region is compliant with the permissions on the region iteself as given by the file under msharefs. Those permission 
checks are not implemented yet in this initial prototype but definitely planned as continuing work.

> 
> This is why I was proposing a file descriptor, so the msharefs file
> could be set to have the same permissions as the backing file it is
> exporting (but I got this wrong).
> This would still leave a window between the msharefs file being
> creating and the permissions being set, where a rogue process could
> attach to a region when they should not have the permission (this
> could be closed by failing a non-creating mshare() if the region is of
> zero len - nothing yet shared - until permission are set and the
> region shared).
> But relying on userspace to always set the correct permissions on the
> msharefs file is dangerous - likely to get it wrong on occasion - and
> isn't sufficient.  The msharefs API needs to be bullet proof.
> 
> Looking at the patches, I cannot see where extra validation is being
> done for a named mapping to ensure any 'importer' has the necessary
> permission against the backing file.
> The 'struct file' (->vm_file, and associated inode) in the VMA is
> sufficient to perform required access checks against the file's perms
> - the posted patches do not check this (but they are for an RFC, so
> don't expect all cases to be handled).  But what about a full path
> permission check?  That is, the 'importer' has necessary permissions
> on the backing file, but would not be able to find this file due to
> directory permissions?  msharefs would bypass the directory checks.
>  >
>>>
>>> The 'importer' would use the address/size of the memory region as
>>> exported (and stored in msharefs), so no need for /sys file (except
>>> for human readable info).
>>
>> I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size,
>> I assume you are not proposing eliminating msharefs.
> 
> The 'exporter' of the mshared region specifies the address and length,
> and is therefore is known by the mshare code.
> An 'import' needs to only pass NULL/0 for addr/len and is told by
> mshare where the region has been attached in its address-space.  With
> this, the /sys file is no longer part of the API.

This would require every importer map the entire mshare'd range. Should we allow mapping a subset of the region, in 
which case importer needs to supply starting address and length? It is possibly simpler to only allow mapping the entire 
region but if there is a use case for partial mapping, designing that flexibility in is useful.

Thanks,
Khalid

> 
> 
>>>
>>> If the set-up operations are split in two, then would it make sense to
>>> also split the teardown as well?  Say, mshare_op(DROP, fd) and
>>> mshare_unlink(fd)?
>>
>> A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care
>> of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that
>> created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I
>> would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the
>> mshare'd region disappearing on its own when all references to it are dropped.
>>
>> Thanks,
>> Khalid
> 
> Cheers,
> Mark
> 
>>
>>>
>>>>
>>>> Does that rationale sound reasonable?
>>>
>>> It doesn't sound unreasonable.  As msharefs is providing a namespace
>>> and perms, it doesn't need much flexibility.  Being able to modifying
>>> the perms post namespace creation (fchown(2)), before exporting the
>>> memory region, might be useful in some cases - but as I don't have any
>>> usecases I'm not claiming it is essential.
>>>
>>>>
>>>> Thanks,
>>>> Khalid
>>>
>>> Cheers,
>>> Mark
>>>>
>>>>>
>>>>> Rather than introduce two new system calls, along with /sys/ file to
>>>>> communicate global addresses, could mshare() be built on top of shmem
>>>>> API?  Thinking of something like the below;
>>>>> 1) For shm_open(3), add a new oflag to indicate the properties needed
>>>>> for mshare() (say, O_SHARED_PTE - better name?)
>>>>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
>>>>> in the sizes which can be set.
>>>>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
>>>>> objects.  On first mmap()ing an appropiate address is assigned,
>>>>> otherwise the current 'global' address is used.
>>>>> 4) shm_unlink(3) destroys the object when last reference is dropped.
>>>>>
>>>>> For 3), might be able to weaken the NULL requirement and validate a
>>>>> given address on first mapping to ensure it is correctly aligned.
>>>>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
>>>>> the default behaviour you require.
>>>>>
>>>>> Internally, the handling of mshare()/O_SHARED_PTE memory might be
>>>>> sufficiently different to shmem that there is not much code sharing
>>>>> between the two (I haven't thought this through, but the object
>>>>> naming/refcounting should be similiar), but using shmem would be a
>>>>> familiar API.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Cheers,
>>>>> Mark
>>>>>
>>>>
>>
Khalid Aziz Jan. 24, 2022, 10:30 p.m. UTC | #26
On 1/24/22 12:45, Andy Lutomirski wrote:
> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/22/22 04:31, Mike Rapoport wrote:
>>> (added linux-api)
>>>
>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>>>> Page tables in kernel consume some of the memory and as long as
>>>> number of mappings being maintained is small enough, this space
>>>> consumed by page tables is not objectionable. When very few memory
>>>> pages are shared between processes, the number of page table entries
>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>> memory on the system. As the number of shared pages and the number
>>>> of times pages are shared goes up, amount of memory consumed by page
>>>> tables starts to become significant.
>>>>
>>>> Some of the field deployments commonly see memory pages shared
>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>> is only 8 bytes long which is very small compared to the 4K page
>>>> size. When 2000 processes map the same page in their address space,
>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>> database server with 300GB SGA, a system carsh was seen with
>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>> even though the system had 512GB of memory. On this server, in the
>>>> worst case scenario of all 1500 processes mapping every page from
>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>> could be shared, amount of memory saved is very significant.
>>>>
>>>> This is a proposal to implement a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>> a new system call - mshare(), which can be used by a process to
>>>> create a region (we will call it mshare'd region) which can be used
>>>> by other processes to map same pages using shared PTEs. Other
>>>> process(es), assuming they have the right permissions, can then make
>>>> the mashare() system call to map the shared pages into their address
>>>> space using the shared PTEs.  When a process is done using this
>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>> access. When the last process accessing mshare'd region calls
>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>> it is freed.
>>>>
>>>>
>>>> API Proposal
>>>> ============
>>>>
>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>
>>>> --
>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>
>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>> region that will be shared at PTE level. "name" refers to shared object
>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>> of this shared memory area and length is the size of this area.
>>>> oflags can be one of:
>>>>
>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>> - O_RDWR opens shared memory area for read and write access
>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>     exists with that name, return an error.
>>>>
>>>> mode represents the creation mode for the shared object under
>>>> /sys/fs/mshare.
>>>>
>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> Did you consider returning a file descriptor from mshare() system call?
>>> Then there would be no need in mshare_unlink() as close(fd) would work.
>>
>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
>> with a close() semantics though unless there was another way to force refcount to 0. Right?
>>
> 
> I'm not sure I understand the problem.  If you're sharing a portion of
> an mm and the mm goes away, then all that should be left are some
> struct files that are no longer useful.  They'll go away when their
> refcount goes to zero.
> 
> --Andy
> 

The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process 
initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd 
portion go away.

One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but 
the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to 
hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() 
increments the refcount and each call to mshare_unlink() decrements the refcount.

--
Khalid
Andy Lutomirski Jan. 24, 2022, 11:16 p.m. UTC | #27
On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/24/22 12:45, Andy Lutomirski wrote:
> > On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> On 1/22/22 04:31, Mike Rapoport wrote:
> >>> (added linux-api)
> >>>
> >>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> >>>> Page tables in kernel consume some of the memory and as long as
> >>>> number of mappings being maintained is small enough, this space
> >>>> consumed by page tables is not objectionable. When very few memory
> >>>> pages are shared between processes, the number of page table entries
> >>>> (PTEs) to maintain is mostly constrained by the number of pages of
> >>>> memory on the system. As the number of shared pages and the number
> >>>> of times pages are shared goes up, amount of memory consumed by page
> >>>> tables starts to become significant.
> >>>>
> >>>> Some of the field deployments commonly see memory pages shared
> >>>> across 1000s of processes. On x86_64, each page requires a PTE that
> >>>> is only 8 bytes long which is very small compared to the 4K page
> >>>> size. When 2000 processes map the same page in their address space,
> >>>> each one of them requires 8 bytes for its PTE and together that adds
> >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >>>> database server with 300GB SGA, a system carsh was seen with
> >>>> out-of-memory condition when 1500+ clients tried to share this SGA
> >>>> even though the system had 512GB of memory. On this server, in the
> >>>> worst case scenario of all 1500 processes mapping every page from
> >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >>>> could be shared, amount of memory saved is very significant.
> >>>>
> >>>> This is a proposal to implement a mechanism in kernel to allow
> >>>> userspace processes to opt into sharing PTEs. The proposal is to add
> >>>> a new system call - mshare(), which can be used by a process to
> >>>> create a region (we will call it mshare'd region) which can be used
> >>>> by other processes to map same pages using shared PTEs. Other
> >>>> process(es), assuming they have the right permissions, can then make
> >>>> the mashare() system call to map the shared pages into their address
> >>>> space using the shared PTEs.  When a process is done using this
> >>>> mshare'd region, it makes a mshare_unlink() system call to end its
> >>>> access. When the last process accessing mshare'd region calls
> >>>> mshare_unlink(), the mshare'd region is torn down and memory used by
> >>>> it is freed.
> >>>>
> >>>>
> >>>> API Proposal
> >>>> ============
> >>>>
> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>>>
> >>>> --
> >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>>>
> >>>> mshare() creates and opens a new, or opens an existing mshare'd
> >>>> region that will be shared at PTE level. "name" refers to shared object
> >>>> name that exists under /sys/fs/mshare. "addr" is the starting address
> >>>> of this shared memory area and length is the size of this area.
> >>>> oflags can be one of:
> >>>>
> >>>> - O_RDONLY opens shared memory area for read only access by everyone
> >>>> - O_RDWR opens shared memory area for read and write access
> >>>> - O_CREAT creates the named shared memory area if it does not exist
> >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>>>     exists with that name, return an error.
> >>>>
> >>>> mode represents the creation mode for the shared object under
> >>>> /sys/fs/mshare.
> >>>>
> >>>> mshare() returns an error code if it fails, otherwise it returns 0.
> >>>
> >>> Did you consider returning a file descriptor from mshare() system call?
> >>> Then there would be no need in mshare_unlink() as close(fd) would work.
> >>
> >> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
> >> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
> >> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
> >> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
> >> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
> >> with a close() semantics though unless there was another way to force refcount to 0. Right?
> >>
> >
> > I'm not sure I understand the problem.  If you're sharing a portion of
> > an mm and the mm goes away, then all that should be left are some
> > struct files that are no longer useful.  They'll go away when their
> > refcount goes to zero.
> >
> > --Andy
> >
>
> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process
> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd
> portion go away.
>
> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but
> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to
> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare()
> increments the refcount and each call to mshare_unlink() decrements the refcount.

In general, objects which are kept alive by name tend to be quite
awkward. Things like network namespaces essentially have to work that
way and end up with awkward APIs.  Things like shared memory don't
actually have to be kept alive by name, and the cases that do keep
them alive by name (tmpfs, shmget) can end up being so awkward that
people invent nameless variants like memfd.

So I would strongly suggest you see how the design works out with no
names and no external keep-alive mechanism.  Either have the continued
existence of *any* fd keep the whole thing alive or make it be a pair
of struct files, one that controls the region (and can map things into
it, etc) and one that can map it.  SCM_RIGHTS is pretty good for
passing objects like this around.

--Andy
Khalid Aziz Jan. 24, 2022, 11:44 p.m. UTC | #28
On 1/24/22 16:16, Andy Lutomirski wrote:
> On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/24/22 12:45, Andy Lutomirski wrote:
>>> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> On 1/22/22 04:31, Mike Rapoport wrote:
>>>>> (added linux-api)
>>>>>
>>>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>>>>>> Page tables in kernel consume some of the memory and as long as
>>>>>> number of mappings being maintained is small enough, this space
>>>>>> consumed by page tables is not objectionable. When very few memory
>>>>>> pages are shared between processes, the number of page table entries
>>>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>>>> memory on the system. As the number of shared pages and the number
>>>>>> of times pages are shared goes up, amount of memory consumed by page
>>>>>> tables starts to become significant.
>>>>>>
>>>>>> Some of the field deployments commonly see memory pages shared
>>>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>>>> is only 8 bytes long which is very small compared to the 4K page
>>>>>> size. When 2000 processes map the same page in their address space,
>>>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>>>> database server with 300GB SGA, a system carsh was seen with
>>>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>>>> even though the system had 512GB of memory. On this server, in the
>>>>>> worst case scenario of all 1500 processes mapping every page from
>>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>>>> could be shared, amount of memory saved is very significant.
>>>>>>
>>>>>> This is a proposal to implement a mechanism in kernel to allow
>>>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>>>> a new system call - mshare(), which can be used by a process to
>>>>>> create a region (we will call it mshare'd region) which can be used
>>>>>> by other processes to map same pages using shared PTEs. Other
>>>>>> process(es), assuming they have the right permissions, can then make
>>>>>> the mashare() system call to map the shared pages into their address
>>>>>> space using the shared PTEs.  When a process is done using this
>>>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>>>> access. When the last process accessing mshare'd region calls
>>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>>>> it is freed.
>>>>>>
>>>>>>
>>>>>> API Proposal
>>>>>> ============
>>>>>>
>>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>>>
>>>>>> --
>>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>>>
>>>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>>>> region that will be shared at PTE level. "name" refers to shared object
>>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>>>> of this shared memory area and length is the size of this area.
>>>>>> oflags can be one of:
>>>>>>
>>>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>>>> - O_RDWR opens shared memory area for read and write access
>>>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>>>      exists with that name, return an error.
>>>>>>
>>>>>> mode represents the creation mode for the shared object under
>>>>>> /sys/fs/mshare.
>>>>>>
>>>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>>
>>>>> Did you consider returning a file descriptor from mshare() system call?
>>>>> Then there would be no need in mshare_unlink() as close(fd) would work.
>>>>
>>>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
>>>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
>>>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
>>>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
>>>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
>>>> with a close() semantics though unless there was another way to force refcount to 0. Right?
>>>>
>>>
>>> I'm not sure I understand the problem.  If you're sharing a portion of
>>> an mm and the mm goes away, then all that should be left are some
>>> struct files that are no longer useful.  They'll go away when their
>>> refcount goes to zero.
>>>
>>> --Andy
>>>
>>
>> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process
>> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd
>> portion go away.
>>
>> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but
>> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to
>> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare()
>> increments the refcount and each call to mshare_unlink() decrements the refcount.
> 
> In general, objects which are kept alive by name tend to be quite
> awkward. Things like network namespaces essentially have to work that
> way and end up with awkward APIs.  Things like shared memory don't
> actually have to be kept alive by name, and the cases that do keep
> them alive by name (tmpfs, shmget) can end up being so awkward that
> people invent nameless variants like memfd.
> 
> So I would strongly suggest you see how the design works out with no
> names and no external keep-alive mechanism.  Either have the continued
> existence of *any* fd keep the whole thing alive or make it be a pair
> of struct files, one that controls the region (and can map things into
> it, etc) and one that can map it.  SCM_RIGHTS is pretty good for
> passing objects like this around.
> 
> --Andy
> 

These are certainly good ideas to simplify this feature. My very first implementation of mshare did not have msharefs, 
was based on fd where fd could be passed to any other process using SCM_RIGHTS and required the process creating mshare 
region to stay alive for the region to exist. That certainly made life simpler in terms of implementation. Feedback from 
my customers of this feature (DB developers and people deploying DB systems) was this imposes a hard dependency on a 
server process that creates the mshare'd region and passes fd to other processes needing access to this region. This 
dependency creates a weak link in system reliability that is too risky. If the server process dies for any reason, the 
entire system becomes unavailable. They requested a more robust implementation that they can depend upon. I then went 
through the process of implementing this using shmfs since POSIX shm has those attributes. That turned out to be more 
kludgy than a clean implementation using a separate in-memory msharefs. That brought me to the RFC implementation I sent 
out.

I do agree with you that name based persistent object makes the implementation more complex (maintaining a separate mm 
not tied to a process requires quite a bit of work to keep things consistent and clean mm up properly as users of this 
shared mm terminate) but I see the reliability point of view. Does that logic resonate with you?

Thanks,
Khalid
Kirill A. Shutemov Jan. 25, 2022, 11:42 a.m. UTC | #29
On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> Example Code
> ============
> 
> Snippet of the code that a donor process would run looks like below:
> 
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
> 
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> 			GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
> 
>         strncpy(addr, "Some random shared text",
> 			sizeof("Some random shared text"));
> -----------------
> 
> Snippet of code that a consumer process would execute looks like:
> 
> -----------------
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
> 
>         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr %lx \n",
> 				mshare_info[1], mshare_info[0]);
>         else
>                 perror("read failed");
> 
>         close(fd);
> 
>         addr = (char *)mshare_info[0];
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
> 			mshare_info[1], O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
> 
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
> 	printf("\nDone\n");
> 
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------

I wounder if we can get away with zero-API here: we can transparently
create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
size and alignment is sutiable. Page tables will be linked to the inode
and will be freed when the last of such mapping will go away. I don't see
a need in new syscalls of flags to existing one.
William Kucharski Jan. 25, 2022, 12:09 p.m. UTC | #30
I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so.

If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action.

> On Jan 25, 2022, at 04:41, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Example Code
>> ============
>> 
>> Snippet of the code that a donor process would run looks like below:
>> 
>> -----------------
>>        addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>        if (addr == MAP_FAILED)
>>                perror("ERROR: mmap failed");
>> 
>>        err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>            GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>        if (err < 0) {
>>                perror("mshare() syscall failed");
>>                exit(1);
>>        }
>> 
>>        strncpy(addr, "Some random shared text",
>>            sizeof("Some random shared text"));
>> -----------------
>> 
>> Snippet of code that a consumer process would execute looks like:
>> 
>> -----------------
>>        fd = open("testregion", O_RDONLY);
>>        if (fd < 0) {
>>                perror("open failed");
>>                exit(1);
>>        }
>> 
>>        if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>                printf("INFO: %ld bytes shared at addr %lx \n",
>>                mshare_info[1], mshare_info[0]);
>>        else
>>                perror("read failed");
>> 
>>        close(fd);
>> 
>>        addr = (char *)mshare_info[0];
>>        err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>            mshare_info[1], O_RDWR, 600);
>>        if (err < 0) {
>>                perror("mshare() syscall failed");
>>                exit(1);
>>        }
>> 
>>        printf("Guest mmap at %px:\n", addr);
>>        printf("%s\n", addr);
>>    printf("\nDone\n");
>> 
>>        err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>        if (err < 0) {
>>                perror("mshare_unlink() failed");
>>                exit(1);
>>        }
>> -----------------
> 
> I wounder if we can get away with zero-API here: we can transparently
> create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> size and alignment is sutiable. Page tables will be linked to the inode
> and will be freed when the last of such mapping will go away. I don't see
> a need in new syscalls of flags to existing one.
> 
> -- 
> Kirill A. Shutemov
>
David Hildenbrand Jan. 25, 2022, 1:18 p.m. UTC | #31
On 25.01.22 13:09, William Kucharski wrote:
> I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so.
> 
> If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action.
> 

I don't have time to look into details right now, but I see various
possible hard-to-handle issues with sharing anon pages via this
mechanism between processes. If we could restrict it to MAP_SHARED and
have some magic toggle to opt in, that would be great. Features like
uffd that we might soon see on some MAP_SHARED pages will require to not
share page tables automatically I assume.
Matthew Wilcox (Oracle) Jan. 25, 2022, 1:23 p.m. UTC | #32
On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote:
> I wounder if we can get away with zero-API here: we can transparently
> create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> size and alignment is sutiable. Page tables will be linked to the inode
> and will be freed when the last of such mapping will go away. I don't see
> a need in new syscalls of flags to existing one.

That's how HugeTLBfs works today, right?  Would you want that mechanism
hoisted into the real MM?  Because my plan was the opposite -- remove it
from the shadow MM once mshare() is established.
Kirill A. Shutemov Jan. 25, 2022, 1:59 p.m. UTC | #33
On Tue, Jan 25, 2022 at 01:23:21PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote:
> > I wounder if we can get away with zero-API here: we can transparently
> > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> > size and alignment is sutiable. Page tables will be linked to the inode
> > and will be freed when the last of such mapping will go away. I don't see
> > a need in new syscalls of flags to existing one.
> 
> That's how HugeTLBfs works today, right?  Would you want that mechanism
> hoisted into the real MM?  Because my plan was the opposite -- remove it
> from the shadow MM once mshare() is established.

I hate HugeTLBfs because it is a special place with own rules. mshare() as
it proposed creates a new special place. I don't like this.

It's better to find a way to integrate the feature natively into core-mm
and make as much users as possible to benefit from it.

I think zero-API approach (plus madvise() hints to tweak it) is worth
considering.
Kirill A. Shutemov Jan. 25, 2022, 2:01 p.m. UTC | #34
On Tue, Jan 25, 2022 at 02:18:57PM +0100, David Hildenbrand wrote:
> On 25.01.22 13:09, William Kucharski wrote:
> > I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so.
> > 
> > If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action.
> > 
> 
> I don't have time to look into details right now, but I see various
> possible hard-to-handle issues with sharing anon pages via this
> mechanism between processes.

Right. We should not break invariant that an anonymous page can only be
mapped into a process once. Otherwise may need to deal with new class of
security issues.
Matthew Wilcox (Oracle) Jan. 25, 2022, 2:09 p.m. UTC | #35
On Tue, Jan 25, 2022 at 04:59:17PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 25, 2022 at 01:23:21PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote:
> > > I wounder if we can get away with zero-API here: we can transparently
> > > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> > > size and alignment is sutiable. Page tables will be linked to the inode
> > > and will be freed when the last of such mapping will go away. I don't see
> > > a need in new syscalls of flags to existing one.
> > 
> > That's how HugeTLBfs works today, right?  Would you want that mechanism
> > hoisted into the real MM?  Because my plan was the opposite -- remove it
> > from the shadow MM once mshare() is established.
> 
> I hate HugeTLBfs because it is a special place with own rules. mshare() as
> it proposed creates a new special place. I don't like this.

No new special place.  I suppose the only thing it creates that's "new"
is an MM without any threads of its own.  And from the MM point of view,
that's not a new thing at all because the MM simply doesn't care how
many threads share an MM.

> It's better to find a way to integrate the feature natively into core-mm
> and make as much users as possible to benefit from it.

That's what mshare is trying to do!

> I think zero-API approach (plus madvise() hints to tweak it) is worth
> considering.

I think the zero-API approach actually misses out on a lot of
possibilities that the mshare() approach offers.  For example, mshare()
allows you to mmap() many small files in the shared region -- you
can't do that with zeroAPI.
Kirill A. Shutemov Jan. 25, 2022, 6:57 p.m. UTC | #36
On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > considering.
> 
> I think the zero-API approach actually misses out on a lot of
> possibilities that the mshare() approach offers.  For example, mshare()
> allows you to mmap() many small files in the shared region -- you
> can't do that with zeroAPI.

Do you consider a use-case for many small files to be common? I would
think that the main consumer of the feature to be mmap of huge files.
And in this case zero enabling burden on userspace side sounds like a
sweet deal.
Matthew Wilcox (Oracle) Jan. 25, 2022, 6:59 p.m. UTC | #37
On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > > considering.
> > 
> > I think the zero-API approach actually misses out on a lot of
> > possibilities that the mshare() approach offers.  For example, mshare()
> > allows you to mmap() many small files in the shared region -- you
> > can't do that with zeroAPI.
> 
> Do you consider a use-case for many small files to be common? I would
> think that the main consumer of the feature to be mmap of huge files.
> And in this case zero enabling burden on userspace side sounds like a
> sweet deal.

mmap() of huge files is certainly the Oracle use-case.  With occasional
funny business like mprotect() of a single page in the middle of a 1GB
hugepage.

The approach of designating ranges of a process's address space as
sharable with other processes felt like the cleaner & frankly more
interesting approach that opens up use-cases other than "hurr, durr, we
are Oracle, we like big files, kernel get out of way now, transactions
to perform".
Matthew Wilcox (Oracle) Jan. 26, 2022, 4:04 a.m. UTC | #38
On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > > > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > > > considering.
> > > 
> > > I think the zero-API approach actually misses out on a lot of
> > > possibilities that the mshare() approach offers.  For example, mshare()
> > > allows you to mmap() many small files in the shared region -- you
> > > can't do that with zeroAPI.
> > 
> > Do you consider a use-case for many small files to be common? I would
> > think that the main consumer of the feature to be mmap of huge files.
> > And in this case zero enabling burden on userspace side sounds like a
> > sweet deal.
> 
> mmap() of huge files is certainly the Oracle use-case.  With occasional
> funny business like mprotect() of a single page in the middle of a 1GB
> hugepage.

Bill and I were talking about this earlier and realised that this is
the key point.  There's a requirement that when one process mprotects
a page that it gets protected in all processes.  You can't do that
without *some* API because that's different behaviour than any existing
API would produce.

So how about something like this ...

int mcreate(const char *name, int flags, mode_t mode);

creates a new mm_struct with a refcount of 2.  returns an fd (one
of the two refcounts) and creates a name for it (inside msharefs,
holds the other refcount).

You can then mmap() that fd to attach it to a chunk of your address
space.  Once attached, you can start to populate it by calling
mmap() and specifying an address inside the attached mm as the first
argument to mmap().

Maybe mcreate() is just a library call, and it's really a thin wrapper
around open() that happens to know where msharefs is mounted.
David Hildenbrand Jan. 26, 2022, 10:16 a.m. UTC | #39
On 26.01.22 05:04, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
>> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
>>> On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
>>>>> I think zero-API approach (plus madvise() hints to tweak it) is worth
>>>>> considering.
>>>>
>>>> I think the zero-API approach actually misses out on a lot of
>>>> possibilities that the mshare() approach offers.  For example, mshare()
>>>> allows you to mmap() many small files in the shared region -- you
>>>> can't do that with zeroAPI.
>>>
>>> Do you consider a use-case for many small files to be common? I would
>>> think that the main consumer of the feature to be mmap of huge files.
>>> And in this case zero enabling burden on userspace side sounds like a
>>> sweet deal.
>>
>> mmap() of huge files is certainly the Oracle use-case.  With occasional
>> funny business like mprotect() of a single page in the middle of a 1GB
>> hugepage.
> 
> Bill and I were talking about this earlier and realised that this is
> the key point.  There's a requirement that when one process mprotects
> a page that it gets protected in all processes.  You can't do that
> without *some* API because that's different behaviour than any existing
> API would produce.

A while ago I talked with Peter about an extended uffd (here: WP)
mechanism that would work on fds instead of the process address space.

The rough idea would be to register the uffd (or however that would be
called) handler on an fd instead of a virtual address space of a single
process and write-protect pages in that fd. Once anybody would try
writing to such a protected range (write, mmap, ...), the uffd handler
would fire and user space could handle the event (-> unprotect). The
page cache would have to remember the uffd information ("wp using
uffd"). When (un)protecting pages using this mechanism, all page tables
mapping the page would have to be updated accordingly using the rmap. At
that point, we wouldn't care if it's a single page table (e.g., shared
similar to hugetlb) or simply multiple page tables.

It's a completely rough idea, I just wanted to mention it.
Matthew Wilcox (Oracle) Jan. 26, 2022, 1:38 p.m. UTC | #40
On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
> A while ago I talked with Peter about an extended uffd (here: WP)
> mechanism that would work on fds instead of the process address space.

As far as I can tell, uffd is a grotesque hack that exists to work around
the poor choice to use anonymous memory instead of file-backed memory
in kvm.  Every time I see somebody mention it, I feel pain.
Kirill A. Shutemov Jan. 26, 2022, 1:42 p.m. UTC | #41
On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > > > > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > > > > considering.
> > > > 
> > > > I think the zero-API approach actually misses out on a lot of
> > > > possibilities that the mshare() approach offers.  For example, mshare()
> > > > allows you to mmap() many small files in the shared region -- you
> > > > can't do that with zeroAPI.
> > > 
> > > Do you consider a use-case for many small files to be common? I would
> > > think that the main consumer of the feature to be mmap of huge files.
> > > And in this case zero enabling burden on userspace side sounds like a
> > > sweet deal.
> > 
> > mmap() of huge files is certainly the Oracle use-case.  With occasional
> > funny business like mprotect() of a single page in the middle of a 1GB
> > hugepage.
> 
> Bill and I were talking about this earlier and realised that this is
> the key point.  There's a requirement that when one process mprotects
> a page that it gets protected in all processes.  You can't do that
> without *some* API because that's different behaviour than any existing
> API would produce.

"hurr, durr, we are Oracle" :P

Sounds like a very niche requirement. I doubt there will more than single
digit user count for the feature. Maybe only the DB.

> So how about something like this ...
> 
> int mcreate(const char *name, int flags, mode_t mode);
> 
> creates a new mm_struct with a refcount of 2.  returns an fd (one
> of the two refcounts) and creates a name for it (inside msharefs,
> holds the other refcount).
> 
> You can then mmap() that fd to attach it to a chunk of your address
> space.  Once attached, you can start to populate it by calling
> mmap() and specifying an address inside the attached mm as the first
> argument to mmap().

That is not what mmap() would normally do to an existing mapping. So it
requires special treatment.

In general mmap() of a mm_struct scares me. I can't wrap my head around
implications.

Like how does it work on fork()?

How accounting works? What happens on OOM?

What prevents creating loops, like mapping a mm_struct inside itself?

What mremap()/munmap() do to such mapping? Will it affect mapping of
mm_struct or will it target mapping inside the mm_sturct?

Maybe it just didn't clicked for me, I donno.

> Maybe mcreate() is just a library call, and it's really a thin wrapper
> around open() that happens to know where msharefs is mounted.
David Hildenbrand Jan. 26, 2022, 1:55 p.m. UTC | #42
On 26.01.22 14:38, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
>> A while ago I talked with Peter about an extended uffd (here: WP)
>> mechanism that would work on fds instead of the process address space.
> 
> As far as I can tell, uffd is a grotesque hack that exists to work around
> the poor choice to use anonymous memory instead of file-backed memory
> in kvm.  Every time I see somebody mention it, I feel pain.
> 

I might be missing something important, because KVM can deal with
file-back memory just fine and uffd is used heavily outside of hypervisors.

I'd love to learn how to handle what ordinary uffd (handle
missing/unpopulated pages) and uffd-wp (handle write access to pages)
can do with files instead. Because if something like that already
exists, it would be precisely what I am talking about.

Maybe mentioning uffd was a bad choice ;)
Mike Rapoport Jan. 26, 2022, 2:12 p.m. UTC | #43
On Wed, Jan 26, 2022 at 01:38:49PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
> > A while ago I talked with Peter about an extended uffd (here: WP)
> > mechanism that would work on fds instead of the process address space.
> 
> As far as I can tell, uffd is a grotesque hack that exists to work around
> the poor choice to use anonymous memory instead of file-backed memory
> in kvm.  Every time I see somebody mention it, I feel pain.

How file-backed memory would have helped for the major use-case of uffd
which is post-copy migration?
Matthew Wilcox (Oracle) Jan. 26, 2022, 2:12 p.m. UTC | #44
On Wed, Jan 26, 2022 at 02:55:10PM +0100, David Hildenbrand wrote:
> On 26.01.22 14:38, Matthew Wilcox wrote:
> > On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
> >> A while ago I talked with Peter about an extended uffd (here: WP)
> >> mechanism that would work on fds instead of the process address space.
> > 
> > As far as I can tell, uffd is a grotesque hack that exists to work around
> > the poor choice to use anonymous memory instead of file-backed memory
> > in kvm.  Every time I see somebody mention it, I feel pain.
> > 
> 
> I might be missing something important, because KVM can deal with
> file-back memory just fine and uffd is used heavily outside of hypervisors.
> 
> I'd love to learn how to handle what ordinary uffd (handle
> missing/unpopulated pages) and uffd-wp (handle write access to pages)
> can do with files instead. Because if something like that already
> exists, it would be precisely what I am talking about.

Every notification that uffd wants already exists as a notification to
the underlying filesystem.  Something like a uffdfs [1] would be able
to do everything that uffd does without adding extra crap all over the MM.

[1] acronyms are bad, mmmkay?
Mike Rapoport Jan. 26, 2022, 2:18 p.m. UTC | #45
On Wed, Jan 26, 2022 at 04:42:47PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
> > > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
>
> > So how about something like this ...
> > 
> > int mcreate(const char *name, int flags, mode_t mode);
> > 
> > creates a new mm_struct with a refcount of 2.  returns an fd (one
> > of the two refcounts) and creates a name for it (inside msharefs,
> > holds the other refcount).
> > 
> > You can then mmap() that fd to attach it to a chunk of your address
> > space.  Once attached, you can start to populate it by calling
> > mmap() and specifying an address inside the attached mm as the first
> > argument to mmap().
> 
> That is not what mmap() would normally do to an existing mapping. So it
> requires special treatment.
> 
> In general mmap() of a mm_struct scares me. I can't wrap my head around
> implications.
> 
> Like how does it work on fork()?
> 
> How accounting works? What happens on OOM?
> 
> What prevents creating loops, like mapping a mm_struct inside itself?
> 
> What mremap()/munmap() do to such mapping? Will it affect mapping of
> mm_struct or will it target mapping inside the mm_sturct?
> 
> Maybe it just didn't clicked for me, I donno.

My understanding was that the new mm_struct would be rather stripped and
will be used more as an abstraction for the shared page table, maybe I'm
totally wrong :)
 
> > Maybe mcreate() is just a library call, and it's really a thin wrapper
> > around open() that happens to know where msharefs is mounted.
> 
> -- 
>  Kirill A. Shutemov
David Hildenbrand Jan. 26, 2022, 2:30 p.m. UTC | #46
On 26.01.22 15:12, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 02:55:10PM +0100, David Hildenbrand wrote:
>> On 26.01.22 14:38, Matthew Wilcox wrote:
>>> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
>>>> A while ago I talked with Peter about an extended uffd (here: WP)
>>>> mechanism that would work on fds instead of the process address space.
>>>
>>> As far as I can tell, uffd is a grotesque hack that exists to work around
>>> the poor choice to use anonymous memory instead of file-backed memory
>>> in kvm.  Every time I see somebody mention it, I feel pain.
>>>
>>
>> I might be missing something important, because KVM can deal with
>> file-back memory just fine and uffd is used heavily outside of hypervisors.
>>
>> I'd love to learn how to handle what ordinary uffd (handle
>> missing/unpopulated pages) and uffd-wp (handle write access to pages)
>> can do with files instead. Because if something like that already
>> exists, it would be precisely what I am talking about.
> 
> Every notification that uffd wants already exists as a notification to
> the underlying filesystem.  Something like a uffdfs [1] would be able
> to do everything that uffd does without adding extra crap all over the MM.

I don't speak "filesystem" fluently, but I assume that could be an
overlay over other fs?

Peter is currently upstreaming uffd-wp for shmem. How could that look
like when doing it the fs-way?
Khalid Aziz Jan. 26, 2022, 5:33 p.m. UTC | #47
On 1/26/22 07:18, Mike Rapoport wrote:
> On Wed, Jan 26, 2022 at 04:42:47PM +0300, Kirill A. Shutemov wrote:
>> On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote:
>>> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
>>>> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
>>
>>> So how about something like this ...
>>>
>>> int mcreate(const char *name, int flags, mode_t mode);
>>>
>>> creates a new mm_struct with a refcount of 2.  returns an fd (one
>>> of the two refcounts) and creates a name for it (inside msharefs,
>>> holds the other refcount).
>>>
>>> You can then mmap() that fd to attach it to a chunk of your address
>>> space.  Once attached, you can start to populate it by calling
>>> mmap() and specifying an address inside the attached mm as the first
>>> argument to mmap().
>>
>> That is not what mmap() would normally do to an existing mapping. So it
>> requires special treatment.
>>
>> In general mmap() of a mm_struct scares me. I can't wrap my head around
>> implications.
>>
>> Like how does it work on fork()?
>>
>> How accounting works? What happens on OOM?
>>
>> What prevents creating loops, like mapping a mm_struct inside itself?
>>
>> What mremap()/munmap() do to such mapping? Will it affect mapping of
>> mm_struct or will it target mapping inside the mm_sturct?
>>
>> Maybe it just didn't clicked for me, I donno.
> 
> My understanding was that the new mm_struct would be rather stripped and
> will be used more as an abstraction for the shared page table, maybe I'm
> totally wrong :)

Your understanding is correct for the RFC implementation of mshare(). mcreate() is a different beast that I do not fully 
understand yet. From Matthew's explanation, it sounds like what he has in mind is that mcreate() is a frontend to 
mshare/msharefs, uses mshare to created the shared region and thus allows a user to mprotect a single page inside the 
mmap it creates using the fd returned by mcreate. mshare underneath automagically extends the new page protection to 
every one sharing that page owing to shared PTE.

--
Khalid

>   
>>> Maybe mcreate() is just a library call, and it's really a thin wrapper
>>> around open() that happens to know where msharefs is mounted.
>>
>> -- 
>>   Kirill A. Shutemov
>