Message ID | cover.1642526745.git.khalid.aziz@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for shared PTEs across processes | expand |
On 1/18/22 1:19 PM, Khalid Aziz wrote:
> - Starting address must be aligned to pgdir size (512GB on x86_64)
How does this work on systems with 5-level paging where a top-level page
table entry covers 256TB? Is the alignment requirement 512GB or 256TB?
How does userspace figure out which alignment requirement it is subject to?
On Tue, Jan 18, 2022 at 01:41:40PM -0800, Dave Hansen wrote: > On 1/18/22 1:19 PM, Khalid Aziz wrote: > > - Starting address must be aligned to pgdir size (512GB on x86_64) > > How does this work on systems with 5-level paging where a top-level page > table entry covers 256TB? Is the alignment requirement 512GB or 256TB? > How does userspace figure out which alignment requirement it is subject to? The original idea was any power of two, naturally aligned, >= PAGE_SIZE, but I suspect Khalid has simplified it for this first implementation.
On 1/18/22 1:19 PM, Khalid Aziz wrote: > This is a proposal to implement a mechanism in kernel to allow > userspace processes to opt into sharing PTEs. The proposal is to add > a new system call - mshare(), which can be used by a process to > create a region (we will call it mshare'd region) which can be used > by other processes to map same pages using shared PTEs. Other > process(es), assuming they have the right permissions, can then make > the mashare() system call to map the shared pages into their address > space using the shared PTEs. One thing that went over my head here was that this allowing sharing of relatively arbitrary *EXISTING* regions. The mshared'd region might be anonymous or an plain mmap()'d file. It can even be a filesystem or device DAX mmap(). In other words, donors can (ideally) share anything. Consumers have must use msharefs to access the donated areas. Right? ( btw... thanks to willy for the correction on IRC.)
On 1/18/22 14:46, Matthew Wilcox wrote: > On Tue, Jan 18, 2022 at 01:41:40PM -0800, Dave Hansen wrote: >> On 1/18/22 1:19 PM, Khalid Aziz wrote: >>> - Starting address must be aligned to pgdir size (512GB on x86_64) >> >> How does this work on systems with 5-level paging where a top-level page >> table entry covers 256TB? Is the alignment requirement 512GB or 256TB? >> How does userspace figure out which alignment requirement it is subject to? > > The original idea was any power of two, naturally aligned, >= PAGE_SIZE, > but I suspect Khalid has simplified it for this first implementation. > Hi Dave, Yes, this is mostly to keep code somewhat simpler. Large regions make it easier to manage the separate set of shared VMAs. Part of the exploration here is to see what size regions work for other people. This initial prototype is x86 only and for now I am using PGDIR_SIZE. I see your point about how would userspace figure out alignment since this should across all architectures and PGDIR_SIZE/PMD_SIZE/PUD_SIZE are not the same across architectures. We can choose a fixed size and alignment. I would like to keep the region size at 2^20 pages or larger to minimize having to manage large number of small shared regions if those regions are not contiguous. Do you have any suggestion? Thanks for your feedback. -- Khalid
On 1/18/22 15:06, Dave Hansen wrote: > On 1/18/22 1:19 PM, Khalid Aziz wrote: >> This is a proposal to implement a mechanism in kernel to allow >> userspace processes to opt into sharing PTEs. The proposal is to add >> a new system call - mshare(), which can be used by a process to >> create a region (we will call it mshare'd region) which can be used >> by other processes to map same pages using shared PTEs. Other >> process(es), assuming they have the right permissions, can then make >> the mashare() system call to map the shared pages into their address >> space using the shared PTEs. > > One thing that went over my head here was that this allowing sharing of > relatively arbitrary *EXISTING* regions. The mshared'd region might be > anonymous or an plain mmap()'d file. It can even be a filesystem or > device DAX mmap(). > > In other words, donors can (ideally) share anything. Consumers have > must use msharefs to access the donated areas. > > Right? > > ( btw... thanks to willy for the correction on IRC.) > Hi Dave, Consumers use msharefs only to get information on address and size of shared region. Access to the donated are does not go through msharefs. So the consumer opens the file in msharefs to read starting address and size: fd = open("testregion", O_RDONLY); if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) printf("INFO: %ld bytes shared at addr %lx \n", mshare_info[1], mshare_info[0]); else perror("read failed"); close(fd); It then uses that information to map in the donated region: addr = (char *)mshare_info[0]; err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], mshare_info[1], O_RDWR, 600); Makes sense? Thanks, Khalid
On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote: > > Page tables in kernel consume some of the memory and as long as > number of mappings being maintained is small enough, this space > consumed by page tables is not objectionable. When very few memory > pages are shared between processes, the number of page table entries > (PTEs) to maintain is mostly constrained by the number of pages of > memory on the system. As the number of shared pages and the number > of times pages are shared goes up, amount of memory consumed by page > tables starts to become significant. > > Some of the field deployments commonly see memory pages shared > across 1000s of processes. On x86_64, each page requires a PTE that > is only 8 bytes long which is very small compared to the 4K page > size. When 2000 processes map the same page in their address space, > each one of them requires 8 bytes for its PTE and together that adds > up to 8K of memory just to hold the PTEs for one 4K page. On a > database server with 300GB SGA, a system carsh was seen with > out-of-memory condition when 1500+ clients tried to share this SGA > even though the system had 512GB of memory. On this server, in the > worst case scenario of all 1500 processes mapping every page from > SGA would have required 878GB+ for just the PTEs. If these PTEs > could be shared, amount of memory saved is very significant. > > This is a proposal to implement a mechanism in kernel to allow > userspace processes to opt into sharing PTEs. The proposal is to add > a new system call - mshare(), which can be used by a process to > create a region (we will call it mshare'd region) which can be used > by other processes to map same pages using shared PTEs. Other > process(es), assuming they have the right permissions, can then make > the mashare() system call to map the shared pages into their address > space using the shared PTEs. When a process is done using this > mshare'd region, it makes a mshare_unlink() system call to end its > access. When the last process accessing mshare'd region calls > mshare_unlink(), the mshare'd region is torn down and memory used by > it is freed. > > > API Proposal > ============ > > The mshare API consists of two system calls - mshare() and mshare_unlink() > > -- > int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) > > mshare() creates and opens a new, or opens an existing mshare'd > region that will be shared at PTE level. "name" refers to shared object > name that exists under /sys/fs/mshare. "addr" is the starting address > of this shared memory area and length is the size of this area. > oflags can be one of: > > - O_RDONLY opens shared memory area for read only access by everyone > - O_RDWR opens shared memory area for read and write access > - O_CREAT creates the named shared memory area if it does not exist > - O_EXCL If O_CREAT was also specified, and a shared memory area > exists with that name, return an error. > > mode represents the creation mode for the shared object under > /sys/fs/mshare. > > mshare() returns an error code if it fails, otherwise it returns 0. > > PTEs are shared at pgdir level and hence it imposes following > requirements on the address and size given to the mshare(): > > - Starting address must be aligned to pgdir size (512GB on x86_64) > - Size must be a multiple of pgdir size > - Any mappings created in this address range at any time become > shared automatically > - Shared address range can have unmapped addresses in it. Any access > to unmapped address will result in SIGBUS > > Mappings within this address range behave as if they were shared > between threads, so a write to a MAP_PRIVATE mapping will create a > page which is shared between all the sharers. The first process that > declares an address range mshare'd can continue to map objects in > the shared area. All other processes that want mshare'd access to > this memory area can do so by calling mshare(). After this call, the > address range given by mshare becomes a shared range in its address > space. Anonymous mappings will be shared and not COWed. > > A file under /sys/fs/mshare can be opened and read from. A read from > this file returns two long values - (1) starting address, and (2) > size of the mshare'd region. > > -- > int mshare_unlink(char *name) > > A shared address range created by mshare() can be destroyed using > mshare_unlink() which removes the shared named object. Once all > processes have unmapped the shared object, the shared address range > references are de-allocated and destroyed. > > mshare_unlink() returns 0 on success or -1 on error. > > > Example Code > ============ > > Snippet of the code that a donor process would run looks like below: > > ----------------- > addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, 0, 0); > if (addr == MAP_FAILED) > perror("ERROR: mmap failed"); > > err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), > GB(512), O_CREAT|O_RDWR|O_EXCL, 600); > if (err < 0) { > perror("mshare() syscall failed"); > exit(1); > } > > strncpy(addr, "Some random shared text", > sizeof("Some random shared text")); > ----------------- > > Snippet of code that a consumer process would execute looks like: > > ----------------- > fd = open("testregion", O_RDONLY); > if (fd < 0) { > perror("open failed"); > exit(1); > } > > if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) > printf("INFO: %ld bytes shared at addr %lx \n", > mshare_info[1], mshare_info[0]); > else > perror("read failed"); > > close(fd); > > addr = (char *)mshare_info[0]; > err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], > mshare_info[1], O_RDWR, 600); > if (err < 0) { > perror("mshare() syscall failed"); > exit(1); > } > > printf("Guest mmap at %px:\n", addr); > printf("%s\n", addr); > printf("\nDone\n"); > > err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); > if (err < 0) { > perror("mshare_unlink() failed"); > exit(1); > } > ----------------- ... Hi Khalid, The proposed mshare() appears to be similar to POSIX shared memory, but with two extra (related) attributes; a) Internally, uses shared page tables. b) Shared memory is mapped at same address for all users. Rather than introduce two new system calls, along with /sys/ file to communicate global addresses, could mshare() be built on top of shmem API? Thinking of something like the below; 1) For shm_open(3), add a new oflag to indicate the properties needed for mshare() (say, O_SHARED_PTE - better name?) 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained in the sizes which can be set. 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE objects. On first mmap()ing an appropiate address is assigned, otherwise the current 'global' address is used. 4) shm_unlink(3) destroys the object when last reference is dropped. For 3), might be able to weaken the NULL requirement and validate a given address on first mapping to ensure it is correctly aligned. shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be the default behaviour you require. Internally, the handling of mshare()/O_SHARED_PTE memory might be sufficiently different to shmem that there is not much code sharing between the two (I haven't thought this through, but the object naming/refcounting should be similiar), but using shmem would be a familiar API. Any thoughts? Cheers, Mark
On 1/19/22 04:38, Mark Hemment wrote: > On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> Page tables in kernel consume some of the memory and as long as >> number of mappings being maintained is small enough, this space >> consumed by page tables is not objectionable. When very few memory >> pages are shared between processes, the number of page table entries >> (PTEs) to maintain is mostly constrained by the number of pages of >> memory on the system. As the number of shared pages and the number >> of times pages are shared goes up, amount of memory consumed by page >> tables starts to become significant. >> >> Some of the field deployments commonly see memory pages shared >> across 1000s of processes. On x86_64, each page requires a PTE that >> is only 8 bytes long which is very small compared to the 4K page >> size. When 2000 processes map the same page in their address space, >> each one of them requires 8 bytes for its PTE and together that adds >> up to 8K of memory just to hold the PTEs for one 4K page. On a >> database server with 300GB SGA, a system carsh was seen with >> out-of-memory condition when 1500+ clients tried to share this SGA >> even though the system had 512GB of memory. On this server, in the >> worst case scenario of all 1500 processes mapping every page from >> SGA would have required 878GB+ for just the PTEs. If these PTEs >> could be shared, amount of memory saved is very significant. >> >> This is a proposal to implement a mechanism in kernel to allow >> userspace processes to opt into sharing PTEs. The proposal is to add >> a new system call - mshare(), which can be used by a process to >> create a region (we will call it mshare'd region) which can be used >> by other processes to map same pages using shared PTEs. Other >> process(es), assuming they have the right permissions, can then make >> the mashare() system call to map the shared pages into their address >> space using the shared PTEs. When a process is done using this >> mshare'd region, it makes a mshare_unlink() system call to end its >> access. When the last process accessing mshare'd region calls >> mshare_unlink(), the mshare'd region is torn down and memory used by >> it is freed. >> >> >> API Proposal >> ============ >> >> The mshare API consists of two system calls - mshare() and mshare_unlink() >> >> -- >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) >> >> mshare() creates and opens a new, or opens an existing mshare'd >> region that will be shared at PTE level. "name" refers to shared object >> name that exists under /sys/fs/mshare. "addr" is the starting address >> of this shared memory area and length is the size of this area. >> oflags can be one of: >> >> - O_RDONLY opens shared memory area for read only access by everyone >> - O_RDWR opens shared memory area for read and write access >> - O_CREAT creates the named shared memory area if it does not exist >> - O_EXCL If O_CREAT was also specified, and a shared memory area >> exists with that name, return an error. >> >> mode represents the creation mode for the shared object under >> /sys/fs/mshare. >> >> mshare() returns an error code if it fails, otherwise it returns 0. >> >> PTEs are shared at pgdir level and hence it imposes following >> requirements on the address and size given to the mshare(): >> >> - Starting address must be aligned to pgdir size (512GB on x86_64) >> - Size must be a multiple of pgdir size >> - Any mappings created in this address range at any time become >> shared automatically >> - Shared address range can have unmapped addresses in it. Any access >> to unmapped address will result in SIGBUS >> >> Mappings within this address range behave as if they were shared >> between threads, so a write to a MAP_PRIVATE mapping will create a >> page which is shared between all the sharers. The first process that >> declares an address range mshare'd can continue to map objects in >> the shared area. All other processes that want mshare'd access to >> this memory area can do so by calling mshare(). After this call, the >> address range given by mshare becomes a shared range in its address >> space. Anonymous mappings will be shared and not COWed. >> >> A file under /sys/fs/mshare can be opened and read from. A read from >> this file returns two long values - (1) starting address, and (2) >> size of the mshare'd region. >> >> -- >> int mshare_unlink(char *name) >> >> A shared address range created by mshare() can be destroyed using >> mshare_unlink() which removes the shared named object. Once all >> processes have unmapped the shared object, the shared address range >> references are de-allocated and destroyed. >> >> mshare_unlink() returns 0 on success or -1 on error. >> >> >> Example Code >> ============ >> >> Snippet of the code that a donor process would run looks like below: >> >> ----------------- >> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, >> MAP_SHARED | MAP_ANONYMOUS, 0, 0); >> if (addr == MAP_FAILED) >> perror("ERROR: mmap failed"); >> >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), >> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); >> if (err < 0) { >> perror("mshare() syscall failed"); >> exit(1); >> } >> >> strncpy(addr, "Some random shared text", >> sizeof("Some random shared text")); >> ----------------- >> >> Snippet of code that a consumer process would execute looks like: >> >> ----------------- >> fd = open("testregion", O_RDONLY); >> if (fd < 0) { >> perror("open failed"); >> exit(1); >> } >> >> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) >> printf("INFO: %ld bytes shared at addr %lx \n", >> mshare_info[1], mshare_info[0]); >> else >> perror("read failed"); >> >> close(fd); >> >> addr = (char *)mshare_info[0]; >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], >> mshare_info[1], O_RDWR, 600); >> if (err < 0) { >> perror("mshare() syscall failed"); >> exit(1); >> } >> >> printf("Guest mmap at %px:\n", addr); >> printf("%s\n", addr); >> printf("\nDone\n"); >> >> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); >> if (err < 0) { >> perror("mshare_unlink() failed"); >> exit(1); >> } >> ----------------- > ... > Hi Khalid, > > The proposed mshare() appears to be similar to POSIX shared memory, > but with two extra (related) attributes; > a) Internally, uses shared page tables. > b) Shared memory is mapped at same address for all users. Hi Mark, You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a separate implementation. I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share a range of addresses that may be fully or partially mapped in is a simpler and more versatile API. Does that rationale sound reasonable? Thanks, Khalid > > Rather than introduce two new system calls, along with /sys/ file to > communicate global addresses, could mshare() be built on top of shmem > API? Thinking of something like the below; > 1) For shm_open(3), add a new oflag to indicate the properties needed > for mshare() (say, O_SHARED_PTE - better name?) > 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained > in the sizes which can be set. > 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE > objects. On first mmap()ing an appropiate address is assigned, > otherwise the current 'global' address is used. > 4) shm_unlink(3) destroys the object when last reference is dropped. > > For 3), might be able to weaken the NULL requirement and validate a > given address on first mapping to ensure it is correctly aligned. > shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be > the default behaviour you require. > > Internally, the handling of mshare()/O_SHARED_PTE memory might be > sufficiently different to shmem that there is not much code sharing > between the two (I haven't thought this through, but the object > naming/refcounting should be similiar), but using shmem would be a > familiar API. > > Any thoughts? > > Cheers, > Mark >
On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote: > > On 1/19/22 04:38, Mark Hemment wrote: > > On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote: > >> > >> Page tables in kernel consume some of the memory and as long as > >> number of mappings being maintained is small enough, this space > >> consumed by page tables is not objectionable. When very few memory > >> pages are shared between processes, the number of page table entries > >> (PTEs) to maintain is mostly constrained by the number of pages of > >> memory on the system. As the number of shared pages and the number > >> of times pages are shared goes up, amount of memory consumed by page > >> tables starts to become significant. > >> > >> Some of the field deployments commonly see memory pages shared > >> across 1000s of processes. On x86_64, each page requires a PTE that > >> is only 8 bytes long which is very small compared to the 4K page > >> size. When 2000 processes map the same page in their address space, > >> each one of them requires 8 bytes for its PTE and together that adds > >> up to 8K of memory just to hold the PTEs for one 4K page. On a > >> database server with 300GB SGA, a system carsh was seen with > >> out-of-memory condition when 1500+ clients tried to share this SGA > >> even though the system had 512GB of memory. On this server, in the > >> worst case scenario of all 1500 processes mapping every page from > >> SGA would have required 878GB+ for just the PTEs. If these PTEs > >> could be shared, amount of memory saved is very significant. > >> > >> This is a proposal to implement a mechanism in kernel to allow > >> userspace processes to opt into sharing PTEs. The proposal is to add > >> a new system call - mshare(), which can be used by a process to > >> create a region (we will call it mshare'd region) which can be used > >> by other processes to map same pages using shared PTEs. Other > >> process(es), assuming they have the right permissions, can then make > >> the mashare() system call to map the shared pages into their address > >> space using the shared PTEs. When a process is done using this > >> mshare'd region, it makes a mshare_unlink() system call to end its > >> access. When the last process accessing mshare'd region calls > >> mshare_unlink(), the mshare'd region is torn down and memory used by > >> it is freed. > >> > >> > >> API Proposal > >> ============ > >> > >> The mshare API consists of two system calls - mshare() and mshare_unlink() > >> > >> -- > >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) > >> > >> mshare() creates and opens a new, or opens an existing mshare'd > >> region that will be shared at PTE level. "name" refers to shared object > >> name that exists under /sys/fs/mshare. "addr" is the starting address > >> of this shared memory area and length is the size of this area. > >> oflags can be one of: > >> > >> - O_RDONLY opens shared memory area for read only access by everyone > >> - O_RDWR opens shared memory area for read and write access > >> - O_CREAT creates the named shared memory area if it does not exist > >> - O_EXCL If O_CREAT was also specified, and a shared memory area > >> exists with that name, return an error. > >> > >> mode represents the creation mode for the shared object under > >> /sys/fs/mshare. > >> > >> mshare() returns an error code if it fails, otherwise it returns 0. > >> > >> PTEs are shared at pgdir level and hence it imposes following > >> requirements on the address and size given to the mshare(): > >> > >> - Starting address must be aligned to pgdir size (512GB on x86_64) > >> - Size must be a multiple of pgdir size > >> - Any mappings created in this address range at any time become > >> shared automatically > >> - Shared address range can have unmapped addresses in it. Any access > >> to unmapped address will result in SIGBUS > >> > >> Mappings within this address range behave as if they were shared > >> between threads, so a write to a MAP_PRIVATE mapping will create a > >> page which is shared between all the sharers. The first process that > >> declares an address range mshare'd can continue to map objects in > >> the shared area. All other processes that want mshare'd access to > >> this memory area can do so by calling mshare(). After this call, the > >> address range given by mshare becomes a shared range in its address > >> space. Anonymous mappings will be shared and not COWed. > >> > >> A file under /sys/fs/mshare can be opened and read from. A read from > >> this file returns two long values - (1) starting address, and (2) > >> size of the mshare'd region. > >> > >> -- > >> int mshare_unlink(char *name) > >> > >> A shared address range created by mshare() can be destroyed using > >> mshare_unlink() which removes the shared named object. Once all > >> processes have unmapped the shared object, the shared address range > >> references are de-allocated and destroyed. > >> > >> mshare_unlink() returns 0 on success or -1 on error. > >> > >> > >> Example Code > >> ============ > >> > >> Snippet of the code that a donor process would run looks like below: > >> > >> ----------------- > >> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, > >> MAP_SHARED | MAP_ANONYMOUS, 0, 0); > >> if (addr == MAP_FAILED) > >> perror("ERROR: mmap failed"); > >> > >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), > >> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); > >> if (err < 0) { > >> perror("mshare() syscall failed"); > >> exit(1); > >> } > >> > >> strncpy(addr, "Some random shared text", > >> sizeof("Some random shared text")); > >> ----------------- > >> > >> Snippet of code that a consumer process would execute looks like: > >> > >> ----------------- > >> fd = open("testregion", O_RDONLY); > >> if (fd < 0) { > >> perror("open failed"); > >> exit(1); > >> } > >> > >> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) > >> printf("INFO: %ld bytes shared at addr %lx \n", > >> mshare_info[1], mshare_info[0]); > >> else > >> perror("read failed"); > >> > >> close(fd); > >> > >> addr = (char *)mshare_info[0]; > >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], > >> mshare_info[1], O_RDWR, 600); > >> if (err < 0) { > >> perror("mshare() syscall failed"); > >> exit(1); > >> } > >> > >> printf("Guest mmap at %px:\n", addr); > >> printf("%s\n", addr); > >> printf("\nDone\n"); > >> > >> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); > >> if (err < 0) { > >> perror("mshare_unlink() failed"); > >> exit(1); > >> } > >> ----------------- > > ... > > Hi Khalid, > > > > The proposed mshare() appears to be similar to POSIX shared memory, > > but with two extra (related) attributes; > > a) Internally, uses shared page tables. > > b) Shared memory is mapped at same address for all users. > > Hi Mark, > > You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region > access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. > mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. > Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different > from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to > leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a > separate implementation. Ah, I jumped in assuming this was only for anon memory. > I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share > a range of addresses that may be fully or partially mapped in is a simpler and more versatile API. So possible you have already considered the below...which does make the API a little more POSIX shm like. The mshare() syscall does two operations; 1) create/open mshare object 2) export/import the given memory region Would it be better if these were seperate operations? That is, mshare_open() (say) creates/opens the object returning a file descriptor. The fd used as the identifier for the export/import after mmap(2); eg. addr = mshare_op(EXPORT, fd, addr, size); addr = mshare_op(IMPORT, fd, NULL, 0); (Not sure about export/import terms..) The benefit of the the separate ops is the file descriptor. This could be used for fstat(2) (and fchown(2)?), although not sure how much value this would add. The 'importer' would use the address/size of the memory region as exported (and stored in msharefs), so no need for /sys file (except for human readable info). If the set-up operations are split in two, then would it make sense to also split the teardown as well? Say, mshare_op(DROP, fd) and mshare_unlink(fd)? > > Does that rationale sound reasonable? It doesn't sound unreasonable. As msharefs is providing a namespace and perms, it doesn't need much flexibility. Being able to modifying the perms post namespace creation (fchown(2)), before exporting the memory region, might be useful in some cases - but as I don't have any usecases I'm not claiming it is essential. > > Thanks, > Khalid Cheers, Mark > > > > > Rather than introduce two new system calls, along with /sys/ file to > > communicate global addresses, could mshare() be built on top of shmem > > API? Thinking of something like the below; > > 1) For shm_open(3), add a new oflag to indicate the properties needed > > for mshare() (say, O_SHARED_PTE - better name?) > > 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained > > in the sizes which can be set. > > 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE > > objects. On first mmap()ing an appropiate address is assigned, > > otherwise the current 'global' address is used. > > 4) shm_unlink(3) destroys the object when last reference is dropped. > > > > For 3), might be able to weaken the NULL requirement and validate a > > given address on first mapping to ensure it is correctly aligned. > > shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be > > the default behaviour you require. > > > > Internally, the handling of mshare()/O_SHARED_PTE memory might be > > sufficiently different to shmem that there is not much code sharing > > between the two (I haven't thought this through, but the object > > naming/refcounting should be similiar), but using shmem would be a > > familiar API. > > > > Any thoughts? > > > > Cheers, > > Mark > > >
On 1/20/22 05:49, Mark Hemment wrote: > On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> On 1/19/22 04:38, Mark Hemment wrote: >>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote: >>>> >>>> Page tables in kernel consume some of the memory and as long as >>>> number of mappings being maintained is small enough, this space >>>> consumed by page tables is not objectionable. When very few memory >>>> pages are shared between processes, the number of page table entries >>>> (PTEs) to maintain is mostly constrained by the number of pages of >>>> memory on the system. As the number of shared pages and the number >>>> of times pages are shared goes up, amount of memory consumed by page >>>> tables starts to become significant. >>>> >>>> Some of the field deployments commonly see memory pages shared >>>> across 1000s of processes. On x86_64, each page requires a PTE that >>>> is only 8 bytes long which is very small compared to the 4K page >>>> size. When 2000 processes map the same page in their address space, >>>> each one of them requires 8 bytes for its PTE and together that adds >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a >>>> database server with 300GB SGA, a system carsh was seen with >>>> out-of-memory condition when 1500+ clients tried to share this SGA >>>> even though the system had 512GB of memory. On this server, in the >>>> worst case scenario of all 1500 processes mapping every page from >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs >>>> could be shared, amount of memory saved is very significant. >>>> >>>> This is a proposal to implement a mechanism in kernel to allow >>>> userspace processes to opt into sharing PTEs. The proposal is to add >>>> a new system call - mshare(), which can be used by a process to >>>> create a region (we will call it mshare'd region) which can be used >>>> by other processes to map same pages using shared PTEs. Other >>>> process(es), assuming they have the right permissions, can then make >>>> the mashare() system call to map the shared pages into their address >>>> space using the shared PTEs. When a process is done using this >>>> mshare'd region, it makes a mshare_unlink() system call to end its >>>> access. When the last process accessing mshare'd region calls >>>> mshare_unlink(), the mshare'd region is torn down and memory used by >>>> it is freed. >>>> >>>> >>>> API Proposal >>>> ============ >>>> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink() >>>> >>>> -- >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) >>>> >>>> mshare() creates and opens a new, or opens an existing mshare'd >>>> region that will be shared at PTE level. "name" refers to shared object >>>> name that exists under /sys/fs/mshare. "addr" is the starting address >>>> of this shared memory area and length is the size of this area. >>>> oflags can be one of: >>>> >>>> - O_RDONLY opens shared memory area for read only access by everyone >>>> - O_RDWR opens shared memory area for read and write access >>>> - O_CREAT creates the named shared memory area if it does not exist >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area >>>> exists with that name, return an error. >>>> >>>> mode represents the creation mode for the shared object under >>>> /sys/fs/mshare. >>>> >>>> mshare() returns an error code if it fails, otherwise it returns 0. >>>> >>>> PTEs are shared at pgdir level and hence it imposes following >>>> requirements on the address and size given to the mshare(): >>>> >>>> - Starting address must be aligned to pgdir size (512GB on x86_64) >>>> - Size must be a multiple of pgdir size >>>> - Any mappings created in this address range at any time become >>>> shared automatically >>>> - Shared address range can have unmapped addresses in it. Any access >>>> to unmapped address will result in SIGBUS >>>> >>>> Mappings within this address range behave as if they were shared >>>> between threads, so a write to a MAP_PRIVATE mapping will create a >>>> page which is shared between all the sharers. The first process that >>>> declares an address range mshare'd can continue to map objects in >>>> the shared area. All other processes that want mshare'd access to >>>> this memory area can do so by calling mshare(). After this call, the >>>> address range given by mshare becomes a shared range in its address >>>> space. Anonymous mappings will be shared and not COWed. >>>> >>>> A file under /sys/fs/mshare can be opened and read from. A read from >>>> this file returns two long values - (1) starting address, and (2) >>>> size of the mshare'd region. >>>> >>>> -- >>>> int mshare_unlink(char *name) >>>> >>>> A shared address range created by mshare() can be destroyed using >>>> mshare_unlink() which removes the shared named object. Once all >>>> processes have unmapped the shared object, the shared address range >>>> references are de-allocated and destroyed. >>>> >>>> mshare_unlink() returns 0 on success or -1 on error. >>>> >>>> >>>> Example Code >>>> ============ >>>> >>>> Snippet of the code that a donor process would run looks like below: >>>> >>>> ----------------- >>>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, >>>> MAP_SHARED | MAP_ANONYMOUS, 0, 0); >>>> if (addr == MAP_FAILED) >>>> perror("ERROR: mmap failed"); >>>> >>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), >>>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); >>>> if (err < 0) { >>>> perror("mshare() syscall failed"); >>>> exit(1); >>>> } >>>> >>>> strncpy(addr, "Some random shared text", >>>> sizeof("Some random shared text")); >>>> ----------------- >>>> >>>> Snippet of code that a consumer process would execute looks like: >>>> >>>> ----------------- >>>> fd = open("testregion", O_RDONLY); >>>> if (fd < 0) { >>>> perror("open failed"); >>>> exit(1); >>>> } >>>> >>>> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) >>>> printf("INFO: %ld bytes shared at addr %lx \n", >>>> mshare_info[1], mshare_info[0]); >>>> else >>>> perror("read failed"); >>>> >>>> close(fd); >>>> >>>> addr = (char *)mshare_info[0]; >>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], >>>> mshare_info[1], O_RDWR, 600); >>>> if (err < 0) { >>>> perror("mshare() syscall failed"); >>>> exit(1); >>>> } >>>> >>>> printf("Guest mmap at %px:\n", addr); >>>> printf("%s\n", addr); >>>> printf("\nDone\n"); >>>> >>>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); >>>> if (err < 0) { >>>> perror("mshare_unlink() failed"); >>>> exit(1); >>>> } >>>> ----------------- >>> ... >>> Hi Khalid, >>> >>> The proposed mshare() appears to be similar to POSIX shared memory, >>> but with two extra (related) attributes; >>> a) Internally, uses shared page tables. >>> b) Shared memory is mapped at same address for all users. >> >> Hi Mark, >> >> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region >> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. >> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. >> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different >> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to >> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a >> separate implementation. > > Ah, I jumped in assuming this was only for anon memory. > >> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share >> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API. > > So possible you have already considered the below...which does make > the API a little more POSIX shm like. > > The mshare() syscall does two operations; > 1) create/open mshare object > 2) export/import the given memory region > > Would it be better if these were seperate operations? That is, > mshare_open() (say) creates/opens the object returning a file > descriptor. The fd used as the identifier for the export/import after > mmap(2); eg. > addr = mshare_op(EXPORT, fd, addr, size); > addr = mshare_op(IMPORT, fd, NULL, 0); > (Not sure about export/import terms..) > > The benefit of the the separate ops is the file descriptor. This > could be used for fstat(2) (and fchown(2)?), although not sure how > much value this would add. Hi Mark, That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region. > > The 'importer' would use the address/size of the memory region as > exported (and stored in msharefs), so no need for /sys file (except > for human readable info). I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size, I assume you are not proposing eliminating msharefs. > > If the set-up operations are split in two, then would it make sense to > also split the teardown as well? Say, mshare_op(DROP, fd) and > mshare_unlink(fd)? A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the mshare'd region disappearing on its own when all references to it are dropped. Thanks, Khalid > >> >> Does that rationale sound reasonable? > > It doesn't sound unreasonable. As msharefs is providing a namespace > and perms, it doesn't need much flexibility. Being able to modifying > the perms post namespace creation (fchown(2)), before exporting the > memory region, might be useful in some cases - but as I don't have any > usecases I'm not claiming it is essential. > >> >> Thanks, >> Khalid > > Cheers, > Mark >> >>> >>> Rather than introduce two new system calls, along with /sys/ file to >>> communicate global addresses, could mshare() be built on top of shmem >>> API? Thinking of something like the below; >>> 1) For shm_open(3), add a new oflag to indicate the properties needed >>> for mshare() (say, O_SHARED_PTE - better name?) >>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained >>> in the sizes which can be set. >>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE >>> objects. On first mmap()ing an appropiate address is assigned, >>> otherwise the current 'global' address is used. >>> 4) shm_unlink(3) destroys the object when last reference is dropped. >>> >>> For 3), might be able to weaken the NULL requirement and validate a >>> given address on first mapping to ensure it is correctly aligned. >>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be >>> the default behaviour you require. >>> >>> Internally, the handling of mshare()/O_SHARED_PTE memory might be >>> sufficiently different to shmem that there is not much code sharing >>> between the two (I haven't thought this through, but the object >>> naming/refcounting should be similiar), but using shmem would be a >>> familiar API. >>> >>> Any thoughts? >>> >>> Cheers, >>> Mark >>> >>
> A file under /sys/fs/mshare can be opened and read from. A read from > this file returns two long values - (1) starting address, and (2) > size of the mshare'd region. > > -- > int mshare_unlink(char *name) > > A shared address range created by mshare() can be destroyed using > mshare_unlink() which removes the shared named object. Once all > processes have unmapped the shared object, the shared address range > references are de-allocated and destroyed. > mshare_unlink() returns 0 on success or -1 on error. I am still struggling with the user scenarios of these new APIs. This patch supposes multiple processes will have same virtual address for the shared area? How can this be guaranteed while different processes can map different stack, heap, libraries, files? BTW, it seems you have different intention with the below? Shared page tables during fork[1] [1] https://lwn.net/Articles/861547/ Thanks Barry
On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote: > > A file under /sys/fs/mshare can be opened and read from. A read from > > this file returns two long values - (1) starting address, and (2) > > size of the mshare'd region. > > > > -- > > int mshare_unlink(char *name) > > > > A shared address range created by mshare() can be destroyed using > > mshare_unlink() which removes the shared named object. Once all > > processes have unmapped the shared object, the shared address range > > references are de-allocated and destroyed. > > > mshare_unlink() returns 0 on success or -1 on error. > > I am still struggling with the user scenarios of these new APIs. This patch > supposes multiple processes will have same virtual address for the shared > area? How can this be guaranteed while different processes can map different > stack, heap, libraries, files? The two processes choose to share a chunk of their address space. They can map anything they like in that shared area, and then also anything they like in the areas that aren't shared. They can choose for that shared area to have the same address in both processes or different locations in each process. If two processes want to put a shared library in that shared address space, that should work. They probably would need to agree to use the same virtual address for the shared page tables for that to work. Processes should probably not put their stacks in the shared region. I mean, it could work, I suppose ... threads manage it in a single address space. But I don't see why you'd want to do that. For heaps, if you want the other process to be able to access the memory, I suppose you could put it in the shared region, but heaps aren't going to be put in the shared region by default. Think of this like hugetlbfs, only instead of sharing hugetlbfs memory, you can share _anything_ that's mmapable. > BTW, it seems you have different intention with the below? > Shared page tables during fork[1] > [1] https://lwn.net/Articles/861547/ Yes, that's completely different.
On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote: > > > A file under /sys/fs/mshare can be opened and read from. A read from > > > this file returns two long values - (1) starting address, and (2) > > > size of the mshare'd region. > > > > > > -- > > > int mshare_unlink(char *name) > > > > > > A shared address range created by mshare() can be destroyed using > > > mshare_unlink() which removes the shared named object. Once all > > > processes have unmapped the shared object, the shared address range > > > references are de-allocated and destroyed. > > > > > mshare_unlink() returns 0 on success or -1 on error. > > > > I am still struggling with the user scenarios of these new APIs. This patch > > supposes multiple processes will have same virtual address for the shared > > area? How can this be guaranteed while different processes can map different > > stack, heap, libraries, files? > > The two processes choose to share a chunk of their address space. > They can map anything they like in that shared area, and then also > anything they like in the areas that aren't shared. They can choose > for that shared area to have the same address in both processes > or different locations in each process. > > If two processes want to put a shared library in that shared address > space, that should work. They probably would need to agree to use > the same virtual address for the shared page tables for that to work. we are depending on an elf loader and ld to map the library dynamically , so hardly can we find a chance in users' code to call mshare() to map libraries in application level? so we are supposed to modify some very low level code to use this feature? > > Processes should probably not put their stacks in the shared region. > I mean, it could work, I suppose ... threads manage it in a single > address space. But I don't see why you'd want to do that. For > heaps, if you want the other process to be able to access the memory, > I suppose you could put it in the shared region, but heaps aren't > going to be put in the shared region by default. > > Think of this like hugetlbfs, only instead of sharing hugetlbfs > memory, you can share _anything_ that's mmapable. yep, we can call mshare() on any kind of memory. for example, if multiple processes use SYSV shmem, posix shmem or mmap the same file. but it seems it is more sensible to let kernel do it automatically rather than depending on calling mshare() from users? It is difficult for users to decide which areas should be applied mshare(). users might want to call mshare() for all shared areas to save memory coming from duplicated PTEs? unlike SYSV shmem and POSIX shmem which are a feature for inter-processes communications, mshare() looks not like a feature for applications, but like a feature for the whole system level? why would applications have to call something which doesn't directly help them? without mshare(), those applications will still work without any problem, right? is there anything in mshare() which is a must-have for applications? or mshare() is only a suggestion from applications like madvise()? > > > BTW, it seems you have different intention with the below? > > Shared page tables during fork[1] > > [1] https://lwn.net/Articles/861547/ > > Yes, that's completely different. Thanks for clarification. Best Regards. Barry
On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote: > On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote: > > > > A file under /sys/fs/mshare can be opened and read from. A read from > > > > this file returns two long values - (1) starting address, and (2) > > > > size of the mshare'd region. > > > > > > > > -- > > > > int mshare_unlink(char *name) > > > > > > > > A shared address range created by mshare() can be destroyed using > > > > mshare_unlink() which removes the shared named object. Once all > > > > processes have unmapped the shared object, the shared address range > > > > references are de-allocated and destroyed. > > > > > > > mshare_unlink() returns 0 on success or -1 on error. > > > > > > I am still struggling with the user scenarios of these new APIs. This patch > > > supposes multiple processes will have same virtual address for the shared > > > area? How can this be guaranteed while different processes can map different > > > stack, heap, libraries, files? > > > > The two processes choose to share a chunk of their address space. > > They can map anything they like in that shared area, and then also > > anything they like in the areas that aren't shared. They can choose > > for that shared area to have the same address in both processes > > or different locations in each process. > > > > If two processes want to put a shared library in that shared address > > space, that should work. They probably would need to agree to use > > the same virtual address for the shared page tables for that to work. > > we are depending on an elf loader and ld to map the library > dynamically , so hardly > can we find a chance in users' code to call mshare() to map libraries > in application > level? If somebody wants to modify ld.so to take advantage of mshare(), they could. That wasn't our primary motivation here, so if it turns out to not work for that usecase, well, that's a shame. > > Think of this like hugetlbfs, only instead of sharing hugetlbfs > > memory, you can share _anything_ that's mmapable. > > yep, we can call mshare() on any kind of memory. for example, if multiple > processes use SYSV shmem, posix shmem or mmap the same file. but > it seems it is more sensible to let kernel do it automatically rather than > depending on calling mshare() from users? It is difficult for users to > decide which areas should be applied mshare(). users might want to call > mshare() for all shared areas to save memory coming from duplicated PTEs? > unlike SYSV shmem and POSIX shmem which are a feature for inter-processes > communications, mshare() looks not like a feature for applications, > but like a feature > for the whole system level? why would applications have to call something which > doesn't directly help them? without mshare(), those applications > will still work without any problem, right? is there anything in > mshare() which is > a must-have for applications? or mshare() is only a suggestion from applications > like madvise()? Our use case is that we have some very large files stored on persistent memory which we want to mmap in thousands of processes. So the first one shares a chunk of its address space and mmaps all the files into that chunk of address space. Subsequent processes find that a suitable address space already exists and use it, sharing the page tables and avoiding the calls to mmap. Sharing page tables is akin to running multiple threads in a single address space; except that only part of the address space is the same. There does need to be a certain amount of trust between the processes sharing the address space. You don't want to do it to an unsuspecting process.
On 1/21/22 07:47, Matthew Wilcox wrote: > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote: >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote: >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote: >>>>> A file under /sys/fs/mshare can be opened and read from. A read from >>>>> this file returns two long values - (1) starting address, and (2) >>>>> size of the mshare'd region. >>>>> >>>>> -- >>>>> int mshare_unlink(char *name) >>>>> >>>>> A shared address range created by mshare() can be destroyed using >>>>> mshare_unlink() which removes the shared named object. Once all >>>>> processes have unmapped the shared object, the shared address range >>>>> references are de-allocated and destroyed. >>>> >>>>> mshare_unlink() returns 0 on success or -1 on error. >>>> >>>> I am still struggling with the user scenarios of these new APIs. This patch >>>> supposes multiple processes will have same virtual address for the shared >>>> area? How can this be guaranteed while different processes can map different >>>> stack, heap, libraries, files? >>> >>> The two processes choose to share a chunk of their address space. >>> They can map anything they like in that shared area, and then also >>> anything they like in the areas that aren't shared. They can choose >>> for that shared area to have the same address in both processes >>> or different locations in each process. >>> >>> If two processes want to put a shared library in that shared address >>> space, that should work. They probably would need to agree to use >>> the same virtual address for the shared page tables for that to work. >> >> we are depending on an elf loader and ld to map the library >> dynamically , so hardly >> can we find a chance in users' code to call mshare() to map libraries >> in application >> level? > > If somebody wants to modify ld.so to take advantage of mshare(), they > could. That wasn't our primary motivation here, so if it turns out to > not work for that usecase, well, that's a shame. > >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs >>> memory, you can share _anything_ that's mmapable. >> >> yep, we can call mshare() on any kind of memory. for example, if multiple >> processes use SYSV shmem, posix shmem or mmap the same file. but >> it seems it is more sensible to let kernel do it automatically rather than >> depending on calling mshare() from users? It is difficult for users to >> decide which areas should be applied mshare(). users might want to call >> mshare() for all shared areas to save memory coming from duplicated PTEs? >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes >> communications, mshare() looks not like a feature for applications, >> but like a feature >> for the whole system level? why would applications have to call something which >> doesn't directly help them? without mshare(), those applications >> will still work without any problem, right? is there anything in >> mshare() which is >> a must-have for applications? or mshare() is only a suggestion from applications >> like madvise()? > > Our use case is that we have some very large files stored on persistent > memory which we want to mmap in thousands of processes. So the first > one shares a chunk of its address space and mmaps all the files into > that chunk of address space. Subsequent processes find that a suitable > address space already exists and use it, sharing the page tables and > avoiding the calls to mmap. > > Sharing page tables is akin to running multiple threads in a single > address space; except that only part of the address space is the same. > There does need to be a certain amount of trust between the processes > sharing the address space. You don't want to do it to an unsuspecting > process. > Hello Barry, mshare() is really meant for sharing data across unrelated processes by sharing address space explicitly and hence opt-in is required. As Matthew said, the processes sharing this virtual address space need to have a level of trust. Permissions on the msharefs files control who can access this shared address space. It is possible to adapt this mechanism to share stack, libraries etc but that is not the intent. This feature will be used by applications that share data with multiple processes using shared mapping normally and it helps them avoid the overhead of large number of duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces amount of memory available for applications and can result in out-of-memory condition. An example from the patch 0/6: "On a database server with 300GB SGA, a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, amount of memory saved is very significant." -- Khalid
> -----Original Message----- > From: Khalid Aziz [mailto:khalid.aziz@oracle.com] > Sent: Saturday, January 22, 2022 12:42 AM > To: Matthew Wilcox <willy@infradead.org>; Barry Song <21cnbao@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org>; Arnd Bergmann <arnd@arndb.de>; > Dave Hansen <dave.hansen@linux.intel.com>; David Hildenbrand > <david@redhat.com>; LKML <linux-kernel@vger.kernel.org>; Linux-MM > <linux-mm@kvack.org>; Longpeng (Mike, Cloud Infrastructure Service Product > Dept.) <longpeng2@huawei.com>; Mike Rapoport <rppt@kernel.org>; Suren > Baghdasaryan <surenb@google.com> > Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes > > On 1/21/22 07:47, Matthew Wilcox wrote: > > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote: > >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote: > >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote: > >>>>> A file under /sys/fs/mshare can be opened and read from. A read from > >>>>> this file returns two long values - (1) starting address, and (2) > >>>>> size of the mshare'd region. > >>>>> > >>>>> -- > >>>>> int mshare_unlink(char *name) > >>>>> > >>>>> A shared address range created by mshare() can be destroyed using > >>>>> mshare_unlink() which removes the shared named object. Once all > >>>>> processes have unmapped the shared object, the shared address range > >>>>> references are de-allocated and destroyed. > >>>> > >>>>> mshare_unlink() returns 0 on success or -1 on error. > >>>> > >>>> I am still struggling with the user scenarios of these new APIs. This patch > >>>> supposes multiple processes will have same virtual address for the shared > >>>> area? How can this be guaranteed while different processes can map different > >>>> stack, heap, libraries, files? > >>> > >>> The two processes choose to share a chunk of their address space. > >>> They can map anything they like in that shared area, and then also > >>> anything they like in the areas that aren't shared. They can choose > >>> for that shared area to have the same address in both processes > >>> or different locations in each process. > >>> > >>> If two processes want to put a shared library in that shared address > >>> space, that should work. They probably would need to agree to use > >>> the same virtual address for the shared page tables for that to work. > >> > >> we are depending on an elf loader and ld to map the library > >> dynamically , so hardly > >> can we find a chance in users' code to call mshare() to map libraries > >> in application > >> level? > > > > If somebody wants to modify ld.so to take advantage of mshare(), they > > could. That wasn't our primary motivation here, so if it turns out to > > not work for that usecase, well, that's a shame. > > > >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs > >>> memory, you can share _anything_ that's mmapable. > >> > >> yep, we can call mshare() on any kind of memory. for example, if multiple > >> processes use SYSV shmem, posix shmem or mmap the same file. but > >> it seems it is more sensible to let kernel do it automatically rather than > >> depending on calling mshare() from users? It is difficult for users to > >> decide which areas should be applied mshare(). users might want to call > >> mshare() for all shared areas to save memory coming from duplicated PTEs? > >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes > >> communications, mshare() looks not like a feature for applications, > >> but like a feature > >> for the whole system level? why would applications have to call something > which > >> doesn't directly help them? without mshare(), those applications > >> will still work without any problem, right? is there anything in > >> mshare() which is > >> a must-have for applications? or mshare() is only a suggestion from > applications > >> like madvise()? > > > > Our use case is that we have some very large files stored on persistent > > memory which we want to mmap in thousands of processes. So the first > > one shares a chunk of its address space and mmaps all the files into > > that chunk of address space. Subsequent processes find that a suitable > > address space already exists and use it, sharing the page tables and > > avoiding the calls to mmap. > > > > Sharing page tables is akin to running multiple threads in a single > > address space; except that only part of the address space is the same. > > There does need to be a certain amount of trust between the processes > > sharing the address space. You don't want to do it to an unsuspecting > > process. > > > > Hello Barry, > > mshare() is really meant for sharing data across unrelated processes by sharing > address space explicitly and hence > opt-in is required. As Matthew said, the processes sharing this virtual address > space need to have a level of trust. > Permissions on the msharefs files control who can access this shared address > space. It is possible to adapt this > mechanism to share stack, libraries etc but that is not the intent. This feature > will be used by applications that share > data with multiple processes using shared mapping normally and it helps them > avoid the overhead of large number of > duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces > amount of memory available for > applications and can result in out-of-memory condition. An example from the patch > 0/6: > > "On a database server with 300GB SGA, a system crash was seen with > out-of-memory condition when 1500+ clients tried to share this SGA > even though the system had 512GB of memory. On this server, in the > worst case scenario of all 1500 processes mapping every page from > SGA would have required 878GB+ for just the PTEs. If these PTEs > could be shared, amount of memory saved is very significant." > The memory overhead of PTEs would be significantly saved if we use hugetlbfs in this case, but why not? > -- > Khalid
On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > Our use case is that we have some very large files stored on persistent > > > memory which we want to mmap in thousands of processes. So the first > > The memory overhead of PTEs would be significantly saved if we use > hugetlbfs in this case, but why not? Because we want the files to be persistent across reboots.
On 1/22/22 2:41 AM, Matthew Wilcox wrote: > On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >>>> Our use case is that we have some very large files stored on persistent >>>> memory which we want to mmap in thousands of processes. So the first >> The memory overhead of PTEs would be significantly saved if we use >> hugetlbfs in this case, but why not? > Because we want the files to be persistent across reboots. 100% agree. There is another use case: geo-redundancy. My view is publicly documented at https://github.com/schoebel/mars/tree/master/docu and click at architecture-guide-geo-redundancy.pdf In some scenarios, migration or (temporary) co-existence of block devices from/between hardware architecture A to/between hardware architecture B might become a future requirement for me. The currrent implementation does not yet use hugetlbfs and/or its proposed / low-overhead / more fine-grained and/or less hardware-architecture specific (future) alternatives. For me, all of these are future options. In particular, when (1) abstractable for reduction of architectural dependencies, and hopefully (2) usable from both kernelspace and userspace. It would be great if msharefs is not only low-footprint, but also would be usable from kernelspace. Reduction (or getting rid) of preallocation strategies would be also a valuable feature for me. Of course, I cannot decide what I will prefer in future for any future requirements. But some kind of mutual awareness and future collaboration would be great.
(added linux-api) On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: > Page tables in kernel consume some of the memory and as long as > number of mappings being maintained is small enough, this space > consumed by page tables is not objectionable. When very few memory > pages are shared between processes, the number of page table entries > (PTEs) to maintain is mostly constrained by the number of pages of > memory on the system. As the number of shared pages and the number > of times pages are shared goes up, amount of memory consumed by page > tables starts to become significant. > > Some of the field deployments commonly see memory pages shared > across 1000s of processes. On x86_64, each page requires a PTE that > is only 8 bytes long which is very small compared to the 4K page > size. When 2000 processes map the same page in their address space, > each one of them requires 8 bytes for its PTE and together that adds > up to 8K of memory just to hold the PTEs for one 4K page. On a > database server with 300GB SGA, a system carsh was seen with > out-of-memory condition when 1500+ clients tried to share this SGA > even though the system had 512GB of memory. On this server, in the > worst case scenario of all 1500 processes mapping every page from > SGA would have required 878GB+ for just the PTEs. If these PTEs > could be shared, amount of memory saved is very significant. > > This is a proposal to implement a mechanism in kernel to allow > userspace processes to opt into sharing PTEs. The proposal is to add > a new system call - mshare(), which can be used by a process to > create a region (we will call it mshare'd region) which can be used > by other processes to map same pages using shared PTEs. Other > process(es), assuming they have the right permissions, can then make > the mashare() system call to map the shared pages into their address > space using the shared PTEs. When a process is done using this > mshare'd region, it makes a mshare_unlink() system call to end its > access. When the last process accessing mshare'd region calls > mshare_unlink(), the mshare'd region is torn down and memory used by > it is freed. > > > API Proposal > ============ > > The mshare API consists of two system calls - mshare() and mshare_unlink() > > -- > int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) > > mshare() creates and opens a new, or opens an existing mshare'd > region that will be shared at PTE level. "name" refers to shared object > name that exists under /sys/fs/mshare. "addr" is the starting address > of this shared memory area and length is the size of this area. > oflags can be one of: > > - O_RDONLY opens shared memory area for read only access by everyone > - O_RDWR opens shared memory area for read and write access > - O_CREAT creates the named shared memory area if it does not exist > - O_EXCL If O_CREAT was also specified, and a shared memory area > exists with that name, return an error. > > mode represents the creation mode for the shared object under > /sys/fs/mshare. > > mshare() returns an error code if it fails, otherwise it returns 0. Did you consider returning a file descriptor from mshare() system call? Then there would be no need in mshare_unlink() as close(fd) would work. > PTEs are shared at pgdir level and hence it imposes following > requirements on the address and size given to the mshare(): > > - Starting address must be aligned to pgdir size (512GB on x86_64) > - Size must be a multiple of pgdir size > - Any mappings created in this address range at any time become > shared automatically > - Shared address range can have unmapped addresses in it. Any access > to unmapped address will result in SIGBUS > > Mappings within this address range behave as if they were shared > between threads, so a write to a MAP_PRIVATE mapping will create a > page which is shared between all the sharers. The first process that > declares an address range mshare'd can continue to map objects in > the shared area. All other processes that want mshare'd access to > this memory area can do so by calling mshare(). After this call, the > address range given by mshare becomes a shared range in its address > space. Anonymous mappings will be shared and not COWed. > > A file under /sys/fs/mshare can be opened and read from. A read from > this file returns two long values - (1) starting address, and (2) > size of the mshare'd region. Maybe read should return a structure containing some data identifier and the data itself, so that it could be extended in the future. > -- > int mshare_unlink(char *name) > > A shared address range created by mshare() can be destroyed using > mshare_unlink() which removes the shared named object. Once all > processes have unmapped the shared object, the shared address range > references are de-allocated and destroyed. > > mshare_unlink() returns 0 on success or -1 on error.
On Sat, Jan 22, 2022 at 11:18:14AM +0100, Thomas Schoebel-Theuer wrote: > On 1/22/22 2:41 AM, Matthew Wilcox wrote: > > On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > > > Our use case is that we have some very large files stored on persistent > > > > > memory which we want to mmap in thousands of processes. So the first > > > The memory overhead of PTEs would be significantly saved if we use > > > hugetlbfs in this case, but why not? > > Because we want the files to be persistent across reboots. > > 100% agree. There is another use case: geo-redundancy. > > My view is publicly documented at > https://github.com/schoebel/mars/tree/master/docu and click at > architecture-guide-geo-redundancy.pdf That's a 160+ page PDF. No offence, Thomas, I'm not reading that to try to understand how you want to use page table sharing. > In some scenarios, migration or (temporary) co-existence of block devices > from/between hardware architecture A to/between hardware architecture B > might become a future requirement for me. I'm not sure how sharing block devices between systems matches up with sharing page tables between processes. > It would be great if msharefs is not only low-footprint, but also would be > usable from kernelspace. I don't understand what you want here either. Kernel threads already share their page tables.
> On Jan 22, 2022, at 3:31 AM, Mike Rapoport <rppt@kernel.org> wrote: > > (added linux-api) > >> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: >> Page tables in kernel consume some of the memory and as long as >> number of mappings being maintained is small enough, this space >> consumed by page tables is not objectionable. When very few memory >> pages are shared between processes, the number of page table entries >> (PTEs) to maintain is mostly constrained by the number of pages of >> memory on the system. As the number of shared pages and the number >> of times pages are shared goes up, amount of memory consumed by page >> tables starts to become significant. Sharing PTEs is nice, but merely sharing a chunk of address space regardless of optimizations is nontrivial. It’s also quite useful, potentially. So I think a good way to start would be to make a nice design for just sharing address space and then, on top of it, figure out how to share page tables. See here for an earlier proposal: https://lore.kernel.org/all/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com/ Alternatively, one could try to optimize memfd so that large similarly aligned mappings in different processes could share page tables. Any of the above will require some interesting thought as to whether TLB shootdowns are managed by the core rmap code or by mmu notifiers.
On Thu, 20 Jan 2022 at 19:15, Khalid Aziz <khalid.aziz@oracle.com> wrote: > > On 1/20/22 05:49, Mark Hemment wrote: > > On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote: > >> > >> On 1/19/22 04:38, Mark Hemment wrote: > >>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote: > >>>> > >>>> Page tables in kernel consume some of the memory and as long as > >>>> number of mappings being maintained is small enough, this space > >>>> consumed by page tables is not objectionable. When very few memory > >>>> pages are shared between processes, the number of page table entries > >>>> (PTEs) to maintain is mostly constrained by the number of pages of > >>>> memory on the system. As the number of shared pages and the number > >>>> of times pages are shared goes up, amount of memory consumed by page > >>>> tables starts to become significant. > >>>> > >>>> Some of the field deployments commonly see memory pages shared > >>>> across 1000s of processes. On x86_64, each page requires a PTE that > >>>> is only 8 bytes long which is very small compared to the 4K page > >>>> size. When 2000 processes map the same page in their address space, > >>>> each one of them requires 8 bytes for its PTE and together that adds > >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a > >>>> database server with 300GB SGA, a system carsh was seen with > >>>> out-of-memory condition when 1500+ clients tried to share this SGA > >>>> even though the system had 512GB of memory. On this server, in the > >>>> worst case scenario of all 1500 processes mapping every page from > >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs > >>>> could be shared, amount of memory saved is very significant. > >>>> > >>>> This is a proposal to implement a mechanism in kernel to allow > >>>> userspace processes to opt into sharing PTEs. The proposal is to add > >>>> a new system call - mshare(), which can be used by a process to > >>>> create a region (we will call it mshare'd region) which can be used > >>>> by other processes to map same pages using shared PTEs. Other > >>>> process(es), assuming they have the right permissions, can then make > >>>> the mashare() system call to map the shared pages into their address > >>>> space using the shared PTEs. When a process is done using this > >>>> mshare'd region, it makes a mshare_unlink() system call to end its > >>>> access. When the last process accessing mshare'd region calls > >>>> mshare_unlink(), the mshare'd region is torn down and memory used by > >>>> it is freed. > >>>> > >>>> > >>>> API Proposal > >>>> ============ > >>>> > >>>> The mshare API consists of two system calls - mshare() and mshare_unlink() > >>>> > >>>> -- > >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) > >>>> > >>>> mshare() creates and opens a new, or opens an existing mshare'd > >>>> region that will be shared at PTE level. "name" refers to shared object > >>>> name that exists under /sys/fs/mshare. "addr" is the starting address > >>>> of this shared memory area and length is the size of this area. > >>>> oflags can be one of: > >>>> > >>>> - O_RDONLY opens shared memory area for read only access by everyone > >>>> - O_RDWR opens shared memory area for read and write access > >>>> - O_CREAT creates the named shared memory area if it does not exist > >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area > >>>> exists with that name, return an error. > >>>> > >>>> mode represents the creation mode for the shared object under > >>>> /sys/fs/mshare. > >>>> > >>>> mshare() returns an error code if it fails, otherwise it returns 0. > >>>> > >>>> PTEs are shared at pgdir level and hence it imposes following > >>>> requirements on the address and size given to the mshare(): > >>>> > >>>> - Starting address must be aligned to pgdir size (512GB on x86_64) > >>>> - Size must be a multiple of pgdir size > >>>> - Any mappings created in this address range at any time become > >>>> shared automatically > >>>> - Shared address range can have unmapped addresses in it. Any access > >>>> to unmapped address will result in SIGBUS > >>>> > >>>> Mappings within this address range behave as if they were shared > >>>> between threads, so a write to a MAP_PRIVATE mapping will create a > >>>> page which is shared between all the sharers. The first process that > >>>> declares an address range mshare'd can continue to map objects in > >>>> the shared area. All other processes that want mshare'd access to > >>>> this memory area can do so by calling mshare(). After this call, the > >>>> address range given by mshare becomes a shared range in its address > >>>> space. Anonymous mappings will be shared and not COWed. > >>>> > >>>> A file under /sys/fs/mshare can be opened and read from. A read from > >>>> this file returns two long values - (1) starting address, and (2) > >>>> size of the mshare'd region. > >>>> > >>>> -- > >>>> int mshare_unlink(char *name) > >>>> > >>>> A shared address range created by mshare() can be destroyed using > >>>> mshare_unlink() which removes the shared named object. Once all > >>>> processes have unmapped the shared object, the shared address range > >>>> references are de-allocated and destroyed. > >>>> > >>>> mshare_unlink() returns 0 on success or -1 on error. > >>>> > >>>> > >>>> Example Code > >>>> ============ > >>>> > >>>> Snippet of the code that a donor process would run looks like below: > >>>> > >>>> ----------------- > >>>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, > >>>> MAP_SHARED | MAP_ANONYMOUS, 0, 0); > >>>> if (addr == MAP_FAILED) > >>>> perror("ERROR: mmap failed"); > >>>> > >>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), > >>>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); > >>>> if (err < 0) { > >>>> perror("mshare() syscall failed"); > >>>> exit(1); > >>>> } > >>>> > >>>> strncpy(addr, "Some random shared text", > >>>> sizeof("Some random shared text")); > >>>> ----------------- > >>>> > >>>> Snippet of code that a consumer process would execute looks like: > >>>> > >>>> ----------------- > >>>> fd = open("testregion", O_RDONLY); > >>>> if (fd < 0) { > >>>> perror("open failed"); > >>>> exit(1); > >>>> } > >>>> > >>>> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) > >>>> printf("INFO: %ld bytes shared at addr %lx \n", > >>>> mshare_info[1], mshare_info[0]); > >>>> else > >>>> perror("read failed"); > >>>> > >>>> close(fd); > >>>> > >>>> addr = (char *)mshare_info[0]; > >>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], > >>>> mshare_info[1], O_RDWR, 600); > >>>> if (err < 0) { > >>>> perror("mshare() syscall failed"); > >>>> exit(1); > >>>> } > >>>> > >>>> printf("Guest mmap at %px:\n", addr); > >>>> printf("%s\n", addr); > >>>> printf("\nDone\n"); > >>>> > >>>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); > >>>> if (err < 0) { > >>>> perror("mshare_unlink() failed"); > >>>> exit(1); > >>>> } > >>>> ----------------- > >>> ... > >>> Hi Khalid, > >>> > >>> The proposed mshare() appears to be similar to POSIX shared memory, > >>> but with two extra (related) attributes; > >>> a) Internally, uses shared page tables. > >>> b) Shared memory is mapped at same address for all users. > >> > >> Hi Mark, > >> > >> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region > >> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. > >> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. > >> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different > >> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to > >> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a > >> separate implementation. > > > > Ah, I jumped in assuming this was only for anon memory. > > > >> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share > >> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API. > > > > So possible you have already considered the below...which does make > > the API a little more POSIX shm like. > > > > The mshare() syscall does two operations; > > 1) create/open mshare object > > 2) export/import the given memory region > > > > Would it be better if these were seperate operations? That is, > > mshare_open() (say) creates/opens the object returning a file > > descriptor. The fd used as the identifier for the export/import after > > mmap(2); eg. > > addr = mshare_op(EXPORT, fd, addr, size); > > addr = mshare_op(IMPORT, fd, NULL, 0); > > (Not sure about export/import terms..) > > > > The benefit of the the separate ops is the file descriptor. This > > could be used for fstat(2) (and fchown(2)?), although not sure how > > much value this would add. > > Hi Mark, > > That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a > regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region. Hi Khalid, In your proposed API, the 'importer' of the mshared region does not open the mshared backing object (when a file being mapped) instead it does an open on the msharefs file. From the code sample in your initial email (simplified), where a process attaches to the mshared region; fd = open("testregion", O_RDONLY); read(fd, &mshare_info, sizeof (mshare_info)); mshare("testregion", addr, len, RDWR, 0600); Open permission checks are done by the mshare() system call against the msharefs file ("testregion"). From the code sample in your initial email (simplified), where a process creates a msharefs file with the anonymous mmap()ed region to be shared; addr = mmap(RDWR, ANON); mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600); Now, consider the case where the mmap() is named (that is, against a file). I believe this is the usecase for Oracle's SGA. My (simplified) code for msharing a named file ("SGA") using your proposed API (does not matter if the mapping is PRIVATE or SHARED); fd = open("SGA", RDWR); addr = mmap(RDWR, ..., fd); mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600); If the permissions (usr/grp+perms+ACL) between the "SGA" file and the "SGA-region" msharefs are different, then it is very likely a serious security issue. That is, a user who could not open(2) the "SGA" file might be able to open the "SGA-region" msharefs file, and so gain at least read permission on the file. This is why I was proposing a file descriptor, so the msharefs file could be set to have the same permissions as the backing file it is exporting (but I got this wrong). This would still leave a window between the msharefs file being creating and the permissions being set, where a rogue process could attach to a region when they should not have the permission (this could be closed by failing a non-creating mshare() if the region is of zero len - nothing yet shared - until permission are set and the region shared). But relying on userspace to always set the correct permissions on the msharefs file is dangerous - likely to get it wrong on occasion - and isn't sufficient. The msharefs API needs to be bullet proof. Looking at the patches, I cannot see where extra validation is being done for a named mapping to ensure any 'importer' has the necessary permission against the backing file. The 'struct file' (->vm_file, and associated inode) in the VMA is sufficient to perform required access checks against the file's perms - the posted patches do not check this (but they are for an RFC, so don't expect all cases to be handled). But what about a full path permission check? That is, the 'importer' has necessary permissions on the backing file, but would not be able to find this file due to directory permissions? msharefs would bypass the directory checks. > > > > The 'importer' would use the address/size of the memory region as > > exported (and stored in msharefs), so no need for /sys file (except > > for human readable info). > > I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size, > I assume you are not proposing eliminating msharefs. The 'exporter' of the mshared region specifies the address and length, and is therefore is known by the mshare code. An 'import' needs to only pass NULL/0 for addr/len and is told by mshare where the region has been attached in its address-space. With this, the /sys file is no longer part of the API. > > > > If the set-up operations are split in two, then would it make sense to > > also split the teardown as well? Say, mshare_op(DROP, fd) and > > mshare_unlink(fd)? > > A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care > of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that > created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I > would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the > mshare'd region disappearing on its own when all references to it are dropped. > > Thanks, > Khalid Cheers, Mark > > > > >> > >> Does that rationale sound reasonable? > > > > It doesn't sound unreasonable. As msharefs is providing a namespace > > and perms, it doesn't need much flexibility. Being able to modifying > > the perms post namespace creation (fchown(2)), before exporting the > > memory region, might be useful in some cases - but as I don't have any > > usecases I'm not claiming it is essential. > > > >> > >> Thanks, > >> Khalid > > > > Cheers, > > Mark > >> > >>> > >>> Rather than introduce two new system calls, along with /sys/ file to > >>> communicate global addresses, could mshare() be built on top of shmem > >>> API? Thinking of something like the below; > >>> 1) For shm_open(3), add a new oflag to indicate the properties needed > >>> for mshare() (say, O_SHARED_PTE - better name?) > >>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained > >>> in the sizes which can be set. > >>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE > >>> objects. On first mmap()ing an appropiate address is assigned, > >>> otherwise the current 'global' address is used. > >>> 4) shm_unlink(3) destroys the object when last reference is dropped. > >>> > >>> For 3), might be able to weaken the NULL requirement and validate a > >>> given address on first mapping to ensure it is correctly aligned. > >>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be > >>> the default behaviour you require. > >>> > >>> Internally, the handling of mshare()/O_SHARED_PTE memory might be > >>> sufficiently different to shmem that there is not much code sharing > >>> between the two (I haven't thought this through, but the object > >>> naming/refcounting should be similiar), but using shmem would be a > >>> familiar API. > >>> > >>> Any thoughts? > >>> > >>> Cheers, > >>> Mark > >>> > >> >
On Mon, Jan 24, 2022 at 03:15:36PM +0000, Mark Hemment wrote: > From the code sample in your initial email (simplified), where a > process creates a msharefs file with the anonymous mmap()ed region to > be shared; > addr = mmap(RDWR, ANON); > mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600); > > Now, consider the case where the mmap() is named (that is, against a > file). I believe this is the usecase for Oracle's SGA. > My (simplified) code for msharing a named file ("SGA") using your > proposed API (does not matter if the mapping is PRIVATE or SHARED); > fd = open("SGA", RDWR); > addr = mmap(RDWR, ..., fd); > mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600); Don't think of an mshared region as containing only one file. It might easily contain dozens. Or none at the start. They're dynamic; the mshare fd represents a chunk of address space, not whatever is currently mapped there. > If the permissions (usr/grp+perms+ACL) between the "SGA" file and the > "SGA-region" msharefs are different, then it is very likely a serious > security issue. Only in the same sense that an application might open() a file that it has permission to access and then open a pipe/socket to a process that does not have permission and send the data to it.
On 1/22/22 04:31, Mike Rapoport wrote: > (added linux-api) > > On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: >> Page tables in kernel consume some of the memory and as long as >> number of mappings being maintained is small enough, this space >> consumed by page tables is not objectionable. When very few memory >> pages are shared between processes, the number of page table entries >> (PTEs) to maintain is mostly constrained by the number of pages of >> memory on the system. As the number of shared pages and the number >> of times pages are shared goes up, amount of memory consumed by page >> tables starts to become significant. >> >> Some of the field deployments commonly see memory pages shared >> across 1000s of processes. On x86_64, each page requires a PTE that >> is only 8 bytes long which is very small compared to the 4K page >> size. When 2000 processes map the same page in their address space, >> each one of them requires 8 bytes for its PTE and together that adds >> up to 8K of memory just to hold the PTEs for one 4K page. On a >> database server with 300GB SGA, a system carsh was seen with >> out-of-memory condition when 1500+ clients tried to share this SGA >> even though the system had 512GB of memory. On this server, in the >> worst case scenario of all 1500 processes mapping every page from >> SGA would have required 878GB+ for just the PTEs. If these PTEs >> could be shared, amount of memory saved is very significant. >> >> This is a proposal to implement a mechanism in kernel to allow >> userspace processes to opt into sharing PTEs. The proposal is to add >> a new system call - mshare(), which can be used by a process to >> create a region (we will call it mshare'd region) which can be used >> by other processes to map same pages using shared PTEs. Other >> process(es), assuming they have the right permissions, can then make >> the mashare() system call to map the shared pages into their address >> space using the shared PTEs. When a process is done using this >> mshare'd region, it makes a mshare_unlink() system call to end its >> access. When the last process accessing mshare'd region calls >> mshare_unlink(), the mshare'd region is torn down and memory used by >> it is freed. >> >> >> API Proposal >> ============ >> >> The mshare API consists of two system calls - mshare() and mshare_unlink() >> >> -- >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) >> >> mshare() creates and opens a new, or opens an existing mshare'd >> region that will be shared at PTE level. "name" refers to shared object >> name that exists under /sys/fs/mshare. "addr" is the starting address >> of this shared memory area and length is the size of this area. >> oflags can be one of: >> >> - O_RDONLY opens shared memory area for read only access by everyone >> - O_RDWR opens shared memory area for read and write access >> - O_CREAT creates the named shared memory area if it does not exist >> - O_EXCL If O_CREAT was also specified, and a shared memory area >> exists with that name, return an error. >> >> mode represents the creation mode for the shared object under >> /sys/fs/mshare. >> >> mshare() returns an error code if it fails, otherwise it returns 0. > > Did you consider returning a file descriptor from mshare() system call? > Then there would be no need in mshare_unlink() as close(fd) would work. That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic with a close() semantics though unless there was another way to force refcount to 0. Right? > >> PTEs are shared at pgdir level and hence it imposes following >> requirements on the address and size given to the mshare(): >> >> - Starting address must be aligned to pgdir size (512GB on x86_64) >> - Size must be a multiple of pgdir size >> - Any mappings created in this address range at any time become >> shared automatically >> - Shared address range can have unmapped addresses in it. Any access >> to unmapped address will result in SIGBUS >> >> Mappings within this address range behave as if they were shared >> between threads, so a write to a MAP_PRIVATE mapping will create a >> page which is shared between all the sharers. The first process that >> declares an address range mshare'd can continue to map objects in >> the shared area. All other processes that want mshare'd access to >> this memory area can do so by calling mshare(). After this call, the >> address range given by mshare becomes a shared range in its address >> space. Anonymous mappings will be shared and not COWed. >> >> A file under /sys/fs/mshare can be opened and read from. A read from >> this file returns two long values - (1) starting address, and (2) >> size of the mshare'd region. > > Maybe read should return a structure containing some data identifier and > the data itself, so that it could be extended in the future. I like that idea. I will work on it. Thanks! -- Khalid > >> -- >> int mshare_unlink(char *name) >> >> A shared address range created by mshare() can be destroyed using >> mshare_unlink() which removes the shared named object. Once all >> processes have unmapped the shared object, the shared address range >> references are de-allocated and destroyed. >> >> mshare_unlink() returns 0 on success or -1 on error. >
On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: > > On 1/22/22 04:31, Mike Rapoport wrote: > > (added linux-api) > > > > On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: > >> Page tables in kernel consume some of the memory and as long as > >> number of mappings being maintained is small enough, this space > >> consumed by page tables is not objectionable. When very few memory > >> pages are shared between processes, the number of page table entries > >> (PTEs) to maintain is mostly constrained by the number of pages of > >> memory on the system. As the number of shared pages and the number > >> of times pages are shared goes up, amount of memory consumed by page > >> tables starts to become significant. > >> > >> Some of the field deployments commonly see memory pages shared > >> across 1000s of processes. On x86_64, each page requires a PTE that > >> is only 8 bytes long which is very small compared to the 4K page > >> size. When 2000 processes map the same page in their address space, > >> each one of them requires 8 bytes for its PTE and together that adds > >> up to 8K of memory just to hold the PTEs for one 4K page. On a > >> database server with 300GB SGA, a system carsh was seen with > >> out-of-memory condition when 1500+ clients tried to share this SGA > >> even though the system had 512GB of memory. On this server, in the > >> worst case scenario of all 1500 processes mapping every page from > >> SGA would have required 878GB+ for just the PTEs. If these PTEs > >> could be shared, amount of memory saved is very significant. > >> > >> This is a proposal to implement a mechanism in kernel to allow > >> userspace processes to opt into sharing PTEs. The proposal is to add > >> a new system call - mshare(), which can be used by a process to > >> create a region (we will call it mshare'd region) which can be used > >> by other processes to map same pages using shared PTEs. Other > >> process(es), assuming they have the right permissions, can then make > >> the mashare() system call to map the shared pages into their address > >> space using the shared PTEs. When a process is done using this > >> mshare'd region, it makes a mshare_unlink() system call to end its > >> access. When the last process accessing mshare'd region calls > >> mshare_unlink(), the mshare'd region is torn down and memory used by > >> it is freed. > >> > >> > >> API Proposal > >> ============ > >> > >> The mshare API consists of two system calls - mshare() and mshare_unlink() > >> > >> -- > >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) > >> > >> mshare() creates and opens a new, or opens an existing mshare'd > >> region that will be shared at PTE level. "name" refers to shared object > >> name that exists under /sys/fs/mshare. "addr" is the starting address > >> of this shared memory area and length is the size of this area. > >> oflags can be one of: > >> > >> - O_RDONLY opens shared memory area for read only access by everyone > >> - O_RDWR opens shared memory area for read and write access > >> - O_CREAT creates the named shared memory area if it does not exist > >> - O_EXCL If O_CREAT was also specified, and a shared memory area > >> exists with that name, return an error. > >> > >> mode represents the creation mode for the shared object under > >> /sys/fs/mshare. > >> > >> mshare() returns an error code if it fails, otherwise it returns 0. > > > > Did you consider returning a file descriptor from mshare() system call? > > Then there would be no need in mshare_unlink() as close(fd) would work. > > That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though > for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases > for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling > mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to > bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic > with a close() semantics though unless there was another way to force refcount to 0. Right? > I'm not sure I understand the problem. If you're sharing a portion of an mm and the mm goes away, then all that should be left are some struct files that are no longer useful. They'll go away when their refcount goes to zero. --Andy
On 1/24/22 08:15, Mark Hemment wrote: > On Thu, 20 Jan 2022 at 19:15, Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> On 1/20/22 05:49, Mark Hemment wrote: >>> On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote: >>>> >>>> On 1/19/22 04:38, Mark Hemment wrote: >>>>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote: >>>>>> >>>>>> Page tables in kernel consume some of the memory and as long as >>>>>> number of mappings being maintained is small enough, this space >>>>>> consumed by page tables is not objectionable. When very few memory >>>>>> pages are shared between processes, the number of page table entries >>>>>> (PTEs) to maintain is mostly constrained by the number of pages of >>>>>> memory on the system. As the number of shared pages and the number >>>>>> of times pages are shared goes up, amount of memory consumed by page >>>>>> tables starts to become significant. >>>>>> >>>>>> Some of the field deployments commonly see memory pages shared >>>>>> across 1000s of processes. On x86_64, each page requires a PTE that >>>>>> is only 8 bytes long which is very small compared to the 4K page >>>>>> size. When 2000 processes map the same page in their address space, >>>>>> each one of them requires 8 bytes for its PTE and together that adds >>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a >>>>>> database server with 300GB SGA, a system carsh was seen with >>>>>> out-of-memory condition when 1500+ clients tried to share this SGA >>>>>> even though the system had 512GB of memory. On this server, in the >>>>>> worst case scenario of all 1500 processes mapping every page from >>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs >>>>>> could be shared, amount of memory saved is very significant. >>>>>> >>>>>> This is a proposal to implement a mechanism in kernel to allow >>>>>> userspace processes to opt into sharing PTEs. The proposal is to add >>>>>> a new system call - mshare(), which can be used by a process to >>>>>> create a region (we will call it mshare'd region) which can be used >>>>>> by other processes to map same pages using shared PTEs. Other >>>>>> process(es), assuming they have the right permissions, can then make >>>>>> the mashare() system call to map the shared pages into their address >>>>>> space using the shared PTEs. When a process is done using this >>>>>> mshare'd region, it makes a mshare_unlink() system call to end its >>>>>> access. When the last process accessing mshare'd region calls >>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by >>>>>> it is freed. >>>>>> >>>>>> >>>>>> API Proposal >>>>>> ============ >>>>>> >>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink() >>>>>> >>>>>> -- >>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) >>>>>> >>>>>> mshare() creates and opens a new, or opens an existing mshare'd >>>>>> region that will be shared at PTE level. "name" refers to shared object >>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address >>>>>> of this shared memory area and length is the size of this area. >>>>>> oflags can be one of: >>>>>> >>>>>> - O_RDONLY opens shared memory area for read only access by everyone >>>>>> - O_RDWR opens shared memory area for read and write access >>>>>> - O_CREAT creates the named shared memory area if it does not exist >>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area >>>>>> exists with that name, return an error. >>>>>> >>>>>> mode represents the creation mode for the shared object under >>>>>> /sys/fs/mshare. >>>>>> >>>>>> mshare() returns an error code if it fails, otherwise it returns 0. >>>>>> >>>>>> PTEs are shared at pgdir level and hence it imposes following >>>>>> requirements on the address and size given to the mshare(): >>>>>> >>>>>> - Starting address must be aligned to pgdir size (512GB on x86_64) >>>>>> - Size must be a multiple of pgdir size >>>>>> - Any mappings created in this address range at any time become >>>>>> shared automatically >>>>>> - Shared address range can have unmapped addresses in it. Any access >>>>>> to unmapped address will result in SIGBUS >>>>>> >>>>>> Mappings within this address range behave as if they were shared >>>>>> between threads, so a write to a MAP_PRIVATE mapping will create a >>>>>> page which is shared between all the sharers. The first process that >>>>>> declares an address range mshare'd can continue to map objects in >>>>>> the shared area. All other processes that want mshare'd access to >>>>>> this memory area can do so by calling mshare(). After this call, the >>>>>> address range given by mshare becomes a shared range in its address >>>>>> space. Anonymous mappings will be shared and not COWed. >>>>>> >>>>>> A file under /sys/fs/mshare can be opened and read from. A read from >>>>>> this file returns two long values - (1) starting address, and (2) >>>>>> size of the mshare'd region. >>>>>> >>>>>> -- >>>>>> int mshare_unlink(char *name) >>>>>> >>>>>> A shared address range created by mshare() can be destroyed using >>>>>> mshare_unlink() which removes the shared named object. Once all >>>>>> processes have unmapped the shared object, the shared address range >>>>>> references are de-allocated and destroyed. >>>>>> >>>>>> mshare_unlink() returns 0 on success or -1 on error. >>>>>> >>>>>> >>>>>> Example Code >>>>>> ============ >>>>>> >>>>>> Snippet of the code that a donor process would run looks like below: >>>>>> >>>>>> ----------------- >>>>>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, >>>>>> MAP_SHARED | MAP_ANONYMOUS, 0, 0); >>>>>> if (addr == MAP_FAILED) >>>>>> perror("ERROR: mmap failed"); >>>>>> >>>>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), >>>>>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); >>>>>> if (err < 0) { >>>>>> perror("mshare() syscall failed"); >>>>>> exit(1); >>>>>> } >>>>>> >>>>>> strncpy(addr, "Some random shared text", >>>>>> sizeof("Some random shared text")); >>>>>> ----------------- >>>>>> >>>>>> Snippet of code that a consumer process would execute looks like: >>>>>> >>>>>> ----------------- >>>>>> fd = open("testregion", O_RDONLY); >>>>>> if (fd < 0) { >>>>>> perror("open failed"); >>>>>> exit(1); >>>>>> } >>>>>> >>>>>> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) >>>>>> printf("INFO: %ld bytes shared at addr %lx \n", >>>>>> mshare_info[1], mshare_info[0]); >>>>>> else >>>>>> perror("read failed"); >>>>>> >>>>>> close(fd); >>>>>> >>>>>> addr = (char *)mshare_info[0]; >>>>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], >>>>>> mshare_info[1], O_RDWR, 600); >>>>>> if (err < 0) { >>>>>> perror("mshare() syscall failed"); >>>>>> exit(1); >>>>>> } >>>>>> >>>>>> printf("Guest mmap at %px:\n", addr); >>>>>> printf("%s\n", addr); >>>>>> printf("\nDone\n"); >>>>>> >>>>>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); >>>>>> if (err < 0) { >>>>>> perror("mshare_unlink() failed"); >>>>>> exit(1); >>>>>> } >>>>>> ----------------- >>>>> ... >>>>> Hi Khalid, >>>>> >>>>> The proposed mshare() appears to be similar to POSIX shared memory, >>>>> but with two extra (related) attributes; >>>>> a) Internally, uses shared page tables. >>>>> b) Shared memory is mapped at same address for all users. >>>> >>>> Hi Mark, >>>> >>>> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region >>>> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. >>>> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. >>>> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different >>>> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to >>>> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a >>>> separate implementation. >>> >>> Ah, I jumped in assuming this was only for anon memory. >>> >>>> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share >>>> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API. >>> >>> So possible you have already considered the below...which does make >>> the API a little more POSIX shm like. >>> >>> The mshare() syscall does two operations; >>> 1) create/open mshare object >>> 2) export/import the given memory region >>> >>> Would it be better if these were seperate operations? That is, >>> mshare_open() (say) creates/opens the object returning a file >>> descriptor. The fd used as the identifier for the export/import after >>> mmap(2); eg. >>> addr = mshare_op(EXPORT, fd, addr, size); >>> addr = mshare_op(IMPORT, fd, NULL, 0); >>> (Not sure about export/import terms..) >>> >>> The benefit of the the separate ops is the file descriptor. This >>> could be used for fstat(2) (and fchown(2)?), although not sure how >>> much value this would add. >> >> Hi Mark, >> >> That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a >> regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region. > > Hi Khalid, > > In your proposed API, the 'importer' of the mshared region does not > open the mshared backing object (when a file being mapped) instead it > does an open on the msharefs file. > From the code sample in your initial email (simplified), where a > process attaches to the mshared region; > fd = open("testregion", O_RDONLY); > read(fd, &mshare_info, sizeof (mshare_info)); > mshare("testregion", addr, len, RDWR, 0600); > > Open permission checks are done by the mshare() system call against > the msharefs file ("testregion"). > > From the code sample in your initial email (simplified), where a > process creates a msharefs file with the anonymous mmap()ed region to > be shared; > addr = mmap(RDWR, ANON); > mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600); > > Now, consider the case where the mmap() is named (that is, against a > file). I believe this is the usecase for Oracle's SGA. > My (simplified) code for msharing a named file ("SGA") using your > proposed API (does not matter if the mapping is PRIVATE or SHARED); > fd = open("SGA", RDWR); > addr = mmap(RDWR, ..., fd); > mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600); > > If the permissions (usr/grp+perms+ACL) between the "SGA" file and the > "SGA-region" msharefs are different, then it is very likely a serious > security issue. > That is, a user who could not open(2) the "SGA" file might be able to > open the "SGA-region" msharefs file, and so gain at least read > permission on the file. If an app creates an mshare'd region and gives wider access permissions than the on the objects it has mapped in or maps in in future, I would read it as app intends to do that. mshare'd region can cover any mapped object besides files. It could be anonymous memory holding data supplied by an application user, or could be data captured over network. How would one validate intended permissions in those cases? A uniform check would be to ensure access given to mshare'd region is compliant with the permissions on the region iteself as given by the file under msharefs. Those permission checks are not implemented yet in this initial prototype but definitely planned as continuing work. > > This is why I was proposing a file descriptor, so the msharefs file > could be set to have the same permissions as the backing file it is > exporting (but I got this wrong). > This would still leave a window between the msharefs file being > creating and the permissions being set, where a rogue process could > attach to a region when they should not have the permission (this > could be closed by failing a non-creating mshare() if the region is of > zero len - nothing yet shared - until permission are set and the > region shared). > But relying on userspace to always set the correct permissions on the > msharefs file is dangerous - likely to get it wrong on occasion - and > isn't sufficient. The msharefs API needs to be bullet proof. > > Looking at the patches, I cannot see where extra validation is being > done for a named mapping to ensure any 'importer' has the necessary > permission against the backing file. > The 'struct file' (->vm_file, and associated inode) in the VMA is > sufficient to perform required access checks against the file's perms > - the posted patches do not check this (but they are for an RFC, so > don't expect all cases to be handled). But what about a full path > permission check? That is, the 'importer' has necessary permissions > on the backing file, but would not be able to find this file due to > directory permissions? msharefs would bypass the directory checks. > > >>> >>> The 'importer' would use the address/size of the memory region as >>> exported (and stored in msharefs), so no need for /sys file (except >>> for human readable info). >> >> I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size, >> I assume you are not proposing eliminating msharefs. > > The 'exporter' of the mshared region specifies the address and length, > and is therefore is known by the mshare code. > An 'import' needs to only pass NULL/0 for addr/len and is told by > mshare where the region has been attached in its address-space. With > this, the /sys file is no longer part of the API. This would require every importer map the entire mshare'd range. Should we allow mapping a subset of the region, in which case importer needs to supply starting address and length? It is possibly simpler to only allow mapping the entire region but if there is a use case for partial mapping, designing that flexibility in is useful. Thanks, Khalid > > >>> >>> If the set-up operations are split in two, then would it make sense to >>> also split the teardown as well? Say, mshare_op(DROP, fd) and >>> mshare_unlink(fd)? >> >> A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care >> of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that >> created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I >> would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the >> mshare'd region disappearing on its own when all references to it are dropped. >> >> Thanks, >> Khalid > > Cheers, > Mark > >> >>> >>>> >>>> Does that rationale sound reasonable? >>> >>> It doesn't sound unreasonable. As msharefs is providing a namespace >>> and perms, it doesn't need much flexibility. Being able to modifying >>> the perms post namespace creation (fchown(2)), before exporting the >>> memory region, might be useful in some cases - but as I don't have any >>> usecases I'm not claiming it is essential. >>> >>>> >>>> Thanks, >>>> Khalid >>> >>> Cheers, >>> Mark >>>> >>>>> >>>>> Rather than introduce two new system calls, along with /sys/ file to >>>>> communicate global addresses, could mshare() be built on top of shmem >>>>> API? Thinking of something like the below; >>>>> 1) For shm_open(3), add a new oflag to indicate the properties needed >>>>> for mshare() (say, O_SHARED_PTE - better name?) >>>>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained >>>>> in the sizes which can be set. >>>>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE >>>>> objects. On first mmap()ing an appropiate address is assigned, >>>>> otherwise the current 'global' address is used. >>>>> 4) shm_unlink(3) destroys the object when last reference is dropped. >>>>> >>>>> For 3), might be able to weaken the NULL requirement and validate a >>>>> given address on first mapping to ensure it is correctly aligned. >>>>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be >>>>> the default behaviour you require. >>>>> >>>>> Internally, the handling of mshare()/O_SHARED_PTE memory might be >>>>> sufficiently different to shmem that there is not much code sharing >>>>> between the two (I haven't thought this through, but the object >>>>> naming/refcounting should be similiar), but using shmem would be a >>>>> familiar API. >>>>> >>>>> Any thoughts? >>>>> >>>>> Cheers, >>>>> Mark >>>>> >>>> >>
On 1/24/22 12:45, Andy Lutomirski wrote: > On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> On 1/22/22 04:31, Mike Rapoport wrote: >>> (added linux-api) >>> >>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: >>>> Page tables in kernel consume some of the memory and as long as >>>> number of mappings being maintained is small enough, this space >>>> consumed by page tables is not objectionable. When very few memory >>>> pages are shared between processes, the number of page table entries >>>> (PTEs) to maintain is mostly constrained by the number of pages of >>>> memory on the system. As the number of shared pages and the number >>>> of times pages are shared goes up, amount of memory consumed by page >>>> tables starts to become significant. >>>> >>>> Some of the field deployments commonly see memory pages shared >>>> across 1000s of processes. On x86_64, each page requires a PTE that >>>> is only 8 bytes long which is very small compared to the 4K page >>>> size. When 2000 processes map the same page in their address space, >>>> each one of them requires 8 bytes for its PTE and together that adds >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a >>>> database server with 300GB SGA, a system carsh was seen with >>>> out-of-memory condition when 1500+ clients tried to share this SGA >>>> even though the system had 512GB of memory. On this server, in the >>>> worst case scenario of all 1500 processes mapping every page from >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs >>>> could be shared, amount of memory saved is very significant. >>>> >>>> This is a proposal to implement a mechanism in kernel to allow >>>> userspace processes to opt into sharing PTEs. The proposal is to add >>>> a new system call - mshare(), which can be used by a process to >>>> create a region (we will call it mshare'd region) which can be used >>>> by other processes to map same pages using shared PTEs. Other >>>> process(es), assuming they have the right permissions, can then make >>>> the mashare() system call to map the shared pages into their address >>>> space using the shared PTEs. When a process is done using this >>>> mshare'd region, it makes a mshare_unlink() system call to end its >>>> access. When the last process accessing mshare'd region calls >>>> mshare_unlink(), the mshare'd region is torn down and memory used by >>>> it is freed. >>>> >>>> >>>> API Proposal >>>> ============ >>>> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink() >>>> >>>> -- >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) >>>> >>>> mshare() creates and opens a new, or opens an existing mshare'd >>>> region that will be shared at PTE level. "name" refers to shared object >>>> name that exists under /sys/fs/mshare. "addr" is the starting address >>>> of this shared memory area and length is the size of this area. >>>> oflags can be one of: >>>> >>>> - O_RDONLY opens shared memory area for read only access by everyone >>>> - O_RDWR opens shared memory area for read and write access >>>> - O_CREAT creates the named shared memory area if it does not exist >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area >>>> exists with that name, return an error. >>>> >>>> mode represents the creation mode for the shared object under >>>> /sys/fs/mshare. >>>> >>>> mshare() returns an error code if it fails, otherwise it returns 0. >>> >>> Did you consider returning a file descriptor from mshare() system call? >>> Then there would be no need in mshare_unlink() as close(fd) would work. >> >> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though >> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases >> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling >> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to >> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic >> with a close() semantics though unless there was another way to force refcount to 0. Right? >> > > I'm not sure I understand the problem. If you're sharing a portion of > an mm and the mm goes away, then all that should be left are some > struct files that are no longer useful. They'll go away when their > refcount goes to zero. > > --Andy > The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd portion go away. One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() increments the refcount and each call to mshare_unlink() decrements the refcount. -- Khalid
On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote: > > On 1/24/22 12:45, Andy Lutomirski wrote: > > On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: > >> > >> On 1/22/22 04:31, Mike Rapoport wrote: > >>> (added linux-api) > >>> > >>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: > >>>> Page tables in kernel consume some of the memory and as long as > >>>> number of mappings being maintained is small enough, this space > >>>> consumed by page tables is not objectionable. When very few memory > >>>> pages are shared between processes, the number of page table entries > >>>> (PTEs) to maintain is mostly constrained by the number of pages of > >>>> memory on the system. As the number of shared pages and the number > >>>> of times pages are shared goes up, amount of memory consumed by page > >>>> tables starts to become significant. > >>>> > >>>> Some of the field deployments commonly see memory pages shared > >>>> across 1000s of processes. On x86_64, each page requires a PTE that > >>>> is only 8 bytes long which is very small compared to the 4K page > >>>> size. When 2000 processes map the same page in their address space, > >>>> each one of them requires 8 bytes for its PTE and together that adds > >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a > >>>> database server with 300GB SGA, a system carsh was seen with > >>>> out-of-memory condition when 1500+ clients tried to share this SGA > >>>> even though the system had 512GB of memory. On this server, in the > >>>> worst case scenario of all 1500 processes mapping every page from > >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs > >>>> could be shared, amount of memory saved is very significant. > >>>> > >>>> This is a proposal to implement a mechanism in kernel to allow > >>>> userspace processes to opt into sharing PTEs. The proposal is to add > >>>> a new system call - mshare(), which can be used by a process to > >>>> create a region (we will call it mshare'd region) which can be used > >>>> by other processes to map same pages using shared PTEs. Other > >>>> process(es), assuming they have the right permissions, can then make > >>>> the mashare() system call to map the shared pages into their address > >>>> space using the shared PTEs. When a process is done using this > >>>> mshare'd region, it makes a mshare_unlink() system call to end its > >>>> access. When the last process accessing mshare'd region calls > >>>> mshare_unlink(), the mshare'd region is torn down and memory used by > >>>> it is freed. > >>>> > >>>> > >>>> API Proposal > >>>> ============ > >>>> > >>>> The mshare API consists of two system calls - mshare() and mshare_unlink() > >>>> > >>>> -- > >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) > >>>> > >>>> mshare() creates and opens a new, or opens an existing mshare'd > >>>> region that will be shared at PTE level. "name" refers to shared object > >>>> name that exists under /sys/fs/mshare. "addr" is the starting address > >>>> of this shared memory area and length is the size of this area. > >>>> oflags can be one of: > >>>> > >>>> - O_RDONLY opens shared memory area for read only access by everyone > >>>> - O_RDWR opens shared memory area for read and write access > >>>> - O_CREAT creates the named shared memory area if it does not exist > >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area > >>>> exists with that name, return an error. > >>>> > >>>> mode represents the creation mode for the shared object under > >>>> /sys/fs/mshare. > >>>> > >>>> mshare() returns an error code if it fails, otherwise it returns 0. > >>> > >>> Did you consider returning a file descriptor from mshare() system call? > >>> Then there would be no need in mshare_unlink() as close(fd) would work. > >> > >> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though > >> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases > >> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling > >> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to > >> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic > >> with a close() semantics though unless there was another way to force refcount to 0. Right? > >> > > > > I'm not sure I understand the problem. If you're sharing a portion of > > an mm and the mm goes away, then all that should be left are some > > struct files that are no longer useful. They'll go away when their > > refcount goes to zero. > > > > --Andy > > > > The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process > initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd > portion go away. > > One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but > the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to > hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() > increments the refcount and each call to mshare_unlink() decrements the refcount. In general, objects which are kept alive by name tend to be quite awkward. Things like network namespaces essentially have to work that way and end up with awkward APIs. Things like shared memory don't actually have to be kept alive by name, and the cases that do keep them alive by name (tmpfs, shmget) can end up being so awkward that people invent nameless variants like memfd. So I would strongly suggest you see how the design works out with no names and no external keep-alive mechanism. Either have the continued existence of *any* fd keep the whole thing alive or make it be a pair of struct files, one that controls the region (and can map things into it, etc) and one that can map it. SCM_RIGHTS is pretty good for passing objects like this around. --Andy
On 1/24/22 16:16, Andy Lutomirski wrote: > On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> On 1/24/22 12:45, Andy Lutomirski wrote: >>> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: >>>> >>>> On 1/22/22 04:31, Mike Rapoport wrote: >>>>> (added linux-api) >>>>> >>>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: >>>>>> Page tables in kernel consume some of the memory and as long as >>>>>> number of mappings being maintained is small enough, this space >>>>>> consumed by page tables is not objectionable. When very few memory >>>>>> pages are shared between processes, the number of page table entries >>>>>> (PTEs) to maintain is mostly constrained by the number of pages of >>>>>> memory on the system. As the number of shared pages and the number >>>>>> of times pages are shared goes up, amount of memory consumed by page >>>>>> tables starts to become significant. >>>>>> >>>>>> Some of the field deployments commonly see memory pages shared >>>>>> across 1000s of processes. On x86_64, each page requires a PTE that >>>>>> is only 8 bytes long which is very small compared to the 4K page >>>>>> size. When 2000 processes map the same page in their address space, >>>>>> each one of them requires 8 bytes for its PTE and together that adds >>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a >>>>>> database server with 300GB SGA, a system carsh was seen with >>>>>> out-of-memory condition when 1500+ clients tried to share this SGA >>>>>> even though the system had 512GB of memory. On this server, in the >>>>>> worst case scenario of all 1500 processes mapping every page from >>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs >>>>>> could be shared, amount of memory saved is very significant. >>>>>> >>>>>> This is a proposal to implement a mechanism in kernel to allow >>>>>> userspace processes to opt into sharing PTEs. The proposal is to add >>>>>> a new system call - mshare(), which can be used by a process to >>>>>> create a region (we will call it mshare'd region) which can be used >>>>>> by other processes to map same pages using shared PTEs. Other >>>>>> process(es), assuming they have the right permissions, can then make >>>>>> the mashare() system call to map the shared pages into their address >>>>>> space using the shared PTEs. When a process is done using this >>>>>> mshare'd region, it makes a mshare_unlink() system call to end its >>>>>> access. When the last process accessing mshare'd region calls >>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by >>>>>> it is freed. >>>>>> >>>>>> >>>>>> API Proposal >>>>>> ============ >>>>>> >>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink() >>>>>> >>>>>> -- >>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode) >>>>>> >>>>>> mshare() creates and opens a new, or opens an existing mshare'd >>>>>> region that will be shared at PTE level. "name" refers to shared object >>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address >>>>>> of this shared memory area and length is the size of this area. >>>>>> oflags can be one of: >>>>>> >>>>>> - O_RDONLY opens shared memory area for read only access by everyone >>>>>> - O_RDWR opens shared memory area for read and write access >>>>>> - O_CREAT creates the named shared memory area if it does not exist >>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area >>>>>> exists with that name, return an error. >>>>>> >>>>>> mode represents the creation mode for the shared object under >>>>>> /sys/fs/mshare. >>>>>> >>>>>> mshare() returns an error code if it fails, otherwise it returns 0. >>>>> >>>>> Did you consider returning a file descriptor from mshare() system call? >>>>> Then there would be no need in mshare_unlink() as close(fd) would work. >>>> >>>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though >>>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases >>>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling >>>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to >>>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic >>>> with a close() semantics though unless there was another way to force refcount to 0. Right? >>>> >>> >>> I'm not sure I understand the problem. If you're sharing a portion of >>> an mm and the mm goes away, then all that should be left are some >>> struct files that are no longer useful. They'll go away when their >>> refcount goes to zero. >>> >>> --Andy >>> >> >> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process >> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd >> portion go away. >> >> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but >> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to >> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() >> increments the refcount and each call to mshare_unlink() decrements the refcount. > > In general, objects which are kept alive by name tend to be quite > awkward. Things like network namespaces essentially have to work that > way and end up with awkward APIs. Things like shared memory don't > actually have to be kept alive by name, and the cases that do keep > them alive by name (tmpfs, shmget) can end up being so awkward that > people invent nameless variants like memfd. > > So I would strongly suggest you see how the design works out with no > names and no external keep-alive mechanism. Either have the continued > existence of *any* fd keep the whole thing alive or make it be a pair > of struct files, one that controls the region (and can map things into > it, etc) and one that can map it. SCM_RIGHTS is pretty good for > passing objects like this around. > > --Andy > These are certainly good ideas to simplify this feature. My very first implementation of mshare did not have msharefs, was based on fd where fd could be passed to any other process using SCM_RIGHTS and required the process creating mshare region to stay alive for the region to exist. That certainly made life simpler in terms of implementation. Feedback from my customers of this feature (DB developers and people deploying DB systems) was this imposes a hard dependency on a server process that creates the mshare'd region and passes fd to other processes needing access to this region. This dependency creates a weak link in system reliability that is too risky. If the server process dies for any reason, the entire system becomes unavailable. They requested a more robust implementation that they can depend upon. I then went through the process of implementing this using shmfs since POSIX shm has those attributes. That turned out to be more kludgy than a clean implementation using a separate in-memory msharefs. That brought me to the RFC implementation I sent out. I do agree with you that name based persistent object makes the implementation more complex (maintaining a separate mm not tied to a process requires quite a bit of work to keep things consistent and clean mm up properly as users of this shared mm terminate) but I see the reliability point of view. Does that logic resonate with you? Thanks, Khalid
On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: > Example Code > ============ > > Snippet of the code that a donor process would run looks like below: > > ----------------- > addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, 0, 0); > if (addr == MAP_FAILED) > perror("ERROR: mmap failed"); > > err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), > GB(512), O_CREAT|O_RDWR|O_EXCL, 600); > if (err < 0) { > perror("mshare() syscall failed"); > exit(1); > } > > strncpy(addr, "Some random shared text", > sizeof("Some random shared text")); > ----------------- > > Snippet of code that a consumer process would execute looks like: > > ----------------- > fd = open("testregion", O_RDONLY); > if (fd < 0) { > perror("open failed"); > exit(1); > } > > if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) > printf("INFO: %ld bytes shared at addr %lx \n", > mshare_info[1], mshare_info[0]); > else > perror("read failed"); > > close(fd); > > addr = (char *)mshare_info[0]; > err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], > mshare_info[1], O_RDWR, 600); > if (err < 0) { > perror("mshare() syscall failed"); > exit(1); > } > > printf("Guest mmap at %px:\n", addr); > printf("%s\n", addr); > printf("\nDone\n"); > > err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); > if (err < 0) { > perror("mshare_unlink() failed"); > exit(1); > } > ----------------- I wounder if we can get away with zero-API here: we can transparently create/use shared page tables for any inode on mmap(MAP_SHARED) as long as size and alignment is sutiable. Page tables will be linked to the inode and will be freed when the last of such mapping will go away. I don't see a need in new syscalls of flags to existing one.
I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so. If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action. > On Jan 25, 2022, at 04:41, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote: >> Example Code >> ============ >> >> Snippet of the code that a donor process would run looks like below: >> >> ----------------- >> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, >> MAP_SHARED | MAP_ANONYMOUS, 0, 0); >> if (addr == MAP_FAILED) >> perror("ERROR: mmap failed"); >> >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), >> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); >> if (err < 0) { >> perror("mshare() syscall failed"); >> exit(1); >> } >> >> strncpy(addr, "Some random shared text", >> sizeof("Some random shared text")); >> ----------------- >> >> Snippet of code that a consumer process would execute looks like: >> >> ----------------- >> fd = open("testregion", O_RDONLY); >> if (fd < 0) { >> perror("open failed"); >> exit(1); >> } >> >> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) >> printf("INFO: %ld bytes shared at addr %lx \n", >> mshare_info[1], mshare_info[0]); >> else >> perror("read failed"); >> >> close(fd); >> >> addr = (char *)mshare_info[0]; >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], >> mshare_info[1], O_RDWR, 600); >> if (err < 0) { >> perror("mshare() syscall failed"); >> exit(1); >> } >> >> printf("Guest mmap at %px:\n", addr); >> printf("%s\n", addr); >> printf("\nDone\n"); >> >> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); >> if (err < 0) { >> perror("mshare_unlink() failed"); >> exit(1); >> } >> ----------------- > > I wounder if we can get away with zero-API here: we can transparently > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as > size and alignment is sutiable. Page tables will be linked to the inode > and will be freed when the last of such mapping will go away. I don't see > a need in new syscalls of flags to existing one. > > -- > Kirill A. Shutemov >
On 25.01.22 13:09, William Kucharski wrote: > I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so. > > If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action. > I don't have time to look into details right now, but I see various possible hard-to-handle issues with sharing anon pages via this mechanism between processes. If we could restrict it to MAP_SHARED and have some magic toggle to opt in, that would be great. Features like uffd that we might soon see on some MAP_SHARED pages will require to not share page tables automatically I assume.
On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote: > I wounder if we can get away with zero-API here: we can transparently > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as > size and alignment is sutiable. Page tables will be linked to the inode > and will be freed when the last of such mapping will go away. I don't see > a need in new syscalls of flags to existing one. That's how HugeTLBfs works today, right? Would you want that mechanism hoisted into the real MM? Because my plan was the opposite -- remove it from the shadow MM once mshare() is established.
On Tue, Jan 25, 2022 at 01:23:21PM +0000, Matthew Wilcox wrote: > On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote: > > I wounder if we can get away with zero-API here: we can transparently > > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as > > size and alignment is sutiable. Page tables will be linked to the inode > > and will be freed when the last of such mapping will go away. I don't see > > a need in new syscalls of flags to existing one. > > That's how HugeTLBfs works today, right? Would you want that mechanism > hoisted into the real MM? Because my plan was the opposite -- remove it > from the shadow MM once mshare() is established. I hate HugeTLBfs because it is a special place with own rules. mshare() as it proposed creates a new special place. I don't like this. It's better to find a way to integrate the feature natively into core-mm and make as much users as possible to benefit from it. I think zero-API approach (plus madvise() hints to tweak it) is worth considering.
On Tue, Jan 25, 2022 at 02:18:57PM +0100, David Hildenbrand wrote: > On 25.01.22 13:09, William Kucharski wrote: > > I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so. > > > > If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action. > > > > I don't have time to look into details right now, but I see various > possible hard-to-handle issues with sharing anon pages via this > mechanism between processes. Right. We should not break invariant that an anonymous page can only be mapped into a process once. Otherwise may need to deal with new class of security issues.
On Tue, Jan 25, 2022 at 04:59:17PM +0300, Kirill A. Shutemov wrote: > On Tue, Jan 25, 2022 at 01:23:21PM +0000, Matthew Wilcox wrote: > > On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote: > > > I wounder if we can get away with zero-API here: we can transparently > > > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as > > > size and alignment is sutiable. Page tables will be linked to the inode > > > and will be freed when the last of such mapping will go away. I don't see > > > a need in new syscalls of flags to existing one. > > > > That's how HugeTLBfs works today, right? Would you want that mechanism > > hoisted into the real MM? Because my plan was the opposite -- remove it > > from the shadow MM once mshare() is established. > > I hate HugeTLBfs because it is a special place with own rules. mshare() as > it proposed creates a new special place. I don't like this. No new special place. I suppose the only thing it creates that's "new" is an MM without any threads of its own. And from the MM point of view, that's not a new thing at all because the MM simply doesn't care how many threads share an MM. > It's better to find a way to integrate the feature natively into core-mm > and make as much users as possible to benefit from it. That's what mshare is trying to do! > I think zero-API approach (plus madvise() hints to tweak it) is worth > considering. I think the zero-API approach actually misses out on a lot of possibilities that the mshare() approach offers. For example, mshare() allows you to mmap() many small files in the shared region -- you can't do that with zeroAPI.
On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote: > > I think zero-API approach (plus madvise() hints to tweak it) is worth > > considering. > > I think the zero-API approach actually misses out on a lot of > possibilities that the mshare() approach offers. For example, mshare() > allows you to mmap() many small files in the shared region -- you > can't do that with zeroAPI. Do you consider a use-case for many small files to be common? I would think that the main consumer of the feature to be mmap of huge files. And in this case zero enabling burden on userspace side sounds like a sweet deal.
On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote: > > > I think zero-API approach (plus madvise() hints to tweak it) is worth > > > considering. > > > > I think the zero-API approach actually misses out on a lot of > > possibilities that the mshare() approach offers. For example, mshare() > > allows you to mmap() many small files in the shared region -- you > > can't do that with zeroAPI. > > Do you consider a use-case for many small files to be common? I would > think that the main consumer of the feature to be mmap of huge files. > And in this case zero enabling burden on userspace side sounds like a > sweet deal. mmap() of huge files is certainly the Oracle use-case. With occasional funny business like mprotect() of a single page in the middle of a 1GB hugepage. The approach of designating ranges of a process's address space as sharable with other processes felt like the cleaner & frankly more interesting approach that opens up use-cases other than "hurr, durr, we are Oracle, we like big files, kernel get out of way now, transactions to perform".
On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote: > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: > > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote: > > > > I think zero-API approach (plus madvise() hints to tweak it) is worth > > > > considering. > > > > > > I think the zero-API approach actually misses out on a lot of > > > possibilities that the mshare() approach offers. For example, mshare() > > > allows you to mmap() many small files in the shared region -- you > > > can't do that with zeroAPI. > > > > Do you consider a use-case for many small files to be common? I would > > think that the main consumer of the feature to be mmap of huge files. > > And in this case zero enabling burden on userspace side sounds like a > > sweet deal. > > mmap() of huge files is certainly the Oracle use-case. With occasional > funny business like mprotect() of a single page in the middle of a 1GB > hugepage. Bill and I were talking about this earlier and realised that this is the key point. There's a requirement that when one process mprotects a page that it gets protected in all processes. You can't do that without *some* API because that's different behaviour than any existing API would produce. So how about something like this ... int mcreate(const char *name, int flags, mode_t mode); creates a new mm_struct with a refcount of 2. returns an fd (one of the two refcounts) and creates a name for it (inside msharefs, holds the other refcount). You can then mmap() that fd to attach it to a chunk of your address space. Once attached, you can start to populate it by calling mmap() and specifying an address inside the attached mm as the first argument to mmap(). Maybe mcreate() is just a library call, and it's really a thin wrapper around open() that happens to know where msharefs is mounted.
On 26.01.22 05:04, Matthew Wilcox wrote: > On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote: >> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: >>> On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote: >>>>> I think zero-API approach (plus madvise() hints to tweak it) is worth >>>>> considering. >>>> >>>> I think the zero-API approach actually misses out on a lot of >>>> possibilities that the mshare() approach offers. For example, mshare() >>>> allows you to mmap() many small files in the shared region -- you >>>> can't do that with zeroAPI. >>> >>> Do you consider a use-case for many small files to be common? I would >>> think that the main consumer of the feature to be mmap of huge files. >>> And in this case zero enabling burden on userspace side sounds like a >>> sweet deal. >> >> mmap() of huge files is certainly the Oracle use-case. With occasional >> funny business like mprotect() of a single page in the middle of a 1GB >> hugepage. > > Bill and I were talking about this earlier and realised that this is > the key point. There's a requirement that when one process mprotects > a page that it gets protected in all processes. You can't do that > without *some* API because that's different behaviour than any existing > API would produce. A while ago I talked with Peter about an extended uffd (here: WP) mechanism that would work on fds instead of the process address space. The rough idea would be to register the uffd (or however that would be called) handler on an fd instead of a virtual address space of a single process and write-protect pages in that fd. Once anybody would try writing to such a protected range (write, mmap, ...), the uffd handler would fire and user space could handle the event (-> unprotect). The page cache would have to remember the uffd information ("wp using uffd"). When (un)protecting pages using this mechanism, all page tables mapping the page would have to be updated accordingly using the rmap. At that point, we wouldn't care if it's a single page table (e.g., shared similar to hugetlb) or simply multiple page tables. It's a completely rough idea, I just wanted to mention it.
On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote: > A while ago I talked with Peter about an extended uffd (here: WP) > mechanism that would work on fds instead of the process address space. As far as I can tell, uffd is a grotesque hack that exists to work around the poor choice to use anonymous memory instead of file-backed memory in kvm. Every time I see somebody mention it, I feel pain.
On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote: > On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote: > > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: > > > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote: > > > > > I think zero-API approach (plus madvise() hints to tweak it) is worth > > > > > considering. > > > > > > > > I think the zero-API approach actually misses out on a lot of > > > > possibilities that the mshare() approach offers. For example, mshare() > > > > allows you to mmap() many small files in the shared region -- you > > > > can't do that with zeroAPI. > > > > > > Do you consider a use-case for many small files to be common? I would > > > think that the main consumer of the feature to be mmap of huge files. > > > And in this case zero enabling burden on userspace side sounds like a > > > sweet deal. > > > > mmap() of huge files is certainly the Oracle use-case. With occasional > > funny business like mprotect() of a single page in the middle of a 1GB > > hugepage. > > Bill and I were talking about this earlier and realised that this is > the key point. There's a requirement that when one process mprotects > a page that it gets protected in all processes. You can't do that > without *some* API because that's different behaviour than any existing > API would produce. "hurr, durr, we are Oracle" :P Sounds like a very niche requirement. I doubt there will more than single digit user count for the feature. Maybe only the DB. > So how about something like this ... > > int mcreate(const char *name, int flags, mode_t mode); > > creates a new mm_struct with a refcount of 2. returns an fd (one > of the two refcounts) and creates a name for it (inside msharefs, > holds the other refcount). > > You can then mmap() that fd to attach it to a chunk of your address > space. Once attached, you can start to populate it by calling > mmap() and specifying an address inside the attached mm as the first > argument to mmap(). That is not what mmap() would normally do to an existing mapping. So it requires special treatment. In general mmap() of a mm_struct scares me. I can't wrap my head around implications. Like how does it work on fork()? How accounting works? What happens on OOM? What prevents creating loops, like mapping a mm_struct inside itself? What mremap()/munmap() do to such mapping? Will it affect mapping of mm_struct or will it target mapping inside the mm_sturct? Maybe it just didn't clicked for me, I donno. > Maybe mcreate() is just a library call, and it's really a thin wrapper > around open() that happens to know where msharefs is mounted.
On 26.01.22 14:38, Matthew Wilcox wrote: > On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote: >> A while ago I talked with Peter about an extended uffd (here: WP) >> mechanism that would work on fds instead of the process address space. > > As far as I can tell, uffd is a grotesque hack that exists to work around > the poor choice to use anonymous memory instead of file-backed memory > in kvm. Every time I see somebody mention it, I feel pain. > I might be missing something important, because KVM can deal with file-back memory just fine and uffd is used heavily outside of hypervisors. I'd love to learn how to handle what ordinary uffd (handle missing/unpopulated pages) and uffd-wp (handle write access to pages) can do with files instead. Because if something like that already exists, it would be precisely what I am talking about. Maybe mentioning uffd was a bad choice ;)
On Wed, Jan 26, 2022 at 01:38:49PM +0000, Matthew Wilcox wrote: > On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote: > > A while ago I talked with Peter about an extended uffd (here: WP) > > mechanism that would work on fds instead of the process address space. > > As far as I can tell, uffd is a grotesque hack that exists to work around > the poor choice to use anonymous memory instead of file-backed memory > in kvm. Every time I see somebody mention it, I feel pain. How file-backed memory would have helped for the major use-case of uffd which is post-copy migration?
On Wed, Jan 26, 2022 at 02:55:10PM +0100, David Hildenbrand wrote: > On 26.01.22 14:38, Matthew Wilcox wrote: > > On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote: > >> A while ago I talked with Peter about an extended uffd (here: WP) > >> mechanism that would work on fds instead of the process address space. > > > > As far as I can tell, uffd is a grotesque hack that exists to work around > > the poor choice to use anonymous memory instead of file-backed memory > > in kvm. Every time I see somebody mention it, I feel pain. > > > > I might be missing something important, because KVM can deal with > file-back memory just fine and uffd is used heavily outside of hypervisors. > > I'd love to learn how to handle what ordinary uffd (handle > missing/unpopulated pages) and uffd-wp (handle write access to pages) > can do with files instead. Because if something like that already > exists, it would be precisely what I am talking about. Every notification that uffd wants already exists as a notification to the underlying filesystem. Something like a uffdfs [1] would be able to do everything that uffd does without adding extra crap all over the MM. [1] acronyms are bad, mmmkay?
On Wed, Jan 26, 2022 at 04:42:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote: > > On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote: > > > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: > > > So how about something like this ... > > > > int mcreate(const char *name, int flags, mode_t mode); > > > > creates a new mm_struct with a refcount of 2. returns an fd (one > > of the two refcounts) and creates a name for it (inside msharefs, > > holds the other refcount). > > > > You can then mmap() that fd to attach it to a chunk of your address > > space. Once attached, you can start to populate it by calling > > mmap() and specifying an address inside the attached mm as the first > > argument to mmap(). > > That is not what mmap() would normally do to an existing mapping. So it > requires special treatment. > > In general mmap() of a mm_struct scares me. I can't wrap my head around > implications. > > Like how does it work on fork()? > > How accounting works? What happens on OOM? > > What prevents creating loops, like mapping a mm_struct inside itself? > > What mremap()/munmap() do to such mapping? Will it affect mapping of > mm_struct or will it target mapping inside the mm_sturct? > > Maybe it just didn't clicked for me, I donno. My understanding was that the new mm_struct would be rather stripped and will be used more as an abstraction for the shared page table, maybe I'm totally wrong :) > > Maybe mcreate() is just a library call, and it's really a thin wrapper > > around open() that happens to know where msharefs is mounted. > > -- > Kirill A. Shutemov
On 26.01.22 15:12, Matthew Wilcox wrote: > On Wed, Jan 26, 2022 at 02:55:10PM +0100, David Hildenbrand wrote: >> On 26.01.22 14:38, Matthew Wilcox wrote: >>> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote: >>>> A while ago I talked with Peter about an extended uffd (here: WP) >>>> mechanism that would work on fds instead of the process address space. >>> >>> As far as I can tell, uffd is a grotesque hack that exists to work around >>> the poor choice to use anonymous memory instead of file-backed memory >>> in kvm. Every time I see somebody mention it, I feel pain. >>> >> >> I might be missing something important, because KVM can deal with >> file-back memory just fine and uffd is used heavily outside of hypervisors. >> >> I'd love to learn how to handle what ordinary uffd (handle >> missing/unpopulated pages) and uffd-wp (handle write access to pages) >> can do with files instead. Because if something like that already >> exists, it would be precisely what I am talking about. > > Every notification that uffd wants already exists as a notification to > the underlying filesystem. Something like a uffdfs [1] would be able > to do everything that uffd does without adding extra crap all over the MM. I don't speak "filesystem" fluently, but I assume that could be an overlay over other fs? Peter is currently upstreaming uffd-wp for shmem. How could that look like when doing it the fs-way?
On 1/26/22 07:18, Mike Rapoport wrote: > On Wed, Jan 26, 2022 at 04:42:47PM +0300, Kirill A. Shutemov wrote: >> On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote: >>> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote: >>>> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: >> >>> So how about something like this ... >>> >>> int mcreate(const char *name, int flags, mode_t mode); >>> >>> creates a new mm_struct with a refcount of 2. returns an fd (one >>> of the two refcounts) and creates a name for it (inside msharefs, >>> holds the other refcount). >>> >>> You can then mmap() that fd to attach it to a chunk of your address >>> space. Once attached, you can start to populate it by calling >>> mmap() and specifying an address inside the attached mm as the first >>> argument to mmap(). >> >> That is not what mmap() would normally do to an existing mapping. So it >> requires special treatment. >> >> In general mmap() of a mm_struct scares me. I can't wrap my head around >> implications. >> >> Like how does it work on fork()? >> >> How accounting works? What happens on OOM? >> >> What prevents creating loops, like mapping a mm_struct inside itself? >> >> What mremap()/munmap() do to such mapping? Will it affect mapping of >> mm_struct or will it target mapping inside the mm_sturct? >> >> Maybe it just didn't clicked for me, I donno. > > My understanding was that the new mm_struct would be rather stripped and > will be used more as an abstraction for the shared page table, maybe I'm > totally wrong :) Your understanding is correct for the RFC implementation of mshare(). mcreate() is a different beast that I do not fully understand yet. From Matthew's explanation, it sounds like what he has in mind is that mcreate() is a frontend to mshare/msharefs, uses mshare to created the shared region and thus allows a user to mprotect a single page inside the mmap it creates using the fd returned by mcreate. mshare underneath automagically extends the new page protection to every one sharing that page owing to shared PTE. -- Khalid > >>> Maybe mcreate() is just a library call, and it's really a thin wrapper >>> around open() that happens to know where msharefs is mounted. >> >> -- >> Kirill A. Shutemov >