Message ID | cover.1656531090.git.khalid.aziz@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for shared PTEs across processes | expand |
Hi Khalid, On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote: > > > Memory pages shared between processes require a page table entry > (PTE) for each process. Each of these PTE consumes consume some of > the memory and as long as number of mappings being maintained is > small enough, this space consumed by page tables is not > objectionable. When very few memory pages are shared between > processes, the number of page table entries (PTEs) to maintain is > mostly constrained by the number of pages of memory on the system. > As the number of shared pages and the number of times pages are > shared goes up, amount of memory consumed by page tables starts to > become significant. This issue does not apply to threads. Any number > of threads can share the same pages inside a process while sharing > the same PTEs. Extending this same model to sharing pages across > processes can eliminate this issue for sharing across processes as > well. > > Some of the field deployments commonly see memory pages shared > across 1000s of processes. On x86_64, each page requires a PTE that > is only 8 bytes long which is very small compared to the 4K page > size. When 2000 processes map the same page in their address space, > each one of them requires 8 bytes for its PTE and together that adds > up to 8K of memory just to hold the PTEs for one 4K page. On a > database server with 300GB SGA, a system crash was seen with > out-of-memory condition when 1500+ clients tried to share this SGA > even though the system had 512GB of memory. On this server, in the > worst case scenario of all 1500 processes mapping every page from > SGA would have required 878GB+ for just the PTEs. If these PTEs > could be shared, amount of memory saved is very significant. > > This patch series implements a mechanism in kernel to allow > userspace processes to opt into sharing PTEs. It adds a new > in-memory filesystem - msharefs. A file created on msharefs creates > a new shared region where all processes sharing that region will > share the PTEs as well. A process can create a new file on msharefs > and then mmap it which assigns a starting address and size to this > mshare'd region. Another process that has the right permission to > open the file on msharefs can then mmap this file in its address > space at same virtual address and size and share this region through > shared PTEs. An unlink() on the file marks the mshare'd region for > deletion once there are no more users of the region. When the mshare > region is deleted, all the pages used by the region are freed. Noting the flexibility of 'mshare' has been reduced from v1. The earlier version allowed msharing of named mappings, while this patch is only for anonymous mappings. Any plans to support named mappings? If not, I guess *someone* will want it (eventually). Minor, as the patch does not introduce new syscalls, but having an API which is flexible for both named and anon mappings would be good (this is a nit, not a strong suggestion). The cover letter details the problem being solved and the API, but gives no details of the implementation. A paragraph on the use of a mm_struct per-msharefs file would be helpful. I've only quickly scanned the patchset; not in enough detail to comment on each patch, but a few observations. o I was expecting to see mprotect() against a mshared vma to either be disallowed or code to support the splitting of a mshared vma. I didn't see either. o For the case where the mshare file has been closed/unmmap but not unlinked, a 'mshare_data' structure will leaked when the inode is evicted. o The alignment requirement is PGDIR_SIZE, which is very large. Should/could this be PMD_SIZE? o mshare should be a conditional feature (CONFIG_MSHARE ?). I might get a chance do a finer grain review later/tomorrow. > API > === > > mshare does not introduce a new API. It instead uses existing APIs > to implement page table sharing. The steps to use this feature are: > > 1. Mount msharefs on /sys/fs/mshare - > mount -t msharefs msharefs /sys/fs/mshare > > 2. mshare regions have alignment and size requirements. Start > address for the region must be aligned to an address boundary and > be a multiple of fixed size. This alignment and size requirement > can be obtained by reading the file /sys/fs/mshare/mshare_info > which returns a number in text format. mshare regions must be > aligned to this boundary and be a multiple of this size. > > 3. For the process creating mshare region: > a. Create a file on /sys/fs/mshare, for example - > fd = open("/sys/fs/mshare/shareme", > O_RDWR|O_CREAT|O_EXCL, 0600); > > b. mmap this file to establish starting address and size - > mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > > c. Write and read to mshared region normally. > > 4. For processes attaching to mshare'd region: > a. Open the file on msharefs, for example - > fd = open("/sys/fs/mshare/shareme", O_RDWR); > > b. Get information about mshare'd region from the file: > struct mshare_info { > unsigned long start; > unsigned long size; > } m_info; > > read(fd, &m_info, sizeof(m_info)); > > c. mmap the mshare'd region - > mmap(m_info.start, m_info.size, > PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > > 5. To delete the mshare region - > unlink("/sys/fs/mshare/shareme"); > > > > Example Code > ============ > > Snippet of the code that a donor process would run looks like below: > > ----------------- > fd = open("/sys/fs/mshare/mshare_info", O_RDONLY); > read(fd, req, 128); > alignsize = atoi(req); > close(fd); > fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600); > start = alignsize * 4; > size = alignsize * 2; > addr = mmap((void *)start, size, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, 0, 0); Typo, missing 'fd'; MAP_SHARED | MAP_ANONYMOUS, fd, 0) > if (addr == MAP_FAILED) > perror("ERROR: mmap failed"); > > strncpy(addr, "Some random shared text", > sizeof("Some random shared text")); > ----------------- > > > Snippet of code that a consumer process would execute looks like: > > ----------------- > struct mshare_info { > unsigned long start; > unsigned long size; > } minfo; > > > fd = open("/sys/fs/mshare/shareme", O_RDONLY); > > if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0)) > printf("INFO: %ld bytes shared at addr 0x%lx \n", > minfo.size, minfo.start); > > addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > > printf("Guest mmap at %px:\n", addr); > printf("%s\n", addr); > printf("\nDone\n"); > > ----------------- > > > > v1 -> v2: > - Eliminated mshare and mshare_unlink system calls and > replaced API with standard mmap and unlink (Based upon > v1 patch discussions and LSF/MM discussions) > - All fd based API (based upon feedback and suggestions from > Andy Lutomirski, Eric Biederman, Kirill and others) > - Added a file /sys/fs/mshare/mshare_info to provide > alignment and size requirement info (based upon feedback > from Dave Hansen, Mark Hemment and discussions at LSF/MM) > - Addressed TODOs in v1 > - Added support for directories in msharefs > - Added locks around any time vma is touched (Dave Hansen) > - Eliminated the need to point vm_mm in original vmas to the > newly synthesized mshare mm > - Ensured mmap_read_unlock is called for correct mm in > handle_mm_fault (Dave Hansen) > > Khalid Aziz (9): > mm: Add msharefs filesystem > mm/mshare: pre-populate msharefs with information file > mm/mshare: make msharefs writable and support directories > mm/mshare: Add a read operation for msharefs files > mm/mshare: Add vm flag for shared PTE > mm/mshare: Add mmap operation > mm/mshare: Add unlink and munmap support > mm/mshare: Add basic page table sharing support > mm/mshare: Enable mshare region mapping across processes > > Documentation/filesystems/msharefs.rst | 19 + > include/linux/mm.h | 10 + > include/trace/events/mmflags.h | 3 +- > include/uapi/linux/magic.h | 1 + > include/uapi/linux/mman.h | 5 + > mm/Makefile | 2 +- > mm/internal.h | 7 + > mm/memory.c | 101 ++++- > mm/mshare.c | 575 +++++++++++++++++++++++++ > 9 files changed, 719 insertions(+), 4 deletions(-) > create mode 100644 Documentation/filesystems/msharefs.rst > create mode 100644 mm/mshare.c > > -- > 2.32.0 Cheers, Mark
On 6/30/22 05:57, Mark Hemment wrote: > Hi Khalid, > > On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> >> Memory pages shared between processes require a page table entry >> (PTE) for each process. Each of these PTE consumes consume some of >> the memory and as long as number of mappings being maintained is >> small enough, this space consumed by page tables is not >> objectionable. When very few memory pages are shared between >> processes, the number of page table entries (PTEs) to maintain is >> mostly constrained by the number of pages of memory on the system. >> As the number of shared pages and the number of times pages are >> shared goes up, amount of memory consumed by page tables starts to >> become significant. This issue does not apply to threads. Any number >> of threads can share the same pages inside a process while sharing >> the same PTEs. Extending this same model to sharing pages across >> processes can eliminate this issue for sharing across processes as >> well. >> >> Some of the field deployments commonly see memory pages shared >> across 1000s of processes. On x86_64, each page requires a PTE that >> is only 8 bytes long which is very small compared to the 4K page >> size. When 2000 processes map the same page in their address space, >> each one of them requires 8 bytes for its PTE and together that adds >> up to 8K of memory just to hold the PTEs for one 4K page. On a >> database server with 300GB SGA, a system crash was seen with >> out-of-memory condition when 1500+ clients tried to share this SGA >> even though the system had 512GB of memory. On this server, in the >> worst case scenario of all 1500 processes mapping every page from >> SGA would have required 878GB+ for just the PTEs. If these PTEs >> could be shared, amount of memory saved is very significant. >> >> This patch series implements a mechanism in kernel to allow >> userspace processes to opt into sharing PTEs. It adds a new >> in-memory filesystem - msharefs. A file created on msharefs creates >> a new shared region where all processes sharing that region will >> share the PTEs as well. A process can create a new file on msharefs >> and then mmap it which assigns a starting address and size to this >> mshare'd region. Another process that has the right permission to >> open the file on msharefs can then mmap this file in its address >> space at same virtual address and size and share this region through >> shared PTEs. An unlink() on the file marks the mshare'd region for >> deletion once there are no more users of the region. When the mshare >> region is deleted, all the pages used by the region are freed. > > Noting the flexibility of 'mshare' has been reduced from v1. The > earlier version allowed msharing of named mappings, while this patch > is only for anonymous mappings. > Any plans to support named mappings? If not, I guess *someone* will > want it (eventually). Minor, as the patch does not introduce new > syscalls, but having an API which is flexible for both named and anon > mappings would be good (this is a nit, not a strong suggestion). I apologize for not clarifying this. The initial mmap() call looks like an anonymous mapping but one could easily call mremap later and map any other objects in the same address space which remains shared until the mshare region is torn down. It is my intent to support mapping any objects in mshare region. > > The cover letter details the problem being solved and the API, but > gives no details of the implementation. A paragraph on the use of a > mm_struct per-msharefs file would be helpful. Good point. I will do that next time. > > I've only quickly scanned the patchset; not in enough detail to > comment on each patch, but a few observations. > > o I was expecting to see mprotect() against a mshared vma to either > be disallowed or code to support the splitting of a mshared vma. I > didn't see either.msharefs_delmm Since mshare region is intended to support multiple objects being mapped in the region and different protections on different parts of region, mprotect should be supported and should handle splitting the mshare'd vmas. Until basic code is solid, it would make sense to prevent splitting vmas and add that on later. I will add this code. > > o For the case where the mshare file has been closed/unmmap but not > unlinked, a 'mshare_data' structure will leaked when the inode is > evicted. You are right. mshare_evict_inode() needs to call msharefs_delmm() to clean up. > > o The alignment requirement is PGDIR_SIZE, which is very large. > Should/could this be PMD_SIZE? Yes, PGDIR_SIZE is large. It works for the database folks who requested this feature but PMD might be more versatile. I have been thinking about switching to PMD since that will make it easier to move hugetlbfs page table sharing code over to this code. > > o mshare should be a conditional feature (CONFIG_MSHARE ?). I can do that. I was reluctant to add yet another CONFIG option. Since this feature is activated explicitly by userspace code, is it necessary to make it a config option? > > > I might get a chance do a finer grain review later/tomorrow. > >> API >> === >> >> mshare does not introduce a new API. It instead uses existing APIs >> to implement page table sharing. The steps to use this feature are: >> >> 1. Mount msharefs on /sys/fs/mshare - >> mount -t msharefs msharefs /sys/fs/mshare >> >> 2. mshare regions have alignment and size requirements. Start >> address for the region must be aligned to an address boundary and >> be a multiple of fixed size. This alignment and size requirement >> can be obtained by reading the file /sys/fs/mshare/mshare_info >> which returns a number in text format. mshare regions must be >> aligned to this boundary and be a multiple of this size. >> >> 3. For the process creating mshare region: >> a. Create a file on /sys/fs/mshare, for example - >> fd = open("/sys/fs/mshare/shareme", >> O_RDWR|O_CREAT|O_EXCL, 0600); >> >> b. mmap this file to establish starting address and size - >> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE, >> MAP_SHARED, fd, 0); >> >> c. Write and read to mshared region normally. >> >> 4. For processes attaching to mshare'd region: >> a. Open the file on msharefs, for example - >> fd = open("/sys/fs/mshare/shareme", O_RDWR); >> >> b. Get information about mshare'd region from the file: >> struct mshare_info { >> unsigned long start; >> unsigned long size; >> } m_info; >> >> read(fd, &m_info, sizeof(m_info)); >> >> c. mmap the mshare'd region - >> mmap(m_info.start, m_info.size, >> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >> >> 5. To delete the mshare region - >> unlink("/sys/fs/mshare/shareme"); >> >> >> >> Example Code >> ============ >> >> Snippet of the code that a donor process would run looks like below: >> >> ----------------- >> fd = open("/sys/fs/mshare/mshare_info", O_RDONLY); >> read(fd, req, 128); >> alignsize = atoi(req); >> close(fd); >> fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600); >> start = alignsize * 4; >> size = alignsize * 2; >> addr = mmap((void *)start, size, PROT_READ | PROT_WRITE, >> MAP_SHARED | MAP_ANONYMOUS, 0, 0); > > Typo, missing 'fd'; MAP_SHARED | MAP_ANONYMOUS, fd, 0) Yes, you are right. I will fix that. Thanks, Mark! I really appreciate your taking time to review this code. -- Khalid
On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: > This patch series implements a mechanism in kernel to allow > userspace processes to opt into sharing PTEs. It adds a new > in-memory filesystem - msharefs. Dumb question: why do we need a new filesystem for this? Is it not feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
On 7/1/22 22:24, Andrew Morton wrote: > On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: > >> This patch series implements a mechanism in kernel to allow >> userspace processes to opt into sharing PTEs. It adds a new >> in-memory filesystem - msharefs. > > Dumb question: why do we need a new filesystem for this? Is it not > feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files? Hi Andrew, The new filesystem is meant to provide only the control files for sharing PTE. It contains a file that provides alignment/size requirement. Other files are created as named objects to represent shared regions and these files provide information about the size and virtual address for each shared regions when the file is read. Actual shared data is not hosted on msharefs. Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc files. Thanks, Khalid
On 02.07.22 06:24, Andrew Morton wrote: > On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: > >> This patch series implements a mechanism in kernel to allow >> userspace processes to opt into sharing PTEs. It adds a new >> in-memory filesystem - msharefs. > > Dumb question: why do we need a new filesystem for this? Is it not > feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files? > IIRC, the general opinion at LSF/MM was that this approach at hand is makes people nervous and I at least am not convinced that we really want to have this upstream. What's *completely* missing from the cover letter are the dirty details: "Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc files.". Gah. As raised already, "anonymous pages" makes me shiver. (To me, what I read, this looks like an RFC to me, yet I see "v2". But I am a little confused why most of the feedback at LSF/MM seems to be ignored and people are moving forward with this approach. But maybe my memory is wrong.) Please, let's look into more generic page table sharing just like hugetlb already provides to some degree. And if we need additional alignment requirements to share multiple page table levels, then let's look into that as well for special users.
On 7/8/22 05:47, David Hildenbrand wrote: > On 02.07.22 06:24, Andrew Morton wrote: >> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >>> This patch series implements a mechanism in kernel to allow >>> userspace processes to opt into sharing PTEs. It adds a new >>> in-memory filesystem - msharefs. >> >> Dumb question: why do we need a new filesystem for this? Is it not >> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files? >> > > IIRC, the general opinion at LSF/MM was that this approach at hand is > makes people nervous and I at least am not convinced that we really want > to have this upstream. Hi David, You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and this just extends that concept to processes. A number of people have commented on potential usefulness of this concept and implementation. There were concerns raised about being able to make this safe and reliable. I had agreed to send a second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that wrong. > > What's *completely* missing from the cover letter are the dirty details: > "Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc > files.". Gah. Yes, cover letter in v2 patch was a little lacking. I will add more details next time. > > As raised already, "anonymous pages" makes me shiver. > > > (To me, what I read, this looks like an RFC to me, yet I see "v2". But I > am a little confused why most of the feedback at LSF/MM seems to be > ignored and people are moving forward with this approach. But maybe my > memory is wrong.) > > Please, let's look into more generic page table sharing just like > hugetlb already provides to some degree. And if we need additional > alignment requirements to share multiple page table levels, then let's > look into that as well for special users. > Thanks, Khalid
On 08.07.22 21:36, Khalid Aziz wrote: > On 7/8/22 05:47, David Hildenbrand wrote: >> On 02.07.22 06:24, Andrew Morton wrote: >>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: >>> >>>> This patch series implements a mechanism in kernel to allow >>>> userspace processes to opt into sharing PTEs. It adds a new >>>> in-memory filesystem - msharefs. >>> >>> Dumb question: why do we need a new filesystem for this? Is it not >>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files? >>> >> >> IIRC, the general opinion at LSF/MM was that this approach at hand is >> makes people nervous and I at least am not convinced that we really want >> to have this upstream. > > Hi David, Hi Khalid, > > You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and > this just extends that concept to processes. They share a *mm* including a consistent virtual memory layout (VMA list). Page table sharing is just a side product of that. You could even call page tables just an implementation detail to produce that consistent virtual memory layout -- described for that MM via a different data structure. > A number of people have commented on potential usefulness of this concept > and implementation. ... and a lot of people raised concerns. Yes, page table sharing to reduce memory consumption/tlb misses/... is something reasonable to have. But that doesn't require mshare, as hugetlb has proven. The design might be useful for a handful of corner (!) cases, but as the cover letter only talks about memory consumption of page tables, I'll not care about those. Once these corner cases are explained and deemed important, we might want to think of possible alternatives to explore the solution space. > There were concerns raised about being able to make this safe and reliable. > I had agreed to send a > second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The Okay, most of the changes I saw are related to the user interface, not to any of the actual dirty implementation-detail concerns. And the cover letter is not really clear what's actually happening under the hood and what the (IMHO) weird semantics of the design imply (as can be seen from Andrews reply). > suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion > on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is > better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that > wrong. Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, and asked the honest question if we can just remove it and replace it by something generic in the future. And as I learned, we most probably cannot rip that out without affecting existing user space. Even replacing it by mshare() would degrade existing user space. So the natural thing to reduce page table consumption (again, what this cover letter talks about) for user space (semi- ?)automatically for MAP_SHARED files is to factor out what hugetlb has, and teach generic MM code to cache and reuse page tables (PTE and PMD tables should be sufficient) where suitable. For reasonably aligned mappings and mapping sizes, it shouldn't be too hard (I know, locking ...), to cache and reuse page tables attached to files -- similar to what hugetlb does, just in a generic way. We might want a mechanism to enable/disable this for specific processes and/or VMAs, but these are minor details. And that could come for free for existing user space, because page tables, and how they are handled, would just be an implementation detail. I'd be really interested into what the major roadblocks/downsides file-based page table sharing has. Because I am not convinced that a mechanism like mshare() -- that has to be explicitly implemented+used by user space -- is required for that.
On 07/13/22 16:00, David Hildenbrand wrote: > On 08.07.22 21:36, Khalid Aziz wrote: > > On 7/8/22 05:47, David Hildenbrand wrote: > >> On 02.07.22 06:24, Andrew Morton wrote: > >>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: > > > suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion > > on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is > > better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that > > wrong. > > Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, > and asked the honest question if we can just remove it and replace it by > something generic in the future. And as I learned, we most probably > cannot rip that out without affecting existing user space. Even > replacing it by mshare() would degrade existing user space. > > So the natural thing to reduce page table consumption (again, what this > cover letter talks about) for user space (semi- ?)automatically for > MAP_SHARED files is to factor out what hugetlb has, and teach generic MM > code to cache and reuse page tables (PTE and PMD tables should be > sufficient) where suitable. > > For reasonably aligned mappings and mapping sizes, it shouldn't be too > hard (I know, locking ...), to cache and reuse page tables attached to > files -- similar to what hugetlb does, just in a generic way. We might > want a mechanism to enable/disable this for specific processes and/or > VMAs, but these are minor details. > > And that could come for free for existing user space, because page > tables, and how they are handled, would just be an implementation detail. > > > I'd be really interested into what the major roadblocks/downsides > file-based page table sharing has. Because I am not convinced that a > mechanism like mshare() -- that has to be explicitly implemented+used by > user space -- is required for that. Perhaps this is an 'opportunity' for me to write up in detail how hugetlb pmd sharing works. As you know, I have been struggling with keeping that working AND safe AND performant. Who knows, this may lead to changes in the existing implementation.
On 13.07.22 19:58, Mike Kravetz wrote: > On 07/13/22 16:00, David Hildenbrand wrote: >> On 08.07.22 21:36, Khalid Aziz wrote: >>> On 7/8/22 05:47, David Hildenbrand wrote: >>>> On 02.07.22 06:24, Andrew Morton wrote: >>>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion >>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is >>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that >>> wrong. >> >> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, >> and asked the honest question if we can just remove it and replace it by >> something generic in the future. And as I learned, we most probably >> cannot rip that out without affecting existing user space. Even >> replacing it by mshare() would degrade existing user space. >> >> So the natural thing to reduce page table consumption (again, what this >> cover letter talks about) for user space (semi- ?)automatically for >> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM >> code to cache and reuse page tables (PTE and PMD tables should be >> sufficient) where suitable. >> >> For reasonably aligned mappings and mapping sizes, it shouldn't be too >> hard (I know, locking ...), to cache and reuse page tables attached to >> files -- similar to what hugetlb does, just in a generic way. We might >> want a mechanism to enable/disable this for specific processes and/or >> VMAs, but these are minor details. >> >> And that could come for free for existing user space, because page >> tables, and how they are handled, would just be an implementation detail. >> >> >> I'd be really interested into what the major roadblocks/downsides >> file-based page table sharing has. Because I am not convinced that a >> mechanism like mshare() -- that has to be explicitly implemented+used by >> user space -- is required for that. > > Perhaps this is an 'opportunity' for me to write up in detail how > hugetlb pmd sharing works. As you know, I have been struggling with > keeping that working AND safe AND performant. Yes, and I have your locking-related changes in my inbox marked as "to be reviewed" :D Sheding some light on that would be highly appreciated, especially, how hugetlb-specific it currently is and for which reason.
On 7/13/22 08:00, David Hildenbrand wrote: > On 08.07.22 21:36, Khalid Aziz wrote: >> On 7/8/22 05:47, David Hildenbrand wrote: >>> On 02.07.22 06:24, Andrew Morton wrote: >>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote: >>>> >>>>> This patch series implements a mechanism in kernel to allow >>>>> userspace processes to opt into sharing PTEs. It adds a new >>>>> in-memory filesystem - msharefs. >>>> >>>> Dumb question: why do we need a new filesystem for this? Is it not >>>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files? >>>> >>> >>> IIRC, the general opinion at LSF/MM was that this approach at hand is >>> makes people nervous and I at least am not convinced that we really want >>> to have this upstream. >> >> Hi David, > > Hi Khalid, > >> >> You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and >> this just extends that concept to processes. > > They share a *mm* including a consistent virtual memory layout (VMA > list). Page table sharing is just a side product of that. You could even > call page tables just an implementation detail to produce that > consistent virtual memory layout -- described for that MM via a > different data structure. Yes, sharing an mm and vma chain does make it different from implementation point of view. > >> A number of people have commented on potential usefulness of this concept >> and implementation. > > ... and a lot of people raised concerns. Yes, page table sharing to > reduce memory consumption/tlb misses/... is something reasonable to > have. But that doesn't require mshare, as hugetlb has proven. > > The design might be useful for a handful of corner (!) cases, but as the > cover letter only talks about memory consumption of page tables, I'll > not care about those. Once these corner cases are explained and deemed > important, we might want to think of possible alternatives to explore > the solution space. Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for every one. > >> There were concerns raised about being able to make this safe and reliable. >> I had agreed to send a >> second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The > > Okay, most of the changes I saw are related to the user interface, not > to any of the actual dirty implementation-detail concerns. And the cover > letter is not really clear what's actually happening under the hood and > what the (IMHO) weird semantics of the design imply (as can be seen from > Andrews reply). Sure, I will add more details to the cover letter next time. msharefs needs more explanation. I will highlight the creation of a new mm struct for mshare regions that is not owned by any process. There was another under-the-hood change that is listed in changelog but could have been highlighted - "Eliminated the need to point vm_mm in original vmas to the newly synthesized mshare mm". How the fields in new mm struct are used helped make this change and could use more details in cover letter. > >> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion >> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is >> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that >> wrong. > > Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, > and asked the honest question if we can just remove it and replace it by > something generic in the future. And as I learned, we most probably > cannot rip that out without affecting existing user space. Even > replacing it by mshare() would degrade existing user space. > > So the natural thing to reduce page table consumption (again, what this > cover letter talks about) for user space (semi- ?)automatically for > MAP_SHARED files is to factor out what hugetlb has, and teach generic MM > code to cache and reuse page tables (PTE and PMD tables should be > sufficient) where suitable. > > For reasonably aligned mappings and mapping sizes, it shouldn't be too > hard (I know, locking ...), to cache and reuse page tables attached to > files -- similar to what hugetlb does, just in a generic way. We might > want a mechanism to enable/disable this for specific processes and/or > VMAs, but these are minor details. > > And that could come for free for existing user space, because page > tables, and how they are handled, would just be an implementation detail. > > > I'd be really interested into what the major roadblocks/downsides > file-based page table sharing has. Because I am not convinced that a > mechanism like mshare() -- that has to be explicitly implemented+used by > user space -- is required for that. > I see two parts to what you are suggesting (please correct me if I get this wrong): 1. Implement a generic page table sharing mechanism 2. Implement a way to use this mechanism from userspace For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing code and recasting it into this framework of special mm struct. I will look some more into it. As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction. Thanks, Khalid
[sorry for not being as responsive as I usually am] >> >> They share a *mm* including a consistent virtual memory layout (VMA >> list). Page table sharing is just a side product of that. You could even >> call page tables just an implementation detail to produce that >> consistent virtual memory layout -- described for that MM via a >> different data structure. > > Yes, sharing an mm and vma chain does make it different from implementation point of view. > >> >>> A number of people have commented on potential usefulness of this concept >>> and implementation. >> >> ... and a lot of people raised concerns. Yes, page table sharing to >> reduce memory consumption/tlb misses/... is something reasonable to >> have. But that doesn't require mshare, as hugetlb has proven. >> >> The design might be useful for a handful of corner (!) cases, but as the >> cover letter only talks about memory consumption of page tables, I'll >> not care about those. Once these corner cases are explained and deemed >> important, we might want to think of possible alternatives to explore >> the solution space. > > Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a > customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into > their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for > every one. Yes. Another use case I am aware of are KVM-based virtual machines, when VM memory (shmem, file-backed) is not only mapped into the emulator process, but also into other processes used to carry out I/O (e.g., vhost-user). In that case, it's tempting to simply share the page tables between all processes for the shared mapping -- automatically, just like shmem/hugetlb already does. [...] >> >>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion >>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is >>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that >>> wrong. >> >> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, >> and asked the honest question if we can just remove it and replace it by >> something generic in the future. And as I learned, we most probably >> cannot rip that out without affecting existing user space. Even >> replacing it by mshare() would degrade existing user space. >> >> So the natural thing to reduce page table consumption (again, what this >> cover letter talks about) for user space (semi- ?)automatically for >> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM >> code to cache and reuse page tables (PTE and PMD tables should be >> sufficient) where suitable. >> >> For reasonably aligned mappings and mapping sizes, it shouldn't be too >> hard (I know, locking ...), to cache and reuse page tables attached to >> files -- similar to what hugetlb does, just in a generic way. We might >> want a mechanism to enable/disable this for specific processes and/or >> VMAs, but these are minor details. >> >> And that could come for free for existing user space, because page >> tables, and how they are handled, would just be an implementation detail. >> >> >> I'd be really interested into what the major roadblocks/downsides >> file-based page table sharing has. Because I am not convinced that a >> mechanism like mshare() -- that has to be explicitly implemented+used by >> user space -- is required for that. >> > > I see two parts to what you are suggesting (please correct me if I get this wrong): > > 1. Implement a generic page table sharing mechanism > 2. Implement a way to use this mechanism from userspace Yes. Whereby 2) would usually just be some heuristic (e.g.,. file > X MiB -> start sharing), with an additional way to just disable it or just enable it. But yes, most of that stuff should just be automatic. > > For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is > to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from > this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing > code and recasting it into this framework of special mm struct. I will look some more into it. The basic idea would be that whenever a MAP_SHARED VMA has a reasonable size, is aligned in a suitable way (including MAP offset), and protection match, you can just share PTE tables and even PMD tables. As page tables of shared mappings usually don't really store per-process information (exceptions I am aware of are userfaultfd and softdirty tracking), we can simply share/unshare page tables of shared mappings fairly easily. Then, you'd have e.g., 2 sets of page tables cached by the fd that can be mapped into processes 1) PROT_READ|PROT_WRITE 2) PROT_READ On VMA protection changes, one would have to unshare (unmap the page table) and either map another shared one, or map a private one. I don't think there would be need to optimize e.g., for PROT_NONE, but of course, other combinations could make sense to cache. PROT_NONE and other corner cases (softdirty tracking) would simply not use shared page tables. Shared page tables would have to be refcounted and one could e.g., implement a shrinker that frees unused page tables in the fd cache when memory reclaim kicks in. With something like that in place, page table consumption could be reduced and vmscan/rmap walks could turn out more efficient. > > As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file > based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal > for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you > can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction. We can glue it to the file or anything else that's shared I think -- I don't think we particularly, as long as it's something shared between processes to be mapped. And to be quite honest, whenever I read about anonymous memory (i.e., MAP_PRIVATE) I hear my inner voice screaming: just use *shared* memory when you want to *share* memory between processes, and optimize that if anything is missing. Having that said, I understood from previous discussions that there is a use case of efficient read-only protection across many processes/VMAs. I was wondering if that could be handled on the fs-level (pte_mkwrite). I remember I raised the idea before: if one could have a userfaultfd-wp-style (overlay?) file (system?), user-space could protect/unprotect file pages via a different mechanism (ioctl) and get notified about write access via something similar to userfaultfd user-space handlers, not via signals. Instead of adjusting VMAs, once would only adjust file page mappings to map the relevant pages R/O when protecting -- if page tables are shared, that would be efficient. Now, that is is just a very vague brain dump to get it out of my (overloaded) system. What I think the overall message is: let's try not designing new features around page table sharing, let's use page table sharing as an rmap performance optimization and as a mechanism to reduce page table overhead. I hope what I said makes any sense, I might eb just wrong.