Message ID | 20200508144043.13893-1-joro@8bytes.org (mailing list archive) |
---|---|
Headers | show |
Series | mm: Get rid of vmalloc_sync_(un)mappings() | expand |
On Fri, May 08, 2020 at 04:40:36PM +0200, Joerg Roedel wrote: > Joerg Roedel (7): > mm: Add functions to track page directory modifications > mm/vmalloc: Track which page-table levels were modified > mm/ioremap: Track which page-table levels were modified > x86/mm/64: Implement arch_sync_kernel_mappings() > x86/mm/32: Implement arch_sync_kernel_mappings() > mm: Remove vmalloc_sync_(un)mappings() > x86/mm: Remove vmalloc faulting > > arch/x86/include/asm/pgtable-2level_types.h | 2 + > arch/x86/include/asm/pgtable-3level_types.h | 2 + > arch/x86/include/asm/pgtable_64_types.h | 2 + > arch/x86/include/asm/switch_to.h | 23 --- > arch/x86/kernel/setup_percpu.c | 6 +- > arch/x86/mm/fault.c | 176 +------------------- > arch/x86/mm/init_64.c | 5 + > arch/x86/mm/pti.c | 8 +- > drivers/acpi/apei/ghes.c | 6 - > include/asm-generic/5level-fixup.h | 5 +- > include/asm-generic/pgtable.h | 23 +++ > include/linux/mm.h | 46 +++++ > include/linux/vmalloc.h | 13 +- > kernel/notifier.c | 1 - > lib/ioremap.c | 46 +++-- > mm/nommu.c | 12 -- > mm/vmalloc.c | 109 +++++++----- > 17 files changed, 199 insertions(+), 286 deletions(-) The only concern I have is the pgd_lock lock hold times. By not doing on-demand faults anymore, and consistently calling sync_global_*(), we iterate that pgd_list thing much more often than before if I'm not mistaken.
On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote: > > Hi, > > after the recent issue with vmalloc and tracing code[1] on x86 and a > long history of previous issues related to the vmalloc_sync_mappings() > interface, I thought the time has come to remove it. Please > see [2], [3], and [4] for some other issues in the past. > > The patches are based on v5.7-rc4 and add tracking of page-table > directory changes to the vmalloc and ioremap code. Depending on which > page-table levels changes have been made, a new per-arch function is > called: arch_sync_kernel_mappings(). > > On x86-64 with 4-level paging, this function will not be called more > than 64 times in a systems runtime (because vmalloc-space takes 64 PGD > entries which are only populated, but never cleared). What's the maximum on other system types? It might make more sense to take the memory hit and pre-populate all the tables at boot so we never have to sync them.
Hi Peter, thanks for reviewing this! On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote: > The only concern I have is the pgd_lock lock hold times. > > By not doing on-demand faults anymore, and consistently calling > sync_global_*(), we iterate that pgd_list thing much more often than > before if I'm not mistaken. Should not be a problem, from what I have seen this function is not called often on x86-64. The vmalloc area needs to be synchronized at the top-level there, which is by now P4D (with 4-level paging). And the vmalloc area takes 64 entries, when all of them are populated the function will not be called again. On 32bit it might be called more often, because synchronization happens on the PMD level, which is also used for large-page mapped ioremap regions. But these don't happen very often and there are also no VMAP stacks on x86-32 which could cause this function to be called frequently. Joerg
On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote: > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote: > What's the maximum on other system types? It might make more sense to > take the memory hit and pre-populate all the tables at boot so we > never have to sync them. Need to look it up for 5-level paging, with 4-level paging its 64 pages to pre-populate the vmalloc area. But that would not solve the problem on x86-32, which needs to synchronize unmappings on the PMD level. Joerg
On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <jroedel@suse.de> wrote: > > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote: > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote: > > > What's the maximum on other system types? It might make more sense to > > take the memory hit and pre-populate all the tables at boot so we > > never have to sync them. > > Need to look it up for 5-level paging, with 4-level paging its 64 pages > to pre-populate the vmalloc area. > > But that would not solve the problem on x86-32, which needs to > synchronize unmappings on the PMD level. What changes in this series with x86-32? We already do that synchronization, right? IOW, in the cases where the vmalloc *fault* code does anything at all, we should have a small bound for how much memory to preallocate and, if we preallocate it, then there is nothing to sync and nothing to fault. And we have the benefit that we never need to sync anything on 64-bit, which is kind of nice. Do we actually need PMD-level things for 32-bit? What if we just outlawed huge pages in the vmalloc space on 32-bit non-PAE? Or maybe the net result isn't much of a cleanup after all given the need to support 32-bit. > > > Joerg
On Fri, May 08, 2020 at 11:34:07PM +0200, Joerg Roedel wrote: > Hi Peter, > > thanks for reviewing this! > > On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote: > > The only concern I have is the pgd_lock lock hold times. > > > > By not doing on-demand faults anymore, and consistently calling > > sync_global_*(), we iterate that pgd_list thing much more often than > > before if I'm not mistaken. > > Should not be a problem, from what I have seen this function is not > called often on x86-64. The vmalloc area needs to be synchronized at > the top-level there, which is by now P4D (with 4-level paging). And the > vmalloc area takes 64 entries, when all of them are populated the > function will not be called again. Right; it's just that the moment you do trigger it, it'll iterate that pgd_list and that is potentially huge. Then again, that's not a new problem. I suppose we can deal with it if/when it becomes an actual problem. I had a quick look and I think it might be possible to make it an RCU managed list. We'd have to remove the pgd_list entry in put_task_struct_rcu_user(). Then we can play games in sync_global_*() and use RCU iteration. New tasks (which can be missed in the RCU iteration) will already have a full clone of the PGD anyway. But like said, something for later I suppose.
On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote: > On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <jroedel@suse.de> wrote: > > > > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote: > > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote: > > > > > What's the maximum on other system types? It might make more sense to > > > take the memory hit and pre-populate all the tables at boot so we > > > never have to sync them. > > > > Need to look it up for 5-level paging, with 4-level paging its 64 pages > > to pre-populate the vmalloc area. > > > > But that would not solve the problem on x86-32, which needs to > > synchronize unmappings on the PMD level. > > What changes in this series with x86-32? This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so that the synchronization happens every time PMD(s) in the vmalloc areas are changed. Before this series this synchronization only happened at arbitrary places calling vmalloc_sync_(un)mappings(). > We already do that synchronization, right? IOW, in the cases where > the vmalloc *fault* code does anything at all, we should have a small > bound for how much memory to preallocate and, if we preallocate it, > then there is nothing to sync and nothing to fault. And we have the > benefit that we never need to sync anything on 64-bit, which is kind > of nice. Don't really get you here, what is pre-allocated and why is there no need to sync and fault then? > Do we actually need PMD-level things for 32-bit? What if we just > outlawed huge pages in the vmalloc space on 32-bit non-PAE? Disallowing huge-pages would at least remove the need to sync unmappings, but we still need to sync new PMD entries. Remember that the size of the vmalloc area on 32 bit is dynamic and depends on the VM-split and the actual amount of RAM on the system. A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of VMALLOC address space. And if we want to avoid vmalloc-faults there, we need to pre-allocate all PTE pages for that area (and the amount of PTE pages needed increases when RAM decreases). On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which is around 5M (or 1% of total system memory). Regards, Joerg
On Sat, May 09, 2020 at 11:25:16AM +0200, Peter Zijlstra wrote: > Right; it's just that the moment you do trigger it, it'll iterate that > pgd_list and that is potentially huge. Then again, that's not a new > problem. > > I suppose we can deal with it if/when it becomes an actual problem. > > I had a quick look and I think it might be possible to make it an RCU > managed list. We'd have to remove the pgd_list entry in > put_task_struct_rcu_user(). Then we can play games in sync_global_*() > and use RCU iteration. New tasks (which can be missed in the RCU > iteration) will already have a full clone of the PGD anyway. Right, it should not be that difficult to rcu-protect the pgd-list, but we can try that when it actually becomes a problem. Joerg
On Sat, May 9, 2020 at 10:52 AM Joerg Roedel <jroedel@suse.de> wrote: > > On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote: > > On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <jroedel@suse.de> wrote: > > > > > > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote: > > > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote: > > > > > > > What's the maximum on other system types? It might make more sense to > > > > take the memory hit and pre-populate all the tables at boot so we > > > > never have to sync them. > > > > > > Need to look it up for 5-level paging, with 4-level paging its 64 pages > > > to pre-populate the vmalloc area. > > > > > > But that would not solve the problem on x86-32, which needs to > > > synchronize unmappings on the PMD level. > > > > What changes in this series with x86-32? > > This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so > that the synchronization happens every time PMD(s) in the vmalloc areas > are changed. Before this series this synchronization only happened at > arbitrary places calling vmalloc_sync_(un)mappings(). > > > We already do that synchronization, right? IOW, in the cases where > > the vmalloc *fault* code does anything at all, we should have a small > > bound for how much memory to preallocate and, if we preallocate it, > > then there is nothing to sync and nothing to fault. And we have the > > benefit that we never need to sync anything on 64-bit, which is kind > > of nice. > > Don't really get you here, what is pre-allocated and why is there no > need to sync and fault then? > > > Do we actually need PMD-level things for 32-bit? What if we just > > outlawed huge pages in the vmalloc space on 32-bit non-PAE? > > Disallowing huge-pages would at least remove the need to sync > unmappings, but we still need to sync new PMD entries. Remember that the > size of the vmalloc area on 32 bit is dynamic and depends on the VM-split > and the actual amount of RAM on the system. > > A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of > VMALLOC address space. And if we want to avoid vmalloc-faults there, we > need to pre-allocate all PTE pages for that area (and the amount of PTE > pages needed increases when RAM decreases). > > On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which > is around 5M (or 1% of total system memory). I can never remember which P?D name goes with which level and which machine type, but I don't think I agree with your math regardless. On x86, there are two fundamental situations that can occur: 1. Non-PAE. There is a single 4k top-level page table per mm, and this table contains either 512 or 1024 entries total. Of those entries, some fraction (half or less) control the kernel address space, and some fraction of *that* is for vmalloc space. Those entries are the *only* thing that needs syncing -- all mms will either have null (not present) in those slots or will have pointers to the *same* next-level-down directories. 2. PAE. Depending on your perspective, there could be a grand total of four top-level paging pointers, of which one (IIRC) is for the kernel. That points to the same place for all mms. Or, if you look at it the other way, PAE is just like #1 except that the top-level table has only four entries and only one points to VMALLOC space. So, unless I'm missing something here, there is an absolute maximum of 512 top-level entries that ever need to be synchronized. Now, there's an additional complication. On x86_64, we have a rule: those entries that need to be synced start out null and may, during the lifetime of the system, change *once*. They are never unmapped or modified after being allocated. This means that those entries can only ever point to a page *table* and not to a ginormous page. So, even if the hardware were to support ginormous pages (which, IIRC, it doesn't), we would be limited to merely immense and not ginormous pages in the vmalloc range. On x86_32, I don't think we have this rule right now. And this means that it's possible for one of these pages to be unmapped or modified. So my suggestion is that just apply the x86_64 rule to x86_32 as well. The practical effect will be that 2-level-paging systems will not be able to use huge pages in the vmalloc range, since the rule will be that the vmalloc-relevant entries in the top-level table must point to page *tables* instead of huge pages. On top of this, if we preallocate these entries, then the maximum amount of memory we can possibly waste is 4k * (entries pointing to vmalloc space - entries actually used for vmalloc space). I don't know what this number typically is, but I don't think it's very large. Preallocating means that vmalloc faults *and* synchronization go away entirely. All of the page tables used for vmalloc will be entirely shared by all mms, so all that's needed to modify vmalloc mappings is to update init_mm and, if needed, flush TLBs. No other page tables will need modification at all. On x86_64, the only real advantage is that the handful of corner cases that make vmalloc faults unpleasant (mostly relating to vmap stacks) go away. On x86_32, a bunch of mind-bending stuff (everything your series deletes but also almost everything your series *adds*) goes away. There may be a genuine tiny performance hit on 2-level systems due to the loss of huge pages in vmalloc space, but I'm not sure I care or that we use them anyway on these systems. And PeterZ can stop even thinking about RCU. Am I making sense? (Aside: I *hate* the PMD, etc terminology. Even the kernel's C types can't keep track of whether pmd_t* points to an entire paging directory or to a single entry. Similarly, everyone knows that a pte_t is a "page table entry", except that pte_t* might instead be a pointer to an array of 512 or 1024 page table entries.)
Hi Andy, On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > 1. Non-PAE. There is a single 4k top-level page table per mm, and > this table contains either 512 or 1024 entries total. Of those > entries, some fraction (half or less) control the kernel address > space, and some fraction of *that* is for vmalloc space. Those > entries are the *only* thing that needs syncing -- all mms will either > have null (not present) in those slots or will have pointers to the > *same* next-level-down directories. Not entirely correct, on 32-bit non-PAE there are two levels with 1024 entries each. In Linux these map to PMD and PTE levels. With 1024 entries each PMD maps 4MB of address space. How much of these 1024 top-level entries are used for the kernel is configuration dependent in Linux, it can be 768, 512, or 256. > 2. PAE. Depending on your perspective, there could be a grand total > of four top-level paging pointers, of which one (IIRC) is for the > kernel. That is again configuration dependent, up to 3 of the top-level pointers could be used for kernel-space. > That points to the same place for all mms. Or, if you look at it the > other way, PAE is just like #1 except that the top-level table has > only four entries and only one points to VMALLOC space. There are more differences. In PAE an entry is 64-bit, which means each PMD-entry only maps 2MB of address space, which means you need two PTE pages to map the same amount of address space you could map with one page without PAE paging. And as noted above, all 3 of the top-level pointers can point to vmalloc space (not exclusivly). > So, unless I'm missing something here, there is an absolute maximum of > 512 top-level entries that ever need to be synchronized. And here is where your assumption is wrong. On 32-bit PAE systems it is not the top-level entries that need to be synchronized for vmalloc, but the second-level entries. And dependent on the kernel configuration, there are (in total, not only vmalloc) 1536, 1024, or 512 of these second-level entries. How much of them are actually used for vmalloc depends on the size of the system RAM (but is at least 64), because the vmalloc area begins after the kernel direct-mapping (with an 8MB unmapped hole). With 512MB of RAM you need 256 entries for the kernel direct-mapping (I ignore the ISA hole here). After that there are 4 unmapped entries and the rest is vmalloc, minus some fixmap and ldt mappings at the end of the address space. I havn't looked up the exact number, but 4 entries (8MB) for this should be close to the real value. 1536 total PMD entries - 256 for direct-mapping - 4 for ISA hole - 4 for stuff at the end of the address space = 1272 PMD entries for VMALLOC space which need synchronization. With less RAM it is even more, and to get rid of faulting and synchronization you need to pre-allocate a 4k PTE page for each of these PMD entries. > Now, there's an additional complication. On x86_64, we have a rule: > those entries that need to be synced start out null and may, during > the lifetime of the system, change *once*. They are never unmapped or > modified after being allocated. This means that those entries can > only ever point to a page *table* and not to a ginormous page. So, > even if the hardware were to support ginormous pages (which, IIRC, it > doesn't), we would be limited to merely immense and not ginormous > pages in the vmalloc range. On x86_32, I don't think we have this > rule right now. And this means that it's possible for one of these > pages to be unmapped or modified. The reason for x86-32 being different is that the address space is orders of magnitude smaller than on x86-64. We just have 4 top-level entries with PAE paging and can't afford to partition kernel-adress space on that level like we do on x86-64. That is the reason the address space is partitioned on the second (PMD) level, which is also the reason vmalloc synchronization needs to happen on that level. And because that's not enough yet, its also the page-table level to map huge-pages. > So my suggestion is that just apply the x86_64 rule to x86_32 as well. > The practical effect will be that 2-level-paging systems will not be > able to use huge pages in the vmalloc range, since the rule will be > that the vmalloc-relevant entries in the top-level table must point to > page *tables* instead of huge pages. I could very well live with prohibiting huge-page ioremap mappings for x86-32. But as I wrote before, this doesn't solve the problems I am trying to address with this patch-set, or would only address them if significant amount of total system memory is used. The pre-allocation solution would work for x86-64, it would only need 256kb of preallocated memory for the vmalloc range to never synchronize or fault again. I have thought about that and did the math before writing this patch-set, but doing the math for 32 bit drove me away from it for reasons written above. And since a lot of the vmalloc_sync_(un)mappings problems I debugged were actually related to 32-bit, I want a solution that works for 32 and 64-bit x86 (at least until support for x86-32 is removed). And I think this patch-set provides a solution that works well for both. Joerg
On Sat, May 09, 2020 at 11:25:16AM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2020 at 11:34:07PM +0200, Joerg Roedel wrote: > > On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote: > > > The only concern I have is the pgd_lock lock hold times. > > > > > > By not doing on-demand faults anymore, and consistently calling > > > sync_global_*(), we iterate that pgd_list thing much more often than > > > before if I'm not mistaken. > > > > Should not be a problem, from what I have seen this function is not > > called often on x86-64. The vmalloc area needs to be synchronized at > > the top-level there, which is by now P4D (with 4-level paging). And the > > vmalloc area takes 64 entries, when all of them are populated the > > function will not be called again. > > Right; it's just that the moment you do trigger it, it'll iterate that > pgd_list and that is potentially huge. Then again, that's not a new > problem. > > I suppose we can deal with it if/when it becomes an actual problem. > > I had a quick look and I think it might be possible to make it an RCU > managed list. We'd have to remove the pgd_list entry in > put_task_struct_rcu_user(). Then we can play games in sync_global_*() > and use RCU iteration. New tasks (which can be missed in the RCU > iteration) will already have a full clone of the PGD anyway. One of the things on my long-term todo list is to replace mm_struct.mmlist with an XArray containing all mm_structs. Then we can use one mark to indicate maybe-swapped and another mark to indicate ... whatever it is pgd_list indicates. Iterating an XArray (whether the entire thing or with marks) is RCU-safe and faster than iterating a linked list, so this should solve the problem?
On Sat, May 9, 2020 at 2:57 PM Joerg Roedel <joro@8bytes.org> wrote: > > Hi Andy, > > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > > So, unless I'm missing something here, there is an absolute maximum of > > 512 top-level entries that ever need to be synchronized. > > And here is where your assumption is wrong. On 32-bit PAE systems it is > not the top-level entries that need to be synchronized for vmalloc, but > the second-level entries. And dependent on the kernel configuration, > there are (in total, not only vmalloc) 1536, 1024, or 512 of these > second-level entries. How much of them are actually used for vmalloc > depends on the size of the system RAM (but is at least 64), because > the vmalloc area begins after the kernel direct-mapping (with an 8MB > unmapped hole). I spent some time looking at the code, and I'm guessing you're talking about the 3-level !SHARED_KERNEL_PMD case. I can't quite figure out what's going on. Can you explain what is actually going on that causes different mms/pgds to have top-level entries in the kernel range that point to different tables? Because I'm not seeing why this makes any sense. > > > Now, there's an additional complication. On x86_64, we have a rule: > > those entries that need to be synced start out null and may, during > > the lifetime of the system, change *once*. They are never unmapped or > > modified after being allocated. This means that those entries can > > only ever point to a page *table* and not to a ginormous page. So, > > even if the hardware were to support ginormous pages (which, IIRC, it > > doesn't), we would be limited to merely immense and not ginormous > > pages in the vmalloc range. On x86_32, I don't think we have this > > rule right now. And this means that it's possible for one of these > > pages to be unmapped or modified. > > The reason for x86-32 being different is that the address space is > orders of magnitude smaller than on x86-64. We just have 4 top-level > entries with PAE paging and can't afford to partition kernel-adress > space on that level like we do on x86-64. That is the reason the address > space is partitioned on the second (PMD) level, which is also the reason > vmalloc synchronization needs to happen on that level. And because > that's not enough yet, its also the page-table level to map huge-pages. Why does it need to be partitioned at all? The only thing that comes to mind is that the LDT range is per-mm. So I can imagine that the PAE case with a 3G user / 1G kernel split has to have the vmalloc range and the LDT range in the same top-level entry. Yuck. > > > So my suggestion is that just apply the x86_64 rule to x86_32 as well. > > The practical effect will be that 2-level-paging systems will not be > > able to use huge pages in the vmalloc range, since the rule will be > > that the vmalloc-relevant entries in the top-level table must point to > > page *tables* instead of huge pages. > > I could very well live with prohibiting huge-page ioremap mappings for > x86-32. But as I wrote before, this doesn't solve the problems I am > trying to address with this patch-set, or would only address them if > significant amount of total system memory is used. > > The pre-allocation solution would work for x86-64, it would only need > 256kb of preallocated memory for the vmalloc range to never synchronize > or fault again. I have thought about that and did the math before > writing this patch-set, but doing the math for 32 bit drove me away from > it for reasons written above. > If it's *just* the LDT that's a problem, we could plausibly shrink the user address range a little bit and put the LDT in the user portion. I suppose this could end up creating its own set of problems involving tracking which code owns which page tables. > And since a lot of the vmalloc_sync_(un)mappings problems I debugged > were actually related to 32-bit, I want a solution that works for 32 and > 64-bit x86 (at least until support for x86-32 is removed). And I think > this patch-set provides a solution that works well for both. I'm not fundamentally objecting to your patch set, but I do want to understand what's going on that needs this stuff. > > > Joerg
On Sat, May 09, 2020 at 10:05:43PM -0700, Andy Lutomirski wrote: > On Sat, May 9, 2020 at 2:57 PM Joerg Roedel <joro@8bytes.org> wrote: > I spent some time looking at the code, and I'm guessing you're talking > about the 3-level !SHARED_KERNEL_PMD case. I can't quite figure out > what's going on. > > Can you explain what is actually going on that causes different > mms/pgds to have top-level entries in the kernel range that point to > different tables? Because I'm not seeing why this makes any sense. There are three cases where the PMDs are not shared on x86-32: 1) With non-PAE the top-level is already the PMD level, because the page-table only has two levels. Since the top-level can't be shared, the PMDs are also not shared. 2) For some reason Xen-PV also does not share kernel PMDs on PAE systems, but I havn't looked into why. 3) On 32-bit PAE with PTI enabled the kernel address space contains the LDT mapping, which is also different per-PGD. There is one PMD entry reserved for the LDT, giving it 2MB of address space. I implemented it this way to keep the 32-bit implementation of PTI mostly similar to the 64-bit one. > Why does it need to be partitioned at all? The only thing that comes > to mind is that the LDT range is per-mm. So I can imagine that the > PAE case with a 3G user / 1G kernel split has to have the vmalloc > range and the LDT range in the same top-level entry. Yuck. PAE with 3G user / 1G kernel has _all_ of the kernel mappings in one top-level entry (direct-mapping, vmalloc, ldt, fixmap). > If it's *just* the LDT that's a problem, we could plausibly shrink the > user address range a little bit and put the LDT in the user portion. > I suppose this could end up creating its own set of problems involving > tracking which code owns which page tables. Yeah, for the PTI case it is only the LDT that causes the unshared kernel PMDs, but even if we move the LDT somewhere else we still have two-level paging and the xen-pv case. Joerg
On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote: > Iterating an XArray (whether the entire thing > or with marks) is RCU-safe and faster than iterating a linked list, > so this should solve the problem? It can hardly be faster if you want all elements -- which is I think the case here. We only call into this if we change an entry, and then we need to propagate that change to all.
On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > On x86_64, the only real advantage is that the handful of corner cases > that make vmalloc faults unpleasant (mostly relating to vmap stacks) > go away. On x86_32, a bunch of mind-bending stuff (everything your > series deletes but also almost everything your series *adds*) goes > away. There may be a genuine tiny performance hit on 2-level systems > due to the loss of huge pages in vmalloc space, but I'm not sure I > care or that we use them anyway on these systems. And PeterZ can stop > even thinking about RCU. > > Am I making sense? I think it'll work for x86_64 and that is really all I care about :-)
On Mon, May 11, 2020 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > > > On x86_64, the only real advantage is that the handful of corner cases > > that make vmalloc faults unpleasant (mostly relating to vmap stacks) > > go away. On x86_32, a bunch of mind-bending stuff (everything your > > series deletes but also almost everything your series *adds*) goes > > away. There may be a genuine tiny performance hit on 2-level systems > > due to the loss of huge pages in vmalloc space, but I'm not sure I > > care or that we use them anyway on these systems. And PeterZ can stop > > even thinking about RCU. > > > > Am I making sense? > > I think it'll work for x86_64 and that is really all I care about :-) Sadly, I think that Joerg has convinced my that this doesn't really work for 32-bit unless we rework the LDT code or drop support for something that we might not want to drop support for. So, last try -- maybe we can start defeaturing 32-bit: What if we make 32-bit PTI depend on PAE? And drop 32-bit Xen PV support? And make 32-bit huge pages depend on PAE? Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI (and optionally PAE non-PTI) can use the evil virtually mapped LDT. And 32-bit non-PAE (the 2-level case) will only have pointers to page tables at the top level. And then we can preallocate. Or maybe we don't want to defeature this much, or maybe the memory hit from this preallocation will hurt little 2-level 32-bit systems too much. (Xen 32-bit PV support seems to be on its way out upstream.)
On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote: > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote: > > Iterating an XArray (whether the entire thing > > or with marks) is RCU-safe and faster than iterating a linked list, > > so this should solve the problem? > > It can hardly be faster if you want all elements -- which is I think the > case here. We only call into this if we change an entry, and then we > need to propagate that change to all. Of course it can be faster. Iterating an array is faster than iterating a linked list because caches. While an XArray is a segmented array (so slower than a plain array), it's plainly going to be faster than iterating a linked list.
On Mon, May 11, 2020 at 08:52:04AM -0700, Matthew Wilcox wrote: > On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote: > > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote: > > > Iterating an XArray (whether the entire thing > > > or with marks) is RCU-safe and faster than iterating a linked list, > > > so this should solve the problem? > > > > It can hardly be faster if you want all elements -- which is I think the > > case here. We only call into this if we change an entry, and then we > > need to propagate that change to all. > > Of course it can be faster. Iterating an array is faster than iterating > a linked list because caches. While an XArray is a segmented array > (so slower than a plain array), it's plainly going to be faster than > iterating a linked list. Quantifying this: $ ./array-vs-list walked sequential array in 0.002039s walked sequential list in 0.002807s walked sequential array in 0.002017s walked shuffled list in 0.102367s walked shuffled array in 0.012114s Attached is the source code; above results on a Kaby Lake with CFLAGS="-O2 -W -Wall -g". #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> unsigned long count = 1000 * 1000; unsigned int verbose; struct object { struct object *next; struct object *prev; unsigned int value; }; #define printv(level, fmt, ...) \ if (level <= verbose) { printf(fmt, ##__VA_ARGS__); } void check_total(unsigned long total) { if (total * 2 != count * (count + 1)) printf("Check your staging! (%lu %lu)\n", total, count); } void alloc_objs(struct object **array) { unsigned long i; for (i = 0; i < count; i++) { struct object *obj = malloc(sizeof(*obj)); obj->value = i + 1; /* Add to the array */ array[i] = obj; } } void shuffle(struct object **array, unsigned long seed) { unsigned long i; printv(1, "random seed %lu\n", seed); srand48(seed); /* Shuffle the array */ for (i = 1; i < count; i++) { struct object *obj; unsigned long j = (unsigned int)mrand48() % (i + 1); if (i == j) continue; obj = array[j]; array[j] = array[i]; array[i] = obj; } } void create_list(struct object **array, struct object *list) { unsigned long i; list->next = list; list->prev = list; for (i = 0; i < count; i++) { struct object *obj = array[i]; /* Add to the tail of the list */ obj->next = list; obj->prev = list->prev; list->prev->next = obj; list->prev = obj; } } void walk_list(struct object *list) { unsigned long total = 0; struct object *obj; for (obj = list->next; obj != list; obj = obj->next) { total += obj->value; } check_total(total); } void walk_array(struct object **array) { unsigned long total = 0; unsigned long i; for (i = 0; i < count; i++) { total += array[i]->value; } check_total(total); } /* time2 - time1 */ double diff_time(struct timespec *time1, struct timespec *time2) { double result = time2->tv_nsec - time1->tv_nsec; return time2->tv_sec - time1->tv_sec + result / 1000 / 1000 / 1000; } int main(int argc, char **argv) { int opt; unsigned long seed = time(NULL); struct object **array; struct object list; struct timespec time1, time2; while ((opt = getopt(argc, argv, "c:s:v")) != -1) { if (opt == 'c') count *= strtoul(optarg, NULL, 0); else if (opt == 's') seed = strtoul(optarg, NULL, 0); else if (opt == 'v') verbose++; } clock_gettime(CLOCK_MONOTONIC, &time1); array = calloc(count, sizeof(void *)); alloc_objs(array); clock_gettime(CLOCK_MONOTONIC, &time2); printv(1, "allocated %lu items in %fs\n", count, diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); walk_array(array); clock_gettime(CLOCK_MONOTONIC, &time2); printf("walked sequential array in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); create_list(array, &list); clock_gettime(CLOCK_MONOTONIC, &time2); printv(1, "created list in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); walk_list(&list); clock_gettime(CLOCK_MONOTONIC, &time2); printf("walked sequential list in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); walk_array(array); clock_gettime(CLOCK_MONOTONIC, &time2); printf("walked sequential array in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); shuffle(array, seed); clock_gettime(CLOCK_MONOTONIC, &time2); printv(1, "shuffled array in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); create_list(array, &list); clock_gettime(CLOCK_MONOTONIC, &time2); printv(1, "created list in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); walk_list(&list); clock_gettime(CLOCK_MONOTONIC, &time2); printf("walked shuffled list in %fs\n", diff_time(&time1, &time2)); clock_gettime(CLOCK_MONOTONIC, &time1); walk_array(array); clock_gettime(CLOCK_MONOTONIC, &time2); printf("walked shuffled array in %fs\n", diff_time(&time1, &time2)); return 0; }
On Mon, May 11, 2020 at 08:52:04AM -0700, Matthew Wilcox wrote: > On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote: > > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote: > > > Iterating an XArray (whether the entire thing > > > or with marks) is RCU-safe and faster than iterating a linked list, > > > so this should solve the problem? > > > > It can hardly be faster if you want all elements -- which is I think the > > case here. We only call into this if we change an entry, and then we > > need to propagate that change to all. > > Of course it can be faster. Iterating an array is faster than iterating > a linked list because caches. While an XArray is a segmented array > (so slower than a plain array), it's plainly going to be faster than > iterating a linked list. Fair enough, mostly also because the actual work (setting a single PTE) doesn't dominate in this case.
On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote: > On Mon, May 11, 2020 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > > > > > On x86_64, the only real advantage is that the handful of corner cases > > > that make vmalloc faults unpleasant (mostly relating to vmap stacks) > > > go away. On x86_32, a bunch of mind-bending stuff (everything your > > > series deletes but also almost everything your series *adds*) goes > > > away. There may be a genuine tiny performance hit on 2-level systems > > > due to the loss of huge pages in vmalloc space, but I'm not sure I > > > care or that we use them anyway on these systems. And PeterZ can stop > > > even thinking about RCU. > > > > > > Am I making sense? > > > > I think it'll work for x86_64 and that is really all I care about :-) > > Sadly, I think that Joerg has convinced my that this doesn't really > work for 32-bit unless we rework the LDT code or drop support for > something that we might not want to drop support for. I was thinking keep these patches for 32bit and fix 64bit 'proper'. But sure, if we can get rid of it all by stripping 32bit features I'm not going to object either.
On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote: > What if we make 32-bit PTI depend on PAE? It already does, PTI support for legacy paging had to be removed because there were memory corruption problems with THP. The reason was that huge PTEs in the user-space area were mapped in two page-tables (kernel and user), but A/D bits were only fetched from the kernel part. To not make things more complicated we agreed on just not supporting PTI without PAE. > And drop 32-bit Xen PV support? And make 32-bit huge pages depend on > PAE? Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI > (and optionally PAE non-PTI) can use the evil virtually mapped LDT. > And 32-bit non-PAE (the 2-level case) will only have pointers to page > tables at the top level. And then we can preallocate. Not sure I can follow you here. How can 32-bit PTI with PAE use the LDT from the direct mapping? I am guessing you want to get rid of the SHARED_KERNEL_PMD==0 case for PAE kernels. This would indeed make syncing unneccessary on PAE, but pre-allocation would still be needed for 2-level paging. Just the amount of memory needed for the pre-allocated PTE pages is half as big as it would be with PAE. > Or maybe we don't want to defeature this much, or maybe the memory hit > from this preallocation will hurt little 2-level 32-bit systems too > much. It will certainly make Linux less likely to boot on low-memory x86-32 systems, whoever will be affected by this. Joerg
> On May 11, 2020, at 12:14 PM, Joerg Roedel <jroedel@suse.de> wrote: > > On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote: >> What if we make 32-bit PTI depend on PAE? > > It already does, PTI support for legacy paging had to be removed because > there were memory corruption problems with THP. The reason was that huge > PTEs in the user-space area were mapped in two page-tables (kernel and > user), but A/D bits were only fetched from the kernel part. To not make > things more complicated we agreed on just not supporting PTI without > PAE. > >> And drop 32-bit Xen PV support? And make 32-bit huge pages depend on >> PAE? Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI >> (and optionally PAE non-PTI) can use the evil virtually mapped LDT. >> And 32-bit non-PAE (the 2-level case) will only have pointers to page >> tables at the top level. And then we can preallocate. > > Not sure I can follow you here. How can 32-bit PTI with PAE use the LDT > from the direct mapping? I am guessing you want to get rid of the > SHARED_KERNEL_PMD==0 case for PAE kernels. I wrote nonsense. I mean bite off a piece of the *user* portion of the address space and stick the LDT there. I contemplated doing this when I wrote the 64-bit code, but I decided we had so much address space to throw around that I liked my solution better. > This would indeed make > syncing unneccessary on PAE, but pre-allocation would still be needed > for 2-level paging. Just the amount of memory needed for the > pre-allocated PTE pages is half as big as it would be with PAE. > >> Or maybe we don't want to defeature this much, or maybe the memory hit >> from this preallocation will hurt little 2-level 32-bit systems too >> much. > > It will certainly make Linux less likely to boot on low-memory x86-32 > systems, whoever will be affected by this. > > I’m guessing the right solution is either your series or your series plus preallocation on 64-bit. I’m just grumpy about it...
On Mon, 11 May 2020 21:14:14 +0200 Joerg Roedel <jroedel@suse.de> wrote: > > Or maybe we don't want to defeature this much, or maybe the memory hit > > from this preallocation will hurt little 2-level 32-bit systems too > > much. > > It will certainly make Linux less likely to boot on low-memory x86-32 > systems, whoever will be affected by this. That may be an issue, as I'm guessing the biggest users of x86-32, is probably small systems that uses Intel's attempts at the embedded space. -- Steve
On Mon, May 11, 2020 at 12:36:19PM -0700, Andy Lutomirski wrote: > I’m guessing the right solution is either your series or your series > plus preallocation on 64-bit. I’m just grumpy about it... Okay, so we can do the pre-allocation when it turns out the pgd_list lock-times become a problem on x86-64. The tracking code in vmalloc.c is needed anyway for 32-bit and there is no reason why 64-bit shouldn't use it as well for now. I don't think that taking the lock _will_ be a problem, as it is only taken when a new PGD/P4D entry is populated. And it is pretty unlikely that a system will populate all 64 of them, with 4-level paging each of these entries will map 512GB of address space. But if I am wrong here pre-allocating is still an option. Joerg
On Tue, 12 May 2020 17:02:50 +0200 Joerg Roedel <jroedel@suse.de> wrote: > On Mon, May 11, 2020 at 12:36:19PM -0700, Andy Lutomirski wrote: > > I’m guessing the right solution is either your series or your series > > plus preallocation on 64-bit. I’m just grumpy about it... > > Okay, so we can do the pre-allocation when it turns out the pgd_list > lock-times become a problem on x86-64. The tracking code in vmalloc.c is > needed anyway for 32-bit and there is no reason why 64-bit shouldn't use > it as well for now. > I don't think that taking the lock _will_ be a problem, as it is only > taken when a new PGD/P4D entry is populated. And it is pretty unlikely > that a system will populate all 64 of them, with 4-level paging each of > these entries will map 512GB of address space. But if I am wrong here > pre-allocating is still an option. > 256TB of RAM isn't too far in the future ;-) -- Steve