Message ID | 20240113094436.2506396-1-sunnanyong@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | A Solution to Re-enable hugetlb vmemmap optimize | expand |
On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > HVO was previously disabled on arm64 [1] due to the lack of necessary > BBM(break-before-make) logic when changing page tables. > This set of patches fix this by adding necessary BBM sequence when > changing page table, and supporting vmemmap page fault handling to > fixup kernel address translation fault if vmemmap is concurrently accessed. I'm not keen on this approach. I'm not even sure it's safe. In the second patch, you take the init_mm.page_table_lock on the fault path but are we sure this is unlocked when the fault was taken? Basically you can get a fault anywhere something accesses a struct page. How often is this code path called? I wonder whether a stop_machine() approach would be simpler. Andrew, I'd suggest we drop these patches from the mm tree for the time being. They haven't received much review from the arm64 folk. Thanks.
On 2024/1/26 2:06, Catalin Marinas wrote: > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: >> HVO was previously disabled on arm64 [1] due to the lack of necessary >> BBM(break-before-make) logic when changing page tables. >> This set of patches fix this by adding necessary BBM sequence when >> changing page table, and supporting vmemmap page fault handling to >> fixup kernel address translation fault if vmemmap is concurrently accessed. > I'm not keen on this approach. I'm not even sure it's safe. In the > second patch, you take the init_mm.page_table_lock on the fault path but > are we sure this is unlocked when the fault was taken? I think this situation is impossible. In the implementation of the second patch, when the page table is being corrupted (the time window when a page fault may occur), vmemmap_update_pte() already holds the init_mm.page_table_lock, and unlock it until page table update is done.Another thread could not hold the init_mm.page_table_lock and also trigger a page fault at the same time. If I have missed any points in my thinking, please correct me. Thank you. > Basically you can > get a fault anywhere something accesses a struct page. > > How often is this code path called? I wonder whether a stop_machine() > approach would be simpler. As long as allocating or releasing hugetlb is called. We cannot limit users to only allocate or release hugetlb when booting or not running any workload on all other cpus, so if use stop_machine(), it will be triggered 8 times every 2M and 4096 times every 1G, which is probably too expensive. I saw that on the X86, in order to improve performance, optimizations such as batch tlb flushing have been done, means that some users are concerned about the performance of hugetlb allocation: https://lwn.net/ml/linux-kernel/20230905214412.89152-1-mike.kravetz@oracle.com/ > Andrew, I'd suggest we drop these patches from the mm tree for the time > being. They haven't received much review from the arm64 folk. Thanks. >
On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > On 2024/1/26 2:06, Catalin Marinas wrote: > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > BBM(break-before-make) logic when changing page tables. > > > This set of patches fix this by adding necessary BBM sequence when > > > changing page table, and supporting vmemmap page fault handling to > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > I'm not keen on this approach. I'm not even sure it's safe. In the > > second patch, you take the init_mm.page_table_lock on the fault path but > > are we sure this is unlocked when the fault was taken? > I think this situation is impossible. In the implementation of the second > patch, when the page table is being corrupted > (the time window when a page fault may occur), vmemmap_update_pte() already > holds the init_mm.page_table_lock, > and unlock it until page table update is done.Another thread could not hold > the init_mm.page_table_lock and > also trigger a page fault at the same time. > If I have missed any points in my thinking, please correct me. Thank you. It still strikes me as incredibly fragile to handle the fault and trying to reason about all the users of 'struct page' is impossible. For example, can the fault happen from irq context? If we want to optimise the vmemmap mapping for arm64, I think we need to consider approaches which avoid the possibility of the fault altogether. It's more complicated to implement, but I think it would be a lot more robust. Andrew -- please can you drop these from -next? Thanks, Will
On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > BBM(break-before-make) logic when changing page tables. > > > > This set of patches fix this by adding necessary BBM sequence when > > > > changing page table, and supporting vmemmap page fault handling to > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > are we sure this is unlocked when the fault was taken? > > I think this situation is impossible. In the implementation of the second > > patch, when the page table is being corrupted > > (the time window when a page fault may occur), vmemmap_update_pte() already > > holds the init_mm.page_table_lock, > > and unlock it until page table update is done.Another thread could not hold > > the init_mm.page_table_lock and > > also trigger a page fault at the same time. > > If I have missed any points in my thinking, please correct me. Thank you. > > It still strikes me as incredibly fragile to handle the fault and trying > to reason about all the users of 'struct page' is impossible. For example, > can the fault happen from irq context? The pte lock cannot be taken in irq context (which I think is what you're asking?) While it is not possible to reason about all users of struct page, we are somewhat relieved of that work by noting that this is only for hugetlbfs, so we don't need to reason about slab, page tables, netmem or zsmalloc. > If we want to optimise the vmemmap mapping for arm64, I think we need to > consider approaches which avoid the possibility of the fault altogether. > It's more complicated to implement, but I think it would be a lot more > robust. > > Andrew -- please can you drop these from -next? > > Thanks, > > Will
On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > BBM(break-before-make) logic when changing page tables. > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > > are we sure this is unlocked when the fault was taken? > > > I think this situation is impossible. In the implementation of the second > > > patch, when the page table is being corrupted > > > (the time window when a page fault may occur), vmemmap_update_pte() already > > > holds the init_mm.page_table_lock, > > > and unlock it until page table update is done.Another thread could not hold > > > the init_mm.page_table_lock and > > > also trigger a page fault at the same time. > > > If I have missed any points in my thinking, please correct me. Thank you. > > > > It still strikes me as incredibly fragile to handle the fault and trying > > to reason about all the users of 'struct page' is impossible. For example, > > can the fault happen from irq context? > > The pte lock cannot be taken in irq context (which I think is what > you're asking?) While it is not possible to reason about all users of > struct page, we are somewhat relieved of that work by noting that this is > only for hugetlbfs, so we don't need to reason about slab, page tables, > netmem or zsmalloc. My concern is that an interrupt handler tries to access a 'struct page' which faults due to another core splitting a pmd mapping for the vmemmap. In this case, I think we'll end up trying to resolve the fault from irq context, which will try to take the spinlock. Avoiding the fault would make this considerably more robust and the architecture has introduced features to avoid break-before-make in some circumstances (see FEAT_BBM and its levels), so having this optimisation conditional on that would seem to be a better approach in my opinion. Will
On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > BBM(break-before-make) logic when changing page tables. > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > > are we sure this is unlocked when the fault was taken? > > > I think this situation is impossible. In the implementation of the second > > > patch, when the page table is being corrupted > > > (the time window when a page fault may occur), vmemmap_update_pte() already > > > holds the init_mm.page_table_lock, > > > and unlock it until page table update is done.Another thread could not hold > > > the init_mm.page_table_lock and > > > also trigger a page fault at the same time. > > > If I have missed any points in my thinking, please correct me. Thank you. > > > > It still strikes me as incredibly fragile to handle the fault and trying > > to reason about all the users of 'struct page' is impossible. For example, > > can the fault happen from irq context? > > The pte lock cannot be taken in irq context (which I think is what > you're asking?) With this patchset, I think it can: IRQ -> interrupt handler accesses vmemmap -> faults -> fault handler in patch 2 takes the init_mm.page_table_lock to wait for the vmemmap rewriting to complete. Maybe it works if the hugetlb code disabled the IRQs but, as Will said, such fault in any kernel context looks fragile.
On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote: > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > > BBM(break-before-make) logic when changing page tables. > > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > > > are we sure this is unlocked when the fault was taken? > > > > I think this situation is impossible. In the implementation of the second > > > > patch, when the page table is being corrupted > > > > (the time window when a page fault may occur), vmemmap_update_pte() already > > > > holds the init_mm.page_table_lock, > > > > and unlock it until page table update is done.Another thread could not hold > > > > the init_mm.page_table_lock and > > > > also trigger a page fault at the same time. > > > > If I have missed any points in my thinking, please correct me. Thank you. > > > > > > It still strikes me as incredibly fragile to handle the fault and trying > > > to reason about all the users of 'struct page' is impossible. For example, > > > can the fault happen from irq context? > > > > The pte lock cannot be taken in irq context (which I think is what > > you're asking?) While it is not possible to reason about all users of > > struct page, we are somewhat relieved of that work by noting that this is > > only for hugetlbfs, so we don't need to reason about slab, page tables, > > netmem or zsmalloc. > > My concern is that an interrupt handler tries to access a 'struct page' > which faults due to another core splitting a pmd mapping for the vmemmap. > In this case, I think we'll end up trying to resolve the fault from irq > context, which will try to take the spinlock. I think that (as per my comments on patch 2), a similar deadlock can happen on RT even if the vmemmap is only accessed in regular process context, and at minimum this needs better comentary and/or lockdep assertions. I'd also prefer that we dropped this for now. > Avoiding the fault would make this considerably more robust and the > architecture has introduced features to avoid break-before-make in some > circumstances (see FEAT_BBM and its levels), so having this optimisation > conditional on that would seem to be a better approach in my opinion. FWIW, that's my position too. Mark.
On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > On 2024/1/26 2:06, Catalin Marinas wrote: > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > BBM(break-before-make) logic when changing page tables. > > > This set of patches fix this by adding necessary BBM sequence when > > > changing page table, and supporting vmemmap page fault handling to > > > fixup kernel address translation fault if vmemmap is concurrently accessed. [...] > > How often is this code path called? I wonder whether a stop_machine() > > approach would be simpler. > As long as allocating or releasing hugetlb is called. We cannot limit users > to only allocate or release hugetlb > when booting or not running any workload on all other cpus, so if use > stop_machine(), it will be triggered > 8 times every 2M and 4096 times every 1G, which is probably too expensive. I'm hoping this can be batched somehow and not do a stop_machine() (or 8) for every 2MB huge page. Just to make sure I understand - is the goal to be able to free struct pages corresponding to hugetlbfs pages? Can we not leave the vmemmap in place and just release that memory to the page allocator? The physical RAM for those struct pages isn't going anywhere, we just have a vmemmap alias to it (cacheable).
On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote: > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > > The pte lock cannot be taken in irq context (which I think is what > > you're asking?) While it is not possible to reason about all users of > > struct page, we are somewhat relieved of that work by noting that this is > > only for hugetlbfs, so we don't need to reason about slab, page tables, > > netmem or zsmalloc. > > My concern is that an interrupt handler tries to access a 'struct page' > which faults due to another core splitting a pmd mapping for the vmemmap. > In this case, I think we'll end up trying to resolve the fault from irq > context, which will try to take the spinlock. Yes, this absolutely can happen (with this patch), and this patch should be dropped for now. While this array of ~512 pages have been allocated to hugetlbfs, and one would think that there would be no way that there could still be references to them, another CPU can have a pointer to this struct page (eg attempting a speculative page cache reference or get_user_pages_fast()). That means it will try to call atomic_add_unless(&page->_refcount, 1, 0); Actually, I wonder if this isn't a problem on x86 too? Do we need to explicitly go through an RCU grace period before freeing the pages for use by somebody else?
On 2/7/2024 6:17 AM, Matthew Wilcox wrote: > On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote: >> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: >>> The pte lock cannot be taken in irq context (which I think is what >>> you're asking?) While it is not possible to reason about all users of >>> struct page, we are somewhat relieved of that work by noting that this is >>> only for hugetlbfs, so we don't need to reason about slab, page tables, >>> netmem or zsmalloc. >> My concern is that an interrupt handler tries to access a 'struct page' >> which faults due to another core splitting a pmd mapping for the vmemmap. >> In this case, I think we'll end up trying to resolve the fault from irq >> context, which will try to take the spinlock. > Yes, this absolutely can happen (with this patch), and this patch should > be dropped for now. > > While this array of ~512 pages have been allocated to hugetlbfs, and one > would think that there would be no way that there could still be > references to them, another CPU can have a pointer to this struct page > (eg attempting a speculative page cache reference or > get_user_pages_fast()). That means it will try to call > atomic_add_unless(&page->_refcount, 1, 0); > > Actually, I wonder if this isn't a problem on x86 too? Do we need to > explicitly go through an RCU grace period before freeing the pages > for use by somebody else? > Sorry, not sure what I'm missing, please help. From hugetlb allocation perspective, one of the scenarios is run time hugetlb page allocation (say 2M pages), starting from the buddy allocator returns compound pages, then the head page is set to frozen, then the folio(compound pages) is put thru the HVO process, one of which is vmemmap_split_pmd() in case a vmemmap page is a PMD page. Until the HVO process completes, none of the vmemmap represented pages are available to any threads, so what are the causes for IRQ threads to access their vmemmap pages? thanks! -jane
在 2024/2/7 20:20, Catalin Marinas 写道: > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: >> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: >>> On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: >>>> On 2024/1/26 2:06, Catalin Marinas wrote: >>>>> On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: >>>>>> HVO was previously disabled on arm64 [1] due to the lack of necessary >>>>>> BBM(break-before-make) logic when changing page tables. >>>>>> This set of patches fix this by adding necessary BBM sequence when >>>>>> changing page table, and supporting vmemmap page fault handling to >>>>>> fixup kernel address translation fault if vmemmap is concurrently accessed. >>>>> I'm not keen on this approach. I'm not even sure it's safe. In the >>>>> second patch, you take the init_mm.page_table_lock on the fault path but >>>>> are we sure this is unlocked when the fault was taken? >>>> I think this situation is impossible. In the implementation of the second >>>> patch, when the page table is being corrupted >>>> (the time window when a page fault may occur), vmemmap_update_pte() already >>>> holds the init_mm.page_table_lock, >>>> and unlock it until page table update is done.Another thread could not hold >>>> the init_mm.page_table_lock and >>>> also trigger a page fault at the same time. >>>> If I have missed any points in my thinking, please correct me. Thank you. >>> It still strikes me as incredibly fragile to handle the fault and trying >>> to reason about all the users of 'struct page' is impossible. For example, >>> can the fault happen from irq context? >> The pte lock cannot be taken in irq context (which I think is what >> you're asking?) > With this patchset, I think it can: IRQ -> interrupt handler accesses > vmemmap -> faults -> fault handler in patch 2 takes the > init_mm.page_table_lock to wait for the vmemmap rewriting to complete. > Maybe it works if the hugetlb code disabled the IRQs but, as Will said, > such fault in any kernel context looks fragile. How about take a new lock with irq disabled during BBM, like: +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) +{ + spin_lock_irq(NEW_LOCK); + pte_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + set_pte_at(&init_mm, addr, ptep, pte); + spin_unlock_irq(NEW_LOCK); +}
On Thu, Feb 08, 2024 at 05:44:48PM +0800, Nanyong Sun wrote: > > 在 2024/2/7 20:20, Catalin Marinas 写道: > > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > > > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > > > BBM(break-before-make) logic when changing page tables. > > > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > > > > are we sure this is unlocked when the fault was taken? > > > > > I think this situation is impossible. In the implementation of the second > > > > > patch, when the page table is being corrupted > > > > > (the time window when a page fault may occur), vmemmap_update_pte() already > > > > > holds the init_mm.page_table_lock, > > > > > and unlock it until page table update is done.Another thread could not hold > > > > > the init_mm.page_table_lock and > > > > > also trigger a page fault at the same time. > > > > > If I have missed any points in my thinking, please correct me. Thank you. > > > > It still strikes me as incredibly fragile to handle the fault and trying > > > > to reason about all the users of 'struct page' is impossible. For example, > > > > can the fault happen from irq context? > > > The pte lock cannot be taken in irq context (which I think is what > > > you're asking?) > > With this patchset, I think it can: IRQ -> interrupt handler accesses > > vmemmap -> faults -> fault handler in patch 2 takes the > > init_mm.page_table_lock to wait for the vmemmap rewriting to complete. > > Maybe it works if the hugetlb code disabled the IRQs but, as Will said, > > such fault in any kernel context looks fragile. > How about take a new lock with irq disabled during BBM, like: > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > +{ > + spin_lock_irq(NEW_LOCK); > + pte_clear(&init_mm, addr, ptep); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + set_pte_at(&init_mm, addr, ptep, pte); > + spin_unlock_irq(NEW_LOCK); > +} I really think the only maintainable way to achieve this is to avoid the possibility of a fault altogether. Will
On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote: > On 2/7/2024 6:17 AM, Matthew Wilcox wrote: > > While this array of ~512 pages have been allocated to hugetlbfs, and one > > would think that there would be no way that there could still be > > references to them, another CPU can have a pointer to this struct page > > (eg attempting a speculative page cache reference or > > get_user_pages_fast()). That means it will try to call > > atomic_add_unless(&page->_refcount, 1, 0); > > > > Actually, I wonder if this isn't a problem on x86 too? Do we need to > > explicitly go through an RCU grace period before freeing the pages > > for use by somebody else? > > > Sorry, not sure what I'm missing, please help. Having written out the analysis, I now think it can't happen on x86, but let's walk through it because it's non-obvious (and I think it illustrates what people are afraid of on Arm). CPU A calls either get_user_pages_fast() or __filemap_get_folio(). Let's do the latter this time. folio = filemap_get_entry(mapping, index); filemap_get_entry: rcu_read_lock(); folio = xas_load(&xas); if (!folio_try_get_rcu(folio)) goto repeat; if (unlikely(folio != xas_reload(&xas))) { folio_put(folio); goto repeat; } folio_try_get_rcu: folio_ref_try_add_rcu(folio, 1); folio_ref_try_add_rcu: if (unlikely(!folio_ref_add_unless(folio, count, 0))) { /* Either the folio has been freed, or will be freed. */ return false; folio_ref_add_unless: return page_ref_add_unless(&folio->page, nr, u); page_ref_add_unless: atomic_add_unless(&page->_refcount, nr, u); A rather deep callchain there, but for our purposes the important part is: we take the RCU read lock, we look up a folio, we increment its refcount if it's not zero, then check that looking up this index gets the same folio; if it doesn't, we decrement the refcount again and retry the lookup. For this analysis, we can be preempted at any point after we've got the folio pointer from xa_load(). > From hugetlb allocation perspective, one of the scenarios is run time > hugetlb page allocation (say 2M pages), starting from the buddy allocator > returns compound pages, then the head page is set to frozen, then the > folio(compound pages) is put thru the HVO process, one of which is > vmemmap_split_pmd() in case a vmemmap page is a PMD page. > > Until the HVO process completes, none of the vmemmap represented pages are > available to any threads, so what are the causes for IRQ threads to access > their vmemmap pages? Yup, this sounds like enough, but it's not. The problem is the person who's looking up the folio in the pagecache under RCU. They've got the folio pointer and have been preempted. So now what happens to our victim folio? Something happens to remove it from the page cache. Maybe the file is truncated, perhaps vmscan comes along and kicks it out. Either way, it's removed from the xarray and gets its refcount set to 0. If the lookup were to continue at this time, everything would be fine because it would see a refcount of 0 and not increment it (in page_ref_add_unless()). And this is where my analysis of RCU tends to go wrong, because I only think of interleaving event A and B. I don't think about B and then C happening before A resumes. But it can! Let's follow the journey of this struct page. Now that it's been removed from the page cache, it's allocated by hugetlb, as you describe. And it's one of the tail pages towards the end of the 512 contiguous struct pages. That means that we alter vmemmap so that the pointer to struct page now points to a different struct page (one of the earlier ones). Then the original page of vmemmap containing our lucky struct page is returned to the page allocator. At this point, it no longer contains struct pages; it can contain literally anything. Where my analysis went wrong was that CPU A _no longer has a pointer to it_. CPU A has a pointer into vmemmap. So it will access the replacement struct page (which definitely has a refcount 0) instead of the one which has been freed. I had thought that CPU A would access the original memory which has now been allocated to someone else. But no, it can't because its pointer is virtual, not physical. --- Now I'm thinking more about this and there's another scenario which I thought might go wrong, and doesn't. For 7 of the 512 pages which are freed, the struct page pointer gathered by CPU A will not point to a page with a refcount of 0. Instead it will point to an alias of the head page with a positive refcount. For those pages, CPU A will see folio_try_get_rcu() succeed. Then it will call xas_reload() and see the folio isn't there any more, so it will call folio_put() on something which used to be a folio, and isn't any more. But folio_put() calls folio_put_testzero() which calls put_page_testzero() without asserting that the pointer is actually to a folio. So everything's fine, but really only by coincidence; I don't think anybody's thought about this scenario before (maybe Muchun has, but I don't remember it being discussed).
On 2/8/2024 7:49 AM, Matthew Wilcox wrote: > On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote: >> On 2/7/2024 6:17 AM, Matthew Wilcox wrote: >>> While this array of ~512 pages have been allocated to hugetlbfs, and one >>> would think that there would be no way that there could still be >>> references to them, another CPU can have a pointer to this struct page >>> (eg attempting a speculative page cache reference or >>> get_user_pages_fast()). That means it will try to call >>> atomic_add_unless(&page->_refcount, 1, 0); >>> >>> Actually, I wonder if this isn't a problem on x86 too? Do we need to >>> explicitly go through an RCU grace period before freeing the pages >>> for use by somebody else? >>> >> Sorry, not sure what I'm missing, please help. > Having written out the analysis, I now think it can't happen on x86, > but let's walk through it because it's non-obvious (and I think it > illustrates what people are afraid of on Arm). > > CPU A calls either get_user_pages_fast() or __filemap_get_folio(). > Let's do the latter this time. > > folio = filemap_get_entry(mapping, index); > filemap_get_entry: > rcu_read_lock(); > folio = xas_load(&xas); > if (!folio_try_get_rcu(folio)) > goto repeat; > if (unlikely(folio != xas_reload(&xas))) { > folio_put(folio); > goto repeat; > } > folio_try_get_rcu: > folio_ref_try_add_rcu(folio, 1); > folio_ref_try_add_rcu: > if (unlikely(!folio_ref_add_unless(folio, count, 0))) { > /* Either the folio has been freed, or will be freed. */ > return false; > folio_ref_add_unless: > return page_ref_add_unless(&folio->page, nr, u); > page_ref_add_unless: > atomic_add_unless(&page->_refcount, nr, u); > > A rather deep callchain there, but for our purposes the important part > is: we take the RCU read lock, we look up a folio, we increment its > refcount if it's not zero, then check that looking up this index gets > the same folio; if it doesn't, we decrement the refcount again and retry > the lookup. > > For this analysis, we can be preempted at any point after we've got the > folio pointer from xa_load(). > >> From hugetlb allocation perspective, one of the scenarios is run time >> hugetlb page allocation (say 2M pages), starting from the buddy allocator >> returns compound pages, then the head page is set to frozen, then the >> folio(compound pages) is put thru the HVO process, one of which is >> vmemmap_split_pmd() in case a vmemmap page is a PMD page. >> >> Until the HVO process completes, none of the vmemmap represented pages are >> available to any threads, so what are the causes for IRQ threads to access >> their vmemmap pages? > Yup, this sounds like enough, but it's not. The problem is the person > who's looking up the folio in the pagecache under RCU. They've got > the folio pointer and have been preempted. So now what happens to our > victim folio? > > Something happens to remove it from the page cache. Maybe the file is > truncated, perhaps vmscan comes along and kicks it out. Either way, it's > removed from the xarray and gets its refcount set to 0. If the lookup > were to continue at this time, everything would be fine because it would > see a refcount of 0 and not increment it (in page_ref_add_unless()). > And this is where my analysis of RCU tends to go wrong, because I only > think of interleaving event A and B. I don't think about B and then C > happening before A resumes. But it can! Let's follow the journey of > this struct page. > > Now that it's been removed from the page cache, it's allocated by hugetlb, > as you describe. And it's one of the tail pages towards the end of > the 512 contiguous struct pages. That means that we alter vmemmap so > that the pointer to struct page now points to a different struct page > (one of the earlier ones). Then the original page of vmemmap containing > our lucky struct page is returned to the page allocator. At this point, > it no longer contains struct pages; it can contain literally anything. > > Where my analysis went wrong was that CPU A _no longer has a pointer > to it_. CPU A has a pointer into vmemmap. So it will access the > replacement struct page (which definitely has a refcount 0) instead of > the one which has been freed. I had thought that CPU A would access the > original memory which has now been allocated to someone else. But no, > it can't because its pointer is virtual, not physical. > > > --- > > Now I'm thinking more about this and there's another scenario which I > thought might go wrong, and doesn't. For 7 of the 512 pages which are > freed, the struct page pointer gathered by CPU A will not point to a > page with a refcount of 0. Instead it will point to an alias of the > head page with a positive refcount. For those pages, CPU A will see > folio_try_get_rcu() succeed. Then it will call xas_reload() and see > the folio isn't there any more, so it will call folio_put() on something > which used to be a folio, and isn't any more. > > But folio_put() calls folio_put_testzero() which calls put_page_testzero() > without asserting that the pointer is actually to a folio. > So everything's fine, but really only by coincidence; I don't think > anybody's thought about this scenario before (maybe Muchun has, but I > don't remember it being discussed). Wow! Marvelous analysis, thank you! So is the solution simple as making folio_put_testzero() to check whether the folio pointer actually points to a folio? or there is more to consider? Thanks a lot! -jane
> On Feb 8, 2024, at 23:49, Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote: >> On 2/7/2024 6:17 AM, Matthew Wilcox wrote: >>> While this array of ~512 pages have been allocated to hugetlbfs, and one >>> would think that there would be no way that there could still be >>> references to them, another CPU can have a pointer to this struct page >>> (eg attempting a speculative page cache reference or >>> get_user_pages_fast()). That means it will try to call >>> atomic_add_unless(&page->_refcount, 1, 0); >>> >>> Actually, I wonder if this isn't a problem on x86 too? Do we need to >>> explicitly go through an RCU grace period before freeing the pages >>> for use by somebody else? >>> >> Sorry, not sure what I'm missing, please help. > > Having written out the analysis, I now think it can't happen on x86, > but let's walk through it because it's non-obvious (and I think it > illustrates what people are afraid of on Arm). > > CPU A calls either get_user_pages_fast() or __filemap_get_folio(). > Let's do the latter this time. > > folio = filemap_get_entry(mapping, index); > filemap_get_entry: > rcu_read_lock(); > folio = xas_load(&xas); > if (!folio_try_get_rcu(folio)) > goto repeat; > if (unlikely(folio != xas_reload(&xas))) { > folio_put(folio); > goto repeat; > } > folio_try_get_rcu: > folio_ref_try_add_rcu(folio, 1); > folio_ref_try_add_rcu: > if (unlikely(!folio_ref_add_unless(folio, count, 0))) { > /* Either the folio has been freed, or will be freed. */ > return false; > folio_ref_add_unless: > return page_ref_add_unless(&folio->page, nr, u); > page_ref_add_unless: > atomic_add_unless(&page->_refcount, nr, u); > > A rather deep callchain there, but for our purposes the important part > is: we take the RCU read lock, we look up a folio, we increment its > refcount if it's not zero, then check that looking up this index gets > the same folio; if it doesn't, we decrement the refcount again and retry > the lookup. > > For this analysis, we can be preempted at any point after we've got the > folio pointer from xa_load(). > >> From hugetlb allocation perspective, one of the scenarios is run time >> hugetlb page allocation (say 2M pages), starting from the buddy allocator >> returns compound pages, then the head page is set to frozen, then the >> folio(compound pages) is put thru the HVO process, one of which is >> vmemmap_split_pmd() in case a vmemmap page is a PMD page. >> >> Until the HVO process completes, none of the vmemmap represented pages are >> available to any threads, so what are the causes for IRQ threads to access >> their vmemmap pages? > > Yup, this sounds like enough, but it's not. The problem is the person > who's looking up the folio in the pagecache under RCU. They've got > the folio pointer and have been preempted. So now what happens to our > victim folio? > > Something happens to remove it from the page cache. Maybe the file is > truncated, perhaps vmscan comes along and kicks it out. Either way, it's > removed from the xarray and gets its refcount set to 0. If the lookup > were to continue at this time, everything would be fine because it would > see a refcount of 0 and not increment it (in page_ref_add_unless()). > And this is where my analysis of RCU tends to go wrong, because I only > think of interleaving event A and B. I don't think about B and then C > happening before A resumes. But it can! Let's follow the journey of > this struct page. > > Now that it's been removed from the page cache, it's allocated by hugetlb, > as you describe. And it's one of the tail pages towards the end of > the 512 contiguous struct pages. That means that we alter vmemmap so > that the pointer to struct page now points to a different struct page > (one of the earlier ones). Then the original page of vmemmap containing > our lucky struct page is returned to the page allocator. At this point, > it no longer contains struct pages; it can contain literally anything. > > Where my analysis went wrong was that CPU A _no longer has a pointer > to it_. CPU A has a pointer into vmemmap. So it will access the > replacement struct page (which definitely has a refcount 0) instead of > the one which has been freed. I had thought that CPU A would access the > original memory which has now been allocated to someone else. But no, > it can't because its pointer is virtual, not physical. > > > --- > > Now I'm thinking more about this and there's another scenario which I > thought might go wrong, and doesn't. For 7 of the 512 pages which are > freed, the struct page pointer gathered by CPU A will not point to a > page with a refcount of 0. Instead it will point to an alias of the > head page with a positive refcount. For those pages, CPU A will see > folio_try_get_rcu() succeed. Then it will call xas_reload() and see > the folio isn't there any more, so it will call folio_put() on something > which used to be a folio, and isn't any more. > > But folio_put() calls folio_put_testzero() which calls put_page_testzero() > without asserting that the pointer is actually to a folio. > So everything's fine, but really only by coincidence; I don't think > anybody's thought about this scenario before (maybe Muchun has, but I > don't remember it being discussed). I have to say it is a really great analysis, I haven't thought about the case of get_page_unless_zero() so deeply. To avoid increasing a refcount to a tail page struct, I have made all the 7 tail pages read-only when I first write those code. But it is a really problem, because it will panic (due to RO permission) when encountering the above scenario to increase its refcount. In order to fix the race with __filemap_get_folio(), my first thought of fixing this issue is to add a rcu_synchronize() after the processing of HVO optimization and before being allocated to users. Note that HugePage pages are frozen before going through the precessing of HVO optimization meaning all the refcount of all the struct pages are 0. Therefore, folio_try_get_rcu() in __filemap_get_folio() will fail unless the HugeTLB page has been allocated to the user. But I realized there are some users who may pass a arbitrary page struct (which may be those 7 special tail page structs, alias of the head page struct, of a HugeTLB page) to the following helpers, which also could get a refcount of a tail page struct. Those helpers also need to be fixed. 1) get_page_unless_zero 2) folio_try_get 3) folio_try_get_rcu I have checked all the users of 1), If I am not wrong, all the users already handle the HugeTLB pages before calling to get_page_unless_zero(). Although there is no problem with 1) now, it will be fragile to let users guarantee that it will not pass any tail pages of a HugeTLB page to 1). So I want to change 1) to the following to fix this. static inline bool get_page_unless_zero(struct page *page) { if (page_ref_add_unless(page, 1, 0)) { /* @page must be a genuine head or alias head page here. */ struct page *head = page_fixed_fake_head(page); if (likely(head == page)) return true; put_page(head); } return false; } 2) and 3) should adopt the similar approach to make sure we cannot increase tail pages' refcount. 2) and 3) will be like the following (only demonstrate the key logic): static inline bool folio_try_get(struct folio *folio)/folio_ref_try_add_rcu { if (folio_ref_add_unless(folio, 1, 0)) { struct folio *genuine = page_folio(&folio->page); if (likely(genuine == folio)) return true; folio_put(genuine); } return false; } Additionally, we also should alter RO permission of those 7 tail pages to RW to avoid panic(). There is no problem in the following helpers since all of them already handle HVO case through _compound_head(), they will get the __genuine__ head page struct and increase its refcount. 1) try_get_page 2) folio_get 3) get_page Just some thoughts from mine, maybe you guys have more simple and graceful approaches. Comments are welcome. Muchun, Thanks.
On Thu, 8 Feb 2024, Will Deacon wrote: > > How about take a new lock with irq disabled during BBM, like: > > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > > +{ > > + spin_lock_irq(NEW_LOCK); > > + pte_clear(&init_mm, addr, ptep); > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > + set_pte_at(&init_mm, addr, ptep, pte); > > + spin_unlock_irq(NEW_LOCK); > > +} > > I really think the only maintainable way to achieve this is to avoid the > possibility of a fault altogether. > > Will > > Nanyong, are you still actively working on making HVO possible on arm64? This would yield a substantial memory savings on hosts that are largely configured with hugetlbfs. In our case, the size of this hugetlbfs pool is actually never changed after boot, but it sounds from the thread that there was an idea to make HVO conditional on FEAT_BBM. Is this being pursued? If so, any testing help needed?
On 2024/3/14 7:32, David Rientjes wrote: > On Thu, 8 Feb 2024, Will Deacon wrote: > >>> How about take a new lock with irq disabled during BBM, like: >>> >>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) >>> +{ >>> + (NEW_LOCK); >>> + pte_clear(&init_mm, addr, ptep); >>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>> + set_pte_at(&init_mm, addr, ptep, pte); >>> + spin_unlock_irq(NEW_LOCK); >>> +} >> I really think the only maintainable way to achieve this is to avoid the >> possibility of a fault altogether. >> >> Will >> >> > Nanyong, are you still actively working on making HVO possible on arm64? > > This would yield a substantial memory savings on hosts that are largely > configured with hugetlbfs. In our case, the size of this hugetlbfs pool > is actually never changed after boot, but it sounds from the thread that > there was an idea to make HVO conditional on FEAT_BBM. Is this being > pursued? > > If so, any testing help needed? I'm afraid that FEAT_BBM may not solve the problem here, because from Arm ARM, I see that FEAT_BBM is only used for changing block size. Therefore, in this HVO feature, it can work in the split PMD stage, that is, BBM can be avoided in vmemmap_split_pmd, but in the subsequent vmemmap_remap_pte, the Output address of PTE still needs to be changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my understanding of ARM FEAT_BBM is wrong, and I hope someone can correct me. Actually, the solution I first considered was to use the stop_machine method, but we have products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically use hugepages, so I have to consider performance issues. If your product does not change the amount of huge pages after booting, using stop_machine() may be a feasible way. So far, I still haven't come up with a good solution.
On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > On 2024/3/14 7:32, David Rientjes wrote: > > On Thu, 8 Feb 2024, Will Deacon wrote: > > > > How about take a new lock with irq disabled during BBM, like: > > > > > > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > > > > +{ > > > > + (NEW_LOCK); > > > > + pte_clear(&init_mm, addr, ptep); > > > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > > > + set_pte_at(&init_mm, addr, ptep, pte); > > > > + spin_unlock_irq(NEW_LOCK); > > > > +} > > > I really think the only maintainable way to achieve this is to avoid the > > > possibility of a fault altogether. > > > > > Nanyong, are you still actively working on making HVO possible on arm64? > > > > This would yield a substantial memory savings on hosts that are largely > > configured with hugetlbfs. In our case, the size of this hugetlbfs pool > > is actually never changed after boot, but it sounds from the thread that > > there was an idea to make HVO conditional on FEAT_BBM. Is this being > > pursued? > > > > If so, any testing help needed? > I'm afraid that FEAT_BBM may not solve the problem here, because from Arm > ARM, > I see that FEAT_BBM is only used for changing block size. Therefore, in this > HVO feature, > it can work in the split PMD stage, that is, BBM can be avoided in > vmemmap_split_pmd, > but in the subsequent vmemmap_remap_pte, the Output address of PTE still > needs to be > changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my > understanding > of ARM FEAT_BBM is wrong, and I hope someone can correct me. > Actually, the solution I first considered was to use the stop_machine > method, but we have > products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically > use hugepages, > so I have to consider performance issues. If your product does not change > the amount of huge > pages after booting, using stop_machine() may be a feasible way. > So far, I still haven't come up with a good solution. Oh, I hadn't appreciated that you needed to remap the memmap live. How do you synchronise the two copies in that case? I think we (i.e. the arch folks) probably need some more explanation on exactly who can race with what here, otherwise I don't grok how this can work. Thanks, Will
On Sun, Feb 11, 2024 at 5:00 AM Muchun Song <muchun.song@linux.dev> wrote: > > > On Feb 8, 2024, at 23:49, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote: > >> On 2/7/2024 6:17 AM, Matthew Wilcox wrote: > >>> While this array of ~512 pages have been allocated to hugetlbfs, and one > >>> would think that there would be no way that there could still be > >>> references to them, another CPU can have a pointer to this struct page > >>> (eg attempting a speculative page cache reference or > >>> get_user_pages_fast()). That means it will try to call > >>> atomic_add_unless(&page->_refcount, 1, 0); > >>> > >>> Actually, I wonder if this isn't a problem on x86 too? Do we need to > >>> explicitly go through an RCU grace period before freeing the pages > >>> for use by somebody else? > >>> > >> Sorry, not sure what I'm missing, please help. > > > > Having written out the analysis, I now think it can't happen on x86, > > but let's walk through it because it's non-obvious (and I think it > > illustrates what people are afraid of on Arm). > > > > CPU A calls either get_user_pages_fast() or __filemap_get_folio(). > > Let's do the latter this time. > > > > folio = filemap_get_entry(mapping, index); > > filemap_get_entry: > > rcu_read_lock(); > > folio = xas_load(&xas); > > if (!folio_try_get_rcu(folio)) > > goto repeat; > > if (unlikely(folio != xas_reload(&xas))) { > > folio_put(folio); > > goto repeat; > > } > > folio_try_get_rcu: > > folio_ref_try_add_rcu(folio, 1); > > folio_ref_try_add_rcu: > > if (unlikely(!folio_ref_add_unless(folio, count, 0))) { > > /* Either the folio has been freed, or will be freed. */ > > return false; > > folio_ref_add_unless: > > return page_ref_add_unless(&folio->page, nr, u); > > page_ref_add_unless: > > atomic_add_unless(&page->_refcount, nr, u); > > > > A rather deep callchain there, but for our purposes the important part > > is: we take the RCU read lock, we look up a folio, we increment its > > refcount if it's not zero, then check that looking up this index gets > > the same folio; if it doesn't, we decrement the refcount again and retry > > the lookup. > > > > For this analysis, we can be preempted at any point after we've got the > > folio pointer from xa_load(). > > > >> From hugetlb allocation perspective, one of the scenarios is run time > >> hugetlb page allocation (say 2M pages), starting from the buddy allocator > >> returns compound pages, then the head page is set to frozen, then the > >> folio(compound pages) is put thru the HVO process, one of which is > >> vmemmap_split_pmd() in case a vmemmap page is a PMD page. > >> > >> Until the HVO process completes, none of the vmemmap represented pages are > >> available to any threads, so what are the causes for IRQ threads to access > >> their vmemmap pages? > > > > Yup, this sounds like enough, but it's not. The problem is the person > > who's looking up the folio in the pagecache under RCU. They've got > > the folio pointer and have been preempted. So now what happens to our > > victim folio? > > > > Something happens to remove it from the page cache. Maybe the file is > > truncated, perhaps vmscan comes along and kicks it out. Either way, it's > > removed from the xarray and gets its refcount set to 0. If the lookup > > were to continue at this time, everything would be fine because it would > > see a refcount of 0 and not increment it (in page_ref_add_unless()). > > And this is where my analysis of RCU tends to go wrong, because I only > > think of interleaving event A and B. I don't think about B and then C > > happening before A resumes. But it can! Let's follow the journey of > > this struct page. > > > > Now that it's been removed from the page cache, it's allocated by hugetlb, > > as you describe. And it's one of the tail pages towards the end of > > the 512 contiguous struct pages. That means that we alter vmemmap so > > that the pointer to struct page now points to a different struct page > > (one of the earlier ones). Then the original page of vmemmap containing > > our lucky struct page is returned to the page allocator. At this point, > > it no longer contains struct pages; it can contain literally anything. > > > > Where my analysis went wrong was that CPU A _no longer has a pointer > > to it_. CPU A has a pointer into vmemmap. So it will access the > > replacement struct page (which definitely has a refcount 0) instead of > > the one which has been freed. I had thought that CPU A would access the > > original memory which has now been allocated to someone else. But no, > > it can't because its pointer is virtual, not physical. > > > > > > --- > > > > Now I'm thinking more about this and there's another scenario which I > > thought might go wrong, and doesn't. For 7 of the 512 pages which are > > freed, the struct page pointer gathered by CPU A will not point to a > > page with a refcount of 0. Instead it will point to an alias of the > > head page with a positive refcount. For those pages, CPU A will see > > folio_try_get_rcu() succeed. Then it will call xas_reload() and see > > the folio isn't there any more, so it will call folio_put() on something > > which used to be a folio, and isn't any more. > > > > But folio_put() calls folio_put_testzero() which calls put_page_testzero() > > without asserting that the pointer is actually to a folio. > > So everything's fine, but really only by coincidence; I don't think > > anybody's thought about this scenario before (maybe Muchun has, but I > > don't remember it being discussed). > > I have to say it is a really great analysis, I haven't thought about the > case of get_page_unless_zero() so deeply. > > To avoid increasing a refcount to a tail page struct, I have made > all the 7 tail pages read-only when I first write those code. I think making tail page metadata RO is a good design choice. Details below. > But it > is a really problem, because it will panic (due to RO permission) > when encountering the above scenario to increase its refcount. > > In order to fix the race with __filemap_get_folio(), my first > thought of fixing this issue is to add a rcu_synchronize() after > the processing of HVO optimization and before being allocated to > users. Note that HugePage pages are frozen before going through > the precessing of HVO optimization meaning all the refcount of all > the struct pages are 0. Therefore, folio_try_get_rcu() in > __filemap_get_folio() will fail unless the HugeTLB page has been > allocated to the user. > > But I realized there are some users who may pass a arbitrary > page struct (which may be those 7 special tail page structs, > alias of the head page struct, of a HugeTLB page) to the following > helpers, which also could get a refcount of a tail page struct. > Those helpers also need to be fixed. > > 1) get_page_unless_zero > 2) folio_try_get > 3) folio_try_get_rcu > > I have checked all the users of 1), If I am not wrong, all the users > already handle the HugeTLB pages before calling to get_page_unless_zero(). > Although there is no problem with 1) now, it will be fragile to let users > guarantee that it will not pass any tail pages of a HugeTLB page to > 1). So I want to change 1) to the following to fix this. > > static inline bool get_page_unless_zero(struct page *page) > { > if (page_ref_add_unless(page, 1, 0)) { > /* @page must be a genuine head or alias head page here. */ > struct page *head = page_fixed_fake_head(page); > > if (likely(head == page)) > return true; > put_page(head); > } > > return false; > } > > 2) and 3) should adopt the similar approach to make sure we cannot increase > tail pages' refcount. 2) and 3) will be like the following (only demonstrate > the key logic): > > static inline bool folio_try_get(struct folio *folio)/folio_ref_try_add_rcu > { > if (folio_ref_add_unless(folio, 1, 0)) { > struct folio *genuine = page_folio(&folio->page); > > if (likely(genuine == folio)) > return true; > folio_put(genuine); > } > > return false; > } > > Additionally, we also should alter RO permission of those 7 tail pages > to RW to avoid panic(). We can use RCU, which IMO is a better choice, as the following: get_page_unless_zero() { int rc = false; rcu_read_lock(); if (page_is_fake_head(page) || !page_ref_count(page)) { smp_mb(); // implied by atomic_add_unless() goto unlock; } rc = page_ref_add_unless(); unlock: rcu_read_unlock(); return rc; } And on the HVO/de-HOV sides: folio_ref_unfreeze(); synchronize_rcu(); HVO/de-HVO; I think this is a lot better than making tail page metadata RW because: 1. it helps debug, IMO, a lot; 2. I don't think HVO is the only one that needs this. David (we missed you in today's THP meeting), Please correct me if I'm wrong -- I think virtio-mem also suffers from the same problem when freeing offlined struct page, since I wasn't able to find anything that would prevent a **speculative** struct page walker from trying to access struct pages belonging to pages being concurrently offlined. If this is true, we might want to map a "zero struct page" rather than leave a hole in vmemmap when offlining pages. And the logic on the hot removal side would be similar to that of HVO. > There is no problem in the following helpers since all of them already > handle HVO case through _compound_head(), they will get the __genuine__ > head page struct and increase its refcount. > > 1) try_get_page > 2) folio_get > 3) get_page > > Just some thoughts from mine, maybe you guys have more simple and graceful > approaches. Comments are welcome. > > Muchun, > Thanks. > >
>> Additionally, we also should alter RO permission of those 7 tail pages >> to RW to avoid panic(). > > We can use RCU, which IMO is a better choice, as the following: > > get_page_unless_zero() > { > int rc = false; > > rcu_read_lock(); > > if (page_is_fake_head(page) || !page_ref_count(page)) { > smp_mb(); // implied by atomic_add_unless() > goto unlock; > } > > rc = page_ref_add_unless(); > > unlock: > rcu_read_unlock(); > > return rc; > } > > And on the HVO/de-HOV sides: > > folio_ref_unfreeze(); > synchronize_rcu(); > HVO/de-HVO; > > I think this is a lot better than making tail page metadata RW because: > 1. it helps debug, IMO, a lot; > 2. I don't think HVO is the only one that needs this. > > David (we missed you in today's THP meeting), Sorry, I had a private meeting conflict :) > > Please correct me if I'm wrong -- I think virtio-mem also suffers from > the same problem when freeing offlined struct page, since I wasn't > able to find anything that would prevent a **speculative** struct page > walker from trying to access struct pages belonging to pages being > concurrently offlined. virtio-mem does currently not yet optimize fake-offlined memory like HVO would. So the only way we really remove "struct page" metadata is by actually offlining+removing a complete Linux memory block, like ordinary memory hotunplug would. It might be an interesting project to optimize "struct page" metadata consumption for fake-offlined memory chunks within an online Linux memory block. The biggest challenge might be interaction with memory hotplug, which requires all "struct page" metadata to be allocated. So that would make cases where virtio-mem hot-plugs a Linux memory block but keeps parts of it fake-offline a bit more problematic to handle . In a world with memdesc this might all be nicer to handle I think :) There is one possible interaction between virtio-mem and speculative page references: all fake-offline chunks in a Linux memory block do have on each page a refcount of 1 and PageOffline() set. When actually offlining the Linux memory block to remove it, virtio-mem will drop that reference during MEM_GOING_OFFLINE, such that memory offlining can proceed (seeing refcount==0 and PageOffline()). In virtio_mem_fake_offline_going_offline() we have: if (WARN_ON(!page_ref_dec_and_test(page))) dump_page(page, "fake-offline page referenced"); which would trigger on a speculative reference. We never saw that trigger so far because quite a long time must have passed ever since a page might have been part of the page cache / page tables, before virtio-mem fake-offlined it (using alloc_contig_range()) and the Linux memory block actually gets offlined. But yes, RCU (e.g., on the memory offlining path) would likely be the right approach to make sure GUP-fast and the pagecache will no longer grab this page by accident. > > If this is true, we might want to map a "zero struct page" rather than > leave a hole in vmemmap when offlining pages. And the logic on the hot > removal side would be similar to that of HVO. Once virtio-mem would do something like HVO, yes. Right now virtio-mem only removes struct-page metadata by removing/unplugging its owned Linux memory blocks once they are fully "logically offline".
I had an offline discussion with Yu on this, and he pointed out something I hadn't realized: the x86 cmpxchg instruction always produces a write cycle, even if it doesn't modify the data - it just writes back the original data in that case. So, get_page_unless_zero will always produce a fault on RO mapped page structures on x86. Maybe this was obvious to other people, but I didn't see it explicitly mentioned, so I figured I'd add the datapoint. - Frank On Thu, Jun 6, 2024 at 1:30 AM David Hildenbrand <david@redhat.com> wrote: > > >> Additionally, we also should alter RO permission of those 7 tail pages > >> to RW to avoid panic(). > > > > We can use RCU, which IMO is a better choice, as the following: > > > > get_page_unless_zero() > > { > > int rc = false; > > > > rcu_read_lock(); > > > > if (page_is_fake_head(page) || !page_ref_count(page)) { > > smp_mb(); // implied by atomic_add_unless() > > goto unlock; > > } > > > > rc = page_ref_add_unless(); > > > > unlock: > > rcu_read_unlock(); > > > > return rc; > > } > > > > And on the HVO/de-HOV sides: > > > > folio_ref_unfreeze(); > > synchronize_rcu(); > > HVO/de-HVO; > > > > I think this is a lot better than making tail page metadata RW because: > > 1. it helps debug, IMO, a lot; > > 2. I don't think HVO is the only one that needs this. > > > > David (we missed you in today's THP meeting), > > Sorry, I had a private meeting conflict :) > > > > > Please correct me if I'm wrong -- I think virtio-mem also suffers from > > the same problem when freeing offlined struct page, since I wasn't > > able to find anything that would prevent a **speculative** struct page > > walker from trying to access struct pages belonging to pages being > > concurrently offlined. > > virtio-mem does currently not yet optimize fake-offlined memory like HVO > would. So the only way we really remove "struct page" metadata is by > actually offlining+removing a complete Linux memory block, like ordinary > memory hotunplug would. > > It might be an interesting project to optimize "struct page" metadata > consumption for fake-offlined memory chunks within an online Linux > memory block. > > The biggest challenge might be interaction with memory hotplug, which > requires all "struct page" metadata to be allocated. So that would make > cases where virtio-mem hot-plugs a Linux memory block but keeps parts of > it fake-offline a bit more problematic to handle . > > In a world with memdesc this might all be nicer to handle I think :) > > > There is one possible interaction between virtio-mem and speculative > page references: all fake-offline chunks in a Linux memory block do have > on each page a refcount of 1 and PageOffline() set. When actually > offlining the Linux memory block to remove it, virtio-mem will drop that > reference during MEM_GOING_OFFLINE, such that memory offlining can > proceed (seeing refcount==0 and PageOffline()). > > In virtio_mem_fake_offline_going_offline() we have: > > if (WARN_ON(!page_ref_dec_and_test(page))) > dump_page(page, "fake-offline page referenced"); > > which would trigger on a speculative reference. > > We never saw that trigger so far because quite a long time must have > passed ever since a page might have been part of the page cache / page > tables, before virtio-mem fake-offlined it (using alloc_contig_range()) > and the Linux memory block actually gets offlined. > > But yes, RCU (e.g., on the memory offlining path) would likely be the > right approach to make sure GUP-fast and the pagecache will no longer > grab this page by accident. > > > > > If this is true, we might want to map a "zero struct page" rather than > > leave a hole in vmemmap when offlining pages. And the logic on the hot > > removal side would be similar to that of HVO. > > Once virtio-mem would do something like HVO, yes. Right now virtio-mem > only removes struct-page metadata by removing/unplugging its owned Linux > memory blocks once they are fully "logically offline". > > -- > Cheers, > > David / dhildenb >
On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > On 2024/3/14 7:32, David Rientjes wrote: > > > On Thu, 8 Feb 2024, Will Deacon wrote: > > > > > > How about take a new lock with irq disabled during BBM, like: > > > > > > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > > > > +{ > > > > + (NEW_LOCK); > > > > + pte_clear(&init_mm, addr, ptep); > > > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > > > + set_pte_at(&init_mm, addr, ptep, pte); > > > > + spin_unlock_irq(NEW_LOCK); > > > > +} > > > I really think the only maintainable way to achieve this is to avoid the > > > possibility of a fault altogether. > > > > > > Will > > > > > > > > Nanyong, are you still actively working on making HVO possible on arm64? > > > > This would yield a substantial memory savings on hosts that are largely > > configured with hugetlbfs. In our case, the size of this hugetlbfs pool > > is actually never changed after boot, but it sounds from the thread that > > there was an idea to make HVO conditional on FEAT_BBM. Is this being > > pursued? > > > > If so, any testing help needed? > I'm afraid that FEAT_BBM may not solve the problem here I think so too -- I came cross this while working on TAO [1]. [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > because from Arm > ARM, > I see that FEAT_BBM is only used for changing block size. Therefore, in this > HVO feature, > it can work in the split PMD stage, that is, BBM can be avoided in > vmemmap_split_pmd, > but in the subsequent vmemmap_remap_pte, the Output address of PTE still > needs to be > changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my > understanding > of ARM FEAT_BBM is wrong, and I hope someone can correct me. > Actually, the solution I first considered was to use the stop_machine > method, but we have > products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically > use hugepages, > so I have to consider performance issues. If your product does not change > the amount of huge > pages after booting, using stop_machine() may be a feasible way. > So far, I still haven't come up with a good solution. I do have a patch that's similar to stop_machine() -- it uses NMI IPIs to pause/resume remote CPUs while the local one is doing BBM. Note that the problem of updating vmemmap for struct page[], as I see it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory hot removal in general [2]. On arm64, we would need to support BBM on vmemmap so that we can fix the problem with offlining memory (or to be precise, unmapping offlined struct page[]), by mapping offlined struct page[] to a read-only page of dummy struct page[], similar to ZERO_PAGE(). (Or we would have to make extremely invasive changes to the reader side, i.e., all speculative PFN walkers.) In case you are interested in testing my approach, you can swap your patch 2 with the following: [2] https://lore.kernel.org/20240621213717.1099079-1-yuzhao@google.com/ diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h index 8ff5f2a2579e..1af1aa34a351 100644 --- a/arch/arm64/include/asm/pgalloc.h +++ b/arch/arm64/include/asm/pgalloc.h @@ -12,6 +12,7 @@ #include <asm/processor.h> #include <asm/cacheflush.h> #include <asm/tlbflush.h> +#include <asm/cpu.h> #define __HAVE_ARCH_PGD_FREE #define __HAVE_ARCH_PUD_FREE @@ -137,4 +138,58 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep) __pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN); } +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP + +#define vmemmap_update_lock vmemmap_update_lock +static inline void vmemmap_update_lock(void) +{ + cpus_read_lock(); +} + +#define vmemmap_update_unlock vmemmap_update_unlock +static inline void vmemmap_update_unlock(void) +{ + cpus_read_unlock(); +} + +#define vmemmap_update_pte vmemmap_update_pte +static inline void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) +{ + preempt_disable(); + pause_remote_cpus(); + + pte_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + set_pte_at(&init_mm, addr, ptep, pte); + + resume_remote_cpus(); + preempt_enable(); +} + +#define vmemmap_update_pmd vmemmap_update_pmd +static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep) +{ + preempt_disable(); + pause_remote_cpus(); + + pmd_clear(pmdp); + flush_tlb_kernel_range(addr, addr + PMD_SIZE); + pmd_populate_kernel(&init_mm, pmdp, ptep); + + resume_remote_cpus(); + preempt_enable(); +} + +#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all +static inline void vmemmap_flush_tlb_all(void) +{ +} + +#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range +static inline void vmemmap_flush_tlb_range(unsigned long start, unsigned long end) +{ +} + +#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */ + #endif diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h index efb13112b408..544b15948b64 100644 --- a/arch/arm64/include/asm/smp.h +++ b/arch/arm64/include/asm/smp.h @@ -144,6 +144,9 @@ bool cpus_are_stuck_in_kernel(void); extern void crash_smp_send_stop(void); extern bool smp_crash_stop_failed(void); +void pause_remote_cpus(void); +void resume_remote_cpus(void); + #endif /* ifndef __ASSEMBLY__ */ #endif /* ifndef __ASM_SMP_H */ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 31c8b3094dd7..ae0a178db066 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -71,16 +71,25 @@ enum ipi_msg_type { IPI_RESCHEDULE, IPI_CALL_FUNC, IPI_CPU_STOP, + IPI_CPU_PAUSE, +#ifdef CONFIG_KEXEC_CORE IPI_CPU_CRASH_STOP, +#endif +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST IPI_TIMER, +#endif +#ifdef CONFIG_IRQ_WORK IPI_IRQ_WORK, +#endif NR_IPI, /* * Any enum >= NR_IPI and < MAX_IPI is special and not tracable * with trace_ipi_* */ IPI_CPU_BACKTRACE = NR_IPI, +#ifdef CONFIG_KGDB IPI_KGDB_ROUNDUP, +#endif MAX_IPI }; @@ -771,9 +780,16 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = { [IPI_RESCHEDULE] = "Rescheduling interrupts", [IPI_CALL_FUNC] = "Function call interrupts", [IPI_CPU_STOP] = "CPU stop interrupts", + [IPI_CPU_PAUSE] = "CPU pause interrupts", +#ifdef CONFIG_KEXEC_CORE [IPI_CPU_CRASH_STOP] = "CPU stop (for crash dump) interrupts", +#endif +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST [IPI_TIMER] = "Timer broadcast interrupts", +#endif +#ifdef CONFIG_IRQ_WORK [IPI_IRQ_WORK] = "IRQ work interrupts", +#endif }; static void smp_cross_call(const struct cpumask *target, unsigned int ipinr); @@ -832,6 +848,85 @@ void __noreturn panic_smp_self_stop(void) local_cpu_stop(); } +static DEFINE_SPINLOCK(cpu_pause_lock); +static cpumask_t paused_cpus; +static cpumask_t resumed_cpus; + +static void pause_local_cpu(void) +{ + int cpu = smp_processor_id(); + + cpumask_clear_cpu(cpu, &resumed_cpus); + /* + * Paired with pause_remote_cpus() to confirm that this CPU not only + * will be paused but also can be reliably resumed. + */ + smp_wmb(); + cpumask_set_cpu(cpu, &paused_cpus); + /* A typical example for sleep and wake-up functions. */ + smp_mb(); + while (!cpumask_test_cpu(cpu, &resumed_cpus)) { + wfe(); + barrier(); + } + barrier(); + cpumask_clear_cpu(cpu, &paused_cpus); +} + +void pause_remote_cpus(void) +{ + cpumask_t cpus_to_pause; + + lockdep_assert_cpus_held(); + lockdep_assert_preemption_disabled(); + + cpumask_copy(&cpus_to_pause, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); + + spin_lock(&cpu_pause_lock); + + WARN_ON_ONCE(!cpumask_empty(&paused_cpus)); + + smp_cross_call(&cpus_to_pause, IPI_CPU_PAUSE); + + while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) { + cpu_relax(); + barrier(); + } + /* + * Paired pause_local_cpu() to confirm that all CPUs not only will be + * paused but also can be reliably resumed. + */ + smp_rmb(); + WARN_ON_ONCE(cpumask_intersects(&cpus_to_pause, &resumed_cpus)); + + spin_unlock(&cpu_pause_lock); +} + +void resume_remote_cpus(void) +{ + cpumask_t cpus_to_resume; + + lockdep_assert_cpus_held(); + lockdep_assert_preemption_disabled(); + + cpumask_copy(&cpus_to_resume, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume); + + spin_lock(&cpu_pause_lock); + + cpumask_setall(&resumed_cpus); + /* A typical example for sleep and wake-up functions. */ + smp_mb(); + while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) { + sev(); + cpu_relax(); + barrier(); + } + + spin_unlock(&cpu_pause_lock); +} + #ifdef CONFIG_KEXEC_CORE static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0); #endif @@ -911,6 +1006,11 @@ static void do_handle_IPI(int ipinr) local_cpu_stop(); break; + case IPI_CPU_PAUSE: + pause_local_cpu(); + break; + +#ifdef CONFIG_KEXEC_CORE case IPI_CPU_CRASH_STOP: if (IS_ENABLED(CONFIG_KEXEC_CORE)) { ipi_cpu_crash_stop(cpu, get_irq_regs()); @@ -918,6 +1018,7 @@ static void do_handle_IPI(int ipinr) unreachable(); } break; +#endif #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST case IPI_TIMER: @@ -939,9 +1040,11 @@ static void do_handle_IPI(int ipinr) nmi_cpu_backtrace(get_irq_regs()); break; +#ifdef CONFIG_KGDB case IPI_KGDB_ROUNDUP: kgdb_nmicallback(cpu, get_irq_regs()); break; +#endif default: pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr); @@ -971,9 +1074,14 @@ static bool ipi_should_be_nmi(enum ipi_msg_type ipi) switch (ipi) { case IPI_CPU_STOP: + case IPI_CPU_PAUSE: +#ifdef CONFIG_KEXEC_CORE case IPI_CPU_CRASH_STOP: +#endif case IPI_CPU_BACKTRACE: +#ifdef CONFIG_KGDB case IPI_KGDB_ROUNDUP: +#endif return true; default: return false; diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 5113753f3ac9..da6f2a7d665e 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -46,6 +46,18 @@ struct vmemmap_remap_walk { unsigned long flags; }; +#ifndef vmemmap_update_lock +static void vmemmap_update_lock(void) +{ +} +#endif + +#ifndef vmemmap_update_unlock +static void vmemmap_update_unlock(void) +{ +} +#endif + #ifndef vmemmap_update_pmd static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep) @@ -194,10 +206,12 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, VM_BUG_ON(!PAGE_ALIGNED(start | end)); + vmemmap_update_lock(); mmap_read_lock(&init_mm); ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops, NULL, walk); mmap_read_unlock(&init_mm); + vmemmap_update_unlock(); if (ret) return ret;
在 2024/6/24 13:39, Yu Zhao 写道: > On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: >> On 2024/3/14 7:32, David Rientjes wrote: >> >>> On Thu, 8 Feb 2024, Will Deacon wrote: >>> >>>>> How about take a new lock with irq disabled during BBM, like: >>>>> >>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) >>>>> +{ >>>>> + (NEW_LOCK); >>>>> + pte_clear(&init_mm, addr, ptep); >>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>>>> + set_pte_at(&init_mm, addr, ptep, pte); >>>>> + spin_unlock_irq(NEW_LOCK); >>>>> +} >>>> I really think the only maintainable way to achieve this is to avoid the >>>> possibility of a fault altogether. >>>> >>>> Will >>>> >>>> >>> Nanyong, are you still actively working on making HVO possible on arm64? >>> >>> This would yield a substantial memory savings on hosts that are largely >>> configured with hugetlbfs. In our case, the size of this hugetlbfs pool >>> is actually never changed after boot, but it sounds from the thread that >>> there was an idea to make HVO conditional on FEAT_BBM. Is this being >>> pursued? >>> >>> If so, any testing help needed? >> I'm afraid that FEAT_BBM may not solve the problem here > I think so too -- I came cross this while working on TAO [1]. > > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > >> because from Arm >> ARM, >> I see that FEAT_BBM is only used for changing block size. Therefore, in this >> HVO feature, >> it can work in the split PMD stage, that is, BBM can be avoided in >> vmemmap_split_pmd, >> but in the subsequent vmemmap_remap_pte, the Output address of PTE still >> needs to be >> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my >> understanding >> of ARM FEAT_BBM is wrong, and I hope someone can correct me. >> Actually, the solution I first considered was to use the stop_machine >> method, but we have >> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically >> use hugepages, >> so I have to consider performance issues. If your product does not change >> the amount of huge >> pages after booting, using stop_machine() may be a feasible way. >> So far, I still haven't come up with a good solution. > I do have a patch that's similar to stop_machine() -- it uses NMI IPIs > to pause/resume remote CPUs while the local one is doing BBM. > > Note that the problem of updating vmemmap for struct page[], as I see > it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory > hot removal in general [2]. On arm64, we would need to support BBM on > vmemmap so that we can fix the problem with offlining memory (or to be > precise, unmapping offlined struct page[]), by mapping offlined struct > page[] to a read-only page of dummy struct page[], similar to > ZERO_PAGE(). (Or we would have to make extremely invasive changes to > the reader side, i.e., all speculative PFN walkers.) > > In case you are interested in testing my approach, you can swap your > patch 2 with the following: I don't have an NMI IPI capable ARM machine on hand, so I think this feature depends on a higher version of the ARM cpu. What I worried about was that other cores would occasionally be interrupted frequently(8 times every 2M and 4096 times every 1G) and then wait for the update of page table to complete before resuming. If there are workloads running on other cores, performance may be affected. This implementation speeds up stopping and resuming other cores, but they still have to wait for the update to finish. > > [2] https://lore.kernel.org/20240621213717.1099079-1-yuzhao@google.com/ > > diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h > index 8ff5f2a2579e..1af1aa34a351 100644 > --- a/arch/arm64/include/asm/pgalloc.h > +++ b/arch/arm64/include/asm/pgalloc.h > @@ -12,6 +12,7 @@ > #include <asm/processor.h> > #include <asm/cacheflush.h> > #include <asm/tlbflush.h> > +#include <asm/cpu.h> > > #define __HAVE_ARCH_PGD_FREE > #define __HAVE_ARCH_PUD_FREE > @@ -137,4 +138,58 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep) > __pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN); > } > > +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > + > +#define vmemmap_update_lock vmemmap_update_lock > +static inline void vmemmap_update_lock(void) > +{ > + cpus_read_lock(); > +} > + > +#define vmemmap_update_unlock vmemmap_update_unlock > +static inline void vmemmap_update_unlock(void) > +{ > + cpus_read_unlock(); > +} > + > +#define vmemmap_update_pte vmemmap_update_pte > +static inline void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > +{ > + preempt_disable(); > + pause_remote_cpus(); > + > + pte_clear(&init_mm, addr, ptep); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + set_pte_at(&init_mm, addr, ptep, pte); > + > + resume_remote_cpus(); > + preempt_enable(); > +} > + > +#define vmemmap_update_pmd vmemmap_update_pmd > +static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep) > +{ > + preempt_disable(); > + pause_remote_cpus(); > + > + pmd_clear(pmdp); > + flush_tlb_kernel_range(addr, addr + PMD_SIZE); > + pmd_populate_kernel(&init_mm, pmdp, ptep); > + > + resume_remote_cpus(); > + preempt_enable(); > +} > + > +#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all > +static inline void vmemmap_flush_tlb_all(void) > +{ > +} > + > +#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range > +static inline void vmemmap_flush_tlb_range(unsigned long start, unsigned long end) > +{ > +} > + > +#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */ > + > #endif > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h > index efb13112b408..544b15948b64 100644 > --- a/arch/arm64/include/asm/smp.h > +++ b/arch/arm64/include/asm/smp.h > @@ -144,6 +144,9 @@ bool cpus_are_stuck_in_kernel(void); > extern void crash_smp_send_stop(void); > extern bool smp_crash_stop_failed(void); > > +void pause_remote_cpus(void); > +void resume_remote_cpus(void); > + > #endif /* ifndef __ASSEMBLY__ */ > > #endif /* ifndef __ASM_SMP_H */ > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 31c8b3094dd7..ae0a178db066 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -71,16 +71,25 @@ enum ipi_msg_type { > IPI_RESCHEDULE, > IPI_CALL_FUNC, > IPI_CPU_STOP, > + IPI_CPU_PAUSE, > +#ifdef CONFIG_KEXEC_CORE > IPI_CPU_CRASH_STOP, > +#endif > +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST > IPI_TIMER, > +#endif > +#ifdef CONFIG_IRQ_WORK > IPI_IRQ_WORK, > +#endif > NR_IPI, > /* > * Any enum >= NR_IPI and < MAX_IPI is special and not tracable > * with trace_ipi_* > */ > IPI_CPU_BACKTRACE = NR_IPI, > +#ifdef CONFIG_KGDB > IPI_KGDB_ROUNDUP, > +#endif > MAX_IPI > }; > > @@ -771,9 +780,16 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = { > [IPI_RESCHEDULE] = "Rescheduling interrupts", > [IPI_CALL_FUNC] = "Function call interrupts", > [IPI_CPU_STOP] = "CPU stop interrupts", > + [IPI_CPU_PAUSE] = "CPU pause interrupts", > +#ifdef CONFIG_KEXEC_CORE > [IPI_CPU_CRASH_STOP] = "CPU stop (for crash dump) interrupts", > +#endif > +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST > [IPI_TIMER] = "Timer broadcast interrupts", > +#endif > +#ifdef CONFIG_IRQ_WORK > [IPI_IRQ_WORK] = "IRQ work interrupts", > +#endif > }; > > static void smp_cross_call(const struct cpumask *target, unsigned int ipinr); > @@ -832,6 +848,85 @@ void __noreturn panic_smp_self_stop(void) > local_cpu_stop(); > } > > +static DEFINE_SPINLOCK(cpu_pause_lock); > +static cpumask_t paused_cpus; > +static cpumask_t resumed_cpus; > + > +static void pause_local_cpu(void) > +{ > + int cpu = smp_processor_id(); > + > + cpumask_clear_cpu(cpu, &resumed_cpus); > + /* > + * Paired with pause_remote_cpus() to confirm that this CPU not only > + * will be paused but also can be reliably resumed. > + */ > + smp_wmb(); > + cpumask_set_cpu(cpu, &paused_cpus); > + /* A typical example for sleep and wake-up functions. */ > + smp_mb(); > + while (!cpumask_test_cpu(cpu, &resumed_cpus)) { > + wfe(); > + barrier(); > + } > + barrier(); > + cpumask_clear_cpu(cpu, &paused_cpus); > +} > + > +void pause_remote_cpus(void) > +{ > + cpumask_t cpus_to_pause; > + > + lockdep_assert_cpus_held(); > + lockdep_assert_preemption_disabled(); > + > + cpumask_copy(&cpus_to_pause, cpu_online_mask); > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); > + > + spin_lock(&cpu_pause_lock); > + > + WARN_ON_ONCE(!cpumask_empty(&paused_cpus)); > + > + smp_cross_call(&cpus_to_pause, IPI_CPU_PAUSE); > + > + while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) { > + cpu_relax(); > + barrier(); > + } > + /* > + * Paired pause_local_cpu() to confirm that all CPUs not only will be > + * paused but also can be reliably resumed. > + */ > + smp_rmb(); > + WARN_ON_ONCE(cpumask_intersects(&cpus_to_pause, &resumed_cpus)); > + > + spin_unlock(&cpu_pause_lock); > +} > + > +void resume_remote_cpus(void) > +{ > + cpumask_t cpus_to_resume; > + > + lockdep_assert_cpus_held(); > + lockdep_assert_preemption_disabled(); > + > + cpumask_copy(&cpus_to_resume, cpu_online_mask); > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume); > + > + spin_lock(&cpu_pause_lock); > + > + cpumask_setall(&resumed_cpus); > + /* A typical example for sleep and wake-up functions. */ > + smp_mb(); > + while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) { > + sev(); > + cpu_relax(); > + barrier(); > + } > + > + spin_unlock(&cpu_pause_lock); > +} > + > #ifdef CONFIG_KEXEC_CORE > static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0); > #endif > @@ -911,6 +1006,11 @@ static void do_handle_IPI(int ipinr) > local_cpu_stop(); > break; > > + case IPI_CPU_PAUSE: > + pause_local_cpu(); > + break; > + > +#ifdef CONFIG_KEXEC_CORE > case IPI_CPU_CRASH_STOP: > if (IS_ENABLED(CONFIG_KEXEC_CORE)) { > ipi_cpu_crash_stop(cpu, get_irq_regs()); > @@ -918,6 +1018,7 @@ static void do_handle_IPI(int ipinr) > unreachable(); > } > break; > +#endif > > #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST > case IPI_TIMER: > @@ -939,9 +1040,11 @@ static void do_handle_IPI(int ipinr) > nmi_cpu_backtrace(get_irq_regs()); > break; > > +#ifdef CONFIG_KGDB > case IPI_KGDB_ROUNDUP: > kgdb_nmicallback(cpu, get_irq_regs()); > break; > +#endif > > default: > pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr); > @@ -971,9 +1074,14 @@ static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > > switch (ipi) { > case IPI_CPU_STOP: > + case IPI_CPU_PAUSE: > +#ifdef CONFIG_KEXEC_CORE > case IPI_CPU_CRASH_STOP: > +#endif > case IPI_CPU_BACKTRACE: > +#ifdef CONFIG_KGDB > case IPI_KGDB_ROUNDUP: > +#endif > return true; > default: > return false; > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 5113753f3ac9..da6f2a7d665e 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -46,6 +46,18 @@ struct vmemmap_remap_walk { > unsigned long flags; > }; > > +#ifndef vmemmap_update_lock > +static void vmemmap_update_lock(void) > +{ > +} > +#endif > + > +#ifndef vmemmap_update_unlock > +static void vmemmap_update_unlock(void) > +{ > +} > +#endif > + > #ifndef vmemmap_update_pmd > static inline void vmemmap_update_pmd(unsigned long addr, > pmd_t *pmdp, pte_t *ptep) > @@ -194,10 +206,12 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, > > VM_BUG_ON(!PAGE_ALIGNED(start | end)); > > + vmemmap_update_lock(); > mmap_read_lock(&init_mm); > ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops, > NULL, walk); > mmap_read_unlock(&init_mm); > + vmemmap_update_unlock(); > if (ret) > return ret; > > > .
On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@huawei.com> wrote: > > > 在 2024/6/24 13:39, Yu Zhao 写道: > > On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > >> On 2024/3/14 7:32, David Rientjes wrote: > >> > >>> On Thu, 8 Feb 2024, Will Deacon wrote: > >>> > >>>>> How about take a new lock with irq disabled during BBM, like: > >>>>> > >>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > >>>>> +{ > >>>>> + (NEW_LOCK); > >>>>> + pte_clear(&init_mm, addr, ptep); > >>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >>>>> + set_pte_at(&init_mm, addr, ptep, pte); > >>>>> + spin_unlock_irq(NEW_LOCK); > >>>>> +} > >>>> I really think the only maintainable way to achieve this is to avoid the > >>>> possibility of a fault altogether. > >>>> > >>>> Will > >>>> > >>>> > >>> Nanyong, are you still actively working on making HVO possible on arm64? > >>> > >>> This would yield a substantial memory savings on hosts that are largely > >>> configured with hugetlbfs. In our case, the size of this hugetlbfs pool > >>> is actually never changed after boot, but it sounds from the thread that > >>> there was an idea to make HVO conditional on FEAT_BBM. Is this being > >>> pursued? > >>> > >>> If so, any testing help needed? > >> I'm afraid that FEAT_BBM may not solve the problem here > > I think so too -- I came cross this while working on TAO [1]. > > > > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > > > >> because from Arm > >> ARM, > >> I see that FEAT_BBM is only used for changing block size. Therefore, in this > >> HVO feature, > >> it can work in the split PMD stage, that is, BBM can be avoided in > >> vmemmap_split_pmd, > >> but in the subsequent vmemmap_remap_pte, the Output address of PTE still > >> needs to be > >> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my > >> understanding > >> of ARM FEAT_BBM is wrong, and I hope someone can correct me. > >> Actually, the solution I first considered was to use the stop_machine > >> method, but we have > >> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically > >> use hugepages, > >> so I have to consider performance issues. If your product does not change > >> the amount of huge > >> pages after booting, using stop_machine() may be a feasible way. > >> So far, I still haven't come up with a good solution. > > I do have a patch that's similar to stop_machine() -- it uses NMI IPIs > > to pause/resume remote CPUs while the local one is doing BBM. > > > > Note that the problem of updating vmemmap for struct page[], as I see > > it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory > > hot removal in general [2]. On arm64, we would need to support BBM on > > vmemmap so that we can fix the problem with offlining memory (or to be > > precise, unmapping offlined struct page[]), by mapping offlined struct > > page[] to a read-only page of dummy struct page[], similar to > > ZERO_PAGE(). (Or we would have to make extremely invasive changes to > > the reader side, i.e., all speculative PFN walkers.) > > > > In case you are interested in testing my approach, you can swap your > > patch 2 with the following: > I don't have an NMI IPI capable ARM machine on hand, so I think this feature > depends on a higher version of the ARM cpu. (Pseudo) NMI does require GICv3 (released in 2015). But that's independent from CPU versions. Just to double check: you don't have GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or irqchip.gicv3_pseudo_nmi=1), is that correct? Even without GICv3, IPIs can be masked but still works, with a less bounded latency. > What I worried about was that other cores would occasionally be interrupted > frequently(8 times every 2M and 4096 times every 1G) and then wait for the > update of page table to complete before resuming. Catalin has suggested batching, and to echo what he said [1]: it's possible to make all vmemmap changes from a single HVO/de-HVO operation into *one batch*. [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/ > If there are workloads > running on other cores, performance may be affected. This implementation > speeds up stopping and resuming other cores, but they still have to wait > for the update to finish. How often does your use case trigger HVO/de-HVO operations? For our VM use case, it's generally correlated to VM lifetimes, i.e., how often VM bin-packing happens. For our THP use case, it can be more often, but I still don't think we would trigger HVO/de-HVO every minute. So with NMI IPIs, IMO, the performance impact would be acceptable to our use cases.
On Wed, Feb 7, 2024 at 5:44 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > BBM(break-before-make) logic when changing page tables. > > > > This set of patches fix this by adding necessary BBM sequence when > > > > changing page table, and supporting vmemmap page fault handling to > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > [...] > > > How often is this code path called? I wonder whether a stop_machine() > > > approach would be simpler. > > As long as allocating or releasing hugetlb is called. We cannot limit users > > to only allocate or release hugetlb > > when booting or not running any workload on all other cpus, so if use > > stop_machine(), it will be triggered > > 8 times every 2M and 4096 times every 1G, which is probably too expensive. > > I'm hoping this can be batched somehow and not do a stop_machine() (or > 8) for every 2MB huge page. Theoretically, all hugeTLB vmemmap operations from a single user request can be done in one batch. This would require the preallocation of the new copy of vmemmap so that the old copy can be replaced with one BBM. > Just to make sure I understand - is the goal to be able to free struct > pages corresponding to hugetlbfs pages? Correct, if you are referring to the pages holding struct page[]. > Can we not leave the vmemmap in > place and just release that memory to the page allocator? We cannot, since the goal is to reuse those pages for something else, i.e., reduce the metadata overhead for hugeTLB. > The physical > RAM for those struct pages isn't going anywhere This is not the case.
On 2024/6/28 5:03, Yu Zhao wrote: > On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@huawei.com> wrote: >> >> 在 2024/6/24 13:39, Yu Zhao 写道: >>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: >>>> On 2024/3/14 7:32, David Rientjes wrote: >>>> >>>>> On Thu, 8 Feb 2024, Will Deacon wrote: >>>>> >>>>>>> How about take a new lock with irq disabled during BBM, like: >>>>>>> >>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) >>>>>>> +{ >>>>>>> + (NEW_LOCK); >>>>>>> + pte_clear(&init_mm, addr, ptep); >>>>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>>>>>> + set_pte_at(&init_mm, addr, ptep, pte); >>>>>>> + spin_unlock_irq(NEW_LOCK); >>>>>>> +} >>>>>> I really think the only maintainable way to achieve this is to avoid the >>>>>> possibility of a fault altogether. >>>>>> >>>>>> Will >>>>>> >>>>>> >>>>> Nanyong, are you still actively working on making HVO possible on arm64? >>>>> >>>>> This would yield a substantial memory savings on hosts that are largely >>>>> configured with hugetlbfs. In our case, the size of this hugetlbfs pool >>>>> is actually never changed after boot, but it sounds from the thread that >>>>> there was an idea to make HVO conditional on FEAT_BBM. Is this being >>>>> pursued? >>>>> >>>>> If so, any testing help needed? >>>> I'm afraid that FEAT_BBM may not solve the problem here >>> I think so too -- I came cross this while working on TAO [1]. >>> >>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ >>> >>>> because from Arm >>>> ARM, >>>> I see that FEAT_BBM is only used for changing block size. Therefore, in this >>>> HVO feature, >>>> it can work in the split PMD stage, that is, BBM can be avoided in >>>> vmemmap_split_pmd, >>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE still >>>> needs to be >>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my >>>> understanding >>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me. >>>> Actually, the solution I first considered was to use the stop_machine >>>> method, but we have >>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically >>>> use hugepages, >>>> so I have to consider performance issues. If your product does not change >>>> the amount of huge >>>> pages after booting, using stop_machine() may be a feasible way. >>>> So far, I still haven't come up with a good solution. >>> I do have a patch that's similar to stop_machine() -- it uses NMI IPIs >>> to pause/resume remote CPUs while the local one is doing BBM. >>> >>> Note that the problem of updating vmemmap for struct page[], as I see >>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory >>> hot removal in general [2]. On arm64, we would need to support BBM on >>> vmemmap so that we can fix the problem with offlining memory (or to be >>> precise, unmapping offlined struct page[]), by mapping offlined struct >>> page[] to a read-only page of dummy struct page[], similar to >>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to >>> the reader side, i.e., all speculative PFN walkers.) >>> >>> In case you are interested in testing my approach, you can swap your >>> patch 2 with the following: >> I don't have an NMI IPI capable ARM machine on hand, so I think this feature >> depends on a higher version of the ARM cpu. > (Pseudo) NMI does require GICv3 (released in 2015). But that's > independent from CPU versions. Just to double check: you don't have > GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or > irqchip.gicv3_pseudo_nmi=1), is that correct? > > Even without GICv3, IPIs can be masked but still works, with a less > bounded latency. Oh,I misunderstood. Pseudo NMI is available. We have CONFIG_ARM64_PSEUDO_NMI=y but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test this solution after opening this in cmdline. >> What I worried about was that other cores would occasionally be interrupted >> frequently(8 times every 2M and 4096 times every 1G) and then wait for the >> update of page table to complete before resuming. > Catalin has suggested batching, and to echo what he said [1]: it's > possible to make all vmemmap changes from a single HVO/de-HVO > operation into *one batch*. > > [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/ > >> If there are workloads >> running on other cores, performance may be affected. This implementation >> speeds up stopping and resuming other cores, but they still have to wait >> for the update to finish. > How often does your use case trigger HVO/de-HVO operations? > > For our VM use case, it's generally correlated to VM lifetimes, i.e., > how often VM bin-packing happens. For our THP use case, it can be more > often, but I still don't think we would trigger HVO/de-HVO every > minute. So with NMI IPIs, IMO, the performance impact would be > acceptable to our use cases. > > . We have many use cases so that I'm not thinking about a specific use case, but rather a generic one. I will test the performance impact of different HVO trigger frequencies, such as triggering HVO while running redis.
On Thu, Jul 4, 2024 at 5:47 AM Nanyong Sun <sunnanyong@huawei.com> wrote: > > On 2024/6/28 5:03, Yu Zhao wrote: > > On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@huawei.com> wrote: > >> > >> 在 2024/6/24 13:39, Yu Zhao 写道: > >>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > >>>> On 2024/3/14 7:32, David Rientjes wrote: > >>>> > >>>>> On Thu, 8 Feb 2024, Will Deacon wrote: > >>>>> > >>>>>>> How about take a new lock with irq disabled during BBM, like: > >>>>>>> > >>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > >>>>>>> +{ > >>>>>>> + (NEW_LOCK); > >>>>>>> + pte_clear(&init_mm, addr, ptep); > >>>>>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >>>>>>> + set_pte_at(&init_mm, addr, ptep, pte); > >>>>>>> + spin_unlock_irq(NEW_LOCK); > >>>>>>> +} > >>>>>> I really think the only maintainable way to achieve this is to avoid the > >>>>>> possibility of a fault altogether. > >>>>>> > >>>>>> Will > >>>>>> > >>>>>> > >>>>> Nanyong, are you still actively working on making HVO possible on arm64? > >>>>> > >>>>> This would yield a substantial memory savings on hosts that are largely > >>>>> configured with hugetlbfs. In our case, the size of this hugetlbfs pool > >>>>> is actually never changed after boot, but it sounds from the thread that > >>>>> there was an idea to make HVO conditional on FEAT_BBM. Is this being > >>>>> pursued? > >>>>> > >>>>> If so, any testing help needed? > >>>> I'm afraid that FEAT_BBM may not solve the problem here > >>> I think so too -- I came cross this while working on TAO [1]. > >>> > >>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > >>> > >>>> because from Arm > >>>> ARM, > >>>> I see that FEAT_BBM is only used for changing block size. Therefore, in this > >>>> HVO feature, > >>>> it can work in the split PMD stage, that is, BBM can be avoided in > >>>> vmemmap_split_pmd, > >>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE still > >>>> needs to be > >>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my > >>>> understanding > >>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me. > >>>> Actually, the solution I first considered was to use the stop_machine > >>>> method, but we have > >>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically > >>>> use hugepages, > >>>> so I have to consider performance issues. If your product does not change > >>>> the amount of huge > >>>> pages after booting, using stop_machine() may be a feasible way. > >>>> So far, I still haven't come up with a good solution. > >>> I do have a patch that's similar to stop_machine() -- it uses NMI IPIs > >>> to pause/resume remote CPUs while the local one is doing BBM. > >>> > >>> Note that the problem of updating vmemmap for struct page[], as I see > >>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory > >>> hot removal in general [2]. On arm64, we would need to support BBM on > >>> vmemmap so that we can fix the problem with offlining memory (or to be > >>> precise, unmapping offlined struct page[]), by mapping offlined struct > >>> page[] to a read-only page of dummy struct page[], similar to > >>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to > >>> the reader side, i.e., all speculative PFN walkers.) > >>> > >>> In case you are interested in testing my approach, you can swap your > >>> patch 2 with the following: > >> I don't have an NMI IPI capable ARM machine on hand, so I think this feature > >> depends on a higher version of the ARM cpu. > > (Pseudo) NMI does require GICv3 (released in 2015). But that's > > independent from CPU versions. Just to double check: you don't have > > GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or > > irqchip.gicv3_pseudo_nmi=1), is that correct? > > > > Even without GICv3, IPIs can be masked but still works, with a less > > bounded latency. > Oh,I misunderstood. Pseudo NMI is available. We have > CONFIG_ARM64_PSEUDO_NMI=y > but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test > this solution after > opening this in cmdline. > > >> What I worried about was that other cores would occasionally be interrupted > >> frequently(8 times every 2M and 4096 times every 1G) and then wait for the > >> update of page table to complete before resuming. > > Catalin has suggested batching, and to echo what he said [1]: it's > > possible to make all vmemmap changes from a single HVO/de-HVO > > operation into *one batch*. > > > > [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/ > > > >> If there are workloads > >> running on other cores, performance may be affected. This implementation > >> speeds up stopping and resuming other cores, but they still have to wait > >> for the update to finish. > > How often does your use case trigger HVO/de-HVO operations? > > > > For our VM use case, it's generally correlated to VM lifetimes, i.e., > > how often VM bin-packing happens. For our THP use case, it can be more > > often, but I still don't think we would trigger HVO/de-HVO every > > minute. So with NMI IPIs, IMO, the performance impact would be > > acceptable to our use cases. > > > > . > We have many use cases so that I'm not thinking about a specific use case, > but rather a generic one. I will test the performance impact of different > HVO trigger frequencies, such as triggering HVO while running redis. Thanks, and if it's not good enough for whatever you are going to test, we can batch the updates at least at the PTE level, or even at the PMD level.
On Thu, Jun 27, 2024 at 03:19:55PM -0600, Yu Zhao wrote: > On Wed, Feb 7, 2024 at 5:44 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > BBM(break-before-make) logic when changing page tables. > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > [...] > > > > How often is this code path called? I wonder whether a stop_machine() > > > > approach would be simpler. > > > > > > As long as allocating or releasing hugetlb is called. We cannot > > > limit users to only allocate or release hugetlb when booting or > > > not running any workload on all other cpus, so if use > > > stop_machine(), it will be triggered 8 times every 2M and 4096 > > > times every 1G, which is probably too expensive. > > > > I'm hoping this can be batched somehow and not do a stop_machine() (or > > 8) for every 2MB huge page. > > Theoretically, all hugeTLB vmemmap operations from a single user > request can be done in one batch. This would require the preallocation > of the new copy of vmemmap so that the old copy can be replaced with > one BBM. Do we ever re-create pmd block entries back for the vmmemap range that was split or do they remain pmd table + pte entries? If the latter, I guess we could do a stop_machine() only for a pmd, it should be self limiting after a while. I don't want user-space to DoS the system by triggering stop_machine() when mapping/unmapping hugetlbfs pages. If I did the maths right, for a 2MB hugetlb page, we have about 8 vmemmap pages (32K). Once we split a 2MB vmemap range, whatever else needs to be touched in this range won't require a stop_machine(). > > Just to make sure I understand - is the goal to be able to free struct > > pages corresponding to hugetlbfs pages? > > Correct, if you are referring to the pages holding struct page[]. > > > Can we not leave the vmemmap in > > place and just release that memory to the page allocator? > > We cannot, since the goal is to reuse those pages for something else, > i.e., reduce the metadata overhead for hugeTLB. What I meant is that we can leave the vmemmap alias in place and just reuse those pages via the linear map etc. The kernel should touch those struct pages to corrupt the data. The only problem would be if we physically unplug those pages but I don't think that's the case here.
On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Thu, Jun 27, 2024 at 03:19:55PM -0600, Yu Zhao wrote: > > On Wed, Feb 7, 2024 at 5:44 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > > BBM(break-before-make) logic when changing page tables. > > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > [...] > > > > > How often is this code path called? I wonder whether a stop_machine() > > > > > approach would be simpler. > > > > > > > > As long as allocating or releasing hugetlb is called. We cannot > > > > limit users to only allocate or release hugetlb when booting or > > > > not running any workload on all other cpus, so if use > > > > stop_machine(), it will be triggered 8 times every 2M and 4096 > > > > times every 1G, which is probably too expensive. > > > > > > I'm hoping this can be batched somehow and not do a stop_machine() (or > > > 8) for every 2MB huge page. > > > > Theoretically, all hugeTLB vmemmap operations from a single user > > request can be done in one batch. This would require the preallocation > > of the new copy of vmemmap so that the old copy can be replaced with > > one BBM. > > Do we ever re-create pmd block entries back for the vmmemap range that > was split or do they remain pmd table + pte entries? If the latter, I > guess we could do a stop_machine() only for a pmd, it should be self > limiting after a while. It's the latter for now, but it can change in the future: we do want to restore the original mapping at the PMD level; instead, we do it at the PTE level because high-order pages backing PMD entries are not as easy to allocate, compared with order-0 pages backing PTEs. > I don't want user-space to DoS the system by > triggering stop_machine() when mapping/unmapping hugetlbfs pages. The operations are privileged, and each HVO or de-HVO request would require at least one stop_machine(). So in theory, a privileged user still can cause DoS. > If I did the maths right, for a 2MB hugetlb page, we have about 8 > vmemmap pages (32K). Once we split a 2MB vmemap range, Correct. > whatever else > needs to be touched in this range won't require a stop_machine(). There might be some misunderstandings here. To do HVO: 1. we split a PMD into 512 PTEs; 2. for every 8 PTEs: 2a. we allocate an order-0 page for PTE #0; 2b. we remap PTE #0 *RW* to this page; 2c. we remap PTEs #1-7 *RO* to this page; 2d. we free the original order-3 page. To do de-HVO: 1. for every 8 PTEs: 1a. we allocate 7 order-0 pages. 1b. we remap PTEs #1-7 *RW* to those pages, respectively. We can in theory restore the original PTE or even PMD mappings at an acceptable success rate by making changes on the MM side, e.g., only allow movable allocations in the area backing the original PMD. Again, we don't do it for now because high-order pages are not as easy to allocate. > > > Just to make sure I understand - is the goal to be able to free struct > > > pages corresponding to hugetlbfs pages? > > > > Correct, if you are referring to the pages holding struct page[]. > > > > > Can we not leave the vmemmap in > > > place and just release that memory to the page allocator? > > > > We cannot, since the goal is to reuse those pages for something else, > > i.e., reduce the metadata overhead for hugeTLB. > > What I meant is that we can leave the vmemmap alias in place and just > reuse those pages via the linear map etc. The kernel should touch those > struct pages to corrupt the data. The only problem would be if we > physically unplug those pages but I don't think that's the case here. Set the repercussions of memory corruption aside, we still can't do this because PTEs #1-7 need to map meaningful data, hence step 2c above.
On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > If I did the maths right, for a 2MB hugetlb page, we have about 8 > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > Correct. > > > whatever else > > needs to be touched in this range won't require a stop_machine(). > > There might be some misunderstandings here. > > To do HVO: > 1. we split a PMD into 512 PTEs; > 2. for every 8 PTEs: > 2a. we allocate an order-0 page for PTE #0; > 2b. we remap PTE #0 *RW* to this page; > 2c. we remap PTEs #1-7 *RO* to this page; > 2d. we free the original order-3 page. Thanks. I now remember why we reverted such support in 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The main problem is that point 2c also changes the output address of the PTE (and the content of the page slightly). The architecture requires a break-before-make in such scenario, though it would have been nice if it was more specific on what could go wrong. We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I assume these 8 vmemmap pages may be accessed and that's why we can't do a break-before-make safely. I was wondering whether we could make the PTEs RO first and then change the output address but we have another rule that the content of the page should be the same. I don't think entries 1-7 are identical to entry 0 (though we could ask the architects for clarification here). Also, can we guarantee that nothing writes to entry 0 while we would do such remapping? We know entries 1-7 won't be written as we mapped them as RO but entry 0 contains the head page. Maybe it's ok to map it RO temporarily until the newly allocated hugetlb page is returned. If we could get the above work, it would be a lot simpler than thinking of stop_machine() or other locks to wait for such remapping. > To do de-HVO: > 1. for every 8 PTEs: > 1a. we allocate 7 order-0 pages. > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. Similar problem in 1.b, changing the output address. Here we could force the content to be the same and remap PTEs 1-7 RO first to the new page, turn them RW afterwards and it's all compliant with the architecture (even without FEAT_BBM). > > What I meant is that we can leave the vmemmap alias in place and just > > reuse those pages via the linear map etc. The kernel should touch those > > struct pages to corrupt the data. The only problem would be if we > > physically unplug those pages but I don't think that's the case here. > > Set the repercussions of memory corruption aside, we still can't do > this because PTEs #1-7 need to map meaningful data, hence step 2c > above. Yeah, I missed this one.
On Wed, Jul 10, 2024 at 10:51 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > > On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > If I did the maths right, for a 2MB hugetlb page, we have about 8 > > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > > > Correct. > > > > > whatever else > > > needs to be touched in this range won't require a stop_machine(). > > > > There might be some misunderstandings here. > > > > To do HVO: > > 1. we split a PMD into 512 PTEs; > > 2. for every 8 PTEs: > > 2a. we allocate an order-0 page for PTE #0; > > 2b. we remap PTE #0 *RW* to this page; > > 2c. we remap PTEs #1-7 *RO* to this page; > > 2d. we free the original order-3 page. > > Thanks. I now remember why we reverted such support in 060a2c92d1b6 > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The main > problem is that point 2c also changes the output address of the PTE > (and the content of the page slightly). The architecture requires a > break-before-make in such scenario, though it would have been nice if it > was more specific on what could go wrong. > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I > assume these 8 vmemmap pages may be accessed and that's why we can't do > a break-before-make safely. Correct > I was wondering whether we could make the > PTEs RO first and then change the output address but we have another > rule that the content of the page should be the same. I don't think > entries 1-7 are identical to entry 0 (though we could ask the architects > for clarification here). Also, can we guarantee that nothing writes to > entry 0 while we would do such remapping? Yes, it's already guaranteed. > We know entries 1-7 won't be > written as we mapped them as RO but entry 0 contains the head page. > Maybe it's ok to map it RO temporarily until the newly allocated hugetlb > page is returned. We can do that. I don't understand how this could elide BBM. After the above, we would still need to do: 3. remap entry 0 from RO to RW, mapping the `struct page` page that will be shared with entry 1-7. 4. remap entry 1-7 from their respective `struct page` pages to that of entry 0, while they remain RO. > If we could get the above work, it would be a lot simpler than thinking > of stop_machine() or other locks to wait for such remapping. Steps 3/4 would not require BBM somehow? > > To do de-HVO: > > 1. for every 8 PTEs: > > 1a. we allocate 7 order-0 pages. > > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. > > Similar problem in 1.b, changing the output address. Here we could force > the content to be the same I don't follow the "the content to be the same" part. After HVO, we have: Entry 0 -> `struct page` page A, RW Entry 1 -> `struct page` page A, RO ... Entry 7 -> `struct page` page A, RO To de-HVO, we need to make them: Entry 0 -> `struct page` page A, RW Entry 1 -> `struct page` page B, RW ... Entry 7 -> `struct page` page H, RW I assume the same content means PTE_0 == PTE_1/.../7? > and remap PTEs 1-7 RO first to the new page, > turn them RW afterwards and it's all compliant with the architecture > (even without FEAT_BBM). It'd be great if we could do that. I don't fully understand it though, at the moment.
On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote: > On Wed, Jul 10, 2024 at 10:51 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > > > On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > If I did the maths right, for a 2MB hugetlb page, we have about 8 > > > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > > > > > Correct. > > > > > > > whatever else > > > > needs to be touched in this range won't require a stop_machine(). > > > > > > There might be some misunderstandings here. > > > > > > To do HVO: > > > 1. we split a PMD into 512 PTEs; > > > 2. for every 8 PTEs: > > > 2a. we allocate an order-0 page for PTE #0; > > > 2b. we remap PTE #0 *RW* to this page; > > > 2c. we remap PTEs #1-7 *RO* to this page; > > > 2d. we free the original order-3 page. > > > > Thanks. I now remember why we reverted such support in 060a2c92d1b6 > > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The main > > problem is that point 2c also changes the output address of the PTE > > (and the content of the page slightly). The architecture requires a > > break-before-make in such scenario, though it would have been nice if it > > was more specific on what could go wrong. > > > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I > > assume these 8 vmemmap pages may be accessed and that's why we can't do > > a break-before-make safely. > > Correct > > > I was wondering whether we could make the > > PTEs RO first and then change the output address but we have another > > rule that the content of the page should be the same. I don't think > > entries 1-7 are identical to entry 0 (though we could ask the architects > > for clarification here). Also, can we guarantee that nothing writes to > > entry 0 while we would do such remapping? > > Yes, it's already guaranteed. > > > We know entries 1-7 won't be > > written as we mapped them as RO but entry 0 contains the head page. > > Maybe it's ok to map it RO temporarily until the newly allocated hugetlb > > page is returned. > > We can do that. I don't understand how this could elide BBM. After the > above, we would still need to do: > 3. remap entry 0 from RO to RW, mapping the `struct page` page that > will be shared with entry 1-7. > 4. remap entry 1-7 from their respective `struct page` pages to that > of entry 0, while they remain RO. The Arm ARM states that we need a BBM if we change the output address and: the old or new mappings are RW *or* the content of the page changes. Ignoring the latter (page content), we can turn the PTEs RO first without changing the pfn followed by changing the pfn while they are RO. Once that's done, we make entry 0 RW and, of course, with additional TLBIs between all these steps. Can we leave entry 0 RO? This would save an additional TLBI. Now, I wonder if all this is worth it. What are the scenarios where the 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. > > If we could get the above work, it would be a lot simpler than thinking > > of stop_machine() or other locks to wait for such remapping. > > Steps 3/4 would not require BBM somehow? If we ignore the 'content' requirement, I think we could skip the BBM but we need to make sure we don't change the permission and pfn at the same time. > > > To do de-HVO: > > > 1. for every 8 PTEs: > > > 1a. we allocate 7 order-0 pages. > > > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. > > > > Similar problem in 1.b, changing the output address. Here we could force > > the content to be the same > > I don't follow the "the content to be the same" part. After HVO, we have: > > Entry 0 -> `struct page` page A, RW > Entry 1 -> `struct page` page A, RO > ... > Entry 7 -> `struct page` page A, RO > > To de-HVO, we need to make them: > > Entry 0 -> `struct page` page A, RW > Entry 1 -> `struct page` page B, RW > ... > Entry 7 -> `struct page` page H, RW > > I assume the same content means PTE_0 == PTE_1/.../7? That's the content of the page at the corresponding pfn before and after the pte change. I'm pretty sure the Arm ARM states this in case the hardware starts a load (e.g. unaligned) from one page and completes it from another, the software should not see any difference. But for the fields we care about in struct page, I assume they'd be the same (or that we just don't care about inconsistencies during this transient period).
On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote: > > On Wed, Jul 10, 2024 at 10:51 AM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > > > > On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > If I did the maths right, for a 2MB hugetlb page, we have about 8 > > > > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > > > > > > > Correct. > > > > > > > > > whatever else > > > > > needs to be touched in this range won't require a stop_machine(). > > > > > > > > There might be some misunderstandings here. > > > > > > > > To do HVO: > > > > 1. we split a PMD into 512 PTEs; > > > > 2. for every 8 PTEs: > > > > 2a. we allocate an order-0 page for PTE #0; > > > > 2b. we remap PTE #0 *RW* to this page; > > > > 2c. we remap PTEs #1-7 *RO* to this page; > > > > 2d. we free the original order-3 page. > > > > > > Thanks. I now remember why we reverted such support in 060a2c92d1b6 > > > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The main > > > problem is that point 2c also changes the output address of the PTE > > > (and the content of the page slightly). The architecture requires a > > > break-before-make in such scenario, though it would have been nice if it > > > was more specific on what could go wrong. > > > > > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I > > > assume these 8 vmemmap pages may be accessed and that's why we can't do > > > a break-before-make safely. > > > > Correct > > > > > I was wondering whether we could make the > > > PTEs RO first and then change the output address but we have another > > > rule that the content of the page should be the same. I don't think > > > entries 1-7 are identical to entry 0 (though we could ask the architects > > > for clarification here). Also, can we guarantee that nothing writes to > > > entry 0 while we would do such remapping? > > > > Yes, it's already guaranteed. > > > > > We know entries 1-7 won't be > > > written as we mapped them as RO but entry 0 contains the head page. > > > Maybe it's ok to map it RO temporarily until the newly allocated hugetlb > > > page is returned. > > > > We can do that. I don't understand how this could elide BBM. After the > > above, we would still need to do: > > 3. remap entry 0 from RO to RW, mapping the `struct page` page that > > will be shared with entry 1-7. > > 4. remap entry 1-7 from their respective `struct page` pages to that > > of entry 0, while they remain RO. > > The Arm ARM states that we need a BBM if we change the output address > and: the old or new mappings are RW *or* the content of the page > changes. Ignoring the latter (page content), we can turn the PTEs RO > first without changing the pfn followed by changing the pfn while they > are RO. Once that's done, we make entry 0 RW and, of course, with > additional TLBIs between all these steps. Aha! This is easy to do -- I just made the RO guaranteed, as I mentioned earlier. Just to make sure I fully understand the workflow: 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area. 2. TLBI once, after pmd_populate_kernel() 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 PTEs, while they remain RO. 4. TLBI once, after set_pte_at() on PTE 1-7. 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area. 6. TLBI once, after set_pte_at() on PTE 0. No BBM required, regardless of FEAT_BBM level 2. Is this correct? > Can we leave entry 0 RO? This > would save an additional TLBI. Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a refcnt on any hugeTLB pages. > Now, I wonder if all this is worth it. What are the scenarios where the > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. > > > > If we could get the above work, it would be a lot simpler than thinking > > > of stop_machine() or other locks to wait for such remapping. > > > > Steps 3/4 would not require BBM somehow? > > If we ignore the 'content' requirement, I think we could skip the BBM > but we need to make sure we don't change the permission and pfn at the > same time. Gotcha. > > > > To do de-HVO: > > > > 1. for every 8 PTEs: > > > > 1a. we allocate 7 order-0 pages. > > > > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. > > > > > > Similar problem in 1.b, changing the output address. Here we could force > > > the content to be the same > > > > I don't follow the "the content to be the same" part. After HVO, we have: > > > > Entry 0 -> `struct page` page A, RW > > Entry 1 -> `struct page` page A, RO > > ... > > Entry 7 -> `struct page` page A, RO > > > > To de-HVO, we need to make them: > > > > Entry 0 -> `struct page` page A, RW > > Entry 1 -> `struct page` page B, RW > > ... > > Entry 7 -> `struct page` page H, RW > > > > I assume the same content means PTE_0 == PTE_1/.../7? > > That's the content of the page at the corresponding pfn before and after > the pte change. I'm pretty sure the Arm ARM states this in case the > hardware starts a load (e.g. unaligned) from one page and completes it > from another, the software should not see any difference. But for the > fields we care about in struct page, I assume they'd be the same (or > that we just don't care about inconsistencies during this transient > period). Thanks for the explanation. I'll cook up something if my understanding above is correct.
On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote: > > > On Wed, Jul 10, 2024 at 10:51 AM Catalin Marinas > > > <catalin.marinas@arm.com> wrote: > > > > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote: > > > > > On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > If I did the maths right, for a 2MB hugetlb page, we have about 8 > > > > > > vmemmap pages (32K). Once we split a 2MB vmemap range, > > > > > > > > > > Correct. > > > > > > > > > > > whatever else > > > > > > needs to be touched in this range won't require a stop_machine(). > > > > > > > > > > There might be some misunderstandings here. > > > > > > > > > > To do HVO: > > > > > 1. we split a PMD into 512 PTEs; > > > > > 2. for every 8 PTEs: > > > > > 2a. we allocate an order-0 page for PTE #0; > > > > > 2b. we remap PTE #0 *RW* to this page; > > > > > 2c. we remap PTEs #1-7 *RO* to this page; > > > > > 2d. we free the original order-3 page. > > > > > > > > Thanks. I now remember why we reverted such support in 060a2c92d1b6 > > > > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The main > > > > problem is that point 2c also changes the output address of the PTE > > > > (and the content of the page slightly). The architecture requires a > > > > break-before-make in such scenario, though it would have been nice if it > > > > was more specific on what could go wrong. > > > > > > > > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I > > > > assume these 8 vmemmap pages may be accessed and that's why we can't do > > > > a break-before-make safely. > > > > > > Correct > > > > > > > I was wondering whether we could make the > > > > PTEs RO first and then change the output address but we have another > > > > rule that the content of the page should be the same. I don't think > > > > entries 1-7 are identical to entry 0 (though we could ask the architects > > > > for clarification here). Also, can we guarantee that nothing writes to > > > > entry 0 while we would do such remapping? > > > > > > Yes, it's already guaranteed. > > > > > > > We know entries 1-7 won't be > > > > written as we mapped them as RO but entry 0 contains the head page. > > > > Maybe it's ok to map it RO temporarily until the newly allocated hugetlb > > > > page is returned. > > > > > > We can do that. I don't understand how this could elide BBM. After the > > > above, we would still need to do: > > > 3. remap entry 0 from RO to RW, mapping the `struct page` page that > > > will be shared with entry 1-7. > > > 4. remap entry 1-7 from their respective `struct page` pages to that > > > of entry 0, while they remain RO. > > > > The Arm ARM states that we need a BBM if we change the output address > > and: the old or new mappings are RW *or* the content of the page > > changes. Ignoring the latter (page content), we can turn the PTEs RO > > first without changing the pfn followed by changing the pfn while they > > are RO. Once that's done, we make entry 0 RW and, of course, with > > additional TLBIs between all these steps. > > Aha! This is easy to do -- I just made the RO guaranteed, as I > mentioned earlier. > > Just to make sure I fully understand the workflow: > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area. > 2. TLBI once, after pmd_populate_kernel() > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 > PTEs, while they remain RO. > 4. TLBI once, after set_pte_at() on PTE 1-7. > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area. > 6. TLBI once, after set_pte_at() on PTE 0. > > No BBM required, regardless of FEAT_BBM level 2. I just studied D8.16.1 from the reference manual, and it seems to me: 1. We still need either FEAT_BBM or BBM to split PMD. 2. We still need BBM when we change PTE 1-7, because even if they remain RO, the content of the `struct page` page at the new location does not match that at the old location. > Is this correct? > > > Can we leave entry 0 RO? This > > would save an additional TLBI. > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a > refcnt on any hugeTLB pages. > > > Now, I wonder if all this is worth it. What are the scenarios where the > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. One of the fundamental assumptions in core MM is that anyone can read or try to grab (write) a refcnt from any `struct page`. Those speculative PFN walkers include memory compaction, etc. > > > > If we could get the above work, it would be a lot simpler than thinking > > > > of stop_machine() or other locks to wait for such remapping. > > > > > > Steps 3/4 would not require BBM somehow? > > > > If we ignore the 'content' requirement, I think we could skip the BBM > > but we need to make sure we don't change the permission and pfn at the > > same time. > > Gotcha. > > > > > > To do de-HVO: > > > > > 1. for every 8 PTEs: > > > > > 1a. we allocate 7 order-0 pages. > > > > > 1b. we remap PTEs #1-7 *RW* to those pages, respectively. > > > > > > > > Similar problem in 1.b, changing the output address. Here we could force > > > > the content to be the same > > > > > > I don't follow the "the content to be the same" part. After HVO, we have: > > > > > > Entry 0 -> `struct page` page A, RW > > > Entry 1 -> `struct page` page A, RO > > > ... > > > Entry 7 -> `struct page` page A, RO > > > > > > To de-HVO, we need to make them: > > > > > > Entry 0 -> `struct page` page A, RW > > > Entry 1 -> `struct page` page B, RW > > > ... > > > Entry 7 -> `struct page` page H, RW > > > > > > I assume the same content means PTE_0 == PTE_1/.../7? > > > > That's the content of the page at the corresponding pfn before and after > > the pte change. I'm pretty sure the Arm ARM states this in case the > > hardware starts a load (e.g. unaligned) from one page and completes it > > from another, the software should not see any difference. But for the > > fields we care about in struct page, I assume they'd be the same (or > > that we just don't care about inconsistencies during this transient > > period). > > Thanks for the explanation. I'll cook up something if my understanding > above is correct.
On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote: > On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > The Arm ARM states that we need a BBM if we change the output address > > > and: the old or new mappings are RW *or* the content of the page > > > changes. Ignoring the latter (page content), we can turn the PTEs RO > > > first without changing the pfn followed by changing the pfn while they > > > are RO. Once that's done, we make entry 0 RW and, of course, with > > > additional TLBIs between all these steps. > > > > Aha! This is easy to do -- I just made the RO guaranteed, as I > > mentioned earlier. > > > > Just to make sure I fully understand the workflow: > > > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area. I don't think we can turn all of them RO here since some of those 512 PTEs are not related to the hugetlb page. So you'd need to keep them RW but preserving the pfn so that there's no actual translation change. I think that's covered by FEAT_BBM level 2. Basically this step should be only about breaking up a PMD block entry into a table entry. > > 2. TLBI once, after pmd_populate_kernel() > > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 > > PTEs, while they remain RO. You may need some intermediate step to turn these PTEs read-only since step 1 should leave them RW. Also, if we want to free and order-3 page here, it might be better to allocate an order 0 even for PTE entry 0 (I had the impression that's what the core code does, I haven't checked). > > 4. TLBI once, after set_pte_at() on PTE 1-7. > > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area. > > 6. TLBI once, after set_pte_at() on PTE 0. > > > > No BBM required, regardless of FEAT_BBM level 2. > > I just studied D8.16.1 from the reference manual, and it seems to me: > 1. We still need either FEAT_BBM or BBM to split PMD. Yes. > 2. We still need BBM when we change PTE 1-7, because even if they > remain RO, the content of the `struct page` page at the new location > does not match that at the old location. Yes, in theory, the data at the new pfn should be the same. We could try to get clarification from the architects on what could go wrong but I suspect it's some atomicity is not guarantee if you read the data (the CPU getting confused whether to read from the old or the new page). Otherwise, since after all these steps PTEs 1-7 point to the same data as PTE 0, before step 3 we could copy the data in page 0 over to the other 7 pages while entries 1-7 are still RO. The remapping afterwards would be fully compliant. > > > Can we leave entry 0 RO? This would save an additional TLBI. > > > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a > > refcnt on any hugeTLB pages. OK, fair enough. > > > Now, I wonder if all this is worth it. What are the scenarios where the > > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. > > One of the fundamental assumptions in core MM is that anyone can > read or try to grab (write) a refcnt from any `struct page`. Those > speculative PFN walkers include memory compaction, etc. But how does this work if PTEs 1-7 are RO? Do those walkers detect it's a tail page and skip it. Actually, if they all point to the same vmemmap page, how can one distinguish a tail page via PTE 1 from the head page via PTE 0? BTW, I'll be on holiday from tomorrow for two weeks and won't be able to follow up on this thread (and likely to forget all the discussion by the time I get back ;)).
On Thu, Jul 11, 2024 at 5:39 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote: > > On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao <yuzhao@google.com> wrote: > > > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > The Arm ARM states that we need a BBM if we change the output address > > > > and: the old or new mappings are RW *or* the content of the page > > > > changes. Ignoring the latter (page content), we can turn the PTEs RO > > > > first without changing the pfn followed by changing the pfn while they > > > > are RO. Once that's done, we make entry 0 RW and, of course, with > > > > additional TLBIs between all these steps. > > > > > > Aha! This is easy to do -- I just made the RO guaranteed, as I > > > mentioned earlier. > > > > > > Just to make sure I fully understand the workflow: > > > > > > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area. > > I don't think we can turn all of them RO here since some of those 512 > PTEs are not related to the hugetlb page. So you'd need to keep them RW > but preserving the pfn so that there's no actual translation change. I > think that's covered by FEAT_BBM level 2. Basically this step should be > only about breaking up a PMD block entry into a table entry. Ack. > > > 2. TLBI once, after pmd_populate_kernel() > > > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8 > > > PTEs, while they remain RO. > > You may need some intermediate step to turn these PTEs read-only since > step 1 should leave them RW. Also, if we want to free and order-3 page > here, it might be better to allocate an order 0 even for PTE entry 0 (I > had the impression that's what the core code does, I haven't checked). Ack. > > > 4. TLBI once, after set_pte_at() on PTE 1-7. > > > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area. > > > 6. TLBI once, after set_pte_at() on PTE 0. > > > > > > No BBM required, regardless of FEAT_BBM level 2. > > > > I just studied D8.16.1 from the reference manual, and it seems to me: > > 1. We still need either FEAT_BBM or BBM to split PMD. > > Yes. Also, I want to confirm my understanding of "changing table size" from the reference manual: in our case, it means splitting a PMD into 512 PTEs with the same permission and OA. If we change the permission *or* OA, we still need to do BBM even with FEAT_BBM level 2. Is this correct? > > 2. We still need BBM when we change PTE 1-7, because even if they > > remain RO, the content of the `struct page` page at the new location > > does not match that at the old location. > > Yes, in theory, the data at the new pfn should be the same. We could try > to get clarification from the architects on what could go wrong but I > suspect it's some atomicity is not guarantee if you read the data (the > CPU getting confused whether to read from the old or the new page). > > Otherwise, since after all these steps PTEs 1-7 point to the same data > as PTE 0, before step 3 we could copy the data in page 0 over to the > other 7 pages while entries 1-7 are still RO. The remapping afterwards > would be fully compliant. Correct -- we do need to copy to make it fully compliant because the core MM doesn't guarantee that. The core MM only guarantees fields (of struct page) required for speculative PFN walkers to function correctly have the same value for all tail pages within a compound page. Non-correctness related fields in theory can have different values for those tail pages. > > > > Can we leave entry 0 RO? This would save an additional TLBI. > > > > > > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a > > > refcnt on any hugeTLB pages. > > OK, fair enough. > > > > > Now, I wonder if all this is worth it. What are the scenarios where the > > > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB > > > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned. > > > > One of the fundamental assumptions in core MM is that anyone can > > read or try to grab (write) a refcnt from any `struct page`. Those > > speculative PFN walkers include memory compaction, etc. > > But how does this work if PTEs 1-7 are RO? Do those walkers detect it's > a tail page and skip it. Correct. > Actually, if they all point to the same vmemmap > page, how can one distinguish a tail page via PTE 1 from the head page > via PTE 0? Two of the correctness related fields are page->_refcount and page->compound_head: 1. _refcount is the only one that can be speculatively updated. Speculative walkers are not allowed to update other fields unless they can grab a refcount. All tail pages must have zero refcount. 2. compound_head speculatively indicates whether a page is head or tail, and if it's tail, its head can be extracted by compound_head(). Since a head can have non-zero refcount, after PTEs 1-7 are remapped to PTE 0, we need a way to prevent speculative walkers from mistaking the first tail for each PTE 1-7 for the head and trying to grab their refcount. This is done by page_is_fake_head() returning true, which relies on the following sequence on. On the writer side: 2a. init compound_head 2b. reset _refcount to 0 2c. synchronize_rcu() 2d. remap PTEs 1-7 to PTE 0 2e. inc _refcount Speculative readers of the first tails respectively at PTEs 1-7 either see refcount being 0 or page_is_fake_head() being true. > BTW, I'll be on holiday from tomorrow for two weeks and won't be able to > follow up on this thread (and likely to forget all the discussion by the > time I get back ;)). Thanks for the heads-up!