Message ID | 20211110105428.32458-1-zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Free user PTE page table pages | expand |
On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: > In this patch series, we add a pte_refcount field to the struct page of page > table to track how many users of PTE page table. Similar to the mechanism of > page refcount, the user of PTE page table should hold a refcount to it before > accessing. The PTE page table page will be freed when the last refcount is > dropped. So, this approach basically adds two atomics on every PTE map If I have it right the reason that zap cannot clean the PTEs today is because zap cannot obtain the mmap lock due to a lock ordering issue with the inode lock vs mmap lock. If it could obtain the mmap lock then it could do the zap using the write side as unmapping a vma does. Rather than adding a new "lock" to ever PTE I wonder if it would be more efficient to break up the mmap lock and introduce a specific rwsem for the page table itself, in addition to the PTL. Currently the mmap lock is protecting both the vma list and the page table. I think that would allow the lock ordering issue to be resolved and zap could obtain a page table rwsem. Compared to two atomics per PTE this would just be two atomic per page table walk operation, it is conceptually a lot simpler, and would allow freeing all the page table levels, not just PTEs. ? Jason
On 10.11.21 13:56, Jason Gunthorpe wrote: > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: > >> In this patch series, we add a pte_refcount field to the struct page of page >> table to track how many users of PTE page table. Similar to the mechanism of >> page refcount, the user of PTE page table should hold a refcount to it before >> accessing. The PTE page table page will be freed when the last refcount is >> dropped. > > So, this approach basically adds two atomics on every PTE map > > If I have it right the reason that zap cannot clean the PTEs today is > because zap cannot obtain the mmap lock due to a lock ordering issue > with the inode lock vs mmap lock. There are different ways to zap: madvise(DONTNEED) vs fallocate(PUNCH_HOLE). It depends on "from where" we're actually comming: a process page table walker or the rmap. The way locking currently works doesn't allow to remove a page table just by holding the mmap lock, not even in write mode. You'll also need to hold the respective rmap locks -- which implies that reclaiming apge tables crossing VMAs is "problematic". Take a look at khugepaged which has to play quite some tricks to remove a page table. And there are other ways we can create empty page tables via the rmap, like reclaim/writeback, although they are rather a secondary concern mostly. > > If it could obtain the mmap lock then it could do the zap using the > write side as unmapping a vma does. > > Rather than adding a new "lock" to ever PTE I wonder if it would be > more efficient to break up the mmap lock and introduce a specific > rwsem for the page table itself, in addition to the PTL. Currently the > mmap lock is protecting both the vma list and the page table. There is the rmap side of things as well. At least the rmap won't reclaim alloc/free page tables, but it will walk page tables while holding the respective rmap lock. > > I think that would allow the lock ordering issue to be resolved and > zap could obtain a page table rwsem. > > Compared to two atomics per PTE this would just be two atomic per > page table walk operation, it is conceptually a lot simpler, and would > allow freeing all the page table levels, not just PTEs. Another alternative is to not do it in the kernel automatically, but instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get called by user space explicitly once it's reasonable. While this will work for the obvious madvise(DONTNEED) users -- like memory allocators -- that zap memory, it's a bit more complicated once shared memory is involved and we're fallocate(PUNCH_HOLE) memory. But it would at least work for many use cases that want to optimize memory consumption for sparse memory mappings. Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs are most certainly the most important part piece.
On 11/10/21 8:56 PM, Jason Gunthorpe wrote: > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: > >> In this patch series, we add a pte_refcount field to the struct page of page >> table to track how many users of PTE page table. Similar to the mechanism of >> page refcount, the user of PTE page table should hold a refcount to it before >> accessing. The PTE page table page will be freed when the last refcount is >> dropped. > > So, this approach basically adds two atomics on every PTE map > > If I have it right the reason that zap cannot clean the PTEs today is > because zap cannot obtain the mmap lock due to a lock ordering issue > with the inode lock vs mmap lock. Currently, both MADV_DONTNEED and MADV_FREE obtain the read side of mmap_lock instead of write side, which is the reason that jemalloc/tcmalloc prefer to use madvise() to release physical memory. > > If it could obtain the mmap lock then it could do the zap using the > write side as unmapping a vma does. Even if it obtains the write side of mmap_lock, how to make sure that all the page table entries are empty? Traverse 512 entries every time? > > Rather than adding a new "lock" to ever PTE I wonder if it would be > more efficient to break up the mmap lock and introduce a specific > rwsem for the page table itself, in addition to the PTL. Currently the > mmap lock is protecting both the vma list and the page table. Now each level of page table has its own spin lock. Can you explain the working mechanism of this special rwsem more clearly? If we can reduce the protection range of mmap_lock, it is indeed a great thing, but I think it is very difficult, and it will not solve the problem of how to check that all entries in the page table page are empty. > > I think that would allow the lock ordering issue to be resolved and > zap could obtain a page table rwsem. > > Compared to two atomics per PTE this would just be two atomic per > page table walk operation, it is conceptually a lot simpler, and would > allow freeing all the page table levels, not just PTEs. The reason why only the PTE page is released now is that it is the largest. This reference count can actually be used for other levels of page tables. > > ? > > Jason >
On 11/10/21 9:25 PM, David Hildenbrand wrote: > On 10.11.21 13:56, Jason Gunthorpe wrote: >> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: >> >>> In this patch series, we add a pte_refcount field to the struct page of page >>> table to track how many users of PTE page table. Similar to the mechanism of >>> page refcount, the user of PTE page table should hold a refcount to it before >>> accessing. The PTE page table page will be freed when the last refcount is >>> dropped. >> >> So, this approach basically adds two atomics on every PTE map >> >> If I have it right the reason that zap cannot clean the PTEs today is >> because zap cannot obtain the mmap lock due to a lock ordering issue >> with the inode lock vs mmap lock. > > There are different ways to zap: madvise(DONTNEED) vs > fallocate(PUNCH_HOLE). It depends on "from where" we're actually > comming: a process page table walker or the rmap. > > The way locking currently works doesn't allow to remove a page table > just by holding the mmap lock, not even in write mode. You'll also need > to hold the respective rmap locks -- which implies that reclaiming apge > tables crossing VMAs is "problematic". Take a look at khugepaged which > has to play quite some tricks to remove a page table. > > And there are other ways we can create empty page tables via the rmap, > like reclaim/writeback, although they are rather a secondary concern mostly. > >> >> If it could obtain the mmap lock then it could do the zap using the >> write side as unmapping a vma does. >> >> Rather than adding a new "lock" to ever PTE I wonder if it would be >> more efficient to break up the mmap lock and introduce a specific >> rwsem for the page table itself, in addition to the PTL. Currently the >> mmap lock is protecting both the vma list and the page table. > > There is the rmap side of things as well. At least the rmap won't > reclaim alloc/free page tables, but it will walk page tables while > holding the respective rmap lock. > >> >> I think that would allow the lock ordering issue to be resolved and >> zap could obtain a page table rwsem. >> >> Compared to two atomics per PTE this would just be two atomic per >> page table walk operation, it is conceptually a lot simpler, and would >> allow freeing all the page table levels, not just PTEs. > > Another alternative is to not do it in the kernel automatically, but > instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get > called by user space explicitly once it's reasonable. While this will > work for the obvious madvise(DONTNEED) users -- like memory allocators > -- that zap memory, it's a bit more complicated once shared memory is > involved and we're fallocate(PUNCH_HOLE) memory. But it would at least > work for many use cases that want to optimize memory consumption for > sparse memory mappings. > > Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area > will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs > are most certainly the most important part piece. > total agree! Thanks, Qi
On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote: > On 10.11.21 13:56, Jason Gunthorpe wrote: > > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: > > > >> In this patch series, we add a pte_refcount field to the struct page of page > >> table to track how many users of PTE page table. Similar to the mechanism of > >> page refcount, the user of PTE page table should hold a refcount to it before > >> accessing. The PTE page table page will be freed when the last refcount is > >> dropped. > > > > So, this approach basically adds two atomics on every PTE map > > > > If I have it right the reason that zap cannot clean the PTEs today is > > because zap cannot obtain the mmap lock due to a lock ordering issue > > with the inode lock vs mmap lock. > > There are different ways to zap: madvise(DONTNEED) vs > fallocate(PUNCH_HOLE). It depends on "from where" we're actually > comming: a process page table walker or the rmap. AFAIK rmap is the same issue, it can't lock the mmap_sem > The way locking currently works doesn't allow to remove a page table > just by holding the mmap lock, not even in write mode. I'm not sure I understand this. If the goal is to free the PTE tables then the main concern is use-after free on page table walkers (which this series is addressing). Ignoring bugs, we have only three ways to read the page table: - Fully locked. Under the PTLs (gup slow is an example) - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example) - hw-locked. Barriered with TLB flush (gup fast is an example) #1 should be completely safe as the PTLs will protect eveything #2 is safe so long as the write side is held during any layout changes #3 interacts with the TLB flush, and is also safe with zap rmap itself is a #1 page table walker, ie it gets the PTLs under page_vma_mapped_walk(). The sin we have comitted here is that both the mmap lock and the PTLs are being used to protect the page table itself with a very complicated dual semantic. Splitting the sleeping mmap lock into 'covers vma' and 'covers page tables' lets us solve the lock ordering and semi-locked can become more fully locked by the new lock, instead of by abusing mmap sem. I'd suggest to make this new lock a special rwsem which allows either concurrent read access OR concurrent PTL access, but not both. This way we don't degrade performance of the split PTLs, *and* when something needs to change the page table structure it has a way to properly exclude all the #2 lockless readers. So evey touch to the page table starts by obtaining this new lock, depending on the access mode to be used. (PTL vs lockless read) We can keep the existing THP logic where a leaf PMD can be transformed to a non-leaf PMD in the semi-locked case, but the case where a non-leaf PMD is transformed to a leaf PMD has to take the lock. Jason
On 10.11.21 15:38, Jason Gunthorpe wrote: > On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote: >> On 10.11.21 13:56, Jason Gunthorpe wrote: >>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: >>> >>>> In this patch series, we add a pte_refcount field to the struct page of page >>>> table to track how many users of PTE page table. Similar to the mechanism of >>>> page refcount, the user of PTE page table should hold a refcount to it before >>>> accessing. The PTE page table page will be freed when the last refcount is >>>> dropped. >>> >>> So, this approach basically adds two atomics on every PTE map >>> >>> If I have it right the reason that zap cannot clean the PTEs today is >>> because zap cannot obtain the mmap lock due to a lock ordering issue >>> with the inode lock vs mmap lock. >> >> There are different ways to zap: madvise(DONTNEED) vs >> fallocate(PUNCH_HOLE). It depends on "from where" we're actually >> comming: a process page table walker or the rmap. > > AFAIK rmap is the same issue, it can't lock the mmap_sem > >> The way locking currently works doesn't allow to remove a page table >> just by holding the mmap lock, not even in write mode. > > I'm not sure I understand this. If the goal is to free the PTE tables > then the main concern is use-after free on page table walkers (which > this series is addressing). Ignoring bugs, we have only three ways to > read the page table: Yes, use-after-free and reuse-while-freeing are the two challenges AFAIKs. > > - Fully locked. Under the PTLs (gup slow is an example) > - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example) > - hw-locked. Barriered with TLB flush (gup fast is an example) Three additions as far as I can tell: 1. Fully locked currently needs the read mmap lock OR the rmap lock in read. PTLs on their own are not sufficient AFAIKT. 2. #1 and #2 can currently only walk within VMA ranges. 3. We can theoretically walk page tables outside VMA ranges with the write mmap lock, because page tables get removed with the mmap lock in read mode and heavy-weight operations (VMA layout, khugepaged) are performed under the write mmap lock. The rmap locks protect from modifications where we want to exclude rmap walkers similarly to how we grab the mmap lock in write, where the PTLs are not sufficient. See mm/mremap.c:move_ptes() as an example which performs VMA layout + page table modifications. See khugepagd which doesn't perform VMA layout modifications but page table modifications. > > #1 should be completely safe as the PTLs will protect eveything > #2 is safe so long as the write side is held during any layout changes > #3 interacts with the TLB flush, and is also safe with zap > > rmap itself is a #1 page table walker, ie it gets the PTLs under > page_vma_mapped_walk(). When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs? Because the PMD-PTLs re usually not taken in case we know there is a page table (nothing would currently change it without heavy locking). And if they are taken, they are only held while allocating/checking a PMDE, not while actually *using* the page table that's mapped in that entry. For example, walk_page_range() requires the mmap lock in read and grabs the PTE-PTLs. > > The sin we have comitted here is that both the mmap lock and the PTLs > are being used to protect the page table itself with a very > complicated dual semantic. > > Splitting the sleeping mmap lock into 'covers vma' and 'covers page > tables' lets us solve the lock ordering and semi-locked can become > more fully locked by the new lock, instead of by abusing mmap sem. It would still be a fairly coarse-grained locking, I am not sure if that is a step into the right direction. If you want to modify *some* page table in your process you have exclude each and every page table walker. Or did I mis-interpret what you were saying? > > I'd suggest to make this new lock a special rwsem which allows either > concurrent read access OR concurrent PTL access, but not both. This I looked into such a lock recently in similar context and something like that does not exist yet (and fairness will be challenging). You either have a single reader or multiple writer. I'd be interested if someone knows of something like that.
On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote: > > - Fully locked. Under the PTLs (gup slow is an example) > > - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example) > > - hw-locked. Barriered with TLB flush (gup fast is an example) > > Three additions as far as I can tell: > > 1. Fully locked currently needs the read mmap lock OR the rmap lock in > read. PTLs on their own are not sufficient AFAIKT. I think the reality is we don't have any fully locked walkers.. Even gup slow is semi-lockless until it reaches lower levels, then it takes the PTLs. (I forgot about that detail!) The mmap lock is being used to protect the higher levels of the page table. It is part of its complicated dual purpose. > 2. #1 and #2 can currently only walk within VMA ranges. AFICT this is an artifact of re-using the mmap lock to protect the page table and not being able to obtain the mmap lock in all the places the page table structure is manipulated. > 3. We can theoretically walk page tables outside VMA ranges with the > write mmap lock, because page tables get removed with the mmap lock in > read mode and heavy-weight operations (VMA layout, khugepaged) are > performed under the write mmap lock. Yes, again, an artifact of the current locking. > The rmap locks protect from modifications where we want to exclude rmap > walkers similarly to how we grab the mmap lock in write, where the PTLs > are not sufficient. It is a similar dual purpose locking as the mmap sem :( > > #1 should be completely safe as the PTLs will protect eveything > > #2 is safe so long as the write side is held during any layout changes > > #3 interacts with the TLB flush, and is also safe with zap > > > > rmap itself is a #1 page table walker, ie it gets the PTLs under > > page_vma_mapped_walk(). > > When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs? Ah, here I was thinking about a lock that can protect all the levels. Today we are abusing the mmap lock to act as the pud_lock, for instance. > Because the PMD-PTLs re usually not taken in case we know there is a > page table (nothing would currently change it without heavy > locking). This only works with the lockless walkers, and relies on the read mmap sem/etc to also mean 'a PTE table cannot become a leaf PMD' > For example, walk_page_range() requires the mmap lock in read and grabs > the PTE-PTLs. Yes, that is a semi-locked reader. > It would still be a fairly coarse-grained locking, I am not sure if that > is a step into the right direction. If you want to modify *some* page > table in your process you have exclude each and every page table walker. > Or did I mis-interpret what you were saying? That is one possible design, it favours fast walking and penalizes mutation. We could also stick a lock in the PMD (instead of a refcount) and still logically be using a lock instead of a refcount scheme. Remember modify here is "want to change a table pointer into a leaf pointer" so it isn't an every day activity.. There is some advantage with this thinking because it harmonizes well with the other stuff that wants to convert tables into leafs, but has to deal with complicated locking. On the other hand, refcounts are a degenerate kind of rwsem and only help with freeing pages. It also puts more atomics in normal fast paths since we are refcounting each PTE, not read locking the PMD. Perhaps the ideal thing would be to stick a rwsem in the PMD. read means a table cannot be come a leaf. I don't know if there is space for another atomic in the PMD level, and we'd have to use a hitching post/hashed waitq scheme too since there surely isn't room for a waitq too.. I wouldn't be so quick to say one is better than the other, but at least let's have thought about a locking solution before merging refcounts :) Jason
On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote: > > I'd suggest to make this new lock a special rwsem which allows either > > concurrent read access OR concurrent PTL access, but not both. This > > I looked into such a lock recently in similar context and something like > that does not exist yet (and fairness will be challenging). You either > have a single reader or multiple writer. I'd be interested if someone > knows of something like that. We've talked about having such a lock before for filesystems which want to permit either many direct-IO accesses or many buffered-IO accesses, but want to exclude a mixture of direct-IO and buffered-IO. The name we came up with for such a lock was the red-blue lock. Either Team Red has the lock, or Team Blue has the lock (or it's free). Indicate free with velue zero, Team Red with positive numbers and Team Blue with negative numbers. If we need to indicate an opposing team is waiting for the semaphore, we can use a high bit (1 << 30) to indicate no new team members can acquire the lock. Not sure whether anybody ever coded it up.
On 10.11.21 17:49, Matthew Wilcox wrote: > On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote: >>> I'd suggest to make this new lock a special rwsem which allows either >>> concurrent read access OR concurrent PTL access, but not both. This >> >> I looked into such a lock recently in similar context and something like >> that does not exist yet (and fairness will be challenging). You either >> have a single reader or multiple writer. I'd be interested if someone >> knows of something like that. > > We've talked about having such a lock before for filesystems which want > to permit either many direct-IO accesses or many buffered-IO accesses, but > want to exclude a mixture of direct-IO and buffered-IO. The name we came > up with for such a lock was the red-blue lock. Either Team Red has the > lock, or Team Blue has the lock (or it's free). Indicate free with velue > zero, Team Red with positive numbers and Team Blue with negative numbers. > If we need to indicate an opposing team is waiting for the semaphore, > we can use a high bit (1 << 30) to indicate no new team members can > acquire the lock. Not sure whether anybody ever coded it up. Interesting, thanks for sharing! My excessive google search didn't reveal anything back then (~3 months ago) -- only questions on popular coding websites asking essentially for the same thing without any helpful replies. But maybe I used the wrong keywords (e.g., "multiple reader, multiple writer", I certainly didn't search for "red-blue lock", but I do like the name :) ). Fairness might still be the biggest issue, but I am certainly no locking expert.
On Wed, Nov 10, 2021 at 05:53:57PM +0100, David Hildenbrand wrote: > On 10.11.21 17:49, Matthew Wilcox wrote: > > On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote: > >>> I'd suggest to make this new lock a special rwsem which allows either > >>> concurrent read access OR concurrent PTL access, but not both. This > >> > >> I looked into such a lock recently in similar context and something like > >> that does not exist yet (and fairness will be challenging). You either > >> have a single reader or multiple writer. I'd be interested if someone > >> knows of something like that. > > > > We've talked about having such a lock before for filesystems which want > > to permit either many direct-IO accesses or many buffered-IO accesses, but > > want to exclude a mixture of direct-IO and buffered-IO. The name we came > > up with for such a lock was the red-blue lock. Either Team Red has the > > lock, or Team Blue has the lock (or it's free). Indicate free with velue > > zero, Team Red with positive numbers and Team Blue with negative numbers. > > If we need to indicate an opposing team is waiting for the semaphore, > > we can use a high bit (1 << 30) to indicate no new team members can > > acquire the lock. Not sure whether anybody ever coded it up. > > Interesting, thanks for sharing! > > My excessive google search didn't reveal anything back then (~3 months > ago) -- only questions on popular coding websites asking essentially for > the same thing without any helpful replies. But maybe I used the wrong > keywords (e.g., "multiple reader, multiple writer", I certainly didn't > search for "red-blue lock", but I do like the name :) ). > > Fairness might still be the biggest issue, but I am certainly no locking > expert. Fairness could use the same basic logic as the write prefered to read in the rwsem. The atomic implementation would be with atomic_dec_unless_positive() and atomic_inc_unless_negative(), without fairness it looks straightfoward. Jason
>> It would still be a fairly coarse-grained locking, I am not sure if that >> is a step into the right direction. If you want to modify *some* page >> table in your process you have exclude each and every page table walker. >> Or did I mis-interpret what you were saying? > > That is one possible design, it favours fast walking and penalizes > mutation. We could also stick a lock in the PMD (instead of a > refcount) and still logically be using a lock instead of a refcount > scheme. Remember modify here is "want to change a table pointer into a > leaf pointer" so it isn't an every day activity.. It will be if we somewhat frequent when reclaim an empty PTE page table as soon as it turns empty. This not only happens when zapping, but also during writeback/swapping. So while writing back / swapping you might be left with empty page tables to reclaim. Of course, this is the current approach. Another approach that doesn't require additional refcounts is scanning page tables for empty ones and reclaiming them. This scanning can either be triggered manually from user space or automatically from the kernel. > > There is some advantage with this thinking because it harmonizes well > with the other stuff that wants to convert tables into leafs, but has > to deal with complicated locking. > > On the other hand, refcounts are a degenerate kind of rwsem and only > help with freeing pages. It also puts more atomics in normal fast > paths since we are refcounting each PTE, not read locking the PMD. > > Perhaps the ideal thing would be to stick a rwsem in the PMD. read > means a table cannot be come a leaf. I don't know if there is space > for another atomic in the PMD level, and we'd have to use a hitching > post/hashed waitq scheme too since there surely isn't room for a waitq > too.. > > I wouldn't be so quick to say one is better than the other, but at > least let's have thought about a locking solution before merging > refcounts :) Yes, absolutely. I can see the beauty in the current approach, because it just reclaims "automatically" once possible -- page table empty and nobody is walking it. The downside is that it doesn't always make sense to reclaim an empty page table immediately once it turns empty. Also, it adds complexity for something that is only a problem in some corner cases -- sparse memory mappings, especially relevant for some memory allocators after freeing a lot of memory or running VMs with memory ballooning after inflating the balloon. Some of these use cases might be good with just triggering page table reclaim manually from user space.
On Wed, Nov 10, 2021 at 06:37:46PM +0100, David Hildenbrand wrote: > It will be if we somewhat frequent when reclaim an empty PTE page table > as soon as it turns empty. What do you think is the frequency of unlocked page table walks that this would have to block on? > Also, it adds complexity for something that is only a problem in some > corner cases -- sparse memory mappings, especially relevant for some > memory allocators after freeing a lot of memory or running VMs with > memory ballooning after inflating the balloon. Some of these use cases > might be good with just triggering page table reclaim manually from user > space. Right, this is why it would be nice if the complexity could address more than one problem, like the existing complex locking around the thp stuff.. Jason
On 11/11/21 1:37 AM, David Hildenbrand wrote: >>> It would still be a fairly coarse-grained locking, I am not sure if that >>> is a step into the right direction. If you want to modify *some* page >>> table in your process you have exclude each and every page table walker. >>> Or did I mis-interpret what you were saying? >> >> That is one possible design, it favours fast walking and penalizes >> mutation. We could also stick a lock in the PMD (instead of a >> refcount) and still logically be using a lock instead of a refcount >> scheme. Remember modify here is "want to change a table pointer into a >> leaf pointer" so it isn't an every day activity.. > > It will be if we somewhat frequent when reclaim an empty PTE page table > as soon as it turns empty. This not only happens when zapping, but also > during writeback/swapping. So while writing back / swapping you might be > left with empty page tables to reclaim. > > Of course, this is the current approach. Another approach that doesn't > require additional refcounts is scanning page tables for empty ones and > reclaiming them. This scanning can either be triggered manually from > user space or automatically from the kernel. Whether it is introducing a special rwsem or scanning an empty page table, there are two problems as follows: #1. When to trigger the scanning or releasing? #2. Every time to release a 4K page table page, 512 page table entries need to be scanned. For #1, if the scanning is triggered manually from user space, the kernel is relatively passive, and the user does not fully know the best timing to scan. If the scanning is triggered automatically from the kernel, that is great. But the timing is not easy to confirm, is it scanned and reclaimed every time zap or try_to_unmap? For #2, refcount has advantages. > >> >> There is some advantage with this thinking because it harmonizes well >> with the other stuff that wants to convert tables into leafs, but has >> to deal with complicated locking. >> >> On the other hand, refcounts are a degenerate kind of rwsem and only >> help with freeing pages. It also puts more atomics in normal fast >> paths since we are refcounting each PTE, not read locking the PMD. >> >> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >> means a table cannot be come a leaf. I don't know if there is space >> for another atomic in the PMD level, and we'd have to use a hitching >> post/hashed waitq scheme too since there surely isn't room for a waitq >> too.. >> >> I wouldn't be so quick to say one is better than the other, but at >> least let's have thought about a locking solution before merging >> refcounts :) > > Yes, absolutely. I can see the beauty in the current approach, because > it just reclaims "automatically" once possible -- page table empty and > nobody is walking it. The downside is that it doesn't always make sense > to reclaim an empty page table immediately once it turns empty. > > Also, it adds complexity for something that is only a problem in some > corner cases -- sparse memory mappings, especially relevant for some > memory allocators after freeing a lot of memory or running VMs with > memory ballooning after inflating the balloon. Some of these use cases > might be good with just triggering page table reclaim manually from user > space. > Yes, this is indeed a problem. Perhaps some flags can be introduced so that the release of page table pages can be delayed in some cases. Similar to the lazyfree mechanism in MADV_FREE? Thanks, Qi
On 11.11.21 04:58, Qi Zheng wrote: > > > On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>> is a step into the right direction. If you want to modify *some* page >>>> table in your process you have exclude each and every page table walker. >>>> Or did I mis-interpret what you were saying? >>> >>> That is one possible design, it favours fast walking and penalizes >>> mutation. We could also stick a lock in the PMD (instead of a >>> refcount) and still logically be using a lock instead of a refcount >>> scheme. Remember modify here is "want to change a table pointer into a >>> leaf pointer" so it isn't an every day activity.. >> >> It will be if we somewhat frequent when reclaim an empty PTE page table >> as soon as it turns empty. This not only happens when zapping, but also >> during writeback/swapping. So while writing back / swapping you might be >> left with empty page tables to reclaim. >> >> Of course, this is the current approach. Another approach that doesn't >> require additional refcounts is scanning page tables for empty ones and >> reclaiming them. This scanning can either be triggered manually from >> user space or automatically from the kernel. > > Whether it is introducing a special rwsem or scanning an empty page > table, there are two problems as follows: > > #1. When to trigger the scanning or releasing? For example when reclaiming memory, when scanning page tables in khugepaged, or triggered by user space (note that this is the approach I originally looked into). But it certainly requires more locking thought to avoid stopping essentially any page table walker. > #2. Every time to release a 4K page table page, 512 page table > entries need to be scanned. It would happen only when actually trigger reclaim of page tables (again, someone has to trigger it), so it's barely an issue. For example, khugepaged already scans the page tables either way. > > For #1, if the scanning is triggered manually from user space, the > kernel is relatively passive, and the user does not fully know the best > timing to scan. If the scanning is triggered automatically from the > kernel, that is great. But the timing is not easy to confirm, is it > scanned and reclaimed every time zap or try_to_unmap? > > For #2, refcount has advantages. > >> >>> >>> There is some advantage with this thinking because it harmonizes well >>> with the other stuff that wants to convert tables into leafs, but has >>> to deal with complicated locking. >>> >>> On the other hand, refcounts are a degenerate kind of rwsem and only >>> help with freeing pages. It also puts more atomics in normal fast >>> paths since we are refcounting each PTE, not read locking the PMD. >>> >>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>> means a table cannot be come a leaf. I don't know if there is space >>> for another atomic in the PMD level, and we'd have to use a hitching >>> post/hashed waitq scheme too since there surely isn't room for a waitq >>> too.. >>> >>> I wouldn't be so quick to say one is better than the other, but at >>> least let's have thought about a locking solution before merging >>> refcounts :) >> >> Yes, absolutely. I can see the beauty in the current approach, because >> it just reclaims "automatically" once possible -- page table empty and >> nobody is walking it. The downside is that it doesn't always make sense >> to reclaim an empty page table immediately once it turns empty. >> >> Also, it adds complexity for something that is only a problem in some >> corner cases -- sparse memory mappings, especially relevant for some >> memory allocators after freeing a lot of memory or running VMs with >> memory ballooning after inflating the balloon. Some of these use cases >> might be good with just triggering page table reclaim manually from user >> space. >> > > Yes, this is indeed a problem. Perhaps some flags can be introduced so > that the release of page table pages can be delayed in some cases. > Similar to the lazyfree mechanism in MADV_FREE? The issue AFAIU is that once your refcount hits 0 (no more references, no more entries), the longer you wait with reclaim, the longer others have to wait for populating a fresh page table because the "page table to be reclaimed" is still stuck around. You'd have to keep the refcount increased for a while, and only drop it after a while. But when? And how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
On 11/11/21 5:22 PM, David Hildenbrand wrote: > On 11.11.21 04:58, Qi Zheng wrote: >> >> >> On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>>> is a step into the right direction. If you want to modify *some* page >>>>> table in your process you have exclude each and every page table walker. >>>>> Or did I mis-interpret what you were saying? >>>> >>>> That is one possible design, it favours fast walking and penalizes >>>> mutation. We could also stick a lock in the PMD (instead of a >>>> refcount) and still logically be using a lock instead of a refcount >>>> scheme. Remember modify here is "want to change a table pointer into a >>>> leaf pointer" so it isn't an every day activity.. >>> >>> It will be if we somewhat frequent when reclaim an empty PTE page table >>> as soon as it turns empty. This not only happens when zapping, but also >>> during writeback/swapping. So while writing back / swapping you might be >>> left with empty page tables to reclaim. >>> >>> Of course, this is the current approach. Another approach that doesn't >>> require additional refcounts is scanning page tables for empty ones and >>> reclaiming them. This scanning can either be triggered manually from >>> user space or automatically from the kernel. >> >> Whether it is introducing a special rwsem or scanning an empty page >> table, there are two problems as follows: >> >> #1. When to trigger the scanning or releasing? > > For example when reclaiming memory, when scanning page tables in > khugepaged, or triggered by user space (note that this is the approach I > originally looked into). But it certainly requires more locking thought > to avoid stopping essentially any page table walker. > >> #2. Every time to release a 4K page table page, 512 page table >> entries need to be scanned. > > It would happen only when actually trigger reclaim of page tables > (again, someone has to trigger it), so it's barely an issue. > > For example, khugepaged already scans the page tables either way. > >> >> For #1, if the scanning is triggered manually from user space, the >> kernel is relatively passive, and the user does not fully know the best >> timing to scan. If the scanning is triggered automatically from the >> kernel, that is great. But the timing is not easy to confirm, is it >> scanned and reclaimed every time zap or try_to_unmap? >> >> For #2, refcount has advantages. >> >>> >>>> >>>> There is some advantage with this thinking because it harmonizes well >>>> with the other stuff that wants to convert tables into leafs, but has >>>> to deal with complicated locking. >>>> >>>> On the other hand, refcounts are a degenerate kind of rwsem and only >>>> help with freeing pages. It also puts more atomics in normal fast >>>> paths since we are refcounting each PTE, not read locking the PMD. >>>> >>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>>> means a table cannot be come a leaf. I don't know if there is space >>>> for another atomic in the PMD level, and we'd have to use a hitching >>>> post/hashed waitq scheme too since there surely isn't room for a waitq >>>> too.. >>>> >>>> I wouldn't be so quick to say one is better than the other, but at >>>> least let's have thought about a locking solution before merging >>>> refcounts :) >>> >>> Yes, absolutely. I can see the beauty in the current approach, because >>> it just reclaims "automatically" once possible -- page table empty and >>> nobody is walking it. The downside is that it doesn't always make sense >>> to reclaim an empty page table immediately once it turns empty. >>> >>> Also, it adds complexity for something that is only a problem in some >>> corner cases -- sparse memory mappings, especially relevant for some >>> memory allocators after freeing a lot of memory or running VMs with >>> memory ballooning after inflating the balloon. Some of these use cases >>> might be good with just triggering page table reclaim manually from user >>> space. >>> >> >> Yes, this is indeed a problem. Perhaps some flags can be introduced so >> that the release of page table pages can be delayed in some cases. >> Similar to the lazyfree mechanism in MADV_FREE? > > The issue AFAIU is that once your refcount hits 0 (no more references, > no more entries), the longer you wait with reclaim, the longer others > have to wait for populating a fresh page table because the "page table > to be reclaimed" is still stuck around. You'd have to keep the refcount > increased for a while, and only drop it after a while. But when? And > how? IMHO it's not trivial, but maybe there is an easy way to achieve it. > For running VMs with memory ballooning after inflating the balloon, is this a hot behavior? Even if it is, it is already facing the release and reallocation of physical pages. The overhead after introducing pte_refcount is that we need to release and re-allocate page table page. But 2MB physical pages only corresponds to 4KiB of PTE page table page. So maybe the overhead is not big. In fact, the performance test shown on the cover letter is this case: test program: https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c Thanks, Qi >
On 11.11.21 12:08, Qi Zheng wrote: > > > On 11/11/21 5:22 PM, David Hildenbrand wrote: >> On 11.11.21 04:58, Qi Zheng wrote: >>> >>> >>> On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>>>> is a step into the right direction. If you want to modify *some* page >>>>>> table in your process you have exclude each and every page table walker. >>>>>> Or did I mis-interpret what you were saying? >>>>> >>>>> That is one possible design, it favours fast walking and penalizes >>>>> mutation. We could also stick a lock in the PMD (instead of a >>>>> refcount) and still logically be using a lock instead of a refcount >>>>> scheme. Remember modify here is "want to change a table pointer into a >>>>> leaf pointer" so it isn't an every day activity.. >>>> >>>> It will be if we somewhat frequent when reclaim an empty PTE page table >>>> as soon as it turns empty. This not only happens when zapping, but also >>>> during writeback/swapping. So while writing back / swapping you might be >>>> left with empty page tables to reclaim. >>>> >>>> Of course, this is the current approach. Another approach that doesn't >>>> require additional refcounts is scanning page tables for empty ones and >>>> reclaiming them. This scanning can either be triggered manually from >>>> user space or automatically from the kernel. >>> >>> Whether it is introducing a special rwsem or scanning an empty page >>> table, there are two problems as follows: >>> >>> #1. When to trigger the scanning or releasing? >> >> For example when reclaiming memory, when scanning page tables in >> khugepaged, or triggered by user space (note that this is the approach I >> originally looked into). But it certainly requires more locking thought >> to avoid stopping essentially any page table walker. >> >>> #2. Every time to release a 4K page table page, 512 page table >>> entries need to be scanned. >> >> It would happen only when actually trigger reclaim of page tables >> (again, someone has to trigger it), so it's barely an issue. >> >> For example, khugepaged already scans the page tables either way. >> >>> >>> For #1, if the scanning is triggered manually from user space, the >>> kernel is relatively passive, and the user does not fully know the best >>> timing to scan. If the scanning is triggered automatically from the >>> kernel, that is great. But the timing is not easy to confirm, is it >>> scanned and reclaimed every time zap or try_to_unmap? >>> >>> For #2, refcount has advantages. >>> >>>> >>>>> >>>>> There is some advantage with this thinking because it harmonizes well >>>>> with the other stuff that wants to convert tables into leafs, but has >>>>> to deal with complicated locking. >>>>> >>>>> On the other hand, refcounts are a degenerate kind of rwsem and only >>>>> help with freeing pages. It also puts more atomics in normal fast >>>>> paths since we are refcounting each PTE, not read locking the PMD. >>>>> >>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>>>> means a table cannot be come a leaf. I don't know if there is space >>>>> for another atomic in the PMD level, and we'd have to use a hitching >>>>> post/hashed waitq scheme too since there surely isn't room for a waitq >>>>> too.. >>>>> >>>>> I wouldn't be so quick to say one is better than the other, but at >>>>> least let's have thought about a locking solution before merging >>>>> refcounts :) >>>> >>>> Yes, absolutely. I can see the beauty in the current approach, because >>>> it just reclaims "automatically" once possible -- page table empty and >>>> nobody is walking it. The downside is that it doesn't always make sense >>>> to reclaim an empty page table immediately once it turns empty. >>>> >>>> Also, it adds complexity for something that is only a problem in some >>>> corner cases -- sparse memory mappings, especially relevant for some >>>> memory allocators after freeing a lot of memory or running VMs with >>>> memory ballooning after inflating the balloon. Some of these use cases >>>> might be good with just triggering page table reclaim manually from user >>>> space. >>>> >>> >>> Yes, this is indeed a problem. Perhaps some flags can be introduced so >>> that the release of page table pages can be delayed in some cases. >>> Similar to the lazyfree mechanism in MADV_FREE? >> >> The issue AFAIU is that once your refcount hits 0 (no more references, >> no more entries), the longer you wait with reclaim, the longer others >> have to wait for populating a fresh page table because the "page table >> to be reclaimed" is still stuck around. You'd have to keep the refcount >> increased for a while, and only drop it after a while. But when? And >> how? IMHO it's not trivial, but maybe there is an easy way to achieve it. >> > > For running VMs with memory ballooning after inflating the balloon, is > this a hot behavior? Even if it is, it is already facing the release and > reallocation of physical pages. The overhead after introducing > pte_refcount is that we need to release and re-allocate page table page. > But 2MB physical pages only corresponds to 4KiB of PTE page table page. > So maybe the overhead is not big. The cases that come to my mind are a) Swapping on shared memory with concurrent access b) Reclaim on file-backed memory with concurrent access c) Free page reporting as implemented by virtio-balloon In all of these cases, you can have someone immediately re-access the page table and re-populate it. For something mostly static (balloon inflation, memory allocator), it's not that big of a deal I guess.
On 11/11/21 7:19 PM, David Hildenbrand wrote: > On 11.11.21 12:08, Qi Zheng wrote: >> >> >> On 11/11/21 5:22 PM, David Hildenbrand wrote: >>> On 11.11.21 04:58, Qi Zheng wrote: >>>> >>>> >>>> On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>>>>> is a step into the right direction. If you want to modify *some* page >>>>>>> table in your process you have exclude each and every page table walker. >>>>>>> Or did I mis-interpret what you were saying? >>>>>> >>>>>> That is one possible design, it favours fast walking and penalizes >>>>>> mutation. We could also stick a lock in the PMD (instead of a >>>>>> refcount) and still logically be using a lock instead of a refcount >>>>>> scheme. Remember modify here is "want to change a table pointer into a >>>>>> leaf pointer" so it isn't an every day activity.. >>>>> >>>>> It will be if we somewhat frequent when reclaim an empty PTE page table >>>>> as soon as it turns empty. This not only happens when zapping, but also >>>>> during writeback/swapping. So while writing back / swapping you might be >>>>> left with empty page tables to reclaim. >>>>> >>>>> Of course, this is the current approach. Another approach that doesn't >>>>> require additional refcounts is scanning page tables for empty ones and >>>>> reclaiming them. This scanning can either be triggered manually from >>>>> user space or automatically from the kernel. >>>> >>>> Whether it is introducing a special rwsem or scanning an empty page >>>> table, there are two problems as follows: >>>> >>>> #1. When to trigger the scanning or releasing? >>> >>> For example when reclaiming memory, when scanning page tables in >>> khugepaged, or triggered by user space (note that this is the approach I >>> originally looked into). But it certainly requires more locking thought >>> to avoid stopping essentially any page table walker. >>> >>>> #2. Every time to release a 4K page table page, 512 page table >>>> entries need to be scanned. >>> >>> It would happen only when actually trigger reclaim of page tables >>> (again, someone has to trigger it), so it's barely an issue. >>> >>> For example, khugepaged already scans the page tables either way. >>> >>>> >>>> For #1, if the scanning is triggered manually from user space, the >>>> kernel is relatively passive, and the user does not fully know the best >>>> timing to scan. If the scanning is triggered automatically from the >>>> kernel, that is great. But the timing is not easy to confirm, is it >>>> scanned and reclaimed every time zap or try_to_unmap? >>>> >>>> For #2, refcount has advantages. >>>> >>>>> >>>>>> >>>>>> There is some advantage with this thinking because it harmonizes well >>>>>> with the other stuff that wants to convert tables into leafs, but has >>>>>> to deal with complicated locking. >>>>>> >>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only >>>>>> help with freeing pages. It also puts more atomics in normal fast >>>>>> paths since we are refcounting each PTE, not read locking the PMD. >>>>>> >>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>>>>> means a table cannot be come a leaf. I don't know if there is space >>>>>> for another atomic in the PMD level, and we'd have to use a hitching >>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq >>>>>> too.. >>>>>> >>>>>> I wouldn't be so quick to say one is better than the other, but at >>>>>> least let's have thought about a locking solution before merging >>>>>> refcounts :) >>>>> >>>>> Yes, absolutely. I can see the beauty in the current approach, because >>>>> it just reclaims "automatically" once possible -- page table empty and >>>>> nobody is walking it. The downside is that it doesn't always make sense >>>>> to reclaim an empty page table immediately once it turns empty. >>>>> >>>>> Also, it adds complexity for something that is only a problem in some >>>>> corner cases -- sparse memory mappings, especially relevant for some >>>>> memory allocators after freeing a lot of memory or running VMs with >>>>> memory ballooning after inflating the balloon. Some of these use cases >>>>> might be good with just triggering page table reclaim manually from user >>>>> space. >>>>> >>>> >>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so >>>> that the release of page table pages can be delayed in some cases. >>>> Similar to the lazyfree mechanism in MADV_FREE? >>> >>> The issue AFAIU is that once your refcount hits 0 (no more references, >>> no more entries), the longer you wait with reclaim, the longer others >>> have to wait for populating a fresh page table because the "page table >>> to be reclaimed" is still stuck around. You'd have to keep the refcount >>> increased for a while, and only drop it after a while. But when? And >>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it. >>> >> >> For running VMs with memory ballooning after inflating the balloon, is >> this a hot behavior? Even if it is, it is already facing the release and >> reallocation of physical pages. The overhead after introducing >> pte_refcount is that we need to release and re-allocate page table page. >> But 2MB physical pages only corresponds to 4KiB of PTE page table page. >> So maybe the overhead is not big. > > The cases that come to my mind are > > a) Swapping on shared memory with concurrent access > b) Reclaim on file-backed memory with concurrent access > c) Free page reporting as implemented by virtio-balloon > > In all of these cases, you can have someone immediately re-access the > page table and re-populate it. In the performance test shown on the cover, we repeatedly performed touch and madvise(MADV_DONTNEED) actions, which simulated the case you said above. We did find a small amount of performance regression, but I think it is acceptable, and no new perf hotspots have been added. > > For something mostly static (balloon inflation, memory allocator), it's > not that big of a deal I guess. >
On 11.11.21 13:00, Qi Zheng wrote: > > > On 11/11/21 7:19 PM, David Hildenbrand wrote: >> On 11.11.21 12:08, Qi Zheng wrote: >>> >>> >>> On 11/11/21 5:22 PM, David Hildenbrand wrote: >>>> On 11.11.21 04:58, Qi Zheng wrote: >>>>> >>>>> >>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>>>>>> is a step into the right direction. If you want to modify *some* page >>>>>>>> table in your process you have exclude each and every page table walker. >>>>>>>> Or did I mis-interpret what you were saying? >>>>>>> >>>>>>> That is one possible design, it favours fast walking and penalizes >>>>>>> mutation. We could also stick a lock in the PMD (instead of a >>>>>>> refcount) and still logically be using a lock instead of a refcount >>>>>>> scheme. Remember modify here is "want to change a table pointer into a >>>>>>> leaf pointer" so it isn't an every day activity.. >>>>>> >>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table >>>>>> as soon as it turns empty. This not only happens when zapping, but also >>>>>> during writeback/swapping. So while writing back / swapping you might be >>>>>> left with empty page tables to reclaim. >>>>>> >>>>>> Of course, this is the current approach. Another approach that doesn't >>>>>> require additional refcounts is scanning page tables for empty ones and >>>>>> reclaiming them. This scanning can either be triggered manually from >>>>>> user space or automatically from the kernel. >>>>> >>>>> Whether it is introducing a special rwsem or scanning an empty page >>>>> table, there are two problems as follows: >>>>> >>>>> #1. When to trigger the scanning or releasing? >>>> >>>> For example when reclaiming memory, when scanning page tables in >>>> khugepaged, or triggered by user space (note that this is the approach I >>>> originally looked into). But it certainly requires more locking thought >>>> to avoid stopping essentially any page table walker. >>>> >>>>> #2. Every time to release a 4K page table page, 512 page table >>>>> entries need to be scanned. >>>> >>>> It would happen only when actually trigger reclaim of page tables >>>> (again, someone has to trigger it), so it's barely an issue. >>>> >>>> For example, khugepaged already scans the page tables either way. >>>> >>>>> >>>>> For #1, if the scanning is triggered manually from user space, the >>>>> kernel is relatively passive, and the user does not fully know the best >>>>> timing to scan. If the scanning is triggered automatically from the >>>>> kernel, that is great. But the timing is not easy to confirm, is it >>>>> scanned and reclaimed every time zap or try_to_unmap? >>>>> >>>>> For #2, refcount has advantages. >>>>> >>>>>> >>>>>>> >>>>>>> There is some advantage with this thinking because it harmonizes well >>>>>>> with the other stuff that wants to convert tables into leafs, but has >>>>>>> to deal with complicated locking. >>>>>>> >>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only >>>>>>> help with freeing pages. It also puts more atomics in normal fast >>>>>>> paths since we are refcounting each PTE, not read locking the PMD. >>>>>>> >>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>>>>>> means a table cannot be come a leaf. I don't know if there is space >>>>>>> for another atomic in the PMD level, and we'd have to use a hitching >>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq >>>>>>> too.. >>>>>>> >>>>>>> I wouldn't be so quick to say one is better than the other, but at >>>>>>> least let's have thought about a locking solution before merging >>>>>>> refcounts :) >>>>>> >>>>>> Yes, absolutely. I can see the beauty in the current approach, because >>>>>> it just reclaims "automatically" once possible -- page table empty and >>>>>> nobody is walking it. The downside is that it doesn't always make sense >>>>>> to reclaim an empty page table immediately once it turns empty. >>>>>> >>>>>> Also, it adds complexity for something that is only a problem in some >>>>>> corner cases -- sparse memory mappings, especially relevant for some >>>>>> memory allocators after freeing a lot of memory or running VMs with >>>>>> memory ballooning after inflating the balloon. Some of these use cases >>>>>> might be good with just triggering page table reclaim manually from user >>>>>> space. >>>>>> >>>>> >>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so >>>>> that the release of page table pages can be delayed in some cases. >>>>> Similar to the lazyfree mechanism in MADV_FREE? >>>> >>>> The issue AFAIU is that once your refcount hits 0 (no more references, >>>> no more entries), the longer you wait with reclaim, the longer others >>>> have to wait for populating a fresh page table because the "page table >>>> to be reclaimed" is still stuck around. You'd have to keep the refcount >>>> increased for a while, and only drop it after a while. But when? And >>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it. >>>> >>> >>> For running VMs with memory ballooning after inflating the balloon, is >>> this a hot behavior? Even if it is, it is already facing the release and >>> reallocation of physical pages. The overhead after introducing >>> pte_refcount is that we need to release and re-allocate page table page. >>> But 2MB physical pages only corresponds to 4KiB of PTE page table page. >>> So maybe the overhead is not big. >> >> The cases that come to my mind are >> >> a) Swapping on shared memory with concurrent access >> b) Reclaim on file-backed memory with concurrent access >> c) Free page reporting as implemented by virtio-balloon >> >> In all of these cases, you can have someone immediately re-access the >> page table and re-populate it. > > In the performance test shown on the cover, we repeatedly performed > touch and madvise(MADV_DONTNEED) actions, which simulated the case > you said above. > > We did find a small amount of performance regression, but I think it is > acceptable, and no new perf hotspots have been added. That test always accesses 2MiB and does it from a single thread. Things might (IMHO will) look different when only accessing individual pages and doing the access from one/multiple separate threads (that's what a),b) and c) essentially do, they don't do it in the pattern you measured. what you measured matches rather a typical memory allocator).
On 11/11/21 8:20 PM, David Hildenbrand wrote: > On 11.11.21 13:00, Qi Zheng wrote: >> >> >> On 11/11/21 7:19 PM, David Hildenbrand wrote: >>> On 11.11.21 12:08, Qi Zheng wrote: >>>> >>>> >>>> On 11/11/21 5:22 PM, David Hildenbrand wrote: >>>>> On 11.11.21 04:58, Qi Zheng wrote: >>>>>> >>>>>> >>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>>>>>>> is a step into the right direction. If you want to modify *some* page >>>>>>>>> table in your process you have exclude each and every page table walker. >>>>>>>>> Or did I mis-interpret what you were saying? >>>>>>>> >>>>>>>> That is one possible design, it favours fast walking and penalizes >>>>>>>> mutation. We could also stick a lock in the PMD (instead of a >>>>>>>> refcount) and still logically be using a lock instead of a refcount >>>>>>>> scheme. Remember modify here is "want to change a table pointer into a >>>>>>>> leaf pointer" so it isn't an every day activity.. >>>>>>> >>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table >>>>>>> as soon as it turns empty. This not only happens when zapping, but also >>>>>>> during writeback/swapping. So while writing back / swapping you might be >>>>>>> left with empty page tables to reclaim. >>>>>>> >>>>>>> Of course, this is the current approach. Another approach that doesn't >>>>>>> require additional refcounts is scanning page tables for empty ones and >>>>>>> reclaiming them. This scanning can either be triggered manually from >>>>>>> user space or automatically from the kernel. >>>>>> >>>>>> Whether it is introducing a special rwsem or scanning an empty page >>>>>> table, there are two problems as follows: >>>>>> >>>>>> #1. When to trigger the scanning or releasing? >>>>> >>>>> For example when reclaiming memory, when scanning page tables in >>>>> khugepaged, or triggered by user space (note that this is the approach I >>>>> originally looked into). But it certainly requires more locking thought >>>>> to avoid stopping essentially any page table walker. >>>>> >>>>>> #2. Every time to release a 4K page table page, 512 page table >>>>>> entries need to be scanned. >>>>> >>>>> It would happen only when actually trigger reclaim of page tables >>>>> (again, someone has to trigger it), so it's barely an issue. >>>>> >>>>> For example, khugepaged already scans the page tables either way. >>>>> >>>>>> >>>>>> For #1, if the scanning is triggered manually from user space, the >>>>>> kernel is relatively passive, and the user does not fully know the best >>>>>> timing to scan. If the scanning is triggered automatically from the >>>>>> kernel, that is great. But the timing is not easy to confirm, is it >>>>>> scanned and reclaimed every time zap or try_to_unmap? >>>>>> >>>>>> For #2, refcount has advantages. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> There is some advantage with this thinking because it harmonizes well >>>>>>>> with the other stuff that wants to convert tables into leafs, but has >>>>>>>> to deal with complicated locking. >>>>>>>> >>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only >>>>>>>> help with freeing pages. It also puts more atomics in normal fast >>>>>>>> paths since we are refcounting each PTE, not read locking the PMD. >>>>>>>> >>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>>>>>>> means a table cannot be come a leaf. I don't know if there is space >>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching >>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq >>>>>>>> too.. >>>>>>>> >>>>>>>> I wouldn't be so quick to say one is better than the other, but at >>>>>>>> least let's have thought about a locking solution before merging >>>>>>>> refcounts :) >>>>>>> >>>>>>> Yes, absolutely. I can see the beauty in the current approach, because >>>>>>> it just reclaims "automatically" once possible -- page table empty and >>>>>>> nobody is walking it. The downside is that it doesn't always make sense >>>>>>> to reclaim an empty page table immediately once it turns empty. >>>>>>> >>>>>>> Also, it adds complexity for something that is only a problem in some >>>>>>> corner cases -- sparse memory mappings, especially relevant for some >>>>>>> memory allocators after freeing a lot of memory or running VMs with >>>>>>> memory ballooning after inflating the balloon. Some of these use cases >>>>>>> might be good with just triggering page table reclaim manually from user >>>>>>> space. >>>>>>> >>>>>> >>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so >>>>>> that the release of page table pages can be delayed in some cases. >>>>>> Similar to the lazyfree mechanism in MADV_FREE? >>>>> >>>>> The issue AFAIU is that once your refcount hits 0 (no more references, >>>>> no more entries), the longer you wait with reclaim, the longer others >>>>> have to wait for populating a fresh page table because the "page table >>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount >>>>> increased for a while, and only drop it after a while. But when? And >>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it. >>>>> >>>> >>>> For running VMs with memory ballooning after inflating the balloon, is >>>> this a hot behavior? Even if it is, it is already facing the release and >>>> reallocation of physical pages. The overhead after introducing >>>> pte_refcount is that we need to release and re-allocate page table page. >>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page. >>>> So maybe the overhead is not big. >>> >>> The cases that come to my mind are >>> >>> a) Swapping on shared memory with concurrent access >>> b) Reclaim on file-backed memory with concurrent access >>> c) Free page reporting as implemented by virtio-balloon >>> >>> In all of these cases, you can have someone immediately re-access the >>> page table and re-populate it. >> >> In the performance test shown on the cover, we repeatedly performed >> touch and madvise(MADV_DONTNEED) actions, which simulated the case >> you said above. >> >> We did find a small amount of performance regression, but I think it is >> acceptable, and no new perf hotspots have been added. > > That test always accesses 2MiB and does it from a single thread. Things > might (IMHO will) look different when only accessing individual pages > and doing the access from one/multiple separate threads (that's what No, it includes multi-threading: while (1) { char *c; char *start = mmap_area[cpu]; char *end = mmap_area[cpu] + FAULT_LENGTH; pthread_barrier_wait(&barrier); //printf("fault into %p-%p\n",start, end); for (c = start; c < end; c += PAGE_SIZE) *c = 0; pthread_barrier_wait(&barrier); for (i = 0; cpu==0 && i < num; i++) madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED); pthread_barrier_wait(&barrier); } Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical memory of threads on other cpu. > a),b) and c) essentially do, they don't do it in the pattern you > measured. what you measured matches rather a typical memory allocator). > >
On 11.11.21 13:32, Qi Zheng wrote: > > > On 11/11/21 8:20 PM, David Hildenbrand wrote: >> On 11.11.21 13:00, Qi Zheng wrote: >>> >>> >>> On 11/11/21 7:19 PM, David Hildenbrand wrote: >>>> On 11.11.21 12:08, Qi Zheng wrote: >>>>> >>>>> >>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote: >>>>>> On 11.11.21 04:58, Qi Zheng wrote: >>>>>>> >>>>>>> >>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote: >>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that >>>>>>>>>> is a step into the right direction. If you want to modify *some* page >>>>>>>>>> table in your process you have exclude each and every page table walker. >>>>>>>>>> Or did I mis-interpret what you were saying? >>>>>>>>> >>>>>>>>> That is one possible design, it favours fast walking and penalizes >>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a >>>>>>>>> refcount) and still logically be using a lock instead of a refcount >>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a >>>>>>>>> leaf pointer" so it isn't an every day activity.. >>>>>>>> >>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table >>>>>>>> as soon as it turns empty. This not only happens when zapping, but also >>>>>>>> during writeback/swapping. So while writing back / swapping you might be >>>>>>>> left with empty page tables to reclaim. >>>>>>>> >>>>>>>> Of course, this is the current approach. Another approach that doesn't >>>>>>>> require additional refcounts is scanning page tables for empty ones and >>>>>>>> reclaiming them. This scanning can either be triggered manually from >>>>>>>> user space or automatically from the kernel. >>>>>>> >>>>>>> Whether it is introducing a special rwsem or scanning an empty page >>>>>>> table, there are two problems as follows: >>>>>>> >>>>>>> #1. When to trigger the scanning or releasing? >>>>>> >>>>>> For example when reclaiming memory, when scanning page tables in >>>>>> khugepaged, or triggered by user space (note that this is the approach I >>>>>> originally looked into). But it certainly requires more locking thought >>>>>> to avoid stopping essentially any page table walker. >>>>>> >>>>>>> #2. Every time to release a 4K page table page, 512 page table >>>>>>> entries need to be scanned. >>>>>> >>>>>> It would happen only when actually trigger reclaim of page tables >>>>>> (again, someone has to trigger it), so it's barely an issue. >>>>>> >>>>>> For example, khugepaged already scans the page tables either way. >>>>>> >>>>>>> >>>>>>> For #1, if the scanning is triggered manually from user space, the >>>>>>> kernel is relatively passive, and the user does not fully know the best >>>>>>> timing to scan. If the scanning is triggered automatically from the >>>>>>> kernel, that is great. But the timing is not easy to confirm, is it >>>>>>> scanned and reclaimed every time zap or try_to_unmap? >>>>>>> >>>>>>> For #2, refcount has advantages. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> There is some advantage with this thinking because it harmonizes well >>>>>>>>> with the other stuff that wants to convert tables into leafs, but has >>>>>>>>> to deal with complicated locking. >>>>>>>>> >>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only >>>>>>>>> help with freeing pages. It also puts more atomics in normal fast >>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD. >>>>>>>>> >>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read >>>>>>>>> means a table cannot be come a leaf. I don't know if there is space >>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching >>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq >>>>>>>>> too.. >>>>>>>>> >>>>>>>>> I wouldn't be so quick to say one is better than the other, but at >>>>>>>>> least let's have thought about a locking solution before merging >>>>>>>>> refcounts :) >>>>>>>> >>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because >>>>>>>> it just reclaims "automatically" once possible -- page table empty and >>>>>>>> nobody is walking it. The downside is that it doesn't always make sense >>>>>>>> to reclaim an empty page table immediately once it turns empty. >>>>>>>> >>>>>>>> Also, it adds complexity for something that is only a problem in some >>>>>>>> corner cases -- sparse memory mappings, especially relevant for some >>>>>>>> memory allocators after freeing a lot of memory or running VMs with >>>>>>>> memory ballooning after inflating the balloon. Some of these use cases >>>>>>>> might be good with just triggering page table reclaim manually from user >>>>>>>> space. >>>>>>>> >>>>>>> >>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so >>>>>>> that the release of page table pages can be delayed in some cases. >>>>>>> Similar to the lazyfree mechanism in MADV_FREE? >>>>>> >>>>>> The issue AFAIU is that once your refcount hits 0 (no more references, >>>>>> no more entries), the longer you wait with reclaim, the longer others >>>>>> have to wait for populating a fresh page table because the "page table >>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount >>>>>> increased for a while, and only drop it after a while. But when? And >>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it. >>>>>> >>>>> >>>>> For running VMs with memory ballooning after inflating the balloon, is >>>>> this a hot behavior? Even if it is, it is already facing the release and >>>>> reallocation of physical pages. The overhead after introducing >>>>> pte_refcount is that we need to release and re-allocate page table page. >>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page. >>>>> So maybe the overhead is not big. >>>> >>>> The cases that come to my mind are >>>> >>>> a) Swapping on shared memory with concurrent access >>>> b) Reclaim on file-backed memory with concurrent access >>>> c) Free page reporting as implemented by virtio-balloon >>>> >>>> In all of these cases, you can have someone immediately re-access the >>>> page table and re-populate it. >>> >>> In the performance test shown on the cover, we repeatedly performed >>> touch and madvise(MADV_DONTNEED) actions, which simulated the case >>> you said above. >>> >>> We did find a small amount of performance regression, but I think it is >>> acceptable, and no new perf hotspots have been added. >> >> That test always accesses 2MiB and does it from a single thread. Things >> might (IMHO will) look different when only accessing individual pages >> and doing the access from one/multiple separate threads (that's what > > No, it includes multi-threading: > Oh sorry, I totally skipped [2]. > while (1) { > char *c; > char *start = mmap_area[cpu]; > char *end = mmap_area[cpu] + FAULT_LENGTH; > pthread_barrier_wait(&barrier); > //printf("fault into %p-%p\n",start, end); > > for (c = start; c < end; c += PAGE_SIZE) > *c = 0; > > pthread_barrier_wait(&barrier); > for (i = 0; cpu==0 && i < num; i++) > madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED); > pthread_barrier_wait(&barrier); > } > > Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical > memory of threads on other cpu. > I'll have a more detailed look at the benchmark. On a quick glimpse, looks like the threads are also accessing a full 2MiB range, one page at a time, and one thread is zapping the whole 2MiB range. A single CPU only accesses memory within one 2MiB range IIRC. Having multiple threads just access individual pages within a single 2 MiB region, and having one thread zap that memory (e.g., simulate swapout) could be another benchmark. We have to make sure to run with THP disabled (e.g., using madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark eventually), because otherwise you might just be populating+zapping THPs if they would otherwise be allowed in the environment.
On 11/11/21 8:51 PM, David Hildenbrand wrote: >>>> >>>> In the performance test shown on the cover, we repeatedly performed >>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case >>>> you said above. >>>> >>>> We did find a small amount of performance regression, but I think it is >>>> acceptable, and no new perf hotspots have been added. >>> >>> That test always accesses 2MiB and does it from a single thread. Things >>> might (IMHO will) look different when only accessing individual pages >>> and doing the access from one/multiple separate threads (that's what >> >> No, it includes multi-threading: >> > > Oh sorry, I totally skipped [2]. > >> while (1) { >> char *c; >> char *start = mmap_area[cpu]; >> char *end = mmap_area[cpu] + FAULT_LENGTH; >> pthread_barrier_wait(&barrier); >> //printf("fault into %p-%p\n",start, end); >> >> for (c = start; c < end; c += PAGE_SIZE) >> *c = 0; >> >> pthread_barrier_wait(&barrier); >> for (i = 0; cpu==0 && i < num; i++) >> madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED); >> pthread_barrier_wait(&barrier); >> } >> >> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical >> memory of threads on other cpu. >> > > I'll have a more detailed look at the benchmark. On a quick glimpse, Thank you for your time :) > looks like the threads are also accessing a full 2MiB range, one page at > a time, and one thread is zapping the whole 2MiB range. A single CPU > only accesses memory within one 2MiB range IIRC. > > Having multiple threads just access individual pages within a single 2 > MiB region, and having one thread zap that memory (e.g., simulate > swapout) could be another benchmark. LGTM, I will simulate more scenarios for testing. > > We have to make sure to run with THP disabled (e.g., using > madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark > eventually), because otherwise you might just be populating+zapping THPs > if they would otherwise be allowed in the environment. Yes, I turned off THP during testing: root@~$ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] >