Message ID | c4b8485b-1f26-1a5f-bdf-c6c22611f610@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm,thp,rmap: rework the use of subpages_mapcount | expand |
On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote: > > Linus was underwhelmed by the earlier compound mapcounts series: > this series builds on top of it (as in next-20221117) to follow > up on his suggestions - except rmap.c still using lock_page_memcg(), > since I hesitate to steal the pleasure of deletion from Johannes. This looks good to me. Particularly 2/3 made me go "Aww, yes" but the overall line removal stats look good too. That said, I only looked at the patches, and not the end result itself. But not having the bit spin lock is, I think, a huge improvement. I do wonder if this should be now just merged with your previous series - it looks a bit odd how your previous series adds that bitlock, only for it to be immediately removed. But if you think the logic ends up being easier to follow this way as two separate patch series, I guess I don't care. And the memcg locking is entirely a separate issue, and I hope Johannes will deal with that. Thanks, Linus
On Fri, Nov 18, 2022 at 12:18:42PM -0800, Linus Torvalds wrote: > On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote: > > > > Linus was underwhelmed by the earlier compound mapcounts series: > > this series builds on top of it (as in next-20221117) to follow > > up on his suggestions - except rmap.c still using lock_page_memcg(), > > since I hesitate to steal the pleasure of deletion from Johannes. > > This looks good to me. Particularly 2/3 made me go "Aww, yes" but the > overall line removal stats look good too. > > That said, I only looked at the patches, and not the end result > itself. But not having the bit spin lock is, I think, a huge > improvement. > > I do wonder if this should be now just merged with your previous > series - it looks a bit odd how your previous series adds that > bitlock, only for it to be immediately removed. > > But if you think the logic ends up being easier to follow this way as > two separate patch series, I guess I don't care. > > And the memcg locking is entirely a separate issue, and I hope > Johannes will deal with that. Yeah, I'll redo the removal on top of this series and resend it. Thanks
On Fri, 18 Nov 2022, Linus Torvalds wrote: > On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote: > > > > Linus was underwhelmed by the earlier compound mapcounts series: > > this series builds on top of it (as in next-20221117) to follow > > up on his suggestions - except rmap.c still using lock_page_memcg(), > > since I hesitate to steal the pleasure of deletion from Johannes. > > This looks good to me. Particularly 2/3 made me go "Aww, yes" but the > overall line removal stats look good too. > > That said, I only looked at the patches, and not the end result > itself. But not having the bit spin lock is, I think, a huge > improvement. Great, thanks a lot for looking through. > > I do wonder if this should be now just merged with your previous > series - it looks a bit odd how your previous series adds that > bitlock, only for it to be immediately removed. > > But if you think the logic ends up being easier to follow this way as > two separate patch series, I guess I don't care. I rather like having its evolution on record there, but that might just be my sentimentality + laziness. Kirill did a grand job of reviewing the first series: I think that, at least for now, it would be easier for people to review the changes if the two series are not recombined. But the first series has not yet graduated from mm-unstable, so if Andrew and/or Kirill also prefer to have them combined into one bit_spin_lock-less series, that I can do. (And the end result should be identical, so would not complicate Johannes's lock_page_memcg() excision.) Hugh > > And the memcg locking is entirely a separate issue, and I hope > Johannes will deal with that. > > Thanks, > Linus
On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote: > But the first series has not yet graduated from mm-unstable, > so if Andrew and/or Kirill also prefer to have them combined into one > bit_spin_lock-less series, that I can do. (And the end result should be > identical, so would not complicate Johannes's lock_page_memcg() excision.) I'd prefer that approach. It's -rc5 and the earlier "mm,huge,rmap: unify and speed up compound mapcounts" series has had some testing. I'd prefer not to toss it all out and start again.
On Fri, Nov 18, 2022 at 2:03 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > I'd prefer that approach. The "that approach" is a bit ambiguous here, particularly considering how you quoted things. But I think from the context you meant "keep them as two separate series, even if the second undoes part of the first and does it differently". And that's fine. Even if it's maybe a bit odd to introduce that locking that then goes away, I can't argue with "the first series was already reviewed and has gone through a fair amount of testing". Linus
On Fri, 18 Nov 2022, Andrew Morton wrote: > On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote: > > > But the first series has not yet graduated from mm-unstable, > > so if Andrew and/or Kirill also prefer to have them combined into one > > bit_spin_lock-less series, that I can do. (And the end result should be > > identical, so would not complicate Johannes's lock_page_memcg() excision.) > > I'd prefer that approach. I think you're saying that you prefer the other approach, to keep the two series separate (second immediately after the first, or not, doesn't matter), rather than combined into one bit_spin_lock-less series. Please clarify! Thanks, Hugh > It's -rc5 and the earlier "mm,huge,rmap: > unify and speed up compound mapcounts" series has had some testing. > I'd prefer not to toss it all out and start again.
On Fri, 18 Nov 2022 14:10:32 -0800 (PST) Hugh Dickins <hughd@google.com> wrote: > On Fri, 18 Nov 2022, Andrew Morton wrote: > > On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote: > > > > > But the first series has not yet graduated from mm-unstable, > > > so if Andrew and/or Kirill also prefer to have them combined into one > > > bit_spin_lock-less series, that I can do. (And the end result should be > > > identical, so would not complicate Johannes's lock_page_memcg() excision.) > > > > I'd prefer that approach. > > I think you're saying that you prefer the other approach, to keep the > two series separate (second immediately after the first, or not, doesn't > matter), rather than combined into one bit_spin_lock-less series. > Please clarify! Thanks, Yes, two separate series. Apologies for the confuddling.
On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote: > Linus was underwhelmed by the earlier compound mapcounts series: > this series builds on top of it (as in next-20221117) to follow > up on his suggestions - except rmap.c still using lock_page_memcg(), > since I hesitate to steal the pleasure of deletion from Johannes. > Is there a plan to remove lock_page_memcg() altogether which I missed? I am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows up in the perf profile on exit path) but if we are removing it then I should just wait.
On Mon, Nov 21, 2022 at 8:59 AM Shakeel Butt <shakeelb@google.com> wrote: > > Is there a plan to remove lock_page_memcg() altogether which I missed? I > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows > up in the perf profile on exit path) Yay. It seems I'm not the only one hating it. > but if we are removing it then I should just wait. Well, I think Johannes was saying that at least the case I disliked (the rmap removal from the page table tear-down - I strongly suspect it's the one you're seeing on your perf profile too) can be removed entirely as long as it's done under the page table lock (which my final version of the rmap delaying still was). See https://lore.kernel.org/all/Y2llcRiDLHc2kg%2FN@cmpxchg.org/ for his preliminary patch. That said, if you have some patch to make it a no-op for _other_ reasons, and could be done away with _entirely_ (not just for rmap), then that would be even better. I am not a fan of that lock in general, but in the teardown rmap path it's actively horrifying because it is taken one page at a time. So it's taken a *lot* (although you might not see it if all you run is long-running benchmarks - it's mainly the "run lots of small scripts that really hits it). The reason it seems to be so horrifyingly noticeable on the exit path is that the fork() side already does the rmap stuff (mainly __page_dup_rmap()) _without_ having to do the lock_page_memcg() dance. So I really hate that lock. It's completely inconsistent, and it all feels very wrong. It seemed entirely pointless when I was looking at the rmap removal path for a single page. The fact that both you and Johannes seem to be more than ready to just remove it makes me much happier, because I've never actually known the memcg code enough to do anything about my simmering hatred. Linus
On Mon, Nov 21, 2022 at 04:59:38PM +0000, Shakeel Butt wrote: > On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote: > > Linus was underwhelmed by the earlier compound mapcounts series: > > this series builds on top of it (as in next-20221117) to follow > > up on his suggestions - except rmap.c still using lock_page_memcg(), > > since I hesitate to steal the pleasure of deletion from Johannes. > > Is there a plan to remove lock_page_memcg() altogether which I missed? I > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows > up in the perf profile on exit path) but if we are removing it then I > should just wait. We can remove it for rmap at least, but we might be able to do more. Besides rmap, we're left with the dirty and writeback page transitions that wrt cgroups need to be atomic with NR_FILE_DIRTY and NR_WRITEBACK. Looking through the various callsites, I think we can delete it from setting and clearing dirty state, as we always hold the page lock (or the pte lock in some instances of folio_mark_dirty). Both of these are taken from the cgroup side, so we're good there. I think we can also remove it when setting writeback, because those sites have the page locked as well. That leaves clearing writeback. This can't hold the page lock due to the atomic context, so currently we need to take lock_page_memcg() as the lock of last resort. I wonder if we can have cgroup take the xalock instead: writeback ending on file pages always acquires the xarray lock. Swap writeback currently doesn't, but we could make it so (swap_address_space). The only thing that gives me pause is the !mapping check in __folio_end_writeback. File and swapcache pages usually have mappings, and truncation waits for writeback to finish before axing page->mapping. So AFAICS this can only happen if we call end_writeback on something that isn't under writeback - in which case the test_clear will fail and we don't update the stats anyway. But I want to be sure. Does anybody know from the top of their heads if a page under writeback could be without a mapping in some weird cornercase? If we could ensure that the NR_WRITEBACK decs are always protected by the xalock, we could grab it from mem_cgroup_move_account(), and then kill lock_page_memcg() altogether.
On Mon, 21 Nov 2022, Johannes Weiner wrote: > On Mon, Nov 21, 2022 at 04:59:38PM +0000, Shakeel Butt wrote: > > On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote: > > > Linus was underwhelmed by the earlier compound mapcounts series: > > > this series builds on top of it (as in next-20221117) to follow > > > up on his suggestions - except rmap.c still using lock_page_memcg(), > > > since I hesitate to steal the pleasure of deletion from Johannes. > > > > Is there a plan to remove lock_page_memcg() altogether which I missed? I > > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows > > up in the perf profile on exit path) but if we are removing it then I > > should just wait. > > We can remove it for rmap at least, but we might be able to do more. I hope the calls from mm/rmap.c can be deleted before deciding the bigger picture for lock_page_memcg() itself; getting rid of it would be very nice, but it has always had a difficult job to do (and you've devoted lots of good effort to minimizing it). > > Besides rmap, we're left with the dirty and writeback page transitions > that wrt cgroups need to be atomic with NR_FILE_DIRTY and NR_WRITEBACK. > > Looking through the various callsites, I think we can delete it from > setting and clearing dirty state, as we always hold the page lock (or > the pte lock in some instances of folio_mark_dirty). Both of these are > taken from the cgroup side, so we're good there. > > I think we can also remove it when setting writeback, because those > sites have the page locked as well. > > That leaves clearing writeback. This can't hold the page lock due to > the atomic context, so currently we need to take lock_page_memcg() as > the lock of last resort. > > I wonder if we can have cgroup take the xalock instead: writeback > ending on file pages always acquires the xarray lock. Swap writeback > currently doesn't, but we could make it so (swap_address_space). It's a little bit of a regression to have to take that lock when ending writeback on swap (compared with the rcu_read_lock() of almost every lock_page_memcg()); but I suppose if swap had been doing that all along, like the normal page cache case, I would not be complaining. > > The only thing that gives me pause is the !mapping check in > __folio_end_writeback. File and swapcache pages usually have mappings, > and truncation waits for writeback to finish before axing > page->mapping. So AFAICS this can only happen if we call end_writeback > on something that isn't under writeback - in which case the test_clear > will fail and we don't update the stats anyway. But I want to be sure. > > Does anybody know from the top of their heads if a page under > writeback could be without a mapping in some weird cornercase? End of writeback has been a persistent troublemaker, in several ways; I forget whether we are content with it now or not. I would not trust whatever I think OTOH of that !mapping case, but I was deeper into it two years ago, and find myself saying "Can mapping be NULL? I don't see how, but allow for that with a WARN_ON_ONCE()" in a patch I posted then (but it didn't go in, we went in another direction). I'm pretty sure it never warned once for me, but I probably wasn't doing enough to test it. And IIRC I did also think that the !mapping check had perhaps been copied from a related function, one where it made more sense. It's also worth noting that the two stats which get decremented there, NR_WRITEBACK and NR_ZONE_WRITE_PENDING, are two of the three which we have commented "Skip checking stats known to go negative occasionally" in mm/vmstat.c: I never did come up with a convincing explanation for that (Roman had his explanation, but I wasn't quite convinced). Maybe it would just be wrong to touch them if mapping were NULL. > > If we could ensure that the NR_WRITEBACK decs are always protected by > the xalock, we could grab it from mem_cgroup_move_account(), and then > kill lock_page_memcg() altogether. I suppose so (but I still feel grudging about the xalock for swap). Hugh
On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote: > That leaves clearing writeback. This can't hold the page lock due to > the atomic context, so currently we need to take lock_page_memcg() as > the lock of last resort. > > I wonder if we can have cgroup take the xalock instead: writeback > ending on file pages always acquires the xarray lock. Swap writeback > currently doesn't, but we could make it so (swap_address_space). > > The only thing that gives me pause is the !mapping check in > __folio_end_writeback. File and swapcache pages usually have mappings, > and truncation waits for writeback to finish before axing > page->mapping. So AFAICS this can only happen if we call end_writeback > on something that isn't under writeback - in which case the test_clear > will fail and we don't update the stats anyway. But I want to be sure. > > Does anybody know from the top of their heads if a page under > writeback could be without a mapping in some weird cornercase? I can't think of such a corner case. We should always wait for writeback to finish before removing the page from the page cache; the writeback bit used to be (and kind of still is) an implicit reference to the page, which means that we can't remove the page cache's reference to the page without waiting for writeback. > If we could ensure that the NR_WRITEBACK decs are always protected by > the xalock, we could grab it from mem_cgroup_move_account(), and then > kill lock_page_memcg() altogether. I'm not thrilled by this idea, but I'm not going to veto it.
On Tue, Nov 22, 2022 at 05:57:42AM +0000, Matthew Wilcox wrote: > On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote: > > That leaves clearing writeback. This can't hold the page lock due to > > the atomic context, so currently we need to take lock_page_memcg() as > > the lock of last resort. > > > > I wonder if we can have cgroup take the xalock instead: writeback > > ending on file pages always acquires the xarray lock. Swap writeback > > currently doesn't, but we could make it so (swap_address_space). > > > > The only thing that gives me pause is the !mapping check in > > __folio_end_writeback. File and swapcache pages usually have mappings, > > and truncation waits for writeback to finish before axing > > page->mapping. So AFAICS this can only happen if we call end_writeback > > on something that isn't under writeback - in which case the test_clear > > will fail and we don't update the stats anyway. But I want to be sure. > > > > Does anybody know from the top of their heads if a page under > > writeback could be without a mapping in some weird cornercase? > > I can't think of such a corner case. We should always wait for > writeback to finish before removing the page from the page cache; > the writeback bit used to be (and kind of still is) an implicit > reference to the page, which means that we can't remove the page > cache's reference to the page without waiting for writeback. Great, thanks! > > If we could ensure that the NR_WRITEBACK decs are always protected by > > the xalock, we could grab it from mem_cgroup_move_account(), and then > > kill lock_page_memcg() altogether. > > I'm not thrilled by this idea, but I'm not going to veto it. Ok, I'm also happy to drop this one. Certainly, the rmap one is the lowest-hanging fruit. I have the patch rebased against Hugh's series in mm-unstable; I'll wait for that to settle down, and then send an updated version to Andrew.
On Mon, Nov 21, 2022 at 09:16:58AM -0800, Linus Torvalds wrote: > On Mon, Nov 21, 2022 at 8:59 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > Is there a plan to remove lock_page_memcg() altogether which I missed? I > > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows > > up in the perf profile on exit path) > > Yay. It seems I'm not the only one hating it. > > > but if we are removing it then I should just wait. > > Well, I think Johannes was saying that at least the case I disliked > (the rmap removal from the page table tear-down - I strongly suspect > it's the one you're seeing on your perf profile too) Yes indeed that is the one. - 99.89% 0.00% fork-large-mmap [kernel.kallsyms] [k] entry_SYSCALL_64_after_hw◆ entry_SYSCALL_64_after_hwframe - do_syscall_64 - 48.94% __x64_sys_exit_group do_group_exit - do_exit - 48.94% exit_mm mmput - __mmput - exit_mmap - 48.61% unmap_vmas - 48.61% unmap_single_vma - unmap_page_range - 48.60% zap_p4d_range - 44.66% zap_pte_range + 12.61% tlb_flush_mmu - 9.38% page_remove_rmap 2.50% lock_page_memcg 2.37% unlock_page_memcg 0.61% PageHuge 4.80% vm_normal_page 2.56% __tlb_remove_page_size 0.85% lock_page_memcg 0.53% PageHuge 2.22% __tlb_remove_page_size 0.93% vm_normal_page 0.72% page_remove_rmap > can be removed > entirely as long as it's done under the page table lock (which my > final version of the rmap delaying still was). > > See > > https://lore.kernel.org/all/Y2llcRiDLHc2kg%2FN@cmpxchg.org/ > > for his preliminary patch. > > That said, if you have some patch to make it a no-op for _other_ > reasons, and could be done away with _entirely_ (not just for rmap), > then that would be even better. I am actually looking at deprecating the whole "move charge" funcitonality of cgroup-v1 i.e. the underlying reason lock_page_memcg exists. That already does not work for couple of cases like partially mapped THP and madv_free'd pages. Though that deprecation process would take some time. In the meantime I was looking at if we can make these functions nop for cgroup-v2. thanks, Shakeel
On Tue, Nov 22, 2022 at 01:55:39AM -0500, Johannes Weiner wrote: > On Tue, Nov 22, 2022 at 05:57:42AM +0000, Matthew Wilcox wrote: > > On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote: > > > That leaves clearing writeback. This can't hold the page lock due to > > > the atomic context, so currently we need to take lock_page_memcg() as > > > the lock of last resort. > > > > > > I wonder if we can have cgroup take the xalock instead: writeback > > > ending on file pages always acquires the xarray lock. Swap writeback > > > currently doesn't, but we could make it so (swap_address_space). > > > > > > The only thing that gives me pause is the !mapping check in > > > __folio_end_writeback. File and swapcache pages usually have mappings, > > > and truncation waits for writeback to finish before axing > > > page->mapping. So AFAICS this can only happen if we call end_writeback > > > on something that isn't under writeback - in which case the test_clear > > > will fail and we don't update the stats anyway. But I want to be sure. > > > > > > Does anybody know from the top of their heads if a page under > > > writeback could be without a mapping in some weird cornercase? > > > > I can't think of such a corner case. We should always wait for > > writeback to finish before removing the page from the page cache; > > the writeback bit used to be (and kind of still is) an implicit > > reference to the page, which means that we can't remove the page > > cache's reference to the page without waiting for writeback. > > Great, thanks! > > > > If we could ensure that the NR_WRITEBACK decs are always protected by > > > the xalock, we could grab it from mem_cgroup_move_account(), and then > > > kill lock_page_memcg() altogether. > > > > I'm not thrilled by this idea, but I'm not going to veto it. > > Ok, I'm also happy to drop this one. > > Certainly, the rmap one is the lowest-hanging fruit. I have the patch > rebased against Hugh's series in mm-unstable; I'll wait for that to > settle down, and then send an updated version to Andrew. I am planning to initiate the deprecation of the move charge functionality of v1. So I would say let's go with low hanging fruit for now and let slow process of deprecation remove the remaining cases.