Message ID | 20221123181838.1373440-1-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: remove lock_page_memcg() from rmap | expand |
On Wed, Nov 23, 2022 at 10:18 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > rmap changes (mapping and unmapping) of a page currently take > lock_page_memcg() to serialize 1) update of the mapcount and the > cgroup mapped counter with 2) cgroup moving the page and updating the > old cgroup and the new cgroup counters based on page_mapped(). > > Before b2052564e66d ("mm: memcontrol: continue cache reclaim from > offlined groups"), we used to reassign all pages that could be found > on a cgroup's LRU list on deletion - something that rmap didn't > naturally serialize against. Since that commit, however, the only > pages that get moved are those mapped into page tables of a task > that's being migrated. In that case, the pte lock is always held (and > we know the page is mapped), which keeps rmap changes at bay already. > > The additional lock_page_memcg() by rmap is redundant. Remove it. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com>
On Wed, 23 Nov 2022, Johannes Weiner wrote: > rmap changes (mapping and unmapping) of a page currently take > lock_page_memcg() to serialize 1) update of the mapcount and the > cgroup mapped counter with 2) cgroup moving the page and updating the > old cgroup and the new cgroup counters based on page_mapped(). > > Before b2052564e66d ("mm: memcontrol: continue cache reclaim from > offlined groups"), we used to reassign all pages that could be found > on a cgroup's LRU list on deletion - something that rmap didn't > naturally serialize against. Since that commit, however, the only > pages that get moved are those mapped into page tables of a task > that's being migrated. In that case, the pte lock is always held (and > we know the page is mapped), which keeps rmap changes at bay already. > > The additional lock_page_memcg() by rmap is redundant. Remove it. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Thank you, I love it: but with sorrow and shame, NAK to this version. I was gearing up to rush in the crash fix at the bottom, when testing showed that the new VM_WARN_ON_ONCE(!folio_mapped(folio)) actually hits. So I've asked Stephen to drop this mm-unstable commit from -next for tonight, while we think about what more is needed. I was disbelieving when I saw the warning, couldn't understand at all. But a look at get_mctgt_type() shatters my illusion: it's doesn't just return a page for pte_present(ptent), it goes off looking up swap cache and page cache; plus I've no idea whether an MC_TARGET_DEVICE page would appear as folio_mapped() or not. Does that mean that we just have to reinstate the folio_mapped() checks in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the commit? Or does it invalidate the whole project to remove lock_page_memcg() from mm/rmap.c? Too disappointed to think about it more tonight :-( Hugh > --- > mm/memcontrol.c | 35 ++++++++++++++++++++--------------- > mm/rmap.c | 12 ------------ > 2 files changed, 20 insertions(+), 27 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 23750cec0036..52b86ca7a78e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5676,7 +5676,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, > * @from: mem_cgroup which the page is moved from. > * @to: mem_cgroup which the page is moved to. @from != @to. > * > - * The caller must make sure the page is not on LRU (isolate_page() is useful.) > + * This function acquires folio_lock() and folio_lock_memcg(). The > + * caller must exclude all other possible ways of accessing > + * page->memcg, such as LRU isolation (to lock out isolation) and > + * having the page mapped and pte-locked (to lock out rmap). > * > * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" > * from old cgroup. > @@ -5696,6 +5699,13 @@ static int mem_cgroup_move_account(struct page *page, > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > VM_BUG_ON(compound && !folio_test_large(folio)); > > + /* > + * We're only moving pages mapped into the moving process's > + * page tables. The caller's pte lock prevents rmap from > + * removing the NR_x_MAPPED state while we transfer it. > + */ > + VM_WARN_ON_ONCE(!folio_mapped(folio)); > + > /* > * Prevent mem_cgroup_migrate() from looking at > * page's memory cgroup of its source page while we change it. > @@ -5715,30 +5725,25 @@ static int mem_cgroup_move_account(struct page *page, > folio_memcg_lock(folio); > > if (folio_test_anon(folio)) { > - if (folio_mapped(folio)) { > - __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); > - __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); > - if (folio_test_transhuge(folio)) { > - __mod_lruvec_state(from_vec, NR_ANON_THPS, > - -nr_pages); > - __mod_lruvec_state(to_vec, NR_ANON_THPS, > - nr_pages); > - } > + __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); > + __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); > + > + if (folio_test_transhuge(folio)) { > + __mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages); > + __mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages); > } > } else { > __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); > __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); > > + __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); > + __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); > + > if (folio_test_swapbacked(folio)) { > __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); > __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); > } > > - if (folio_mapped(folio)) { > - __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); > - __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); > - } > - > if (folio_test_dirty(folio)) { > struct address_space *mapping = folio_mapping(folio); > > diff --git a/mm/rmap.c b/mm/rmap.c > index 459dc1c44d8a..11a4894158db 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1222,9 +1222,6 @@ void page_add_anon_rmap(struct page *page, > bool compound = flags & RMAP_COMPOUND; > bool first = true; > > - if (unlikely(PageKsm(page))) > - lock_page_memcg(page); > - > /* Is page being mapped by PTE? Is this its first map to be added? */ > if (likely(!compound)) { > first = atomic_inc_and_test(&page->_mapcount); > @@ -1254,9 +1251,6 @@ void page_add_anon_rmap(struct page *page, > if (nr) > __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr); > > - if (unlikely(PageKsm(page))) > - unlock_page_memcg(page); > - > /* address might be in next vma when migration races vma_adjust */ > else if (first) > __page_set_anon_rmap(page, vma, address, > @@ -1321,7 +1315,6 @@ void page_add_file_rmap(struct page *page, > bool first; > > VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); > - lock_page_memcg(page); > > /* Is page being mapped by PTE? Is this its first map to be added? */ > if (likely(!compound)) { > @@ -1349,7 +1342,6 @@ void page_add_file_rmap(struct page *page, > NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); > if (nr) > __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); > - unlock_page_memcg(page); > > mlock_vma_page(page, vma, compound); > } > @@ -1378,8 +1370,6 @@ void page_remove_rmap(struct page *page, > return; > } > > - lock_page_memcg(page); > - > /* Is page being unmapped by PTE? Is this its last map to be removed? */ > if (likely(!compound)) { > last = atomic_add_negative(-1, &page->_mapcount); > @@ -1427,8 +1417,6 @@ void page_remove_rmap(struct page *page, > * and remember that it's only reliable while mapped. > */ > > - unlock_page_memcg(page); > - > munlock_vma_page(page, vma, compound); > } > > -- > 2.38.1 [PATCH] mm: remove lock_page_memcg() from rmap - fix Blame me for the hidden "else", which now does the wrong thing, leaving page's anon_vma unset, then VM_BUG_ON before do_swap_page's set_pte_at. Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/rmap.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 11a4894158db..5a8d27fdc644 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1251,13 +1251,14 @@ void page_add_anon_rmap(struct page *page, if (nr) __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr); - /* address might be in next vma when migration races vma_adjust */ - else if (first) - __page_set_anon_rmap(page, vma, address, - !!(flags & RMAP_EXCLUSIVE)); - else - __page_check_anon_rmap(page, vma, address); - + if (!PageKsm(page)) { + /* address may be in next vma if migration races vma_adjust */ + if (first) + __page_set_anon_rmap(page, vma, address, + !!(flags & RMAP_EXCLUSIVE)); + else + __page_check_anon_rmap(page, vma, address); + } mlock_vma_page(page, vma, compound); }
On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote: > On Wed, 23 Nov 2022, Johannes Weiner wrote: > > rmap changes (mapping and unmapping) of a page currently take > > lock_page_memcg() to serialize 1) update of the mapcount and the > > cgroup mapped counter with 2) cgroup moving the page and updating the > > old cgroup and the new cgroup counters based on page_mapped(). > > > > Before b2052564e66d ("mm: memcontrol: continue cache reclaim from > > offlined groups"), we used to reassign all pages that could be found > > on a cgroup's LRU list on deletion - something that rmap didn't > > naturally serialize against. Since that commit, however, the only > > pages that get moved are those mapped into page tables of a task > > that's being migrated. In that case, the pte lock is always held (and > > we know the page is mapped), which keeps rmap changes at bay already. > > > > The additional lock_page_memcg() by rmap is redundant. Remove it. > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Thank you, I love it: but with sorrow and shame, NAK to this version. > > I was gearing up to rush in the crash fix at the bottom, when testing > showed that the new VM_WARN_ON_ONCE(!folio_mapped(folio)) actually hits. > > So I've asked Stephen to drop this mm-unstable commit from -next for > tonight, while we think about what more is needed. > > I was disbelieving when I saw the warning, couldn't understand at all. > But a look at get_mctgt_type() shatters my illusion: it's doesn't just > return a page for pte_present(ptent), it goes off looking up swap > cache and page cache; plus I've no idea whether an MC_TARGET_DEVICE > page would appear as folio_mapped() or not. Thanks for catching this. A device_private pte always has a matching mapcount in the fake page it points to, so we should be good here. Those pages migrate back and forth between system and device memory, relying on the migration code's unmap and restore bits. Hence they participate in regular rmap. The swapcache/pagecache bit was a brainfart. We acquire the folio lock in move_account(), which would lock out concurrent faults. If it's not mapped, I don't see how it could become mapped behind our backs. But we do need to be prepared for it to be unmapped. > Does that mean that we just have to reinstate the folio_mapped() checks > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the > commit? Or does it invalidate the whole project to remove > lock_page_memcg() from mm/rmap.c? I think we have to bring back the folio_mapped() conditional and update the comments. But it shouldn't invalidate the whole project. I'll triple check this, however. Thanks
On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote: > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote: > The swapcache/pagecache bit was a brainfart. We acquire the folio lock > in move_account(), which would lock out concurrent faults. If it's not > mapped, I don't see how it could become mapped behind our backs. But > we do need to be prepared for it to be unmapped. Welp, that doesn't protect us from the inverse, where the page is mapped elsewhere and the other ptes are going away. So this won't be enough, unfortunately. > > Does that mean that we just have to reinstate the folio_mapped() checks > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the > > commit? Or does it invalidate the whole project to remove > > lock_page_memcg() from mm/rmap.c? Short of further restricting the pages that can be moved, I don't see how we can get rid of the cgroup locks in rmap after all. :( We can try limiting move candidates to present ptes. But maybe it's indeed time to deprecate the legacy charge moving altogether, and get rid of the entire complication. Hugh, Shakeel, Michal, what do you think?
On Tue, Nov 29, 2022 at 11:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > We can try limiting move candidates to present ptes. But maybe it's > indeed time to deprecate the legacy charge moving altogether, and get > rid of the entire complication. Please. If that's what it takes.. Linus
On Tue, Nov 29, 2022 at 11:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote: > > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote: > > The swapcache/pagecache bit was a brainfart. We acquire the folio lock > > in move_account(), which would lock out concurrent faults. If it's not > > mapped, I don't see how it could become mapped behind our backs. But > > we do need to be prepared for it to be unmapped. > > Welp, that doesn't protect us from the inverse, where the page is > mapped elsewhere and the other ptes are going away. So this won't be > enough, unfortunately. > > > > Does that mean that we just have to reinstate the folio_mapped() checks > > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the > > > commit? Or does it invalidate the whole project to remove > > > lock_page_memcg() from mm/rmap.c? > > Short of further restricting the pages that can be moved, I don't see > how we can get rid of the cgroup locks in rmap after all. :( > > We can try limiting move candidates to present ptes. But maybe it's > indeed time to deprecate the legacy charge moving altogether, and get > rid of the entire complication. > > Hugh, Shakeel, Michal, what do you think? I am on-board.
On Tue, 29 Nov 2022, Johannes Weiner wrote: > On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote: > > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote: > > The swapcache/pagecache bit was a brainfart. We acquire the folio lock > > in move_account(), which would lock out concurrent faults. If it's not > > mapped, I don't see how it could become mapped behind our backs. But > > we do need to be prepared for it to be unmapped. > > Welp, that doesn't protect us from the inverse, where the page is > mapped elsewhere and the other ptes are going away. So this won't be > enough, unfortunately. > > > > Does that mean that we just have to reinstate the folio_mapped() checks > > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the > > > commit? Or does it invalidate the whole project to remove > > > lock_page_memcg() from mm/rmap.c? > > Short of further restricting the pages that can be moved, I don't see > how we can get rid of the cgroup locks in rmap after all. :( > > We can try limiting move candidates to present ptes. But maybe it's > indeed time to deprecate the legacy charge moving altogether, and get > rid of the entire complication. > > Hugh, Shakeel, Michal, what do you think? I'm certainly not against deprecating it - it's a largish body of odd code, which poses signficant problems, yet is very seldom used; but I feel that we'd all like to see it gone from rmap quicker that it can be fully deprecated out of existence. I do wonder if any user would notice, if we quietly removed its operation on non-present ptes; certainly there *might* be users relying on that behaviour, but I doubt that many would. Alternatively (although I think Linus's objection to it in rmap is on both aesthetic and performance grounds, and retaining any trace of it in rmap.c still fails the aesthetic), can there be some static-keying done, to eliminate (un)lock_page_memcg() overhead for all but those few who actually indulge in moving memcg charge at immigrate? (But I think you would have already done that if it were possible.) Hugh
On Tue 29-11-22 14:08:34, Johannes Weiner wrote: [...] > Short of further restricting the pages that can be moved, I don't see > how we can get rid of the cgroup locks in rmap after all. :( :( > We can try limiting move candidates to present ptes. But maybe it's > indeed time to deprecate the legacy charge moving altogether, and get > rid of the entire complication. > > Hugh, Shakeel, Michal, what do you think? I do agree with Hugh that going part way will likely not solve the general maintenance burden. I am OK with dropping the functionality but we should at least add a big fat warning if anybody wants to enable memory.move_charge_at_immigrate. If there are any workload which absolutely depend on the feature (which would make them impossible to migrate to v2 btw) then we want to hear about that and address in some way. An incremental deprecation by limiting to present ptes sounds like this semantic change might get overlooked for an extended time.
On Tue, Nov 29, 2022 at 11:33 PM Hugh Dickins <hughd@google.com> wrote: > > On Tue, 29 Nov 2022, Johannes Weiner wrote: > > On Mon, Nov 28, 2022 at 11:59:53AM -0500, Johannes Weiner wrote: > > > On Wed, Nov 23, 2022 at 10:03:00PM -0800, Hugh Dickins wrote: > > > The swapcache/pagecache bit was a brainfart. We acquire the folio lock > > > in move_account(), which would lock out concurrent faults. If it's not > > > mapped, I don't see how it could become mapped behind our backs. But > > > we do need to be prepared for it to be unmapped. > > > > Welp, that doesn't protect us from the inverse, where the page is > > mapped elsewhere and the other ptes are going away. So this won't be > > enough, unfortunately. > > > > > > Does that mean that we just have to reinstate the folio_mapped() checks > > > > in mm/memcontrol.c i.e. revert all mm/memcontrol.c changes from the > > > > commit? Or does it invalidate the whole project to remove > > > > lock_page_memcg() from mm/rmap.c? > > > > Short of further restricting the pages that can be moved, I don't see > > how we can get rid of the cgroup locks in rmap after all. :( > > > > We can try limiting move candidates to present ptes. But maybe it's > > indeed time to deprecate the legacy charge moving altogether, and get > > rid of the entire complication. > > > > Hugh, Shakeel, Michal, what do you think? > > I'm certainly not against deprecating it - it's a largish body of odd > code, which poses signficant problems, yet is very seldom used; but I > feel that we'd all like to see it gone from rmap quicker that it can > be fully deprecated out of existence. > > I do wonder if any user would notice, if we quietly removed its > operation on non-present ptes; certainly there *might* be users > relying on that behaviour, but I doubt that many would. > > Alternatively (although I think Linus's objection to it in rmap is on > both aesthetic and performance grounds, and retaining any trace of it > in rmap.c still fails the aesthetic), can there be some static-keying > done, to eliminate (un)lock_page_memcg() overhead for all but those few > who actually indulge in moving memcg charge at immigrate? (But I think > you would have already done that if it were possible.) > My preference would be going with the removal of non-present ptes over static-key in [un]lock_page_memcg(). How about the following steps: 1. Add warning in memory.move_charge_at_immigrate now (6.1/6.2) that this is going away and also backport it to the stable kernels. 2. For 6.2 (or 6.3), remove the non-present pte migration with some additional text in the warning and do the rmap cleanup. 3. After 3 or 4 releases (and hopefully finding no real users), we deprecate this completely. Step 3 can be delayed if there are some users depending on it. However we need to be firm that this is going away irrespective. Shakeel
On Wed, 30 Nov 2022, Shakeel Butt wrote: > > 2. For 6.2 (or 6.3), remove the non-present pte migration with some > additional text in the warning and do the rmap cleanup. I just had an idea for softening the impact of that change: a moment's more thought may prove it's a terrible idea, but right now I like it. What if we keep the non-present pte migration throughout the deprecation period, but with a change to the where the folio_trylock() is done, and a refusal to move the charge on the page of a non-present pte, if that page/folio is currently mapped anywhere else - the folio lock preventing it from then becoming mapped while in mem_cgroup_move_account(). There's an argument that that's a better implementation anyway: that we should not interfere with others' pages; but perhaps it would turn out to be unimplementable, or would make for less predictable behaviour. Hugh
On Wed, Nov 30, 2022 at 09:36:15AM -0800, Hugh Dickins wrote: > On Wed, 30 Nov 2022, Shakeel Butt wrote: > > > > 2. For 6.2 (or 6.3), remove the non-present pte migration with some > > additional text in the warning and do the rmap cleanup. > > I just had an idea for softening the impact of that change: a moment's > more thought may prove it's a terrible idea, but right now I like it. > > What if we keep the non-present pte migration throughout the deprecation > period, but with a change to the where the folio_trylock() is done, and > a refusal to move the charge on the page of a non-present pte, if that > page/folio is currently mapped anywhere else - the folio lock preventing > it from then becoming mapped while in mem_cgroup_move_account(). I would like that better too. Charge moving has always been lossy (because of trylocking the page, and having to isolate it), but categorically leaving private swap pages behind seems like a bit much to sneak in quietly. > There's an argument that that's a better implementation anyway: that > we should not interfere with others' pages; but perhaps it would turn > out to be unimplementable, or would make for less predictable behaviour. Hm, I think the below should work for swap pages. Do you see anything obviously wrong with it, or scenarios I haven't considered? @@ -5637,6 +5645,46 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, * we call find_get_page() with swapper_space directly. */ page = find_get_page(swap_address_space(ent), swp_offset(ent)); + + /* + * Don't move shared charges. This isn't just for saner move + * semantics, it also ensures that page_mapped() is stable for + * the accounting in mem_cgroup_mapcount(). + * + * We have to serialize against the following paths: fork + * (which may copy a page map or a swap pte), fault (which may + * change a swap pte into a page map), unmap (which may cause + * a page map or a swap pte to disappear), and reclaim (which + * may change a page map into a swap pte). + * + * - Without swapcache, we only want to move the charge if + * there are no other swap ptes. With the pte lock, the + * swapcount is stable against all of the above scenarios + * when it's 1 (our pte), which is the case we care about. + * + * - When there is a page in swapcache, we only want to move + * charges when neither the page nor the swap entry are + * mapped elsewhere. The pte lock prevents our pte from + * being forked or unmapped. The page lock will stop faults + * against, and reclaim of, the swapcache page. So if the + * page isn't mapped, and the swap count is 1 (our pte), the + * test results are stable and the charge is exclusive. + */ + if (!page && __swap_count(ent) != 1) + return NULL; + + if (page) { + if (!trylock_page(page)) { + put_page(page); + return NULL; + } + if (page_mapped(page) || __swap_count(ent) != 1) { + unlock_page(page); + put_page(page); + return NULL; + } + } + entry->val = ent.val; return page;
On Wed, 30 Nov 2022, Johannes Weiner wrote: > > Hm, I think the below should work for swap pages. Do you see anything > obviously wrong with it, or scenarios I haven't considered? > I think you're overcomplicating it, with the __swap_count(ent) business, and consequent unnecessarily detailed comments on the serialization. Page/folio lock prevents a !page_mapped(page) becoming a page_mapped(page), whether it's in swap cache or in file cache; it does not stop the sharing count going further up, or down even to 0, but we just don't need to worry about that sharing count - the MC_TARGET_PAGE case does not reject pages with mapcount > 1, so why complicate the swap or file case in that way? (Yes, it can be argued that all such sharing should be rejected; but we didn't come here to argue improvements to memcg charge moving semantics: just to minimize its effect on rmap, before it is fully deprecated.) Or am I missing the point of why you add that complication? > @@ -5637,6 +5645,46 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, Don't forget to trylock the page in the device_private case before this. > * we call find_get_page() with swapper_space directly. > */ > page = find_get_page(swap_address_space(ent), swp_offset(ent)); > + > + /* > + * Don't move shared charges. This isn't just for saner move > + * semantics, it also ensures that page_mapped() is stable for > + * the accounting in mem_cgroup_mapcount(). mem_cgroup_mapcount()?? > + * > + * We have to serialize against the following paths: fork > + * (which may copy a page map or a swap pte), fault (which may > + * change a swap pte into a page map), unmap (which may cause > + * a page map or a swap pte to disappear), and reclaim (which > + * may change a page map into a swap pte). > + * > + * - Without swapcache, we only want to move the charge if > + * there are no other swap ptes. With the pte lock, the > + * swapcount is stable against all of the above scenarios > + * when it's 1 (our pte), which is the case we care about. > + * > + * - When there is a page in swapcache, we only want to move > + * charges when neither the page nor the swap entry are > + * mapped elsewhere. The pte lock prevents our pte from > + * being forked or unmapped. The page lock will stop faults > + * against, and reclaim of, the swapcache page. So if the > + * page isn't mapped, and the swap count is 1 (our pte), the > + * test results are stable and the charge is exclusive. > + */ > + if (!page && __swap_count(ent) != 1) > + return NULL; > + > + if (page) { > + if (!trylock_page(page)) { > + put_page(page); > + return NULL; > + } > + if (page_mapped(page) || __swap_count(ent) != 1) { > + unlock_page(page); > + put_page(page); > + return NULL; > + } > + } > + > entry->val = ent.val; > > return page; Looks right, without the __swap_count() additions and swap count comments. And similar code in mc_handle_file_pte() - or are you saying that only swap should be handled this way? I would disagree. And matching trylock in mc_handle_present_pte() (and get_mctgt_type_thp()), instead of in mem_cgroup_move_account(). I haven't checked to see where the page then needs to be unlocked, probably some new places. And I don't know what will be best for the preliminary precharge pass: doesn't really want the page lock at all, but it may be unnecessary complication to avoid taking it then unlocking it in that pass. Hugh
On Wed, Nov 30, 2022 at 04:13:23PM -0800, Hugh Dickins wrote: > On Wed, 30 Nov 2022, Johannes Weiner wrote: > > > > Hm, I think the below should work for swap pages. Do you see anything > > obviously wrong with it, or scenarios I haven't considered? > > > > I think you're overcomplicating it, with the __swap_count(ent) business, > and consequent unnecessarily detailed comments on the serialization. > > Page/folio lock prevents a !page_mapped(page) becoming a page_mapped(page), > whether it's in swap cache or in file cache; it does not stop the sharing > count going further up, or down even to 0, but we just don't need to worry > about that sharing count - the MC_TARGET_PAGE case does not reject pages > with mapcount > 1, so why complicate the swap or file case in that way? > > (Yes, it can be argued that all such sharing should be rejected; but we > didn't come here to argue improvements to memcg charge moving semantics: > just to minimize its effect on rmap, before it is fully deprecated.) > > Or am I missing the point of why you add that complication? No, it just seemed odd to move shared swap *unless* it's partially faulted. But you're right, it's probably not worth the hassle. I'll cut this down to the page_mapped() check. The struggle of writing code for Schroedinger's User... > > @@ -5637,6 +5645,46 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, > > Don't forget to trylock the page in the device_private case before this. Yep, thanks! > > * we call find_get_page() with swapper_space directly. > > */ > > page = find_get_page(swap_address_space(ent), swp_offset(ent)); > > + > > + /* > > + * Don't move shared charges. This isn't just for saner move > > + * semantics, it also ensures that page_mapped() is stable for > > + * the accounting in mem_cgroup_mapcount(). > > mem_cgroup_mapcount()?? mem_cgroup_move_account() of course! Will fix. > > + * We have to serialize against the following paths: fork > > + * (which may copy a page map or a swap pte), fault (which may > > + * change a swap pte into a page map), unmap (which may cause > > + * a page map or a swap pte to disappear), and reclaim (which > > + * may change a page map into a swap pte). > > + * > > + * - Without swapcache, we only want to move the charge if > > + * there are no other swap ptes. With the pte lock, the > > + * swapcount is stable against all of the above scenarios > > + * when it's 1 (our pte), which is the case we care about. > > + * > > + * - When there is a page in swapcache, we only want to move > > + * charges when neither the page nor the swap entry are > > + * mapped elsewhere. The pte lock prevents our pte from > > + * being forked or unmapped. The page lock will stop faults > > + * against, and reclaim of, the swapcache page. So if the > > + * page isn't mapped, and the swap count is 1 (our pte), the > > + * test results are stable and the charge is exclusive. ... and edit this down accordingly. > > + */ > > + if (!page && __swap_count(ent) != 1) > > + return NULL; > > + > > + if (page) { > > + if (!trylock_page(page)) { > > + put_page(page); > > + return NULL; > > + } > > + if (page_mapped(page) || __swap_count(ent) != 1) { > > + unlock_page(page); > > + put_page(page); > > + return NULL; > > + } > > + } > > + > > entry->val = ent.val; > > > > return page; > > Looks right, without the __swap_count() additions and swap count comments. > > And similar code in mc_handle_file_pte() - or are you saying that only > swap should be handled this way? I would disagree. Right, same rules apply there. I only pasted the swap one to make sure we get aligned on the basic strategy. > And matching trylock in mc_handle_present_pte() (and get_mctgt_type_thp()), > instead of in mem_cgroup_move_account(). Yes. > I haven't checked to see where the page then needs to be unlocked, > probably some new places. Yes, the callers of get_mctgt_type*() need to unlock (if target is passed and the page is returned). It looks straight-forward, they already have to do put_page(). > And I don't know what will be best for the preliminary precharge pass: > doesn't really want the page lock at all, but it may be unnecessary > complication to avoid taking it then unlocking it in that pass. We could make it conditional on target, which precharge doesn't pass, but I agree it's likely not worth optimizing that code at this point. Thanks for taking a look, Hugh, that's excellent input. I'll finish this patch, rebase the rmap patch on it, and add a new one to issue a deprecation warning in mem_cgroup_move_charge_write(). Johannes
On Thu, 1 Dec 2022, Johannes Weiner wrote: > On Wed, Nov 30, 2022 at 04:13:23PM -0800, Hugh Dickins wrote: > > > And I don't know what will be best for the preliminary precharge pass: > > doesn't really want the page lock at all, but it may be unnecessary > > complication to avoid taking it then unlocking it in that pass. > > We could make it conditional on target, which precharge doesn't pass, > but I agree it's likely not worth optimizing that code at this point. I hadn't noticed that NULL target so easily distinguishes that case: unless it goes on to uglify things more (which I think it won't, seeing now how there are already plenty of conditionals on target), I would prefer to avoid the trylock and unlock in the precharge case; but decide for yourself. Hugh
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 23750cec0036..52b86ca7a78e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5676,7 +5676,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * - * The caller must make sure the page is not on LRU (isolate_page() is useful.) + * This function acquires folio_lock() and folio_lock_memcg(). The + * caller must exclude all other possible ways of accessing + * page->memcg, such as LRU isolation (to lock out isolation) and + * having the page mapped and pte-locked (to lock out rmap). * * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" * from old cgroup. @@ -5696,6 +5699,13 @@ static int mem_cgroup_move_account(struct page *page, VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON(compound && !folio_test_large(folio)); + /* + * We're only moving pages mapped into the moving process's + * page tables. The caller's pte lock prevents rmap from + * removing the NR_x_MAPPED state while we transfer it. + */ + VM_WARN_ON_ONCE(!folio_mapped(folio)); + /* * Prevent mem_cgroup_migrate() from looking at * page's memory cgroup of its source page while we change it. @@ -5715,30 +5725,25 @@ static int mem_cgroup_move_account(struct page *page, folio_memcg_lock(folio); if (folio_test_anon(folio)) { - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); - if (folio_test_transhuge(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_THPS, - -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_THPS, - nr_pages); - } + __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); + + if (folio_test_transhuge(folio)) { + __mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages); } } else { __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); + __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); + if (folio_test_swapbacked(folio)) { __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); } - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); - } - if (folio_test_dirty(folio)) { struct address_space *mapping = folio_mapping(folio); diff --git a/mm/rmap.c b/mm/rmap.c index 459dc1c44d8a..11a4894158db 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1222,9 +1222,6 @@ void page_add_anon_rmap(struct page *page, bool compound = flags & RMAP_COMPOUND; bool first = true; - if (unlikely(PageKsm(page))) - lock_page_memcg(page); - /* Is page being mapped by PTE? Is this its first map to be added? */ if (likely(!compound)) { first = atomic_inc_and_test(&page->_mapcount); @@ -1254,9 +1251,6 @@ void page_add_anon_rmap(struct page *page, if (nr) __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr); - if (unlikely(PageKsm(page))) - unlock_page_memcg(page); - /* address might be in next vma when migration races vma_adjust */ else if (first) __page_set_anon_rmap(page, vma, address, @@ -1321,7 +1315,6 @@ void page_add_file_rmap(struct page *page, bool first; VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); - lock_page_memcg(page); /* Is page being mapped by PTE? Is this its first map to be added? */ if (likely(!compound)) { @@ -1349,7 +1342,6 @@ void page_add_file_rmap(struct page *page, NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); if (nr) __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); - unlock_page_memcg(page); mlock_vma_page(page, vma, compound); } @@ -1378,8 +1370,6 @@ void page_remove_rmap(struct page *page, return; } - lock_page_memcg(page); - /* Is page being unmapped by PTE? Is this its last map to be removed? */ if (likely(!compound)) { last = atomic_add_negative(-1, &page->_mapcount); @@ -1427,8 +1417,6 @@ void page_remove_rmap(struct page *page, * and remember that it's only reliable while mapped. */ - unlock_page_memcg(page); - munlock_vma_page(page, vma, compound); }
rmap changes (mapping and unmapping) of a page currently take lock_page_memcg() to serialize 1) update of the mapcount and the cgroup mapped counter with 2) cgroup moving the page and updating the old cgroup and the new cgroup counters based on page_mapped(). Before b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), we used to reassign all pages that could be found on a cgroup's LRU list on deletion - something that rmap didn't naturally serialize against. Since that commit, however, the only pages that get moved are those mapped into page tables of a task that's being migrated. In that case, the pte lock is always held (and we know the page is mapped), which keeps rmap changes at bay already. The additional lock_page_memcg() by rmap is redundant. Remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 35 ++++++++++++++++++++--------------- mm/rmap.c | 12 ------------ 2 files changed, 20 insertions(+), 27 deletions(-)