Message ID | 29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [hotfix] mm: fix crashes from deferred split racing folio migration | expand |
On 2024/7/2 15:40, Hugh Dickins wrote: > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on > flags when freeing, yet the flags shown are not bad: PG_locked had been > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN > symptoms implying double free by deferred split and large folio migration. > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large > folio migration") was right to fix the memcg-dependent locking broken in > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), > but missed a subtlety of deferred_split_scan(): it moves folios to its own > local list to work on them without split_queue_lock, during which time > folio->_deferred_list is not empty, but even the "right" lock does nothing > to secure the folio and the list it is on. > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() > while the old folio's reference count is temporarily frozen to 0 - adding > such a freeze in the !mapping case too (originally, folio lock and > unmapping and no swap cache left an anon folio unreachable, so no freezing > was needed there: but the deferred split queue offers a way to reach it). Thanks Hugh. But after reading your analysis, I am concerned that the folio_undo_large_rmappable() and deferred_split_scan() may still encounter a race condition with the local list, even with your patch. Suppose folio A has already been queued into the local list in deferred_split_scan() by thread A, but fails to 'folio_trylock' and then releases the reference count. At the same time, folio A can be frozen by another thread B in folio_migrate_mapping(). In such a case, folio_undo_large_rmappable() would remove folio A from the local list without *any* lock protection, creating a race condition with the local list iteration in deferred_split_scan(). Anyway, I think this patch can still fix some possible races. Feel free to add: Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org > --- > This patch against 6.10-rc6: Kefeng has commits in the mm-tree which > which will need adjustment to go over this, but we can both check the > result. I have wondered whether just reverting 85ce2c517ade and its > subsequent fixups would be better: but that would be a bigger job, > and probably not the right choice. > > mm/memcontrol.c | 11 ----------- > mm/migrate.c | 13 +++++++++++++ > 2 files changed, 13 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 71fe2a95b8bd..8f2f1bb18c9c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) > > /* Transfer the charge and the css ref */ > commit_charge(new, memcg); > - /* > - * If the old folio is a large folio and is in the split queue, it needs > - * to be removed from the split queue now, in case getting an incorrect > - * split queue in destroy_large_folio() after the memcg of the old folio > - * is cleared. > - * > - * In addition, the old folio is about to be freed after migration, so > - * removing from the split queue a bit earlier seems reasonable. > - */ > - if (folio_test_large(old) && folio_test_large_rmappable(old)) > - folio_undo_large_rmappable(old); > old->memcg_data = 0; > } > > diff --git a/mm/migrate.c b/mm/migrate.c > index 20cb9f5f7446..a8c6f466e33a 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping, > if (folio_ref_count(folio) != expected_count) > return -EAGAIN; > > + /* Take off deferred split queue while frozen and memcg set */ > + if (folio_test_large(folio) && > + folio_test_large_rmappable(folio)) { > + if (!folio_ref_freeze(folio, expected_count)) > + return -EAGAIN; > + folio_undo_large_rmappable(folio); > + folio_ref_unfreeze(folio, expected_count); > + } > + > /* No turning back from here */ > newfolio->index = folio->index; > newfolio->mapping = folio->mapping; > @@ -433,6 +442,10 @@ int folio_migrate_mapping(struct address_space *mapping, > return -EAGAIN; > } > > + /* Take off deferred split queue while frozen and memcg set */ > + if (folio_test_large(folio) && folio_test_large_rmappable(folio)) > + folio_undo_large_rmappable(folio); > + > /* > * Now we know that no one else is looking at the folio: > * no turning back from here.
On Tue, 2 Jul 2024, Baolin Wang wrote: > On 2024/7/2 15:40, Hugh Dickins wrote: > > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on > > flags when freeing, yet the flags shown are not bad: PG_locked had been > > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from > > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN > > symptoms implying double free by deferred split and large folio migration. > > > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large > > folio migration") was right to fix the memcg-dependent locking broken in > > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), > > but missed a subtlety of deferred_split_scan(): it moves folios to its own > > local list to work on them without split_queue_lock, during which time > > folio->_deferred_list is not empty, but even the "right" lock does nothing > > to secure the folio and the list it is on. > > > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so > > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() > > while the old folio's reference count is temporarily frozen to 0 - adding > > such a freeze in the !mapping case too (originally, folio lock and (I should have added "isolation and" into that list of conditions.) > > unmapping and no swap cache left an anon folio unreachable, so no freezing > > was needed there: but the deferred split queue offers a way to reach it). > > Thanks Hugh. > > But after reading your analysis, I am concerned that the > folio_undo_large_rmappable() and deferred_split_scan() may still encounter a > race condition with the local list, even with your patch. > > Suppose folio A has already been queued into the local list in > deferred_split_scan() by thread A, but fails to 'folio_trylock' and then > releases the reference count. At the same time, folio A can be frozen by > another thread B in folio_migrate_mapping(). In such a case, > folio_undo_large_rmappable() would remove folio A from the local list without > *any* lock protection, creating a race condition with the local list iteration > in deferred_split_scan(). It's a good doubt to raise, but I think we're okay: because Kirill designed the deferred split list (and its temporary local list) to be safe in that way. You're right that if thread B's freeze to 0 wins the race, thread B will be messing with a list on thread A's stack while thread A is quite possibly running; but thread A will not leave that stack frame without taking again the split_queue_lock which thread B holds while removing from the list. We would certainly not want to introduce such a subtlety right now! But never mind page migration, this is how it has always worked when racing with the folio being freed at the same time - maybe deferred_split_scan() wins the freeing race and is the one to remove folio from deferred split list, or maybe the other freer does that. I forget whether there was an initial flurry of races to be fixed when it came in, but it has been stable against concurrent freeing for years. Please think this over again: do not trust my honeyed words! > > Anyway, I think this patch can still fix some possible races. Feel free to > add: > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Thanks, but I certainly don't want this to go into the tree if it's still flawed as you suggest. Hugh
On 2024/7/3 00:15, Hugh Dickins wrote: > On Tue, 2 Jul 2024, Baolin Wang wrote: >> On 2024/7/2 15:40, Hugh Dickins wrote: >>> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on >>> flags when freeing, yet the flags shown are not bad: PG_locked had been >>> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from >>> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN >>> symptoms implying double free by deferred split and large folio migration. >>> >>> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large >>> folio migration") was right to fix the memcg-dependent locking broken in >>> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), >>> but missed a subtlety of deferred_split_scan(): it moves folios to its own >>> local list to work on them without split_queue_lock, during which time >>> folio->_deferred_list is not empty, but even the "right" lock does nothing >>> to secure the folio and the list it is on. >>> >>> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so >>> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() >>> while the old folio's reference count is temporarily frozen to 0 - adding >>> such a freeze in the !mapping case too (originally, folio lock and > > (I should have added "isolation and" into that list of conditions.) > >>> unmapping and no swap cache left an anon folio unreachable, so no freezing >>> was needed there: but the deferred split queue offers a way to reach it). >> >> Thanks Hugh. >> >> But after reading your analysis, I am concerned that the >> folio_undo_large_rmappable() and deferred_split_scan() may still encounter a >> race condition with the local list, even with your patch. >> >> Suppose folio A has already been queued into the local list in >> deferred_split_scan() by thread A, but fails to 'folio_trylock' and then >> releases the reference count. At the same time, folio A can be frozen by >> another thread B in folio_migrate_mapping(). In such a case, >> folio_undo_large_rmappable() would remove folio A from the local list without >> *any* lock protection, creating a race condition with the local list iteration >> in deferred_split_scan(). > > It's a good doubt to raise, but I think we're okay: because Kirill > designed the deferred split list (and its temporary local list) to > be safe in that way. > > You're right that if thread B's freeze to 0 wins the race, thread B > will be messing with a list on thread A's stack while thread A is > quite possibly running; but thread A will not leave that stack frame > without taking again the split_queue_lock which thread B holds while > removing from the list. > > We would certainly not want to introduce such a subtlety right now! > But never mind page migration, this is how it has always worked when > racing with the folio being freed at the same time - maybe > deferred_split_scan() wins the freeing race and is the one to remove > folio from deferred split list, or maybe the other freer does that. Yes, thanks for explanation. And after thinking more, the 'list_for_each_entry_safe' in deferred_split_scan() can maintain the list iteration safety, so I think this is safe. > I forget whether there was an initial flurry of races to be fixed when > it came in, but it has been stable against concurrent freeing for years. > > Please think this over again: do not trust my honeyed words! > >> >> Anyway, I think this patch can still fix some possible races. Feel free to >> add: >> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > Thanks, but I certainly don't want this to go into the tree if it's > still flawed as you suggest. Now I have no doubt for this fix, and please continue to keep my Reviewed-by tag, thanks.
On Tue, 2 Jul 2024 09:15:54 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > > > > Anyway, I think this patch can still fix some possible races. Feel free to > > add: > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > Thanks, but I certainly don't want this to go into the tree if it's > still flawed as you suggest. I queued it for testing but I shall not send it unsrteam until we have sorted these things out.
On 2 Jul 2024, at 3:40, Hugh Dickins wrote: > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on > flags when freeing, yet the flags shown are not bad: PG_locked had been > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN > symptoms implying double free by deferred split and large folio migration. > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large > folio migration") was right to fix the memcg-dependent locking broken in > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), > but missed a subtlety of deferred_split_scan(): it moves folios to its own > local list to work on them without split_queue_lock, during which time > folio->_deferred_list is not empty, but even the "right" lock does nothing > to secure the folio and the list it is on. > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() > while the old folio's reference count is temporarily frozen to 0 - adding > such a freeze in the !mapping case too (originally, folio lock and > unmapping and no swap cache left an anon folio unreachable, so no freezing > was needed there: but the deferred split queue offers a way to reach it). > > Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org > --- > This patch against 6.10-rc6: Kefeng has commits in the mm-tree which > which will need adjustment to go over this, but we can both check the > result. I have wondered whether just reverting 85ce2c517ade and its > subsequent fixups would be better: but that would be a bigger job, > and probably not the right choice. > > mm/memcontrol.c | 11 ----------- > mm/migrate.c | 13 +++++++++++++ > 2 files changed, 13 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 71fe2a95b8bd..8f2f1bb18c9c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) > > /* Transfer the charge and the css ref */ > commit_charge(new, memcg); > - /* > - * If the old folio is a large folio and is in the split queue, it needs > - * to be removed from the split queue now, in case getting an incorrect > - * split queue in destroy_large_folio() after the memcg of the old folio > - * is cleared. > - * > - * In addition, the old folio is about to be freed after migration, so > - * removing from the split queue a bit earlier seems reasonable. > - */ > - if (folio_test_large(old) && folio_test_large_rmappable(old)) > - folio_undo_large_rmappable(old); > old->memcg_data = 0; > } > > diff --git a/mm/migrate.c b/mm/migrate.c > index 20cb9f5f7446..a8c6f466e33a 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping, > if (folio_ref_count(folio) != expected_count) > return -EAGAIN; > > + /* Take off deferred split queue while frozen and memcg set */ > + if (folio_test_large(folio) && > + folio_test_large_rmappable(folio)) { > + if (!folio_ref_freeze(folio, expected_count)) > + return -EAGAIN; > + folio_undo_large_rmappable(folio); > + folio_ref_unfreeze(folio, expected_count); > + } > + I wonder if the patch below would make the code look better by using the same freeze/unfreeze pattern like file-backed path. After reading the emails between you and Baolin and checking the code, I think the patch looks good to me. Feel free to add Reviewed-by: Zi Yan <ziy@nvidia.com> BTW, this subtlety is very error prone, as Matthew, Ryan, and I all encountered errors because of this[1][2]. Matthew has a good summary of the subtlety: "the (undocumented) logic in deferred_split_scan() that a folio with a positive refcount will not be removed from the list." [1] https://lore.kernel.org/linux-mm/Ze9EFdFLXQEUVtKl@casper.infradead.org/ [2] https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ diff --git a/mm/migrate.c b/mm/migrate.c index a8c6f466e33a..afcc0653dcb7 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -412,17 +412,15 @@ int folio_migrate_mapping(struct address_space *mapping, if (!mapping) { /* Anonymous page without mapping */ - if (folio_ref_count(folio) != expected_count) + if (!folio_ref_freeze(folio, expected_count)) return -EAGAIN; /* Take off deferred split queue while frozen and memcg set */ if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) { - if (!folio_ref_freeze(folio, expected_count)) - return -EAGAIN; + folio_test_large_rmappable(folio)) folio_undo_large_rmappable(folio); - folio_ref_unfreeze(folio, expected_count); - } + + folio_ref_unfreeze(folio, expected_count); /* No turning back from here */ newfolio->index = folio->index; > /* No turning back from here */ > newfolio->index = folio->index; > newfolio->mapping = folio->mapping; > @@ -433,6 +442,10 @@ int folio_migrate_mapping(struct address_space *mapping, > return -EAGAIN; > } > > + /* Take off deferred split queue while frozen and memcg set */ > + if (folio_test_large(folio) && folio_test_large_rmappable(folio)) > + folio_undo_large_rmappable(folio); > + > /* > * Now we know that no one else is looking at the folio: > * no turning back from here. > -- > 2.35.3 -- Best Regards, Yan, Zi
On 03.07.24 16:30, Zi Yan wrote: > On 2 Jul 2024, at 3:40, Hugh Dickins wrote: > >> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on >> flags when freeing, yet the flags shown are not bad: PG_locked had been >> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from >> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN >> symptoms implying double free by deferred split and large folio migration. >> >> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large >> folio migration") was right to fix the memcg-dependent locking broken in >> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), >> but missed a subtlety of deferred_split_scan(): it moves folios to its own >> local list to work on them without split_queue_lock, during which time >> folio->_deferred_list is not empty, but even the "right" lock does nothing >> to secure the folio and the list it is on. >> >> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so >> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() >> while the old folio's reference count is temporarily frozen to 0 - adding >> such a freeze in the !mapping case too (originally, folio lock and >> unmapping and no swap cache left an anon folio unreachable, so no freezing >> was needed there: but the deferred split queue offers a way to reach it). >> >> Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") >> Signed-off-by: Hugh Dickins <hughd@google.com> >> Cc: stable@vger.kernel.org >> --- >> This patch against 6.10-rc6: Kefeng has commits in the mm-tree which >> which will need adjustment to go over this, but we can both check the >> result. I have wondered whether just reverting 85ce2c517ade and its >> subsequent fixups would be better: but that would be a bigger job, >> and probably not the right choice. >> >> mm/memcontrol.c | 11 ----------- >> mm/migrate.c | 13 +++++++++++++ >> 2 files changed, 13 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 71fe2a95b8bd..8f2f1bb18c9c 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) >> >> /* Transfer the charge and the css ref */ >> commit_charge(new, memcg); >> - /* >> - * If the old folio is a large folio and is in the split queue, it needs >> - * to be removed from the split queue now, in case getting an incorrect >> - * split queue in destroy_large_folio() after the memcg of the old folio >> - * is cleared. >> - * >> - * In addition, the old folio is about to be freed after migration, so >> - * removing from the split queue a bit earlier seems reasonable. >> - */ >> - if (folio_test_large(old) && folio_test_large_rmappable(old)) >> - folio_undo_large_rmappable(old); >> old->memcg_data = 0; >> } >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index 20cb9f5f7446..a8c6f466e33a 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping, >> if (folio_ref_count(folio) != expected_count) >> return -EAGAIN; >> >> + /* Take off deferred split queue while frozen and memcg set */ >> + if (folio_test_large(folio) && >> + folio_test_large_rmappable(folio)) { >> + if (!folio_ref_freeze(folio, expected_count)) >> + return -EAGAIN; >> + folio_undo_large_rmappable(folio); >> + folio_ref_unfreeze(folio, expected_count); >> + } >> + > > I wonder if the patch below would make the code look better by using > the same freeze/unfreeze pattern like file-backed path. After > reading the emails between you and Baolin and checking the code, > I think the patch looks good to me. Feel free to add > Reviewed-by: Zi Yan <ziy@nvidia.com> > > BTW, this subtlety is very error prone, as Matthew, Ryan, and I all > encountered errors because of this[1][2]. Matthew has a good summary > of the subtlety: > > "the (undocumented) logic in deferred_split_scan() that a folio > with a positive refcount will not be removed from the list." > > [1] https://lore.kernel.org/linux-mm/Ze9EFdFLXQEUVtKl@casper.infradead.org/ > [2] https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ > > diff --git a/mm/migrate.c b/mm/migrate.c > index a8c6f466e33a..afcc0653dcb7 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -412,17 +412,15 @@ int folio_migrate_mapping(struct address_space *mapping, > > if (!mapping) { > /* Anonymous page without mapping */ > - if (folio_ref_count(folio) != expected_count) > + if (!folio_ref_freeze(folio, expected_count)) > return -EAGAIN; > > /* Take off deferred split queue while frozen and memcg set */ > if (folio_test_large(folio) && > - folio_test_large_rmappable(folio)) { > - if (!folio_ref_freeze(folio, expected_count)) > - return -EAGAIN; > + folio_test_large_rmappable(folio)) > folio_undo_large_rmappable(folio); > - folio_ref_unfreeze(folio, expected_count); > - } > + > + folio_ref_unfreeze(folio, expected_count); > The downside is freezing order-0, where we don't need to freeze, right?
On 3 Jul 2024, at 12:21, David Hildenbrand wrote: > On 03.07.24 16:30, Zi Yan wrote: >> On 2 Jul 2024, at 3:40, Hugh Dickins wrote: >> >>> Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on >>> flags when freeing, yet the flags shown are not bad: PG_locked had been >>> set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from >>> deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN >>> symptoms implying double free by deferred split and large folio migration. >>> >>> 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large >>> folio migration") was right to fix the memcg-dependent locking broken in >>> 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), >>> but missed a subtlety of deferred_split_scan(): it moves folios to its own >>> local list to work on them without split_queue_lock, during which time >>> folio->_deferred_list is not empty, but even the "right" lock does nothing >>> to secure the folio and the list it is on. >>> >>> Fortunately, deferred_split_scan() is careful to use folio_try_get(): so >>> folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() >>> while the old folio's reference count is temporarily frozen to 0 - adding >>> such a freeze in the !mapping case too (originally, folio lock and >>> unmapping and no swap cache left an anon folio unreachable, so no freezing >>> was needed there: but the deferred split queue offers a way to reach it). >>> >>> Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") >>> Signed-off-by: Hugh Dickins <hughd@google.com> >>> Cc: stable@vger.kernel.org >>> --- >>> This patch against 6.10-rc6: Kefeng has commits in the mm-tree which >>> which will need adjustment to go over this, but we can both check the >>> result. I have wondered whether just reverting 85ce2c517ade and its >>> subsequent fixups would be better: but that would be a bigger job, >>> and probably not the right choice. >>> >>> mm/memcontrol.c | 11 ----------- >>> mm/migrate.c | 13 +++++++++++++ >>> 2 files changed, 13 insertions(+), 11 deletions(-) >>> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 71fe2a95b8bd..8f2f1bb18c9c 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) >>> >>> /* Transfer the charge and the css ref */ >>> commit_charge(new, memcg); >>> - /* >>> - * If the old folio is a large folio and is in the split queue, it needs >>> - * to be removed from the split queue now, in case getting an incorrect >>> - * split queue in destroy_large_folio() after the memcg of the old folio >>> - * is cleared. >>> - * >>> - * In addition, the old folio is about to be freed after migration, so >>> - * removing from the split queue a bit earlier seems reasonable. >>> - */ >>> - if (folio_test_large(old) && folio_test_large_rmappable(old)) >>> - folio_undo_large_rmappable(old); >>> old->memcg_data = 0; >>> } >>> >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index 20cb9f5f7446..a8c6f466e33a 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping, >>> if (folio_ref_count(folio) != expected_count) >>> return -EAGAIN; >>> >>> + /* Take off deferred split queue while frozen and memcg set */ >>> + if (folio_test_large(folio) && >>> + folio_test_large_rmappable(folio)) { >>> + if (!folio_ref_freeze(folio, expected_count)) >>> + return -EAGAIN; >>> + folio_undo_large_rmappable(folio); >>> + folio_ref_unfreeze(folio, expected_count); >>> + } >>> + >> >> I wonder if the patch below would make the code look better by using >> the same freeze/unfreeze pattern like file-backed path. After >> reading the emails between you and Baolin and checking the code, >> I think the patch looks good to me. Feel free to add >> Reviewed-by: Zi Yan <ziy@nvidia.com> >> >> BTW, this subtlety is very error prone, as Matthew, Ryan, and I all >> encountered errors because of this[1][2]. Matthew has a good summary >> of the subtlety: >> >> "the (undocumented) logic in deferred_split_scan() that a folio >> with a positive refcount will not be removed from the list." >> >> [1] https://lore.kernel.org/linux-mm/Ze9EFdFLXQEUVtKl@casper.infradead.org/ >> [2] https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index a8c6f466e33a..afcc0653dcb7 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -412,17 +412,15 @@ int folio_migrate_mapping(struct address_space *mapping, >> >> if (!mapping) { >> /* Anonymous page without mapping */ >> - if (folio_ref_count(folio) != expected_count) >> + if (!folio_ref_freeze(folio, expected_count)) >> return -EAGAIN; >> >> /* Take off deferred split queue while frozen and memcg set */ >> if (folio_test_large(folio) && >> - folio_test_large_rmappable(folio)) { >> - if (!folio_ref_freeze(folio, expected_count)) >> - return -EAGAIN; >> + folio_test_large_rmappable(folio)) >> folio_undo_large_rmappable(folio); >> - folio_ref_unfreeze(folio, expected_count); >> - } >> + >> + folio_ref_unfreeze(folio, expected_count); >> > > The downside is freezing order-0, where we don't need to freeze, right? Right. I missed that part. Forget about my change above. Thanks. -- Best Regards, Yan, Zi
On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on > flags when freeing, yet the flags shown are not bad: PG_locked had been > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN > symptoms implying double free by deferred split and large folio migration. > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large > folio migration") was right to fix the memcg-dependent locking broken in > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), > but missed a subtlety of deferred_split_scan(): it moves folios to its own > local list to work on them without split_queue_lock, during which time > folio->_deferred_list is not empty, but even the "right" lock does nothing > to secure the folio and the list it is on. > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() > while the old folio's reference count is temporarily frozen to 0 - adding > such a freeze in the !mapping case too (originally, folio lock and > unmapping and no swap cache left an anon folio unreachable, so no freezing > was needed there: but the deferred split queue offers a way to reach it). There's a conflict when applying Kefeng's "mm: refactor folio_undo_large_rmappable()" (https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com) on top of this hotfix. --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable +++ mm/memcontrol.c @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol * In addition, the old folio is about to be freed after migration, so * removing from the split queue a bit earlier seems reasonable. */ - if (folio_test_large(old) && folio_test_large_rmappable(old)) - folio_undo_large_rmappable(old); + folio_undo_large_rmappable(old); old->memcg_data = 0; } I'm resolving this by simply dropping the above hunk. So Kefeng's patch is now as below. Please check. --- a/mm/huge_memory.c~mm-refactor-folio_undo_large_rmappable +++ a/mm/huge_memory.c @@ -3258,22 +3258,11 @@ out: return ret; } -void folio_undo_large_rmappable(struct folio *folio) +void __folio_undo_large_rmappable(struct folio *folio) { struct deferred_split *ds_queue; unsigned long flags; - if (folio_order(folio) <= 1) - return; - - /* - * At this point, there is no one trying to add the folio to - * deferred_list. If folio is not in deferred_list, it's safe - * to check without acquiring the split_queue_lock. - */ - if (data_race(list_empty(&folio->_deferred_list))) - return; - ds_queue = get_deferred_split_queue(folio); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { --- a/mm/internal.h~mm-refactor-folio_undo_large_rmappable +++ a/mm/internal.h @@ -622,7 +622,22 @@ static inline void folio_set_order(struc #endif } -void folio_undo_large_rmappable(struct folio *folio); +void __folio_undo_large_rmappable(struct folio *folio); +static inline void folio_undo_large_rmappable(struct folio *folio) +{ + if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio)) + return; + + /* + * At this point, there is no one trying to add the folio to + * deferred_list. If folio is not in deferred_list, it's safe + * to check without acquiring the split_queue_lock. + */ + if (data_race(list_empty(&folio->_deferred_list))) + return; + + __folio_undo_large_rmappable(folio); +} static inline struct folio *page_rmappable_folio(struct page *page) { --- a/mm/page_alloc.c~mm-refactor-folio_undo_large_rmappable +++ a/mm/page_alloc.c @@ -2661,8 +2661,7 @@ void free_unref_folios(struct folio_batc unsigned long pfn = folio_pfn(folio); unsigned int order = folio_order(folio); - if (order > 0 && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (!free_pages_prepare(&folio->page, order)) continue; /* --- a/mm/swap.c~mm-refactor-folio_undo_large_rmappable +++ a/mm/swap.c @@ -123,8 +123,7 @@ void __folio_put(struct folio *folio) } page_cache_release(folio); - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); mem_cgroup_uncharge(folio); free_unref_page(&folio->page, folio_order(folio)); } @@ -1021,10 +1020,7 @@ void folios_put_refs(struct folio_batch free_huge_folio(folio); continue; } - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); - + folio_undo_large_rmappable(folio); __page_cache_release(folio, &lruvec, &flags); if (j != i) --- a/mm/vmscan.c~mm-refactor-folio_undo_large_rmappable +++ a/mm/vmscan.c @@ -1439,9 +1439,7 @@ free_it: */ nr_reclaimed += nr_pages; - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (folio_batch_add(&free_folios, folio) == 0) { mem_cgroup_uncharge_folios(&free_folios); try_to_unmap_flush(); @@ -1848,9 +1846,7 @@ static unsigned int move_folios_to_lru(s if (unlikely(folio_put_testzero(folio))) { __folio_clear_lru_flags(folio); - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (folio_batch_add(&free_folios, folio) == 0) { spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_folios(&free_folios);
On Wed, 3 Jul 2024, Andrew Morton wrote: > On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > > > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on > > flags when freeing, yet the flags shown are not bad: PG_locked had been > > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from > > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN > > symptoms implying double free by deferred split and large folio migration. > > > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large > > folio migration") was right to fix the memcg-dependent locking broken in > > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), > > but missed a subtlety of deferred_split_scan(): it moves folios to its own > > local list to work on them without split_queue_lock, during which time > > folio->_deferred_list is not empty, but even the "right" lock does nothing > > to secure the folio and the list it is on. > > > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so > > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() > > while the old folio's reference count is temporarily frozen to 0 - adding > > such a freeze in the !mapping case too (originally, folio lock and > > unmapping and no swap cache left an anon folio unreachable, so no freezing > > was needed there: but the deferred split queue offers a way to reach it). > > There's a conflict when applying Kefeng's "mm: refactor > folio_undo_large_rmappable()" > (https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com) > on top of this hotfix. Yes, anticipated in my "below the --- line" comments: sorry for giving you this nuisance. And perhaps a conflict with another one of Kefeng's, which deletes a hunk in mm/migrate.c just above where I add a hunk: and that's indeed how it should end up, hunk deleted by Kefeng, hunk added by me. > > --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable > +++ mm/memcontrol.c > @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol > * In addition, the old folio is about to be freed after migration, so > * removing from the split queue a bit earlier seems reasonable. > */ > - if (folio_test_large(old) && folio_test_large_rmappable(old)) > - folio_undo_large_rmappable(old); > + folio_undo_large_rmappable(old); > old->memcg_data = 0; > } > > I'm resolving this by simply dropping the above hunk. So Kefeng's > patch is now as below. Please check. Checked, and that is correct, thank you Andrew. Correct, but not quite complete: because I'm sure that if Kefeng had written his patch after mine, he would have made the equivalent change in mm/migrate.c: --- a/mm/migrate.c +++ b/mm/migrate.c @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping, } /* Take off deferred split queue while frozen and memcg set */ - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); /* * Now we know that no one else is looking at the folio: But there's no harm done if you push out a tree without that additional mod: we can add it as a fixup afterwards, it's no more than a cleanup. (I'm on the lookout for an mm.git update, hope to give it a try when it appears.) Thanks, Hugh
On Wed, 3 Jul 2024 20:21:22 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > On Wed, 3 Jul 2024, Andrew Morton wrote: > > On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > > > > > Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on > > > flags when freeing, yet the flags shown are not bad: PG_locked had been > > > set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from > > > deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN > > > symptoms implying double free by deferred split and large folio migration. > > > > > > 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large > > > folio migration") was right to fix the memcg-dependent locking broken in > > > 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), > > > but missed a subtlety of deferred_split_scan(): it moves folios to its own > > > local list to work on them without split_queue_lock, during which time > > > folio->_deferred_list is not empty, but even the "right" lock does nothing > > > to secure the folio and the list it is on. > > > > > > Fortunately, deferred_split_scan() is careful to use folio_try_get(): so > > > folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() > > > while the old folio's reference count is temporarily frozen to 0 - adding > > > such a freeze in the !mapping case too (originally, folio lock and > > > unmapping and no swap cache left an anon folio unreachable, so no freezing > > > was needed there: but the deferred split queue offers a way to reach it). > > > > There's a conflict when applying Kefeng's "mm: refactor > > folio_undo_large_rmappable()" > > (https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com) > > on top of this hotfix. > > Yes, anticipated in my "below the --- line" comments: > sorry for giving you this nuisance. np > And perhaps a conflict with another one of Kefeng's, which deletes a hunk > in mm/migrate.c just above where I add a hunk: and that's indeed how it > should end up, hunk deleted by Kefeng, hunk added by me. Sorted, I hope. > > > > --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable > > +++ mm/memcontrol.c > > @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol > > * In addition, the old folio is about to be freed after migration, so > > * removing from the split queue a bit earlier seems reasonable. > > */ > > - if (folio_test_large(old) && folio_test_large_rmappable(old)) > > - folio_undo_large_rmappable(old); > > + folio_undo_large_rmappable(old); > > old->memcg_data = 0; > > } > > > > I'm resolving this by simply dropping the above hunk. So Kefeng's > > patch is now as below. Please check. > > Checked, and that is correct, thank you Andrew. great. > Correct, but not quite > complete: because I'm sure that if Kefeng had written his patch after > mine, he would have made the equivalent change in mm/migrate.c: > > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping, > } > > /* Take off deferred split queue while frozen and memcg set */ > - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) > - folio_undo_large_rmappable(folio); > + folio_undo_large_rmappable(folio); > > /* > * Now we know that no one else is looking at the folio: > > But there's no harm done if you push out a tree without that additional > mod: we can add it as a fixup afterwards, it's no more than a cleanup. OK, someone please send that along? I'll queue it as a -fix so a single line of changelog is all that I shall retain (but more is welcome! People can follow the Link:) > (I'm on the lookout for an mm.git update, hope to give it a try when it > appears.) 12 seconds ago.
On 2024/7/4 11:21, Hugh Dickins wrote: > On Wed, 3 Jul 2024, Andrew Morton wrote: >> On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: >> ... > > And perhaps a conflict with another one of Kefeng's, which deletes a hunk > in mm/migrate.c just above where I add a hunk: and that's indeed how it > should end up, hunk deleted by Kefeng, hunk added by me. > >> >> --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable >> +++ mm/memcontrol.c >> @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol >> * In addition, the old folio is about to be freed after migration, so >> * removing from the split queue a bit earlier seems reasonable. >> */ >> - if (folio_test_large(old) && folio_test_large_rmappable(old)) >> - folio_undo_large_rmappable(old); >> + folio_undo_large_rmappable(old); >> old->memcg_data = 0; >> } >> >> I'm resolving this by simply dropping the above hunk. So Kefeng's >> patch is now as below. Please check. > > Checked, and that is correct, thank you Andrew. Correct, but not quite > complete: because I'm sure that if Kefeng had written his patch after > mine, he would have made the equivalent change in mm/migrate.c: > Yes, > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping, > } > > /* Take off deferred split queue while frozen and memcg set */ > - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) > - folio_undo_large_rmappable(folio); > + folio_undo_large_rmappable(folio); > > /* > * Now we know that no one else is looking at the folio: > > But there's no harm done if you push out a tree without that additional > mod: we can add it as a fixup afterwards, it's no more than a cleanup. > Maybe we could convert to __folio_undo_large_rmappable() for !maping part, which will avoid unnecessary freeze/unfreeze for empty deferred list. diff --git a/mm/migrate.c b/mm/migrate.c index ca65f03acb31..e6af9c25c25b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -412,10 +412,11 @@ static int __folio_migrate_mapping(struct address_space *mapping, if (!mapping) { /* Take off deferred split queue while frozen and memcg set */ if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) { + folio_test_large_rmappable(folio) && + !data_race(list_empty(&folio->_deferred_list))) { if (!folio_ref_freeze(folio, expected_count)) return -EAGAIN; - folio_undo_large_rmappable(folio); + __folio_undo_large_rmappable(folio); folio_ref_unfreeze(folio, expected_count); } > (I'm on the lookout for an mm.git update, hope to give it a try when it > appears.) > > Thanks, > Hugh >
On Thu, 4 Jul 2024, Kefeng Wang wrote: > On 2024/7/4 11:21, Hugh Dickins wrote: > > On Wed, 3 Jul 2024, Andrew Morton wrote: > >> On Tue, 2 Jul 2024 00:40:55 -0700 (PDT) Hugh Dickins <hughd@google.com> > >> wrote: > >> > ... > > > > And perhaps a conflict with another one of Kefeng's, which deletes a hunk > > in mm/migrate.c just above where I add a hunk: and that's indeed how it > > should end up, hunk deleted by Kefeng, hunk added by me. > > > >> > >> --- mm/memcontrol.c~mm-refactor-folio_undo_large_rmappable > >> +++ mm/memcontrol.c > >> @@ -7832,8 +7832,7 @@ void mem_cgroup_migrate(struct folio *ol > >> * In addition, the old folio is about to be freed after migration, so > >> * removing from the split queue a bit earlier seems reasonable. > >> */ > >> - if (folio_test_large(old) && folio_test_large_rmappable(old)) > >> - folio_undo_large_rmappable(old); > >> + folio_undo_large_rmappable(old); > >> old->memcg_data = 0; > >> } > >> > >> I'm resolving this by simply dropping the above hunk. So Kefeng's > >> patch is now as below. Please check. > > > > Checked, and that is correct, thank you Andrew. Correct, but not quite > > complete: because I'm sure that if Kefeng had written his patch after > > mine, he would have made the equivalent change in mm/migrate.c: > > > > Yes, > > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping, > > } > > > > /* Take off deferred split queue while frozen and memcg set */ > > - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) > > - folio_undo_large_rmappable(folio); > > + folio_undo_large_rmappable(folio); > > > > /* > > * Now we know that no one else is looking at the folio: > > > > But there's no harm done if you push out a tree without that additional > > mod: we can add it as a fixup afterwards, it's no more than a cleanup. > > > Maybe we could convert to __folio_undo_large_rmappable() for !maping part, > which will avoid unnecessary freeze/unfreeze for empty deferred > list. > > diff --git a/mm/migrate.c b/mm/migrate.c > index ca65f03acb31..e6af9c25c25b 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -412,10 +412,11 @@ static int __folio_migrate_mapping(struct address_space > *mapping, > if (!mapping) { > /* Take off deferred split queue while frozen and memcg set */ > if (folio_test_large(folio) && > - folio_test_large_rmappable(folio)) { > + folio_test_large_rmappable(folio) && > + !data_race(list_empty(&folio->_deferred_list))) { > if (!folio_ref_freeze(folio, expected_count)) > return -EAGAIN; > - folio_undo_large_rmappable(folio); > + __folio_undo_large_rmappable(folio); > folio_ref_unfreeze(folio, expected_count); > } > What you show above is exactly what I had when I was originally testing over the top of mm-everything (well, not quite exactly, I don't think I bothered with the data_race()). But I grew to feel that probably everyone else would be happier with less of those internals _deferred_list and __folio_undo_large_rmappable() spread about. There are many ways to play it. I had also considered doing it Zi Yan's way, freezing always in the !mapping case as well as in the mapping case: what overhead it adds would probably get lost amidst all the other overhead of page migration. It will not be surprising if changes come later requiring us always to freeze in the anon !swapcache case too, it always seemed a bit surprising not to need freezing there. But for now I decided it's best to keep the freezing to the case where it's known to be needed (but without getting into __s). Many ways to play it, and I've no objection if someone then changes it around later; but we've no need to depart from what Andrew already has. Except, he did ask one of us to send along the -fix removing the unnecessary checks before its second folio_undo_large_rmappable() once your refactor patch goes in: here it is below. [I guess this is the wrong place to say so, but folio_undo_large_rmappable() is a dreadful name: it completely obscures what the function actually does, and gives the false impression that the folio would be !large_rmappable afterwards. I hope that one day the name gets changed to something like folio_unqueue_deferred_split() or folio_cancel_deferred_split().] [PATCH] mm: refactor folio_undo_large_rmappable() fix Now that folio_undo_large_rmappable() is an inline function checking order and large_rmappable for itself (and __folio_undo_large_rmappable() is now declared even when CONFIG_TRANASPARENT_HUGEPAGE is off) there is no need for folio_migrate_mapping() to check large and large_rmappable first (in the mapping case when it has had to freeze anyway). Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> --- For folding in to mm-unstable's "mm: refactor folio_undo_large_rmappable()", unless I'm too late and it's already mm-stable (no problem, just a cleanup). mm/migrate.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/mm/migrate.c +++ b/mm/migrate.c @@ -443,8 +443,7 @@ int folio_migrate_mapping(struct address_space *mapping, } /* Take off deferred split queue while frozen and memcg set */ - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); /* * Now we know that no one else is looking at the folio:
On Sat, 6 Jul 2024 14:29:00 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > > What you show above is exactly what I had when I was originally testing > over the top of mm-everything (well, not quite exactly, I don't think I > bothered with the data_race()). But I grew to feel that probably everyone > else would be happier with less of those internals _deferred_list and > __folio_undo_large_rmappable() spread about. > > There are many ways to play it. I had also considered doing it Zi Yan's > way, freezing always in the !mapping case as well as in the mapping case: > what overhead it adds would probably get lost amidst all the other overhead > of page migration. It will not be surprising if changes come later requiring > us always to freeze in the anon !swapcache case too, it always seemed a bit > surprising not to need freezing there. But for now I decided it's best to > keep the freezing to the case where it's known to be needed (but without > getting into __s). > > Many ways to play it, and I've no objection if someone then changes it > around later; but we've no need to depart from what Andrew already has. > > Except, he did ask one of us to send along the -fix removing the unnecessary > checks before its second folio_undo_large_rmappable() once your refactor > patch goes in: here it is below. Grabbed, thanks. > [I guess this is the wrong place to say so, but folio_undo_large_rmappable() > is a dreadful name: it completely obscures what the function actually does, > and gives the false impression that the folio would be !large_rmappable > afterwards. I hope that one day the name gets changed to something like > folio_unqueue_deferred_split() or folio_cancel_deferred_split().] Naming is important, but so also is commentary. folio_undo_large_rmappable() lacks any. > [PATCH] mm: refactor folio_undo_large_rmappable() fix > > Now that folio_undo_large_rmappable() is an inline function checking > order and large_rmappable for itself (and __folio_undo_large_rmappable() > is now declared even when CONFIG_TRANASPARENT_HUGEPAGE is off) there is > no need for folio_migrate_mapping() to check large and large_rmappable > first (in the mapping case when it has had to freeze anyway). > > ... > > For folding in to mm-unstable's "mm: refactor folio_undo_large_rmappable()", > unless I'm too late and it's already mm-stable (no problem, just a cleanup). Missed the mm-stable mergification by >that much<. I'll queue it separately, thanks.
On 2024/7/7 10:11, Andrew Morton wrote: > On Sat, 6 Jul 2024 14:29:00 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > >> >> What you show above is exactly what I had when I was originally testing >> over the top of mm-everything (well, not quite exactly, I don't think I >> bothered with the data_race()). But I grew to feel that probably everyone >> else would be happier with less of those internals _deferred_list and >> __folio_undo_large_rmappable() spread about. Maybe some helper to check whether or not we should unqueue the defferred_list, but it out of scope in this patch, and maybe not worth it >> >> There are many ways to play it. I had also considered doing it Zi Yan's >> way, freezing always in the !mapping case as well as in the mapping case: >> what overhead it adds would probably get lost amidst all the other overhead >> of page migration. It will not be surprising if changes come later requiring >> us always to freeze in the anon !swapcache case too, it always seemed a bit >> surprising not to need freezing there. But for now I decided it's best to >> keep the freezing to the case where it's known to be needed (but without >> getting into __s). >> >> Many ways to play it, and I've no objection if someone then changes it >> around later; but we've no need to depart from what Andrew already has. >> >> Except, he did ask one of us to send along the -fix removing the unnecessary >> checks before its second folio_undo_large_rmappable() once your refactor >> patch goes in: here it is below. OK, let's keep it simple, thank your for pushing it out. > > Grabbed, thanks. > >> [I guess this is the wrong place to say so, but folio_undo_large_rmappable() >> is a dreadful name: it completely obscures what the function actually does, >> and gives the false impression that the folio would be !large_rmappable >> afterwards. I hope that one day the name gets changed to something like >> folio_unqueue_deferred_split() or folio_cancel_deferred_split().] > > Naming is important, but so also is commentary. > folio_undo_large_rmappable() lacks any. > >> [PATCH] mm: refactor folio_undo_large_rmappable() fix >> >> Now that folio_undo_large_rmappable() is an inline function checking >> order and large_rmappable for itself (and __folio_undo_large_rmappable() >> is now declared even when CONFIG_TRANASPARENT_HUGEPAGE is off) there is >> no need for folio_migrate_mapping() to check large and large_rmappable >> first (in the mapping case when it has had to freeze anyway). >> >> ... >> >> For folding in to mm-unstable's "mm: refactor folio_undo_large_rmappable()", >> unless I'm too late and it's already mm-stable (no problem, just a cleanup). > > Missed the mm-stable mergification by >that much<. I'll queue it > separately, thanks.
> [I guess this is the wrong place to say so, but folio_undo_large_rmappable() > is a dreadful name: it completely obscures what the function actually does, > and gives the false impression that the folio would be !large_rmappable > afterwards. I hope that one day the name gets changed to something like > folio_unqueue_deferred_split() or folio_cancel_deferred_split().] Fully agreed, that is absolutely confusing. (and I thought for a second that the folio would be !large_rmappable afterwards!)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 71fe2a95b8bd..8f2f1bb18c9c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) /* Transfer the charge and the css ref */ commit_charge(new, memcg); - /* - * If the old folio is a large folio and is in the split queue, it needs - * to be removed from the split queue now, in case getting an incorrect - * split queue in destroy_large_folio() after the memcg of the old folio - * is cleared. - * - * In addition, the old folio is about to be freed after migration, so - * removing from the split queue a bit earlier seems reasonable. - */ - if (folio_test_large(old) && folio_test_large_rmappable(old)) - folio_undo_large_rmappable(old); old->memcg_data = 0; } diff --git a/mm/migrate.c b/mm/migrate.c index 20cb9f5f7446..a8c6f466e33a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address_space *mapping, if (folio_ref_count(folio) != expected_count) return -EAGAIN; + /* Take off deferred split queue while frozen and memcg set */ + if (folio_test_large(folio) && + folio_test_large_rmappable(folio)) { + if (!folio_ref_freeze(folio, expected_count)) + return -EAGAIN; + folio_undo_large_rmappable(folio); + folio_ref_unfreeze(folio, expected_count); + } + /* No turning back from here */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; @@ -433,6 +442,10 @@ int folio_migrate_mapping(struct address_space *mapping, return -EAGAIN; } + /* Take off deferred split queue while frozen and memcg set */ + if (folio_test_large(folio) && folio_test_large_rmappable(folio)) + folio_undo_large_rmappable(folio); + /* * Now we know that no one else is looking at the folio: * no turning back from here.
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on flags when freeing, yet the flags shown are not bad: PG_locked had been set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN symptoms implying double free by deferred split and large folio migration. 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") was right to fix the memcg-dependent locking broken in 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), but missed a subtlety of deferred_split_scan(): it moves folios to its own local list to work on them without split_queue_lock, during which time folio->_deferred_list is not empty, but even the "right" lock does nothing to secure the folio and the list it is on. Fortunately, deferred_split_scan() is careful to use folio_try_get(): so folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() while the old folio's reference count is temporarily frozen to 0 - adding such a freeze in the !mapping case too (originally, folio lock and unmapping and no swap cache left an anon folio unreachable, so no freezing was needed there: but the deferred split queue offers a way to reach it). Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org --- This patch against 6.10-rc6: Kefeng has commits in the mm-tree which which will need adjustment to go over this, but we can both check the result. I have wondered whether just reverting 85ce2c517ade and its subsequent fixups would be better: but that would be a bigger job, and probably not the right choice. mm/memcontrol.c | 11 ----------- mm/migrate.c | 13 +++++++++++++ 2 files changed, 13 insertions(+), 11 deletions(-)