diff mbox series

[hotfix,2/2] mm/thp: fix deferred split unqueue naming and locking

Message ID 7dc6b280-cd87-acd1-1124-e512e3d2217d@google.com (mailing list archive)
State New
Headers show
Series [hotfix,1/2] mm/thp: fix deferred split queue not partially_mapped | expand

Commit Message

Hugh Dickins Oct. 24, 2024, 4:13 a.m. UTC
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without).  The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.

Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does.  But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).

Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list.  __remove_mapping() has frozen refcount to
0 here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.

That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue (which can now be removed), but no check on deferred
queue before adding THP to swapcache (maybe other circumstances prevented
it at that time, but not now).

Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails.  Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.

The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
care deferred split queue in memcg charge move path"), which argued that
the existence of a PMD mapping guarantees that the THP cannot be on a
deferred list.  I'm not sure if that was true at the time (swapcache
remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
("mm: split underused THPs").

[Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
large but not PMD-mapped: that's safe, but perhaps not intended: it's
arguable whether the deprecated feature should be updated to work
better with the new feature; but certainly not in this patch.]

Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included.  It
is unclear whether these fixes are realistically needed before 6.12.

Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
 mm/internal.h      | 10 +++++-----
 mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
 mm/memcontrol.c    |  8 +++++---
 mm/migrate.c       |  4 ++--
 mm/page_alloc.c    |  4 +++-
 mm/swap.c          |  4 ++--
 mm/vmscan.c        |  4 ++--
 8 files changed, 65 insertions(+), 29 deletions(-)

Comments

Yang Shi Oct. 24, 2024, 8 p.m. UTC | #1
On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
>
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
>
> Before fixing locking: rename misleading folio_undo_large_rmappable(),
> which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> which is what it does.  But that and its out-of-line __callee are mm
> internals of very limited usability: add comment and WARN_ON_ONCEs to
> check usage; and return a bool to say if a deferred split was unqueued,
> which can then be used in WARN_ON_ONCEs around safety checks (sparing
> callers the arcane conditionals in __folio_unqueue_deferred_split()).
>
> Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> without checking and unqueueing a THP folio from deferred split list;
> which is unfortunate, since the split_queue_lock depends on the memcg
> (when memcg is enabled); so swapout has been unqueueing such THPs later,
> when freeing the folio, using the pgdat's lock instead: potentially
> corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> 0 here, so no problem with calling folio_unqueue_deferred_split() before
> resetting memcg_data.
>
> That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> shrinker memcg aware"): which included a check on swapcache before adding
> to deferred queue (which can now be removed), but no check on deferred
> queue before adding THP to swapcache (maybe other circumstances prevented
> it at that time, but not now).

If I remember correctly, THP just can be added to deferred list when
there is no PMD map before mTHP swapout, so shrink_page_list() did
check THP's compound_mapcount (called _entire_mapcount now) before
adding it to swap cache.

Now the code just checked whether the large folio is on deferred list or not.

>
> Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> folio->memcg_data without checking and unqueueing a THP folio from the
> deferred list, sometimes corrupting "from" memcg's list, like swapout.
> Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> of course unqueues), or skip it if that fails.  Not ideal, but moving
> charge has been requested, and khugepaged should repair the THP later:
> nobody wants new custom unqueueing code just for this deprecated case.
>
> The 87eaceb3faa5 commit did have the code to move from one deferred list
> to another (but was not conscious of its unsafety while refcount non-0);
> but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> care deferred split queue in memcg charge move path"), which argued that
> the existence of a PMD mapping guarantees that the THP cannot be on a
> deferred list.  I'm not sure if that was true at the time (swapcache
> remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> ("mm: split underused THPs").

Same reason as above.

>
> [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> large but not PMD-mapped: that's safe, but perhaps not intended: it's
> arguable whether the deprecated feature should be updated to work
> better with the new feature; but certainly not in this patch.]
>
> Backport to 6.11 should be straightforward. Earlier backports must take
> care that other _deferred_list fixes and dependencies are included.  It
> is unclear whether these fixes are realistically needed before 6.12.
>
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
>  mm/internal.h      | 10 +++++-----
>  mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
>  mm/memcontrol.c    |  8 +++++---
>  mm/migrate.c       |  4 ++--
>  mm/page_alloc.c    |  4 +++-
>  mm/swap.c          |  4 ++--
>  mm/vmscan.c        |  4 ++--
>  8 files changed, 65 insertions(+), 29 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a1d345f1680c..dc7d5bb76495 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>         return split_huge_page_to_list_to_order(&folio->page, list, ret);
>  }
>
> -void __folio_undo_large_rmappable(struct folio *folio)
> +/*
> + * __folio_unqueue_deferred_split() is not to be called directly:
> + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> + * limits its calls to those folios which may have a _deferred_list for
> + * queueing THP splits, and that list is (racily observed to be) non-empty.
> + *
> + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> + *
> + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> + * therefore important to unqueue deferred split before changing folio memcg.
> + */
> +bool __folio_unqueue_deferred_split(struct folio *folio)
>  {
>         struct deferred_split *ds_queue;
>         unsigned long flags;
> +       bool unqueued = false;
> +
> +       WARN_ON_ONCE(folio_ref_count(folio));
> +       WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
>
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
>                                       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>                 }
>                 list_del_init(&folio->_deferred_list);
> +               unqueued = true;
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +
> +       return unqueued;        /* useful for debug warnings */
>  }
>
>  /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>         if (!partially_mapped && !split_underused_thp)
>                 return;
>
> -       /*
> -        * The try_to_unmap() in page reclaim path might reach here too,
> -        * this may cause a race condition to corrupt deferred split queue.
> -        * And, if page reclaim is already handling the same folio, it is
> -        * unnecessary to handle it again in shrinker.
> -        *
> -        * Check the swapcache flag to determine if the folio is being
> -        * handled by page reclaim since THP swap would add the folio into
> -        * swap cache before calling try_to_unmap().
> -        */
> -       if (folio_test_swapcache(folio))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (partially_mapped) {
>                 if (!folio_test_partially_mapped(folio)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index 93083bbeeefa..16c1f3cd599e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
>  #endif
>  }
>
> -void __folio_undo_large_rmappable(struct folio *folio);
> -static inline void folio_undo_large_rmappable(struct folio *folio)
> +bool __folio_unqueue_deferred_split(struct folio *folio);
> +static inline bool folio_unqueue_deferred_split(struct folio *folio)
>  {
>         if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> -               return;
> +               return false;
>
>         /*
>          * At this point, there is no one trying to add the folio to
> @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
>          * to check without acquiring the split_queue_lock.
>          */
>         if (data_race(list_empty(&folio->_deferred_list)))
> -               return;
> +               return false;
>
> -       __folio_undo_large_rmappable(folio);
> +       return __folio_unqueue_deferred_split(folio);
>  }
>
>  static inline struct folio *page_rmappable_folio(struct page *page)
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 81d8819f13cd..f8744f5630bb 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
>         css_get(&to->css);
>         css_put(&from->css);
>
> +       /* Warning should never happen, so don't worry about refcount non-0 */
> +       WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>         folio->memcg_data = (unsigned long)to;
>
>         __folio_memcg_unlock(from);
> @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>         enum mc_target_type target_type;
>         union mc_target target;
>         struct folio *folio;
> +       bool tried_split_before = false;
>
> +retry_pmd:
>         ptl = pmd_trans_huge_lock(pmd, vma);
>         if (ptl) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
> @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>                 target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
>                 if (target_type == MC_TARGET_PAGE) {
>                         folio = target.folio;
> +                       /*
> +                        * Deferred split queue locking depends on memcg,
> +                        * and unqueue is unsafe unless folio refcount is 0:
> +                        * split or skip if on the queue? first try to split.
> +                        */
> +                       if (!list_empty(&folio->_deferred_list)) {
> +                               spin_unlock(ptl);
> +                               if (!tried_split_before)
> +                                       split_folio(folio);
> +                               folio_unlock(folio);
> +                               folio_put(folio);
> +                               if (tried_split_before)
> +                                       return 0;
> +                               tried_split_before = true;
> +                               goto retry_pmd;
> +                       }
> +                       /*
> +                        * So long as that pmd lock is held, the folio cannot
> +                        * be racily added to the _deferred_list, because
> +                        * __folio_remove_rmap() will find !partially_mapped.
> +                        */
>                         if (folio_isolate_lru(folio)) {
>                                 if (!mem_cgroup_move_account(folio, true,
>                                                              mc.from, mc.to)) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2703227cce88..06df2af97415 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>         struct obj_cgroup *objcg;
>
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> -                       !folio_test_hugetlb(folio) &&
> -                       !list_empty(&folio->_deferred_list), folio);
>
>         /*
>          * Nobody should be changing or seriously looking at
> @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>                         ug->nr_memory += nr_pages;
>                 ug->pgpgout++;
>
> +               WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>                 folio->memcg_data = 0;
>         }
>
> @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>         /* Transfer the charge and the css ref */
>         commit_charge(new, memcg);
> +
> +       /* Warning should never happen, so don't worry about refcount non-0 */
> +       WARN_ON_ONCE(folio_unqueue_deferred_split(old));
>         old->memcg_data = 0;
>  }
>
> @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
>         VM_BUG_ON_FOLIO(oldid, folio);
>         mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
>
> +       folio_unqueue_deferred_split(folio);
>         folio->memcg_data = 0;
>
>         if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index df91248755e4..691f25ee2489 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                     folio_test_large_rmappable(folio)) {
>                         if (!folio_ref_freeze(folio, expected_count))
>                                 return -EAGAIN;
> -                       folio_undo_large_rmappable(folio);
> +                       folio_unqueue_deferred_split(folio);
>                         folio_ref_unfreeze(folio, expected_count);
>                 }
>
> @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         }
>
>         /* Take off deferred split queue while frozen and memcg set */
> -       folio_undo_large_rmappable(folio);
> +       folio_unqueue_deferred_split(folio);
>
>         /*
>          * Now we know that no one else is looking at the folio:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4b21a368b4e2..57f64b5d0004 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
>                 unsigned long pfn = folio_pfn(folio);
>                 unsigned int order = folio_order(folio);
>
> -               folio_undo_large_rmappable(folio);
> +               if (mem_cgroup_disabled())
> +                       folio_unqueue_deferred_split(folio);

This looks confusing. It looks all callsites of free_unref_folios()
have folio_unqueue_deferred_split() and memcg uncharge called before
it. If there is any problem, memcg uncharge should catch it. Did I
miss something?

> +
>                 if (!free_pages_prepare(&folio->page, order))
>                         continue;
>                 /*
> diff --git a/mm/swap.c b/mm/swap.c
> index 835bdf324b76..b8e3259ea2c4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
>         }
>
>         page_cache_release(folio);
> -       folio_undo_large_rmappable(folio);
> +       folio_unqueue_deferred_split(folio);
>         mem_cgroup_uncharge(folio);
>         free_unref_page(&folio->page, folio_order(folio));
>  }
> @@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>                         free_huge_folio(folio);
>                         continue;
>                 }
> -               folio_undo_large_rmappable(folio);
> +               folio_unqueue_deferred_split(folio);
>                 __page_cache_release(folio, &lruvec, &flags);
>
>                 if (j != i)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb4e8440c507..635d45745b73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                  */
>                 nr_reclaimed += nr_pages;
>
> -               folio_undo_large_rmappable(folio);
> +               folio_unqueue_deferred_split(folio);
>                 if (folio_batch_add(&free_folios, folio) == 0) {
>                         mem_cgroup_uncharge_folios(&free_folios);
>                         try_to_unmap_flush();
> @@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>                 if (unlikely(folio_put_testzero(folio))) {
>                         __folio_clear_lru_flags(folio);
>
> -                       folio_undo_large_rmappable(folio);
> +                       folio_unqueue_deferred_split(folio);
>                         if (folio_batch_add(&free_folios, folio) == 0) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
>                                 mem_cgroup_uncharge_folios(&free_folios);
> --
> 2.35.3
David Hildenbrand Oct. 24, 2024, 8:52 p.m. UTC | #2
On 24.10.24 06:13, Hugh Dickins wrote:
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
> 
> Before fixing locking: rename misleading folio_undo_large_rmappable(),
> which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> which is what it does.  

Yes please. I stumbled into that myself recently -- leftover from 
previous rework.

It would have been reasonable to move that into a separate (follow-up?) 
patch.

> But that and its out-of-line __callee are mm
> internals of very limited usability: add comment and WARN_ON_ONCEs to
> check usage; and return a bool to say if a deferred split was unqueued,
> which can then be used in WARN_ON_ONCEs around safety checks (sparing
> callers the arcane conditionals in __folio_unqueue_deferred_split()).
> 
> Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> without checking and unqueueing a THP folio from deferred split list;
> which is unfortunate, since the split_queue_lock depends on the memcg
> (when memcg is enabled); so swapout has been unqueueing such THPs later,
> when freeing the folio, using the pgdat's lock instead: potentially
> corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> 0 here, so no problem with calling folio_unqueue_deferred_split() before
> resetting memcg_data.
> 
> That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> shrinker memcg aware"): which included a check on swapcache before adding
> to deferred queue (which can now be removed), but no check on deferred
> queue before adding THP to swapcache (maybe other circumstances prevented
> it at that time, but not now).
> 
> Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> folio->memcg_data without checking and unqueueing a THP folio from the
> deferred list, sometimes corrupting "from" memcg's list, like swapout.
> Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> of course unqueues), or skip it if that fails.  Not ideal, but moving
> charge has been requested, and khugepaged should repair the THP later:
> nobody wants new custom unqueueing code just for this deprecated case.
> 
> The 87eaceb3faa5 commit did have the code to move from one deferred list
> to another (but was not conscious of its unsafety while refcount non-0);
> but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> care deferred split queue in memcg charge move path"), which argued that
> the existence of a PMD mapping guarantees that the THP cannot be on a
> deferred list.  

I recall this can happen, not sure on 5.6 though: assume we have an anon 
THP with 1 PMD mapping and a single PTE mapping for simplicity.

Assume we want to migrate that folio and first remove the PMD mapping, 
then the PTE mapping. After removing the PMD mapping, we add it to the 
deferred split queue (single PTE mapped).

Now assume migration fails and we remove migration entries -> remap.

We now have a PMD-mapped THP on the deferred split queue.

(again, I might be wrong but that's from memory without digging into the 
code)


> I'm not sure if that was true at the time (swapcache
> remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> ("mm: split underused THPs").

We only remap PTEs from the swapcache, never PMDs.

> 
> [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> large but not PMD-mapped: that's safe, but perhaps not intended: it's
> arguable whether the deprecated feature should be updated to work
> better with the new feature; but certainly not in this patch.]
> 
> Backport to 6.11 should be straightforward. Earlier backports must take
> care that other _deferred_list fixes and dependencies are included.  It
> is unclear whether these fixes are realistically needed before 6.12.
> 
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
>   mm/internal.h      | 10 +++++-----
>   mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
>   mm/memcontrol.c    |  8 +++++---
>   mm/migrate.c       |  4 ++--
>   mm/page_alloc.c    |  4 +++-
>   mm/swap.c          |  4 ++--
>   mm/vmscan.c        |  4 ++--
>   8 files changed, 65 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a1d345f1680c..dc7d5bb76495 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>   	return split_huge_page_to_list_to_order(&folio->page, list, ret);
>   }
>   
> -void __folio_undo_large_rmappable(struct folio *folio)
> +/*
> + * __folio_unqueue_deferred_split() is not to be called directly:
> + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> + * limits its calls to those folios which may have a _deferred_list for
> + * queueing THP splits, and that list is (racily observed to be) non-empty.
> + *
> + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> + *
> + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> + * therefore important to unqueue deferred split before changing folio memcg.
> + */
> +bool __folio_unqueue_deferred_split(struct folio *folio)
>   {
>   	struct deferred_split *ds_queue;
>   	unsigned long flags;
> +	bool unqueued = false;
> +
> +	WARN_ON_ONCE(folio_ref_count(folio));
> +	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
>   
>   	ds_queue = get_deferred_split_queue(folio);
>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
>   				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>   		}
>   		list_del_init(&folio->_deferred_list);
> +		unqueued = true;
>   	}
>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +
> +	return unqueued;	/* useful for debug warnings */
>   }
>   
>   /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>   	if (!partially_mapped && !split_underused_thp)
>   		return;
>   
> -	/*
> -	 * The try_to_unmap() in page reclaim path might reach here too,
> -	 * this may cause a race condition to corrupt deferred split queue.
> -	 * And, if page reclaim is already handling the same folio, it is
> -	 * unnecessary to handle it again in shrinker.
> -	 *
> -	 * Check the swapcache flag to determine if the folio is being
> -	 * handled by page reclaim since THP swap would add the folio into
> -	 * swap cache before calling try_to_unmap().
> -	 */
> -	if (folio_test_swapcache(folio))
> -		return;
> -
>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>   	if (partially_mapped) {
>   		if (!folio_test_partially_mapped(folio)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index 93083bbeeefa..16c1f3cd599e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
>   #endif
>   }
>   
> -void __folio_undo_large_rmappable(struct folio *folio);
> -static inline void folio_undo_large_rmappable(struct folio *folio)
> +bool __folio_unqueue_deferred_split(struct folio *folio);
> +static inline bool folio_unqueue_deferred_split(struct folio *folio)
>   {
>   	if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))

The rmappable check here is still confusing for me. I assume we want to 
exclude hugetlb or others that reuse the field for other purposes ...

> -		return;
> +		return false;
>   
>   	/*
>   	 * At this point, there is no one trying to add the folio to
> @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
>   	 * to check without acquiring the split_queue_lock.
>   	 */
>   	if (data_race(list_empty(&folio->_deferred_list)))
> -		return;
> +		return false;
>   
> -	__folio_undo_large_rmappable(folio);
> +	return __folio_unqueue_deferred_split(folio);
>   }
>   
>   static inline struct folio *page_rmappable_folio(struct page *page)
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 81d8819f13cd..f8744f5630bb 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
>   	css_get(&to->css);
>   	css_put(&from->css);
>   
> +	/* Warning should never happen, so don't worry about refcount non-0 */
> +	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>   	folio->memcg_data = (unsigned long)to;
>   
>   	__folio_memcg_unlock(from);
> @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   	enum mc_target_type target_type;
>   	union mc_target target;
>   	struct folio *folio;
> +	bool tried_split_before = false;
>   
> +retry_pmd:
>   	ptl = pmd_trans_huge_lock(pmd, vma);
>   	if (ptl) {
>   		if (mc.precharge < HPAGE_PMD_NR) {
> @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
>   		if (target_type == MC_TARGET_PAGE) {
>   			folio = target.folio;
> +			/*
> +			 * Deferred split queue locking depends on memcg,
> +			 * and unqueue is unsafe unless folio refcount is 0:
> +			 * split or skip if on the queue? first try to split.
> +			 */
> +			if (!list_empty(&folio->_deferred_list)) {
> +				spin_unlock(ptl);
> +				if (!tried_split_before)
> +					split_folio(folio);
> +				folio_unlock(folio);
> +				folio_put(folio);
> +				if (tried_split_before)
> +					return 0;
> +				tried_split_before = true;
> +				goto retry_pmd;
> +			}
> +			/*
> +			 * So long as that pmd lock is held, the folio cannot
> +			 * be racily added to the _deferred_list, because
> +			 * __folio_remove_rmap() will find !partially_mapped.
> +			 */
>   			if (folio_isolate_lru(folio)) {
>   				if (!mem_cgroup_move_account(folio, true,
>   							     mc.from, mc.to)) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2703227cce88..06df2af97415 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>   	struct obj_cgroup *objcg;
>   
>   	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> -	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> -			!folio_test_hugetlb(folio) &&
> -			!list_empty(&folio->_deferred_list), folio);
>   
>   	/*
>   	 * Nobody should be changing or seriously looking at
> @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>   			ug->nr_memory += nr_pages;
>   		ug->pgpgout++;
>   
> +		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>   		folio->memcg_data = 0;
>   	}
>   
> @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>   
>   	/* Transfer the charge and the css ref */
>   	commit_charge(new, memcg);
> +
> +	/* Warning should never happen, so don't worry about refcount non-0 */
> +	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
>   	old->memcg_data = 0;
>   }
>   
> @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
>   	VM_BUG_ON_FOLIO(oldid, folio);
>   	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
>   
> +	folio_unqueue_deferred_split(folio);
>   	folio->memcg_data = 0;
>   
>   	if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index df91248755e4..691f25ee2489 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>   		    folio_test_large_rmappable(folio)) {
>   			if (!folio_ref_freeze(folio, expected_count))
>   				return -EAGAIN;
> -			folio_undo_large_rmappable(folio);
> +			folio_unqueue_deferred_split(folio);
>   			folio_ref_unfreeze(folio, expected_count);
>   		}
>   
> @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>   	}
>   
>   	/* Take off deferred split queue while frozen and memcg set */
> -	folio_undo_large_rmappable(folio);
> +	folio_unqueue_deferred_split(folio);
>   
>   	/*
>   	 * Now we know that no one else is looking at the folio:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4b21a368b4e2..57f64b5d0004 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
>   		unsigned long pfn = folio_pfn(folio);
>   		unsigned int order = folio_order(folio);
>   
> -		folio_undo_large_rmappable(folio);
> +		if (mem_cgroup_disabled())
> +			folio_unqueue_deferred_split(folio);
> +

Is it worth adding a comment here where we take care of ths with memcg 
enabled?

It's complicated stuff, nothing jumped at me, but it's late here so my 
brain is not fully functional ...
Yang Shi Oct. 25, 2024, 1:21 a.m. UTC | #3
On Thu, Oct 24, 2024 at 1:00 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> >
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.  But that and its out-of-line __callee are mm
> > internals of very limited usability: add comment and WARN_ON_ONCEs to
> > check usage; and return a bool to say if a deferred split was unqueued,
> > which can then be used in WARN_ON_ONCEs around safety checks (sparing
> > callers the arcane conditionals in __folio_unqueue_deferred_split()).
> >
> > Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> > without checking and unqueueing a THP folio from deferred split list;
> > which is unfortunate, since the split_queue_lock depends on the memcg
> > (when memcg is enabled); so swapout has been unqueueing such THPs later,
> > when freeing the folio, using the pgdat's lock instead: potentially
> > corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> > 0 here, so no problem with calling folio_unqueue_deferred_split() before
> > resetting memcg_data.
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
>
> If I remember correctly, THP just can be added to deferred list when
> there is no PMD map before mTHP swapout, so shrink_page_list() did
> check THP's compound_mapcount (called _entire_mapcount now) before
> adding it to swap cache.

But as David pointed out there may be PMD mapped THP on deferred list.
We should just not hit it by luck. The migration failure may be very
rare because migration may split the THP or retry a few times, and
migration is called with holding extra refcount and having THP off
lru.

>
> Now the code just checked whether the large folio is on deferred list or not.
>
> >
> > Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> > folio->memcg_data without checking and unqueueing a THP folio from the
> > deferred list, sometimes corrupting "from" memcg's list, like swapout.
> > Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> > used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> > mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> > of course unqueues), or skip it if that fails.  Not ideal, but moving
> > charge has been requested, and khugepaged should repair the THP later:
> > nobody wants new custom unqueueing code just for this deprecated case.
> >
> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.  I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
>
> Same reason as above.
>
> >
> > [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> > large but not PMD-mapped: that's safe, but perhaps not intended: it's
> > arguable whether the deprecated feature should be updated to work
> > better with the new feature; but certainly not in this patch.]
> >
> > Backport to 6.11 should be straightforward. Earlier backports must take
> > care that other _deferred_list fixes and dependencies are included.  It
> > is unclear whether these fixes are realistically needed before 6.12.
> >
> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
> >  mm/internal.h      | 10 +++++-----
> >  mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
> >  mm/memcontrol.c    |  8 +++++---
> >  mm/migrate.c       |  4 ++--
> >  mm/page_alloc.c    |  4 +++-
> >  mm/swap.c          |  4 ++--
> >  mm/vmscan.c        |  4 ++--
> >  8 files changed, 65 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a1d345f1680c..dc7d5bb76495 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
> >         return split_huge_page_to_list_to_order(&folio->page, list, ret);
> >  }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio)
> > +/*
> > + * __folio_unqueue_deferred_split() is not to be called directly:
> > + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> > + * limits its calls to those folios which may have a _deferred_list for
> > + * queueing THP splits, and that list is (racily observed to be) non-empty.
> > + *
> > + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> > + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> > + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> > + *
> > + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> > + * therefore important to unqueue deferred split before changing folio memcg.
> > + */
> > +bool __folio_unqueue_deferred_split(struct folio *folio)
> >  {
> >         struct deferred_split *ds_queue;
> >         unsigned long flags;
> > +       bool unqueued = false;
> > +
> > +       WARN_ON_ONCE(folio_ref_count(folio));
> > +       WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> >
> >         ds_queue = get_deferred_split_queue(folio);
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >                                       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> >                 }
> >                 list_del_init(&folio->_deferred_list);
> > +               unqueued = true;
> >         }
> >         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > +
> > +       return unqueued;        /* useful for debug warnings */
> >  }
> >
> >  /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >         if (!partially_mapped && !split_underused_thp)
> >                 return;
> >
> > -       /*
> > -        * The try_to_unmap() in page reclaim path might reach here too,
> > -        * this may cause a race condition to corrupt deferred split queue.
> > -        * And, if page reclaim is already handling the same folio, it is
> > -        * unnecessary to handle it again in shrinker.
> > -        *
> > -        * Check the swapcache flag to determine if the folio is being
> > -        * handled by page reclaim since THP swap would add the folio into
> > -        * swap cache before calling try_to_unmap().
> > -        */
> > -       if (folio_test_swapcache(folio))
> > -               return;
> > -
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >         if (partially_mapped) {
> >                 if (!folio_test_partially_mapped(folio)) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
> >  #endif
> >  }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >  {
> >         if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> > -               return;
> > +               return false;
> >
> >         /*
> >          * At this point, there is no one trying to add the folio to
> > @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
> >          * to check without acquiring the split_queue_lock.
> >          */
> >         if (data_race(list_empty(&folio->_deferred_list)))
> > -               return;
> > +               return false;
> >
> > -       __folio_undo_large_rmappable(folio);
> > +       return __folio_unqueue_deferred_split(folio);
> >  }
> >
> >  static inline struct folio *page_rmappable_folio(struct page *page)
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 81d8819f13cd..f8744f5630bb 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
> >         css_get(&to->css);
> >         css_put(&from->css);
> >
> > +       /* Warning should never happen, so don't worry about refcount non-0 */
> > +       WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >         folio->memcg_data = (unsigned long)to;
> >
> >         __folio_memcg_unlock(from);
> > @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >         enum mc_target_type target_type;
> >         union mc_target target;
> >         struct folio *folio;
> > +       bool tried_split_before = false;
> >
> > +retry_pmd:
> >         ptl = pmd_trans_huge_lock(pmd, vma);
> >         if (ptl) {
> >                 if (mc.precharge < HPAGE_PMD_NR) {
> > @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >                 target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> >                 if (target_type == MC_TARGET_PAGE) {
> >                         folio = target.folio;
> > +                       /*
> > +                        * Deferred split queue locking depends on memcg,
> > +                        * and unqueue is unsafe unless folio refcount is 0:
> > +                        * split or skip if on the queue? first try to split.
> > +                        */
> > +                       if (!list_empty(&folio->_deferred_list)) {
> > +                               spin_unlock(ptl);
> > +                               if (!tried_split_before)
> > +                                       split_folio(folio);
> > +                               folio_unlock(folio);
> > +                               folio_put(folio);
> > +                               if (tried_split_before)
> > +                                       return 0;
> > +                               tried_split_before = true;
> > +                               goto retry_pmd;
> > +                       }
> > +                       /*
> > +                        * So long as that pmd lock is held, the folio cannot
> > +                        * be racily added to the _deferred_list, because
> > +                        * __folio_remove_rmap() will find !partially_mapped.
> > +                        */
> >                         if (folio_isolate_lru(folio)) {
> >                                 if (!mem_cgroup_move_account(folio, true,
> >                                                              mc.from, mc.to)) {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2703227cce88..06df2af97415 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >         struct obj_cgroup *objcg;
> >
> >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > -       VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> > -                       !folio_test_hugetlb(folio) &&
> > -                       !list_empty(&folio->_deferred_list), folio);
> >
> >         /*
> >          * Nobody should be changing or seriously looking at
> > @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >                         ug->nr_memory += nr_pages;
> >                 ug->pgpgout++;
> >
> > +               WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >                 folio->memcg_data = 0;
> >         }
> >
> > @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> >
> >         /* Transfer the charge and the css ref */
> >         commit_charge(new, memcg);
> > +
> > +       /* Warning should never happen, so don't worry about refcount non-0 */
> > +       WARN_ON_ONCE(folio_unqueue_deferred_split(old));
> >         old->memcg_data = 0;
> >  }
> >
> > @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
> >         VM_BUG_ON_FOLIO(oldid, folio);
> >         mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
> >
> > +       folio_unqueue_deferred_split(folio);
> >         folio->memcg_data = 0;
> >
> >         if (!mem_cgroup_is_root(memcg))
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..691f25ee2489 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >                     folio_test_large_rmappable(folio)) {
> >                         if (!folio_ref_freeze(folio, expected_count))
> >                                 return -EAGAIN;
> > -                       folio_undo_large_rmappable(folio);
> > +                       folio_unqueue_deferred_split(folio);
> >                         folio_ref_unfreeze(folio, expected_count);
> >                 }
> >
> > @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >         }
> >
> >         /* Take off deferred split queue while frozen and memcg set */
> > -       folio_undo_large_rmappable(folio);
> > +       folio_unqueue_deferred_split(folio);
> >
> >         /*
> >          * Now we know that no one else is looking at the folio:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >                 unsigned long pfn = folio_pfn(folio);
> >                 unsigned int order = folio_order(folio);
> >
> > -               folio_undo_large_rmappable(folio);
> > +               if (mem_cgroup_disabled())
> > +                       folio_unqueue_deferred_split(folio);
>
> This looks confusing. It looks all callsites of free_unref_folios()
> have folio_unqueue_deferred_split() and memcg uncharge called before
> it. If there is any problem, memcg uncharge should catch it. Did I
> miss something?
>
> > +
> >                 if (!free_pages_prepare(&folio->page, order))
> >                         continue;
> >                 /*
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 835bdf324b76..b8e3259ea2c4 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
> >         }
> >
> >         page_cache_release(folio);
> > -       folio_undo_large_rmappable(folio);
> > +       folio_unqueue_deferred_split(folio);
> >         mem_cgroup_uncharge(folio);
> >         free_unref_page(&folio->page, folio_order(folio));
> >  }
> > @@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
> >                         free_huge_folio(folio);
> >                         continue;
> >                 }
> > -               folio_undo_large_rmappable(folio);
> > +               folio_unqueue_deferred_split(folio);
> >                 __page_cache_release(folio, &lruvec, &flags);
> >
> >                 if (j != i)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index eb4e8440c507..635d45745b73 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                  */
> >                 nr_reclaimed += nr_pages;
> >
> > -               folio_undo_large_rmappable(folio);
> > +               folio_unqueue_deferred_split(folio);
> >                 if (folio_batch_add(&free_folios, folio) == 0) {
> >                         mem_cgroup_uncharge_folios(&free_folios);
> >                         try_to_unmap_flush();
> > @@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
> >                 if (unlikely(folio_put_testzero(folio))) {
> >                         __folio_clear_lru_flags(folio);
> >
> > -                       folio_undo_large_rmappable(folio);
> > +                       folio_unqueue_deferred_split(folio);
> >                         if (folio_batch_add(&free_folios, folio) == 0) {
> >                                 spin_unlock_irq(&lruvec->lru_lock);
> >                                 mem_cgroup_uncharge_folios(&free_folios);
> > --
> > 2.35.3
Yang Shi Oct. 25, 2024, 1:25 a.m. UTC | #4
On Thu, Oct 24, 2024 at 1:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> >
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.
>
> Yes please. I stumbled into that myself recently -- leftover from
> previous rework.
>
> It would have been reasonable to move that into a separate (follow-up?)
> patch.
>
> > But that and its out-of-line __callee are mm
> > internals of very limited usability: add comment and WARN_ON_ONCEs to
> > check usage; and return a bool to say if a deferred split was unqueued,
> > which can then be used in WARN_ON_ONCEs around safety checks (sparing
> > callers the arcane conditionals in __folio_unqueue_deferred_split()).
> >
> > Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> > without checking and unqueueing a THP folio from deferred split list;
> > which is unfortunate, since the split_queue_lock depends on the memcg
> > (when memcg is enabled); so swapout has been unqueueing such THPs later,
> > when freeing the folio, using the pgdat's lock instead: potentially
> > corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> > 0 here, so no problem with calling folio_unqueue_deferred_split() before
> > resetting memcg_data.
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
> >
> > Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> > folio->memcg_data without checking and unqueueing a THP folio from the
> > deferred list, sometimes corrupting "from" memcg's list, like swapout.
> > Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> > used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> > mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> > of course unqueues), or skip it if that fails.  Not ideal, but moving
> > charge has been requested, and khugepaged should repair the THP later:
> > nobody wants new custom unqueueing code just for this deprecated case.
> >
> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.
>
> I recall this can happen, not sure on 5.6 though: assume we have an anon
> THP with 1 PMD mapping and a single PTE mapping for simplicity.
>
> Assume we want to migrate that folio and first remove the PMD mapping,
> then the PTE mapping. After removing the PMD mapping, we add it to the
> deferred split queue (single PTE mapped).
>
> Now assume migration fails and we remove migration entries -> remap.
>
> We now have a PMD-mapped THP on the deferred split queue.
>
> (again, I might be wrong but that's from memory without digging into the
> code)

It sounds possible to me on 5.6 too. I didn't see
remove_migration_ptes() remove PMD-mapped THP from deferred list. As I
said in earlier email, this case should be not common and we were luck
not to hit it.

>
>
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
>
> We only remap PTEs from the swapcache, never PMDs.
>
> >
> > [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> > large but not PMD-mapped: that's safe, but perhaps not intended: it's
> > arguable whether the deprecated feature should be updated to work
> > better with the new feature; but certainly not in this patch.]
> >
> > Backport to 6.11 should be straightforward. Earlier backports must take
> > care that other _deferred_list fixes and dependencies are included.  It
> > is unclear whether these fixes are realistically needed before 6.12.
> >
> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
> >   mm/internal.h      | 10 +++++-----
> >   mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
> >   mm/memcontrol.c    |  8 +++++---
> >   mm/migrate.c       |  4 ++--
> >   mm/page_alloc.c    |  4 +++-
> >   mm/swap.c          |  4 ++--
> >   mm/vmscan.c        |  4 ++--
> >   8 files changed, 65 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a1d345f1680c..dc7d5bb76495 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
> >       return split_huge_page_to_list_to_order(&folio->page, list, ret);
> >   }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio)
> > +/*
> > + * __folio_unqueue_deferred_split() is not to be called directly:
> > + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> > + * limits its calls to those folios which may have a _deferred_list for
> > + * queueing THP splits, and that list is (racily observed to be) non-empty.
> > + *
> > + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> > + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> > + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> > + *
> > + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> > + * therefore important to unqueue deferred split before changing folio memcg.
> > + */
> > +bool __folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >       struct deferred_split *ds_queue;
> >       unsigned long flags;
> > +     bool unqueued = false;
> > +
> > +     WARN_ON_ONCE(folio_ref_count(folio));
> > +     WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> >
> >       ds_queue = get_deferred_split_queue(folio);
> >       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >                                     MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> >               }
> >               list_del_init(&folio->_deferred_list);
> > +             unqueued = true;
> >       }
> >       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > +
> > +     return unqueued;        /* useful for debug warnings */
> >   }
> >
> >   /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >       if (!partially_mapped && !split_underused_thp)
> >               return;
> >
> > -     /*
> > -      * The try_to_unmap() in page reclaim path might reach here too,
> > -      * this may cause a race condition to corrupt deferred split queue.
> > -      * And, if page reclaim is already handling the same folio, it is
> > -      * unnecessary to handle it again in shrinker.
> > -      *
> > -      * Check the swapcache flag to determine if the folio is being
> > -      * handled by page reclaim since THP swap would add the folio into
> > -      * swap cache before calling try_to_unmap().
> > -      */
> > -     if (folio_test_swapcache(folio))
> > -             return;
> > -
> >       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >       if (partially_mapped) {
> >               if (!folio_test_partially_mapped(folio)) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
> >   #endif
> >   }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >       if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
>
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...
>
> > -             return;
> > +             return false;
> >
> >       /*
> >        * At this point, there is no one trying to add the folio to
> > @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
> >        * to check without acquiring the split_queue_lock.
> >        */
> >       if (data_race(list_empty(&folio->_deferred_list)))
> > -             return;
> > +             return false;
> >
> > -     __folio_undo_large_rmappable(folio);
> > +     return __folio_unqueue_deferred_split(folio);
> >   }
> >
> >   static inline struct folio *page_rmappable_folio(struct page *page)
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 81d8819f13cd..f8744f5630bb 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
> >       css_get(&to->css);
> >       css_put(&from->css);
> >
> > +     /* Warning should never happen, so don't worry about refcount non-0 */
> > +     WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >       folio->memcg_data = (unsigned long)to;
> >
> >       __folio_memcg_unlock(from);
> > @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >       enum mc_target_type target_type;
> >       union mc_target target;
> >       struct folio *folio;
> > +     bool tried_split_before = false;
> >
> > +retry_pmd:
> >       ptl = pmd_trans_huge_lock(pmd, vma);
> >       if (ptl) {
> >               if (mc.precharge < HPAGE_PMD_NR) {
> > @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >               target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> >               if (target_type == MC_TARGET_PAGE) {
> >                       folio = target.folio;
> > +                     /*
> > +                      * Deferred split queue locking depends on memcg,
> > +                      * and unqueue is unsafe unless folio refcount is 0:
> > +                      * split or skip if on the queue? first try to split.
> > +                      */
> > +                     if (!list_empty(&folio->_deferred_list)) {
> > +                             spin_unlock(ptl);
> > +                             if (!tried_split_before)
> > +                                     split_folio(folio);
> > +                             folio_unlock(folio);
> > +                             folio_put(folio);
> > +                             if (tried_split_before)
> > +                                     return 0;
> > +                             tried_split_before = true;
> > +                             goto retry_pmd;
> > +                     }
> > +                     /*
> > +                      * So long as that pmd lock is held, the folio cannot
> > +                      * be racily added to the _deferred_list, because
> > +                      * __folio_remove_rmap() will find !partially_mapped.
> > +                      */
> >                       if (folio_isolate_lru(folio)) {
> >                               if (!mem_cgroup_move_account(folio, true,
> >                                                            mc.from, mc.to)) {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2703227cce88..06df2af97415 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >       struct obj_cgroup *objcg;
> >
> >       VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > -     VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> > -                     !folio_test_hugetlb(folio) &&
> > -                     !list_empty(&folio->_deferred_list), folio);
> >
> >       /*
> >        * Nobody should be changing or seriously looking at
> > @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >                       ug->nr_memory += nr_pages;
> >               ug->pgpgout++;
> >
> > +             WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >               folio->memcg_data = 0;
> >       }
> >
> > @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> >
> >       /* Transfer the charge and the css ref */
> >       commit_charge(new, memcg);
> > +
> > +     /* Warning should never happen, so don't worry about refcount non-0 */
> > +     WARN_ON_ONCE(folio_unqueue_deferred_split(old));
> >       old->memcg_data = 0;
> >   }
> >
> > @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
> >       VM_BUG_ON_FOLIO(oldid, folio);
> >       mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
> >
> > +     folio_unqueue_deferred_split(folio);
> >       folio->memcg_data = 0;
> >
> >       if (!mem_cgroup_is_root(memcg))
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..691f25ee2489 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >                   folio_test_large_rmappable(folio)) {
> >                       if (!folio_ref_freeze(folio, expected_count))
> >                               return -EAGAIN;
> > -                     folio_undo_large_rmappable(folio);
> > +                     folio_unqueue_deferred_split(folio);
> >                       folio_ref_unfreeze(folio, expected_count);
> >               }
> >
> > @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >       }
> >
> >       /* Take off deferred split queue while frozen and memcg set */
> > -     folio_undo_large_rmappable(folio);
> > +     folio_unqueue_deferred_split(folio);
> >
> >       /*
> >        * Now we know that no one else is looking at the folio:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >               unsigned long pfn = folio_pfn(folio);
> >               unsigned int order = folio_order(folio);
> >
> > -             folio_undo_large_rmappable(folio);
> > +             if (mem_cgroup_disabled())
> > +                     folio_unqueue_deferred_split(folio);
> > +
>
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?
>
> It's complicated stuff, nothing jumped at me, but it's late here so my
> brain is not fully functional ...
>
> --
> Cheers,
>
> David / dhildenb
>
Hugh Dickins Oct. 25, 2024, 6:57 a.m. UTC | #5
On Thu, 24 Oct 2024, Yang Shi wrote:
> On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
> 
> If I remember correctly, THP just can be added to deferred list when
> there is no PMD map before mTHP swapout, so shrink_page_list() did
> check THP's compound_mapcount (called _entire_mapcount now) before
> adding it to swap cache.
> 
> Now the code just checked whether the large folio is on deferred list or not.

I've continued to find it hard to think about, hard to be convinced by
that sequence of checks, without an actual explicit _deferred_list check.

David has brilliantly come up with the failed THP migration example;
and I think now perhaps 5.8's 5503fbf2b0b8 ("khugepaged: allow to
collapse PTE-mapped compound pages") introduced another way?

But I certainly need to reword that wagging finger pointing to your
commit: these are much more exceptional cases than I was thinking there.

I have this evening tried running swapping load on 5.10 and 6.6 and 6.11,
each with just a BUG_ON(!list_empty(the deferred list)) before resetting
memcg in mem_cgroup_swapout() - it would of course be much easier to hit
such a BUG_ON() than for the consequent wrong locking to be so unlucky
as to actually result in list corruption.

None of those BUG_ONs hit; though I was only running each for 1.5 hour,
and looking at vmstats at the end, see the were really not exercising
deferred split very much at all.  I'd been hoping for an immediate hit
(as on 6.12-rc) to confirm my doubt, but no.  That doesn't *prove* you're
right, but (excepting David's and my weird cases) I bet you are right.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >                 unsigned long pfn = folio_pfn(folio);
> >                 unsigned int order = folio_order(folio);
> >
> > -               folio_undo_large_rmappable(folio);
> > +               if (mem_cgroup_disabled())
> > +                       folio_unqueue_deferred_split(folio);
> 
> This looks confusing. It looks all callsites of free_unref_folios()
> have folio_unqueue_deferred_split() and memcg uncharge called before
> it. If there is any problem, memcg uncharge should catch it. Did I
> miss something?

I don't understand what you're suggesting there.  But David remarked
on it too, so it seems that I do need at least to add some comment.

I'd better re-examine the memcg/non-memcg forking paths again: but by
strange coincidence (or suggestion?), I'm suddenly now too tired here,
precisely where David stopped too.  I'll have to come back to this
tomorrow, sorry.

Hugh
Yang Shi Oct. 25, 2024, 4:34 p.m. UTC | #6
On Thu, Oct 24, 2024 at 11:57 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 24 Oct 2024, Yang Shi wrote:
> > On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > > shrinker memcg aware"): which included a check on swapcache before adding
> > > to deferred queue (which can now be removed), but no check on deferred
> > > queue before adding THP to swapcache (maybe other circumstances prevented
> > > it at that time, but not now).
> >
> > If I remember correctly, THP just can be added to deferred list when
> > there is no PMD map before mTHP swapout, so shrink_page_list() did
> > check THP's compound_mapcount (called _entire_mapcount now) before
> > adding it to swap cache.
> >
> > Now the code just checked whether the large folio is on deferred list or not.
>
> I've continued to find it hard to think about, hard to be convinced by
> that sequence of checks, without an actual explicit _deferred_list check.

You meant the swap cache check? I was trying to recall the reason. If
I remember correctly (sorry, memory is still vague), if the THP was
PMD-mapped and PTE-mapped by two processes, the THP may be added to
swap cache since just compound_mapcount was checked. Then
try_to_unmap() in shrink_page_list() may add it to deferred list if
PMD mapping was unmapped first. The potential list corruption fixed by
you now may be triggered.

But this was based on the assumption that there can't be PMD-mapped
THP on deferred list. If this happens (as David's migration fail
example), the swap cache check should be not enough. This case was
overlooked.

>
> David has brilliantly come up with the failed THP migration example;
> and I think now perhaps 5.8's 5503fbf2b0b8 ("khugepaged: allow to
> collapse PTE-mapped compound pages") introduced another way?
>
> But I certainly need to reword that wagging finger pointing to your
> commit: these are much more exceptional cases than I was thinking there.
>
> I have this evening tried running swapping load on 5.10 and 6.6 and 6.11,
> each with just a BUG_ON(!list_empty(the deferred list)) before resetting
> memcg in mem_cgroup_swapout() - it would of course be much easier to hit
> such a BUG_ON() than for the consequent wrong locking to be so unlucky
> as to actually result in list corruption.
>
> None of those BUG_ONs hit; though I was only running each for 1.5 hour,
> and looking at vmstats at the end, see the were really not exercising
> deferred split very much at all.  I'd been hoping for an immediate hit
> (as on 6.12-rc) to confirm my doubt, but no.  That doesn't *prove* you're
> right, but (excepting David's and my weird cases) I bet you are right.

Maybe it just happened rarely and was hard to hit. But still
theoretically possible. Your fix is more reliable.

>
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 4b21a368b4e2..57f64b5d0004 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> > >                 unsigned long pfn = folio_pfn(folio);
> > >                 unsigned int order = folio_order(folio);
> > >
> > > -               folio_undo_large_rmappable(folio);
> > > +               if (mem_cgroup_disabled())
> > > +                       folio_unqueue_deferred_split(folio);
> >
> > This looks confusing. It looks all callsites of free_unref_folios()
> > have folio_unqueue_deferred_split() and memcg uncharge called before
> > it. If there is any problem, memcg uncharge should catch it. Did I
> > miss something?
>
> I don't understand what you're suggesting there.  But David remarked
> on it too, so it seems that I do need at least to add some comment.
>
> I'd better re-examine the memcg/non-memcg forking paths again: but by
> strange coincidence (or suggestion?), I'm suddenly now too tired here,
> precisely where David stopped too.  I'll have to come back to this
> tomorrow, sorry.

I perhaps misunderstood this code. Just feel free to correct me if it
doesn't make sense to you. But, yes, some comments are definitely
welcome and helpful for understanding the code and review.

>
> Hugh
Hugh Dickins Oct. 27, 2024, 5:35 a.m. UTC | #7
On Fri, 25 Oct 2024, Yang Shi wrote:
> On Thu, Oct 24, 2024 at 11:57 PM Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 24 Oct 2024, Yang Shi wrote:
> > > On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> > > >                 unsigned long pfn = folio_pfn(folio);
> > > >                 unsigned int order = folio_order(folio);
> > > >
> > > > -               folio_undo_large_rmappable(folio);
> > > > +               if (mem_cgroup_disabled())
> > > > +                       folio_unqueue_deferred_split(folio);
> > >
> > > This looks confusing. It looks all callsites of free_unref_folios()
> > > have folio_unqueue_deferred_split() and memcg uncharge called before
> > > it. If there is any problem, memcg uncharge should catch it. Did I
> > > miss something?
> >
> > I don't understand what you're suggesting there.  But David remarked
> > on it too, so it seems that I do need at least to add some comment.
> >
> > I'd better re-examine the memcg/non-memcg forking paths again: but by
> > strange coincidence (or suggestion?), I'm suddenly now too tired here,
> > precisely where David stopped too.  I'll have to come back to this
> > tomorrow, sorry.
> 
> I perhaps misunderstood this code. Just feel free to correct me if it
> doesn't make sense to you. But, yes, some comments are definitely
> welcome and helpful for understanding the code and review.

Thanks a lot for challenging that: it was me who misunderstood, not you.

I might just be inventing this excuse, but I think what happened was,
I'd been staring at an earlier release tree, and in that earlier tree
the prior unqueueing was tucked away inside a memcg function, but not
done in the #ifndef CONFIG_MEMCG stub: so I thought that this
folio_unqueue_deferred_split() in free_unref_folios() was needed just
to do it when mem_cgroup_disabled() (either by CONFIG or bootoption).

And I thought the "if (mem_cgroup_disabled())" was comment enough:
except it made no sense to you and David who saw what I was blind to
(and what you describe perfectly clearly above - it depresses me
sometimes, how I cannot even read what someone wrote, until I've
arrived at the same conclusion myself!).

If my story about !memcg stubs is true, then I think Matthew has
been cleaning all that up recently.  Except for put_pages_list()
(where I now see he wanted to insert a VM_BUG_ON(folio_memcg) in
April, but was forced to retreat): that one does not have a
folio_unqueue_deferred_split() in, but the good news is that
it no longer has any callers - I'll send a patch to delete it.

And instead of my misunderstood code above in free_unref_folios(),
just deleting the folio_undo_large_unmappable() line, with a
comment in the commit message.

Thanks!
Hugh
Hugh Dickins Oct. 27, 2024, 7:07 a.m. UTC | #8
On Thu, 24 Oct 2024, David Hildenbrand wrote:
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> > 
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.  
> 
> Yes please. I stumbled into that myself recently -- leftover from previous
> rework.
> 
> It would have been reasonable to move that into a separate (follow-up?) patch.

I was expecting someone to ask for a breakup: I plead, as the one who
will try some backports, to keep this all in one to make that job easier.

I doubt I'll go back further than 6.6 LTS.  Porting this patch is easy,
the hard part is gathering together the little bits and pieces it also
needs to be good (I'm thinking of folio migration freezing anon to 0
in particular, that will lead to others).

> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.  
> 
> I recall this can happen, not sure on 5.6 though: assume we have an anon THP
> with 1 PMD mapping and a single PTE mapping for simplicity.
> 
> Assume we want to migrate that folio and first remove the PMD mapping, then
> the PTE mapping. After removing the PMD mapping, we add it to the deferred
> split queue (single PTE mapped).
> 
> Now assume migration fails and we remove migration entries -> remap.
> 
> We now have a PMD-mapped THP on the deferred split queue.
> 
> (again, I might be wrong but that's from memory without digging into the code)

I believe you're right, thank you for doing the heavy thinking for me.
Then I came up with 5.8's 5503fbf2b0b8 ("khugepaged: allow to
collapse PTE-mapped compound pages") introducing another way.

It looks to me like there is a case for backporting this patch,
but not a very strong case: worth some effort, but not too much -
I've not heard of the predicted issues in practice.

> 
> 
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
> 
> We only remap PTEs from the swapcache, never PMDs.

Right, thanks.

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool
> > partially_mapped)
> >    if (!partially_mapped && !split_underused_thp)
> >     return;
> >   -	/*
> > -	 * The try_to_unmap() in page reclaim path might reach here too,
> > -	 * this may cause a race condition to corrupt deferred split queue.
> > -	 * And, if page reclaim is already handling the same folio, it is
> > -	 * unnecessary to handle it again in shrinker.
> > -	 *
> > -	 * Check the swapcache flag to determine if the folio is being
> > -	 * handled by page reclaim since THP swap would add the folio into
> > -	 * swap cache before calling try_to_unmap().
> > -	 */
> > -	if (folio_test_swapcache(folio))
> > -		return;
> > -
> >    spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >    if (partially_mapped) {
> >   		if (!folio_test_partially_mapped(folio)) {

I'm changing this part in v2: keeping the return on swapcache, updating
the comment to say it's no longer essential, but a likely optimization.

I noticed the stats for deferred split were very different when I made
this change, largely because I'm doing a lot of unrealistic swapping
no doubt, but I don't want to take the risk of changing behaviour
when all we're trying to fix is list corruption.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio
> > *folio, unsigned int order)
> >   #endif
> >   }
> >   
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >    if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> 
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...

Yes, I believe that is what it's about.  Somewhere else excludes hugetlb
explicitly instead.  I know Matthew would have liked to find a better name
than "large_rmappable" (aren't hugetlb folios large and rmappable?), but
nobody suggested a better.

You've reminded me to worry about how hugetlb pages get through
free_tail_page_prepare()'s case 2 check for list_empty(_deferred_list).
Ah, there's an INIT_LIST_HEAD(&folio->_deferred_list) in mm/hugetlb.c,
before freeing a hugetlb folio.  (The kind of detail which may be
different in backports.)

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >     unsigned long pfn = folio_pfn(folio);
> >     unsigned int order = folio_order(folio);
> >   -		folio_undo_large_rmappable(folio);
> > +		if (mem_cgroup_disabled())
> > +			folio_unqueue_deferred_split(folio);
> > +
> 
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?

I've confessed to Yang Shi that I got this wrong: so in v2 there is
not even a line to add such a comment on (but comment in commit message).

> 
> It's complicated stuff, nothing jumped at me, but it's late here so my brain
> is not fully functional ...

Even your not fully functional brain is so much quicker than mine:
many thanks for applying it.

Hugh
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a1d345f1680c..dc7d5bb76495 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3588,10 +3588,27 @@  int split_folio_to_list(struct folio *folio, struct list_head *list)
 	return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }
 
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
 	unsigned long flags;
+	bool unqueued = false;
+
+	WARN_ON_ONCE(folio_ref_count(folio));
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
 
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@  void __folio_undo_large_rmappable(struct folio *folio)
 				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
 		list_del_init(&folio->_deferred_list);
+		unqueued = true;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+	return unqueued;	/* useful for debug warnings */
 }
 
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3626,19 +3646,6 @@  void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (!partially_mapped && !split_underused_thp)
 		return;
 
-	/*
-	 * The try_to_unmap() in page reclaim path might reach here too,
-	 * this may cause a race condition to corrupt deferred split queue.
-	 * And, if page reclaim is already handling the same folio, it is
-	 * unnecessary to handle it again in shrinker.
-	 *
-	 * Check the swapcache flag to determine if the folio is being
-	 * handled by page reclaim since THP swap would add the folio into
-	 * swap cache before calling try_to_unmap().
-	 */
-	if (folio_test_swapcache(folio))
-		return;
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (partially_mapped) {
 		if (!folio_test_partially_mapped(folio)) {
diff --git a/mm/internal.h b/mm/internal.h
index 93083bbeeefa..16c1f3cd599e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -639,11 +639,11 @@  static inline void folio_set_order(struct folio *folio, unsigned int order)
 #endif
 }
 
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
 {
 	if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
-		return;
+		return false;
 
 	/*
 	 * At this point, there is no one trying to add the folio to
@@ -651,9 +651,9 @@  static inline void folio_undo_large_rmappable(struct folio *folio)
 	 * to check without acquiring the split_queue_lock.
 	 */
 	if (data_race(list_empty(&folio->_deferred_list)))
-		return;
+		return false;
 
-	__folio_undo_large_rmappable(folio);
+	return __folio_unqueue_deferred_split(folio);
 }
 
 static inline struct folio *page_rmappable_folio(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 81d8819f13cd..f8744f5630bb 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -848,6 +848,8 @@  static int mem_cgroup_move_account(struct folio *folio,
 	css_get(&to->css);
 	css_put(&from->css);
 
+	/* Warning should never happen, so don't worry about refcount non-0 */
+	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
 	folio->memcg_data = (unsigned long)to;
 
 	__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@  static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct folio *folio;
+	bool tried_split_before = false;
 
+retry_pmd:
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@  static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
 		if (target_type == MC_TARGET_PAGE) {
 			folio = target.folio;
+			/*
+			 * Deferred split queue locking depends on memcg,
+			 * and unqueue is unsafe unless folio refcount is 0:
+			 * split or skip if on the queue? first try to split.
+			 */
+			if (!list_empty(&folio->_deferred_list)) {
+				spin_unlock(ptl);
+				if (!tried_split_before)
+					split_folio(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+				if (tried_split_before)
+					return 0;
+				tried_split_before = true;
+				goto retry_pmd;
+			}
+			/*
+			 * So long as that pmd lock is held, the folio cannot
+			 * be racily added to the _deferred_list, because
+			 * __folio_remove_rmap() will find !partially_mapped.
+			 */
 			if (folio_isolate_lru(folio)) {
 				if (!mem_cgroup_move_account(folio, true,
 							     mc.from, mc.to)) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2703227cce88..06df2af97415 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4629,9 +4629,6 @@  static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
-	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
-			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@  static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 			ug->nr_memory += nr_pages;
 		ug->pgpgout++;
 
+		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
 		folio->memcg_data = 0;
 	}
 
@@ -4789,6 +4787,9 @@  void mem_cgroup_migrate(struct folio *old, struct folio *new)
 
 	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
+
+	/* Warning should never happen, so don't worry about refcount non-0 */
+	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
 	old->memcg_data = 0;
 }
 
@@ -4975,6 +4976,7 @@  void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	VM_BUG_ON_FOLIO(oldid, folio);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
+	folio_unqueue_deferred_split(folio);
 	folio->memcg_data = 0;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/migrate.c b/mm/migrate.c
index df91248755e4..691f25ee2489 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -489,7 +489,7 @@  static int __folio_migrate_mapping(struct address_space *mapping,
 		    folio_test_large_rmappable(folio)) {
 			if (!folio_ref_freeze(folio, expected_count))
 				return -EAGAIN;
-			folio_undo_large_rmappable(folio);
+			folio_unqueue_deferred_split(folio);
 			folio_ref_unfreeze(folio, expected_count);
 		}
 
@@ -514,7 +514,7 @@  static int __folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	/* Take off deferred split queue while frozen and memcg set */
-	folio_undo_large_rmappable(folio);
+	folio_unqueue_deferred_split(folio);
 
 	/*
 	 * Now we know that no one else is looking at the folio:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b21a368b4e2..57f64b5d0004 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2681,7 +2681,9 @@  void free_unref_folios(struct folio_batch *folios)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
-		folio_undo_large_rmappable(folio);
+		if (mem_cgroup_disabled())
+			folio_unqueue_deferred_split(folio);
+
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..b8e3259ea2c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -121,7 +121,7 @@  void __folio_put(struct folio *folio)
 	}
 
 	page_cache_release(folio);
-	folio_undo_large_rmappable(folio);
+	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
 	free_unref_page(&folio->page, folio_order(folio));
 }
@@ -988,7 +988,7 @@  void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			free_huge_folio(folio);
 			continue;
 		}
-		folio_undo_large_rmappable(folio);
+		folio_unqueue_deferred_split(folio);
 		__page_cache_release(folio, &lruvec, &flags);
 
 		if (j != i)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb4e8440c507..635d45745b73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1475,7 +1475,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 */
 		nr_reclaimed += nr_pages;
 
-		folio_undo_large_rmappable(folio);
+		folio_unqueue_deferred_split(folio);
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1863,7 +1863,7 @@  static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		if (unlikely(folio_put_testzero(folio))) {
 			__folio_clear_lru_flags(folio);
 
-			folio_undo_large_rmappable(folio);
+			folio_unqueue_deferred_split(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);