[v2,0/9] support large folio swap-out and swap-in for shmem

Message ID	cover.1718690645.git.baolin.wang@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Baolin Wang <baolin.wang@linux.alibaba.com> To: akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, chrisl@kernel.org, ying.huang@intel.com, 21cnbao@gmail.com, ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com, ioworker0@gmail.com, da.gomez@samsung.com, p.raghav@samsung.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Date: Tue, 18 Jun 2024 14:54:12 +0800 Message-Id: <cover.1718690645.git.baolin.wang@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	support large folio swap-out and swap-in for shmem \| expand [v2,0/9] support large folio swap-out and swap-in for shmem [v2,1/9] mm: vmscan: add validation before spliting shmem large folio [v2,2/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting [v2,3/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap [v2,4/9] mm: shmem: return number of pages beeing freed in shmem_free_swap [v2,5/9] mm: filemap: use xa_get_order() to get the swap entry order [v2,6/9] mm: shmem: use swap_free_nr() to free shmem swap entries [v2,7/9] mm: shmem: support large folio allocation for shmem_replace_folio() [v2,8/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() [v2,9/9] mm: shmem: support large folio swap out

Baolin Wang June 18, 2024, 6:54 a.m. UTC

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap-out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, and large folio swap-in[3] series is queued into mm-unstable branch.
Hence this patch set also supports the large folio swap-out and swap-in for
shmem.

Functional testing
==================
I use the latest mm-unstable branch to test with reverting Chris's
"mm: swap: mTHP swap allocator base on swap cluster order" series which
can cause some problems (Hugh also reported in [4]).

Machine environment: 32 Arm cores, 120G memory and 50G swap device.

1. Run xfstests suite to test tmpfs filesystem, and I did not catch any
regressions with this patch set.
FSTYP=tmpfs
export TEST_DIR=/mnt/tempfs_mnt
export TEST_DEV=/mnt/tempfs_mnt
export SCRATCH_MNT=/mnt/scratchdir
export SCRATCH_DEV=/mnt/scratchdir

2. Run all mm selftests in tools/testing/selftests/mm/, and no
regressions found.

3. I also wrote several shmem swap test cases, including shmem splitting,
shmem swapout, shmem swapin, swapoff during shmem swapout, shmem reclaim,
shmem swapin replacement, etc. I tested these cases under 4K and 64K
shmem folio sizes with a swap device, and shmem swap functionality works
well on my machine (I can share these test cases if needed).

Hugh, I think this version has fixed the reference issue you pointed out
before, and I have also fixed some issues related to shmem swapin replacement.
So could you help to take a look at this series when you find some time?
Thanks a lot.

[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
[3] https://lore.kernel.org/all/20240508224040.190469-6-21cnbao@gmail.com/T/
[4] https://lore.kernel.org/all/8db63194-77fd-e0b8-8601-2bbf04889a5b@google.com/

Changes from v1:
 - Remove useless 'order' variable in shmem_partial_swap_usage(), per Daniel.
 - Add a new patch to return number of pages beeing freed in shmem_free_swap(),
 per Daniel.
 - Drop 'orders' parameter for find_get_entries() and find_lock_entries().
 - Round down the index when adding the swapin folio into the pagecache,
 suggested by Hugh.
 - Fix the reference issue when removing folio from pagecache in patch 8.
 - Fix replacing old folio in swap cache in patch 7.

Changes from RFC:
 - Rebased to the latest mm-unstable.
 - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
 branch.

Baolin Wang (8):
  mm: vmscan: add validation before spliting shmem large folio
  mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
    flag setting
  mm: shmem: extend shmem_partial_swap_usage() to support large folio
    swap
  mm: filemap: use xa_get_order() to get the swap entry order
  mm: shmem: use swap_free_nr() to free shmem swap entries
  mm: shmem: support large folio allocation for shmem_replace_folio()
  mm: shmem: drop folio reference count using 'nr_pages' in
    shmem_delete_from_page_cache()
  mm: shmem: support large folio swap out

Daniel Gomez (1):
  mm: shmem: return number of pages beeing freed in shmem_free_swap

 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |   1 +
 include/linux/swap.h                      |   4 +-
 include/linux/writeback.h                 |   1 +
 mm/filemap.c                              |   4 +
 mm/shmem.c                                | 107 +++++++++++++---------
 mm/swapfile.c                             |  98 ++++++++++----------
 mm/vmscan.c                               |  22 ++++-
 7 files changed, 142 insertions(+), 95 deletions(-)

Andrew Morton June 18, 2024, 8:05 p.m. UTC | #1

On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Shmem will support large folio allocation [1] [2] to get a better performance,
> however, the memory reclaim still splits the precious large folios when trying
> to swap-out shmem, which may lead to the memory fragmentation issue and can not
> take advantage of the large folio for shmeme.
> 
> Moreover, the swap code already supports for swapping out large folio without
> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> Hence this patch set also supports the large folio swap-out and swap-in for
> shmem.

I'll add this to mm-unstable for some exposure, but I wonder how much
testing it will have recieved by the time the next merge window opens?

> Functional testing
> ==================
> I use the latest mm-unstable branch to test with reverting Chris's
> "mm: swap: mTHP swap allocator base on swap cluster order" series which
> can cause some problems (Hugh also reported in [4]).

I dropped Chris's series from mm-unstable.

Baolin Wang June 19, 2024, 1:28 a.m. UTC | #2

On 2024/6/19 04:05, Andrew Morton wrote:
> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Shmem will support large folio allocation [1] [2] to get a better performance,
>> however, the memory reclaim still splits the precious large folios when trying
>> to swap-out shmem, which may lead to the memory fragmentation issue and can not
>> take advantage of the large folio for shmeme.
>>
>> Moreover, the swap code already supports for swapping out large folio without
>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>> Hence this patch set also supports the large folio swap-out and swap-in for
>> shmem.
> 
> I'll add this to mm-unstable for some exposure, but I wonder how much
> testing it will have recieved by the time the next merge window opens?

Thanks Andrew. I am fine with this series going to 6.12 if you are 
concerned about insufficient testing (and let's also wait for Hugh's 
comments). Since we (Daniel and I) have some follow-up patches that will 
rely on this swap series, hope this series can be tested as extensively 
as possible to ensure its stability in the mm branch.

Hugh Dickins June 19, 2024, 8:16 a.m. UTC | #3

On Wed, 19 Jun 2024, Baolin Wang wrote:
> On 2024/6/19 04:05, Andrew Morton wrote:
> > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> > 
> >> Shmem will support large folio allocation [1] [2] to get a better
> >> performance,
> >> however, the memory reclaim still splits the precious large folios when
> >> trying
> >> to swap-out shmem, which may lead to the memory fragmentation issue and can
> >> not
> >> take advantage of the large folio for shmeme.
> >>
> >> Moreover, the swap code already supports for swapping out large folio
> >> without
> >> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> >> Hence this patch set also supports the large folio swap-out and swap-in for
> >> shmem.
> > 
> > I'll add this to mm-unstable for some exposure, but I wonder how much
> > testing it will have recieved by the time the next merge window opens?
> 
> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
> about insufficient testing (and let's also wait for Hugh's comments). Since we
> (Daniel and I) have some follow-up patches that will rely on this swap series,
> hope this series can be tested as extensively as possible to ensure its
> stability in the mm branch.

Thanks for giving it the exposure, Andrew, but please drop it from
mm-unstable until the next cycle. I'd been about to write to say I
wouldn't be trying it until next cycle, when your mm-commits came in:
so I thought I ought at least to give mm-everything-2024-06-18 a try.

Baolin may have fixed stuff, but he (or the interaction with other mm
work) has broken stuff too: I couldn't get as far with it as with the
previous version. Just "cp -a" of a kernel source tree into a tmpfs
huge=always size=<bigenough> failed with lots of ENOSPCs, and when
"rm -rf"ed lots of WARN_ON(inode->i_blocks) from shmem_evict_inode();
and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
find_lock_entries().

Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug
I'm trying to chase even when this series is reverted: some kind of page
double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
mostly from page reclaim or from exit_mmap(). I'm still getting a feel
for it, maybe it occurs soon enough for a reliable bisection, maybe not.

(While writing, a run with mm-unstable cut off at 2a9964cc5d27,
drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
has not yet hit any problem: too early to tell but promising.)

And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
Chris Li's mTHP swap series: which worked fairly well, until it locked
up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
on a page with 0 refcount, while a page table lock is held which one
by one the other CPUs come to want for reclaim. On two machines.

None of these problems seen on Stephen's last next-2024-06-13.
I had wanted to see if mm-everything-2024-06-18 fixed that lockup,
but with the new problems I cannot tell (or it could all be the same
problem: but if so, odd that it manifests consistently differently).

There are way too many mTHP shmem and swap patch series floating
around at the moment, in mm and in fs, for me to cope with:
everyone, please slow down and test more.

Hugh

p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
long enough, and deserves promotion to hotfix and Linus soon.

Baolin Wang June 19, 2024, 8:58 a.m. UTC | #4

On 2024/6/19 16:16, Hugh Dickins wrote:
> On Wed, 19 Jun 2024, Baolin Wang wrote:
>> On 2024/6/19 04:05, Andrew Morton wrote:
>>> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>> Shmem will support large folio allocation [1] [2] to get a better
>>>> performance,
>>>> however, the memory reclaim still splits the precious large folios when
>>>> trying
>>>> to swap-out shmem, which may lead to the memory fragmentation issue and can
>>>> not
>>>> take advantage of the large folio for shmeme.
>>>>
>>>> Moreover, the swap code already supports for swapping out large folio
>>>> without
>>>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>>>> Hence this patch set also supports the large folio swap-out and swap-in for
>>>> shmem.
>>>
>>> I'll add this to mm-unstable for some exposure, but I wonder how much
>>> testing it will have recieved by the time the next merge window opens?
>>
>> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
>> about insufficient testing (and let's also wait for Hugh's comments). Since we
>> (Daniel and I) have some follow-up patches that will rely on this swap series,
>> hope this series can be tested as extensively as possible to ensure its
>> stability in the mm branch.
> 
> Thanks for giving it the exposure, Andrew, but please drop it from
> mm-unstable until the next cycle. I'd been about to write to say I
> wouldn't be trying it until next cycle, when your mm-commits came in:
> so I thought I ought at least to give mm-everything-2024-06-18 a try.
> 
> Baolin may have fixed stuff, but he (or the interaction with other mm
> work) has broken stuff too: I couldn't get as far with it as with the
> previous version. Just "cp -a" of a kernel source tree into a tmpfs
> huge=always size=<bigenough> failed with lots of ENOSPCs, and when
> "rm -rf"ed lots of WARN_ON(inode->i_blocks) from shmem_evict_inode();
> and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
> find_lock_entries().

Thanks Hugh for giving it a try. I also can reproduce the 
WARN_ON(inode->i_blocks) issue with today's mm-unstable branch (sadly, 
this issue didn't occur in the older mm-unstable branch), and now I have 
a fix in my local tree. But I will not send out a V3 so quickly, and I 
agree we can drop this series from mm-unstable branch until next cycle.

> Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug

I did not encounter the VM_BUG_ON_FOLIO() issue, but let me try your 
testing case...

> I'm trying to chase even when this series is reverted: some kind of page
> double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
> mostly from page reclaim or from exit_mmap(). I'm still getting a feel
> for it, maybe it occurs soon enough for a reliable bisection, maybe not.
> 
> (While writing, a run with mm-unstable cut off at 2a9964cc5d27,
> drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
> has not yet hit any problem: too early to tell but promising.)
> 
> And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
> Chris Li's mTHP swap series: which worked fairly well, until it locked
> up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
> on a page with 0 refcount, while a page table lock is held which one
> by one the other CPUs come to want for reclaim. On two machines.
> 
> None of these problems seen on Stephen's last next-2024-06-13.
> I had wanted to see if mm-everything-2024-06-18 fixed that lockup,
> but with the new problems I cannot tell (or it could all be the same
> problem: but if so, odd that it manifests consistently differently).
> 
> There are way too many mTHP shmem and swap patch series floating
> around at the moment, in mm and in fs, for me to cope with:
> everyone, please slow down and test more.

Sure. I will continue to do more testing. Thanks.

Andrew Morton June 20, 2024, 1:33 a.m. UTC | #5

On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> On Wed, 19 Jun 2024, Baolin Wang wrote:
> > On 2024/6/19 04:05, Andrew Morton wrote:
> > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
> > > <baolin.wang@linux.alibaba.com> wrote:
> > > 
> > >> Shmem will support large folio allocation [1] [2] to get a better
> > >> performance,
> > >> however, the memory reclaim still splits the precious large folios when
> > >> trying
> > >> to swap-out shmem, which may lead to the memory fragmentation issue and can
> > >> not
> > >> take advantage of the large folio for shmeme.
> > >>
> > >> Moreover, the swap code already supports for swapping out large folio
> > >> without
> > >> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> > >> Hence this patch set also supports the large folio swap-out and swap-in for
> > >> shmem.
> > > 
> > > I'll add this to mm-unstable for some exposure, but I wonder how much
> > > testing it will have recieved by the time the next merge window opens?
> > 
> > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
> > about insufficient testing (and let's also wait for Hugh's comments). Since we
> > (Daniel and I) have some follow-up patches that will rely on this swap series,
> > hope this series can be tested as extensively as possible to ensure its
> > stability in the mm branch.
> 
> Thanks for giving it the exposure, Andrew, but please drop it from
> mm-unstable until the next cycle.

Thanks, dropped.

> p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
> long enough, and deserves promotion to hotfix and Linus soon.

Oh, OK, done.

And it's cc:stable.  I didn't get any sens of urgency for this one -
what is your thinking here?

Hugh Dickins June 20, 2024, 3:59 a.m. UTC | #6

On Wed, 19 Jun 2024, Andrew Morton wrote:
> On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > On Wed, 19 Jun 2024, Baolin Wang wrote:
> > > On 2024/6/19 04:05, Andrew Morton wrote:
> > > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
> > > > <baolin.wang@linux.alibaba.com> wrote:
> > > > 
> > > >> Shmem will support large folio allocation [1] [2] to get a better
> > > >> performance,
> > > >> however, the memory reclaim still splits the precious large folios when
> > > >> trying
> > > >> to swap-out shmem, which may lead to the memory fragmentation issue and can
> > > >> not
> > > >> take advantage of the large folio for shmeme.
> > > >>
> > > >> Moreover, the swap code already supports for swapping out large folio
> > > >> without
> > > >> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> > > >> Hence this patch set also supports the large folio swap-out and swap-in for
> > > >> shmem.
> > > > 
> > > > I'll add this to mm-unstable for some exposure, but I wonder how much
> > > > testing it will have recieved by the time the next merge window opens?
> > > 
> > > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
> > > about insufficient testing (and let's also wait for Hugh's comments). Since we
> > > (Daniel and I) have some follow-up patches that will rely on this swap series,
> > > hope this series can be tested as extensively as possible to ensure its
> > > stability in the mm branch.
> > 
> > Thanks for giving it the exposure, Andrew, but please drop it from
> > mm-unstable until the next cycle.
> 
> Thanks, dropped.

Thanks. I'll add a little more info in other mail, against the further
2024-06-18 problems I reported, but tl;dr is they are still a mystery:
I cannot yet say "drop this" or "drop that" or "here's a fix".

> 
> > p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
> > long enough, and deserves promotion to hotfix and Linus soon.
> 
> Oh, OK, done.
> 
> And it's cc:stable.  I didn't get any sens of urgency for this one -
> what is your thinking here?

I thought you were right to add the cc:stable. The current v6.8..v6.10
state does not result in any crashes or warnings, but it totally (well,
511 times out of 512, in some workloads anyway) defeats the purpose of
shmem+file huge pages - the kernel is going to all the trouble of trying
to allocate those huge pages, but then refuses to map them by PMD unless
the fault happens to occur within the first 4096 bytes (talking x86_64).

I imagine that this causes a significant performance degradation in
some workloads which ask for and are used to getting huge pages there;
and they might also exceed their memory quotas, since a page table has
to be allocated where a PMD-mapping needs none (anon THPs reserve a page
table anyway, to rely on later if splitting, but shmem+file THPs do not).
And it's surprising that no tests were reported as failing.

I did forget that khugepaged should come along later and fix this up
(that used not to be done, but I believe we got it working around v6.6).
The tests where I noticed the low ShmemPmdMapped were too quick for
khugepaged to fix them: but if you'd prefer to avoid cc:stable, we
could say that khugepaged should in due course usually fix the
long-running cases, which are the ones most in need of fixing.

Hugh

Hugh Dickins June 20, 2024, 4:42 a.m. UTC | #7

On Wed, 19 Jun 2024, Hugh Dickins wrote:
> 
> and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
> find_lock_entries().
> 
> Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug
> I'm trying to chase even when this series is reverted:

Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related
to Baolin's series: much more likely to be an instance of other problems.

> some kind of page
> double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
> mostly from page reclaim or from exit_mmap(). I'm still getting a feel
> for it, maybe it occurs soon enough for a reliable bisection, maybe not.
> 
> (While writing, a run with mm-unstable cut off at 2a9964cc5d27,
> drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
> has not yet hit any problem: too early to tell but promising.)

Yes, that ran without trouble for many hours on two machines. I didn't
do a formal bisection, but did appear to narrow it down convincingly to
Barry's folio_add_new_anon_rmap() series: crashes soon on both machines
with Barry's in but Baolin's out, no crashes with both out.

Yet while I was studying Barry's patches trying to explain it, one
of the machines did at last crash: it's as if Barry's has opened a
window which makes these crashes more likely, but not itself to blame.

I'll go back to studying that crash now: two CPUs crashed about the
same time, perhaps they interacted and give a hint at root cause.

(I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
was all about optimizing a known-exclusive case, so it surprises me
to see it being extended to non-exclusive; and I worry over how its
atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
I've never caught up with David's exclusive changes, I'm out of date).

But even if those are wrong, I'd expect them to tend towards a mapped
page becoming unreclaimable, then "Bad page map" when munmapped,
not to any of the double-free symptoms I've actually seen.)

> 
> And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
> Chris Li's mTHP swap series: which worked fairly well, until it locked
> up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
> on a page with 0 refcount, while a page table lock is held which one
> by one the other CPUs come to want for reclaim. On two machines.

I've not seen that symptom at all since 2024-06-15: intriguing,
but none of us can afford the time to worry about vanished bugs.

Hugh

David Hildenbrand June 20, 2024, 6:41 a.m. UTC | #8

On 20.06.24 05:59, Hugh Dickins wrote:
> On Wed, 19 Jun 2024, Andrew Morton wrote:
>> On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
>>> On Wed, 19 Jun 2024, Baolin Wang wrote:
>>>> On 2024/6/19 04:05, Andrew Morton wrote:
>>>>> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>
>>>>>> Shmem will support large folio allocation [1] [2] to get a better
>>>>>> performance,
>>>>>> however, the memory reclaim still splits the precious large folios when
>>>>>> trying
>>>>>> to swap-out shmem, which may lead to the memory fragmentation issue and can
>>>>>> not
>>>>>> take advantage of the large folio for shmeme.
>>>>>>
>>>>>> Moreover, the swap code already supports for swapping out large folio
>>>>>> without
>>>>>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>>>>>> Hence this patch set also supports the large folio swap-out and swap-in for
>>>>>> shmem.
>>>>>
>>>>> I'll add this to mm-unstable for some exposure, but I wonder how much
>>>>> testing it will have recieved by the time the next merge window opens?
>>>>
>>>> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
>>>> about insufficient testing (and let's also wait for Hugh's comments). Since we
>>>> (Daniel and I) have some follow-up patches that will rely on this swap series,
>>>> hope this series can be tested as extensively as possible to ensure its
>>>> stability in the mm branch.
>>>
>>> Thanks for giving it the exposure, Andrew, but please drop it from
>>> mm-unstable until the next cycle.
>>
>> Thanks, dropped.
> 
> Thanks. I'll add a little more info in other mail, against the further
> 2024-06-18 problems I reported, but tl;dr is they are still a mystery:
> I cannot yet say "drop this" or "drop that" or "here's a fix".
> 
>>
>>> p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
>>> long enough, and deserves promotion to hotfix and Linus soon.
>>
>> Oh, OK, done.
>>
>> And it's cc:stable.  I didn't get any sens of urgency for this one -
>> what is your thinking here?
> 
> I thought you were right to add the cc:stable. The current v6.8..v6.10
> state does not result in any crashes or warnings, but it totally (well,
> 511 times out of 512, in some workloads anyway) defeats the purpose of
> shmem+file huge pages - the kernel is going to all the trouble of trying
> to allocate those huge pages, but then refuses to map them by PMD unless
> the fault happens to occur within the first 4096 bytes (talking x86_64).
> 
> I imagine that this causes a significant performance degradation in
> some workloads which ask for and are used to getting huge pages there;
> and they might also exceed their memory quotas, since a page table has
> to be allocated where a PMD-mapping needs none (anon THPs reserve a page
> table anyway, to rely on later if splitting, but shmem+file THPs do not).
> And it's surprising that no tests were reported as failing.

Exactly my thinking. Either lack of tests or it doesn't really happen 
that often where khugepaged doesn't fix it up.

After all it's been two kernel releases ....

David Hildenbrand June 20, 2024, 6:52 a.m. UTC | #9

> 
> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
> was all about optimizing a known-exclusive case, so it surprises me
> to see it being extended to non-exclusive; and I worry over how its
> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
> I've never caught up with David's exclusive changes, I'm out of date).

We discussed that a while ago: if we wouldn't be holding the folio lock 
in the "folio == swapcache" at that point (which we do for both 
do_swap_page() and unuse_pte()) things would already be pretty broken.

That's I added a while ago:

if (unlikely(!folio_test_anon(folio))) {
	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
	/*
	 * For a PTE-mapped large folio, we only know that the single
	 * PTE is exclusive. Further, __folio_set_anon() might not get
	 * folio->index right when not given the address of the head
	 * page.
	 */
	...

We should probably move that VM_WARN_ON_FOLIO() to 
folio_add_new_anon_rmap() and document that it's required in the 
non-exclusive case.

> 
> But even if those are wrong, I'd expect them to tend towards a mapped
> page becoming unreclaimable, then "Bad page map" when munmapped,
> not to any of the double-free symptoms I've actually seen.)

What's the first known-good commit?

Hugh Dickins June 20, 2024, 4:27 p.m. UTC | #10

On Thu, 20 Jun 2024, David Hildenbrand wrote:

> > 
> > (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
> > was all about optimizing a known-exclusive case, so it surprises me
> > to see it being extended to non-exclusive; and I worry over how its
> > atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
> > I've never caught up with David's exclusive changes, I'm out of date).
> 
> We discussed that a while ago: if we wouldn't be holding the folio lock in the
> "folio == swapcache" at that point (which we do for both do_swap_page() and
> unuse_pte()) things would already be pretty broken.

You're thinking of the non-atomic-ness of atomic_set(): I agree that
the folio lock makes that safe (against other adds; but not against
concurrent removes, which could never occur with the old "_new" usage).

But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0):
once the folio lock has been dropped, another non-exclusive add could
come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when
it should be 2)?

> 
> That's I added a while ago:
> 
> if (unlikely(!folio_test_anon(folio))) {
> 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> 	/*
> 	 * For a PTE-mapped large folio, we only know that the single
> 	 * PTE is exclusive. Further, __folio_set_anon() might not get
> 	 * folio->index right when not given the address of the head
> 	 * page.
> 	 */
> 	...
> 
> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap()
> and document that it's required in the non-exclusive case.
> 
> > 
> > But even if those are wrong, I'd expect them to tend towards a mapped
> > page becoming unreclaimable, then "Bad page map" when munmapped,
> > not to any of the double-free symptoms I've actually seen.)
> 
> What's the first known-good commit?

I cannot answer that with any certainty: we're on the shifting sands
of next and mm-unstable, and the bug itself is looking rather like
something which gets amplified or masked by other changes - witness
my confident arrival at Barry's series as introducing the badness,
only for a longer run then to contradict that conclusion.

There was no sign of a problem in a 20-hour run of the same load on
rc3-based next-2024-06-13 (plus my posted fixes); there has been no
sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4
itself, mm.git needing the more urgent attention). mm-everything-
2024-06-15 (minus Chris's mTHP swap) did not show this problem, but
did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus
Baolin's mTHP shmem swap) is where I'm hunting it.

Hugh

David Hildenbrand June 20, 2024, 4:51 p.m. UTC | #11

On 20.06.24 18:27, Hugh Dickins wrote:
> On Thu, 20 Jun 2024, David Hildenbrand wrote:
> 
>>>
>>> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
>>> was all about optimizing a known-exclusive case, so it surprises me
>>> to see it being extended to non-exclusive; and I worry over how its
>>> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
>>> I've never caught up with David's exclusive changes, I'm out of date).
>>
>> We discussed that a while ago: if we wouldn't be holding the folio lock in the
>> "folio == swapcache" at that point (which we do for both do_swap_page() and
>> unuse_pte()) things would already be pretty broken.
> 
> You're thinking of the non-atomic-ness of atomic_set(): I agree that
> the folio lock makes that safe (against other adds; but not against
> concurrent removes, which could never occur with the old "_new" usage).
> 
> But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0):
> once the folio lock has been dropped, another non-exclusive add could
> come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when
> it should be 2)?

That would mean that we have someone checking folio_test_anon() before 
doing the folio_add_new*, and *not* performing that check under folio 
lock such that we can race with another folio_add_new*() call. (or not 
checking for folio_test_anon() at all, which would be similarly broken)

When thinking about this in the past I concluded that such code would be 
inherently racy and wrong already and the existing VM_WARN_ON_FOLIO() 
would have revealed that race.

Whoever calls folio_add_new*() either (a) just allocated that thing and 
can be sure that nobody else does something concurrently -- no need for 
the folio lock; or (b) calls it only when !folio_test_anon(), and 
synchronizes against any other possible races using the folio lock.

swapin code falls into category (b), everything else we have in the tree 
into category (a).

So far my thinking :)

> 
>>
>> That's I added a while ago:
>>
>> if (unlikely(!folio_test_anon(folio))) {
>> 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>> 	/*
>> 	 * For a PTE-mapped large folio, we only know that the single
>> 	 * PTE is exclusive. Further, __folio_set_anon() might not get
>> 	 * folio->index right when not given the address of the head
>> 	 * page.
>> 	 */
>> 	...
>>
>> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap()
>> and document that it's required in the non-exclusive case.
>>
>>>
>>> But even if those are wrong, I'd expect them to tend towards a mapped
>>> page becoming unreclaimable, then "Bad page map" when munmapped,
>>> not to any of the double-free symptoms I've actually seen.)
>>
>> What's the first known-good commit?
> 
> I cannot answer that with any certainty: we're on the shifting sands
> of next and mm-unstable, and the bug itself is looking rather like
> something which gets amplified or masked by other changes - witness
> my confident arrival at Barry's series as introducing the badness,
> only for a longer run then to contradict that conclusion.
> 
> There was no sign of a problem in a 20-hour run of the same load on
> rc3-based next-2024-06-13 (plus my posted fixes); there has been no
> sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4
> itself, mm.git needing the more urgent attention). mm-everything-
> 2024-06-15 (minus Chris's mTHP swap) did not show this problem, but
> did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus
> Baolin's mTHP shmem swap) is where I'm hunting it.

Good, as long as we suspect only something  very recent is affected.

There was this report against rc3 upstream:

https://lore.kernel.org/all/CA+G9fYvKmr84WzTArmfaypKM9+=Aw0uXCtuUKHQKFCNMGJyOgQ@mail.gmail.com/

It's only been observed once, and who knows what's happening in that NFS 
setup. I suspected NUMA hinting code, but I am not sure if that really 
explains the issue ...

Hugh Dickins June 25, 2024, 4:54 a.m. UTC | #12

On Wed, 19 Jun 2024, Hugh Dickins wrote:
> On Wed, 19 Jun 2024, Hugh Dickins wrote:
> > 
> > and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
> > find_lock_entries().
> > 
> > Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug
> > I'm trying to chase even when this series is reverted:
> 
> Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related
> to Baolin's series: much more likely to be an instance of other problems.
> 
> > some kind of page
> > double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
> > mostly from page reclaim or from exit_mmap(). I'm still getting a feel
> > for it, maybe it occurs soon enough for a reliable bisection, maybe not.
> > 
> > (While writing, a run with mm-unstable cut off at 2a9964cc5d27,
> > drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
> > has not yet hit any problem: too early to tell but promising.)
> 
> Yes, that ran without trouble for many hours on two machines. I didn't
> do a formal bisection, but did appear to narrow it down convincingly to
> Barry's folio_add_new_anon_rmap() series: crashes soon on both machines
> with Barry's in but Baolin's out, no crashes with both out.
> 
> Yet while I was studying Barry's patches trying to explain it, one
> of the machines did at last crash: it's as if Barry's has opened a
> window which makes these crashes more likely, but not itself to blame.
> 
> I'll go back to studying that crash now: two CPUs crashed about the
> same time, perhaps they interacted and give a hint at root cause.
> 
> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
> was all about optimizing a known-exclusive case, so it surprises me
> to see it being extended to non-exclusive; and I worry over how its
> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
> I've never caught up with David's exclusive changes, I'm out of date).

David answered on this, thanks: yes, I hadn't got around to seeing that
it only got to be called this way when page anon was not already set:
so agreed, no problem here with the _mapcount 0.

But eventually I came to see what's wrong with folio_add_new_anon_rmap():
this Baolin thread is the wrong place to post the fix, I'll send it now
in response to Barry's 2/3.  With that fix, and another mentioned below,
mm-unstable becomes very much more stable for me.

There is still something else causing "Bad page"s and maybe double
frees, something perhaps in the intersection of folio migration and
deferred splitting and miniTHP; but it's too rare, happened last night
when I wanted to post, but has refused to reappear since then; not
holding up testing, I'll give it more thought next time it comes.

> 
> But even if those are wrong, I'd expect them to tend towards a mapped
> page becoming unreclaimable, then "Bad page map" when munmapped,
> not to any of the double-free symptoms I've actually seen.)
> 
> > 
> > And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
> > Chris Li's mTHP swap series: which worked fairly well, until it locked
> > up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
> > on a page with 0 refcount, while a page table lock is held which one
> > by one the other CPUs come to want for reclaim. On two machines.
> 
> I've not seen that symptom at all since 2024-06-15: intriguing,
> but none of us can afford the time to worry about vanished bugs.

That issue reappeared when I was testing on next-20240621:
I'll send the fix now in response to Kefeng's 2/5.

Hugh

[v2,0/9] support large folio swap-out and swap-in for shmem

Message

Comments