Message ID | cover.1718690645.git.baolin.wang@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | support large folio swap-out and swap-in for shmem | expand |
On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > Shmem will support large folio allocation [1] [2] to get a better performance, > however, the memory reclaim still splits the precious large folios when trying > to swap-out shmem, which may lead to the memory fragmentation issue and can not > take advantage of the large folio for shmeme. > > Moreover, the swap code already supports for swapping out large folio without > split, and large folio swap-in[3] series is queued into mm-unstable branch. > Hence this patch set also supports the large folio swap-out and swap-in for > shmem. I'll add this to mm-unstable for some exposure, but I wonder how much testing it will have recieved by the time the next merge window opens? > Functional testing > ================== > I use the latest mm-unstable branch to test with reverting Chris's > "mm: swap: mTHP swap allocator base on swap cluster order" series which > can cause some problems (Hugh also reported in [4]). I dropped Chris's series from mm-unstable.
On 2024/6/19 04:05, Andrew Morton wrote: > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > >> Shmem will support large folio allocation [1] [2] to get a better performance, >> however, the memory reclaim still splits the precious large folios when trying >> to swap-out shmem, which may lead to the memory fragmentation issue and can not >> take advantage of the large folio for shmeme. >> >> Moreover, the swap code already supports for swapping out large folio without >> split, and large folio swap-in[3] series is queued into mm-unstable branch. >> Hence this patch set also supports the large folio swap-out and swap-in for >> shmem. > > I'll add this to mm-unstable for some exposure, but I wonder how much > testing it will have recieved by the time the next merge window opens? Thanks Andrew. I am fine with this series going to 6.12 if you are concerned about insufficient testing (and let's also wait for Hugh's comments). Since we (Daniel and I) have some follow-up patches that will rely on this swap series, hope this series can be tested as extensively as possible to ensure its stability in the mm branch.
On Wed, 19 Jun 2024, Baolin Wang wrote: > On 2024/6/19 04:05, Andrew Morton wrote: > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang > > <baolin.wang@linux.alibaba.com> wrote: > > > >> Shmem will support large folio allocation [1] [2] to get a better > >> performance, > >> however, the memory reclaim still splits the precious large folios when > >> trying > >> to swap-out shmem, which may lead to the memory fragmentation issue and can > >> not > >> take advantage of the large folio for shmeme. > >> > >> Moreover, the swap code already supports for swapping out large folio > >> without > >> split, and large folio swap-in[3] series is queued into mm-unstable branch. > >> Hence this patch set also supports the large folio swap-out and swap-in for > >> shmem. > > > > I'll add this to mm-unstable for some exposure, but I wonder how much > > testing it will have recieved by the time the next merge window opens? > > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned > about insufficient testing (and let's also wait for Hugh's comments). Since we > (Daniel and I) have some follow-up patches that will rely on this swap series, > hope this series can be tested as extensively as possible to ensure its > stability in the mm branch. Thanks for giving it the exposure, Andrew, but please drop it from mm-unstable until the next cycle. I'd been about to write to say I wouldn't be trying it until next cycle, when your mm-commits came in: so I thought I ought at least to give mm-everything-2024-06-18 a try. Baolin may have fixed stuff, but he (or the interaction with other mm work) has broken stuff too: I couldn't get as far with it as with the previous version. Just "cp -a" of a kernel source tree into a tmpfs huge=always size=<bigenough> failed with lots of ENOSPCs, and when "rm -rf"ed lots of WARN_ON(inode->i_blocks) from shmem_evict_inode(); and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from find_lock_entries(). Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug I'm trying to chase even when this series is reverted: some kind of page double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs, mostly from page reclaim or from exit_mmap(). I'm still getting a feel for it, maybe it occurs soon enough for a reliable bisection, maybe not. (While writing, a run with mm-unstable cut off at 2a9964cc5d27, drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest, has not yet hit any problem: too early to tell but promising.) And before 2024-06-18, I was working on mm-everything-2024-06-15 minus Chris Li's mTHP swap series: which worked fairly well, until it locked up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around on a page with 0 refcount, while a page table lock is held which one by one the other CPUs come to want for reclaim. On two machines. None of these problems seen on Stephen's last next-2024-06-13. I had wanted to see if mm-everything-2024-06-18 fixed that lockup, but with the new problems I cannot tell (or it could all be the same problem: but if so, odd that it manifests consistently differently). There are way too many mTHP shmem and swap patch series floating around at the moment, in mm and in fs, for me to cope with: everyone, please slow down and test more. Hugh p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked long enough, and deserves promotion to hotfix and Linus soon.
On 2024/6/19 16:16, Hugh Dickins wrote: > On Wed, 19 Jun 2024, Baolin Wang wrote: >> On 2024/6/19 04:05, Andrew Morton wrote: >>> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang >>> <baolin.wang@linux.alibaba.com> wrote: >>> >>>> Shmem will support large folio allocation [1] [2] to get a better >>>> performance, >>>> however, the memory reclaim still splits the precious large folios when >>>> trying >>>> to swap-out shmem, which may lead to the memory fragmentation issue and can >>>> not >>>> take advantage of the large folio for shmeme. >>>> >>>> Moreover, the swap code already supports for swapping out large folio >>>> without >>>> split, and large folio swap-in[3] series is queued into mm-unstable branch. >>>> Hence this patch set also supports the large folio swap-out and swap-in for >>>> shmem. >>> >>> I'll add this to mm-unstable for some exposure, but I wonder how much >>> testing it will have recieved by the time the next merge window opens? >> >> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned >> about insufficient testing (and let's also wait for Hugh's comments). Since we >> (Daniel and I) have some follow-up patches that will rely on this swap series, >> hope this series can be tested as extensively as possible to ensure its >> stability in the mm branch. > > Thanks for giving it the exposure, Andrew, but please drop it from > mm-unstable until the next cycle. I'd been about to write to say I > wouldn't be trying it until next cycle, when your mm-commits came in: > so I thought I ought at least to give mm-everything-2024-06-18 a try. > > Baolin may have fixed stuff, but he (or the interaction with other mm > work) has broken stuff too: I couldn't get as far with it as with the > previous version. Just "cp -a" of a kernel source tree into a tmpfs > huge=always size=<bigenough> failed with lots of ENOSPCs, and when > "rm -rf"ed lots of WARN_ON(inode->i_blocks) from shmem_evict_inode(); > and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from > find_lock_entries(). Thanks Hugh for giving it a try. I also can reproduce the WARN_ON(inode->i_blocks) issue with today's mm-unstable branch (sadly, this issue didn't occur in the older mm-unstable branch), and now I have a fix in my local tree. But I will not send out a V3 so quickly, and I agree we can drop this series from mm-unstable branch until next cycle. > Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug I did not encounter the VM_BUG_ON_FOLIO() issue, but let me try your testing case... > I'm trying to chase even when this series is reverted: some kind of page > double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs, > mostly from page reclaim or from exit_mmap(). I'm still getting a feel > for it, maybe it occurs soon enough for a reliable bisection, maybe not. > > (While writing, a run with mm-unstable cut off at 2a9964cc5d27, > drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest, > has not yet hit any problem: too early to tell but promising.) > > And before 2024-06-18, I was working on mm-everything-2024-06-15 minus > Chris Li's mTHP swap series: which worked fairly well, until it locked > up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around > on a page with 0 refcount, while a page table lock is held which one > by one the other CPUs come to want for reclaim. On two machines. > > None of these problems seen on Stephen's last next-2024-06-13. > I had wanted to see if mm-everything-2024-06-18 fixed that lockup, > but with the new problems I cannot tell (or it could all be the same > problem: but if so, odd that it manifests consistently differently). > > There are way too many mTHP shmem and swap patch series floating > around at the moment, in mm and in fs, for me to cope with: > everyone, please slow down and test more. Sure. I will continue to do more testing. Thanks.
On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > On Wed, 19 Jun 2024, Baolin Wang wrote: > > On 2024/6/19 04:05, Andrew Morton wrote: > > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang > > > <baolin.wang@linux.alibaba.com> wrote: > > > > > >> Shmem will support large folio allocation [1] [2] to get a better > > >> performance, > > >> however, the memory reclaim still splits the precious large folios when > > >> trying > > >> to swap-out shmem, which may lead to the memory fragmentation issue and can > > >> not > > >> take advantage of the large folio for shmeme. > > >> > > >> Moreover, the swap code already supports for swapping out large folio > > >> without > > >> split, and large folio swap-in[3] series is queued into mm-unstable branch. > > >> Hence this patch set also supports the large folio swap-out and swap-in for > > >> shmem. > > > > > > I'll add this to mm-unstable for some exposure, but I wonder how much > > > testing it will have recieved by the time the next merge window opens? > > > > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned > > about insufficient testing (and let's also wait for Hugh's comments). Since we > > (Daniel and I) have some follow-up patches that will rely on this swap series, > > hope this series can be tested as extensively as possible to ensure its > > stability in the mm branch. > > Thanks for giving it the exposure, Andrew, but please drop it from > mm-unstable until the next cycle. Thanks, dropped. > p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked > long enough, and deserves promotion to hotfix and Linus soon. Oh, OK, done. And it's cc:stable. I didn't get any sens of urgency for this one - what is your thinking here?
On Wed, 19 Jun 2024, Andrew Morton wrote: > On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > > On Wed, 19 Jun 2024, Baolin Wang wrote: > > > On 2024/6/19 04:05, Andrew Morton wrote: > > > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang > > > > <baolin.wang@linux.alibaba.com> wrote: > > > > > > > >> Shmem will support large folio allocation [1] [2] to get a better > > > >> performance, > > > >> however, the memory reclaim still splits the precious large folios when > > > >> trying > > > >> to swap-out shmem, which may lead to the memory fragmentation issue and can > > > >> not > > > >> take advantage of the large folio for shmeme. > > > >> > > > >> Moreover, the swap code already supports for swapping out large folio > > > >> without > > > >> split, and large folio swap-in[3] series is queued into mm-unstable branch. > > > >> Hence this patch set also supports the large folio swap-out and swap-in for > > > >> shmem. > > > > > > > > I'll add this to mm-unstable for some exposure, but I wonder how much > > > > testing it will have recieved by the time the next merge window opens? > > > > > > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned > > > about insufficient testing (and let's also wait for Hugh's comments). Since we > > > (Daniel and I) have some follow-up patches that will rely on this swap series, > > > hope this series can be tested as extensively as possible to ensure its > > > stability in the mm branch. > > > > Thanks for giving it the exposure, Andrew, but please drop it from > > mm-unstable until the next cycle. > > Thanks, dropped. Thanks. I'll add a little more info in other mail, against the further 2024-06-18 problems I reported, but tl;dr is they are still a mystery: I cannot yet say "drop this" or "drop that" or "here's a fix". > > > p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked > > long enough, and deserves promotion to hotfix and Linus soon. > > Oh, OK, done. > > And it's cc:stable. I didn't get any sens of urgency for this one - > what is your thinking here? I thought you were right to add the cc:stable. The current v6.8..v6.10 state does not result in any crashes or warnings, but it totally (well, 511 times out of 512, in some workloads anyway) defeats the purpose of shmem+file huge pages - the kernel is going to all the trouble of trying to allocate those huge pages, but then refuses to map them by PMD unless the fault happens to occur within the first 4096 bytes (talking x86_64). I imagine that this causes a significant performance degradation in some workloads which ask for and are used to getting huge pages there; and they might also exceed their memory quotas, since a page table has to be allocated where a PMD-mapping needs none (anon THPs reserve a page table anyway, to rely on later if splitting, but shmem+file THPs do not). And it's surprising that no tests were reported as failing. I did forget that khugepaged should come along later and fix this up (that used not to be done, but I believe we got it working around v6.6). The tests where I noticed the low ShmemPmdMapped were too quick for khugepaged to fix them: but if you'd prefer to avoid cc:stable, we could say that khugepaged should in due course usually fix the long-running cases, which are the ones most in need of fixing. Hugh
On Wed, 19 Jun 2024, Hugh Dickins wrote: > > and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from > find_lock_entries(). > > Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug > I'm trying to chase even when this series is reverted: Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related to Baolin's series: much more likely to be an instance of other problems. > some kind of page > double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs, > mostly from page reclaim or from exit_mmap(). I'm still getting a feel > for it, maybe it occurs soon enough for a reliable bisection, maybe not. > > (While writing, a run with mm-unstable cut off at 2a9964cc5d27, > drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest, > has not yet hit any problem: too early to tell but promising.) Yes, that ran without trouble for many hours on two machines. I didn't do a formal bisection, but did appear to narrow it down convincingly to Barry's folio_add_new_anon_rmap() series: crashes soon on both machines with Barry's in but Baolin's out, no crashes with both out. Yet while I was studying Barry's patches trying to explain it, one of the machines did at last crash: it's as if Barry's has opened a window which makes these crashes more likely, but not itself to blame. I'll go back to studying that crash now: two CPUs crashed about the same time, perhaps they interacted and give a hint at root cause. (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap() was all about optimizing a known-exclusive case, so it surprises me to see it being extended to non-exclusive; and I worry over how its atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but I've never caught up with David's exclusive changes, I'm out of date). But even if those are wrong, I'd expect them to tend towards a mapped page becoming unreclaimable, then "Bad page map" when munmapped, not to any of the double-free symptoms I've actually seen.) > > And before 2024-06-18, I was working on mm-everything-2024-06-15 minus > Chris Li's mTHP swap series: which worked fairly well, until it locked > up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around > on a page with 0 refcount, while a page table lock is held which one > by one the other CPUs come to want for reclaim. On two machines. I've not seen that symptom at all since 2024-06-15: intriguing, but none of us can afford the time to worry about vanished bugs. Hugh
On 20.06.24 05:59, Hugh Dickins wrote: > On Wed, 19 Jun 2024, Andrew Morton wrote: >> On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: >>> On Wed, 19 Jun 2024, Baolin Wang wrote: >>>> On 2024/6/19 04:05, Andrew Morton wrote: >>>>> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang >>>>> <baolin.wang@linux.alibaba.com> wrote: >>>>> >>>>>> Shmem will support large folio allocation [1] [2] to get a better >>>>>> performance, >>>>>> however, the memory reclaim still splits the precious large folios when >>>>>> trying >>>>>> to swap-out shmem, which may lead to the memory fragmentation issue and can >>>>>> not >>>>>> take advantage of the large folio for shmeme. >>>>>> >>>>>> Moreover, the swap code already supports for swapping out large folio >>>>>> without >>>>>> split, and large folio swap-in[3] series is queued into mm-unstable branch. >>>>>> Hence this patch set also supports the large folio swap-out and swap-in for >>>>>> shmem. >>>>> >>>>> I'll add this to mm-unstable for some exposure, but I wonder how much >>>>> testing it will have recieved by the time the next merge window opens? >>>> >>>> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned >>>> about insufficient testing (and let's also wait for Hugh's comments). Since we >>>> (Daniel and I) have some follow-up patches that will rely on this swap series, >>>> hope this series can be tested as extensively as possible to ensure its >>>> stability in the mm branch. >>> >>> Thanks for giving it the exposure, Andrew, but please drop it from >>> mm-unstable until the next cycle. >> >> Thanks, dropped. > > Thanks. I'll add a little more info in other mail, against the further > 2024-06-18 problems I reported, but tl;dr is they are still a mystery: > I cannot yet say "drop this" or "drop that" or "here's a fix". > >> >>> p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked >>> long enough, and deserves promotion to hotfix and Linus soon. >> >> Oh, OK, done. >> >> And it's cc:stable. I didn't get any sens of urgency for this one - >> what is your thinking here? > > I thought you were right to add the cc:stable. The current v6.8..v6.10 > state does not result in any crashes or warnings, but it totally (well, > 511 times out of 512, in some workloads anyway) defeats the purpose of > shmem+file huge pages - the kernel is going to all the trouble of trying > to allocate those huge pages, but then refuses to map them by PMD unless > the fault happens to occur within the first 4096 bytes (talking x86_64). > > I imagine that this causes a significant performance degradation in > some workloads which ask for and are used to getting huge pages there; > and they might also exceed their memory quotas, since a page table has > to be allocated where a PMD-mapping needs none (anon THPs reserve a page > table anyway, to rely on later if splitting, but shmem+file THPs do not). > And it's surprising that no tests were reported as failing. Exactly my thinking. Either lack of tests or it doesn't really happen that often where khugepaged doesn't fix it up. After all it's been two kernel releases ....
> > (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap() > was all about optimizing a known-exclusive case, so it surprises me > to see it being extended to non-exclusive; and I worry over how its > atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but > I've never caught up with David's exclusive changes, I'm out of date). We discussed that a while ago: if we wouldn't be holding the folio lock in the "folio == swapcache" at that point (which we do for both do_swap_page() and unuse_pte()) things would already be pretty broken. That's I added a while ago: if (unlikely(!folio_test_anon(folio))) { VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); /* * For a PTE-mapped large folio, we only know that the single * PTE is exclusive. Further, __folio_set_anon() might not get * folio->index right when not given the address of the head * page. */ ... We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap() and document that it's required in the non-exclusive case. > > But even if those are wrong, I'd expect them to tend towards a mapped > page becoming unreclaimable, then "Bad page map" when munmapped, > not to any of the double-free symptoms I've actually seen.) What's the first known-good commit?
On Thu, 20 Jun 2024, David Hildenbrand wrote: > > > > (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap() > > was all about optimizing a known-exclusive case, so it surprises me > > to see it being extended to non-exclusive; and I worry over how its > > atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but > > I've never caught up with David's exclusive changes, I'm out of date). > > We discussed that a while ago: if we wouldn't be holding the folio lock in the > "folio == swapcache" at that point (which we do for both do_swap_page() and > unuse_pte()) things would already be pretty broken. You're thinking of the non-atomic-ness of atomic_set(): I agree that the folio lock makes that safe (against other adds; but not against concurrent removes, which could never occur with the old "_new" usage). But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0): once the folio lock has been dropped, another non-exclusive add could come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when it should be 2)? > > That's I added a while ago: > > if (unlikely(!folio_test_anon(folio))) { > VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > /* > * For a PTE-mapped large folio, we only know that the single > * PTE is exclusive. Further, __folio_set_anon() might not get > * folio->index right when not given the address of the head > * page. > */ > ... > > We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap() > and document that it's required in the non-exclusive case. > > > > > But even if those are wrong, I'd expect them to tend towards a mapped > > page becoming unreclaimable, then "Bad page map" when munmapped, > > not to any of the double-free symptoms I've actually seen.) > > What's the first known-good commit? I cannot answer that with any certainty: we're on the shifting sands of next and mm-unstable, and the bug itself is looking rather like something which gets amplified or masked by other changes - witness my confident arrival at Barry's series as introducing the badness, only for a longer run then to contradict that conclusion. There was no sign of a problem in a 20-hour run of the same load on rc3-based next-2024-06-13 (plus my posted fixes); there has been no sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4 itself, mm.git needing the more urgent attention). mm-everything- 2024-06-15 (minus Chris's mTHP swap) did not show this problem, but did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus Baolin's mTHP shmem swap) is where I'm hunting it. Hugh
On 20.06.24 18:27, Hugh Dickins wrote: > On Thu, 20 Jun 2024, David Hildenbrand wrote: > >>> >>> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap() >>> was all about optimizing a known-exclusive case, so it surprises me >>> to see it being extended to non-exclusive; and I worry over how its >>> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but >>> I've never caught up with David's exclusive changes, I'm out of date). >> >> We discussed that a while ago: if we wouldn't be holding the folio lock in the >> "folio == swapcache" at that point (which we do for both do_swap_page() and >> unuse_pte()) things would already be pretty broken. > > You're thinking of the non-atomic-ness of atomic_set(): I agree that > the folio lock makes that safe (against other adds; but not against > concurrent removes, which could never occur with the old "_new" usage). > > But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0): > once the folio lock has been dropped, another non-exclusive add could > come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when > it should be 2)? That would mean that we have someone checking folio_test_anon() before doing the folio_add_new*, and *not* performing that check under folio lock such that we can race with another folio_add_new*() call. (or not checking for folio_test_anon() at all, which would be similarly broken) When thinking about this in the past I concluded that such code would be inherently racy and wrong already and the existing VM_WARN_ON_FOLIO() would have revealed that race. Whoever calls folio_add_new*() either (a) just allocated that thing and can be sure that nobody else does something concurrently -- no need for the folio lock; or (b) calls it only when !folio_test_anon(), and synchronizes against any other possible races using the folio lock. swapin code falls into category (b), everything else we have in the tree into category (a). So far my thinking :) > >> >> That's I added a while ago: >> >> if (unlikely(!folio_test_anon(folio))) { >> VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); >> /* >> * For a PTE-mapped large folio, we only know that the single >> * PTE is exclusive. Further, __folio_set_anon() might not get >> * folio->index right when not given the address of the head >> * page. >> */ >> ... >> >> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap() >> and document that it's required in the non-exclusive case. >> >>> >>> But even if those are wrong, I'd expect them to tend towards a mapped >>> page becoming unreclaimable, then "Bad page map" when munmapped, >>> not to any of the double-free symptoms I've actually seen.) >> >> What's the first known-good commit? > > I cannot answer that with any certainty: we're on the shifting sands > of next and mm-unstable, and the bug itself is looking rather like > something which gets amplified or masked by other changes - witness > my confident arrival at Barry's series as introducing the badness, > only for a longer run then to contradict that conclusion. > > There was no sign of a problem in a 20-hour run of the same load on > rc3-based next-2024-06-13 (plus my posted fixes); there has been no > sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4 > itself, mm.git needing the more urgent attention). mm-everything- > 2024-06-15 (minus Chris's mTHP swap) did not show this problem, but > did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus > Baolin's mTHP shmem swap) is where I'm hunting it. Good, as long as we suspect only something very recent is affected. There was this report against rc3 upstream: https://lore.kernel.org/all/CA+G9fYvKmr84WzTArmfaypKM9+=Aw0uXCtuUKHQKFCNMGJyOgQ@mail.gmail.com/ It's only been observed once, and who knows what's happening in that NFS setup. I suspected NUMA hinting code, but I am not sure if that really explains the issue ...
On Wed, 19 Jun 2024, Hugh Dickins wrote: > On Wed, 19 Jun 2024, Hugh Dickins wrote: > > > > and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from > > find_lock_entries(). > > > > Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug > > I'm trying to chase even when this series is reverted: > > Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related > to Baolin's series: much more likely to be an instance of other problems. > > > some kind of page > > double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs, > > mostly from page reclaim or from exit_mmap(). I'm still getting a feel > > for it, maybe it occurs soon enough for a reliable bisection, maybe not. > > > > (While writing, a run with mm-unstable cut off at 2a9964cc5d27, > > drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest, > > has not yet hit any problem: too early to tell but promising.) > > Yes, that ran without trouble for many hours on two machines. I didn't > do a formal bisection, but did appear to narrow it down convincingly to > Barry's folio_add_new_anon_rmap() series: crashes soon on both machines > with Barry's in but Baolin's out, no crashes with both out. > > Yet while I was studying Barry's patches trying to explain it, one > of the machines did at last crash: it's as if Barry's has opened a > window which makes these crashes more likely, but not itself to blame. > > I'll go back to studying that crash now: two CPUs crashed about the > same time, perhaps they interacted and give a hint at root cause. > > (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap() > was all about optimizing a known-exclusive case, so it surprises me > to see it being extended to non-exclusive; and I worry over how its > atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but > I've never caught up with David's exclusive changes, I'm out of date). David answered on this, thanks: yes, I hadn't got around to seeing that it only got to be called this way when page anon was not already set: so agreed, no problem here with the _mapcount 0. But eventually I came to see what's wrong with folio_add_new_anon_rmap(): this Baolin thread is the wrong place to post the fix, I'll send it now in response to Barry's 2/3. With that fix, and another mentioned below, mm-unstable becomes very much more stable for me. There is still something else causing "Bad page"s and maybe double frees, something perhaps in the intersection of folio migration and deferred splitting and miniTHP; but it's too rare, happened last night when I wanted to post, but has refused to reappear since then; not holding up testing, I'll give it more thought next time it comes. > > But even if those are wrong, I'd expect them to tend towards a mapped > page becoming unreclaimable, then "Bad page map" when munmapped, > not to any of the double-free symptoms I've actually seen.) > > > > > And before 2024-06-18, I was working on mm-everything-2024-06-15 minus > > Chris Li's mTHP swap series: which worked fairly well, until it locked > > up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around > > on a page with 0 refcount, while a page table lock is held which one > > by one the other CPUs come to want for reclaim. On two machines. > > I've not seen that symptom at all since 2024-06-15: intriguing, > but none of us can afford the time to worry about vanished bugs. That issue reappeared when I was testing on next-20240621: I'll send the fix now in response to Kefeng's 2/5. Hugh