Message ID | 20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] migrate_pages: Avoid blocking for IO in MIGRATE_SYNC_LIGHT | expand |
On Fri, Apr 28, 2023 at 01:54:38PM -0700, Douglas Anderson wrote: > Suggested-by: Matthew Wilcox <willy@infradead.org> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Hillf Danton <hdanton@sina.com> > Cc: Gao Xiang <hsiangkao@linux.alibaba.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > Most of the actual code in this patch came from emails written by > Matthew Wilcox and I just cleaned the code up to get it to compile. > I'm happy to set authorship to him if he would like, but for now I've > credited him with Suggested-by. This all looks good to me. I don't care about getting credit for it. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
On 28 Apr 2023 13:54:38 -0700 Douglas Anderson <dianders@chromium.org> > The MIGRATE_SYNC_LIGHT mode is intended to block for things that will > finish quickly but not for things that will take a long time. Exactly > how long is too long is not well defined, but waits of tens of > milliseconds is likely non-ideal. > > When putting a Chromebook under memory pressure (opening over 90 tabs > on a 4GB machine) it was fairly easy to see delays waiting for some > locks in the kcompactd code path of > 100 ms. While the laptop wasn't > amazingly usable in this state, it was still limping along and this > state isn't something artificial. Sometimes we simply end up with a > lot of memory pressure. Was kcompactd waken up for PAGE_ALLOC_COSTLY_ORDER? > > Putting the same Chromebook under memory pressure while it was running > Android apps (though not stressing them) showed a much worse result > (NOTE: this was on a older kernel but the codepaths here are similar). > Android apps on ChromeOS currently run from a 128K-block, > zlib-compressed, loopback-mounted squashfs disk. If we get a page > fault from something backed by the squashfs filesystem we could end up > holding a folio lock while reading enough from disk to decompress 128K > (and then decompressing it using the somewhat slow zlib algorithms). > That reading goes through the ext4 subsystem (because it's a loopback > mount) before eventually ending up in the block subsystem. This extra > jaunt adds extra overhead. Without much work I could see cases where > we ended up blocked on a folio lock for over a second. With more > extreme memory pressure I could see up to 25 seconds. In the same kcompactd code path above? > > We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for > the two locks that were seen to be slow [1] and that generated much > discussion. After discussion, it was decided that we should avoid > waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being > held for IO. We'll continue with the unbounded wait for the more full > SYNC modes. > > With this change, I couldn't see any slow waits on these locks with my > previous testcases. Well this is the upside after this change, but given the win, what is the lose/cost paid? For example the changes in compact fail and success [1]. [1] https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/
On Fri, Apr 28, 2023 at 01:54:38PM -0700, Douglas Anderson wrote: > The MIGRATE_SYNC_LIGHT mode is intended to block for things that will > finish quickly but not for things that will take a long time. Exactly > how long is too long is not well defined, but waits of tens of > milliseconds is likely non-ideal. > > When putting a Chromebook under memory pressure (opening over 90 tabs > on a 4GB machine) it was fairly easy to see delays waiting for some > locks in the kcompactd code path of > 100 ms. While the laptop wasn't > amazingly usable in this state, it was still limping along and this > state isn't something artificial. Sometimes we simply end up with a > lot of memory pressure. > > Putting the same Chromebook under memory pressure while it was running > Android apps (though not stressing them) showed a much worse result > (NOTE: this was on a older kernel but the codepaths here are similar). > Android apps on ChromeOS currently run from a 128K-block, > zlib-compressed, loopback-mounted squashfs disk. If we get a page > fault from something backed by the squashfs filesystem we could end up > holding a folio lock while reading enough from disk to decompress 128K > (and then decompressing it using the somewhat slow zlib algorithms). > That reading goes through the ext4 subsystem (because it's a loopback > mount) before eventually ending up in the block subsystem. This extra > jaunt adds extra overhead. Without much work I could see cases where > we ended up blocked on a folio lock for over a second. With more > extreme memory pressure I could see up to 25 seconds. > > We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for > the two locks that were seen to be slow [1] and that generated much > discussion. After discussion, it was decided that we should avoid > waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being > held for IO. We'll continue with the unbounded wait for the more full > SYNC modes. > > With this change, I couldn't see any slow waits on these locks with my > previous testcases. > > NOTE: The reason I stated digging into this originally isn't because > some benchmark had gone awry, but because we've received in-the-field > crash reports where we have a hung task waiting on the page lock > (which is the equivalent code path on old kernels). While the root > cause of those crashes is likely unrelated and won't be fixed by this > patch, analyzing those crash reports did point out these very long > waits seemed like something good to fix. With this patch we should no > longer hang waiting on these locks, but presumably the system will > still be in a bad shape and hang somewhere else. > > [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid > > Suggested-by: Matthew Wilcox <willy@infradead.org> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Hillf Danton <hdanton@sina.com> > Cc: Gao Xiang <hsiangkao@linux.alibaba.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Mel Gorman <mgorman@techsingularity.net>
Hi, On Sat, Apr 29, 2023 at 3:14 AM Hillf Danton <hdanton@sina.com> wrote: > > On 28 Apr 2023 13:54:38 -0700 Douglas Anderson <dianders@chromium.org> > > The MIGRATE_SYNC_LIGHT mode is intended to block for things that will > > finish quickly but not for things that will take a long time. Exactly > > how long is too long is not well defined, but waits of tens of > > milliseconds is likely non-ideal. > > > > When putting a Chromebook under memory pressure (opening over 90 tabs > > on a 4GB machine) it was fairly easy to see delays waiting for some > > locks in the kcompactd code path of > 100 ms. While the laptop wasn't > > amazingly usable in this state, it was still limping along and this > > state isn't something artificial. Sometimes we simply end up with a > > lot of memory pressure. > > Was kcompactd waken up for PAGE_ALLOC_COSTLY_ORDER? I put some more traces in and reproduced it again. I saw something that looked like this: 1. balance_pgdat() called wakeup_kcompactd() with order=10 and that caused us to get all the way to the end and wakeup kcompactd (there were previous calls to wakeup_kcompactd() that returned early). 2. kcompactd started and completed kcompactd_do_work() without blocking. 3. kcompactd called proactive_compact_node() and there blocked for ~92ms in one case, ~120ms in another case, ~131ms in another case. > > Putting the same Chromebook under memory pressure while it was running > > Android apps (though not stressing them) showed a much worse result > > (NOTE: this was on a older kernel but the codepaths here are similar). > > Android apps on ChromeOS currently run from a 128K-block, > > zlib-compressed, loopback-mounted squashfs disk. If we get a page > > fault from something backed by the squashfs filesystem we could end up > > holding a folio lock while reading enough from disk to decompress 128K > > (and then decompressing it using the somewhat slow zlib algorithms). > > That reading goes through the ext4 subsystem (because it's a loopback > > mount) before eventually ending up in the block subsystem. This extra > > jaunt adds extra overhead. Without much work I could see cases where > > we ended up blocked on a folio lock for over a second. With more > > extreme memory pressure I could see up to 25 seconds. > > In the same kcompactd code path above? It was definitely in kcompactd. I can go back and trace through this too, if it's useful, but I suspect it's the same. > > We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for > > the two locks that were seen to be slow [1] and that generated much > > discussion. After discussion, it was decided that we should avoid > > waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being > > held for IO. We'll continue with the unbounded wait for the more full > > SYNC modes. > > > > With this change, I couldn't see any slow waits on these locks with my > > previous testcases. > > Well this is the upside after this change, but given the win, what is > the lose/cost paid? For example the changes in compact fail and success [1]. > > [1] https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/ That looks like an interesting series. Obviously it would need to be tested, but my hunch is that ${SUBJECT} patch would work well with that series. Specifically with Johannes's series it seems more important for the kcompactd thread to be working fruitfully. Having it blocked for a long time when there is other useful work it could be doing still seems wrong. With ${SUBJECT} patch it's not that we'll never come back and try again, but we'll just wait until a future iteration when (hopefully) the locks are easier to acquire. In the meantime, we're looking for other pages to migrate. -Doug
Hi, On Sun, Apr 30, 2023 at 1:53 AM Hillf Danton <hdanton@sina.com> wrote: > > On 28 Apr 2023 13:54:38 -0700 Douglas Anderson <dianders@chromium.org> > > The MIGRATE_SYNC_LIGHT mode is intended to block for things that will > > finish quickly but not for things that will take a long time. Exactly > > how long is too long is not well defined, but waits of tens of > > milliseconds is likely non-ideal. > > > > When putting a Chromebook under memory pressure (opening over 90 tabs > > on a 4GB machine) it was fairly easy to see delays waiting for some > > locks in the kcompactd code path of > 100 ms. While the laptop wasn't > > amazingly usable in this state, it was still limping along and this > > state isn't something artificial. Sometimes we simply end up with a > > lot of memory pressure. > > Given longer than 100ms stall, this can not be a correct fix if the > hardware fails to do more than ten IOs a second. > > OTOH given some pages reclaimed for compaction to make forward progress > before kswapd wakes kcompactd up, this can not be a fix without spotting > the cause of the stall. Right that the system is in pretty bad shape when this happens and it's not very effective at doing IO or much of anything because it's under bad memory pressure. I guess my first thought is that, when this happens then a process holding the lock gets preempted and doesn't get scheduled back in for a while. That _should_ be possible, right? In the case where I'm reproducing this then all the CPUs would be super busy madly trying to compress / decompress zram, so it doesn't surprise me that a process could get context switched out for a while. -Doug
On 2 May 2023 14:20:54 -0700 Douglas Anderson <dianders@chromium.org> > On Sun, Apr 30, 2023 at 1:53=E2=80=AFAM Hillf Danton <hdanton@sina.com> wrote: > > On 28 Apr 2023 13:54:38 -0700 Douglas Anderson <dianders@chromium.org> > > > The MIGRATE_SYNC_LIGHT mode is intended to block for things that will > > > finish quickly but not for things that will take a long time. Exactly > > > how long is too long is not well defined, but waits of tens of > > > milliseconds is likely non-ideal. > > > > > > When putting a Chromebook under memory pressure (opening over 90 tabs > > > on a 4GB machine) it was fairly easy to see delays waiting for some > > > locks in the kcompactd code path of > 100 ms. While the laptop wasn't > > > amazingly usable in this state, it was still limping along and this > > > state isn't something artificial. Sometimes we simply end up with a > > > lot of memory pressure. > > > > Given longer than 100ms stall, this can not be a correct fix if the > > hardware fails to do more than ten IOs a second. > > > > OTOH given some pages reclaimed for compaction to make forward progress > > before kswapd wakes kcompactd up, this can not be a fix without spotting > > the cause of the stall. > > Right that the system is in pretty bad shape when this happens and > it's not very effective at doing IO or much of anything because it's > under bad memory pressure. Based on the info in another reply [1] | I put some more traces in and reproduced it again. I saw something | that looked like this: | | 1. balance_pgdat() called wakeup_kcompactd() with order=10 and that | caused us to get all the way to the end and wakeup kcompactd (there | were previous calls to wakeup_kcompactd() that returned early). | | 2. kcompactd started and completed kcompactd_do_work() without blocking. | | 3. kcompactd called proactive_compact_node() and there blocked for | ~92ms in one case, ~120ms in another case, ~131ms in another case. I see fragmentation given order=10 and proactive_compact_node(). Can you specify the evidence of bad memory pressure? [1] https://lore.kernel.org/lkml/CAD=FV=V8m-mpJsFntCciqtq7xnvhmnvPdTvxNuBGBT3-cDdabQ@mail.gmail.com/ > > I guess my first thought is that, when this happens then a process > holding the lock gets preempted and doesn't get scheduled back in for > a while. That _should_ be possible, right? In the case where I'm > reproducing this then all the CPUs would be super busy madly trying to > compress / decompress zram, so it doesn't surprise me that a process > could get context switched out for a while. Could switchout turn the below I/O upside down? /* * In "light" mode, we can wait for transient locks (eg * inserting a page into the page table), but it's not * worth waiting for I/O. */
Hi, On Tue, May 2, 2023 at 6:45 PM Hillf Danton <hdanton@sina.com> wrote: > > On 2 May 2023 14:20:54 -0700 Douglas Anderson <dianders@chromium.org> > > On Sun, Apr 30, 2023 at 1:53=E2=80=AFAM Hillf Danton <hdanton@sina.com> wrote: > > > On 28 Apr 2023 13:54:38 -0700 Douglas Anderson <dianders@chromium.org> > > > > The MIGRATE_SYNC_LIGHT mode is intended to block for things that will > > > > finish quickly but not for things that will take a long time. Exactly > > > > how long is too long is not well defined, but waits of tens of > > > > milliseconds is likely non-ideal. > > > > > > > > When putting a Chromebook under memory pressure (opening over 90 tabs > > > > on a 4GB machine) it was fairly easy to see delays waiting for some > > > > locks in the kcompactd code path of > 100 ms. While the laptop wasn't > > > > amazingly usable in this state, it was still limping along and this > > > > state isn't something artificial. Sometimes we simply end up with a > > > > lot of memory pressure. > > > > > > Given longer than 100ms stall, this can not be a correct fix if the > > > hardware fails to do more than ten IOs a second. > > > > > > OTOH given some pages reclaimed for compaction to make forward progress > > > before kswapd wakes kcompactd up, this can not be a fix without spotting > > > the cause of the stall. > > > > Right that the system is in pretty bad shape when this happens and > > it's not very effective at doing IO or much of anything because it's > > under bad memory pressure. > > Based on the info in another reply [1] > > | I put some more traces in and reproduced it again. I saw something > | that looked like this: > | > | 1. balance_pgdat() called wakeup_kcompactd() with order=10 and that > | caused us to get all the way to the end and wakeup kcompactd (there > | were previous calls to wakeup_kcompactd() that returned early). > | > | 2. kcompactd started and completed kcompactd_do_work() without blocking. > | > | 3. kcompactd called proactive_compact_node() and there blocked for > | ~92ms in one case, ~120ms in another case, ~131ms in another case. > > I see fragmentation given order=10 and proactive_compact_node(). Can you > specify the evidence of bad memory pressure? What type of evidence are you looking for? When I'm reproducing these problems, I'm running a test that specifically puts the system under memory pressure by opening up lots of tabs in the Chrome browser. When I start seeing these printouts, I can take a look at the system and I can see that it's pretty much constantly swapping in and swapping out. > > I guess my first thought is that, when this happens then a process > > holding the lock gets preempted and doesn't get scheduled back in for > > a while. That _should_ be possible, right? In the case where I'm > > reproducing this then all the CPUs would be super busy madly trying to > > compress / decompress zram, so it doesn't surprise me that a process > > could get context switched out for a while. > > Could switchout turn the below I/O upside down? > /* > * In "light" mode, we can wait for transient locks (eg > * inserting a page into the page table), but it's not > * worth waiting for I/O. > */ I'm not sure I understand what you're asking, sorry! -Doug
On 5 May 2023 10:11:41 -0700 Douglas Anderson <dianders@chromium.org> > What type of evidence are you looking for? Because kswapd is responsible for maintaining watermarks, there is no memory pressure if it decides to take a nap. Anyway defragmentation can not make forward progress without enough free pages, see kswapd_is_running() in should_proactive_compact_node(). > When I'm reproducing these > problems, I'm running a test that specifically puts the system under > memory pressure by opening up lots of tabs in the Chrome browser. When > I start seeing these printouts, I can take a look at the system and I > can see that it's pretty much constantly swapping in and swapping out. Now trun to what is important. The constant swapin and swapout say more tabs are opened than designed, and simply closing enough tabs is a better fix, because fix like this one fails again with 20 more tabs opened for instance.
diff --git a/mm/migrate.c b/mm/migrate.c index db3f154446af..4a384eb32917 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -698,37 +698,32 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, enum migrate_mode mode) { struct buffer_head *bh = head; + struct buffer_head *failed_bh; - /* Simple case, sync compaction */ - if (mode != MIGRATE_ASYNC) { - do { - lock_buffer(bh); - bh = bh->b_this_page; - - } while (bh != head); - - return true; - } - - /* async case, we cannot block on lock_buffer so use trylock_buffer */ do { if (!trylock_buffer(bh)) { - /* - * We failed to lock the buffer and cannot stall in - * async migration. Release the taken locks - */ - struct buffer_head *failed_bh = bh; - bh = head; - while (bh != failed_bh) { - unlock_buffer(bh); - bh = bh->b_this_page; - } - return false; + if (mode == MIGRATE_ASYNC) + goto unlock; + if (mode == MIGRATE_SYNC_LIGHT && !buffer_uptodate(bh)) + goto unlock; + lock_buffer(bh); } bh = bh->b_this_page; } while (bh != head); + return true; + +unlock: + /* We failed to lock the buffer and cannot stall. */ + failed_bh = bh; + bh = head; + while (bh != failed_bh) { + unlock_buffer(bh); + bh = bh->b_this_page; + } + + return false; } static int __buffer_migrate_folio(struct address_space *mapping, @@ -1162,6 +1157,14 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page if (current->flags & PF_MEMALLOC) goto out; + /* + * In "light" mode, we can wait for transient locks (eg + * inserting a page into the page table), but it's not + * worth waiting for I/O. + */ + if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) + goto out; + folio_lock(src); } locked = true;
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will finish quickly but not for things that will take a long time. Exactly how long is too long is not well defined, but waits of tens of milliseconds is likely non-ideal. When putting a Chromebook under memory pressure (opening over 90 tabs on a 4GB machine) it was fairly easy to see delays waiting for some locks in the kcompactd code path of > 100 ms. While the laptop wasn't amazingly usable in this state, it was still limping along and this state isn't something artificial. Sometimes we simply end up with a lot of memory pressure. Putting the same Chromebook under memory pressure while it was running Android apps (though not stressing them) showed a much worse result (NOTE: this was on a older kernel but the codepaths here are similar). Android apps on ChromeOS currently run from a 128K-block, zlib-compressed, loopback-mounted squashfs disk. If we get a page fault from something backed by the squashfs filesystem we could end up holding a folio lock while reading enough from disk to decompress 128K (and then decompressing it using the somewhat slow zlib algorithms). That reading goes through the ext4 subsystem (because it's a loopback mount) before eventually ending up in the block subsystem. This extra jaunt adds extra overhead. Without much work I could see cases where we ended up blocked on a folio lock for over a second. With more extreme memory pressure I could see up to 25 seconds. We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the two locks that were seen to be slow [1] and that generated much discussion. After discussion, it was decided that we should avoid waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being held for IO. We'll continue with the unbounded wait for the more full SYNC modes. With this change, I couldn't see any slow waits on these locks with my previous testcases. NOTE: The reason I stated digging into this originally isn't because some benchmark had gone awry, but because we've received in-the-field crash reports where we have a hung task waiting on the page lock (which is the equivalent code path on old kernels). While the root cause of those crashes is likely unrelated and won't be fixed by this patch, analyzing those crash reports did point out these very long waits seemed like something good to fix. With this patch we should no longer hang waiting on these locks, but presumably the system will still be in a bad shape and hang somewhere else. [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Most of the actual code in this patch came from emails written by Matthew Wilcox and I just cleaned the code up to get it to compile. I'm happy to set authorship to him if he would like, but for now I've credited him with Suggested-by. This patch has changed pretty significantly between versions, so adding a link to previous versions to help anyone needing to find the history: v1 - https://lore.kernel.org/r/20230413182313.RFC.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid v2 - https://lore.kernel.org/r/20230421221249.1616168-1-dianders@chromium.org/ Changes in v3: - Combine patches for buffers and folios. - Use buffer_uptodate() and folio_test_uptodate() instead of timeout. Changes in v2: - Keep unbounded delay in "SYNC", delay with a timeout in "SYNC_LIGHT". - Also add a timeout for locking of buffers. mm/migrate.c | 49 ++++++++++++++++++++++++++----------------------- 1 file changed, 26 insertions(+), 23 deletions(-)