Message ID | 20220221111749.1928222-1-cgel.zte@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [linux-next] mm: swap: get rid of deadloop in swapin readahead | expand |
On Mon, 21 Feb 2022 11:17:49 +0000 cgel.zte@gmail.com wrote: > From: Guo Ziliang <guo.ziliang@zte.com.cn> > > In our testing, a deadloop task was found. Through sysrq printing, same > stack was found every time, as follows: > __swap_duplicate+0x58/0x1a0 > swapcache_prepare+0x24/0x30 > __read_swap_cache_async+0xac/0x220 > read_swap_cache_async+0x58/0xa0 > swapin_readahead+0x24c/0x628 > do_swap_page+0x374/0x8a0 > __handle_mm_fault+0x598/0xd60 > handle_mm_fault+0x114/0x200 > do_page_fault+0x148/0x4d0 > do_translation_fault+0xb0/0xd4 > do_mem_abort+0x50/0xb0 > > The reason for the deadloop is that swapcache_prepare() always returns > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that > it cannot jump out of the loop. We suspect that the task that clears > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower > the priority of the task stuck in a deadloop so that the task that > clears the SWAP_HAS_CACHE flag will run. The results show that the > system returns to normal after the priority is lowered. > > In our testing, multiple real-time tasks are bound to the same core, > and the task in the deadloop is the highest priority task of the > core, so the deadloop task cannot be preempted. > > Although cond_resched() is used by __read_swap_cache_async, it is an > empty function in the preemptive system and cannot achieve the purpose > of releasing the CPU. A high-priority task cannot release the CPU > unless preempted by a higher-priority task. But when this task > is already the highest priority task on this core, other tasks > will not be able to be scheduled. So we think we should replace > cond_resched() with schedule_timeout_uninterruptible(1), > schedule_timeout_interruptible will call set_current_state > first to set the task state, so the task will be removed > from the running queue, so as to achieve the purpose of > giving up the CPU and prevent it from running in kernel > mode for too long. > > ... > > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE > * in swap_map, but not yet added its page to swap cache. > */ > - cond_resched(); > + schedule_timeout_uninterruptible(1); > } > > /* Sigh. I guess yes, we should do this, at least in a short-term, backportable-to-stable way. But busy-waiting while hoping that someone else will save us isn't an attractive design. Hugh, have you ever thought about something more deterministic in there? Thanks.
On Mon 21-02-22 11:17:49, cgel.zte@gmail.com wrote: > From: Guo Ziliang <guo.ziliang@zte.com.cn> > > In our testing, a deadloop task was found. Through sysrq printing, same > stack was found every time, as follows: > __swap_duplicate+0x58/0x1a0 > swapcache_prepare+0x24/0x30 > __read_swap_cache_async+0xac/0x220 > read_swap_cache_async+0x58/0xa0 > swapin_readahead+0x24c/0x628 > do_swap_page+0x374/0x8a0 > __handle_mm_fault+0x598/0xd60 > handle_mm_fault+0x114/0x200 > do_page_fault+0x148/0x4d0 > do_translation_fault+0xb0/0xd4 > do_mem_abort+0x50/0xb0 > > The reason for the deadloop is that swapcache_prepare() always returns > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that > it cannot jump out of the loop. We suspect that the task that clears > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower > the priority of the task stuck in a deadloop so that the task that > clears the SWAP_HAS_CACHE flag will run. The results show that the > system returns to normal after the priority is lowered. > > In our testing, multiple real-time tasks are bound to the same core, > and the task in the deadloop is the highest priority task of the > core, so the deadloop task cannot be preempted. > > Although cond_resched() is used by __read_swap_cache_async, it is an > empty function in the preemptive system and cannot achieve the purpose > of releasing the CPU. A high-priority task cannot release the CPU > unless preempted by a higher-priority task. But when this task > is already the highest priority task on this core, other tasks > will not be able to be scheduled. So we think we should replace > cond_resched() with schedule_timeout_uninterruptible(1), > schedule_timeout_interruptible will call set_current_state > first to set the task state, so the task will be removed > from the running queue, so as to achieve the purpose of > giving up the CPU and prevent it from running in kernel > mode for too long. I am sorry but I really do not see how this case is any different from any other kernel code path being hogged by a RT task. We surely shouldn't put sleeps into all random paths which are doing cond_resched at the moment. > Reported-by: Zeal Robot <zealci@zte.com.cn> > Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> > Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> > Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> > Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn> > --- > mm/swap_state.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 8d4104242100..ee67164531c0 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE > * in swap_map, but not yet added its page to swap cache. > */ > - cond_resched(); > + schedule_timeout_uninterruptible(1); > } > > /* > -- > 2.15.2 >
On Mon, 28 Feb 2022 08:57:49 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Mon 21-02-22 11:17:49, cgel.zte@gmail.com wrote: > > From: Guo Ziliang <guo.ziliang@zte.com.cn> > > > > In our testing, a deadloop task was found. Through sysrq printing, same > > stack was found every time, as follows: > > __swap_duplicate+0x58/0x1a0 > > swapcache_prepare+0x24/0x30 > > __read_swap_cache_async+0xac/0x220 > > read_swap_cache_async+0x58/0xa0 > > swapin_readahead+0x24c/0x628 > > do_swap_page+0x374/0x8a0 > > __handle_mm_fault+0x598/0xd60 > > handle_mm_fault+0x114/0x200 > > do_page_fault+0x148/0x4d0 > > do_translation_fault+0xb0/0xd4 > > do_mem_abort+0x50/0xb0 > > > > The reason for the deadloop is that swapcache_prepare() always returns > > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that > > it cannot jump out of the loop. We suspect that the task that clears > > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower > > the priority of the task stuck in a deadloop so that the task that > > clears the SWAP_HAS_CACHE flag will run. The results show that the > > system returns to normal after the priority is lowered. > > > > In our testing, multiple real-time tasks are bound to the same core, > > and the task in the deadloop is the highest priority task of the > > core, so the deadloop task cannot be preempted. > > > > Although cond_resched() is used by __read_swap_cache_async, it is an > > empty function in the preemptive system and cannot achieve the purpose > > of releasing the CPU. A high-priority task cannot release the CPU > > unless preempted by a higher-priority task. But when this task > > is already the highest priority task on this core, other tasks > > will not be able to be scheduled. So we think we should replace > > cond_resched() with schedule_timeout_uninterruptible(1), > > schedule_timeout_interruptible will call set_current_state > > first to set the task state, so the task will be removed > > from the running queue, so as to achieve the purpose of > > giving up the CPU and prevent it from running in kernel > > mode for too long. > > I am sorry but I really do not see how this case is any different from > any other kernel code path being hogged by a RT task. We surely > shouldn't put sleeps into all random paths which are doing cond_resched > at the moment. But this cond_resched() is different from most. This one is attempting to yield the CPU so this task can make progress. And cond_resched() simply isn't an appropriate way of doing this because under this fairly common situation, it's a no-op.
On Fri, 25 Feb 2022, Andrew Morton wrote: > On Mon, 21 Feb 2022 11:17:49 +0000 cgel.zte@gmail.com wrote: > > From: Guo Ziliang <guo.ziliang@zte.com.cn> > > > > In our testing, a deadloop task was found. Through sysrq printing, same > > stack was found every time, as follows: > > __swap_duplicate+0x58/0x1a0 > > swapcache_prepare+0x24/0x30 > > __read_swap_cache_async+0xac/0x220 > > read_swap_cache_async+0x58/0xa0 > > swapin_readahead+0x24c/0x628 > > do_swap_page+0x374/0x8a0 > > __handle_mm_fault+0x598/0xd60 > > handle_mm_fault+0x114/0x200 > > do_page_fault+0x148/0x4d0 > > do_translation_fault+0xb0/0xd4 > > do_mem_abort+0x50/0xb0 > > > > The reason for the deadloop is that swapcache_prepare() always returns > > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that > > it cannot jump out of the loop. We suspect that the task that clears > > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower > > the priority of the task stuck in a deadloop so that the task that > > clears the SWAP_HAS_CACHE flag will run. The results show that the > > system returns to normal after the priority is lowered. > > > > In our testing, multiple real-time tasks are bound to the same core, > > and the task in the deadloop is the highest priority task of the > > core, so the deadloop task cannot be preempted. > > > > Although cond_resched() is used by __read_swap_cache_async, it is an > > empty function in the preemptive system and cannot achieve the purpose > > of releasing the CPU. A high-priority task cannot release the CPU > > unless preempted by a higher-priority task. But when this task > > is already the highest priority task on this core, other tasks > > will not be able to be scheduled. So we think we should replace > > cond_resched() with schedule_timeout_uninterruptible(1), > > schedule_timeout_interruptible will call set_current_state > > first to set the task state, so the task will be removed > > from the running queue, so as to achieve the purpose of > > giving up the CPU and prevent it from running in kernel > > mode for too long. > > > > ... > > > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE > > * in swap_map, but not yet added its page to swap cache. > > */ > > - cond_resched(); > > + schedule_timeout_uninterruptible(1); > > } > > > > /* > > Sigh. I guess yes, we should do this, at least in a short-term, > backportable-to-stable way. > > But busy-waiting while hoping that someone else will save us isn't an > attractive design. Hugh, have you ever thought about something more > deterministic in there? Not something more deterministic, no: I think that would entail heavier locking, perhaps slowing down hotter paths, just to avoid this swap oddity. This loop was written long before there was a preemptive kernel: it was appropriate then, and almost never needed more than one retry to complete; but preemption changed the story without us realizing. Sigh here too. I commend the thread on it from July 2018: https://lore.kernel.org/linux-mm/2018072514403228778860@wingtech.com/ There the 4.9-stable user proposed preempt_disable(), I agreed but found the patch provided insufficient, and offered another 4.9 patch further down the thread. Your preference at the time was msleep(1). I was working on a similar patch for 4.18, but have not completed it yet ;) and don't remember how satisfied or not I was with that one; and wonder if I'm any more likely to get it finished by 2026. It's clear that I put much more thought into it back then than just now. Maybe someone else would have a go: my 4.9 patch in that thread shows most of it, but might need a lot of work to update to 5.17. And it also gathered some temporary debug stats on how often this happens: I'm not conscious of using RT at all, but was disturbed to see how long an ordinary preemptive kernel was sometimes spinning there. So I think I agree with you more than Michal on that: RT just makes the bad behaviour more obvious. Hugh
On Mon, 28 Feb 2022 20:07:33 -0800 (PST) Hugh Dickins <hughd@google.com> wrote: > > > --- a/mm/swap_state.c > > > +++ b/mm/swap_state.c > > > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE > > > * in swap_map, but not yet added its page to swap cache. > > > */ > > > - cond_resched(); > > > + schedule_timeout_uninterruptible(1); > > > } > > > > > > /* > > > > Sigh. I guess yes, we should do this, at least in a short-term, > > backportable-to-stable way. > > > > But busy-waiting while hoping that someone else will save us isn't an > > attractive design. Hugh, have you ever thought about something more > > deterministic in there? > > Not something more deterministic, no: I think that would entail > heavier locking, perhaps slowing down hotter paths, just to avoid > this swap oddity. > > This loop was written long before there was a preemptive kernel: > it was appropriate then, and almost never needed more than one retry > to complete; but preemption changed the story without us realizing. > > Sigh here too. I commend the thread on it from July 2018: > https://lore.kernel.org/linux-mm/2018072514403228778860@wingtech.com/ > > There the 4.9-stable user proposed preempt_disable(), I agreed but > found the patch provided insufficient, and offered another 4.9 patch > further down the thread. Your preference at the time was msleep(1). > > I was working on a similar patch for 4.18, but have not completed it > yet ;) and don't remember how satisfied or not I was with that one; > and wonder if I'm any more likely to get it finished by 2026. It's > clear that I put much more thought into it back then than just now. > > Maybe someone else would have a go: my 4.9 patch in that thread > shows most of it, but might need a lot of work to update to 5.17. > > And it also gathered some temporary debug stats on how often this > happens: I'm not conscious of using RT at all, but was disturbed to see > how long an ordinary preemptive kernel was sometimes spinning there. > So I think I agree with you more than Michal on that: RT just makes > the bad behaviour more obvious. Thanks as always. Using msleep() seems pretty pointless so I plan to go ahead with patch as-is, with a cc:stable. None of it is pretty, but it's better than what we have now, yes?
On Mon 28-02-22 07:33:15, Andrew Morton wrote: > On Mon, 28 Feb 2022 08:57:49 +0100 Michal Hocko <mhocko@suse.com> wrote: > > > On Mon 21-02-22 11:17:49, cgel.zte@gmail.com wrote: > > > From: Guo Ziliang <guo.ziliang@zte.com.cn> > > > > > > In our testing, a deadloop task was found. Through sysrq printing, same > > > stack was found every time, as follows: > > > __swap_duplicate+0x58/0x1a0 > > > swapcache_prepare+0x24/0x30 > > > __read_swap_cache_async+0xac/0x220 > > > read_swap_cache_async+0x58/0xa0 > > > swapin_readahead+0x24c/0x628 > > > do_swap_page+0x374/0x8a0 > > > __handle_mm_fault+0x598/0xd60 > > > handle_mm_fault+0x114/0x200 > > > do_page_fault+0x148/0x4d0 > > > do_translation_fault+0xb0/0xd4 > > > do_mem_abort+0x50/0xb0 > > > > > > The reason for the deadloop is that swapcache_prepare() always returns > > > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that > > > it cannot jump out of the loop. We suspect that the task that clears > > > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower > > > the priority of the task stuck in a deadloop so that the task that > > > clears the SWAP_HAS_CACHE flag will run. The results show that the > > > system returns to normal after the priority is lowered. > > > > > > In our testing, multiple real-time tasks are bound to the same core, > > > and the task in the deadloop is the highest priority task of the > > > core, so the deadloop task cannot be preempted. > > > > > > Although cond_resched() is used by __read_swap_cache_async, it is an > > > empty function in the preemptive system and cannot achieve the purpose > > > of releasing the CPU. A high-priority task cannot release the CPU > > > unless preempted by a higher-priority task. But when this task > > > is already the highest priority task on this core, other tasks > > > will not be able to be scheduled. So we think we should replace > > > cond_resched() with schedule_timeout_uninterruptible(1), > > > schedule_timeout_interruptible will call set_current_state > > > first to set the task state, so the task will be removed > > > from the running queue, so as to achieve the purpose of > > > giving up the CPU and prevent it from running in kernel > > > mode for too long. > > > > I am sorry but I really do not see how this case is any different from > > any other kernel code path being hogged by a RT task. We surely > > shouldn't put sleeps into all random paths which are doing cond_resched > > at the moment. > > But this cond_resched() is different from most. This one is attempting > to yield the CPU so this task can make progress. And cond_resched() > simply isn't an appropriate way of doing this because under this fairly > common situation, it's a no-op. I might be really missing something but I really do not see how is this any different from the page allocator path which only does cond_resched as well (well, except for throttling but that might just not trigger). Or other paths which just do cond_resched while waiting for a progress somewhere else. Not that I like this situation but !PREEMPT kernel with RT priority tasks is rather limited and full of potential priblems IMHO.
On Tue, 1 Mar 2022, Andrew Morton wrote: > > Using msleep() seems pretty pointless so I plan to go ahead with patch > as-is, with a cc:stable. None of it is pretty, but it's better than > what we have now, yes? Yes, I've no objection to that. Hugh
On Wed, 2 Mar 2022, Michal Hocko wrote: > > I might be really missing something but I really do not see how is this > any different from the page allocator path which only does cond_resched > as well (well, except for throttling but that might just not trigger). > Or other paths which just do cond_resched while waiting for a progress > somewhere else. > > Not that I like this situation but !PREEMPT kernel with RT priority > tasks is rather limited and full of potential priblems IMHO. As I said in previous mail, I have really not given this as much thought this time as I did in the 2018 mail thread linked there; but have seen that it behaves more badly than I had imagined, in any preemptive kernel - no need for RT. We just don't have the stats to show when this code here spins waiting on code elsewhere that is sleeping. I think the difference from most cond_resched() places is that swapin is trying to collect together several factors with minimal locking, and we should have added preempt_disable()s when preemption was invented. But it's only swap so we didn't notice. Hugh
diff --git a/mm/swap_state.c b/mm/swap_state.c index 8d4104242100..ee67164531c0 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * __read_swap_cache_async(), which has set SWAP_HAS_CACHE * in swap_map, but not yet added its page to swap cache. */ - cond_resched(); + schedule_timeout_uninterruptible(1); } /*