Message ID | 20210311021321.127500-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use | expand |
On Wed, 10 Mar 2021 18:13:21 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > put_page does not correctly handle all calling contexts for hugetlb > pages. This was recently discussed in the threads [1] and [2]. > > free_huge_page is the routine called for the final put_page of huegtlb > pages. Since at least the beginning of git history, free_huge_page has > acquired the hugetlb_lock to move the page to a free list and possibly > perform other processing. When this code was originally written, the > hugetlb_lock should have been made irq safe. > > For many years, nobody noticed this situation until lockdep code caught > free_huge_page being called from irq context. By this time, another > lock (hugetlb subpool) was also taken in the free_huge_page path. In > addition, hugetlb cgroup code had been added which could hold > hugetlb_lock for a considerable period of time. Because of this, commit > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task > context") was added to address the issue of free_huge_page being called > from irq context. That commit hands off free_huge_page processing to a > workqueue if !in_task. > > The !in_task check handles the case of being called from irq context. > However, it does not take into account the case when called with irqs > disabled as in [1]. > > To complicate matters, functionality has been added to hugetlb > such that free_huge_page may block/sleep in certain situations. The > hugetlb_lock is of course dropped before potentially blocking. > > One way to handle all calling contexts is to have free_huge_page always > send pages to the workqueue for processing. This idea was briefly > discussed here [3], but has some undesirable side effects. > > Ideally, the hugetlb_lock should have been irq safe from the beginning > and any code added to the free_huge_page path should have taken this > into account. However, this has not happened. The code today does have > the ability to hand off requests to a workqueue. It does this for calls > from irq context. Changing the check in the code from !in_task to > in_atomic would handle the situations when called with irqs disabled. > However, it does not not handle the case when called with a spinlock > held. This is needed because the code could block/sleep. > > Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be > used to detect all atomic contexts where sleeping is not possible. > > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ > [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/ > [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/ > > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -235,6 +235,7 @@ config HUGETLBFS > > config HUGETLB_PAGE > def_bool HUGETLBFS > + select PREEMPT_COUNT > Well this is unfortunate. hugetlb is forcing PREEMPT_COUNT because we screwed things up. Did we consider changing the networking code to call a new free_huge_tlb_from_irq()? So the callee doesn't need to guess. Or something else? Is anyone looking onto fixing this for real?
On Wed 10-03-21 21:43:16, Andrew Morton wrote: > On Wed, 10 Mar 2021 18:13:21 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > put_page does not correctly handle all calling contexts for hugetlb > > pages. This was recently discussed in the threads [1] and [2]. > > > > free_huge_page is the routine called for the final put_page of huegtlb > > pages. Since at least the beginning of git history, free_huge_page has > > acquired the hugetlb_lock to move the page to a free list and possibly > > perform other processing. When this code was originally written, the > > hugetlb_lock should have been made irq safe. > > > > For many years, nobody noticed this situation until lockdep code caught > > free_huge_page being called from irq context. By this time, another > > lock (hugetlb subpool) was also taken in the free_huge_page path. In > > addition, hugetlb cgroup code had been added which could hold > > hugetlb_lock for a considerable period of time. Because of this, commit > > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task > > context") was added to address the issue of free_huge_page being called > > from irq context. That commit hands off free_huge_page processing to a > > workqueue if !in_task. > > > > The !in_task check handles the case of being called from irq context. > > However, it does not take into account the case when called with irqs > > disabled as in [1]. > > > > To complicate matters, functionality has been added to hugetlb > > such that free_huge_page may block/sleep in certain situations. The > > hugetlb_lock is of course dropped before potentially blocking. > > > > One way to handle all calling contexts is to have free_huge_page always > > send pages to the workqueue for processing. This idea was briefly > > discussed here [3], but has some undesirable side effects. > > > > Ideally, the hugetlb_lock should have been irq safe from the beginning > > and any code added to the free_huge_page path should have taken this > > into account. However, this has not happened. The code today does have > > the ability to hand off requests to a workqueue. It does this for calls > > from irq context. Changing the check in the code from !in_task to > > in_atomic would handle the situations when called with irqs disabled. > > However, it does not not handle the case when called with a spinlock > > held. This is needed because the code could block/sleep. > > > > Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be > > used to detect all atomic contexts where sleeping is not possible. > > > > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ > > [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/ > > [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/ > > > > --- a/fs/Kconfig > > +++ b/fs/Kconfig > > @@ -235,6 +235,7 @@ config HUGETLBFS > > > > config HUGETLB_PAGE > > def_bool HUGETLBFS > > + select PREEMPT_COUNT > > > > Well this is unfortunate. hugetlb is forcing PREEMPT_COUNT because we > screwed things up. Yes this is far from ideal but we have tried to explore other ways all looking much more complex. [1] shows that this is a problem already and needs a reasonable fix to be backported for older kernels. > Did we consider changing the networking code to call a new > free_huge_tlb_from_irq()? So the callee doesn't need to guess. I do not think we want to pollute networking or any other code that simply wants to put_page with a hugetlb specific knowledge. > Or something else? > > Is anyone looking onto fixing this for real? Mike said he would be looking into making hugetlb_lock irq safe but there is a non trivial way there and this would be not a great candidate for backporting. Btw. RCU already wants to have a reliable in_atomic as well and that effectivelly means enabling PREEMPT_COUNT for everybody. The overhead of per-cpu preempt counter should pretty much invisible AFAIK.
On Wed 10-03-21 18:13:21, Mike Kravetz wrote: > put_page does not correctly handle all calling contexts for hugetlb > pages. This was recently discussed in the threads [1] and [2]. > > free_huge_page is the routine called for the final put_page of huegtlb > pages. Since at least the beginning of git history, free_huge_page has > acquired the hugetlb_lock to move the page to a free list and possibly > perform other processing. When this code was originally written, the > hugetlb_lock should have been made irq safe. > > For many years, nobody noticed this situation until lockdep code caught > free_huge_page being called from irq context. By this time, another > lock (hugetlb subpool) was also taken in the free_huge_page path. In > addition, hugetlb cgroup code had been added which could hold > hugetlb_lock for a considerable period of time. Because of this, commit > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task > context") was added to address the issue of free_huge_page being called > from irq context. That commit hands off free_huge_page processing to a > workqueue if !in_task. > > The !in_task check handles the case of being called from irq context. > However, it does not take into account the case when called with irqs > disabled as in [1]. > > To complicate matters, functionality has been added to hugetlb > such that free_huge_page may block/sleep in certain situations. The > hugetlb_lock is of course dropped before potentially blocking. > > One way to handle all calling contexts is to have free_huge_page always > send pages to the workqueue for processing. This idea was briefly > discussed here [3], but has some undesirable side effects. s@undesirable side effects@undesirable user visible side effects@ > Ideally, the hugetlb_lock should have been irq safe from the beginning > and any code added to the free_huge_page path should have taken this > into account. However, this has not happened. The code today does have > the ability to hand off requests to a workqueue. It does this for calls > from irq context. Changing the check in the code from !in_task to > in_atomic would handle the situations when called with irqs disabled. > However, it does not not handle the case when called with a spinlock > held. This is needed because the code could block/sleep. > > Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be > used to detect all atomic contexts where sleeping is not possible. > > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ > [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/ > [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/ > > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> While not an ideal solution I believe this is the most straightforward one wrt to backporting to older kernels which are affected. I have a hope that a preemption model independent in_atomic() is going to become a norm. RCU is very much interested in the same thing as well. Now we have two core kernel users requiring this so hopefully this will make the case stronger. That being said Acked-by: Michal Hocko <mhocko@suse.com> > --- > fs/Kconfig | 1 + > mm/hugetlb.c | 10 +++++----- > 2 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/fs/Kconfig b/fs/Kconfig > index 462253ae483a..403d7a7a619a 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -235,6 +235,7 @@ config HUGETLBFS > > config HUGETLB_PAGE > def_bool HUGETLBFS > + select PREEMPT_COUNT > > config MEMFD_CREATE > def_bool TMPFS || HUGETLBFS > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 33b0d8778551..5407e77ca803 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1437,9 +1437,9 @@ static void __free_huge_page(struct page *page) > } > > /* > - * As free_huge_page() can be called from a non-task context, we have > - * to defer the actual freeing in a workqueue to prevent potential > - * hugetlb_lock deadlock. > + * If free_huge_page() is called from an atomic context, we have to defer > + * the actual freeing in a workqueue. This is to prevent possible sleeping > + * while in atomic and potential hugetlb_lock deadlock. > * > * free_hpage_workfn() locklessly retrieves the linked list of pages to > * be freed and frees them one-by-one. As the page->mapping pointer is > @@ -1467,9 +1467,9 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn); > void free_huge_page(struct page *page) > { > /* > - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock. > + * Defer freeing if in atomic context and sleeping is not allowed > */ > - if (!in_task()) { > + if (in_atomic()) { > /* > * Only call schedule_work() if hpage_freelist is previously > * empty. Otherwise, schedule_work() had been called but the > -- > 2.29.2 >
On Thu 11-03-21 09:26:03, Michal Hocko wrote: > On Wed 10-03-21 18:13:21, Mike Kravetz wrote: > > put_page does not correctly handle all calling contexts for hugetlb > > pages. This was recently discussed in the threads [1] and [2]. > > > > free_huge_page is the routine called for the final put_page of huegtlb > > pages. Since at least the beginning of git history, free_huge_page has > > acquired the hugetlb_lock to move the page to a free list and possibly > > perform other processing. When this code was originally written, the > > hugetlb_lock should have been made irq safe. > > > > For many years, nobody noticed this situation until lockdep code caught > > free_huge_page being called from irq context. By this time, another > > lock (hugetlb subpool) was also taken in the free_huge_page path. In > > addition, hugetlb cgroup code had been added which could hold > > hugetlb_lock for a considerable period of time. Because of this, commit > > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task > > context") was added to address the issue of free_huge_page being called > > from irq context. That commit hands off free_huge_page processing to a > > workqueue if !in_task. > > > > The !in_task check handles the case of being called from irq context. > > However, it does not take into account the case when called with irqs > > disabled as in [1]. > > > > To complicate matters, functionality has been added to hugetlb > > such that free_huge_page may block/sleep in certain situations. The > > hugetlb_lock is of course dropped before potentially blocking. > > > > One way to handle all calling contexts is to have free_huge_page always > > send pages to the workqueue for processing. This idea was briefly > > discussed here [3], but has some undesirable side effects. > > s@undesirable side effects@undesirable user visible side effects@ > > > Ideally, the hugetlb_lock should have been irq safe from the beginning > > and any code added to the free_huge_page path should have taken this > > into account. However, this has not happened. The code today does have > > the ability to hand off requests to a workqueue. It does this for calls > > from irq context. Changing the check in the code from !in_task to > > in_atomic would handle the situations when called with irqs disabled. > > However, it does not not handle the case when called with a spinlock > > held. This is needed because the code could block/sleep. > > > > Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be > > used to detect all atomic contexts where sleeping is not possible. > > > > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ > > [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/ > > [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/ > > > > Suggested-by: Michal Hocko <mhocko@suse.com> > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > While not an ideal solution I believe this is the most straightforward > one wrt to backporting to older kernels which are affected. I have a > hope that a preemption model independent in_atomic() is going to become > a norm. RCU is very much interested in the same thing as well. Now we > have two core kernel users requiring this so hopefully this will make > the case stronger. > > That being said > Acked-by: Michal Hocko <mhocko@suse.com> Btw. we very likely want Cc: stable
On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote: > from irq context. Changing the check in the code from !in_task to > in_atomic would handle the situations when called with irqs disabled. It does not. local_irq_disable() does not change preempt_count().
On Thu 11-03-21 09:46:30, Peter Zijlstra wrote: > On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote: > > from irq context. Changing the check in the code from !in_task to > > in_atomic would handle the situations when called with irqs disabled. > > It does not. local_irq_disable() does not change preempt_count(). You are right. Earlier I was suggesting to check of irq_disabled() as well http://lkml.kernel.org/r/YD4I+VPr3UNt063H@dhcp22.suse.cz back then it was not really clear to me that in fact we do care about spin locks more than irq disabled code. I am not even sure whether we need to care about irq disabled regions without any locks held that wouldn't be covered by in_atomic. But it would be safer to add irq_disabled check as well.
On Thu, Mar 11, 2021 at 10:01:22AM +0100, Michal Hocko wrote: > On Thu 11-03-21 09:46:30, Peter Zijlstra wrote: > > On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote: > > > from irq context. Changing the check in the code from !in_task to > > > in_atomic would handle the situations when called with irqs disabled. > > > > It does not. local_irq_disable() does not change preempt_count(). > > You are right. Earlier I was suggesting to check of irq_disabled() as > well http://lkml.kernel.org/r/YD4I+VPr3UNt063H@dhcp22.suse.cz > > back then it was not really clear to me that in fact we do care about > spin locks more than irq disabled code. I am not even sure whether we > need to care about irq disabled regions without any locks held that > wouldn't be covered by in_atomic. But it would be safer to add > irq_disabled check as well. Safer still is always doing it, replace it with if (true). What's the purpose, doing the minimal 'correct', of the maximal safe solution? The whole changelog reads like a trainwreck, but akpm already commented on that. I picked out a small factual incorrectness, simply because if you can't get that right, the whole argument looses weight. That said, I don't think you actually need it, if as you write the lock should be IRQ-safe, then you're worried about the IRQ recursion deadlock: spin_lock(&A); <IRQ> spin_lock(&A); And that obviously cannot happen when IRQs are disabled, even if you were holding a lock. Anyway, I think I support Andrew here, this needs proper fixing, not bandaids.
On Thu 11-03-21 10:32:24, Peter Zijlstra wrote: > On Thu, Mar 11, 2021 at 10:01:22AM +0100, Michal Hocko wrote: > > On Thu 11-03-21 09:46:30, Peter Zijlstra wrote: > > > On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote: > > > > from irq context. Changing the check in the code from !in_task to > > > > in_atomic would handle the situations when called with irqs disabled. > > > > > > It does not. local_irq_disable() does not change preempt_count(). > > > > You are right. Earlier I was suggesting to check of irq_disabled() as > > well http://lkml.kernel.org/r/YD4I+VPr3UNt063H@dhcp22.suse.cz > > > > back then it was not really clear to me that in fact we do care about > > spin locks more than irq disabled code. I am not even sure whether we > > need to care about irq disabled regions without any locks held that > > wouldn't be covered by in_atomic. But it would be safer to add > > irq_disabled check as well. > > Safer still is always doing it, replace it with if (true). > > What's the purpose, doing the minimal 'correct', of the maximal safe > solution? If we always defer to a WQ context then an admin wouldn't have any feedback from the syscall when releasing the pool. > The whole changelog reads like a trainwreck, but akpm already commented > on that. I picked out a small factual incorrectness, simply because if > you can't get that right, the whole argument looses weight. Is there any reason why in_atomic || irq_disabled wouldn't work universally? > That said, I don't think you actually need it, if as you write the lock > should be IRQ-safe, then you're worried about the IRQ recursion > deadlock: making hugetlb_lock irqsafe is a long way as explained by Mike elsewhere. Not only that. The upcoming hugeltb feature to have sparse vmemmap for hugetlb pages will need to allocate vmemmap when hugetlb page is to be freed back to the allocator. That cannot happen in any atomic context so there will be a need to tell those contexts for special casing.
On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote: > put_page does not correctly handle all calling contexts for hugetlb > pages. This was recently discussed in the threads [1] and [2]. > > free_huge_page is the routine called for the final put_page of huegtlb > pages. Since at least the beginning of git history, free_huge_page has > acquired the hugetlb_lock to move the page to a free list and possibly > perform other processing. When this code was originally written, the > hugetlb_lock should have been made irq safe. > > For many years, nobody noticed this situation until lockdep code caught > free_huge_page being called from irq context. By this time, another > lock (hugetlb subpool) was also taken in the free_huge_page path. AFAICT there's no actual problem with making spool->lock IRQ-safe too. > In addition, hugetlb cgroup code had been added which could hold > hugetlb_lock for a considerable period of time. cgroups, always bloody cgroups. The scheduler (and a fair number of other places) get to deal with cgroups with IRQs disabled, so I'm sure this can too. > Because of this, commit > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task > context") was added to address the issue of free_huge_page being called > from irq context. That commit hands off free_huge_page processing to a > workqueue if !in_task. > > The !in_task check handles the case of being called from irq context. > However, it does not take into account the case when called with irqs > disabled as in [1]. > > To complicate matters, functionality has been added to hugetlb > such that free_huge_page may block/sleep in certain situations. The > hugetlb_lock is of course dropped before potentially blocking. AFAICT that's because CMA, right? That's only hstate_is_gigantic() and free_gigantic_page() that has that particular trainwreck. So you could move the workqueue there, and leave all the other hugetlb sizes unaffected. Afaict if you limit the workqueue crud to cma_clear_bitmap(), you don't get your.. > One way to handle all calling contexts is to have free_huge_page always > send pages to the workqueue for processing. This idea was briefly > discussed here [3], but has some undesirable side effects. ... user visible side effects either. > Ideally, the hugetlb_lock should have been irq safe from the beginning > and any code added to the free_huge_page path should have taken this > into account. However, this has not happened. The code today does have > the ability to hand off requests to a workqueue. It does this for calls > from irq context. Changing the check in the code from !in_task to > in_atomic would handle the situations when called with irqs disabled. > However, it does not not handle the case when called with a spinlock > held. This is needed because the code could block/sleep. I'll argue the current workqueue thing is in the wrong place to begin with. So how about you make hugetlb_lock and spool->lock IRQ-safe, move thw workqueue thingy into cma_release(), and then worry about optimizing the cgroup crap? Correctness first, performance second. Also, if you really care about performance, not using cgroups is a very good option anyway.
On Thu, Mar 11, 2021 at 10:44:56AM +0100, Michal Hocko wrote: > On Thu 11-03-21 10:32:24, Peter Zijlstra wrote: > > The whole changelog reads like a trainwreck, but akpm already commented > > on that. I picked out a small factual incorrectness, simply because if > > you can't get that right, the whole argument looses weight. > > Is there any reason why in_atomic || irq_disabled wouldn't work > universally? I just explained to you how you really wanted: in_atomic() && !irq_disabled() > > That said, I don't think you actually need it, if as you write the lock > > should be IRQ-safe, then you're worried about the IRQ recursion > > deadlock: > > making hugetlb_lock irqsafe is a long way as explained by Mike > elsewhere. Not only that. The upcoming hugeltb feature to have sparse > vmemmap for hugetlb pages will need to allocate vmemmap when hugetlb > page is to be freed back to the allocator. That cannot happen in any > atomic context so there will be a need to tell those contexts for > special casing. Then scrap the vmemmap code *NOW*. Do not merge more shit before fixing existing problems. Especially not if that's known to make it harder to fix the problems.
On Thu 11-03-21 10:52:50, Peter Zijlstra wrote: > On Thu, Mar 11, 2021 at 10:44:56AM +0100, Michal Hocko wrote: > > On Thu 11-03-21 10:32:24, Peter Zijlstra wrote: > > > The whole changelog reads like a trainwreck, but akpm already commented > > > on that. I picked out a small factual incorrectness, simply because if > > > you can't get that right, the whole argument looses weight. > > > > Is there any reason why in_atomic || irq_disabled wouldn't work > > universally? > > I just explained to you how you really wanted: > > in_atomic() && !irq_disabled() Sorry for being dense but I do not follow. You have provided the following example spin_lock(&A); <IRQ> spin_lock(&A); if A == hugetlb_lock then we should never reenter with free_huge_page if (in_atomic() || irq_disabled()) schedule_in_wq(); else free_directly() because hugetlb_lock is never held in irq context other than from put_page (aka the above) path which will explicitly defer the handling and thus the lock to a different context. We need to check for irq_disabled because of the sleeping paths in the freeing path. Or do I miss something? From the code simplicity POV (and hugetlb has grown a lot of complexity) it would be really easiest to make sure __free_huge_page to be called from a non-atomic process context. There are few ways to do that - defer each call to a WQ - user visible which sucks - defer from atomic or otherwise non-sleeping contextx - requires reliable in_atomic AFAICS - defer sleeping operations - makes the code flow more complex and it would be again user visible in some cases. So I would say we are in "pick your own poison" kind of situation.
On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote: > Sorry for being dense but I do not follow. You have provided the > following example > spin_lock(&A); > <IRQ> > spin_lock(&A); > > if A == hugetlb_lock then we should never reenter with > free_huge_page What I'm saying is that if irq_disabled(), the that interrupt cannot happen, so the second spin_lock cannot happen, so the deadlock cannot happen. So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ recursion deadlock. Also, Linus hates constructs like this: https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com > From the code simplicity POV (and hugetlb has grown a lot of complexity) > it would be really easiest to make sure __free_huge_page to be called > from a non-atomic process context. There are few ways to do that > - defer each call to a WQ - user visible which sucks > - defer from atomic or otherwise non-sleeping contextx - requires > reliable in_atomic AFAICS > - defer sleeping operations - makes the code flow more complex and it > would be again user visible in some cases. > > So I would say we are in "pick your own poison" kind of situation. Just to be clear: NAK on this patch and any and all ductape crap. Fix it properly, make hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA thing. The code really doesn't look _that_ complicated.
On Thu 11-03-21 12:36:51, Peter Zijlstra wrote: > On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote: > > > Sorry for being dense but I do not follow. You have provided the > > following example > > spin_lock(&A); > > <IRQ> > > spin_lock(&A); > > > > if A == hugetlb_lock then we should never reenter with > > free_huge_page > > What I'm saying is that if irq_disabled(), the that interrupt cannot > happen, so the second spin_lock cannot happen, so the deadlock cannot > happen. > > So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ > recursion deadlock. OK, then I understand your point now. I thought you were arguing an actual deadlock scenario. As I've said irq_disabled check would be needed for sleeping operations that we already do. > Also, Linus hates constructs like this: > > https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com > > > From the code simplicity POV (and hugetlb has grown a lot of complexity) > > it would be really easiest to make sure __free_huge_page to be called > > from a non-atomic process context. There are few ways to do that > > - defer each call to a WQ - user visible which sucks > > - defer from atomic or otherwise non-sleeping contextx - requires > > reliable in_atomic AFAICS > > - defer sleeping operations - makes the code flow more complex and it > > would be again user visible in some cases. > > > > So I would say we are in "pick your own poison" kind of situation. > > Just to be clear: > > NAK on this patch and any and all ductape crap. Fix it properly, make > hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA > thing. > > The code really doesn't look _that_ complicated. Fair enough. As I've said I am not a great fan of this patch either but it is a quick fix for a likely long term problem. If reworking the hugetlb locking is preferable then be it.
On Thu 11-03-21 12:36:51, Peter Zijlstra wrote: [...] > Also, Linus hates constructs like this: > > https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com Btw. I would really appreciate if somebody would explain why it is _fundamentally broken_ to check for an atomic context and chose a different handling in a code path, like put_page, which is out of hands of the called context? This can be called from a wide variety of contexts. There is no way to pass a context information to the called function. I do recognize that this is not an act of beauty but why fundamentally broken? The put_page context can certainly work towards robustness and operate on the most restrictive context grounds (I really hope nobody will ever come up with an idea that put_page can be called from nmi context). This can make the code more complex and less optimal in normal case (e.g. hugetlb is almost never freed from an atomic context - one has to be really creative to achieve that). So where do we draw a line?
On 3/11/21 4:02 AM, Michal Hocko wrote: > On Thu 11-03-21 12:36:51, Peter Zijlstra wrote: >> On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote: >> >>> Sorry for being dense but I do not follow. You have provided the >>> following example >>> spin_lock(&A); >>> <IRQ> >>> spin_lock(&A); >>> >>> if A == hugetlb_lock then we should never reenter with >>> free_huge_page >> >> What I'm saying is that if irq_disabled(), the that interrupt cannot >> happen, so the second spin_lock cannot happen, so the deadlock cannot >> happen. >> >> So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ >> recursion deadlock. > > OK, then I understand your point now. I thought you were arguing > an actual deadlock scenario. As I've said irq_disabled check would be > needed for sleeping operations that we already do. > >> Also, Linus hates constructs like this: >> >> https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com >> >>> From the code simplicity POV (and hugetlb has grown a lot of complexity) >>> it would be really easiest to make sure __free_huge_page to be called >>> from a non-atomic process context. There are few ways to do that >>> - defer each call to a WQ - user visible which sucks >>> - defer from atomic or otherwise non-sleeping contextx - requires >>> reliable in_atomic AFAICS >>> - defer sleeping operations - makes the code flow more complex and it >>> would be again user visible in some cases. >>> >>> So I would say we are in "pick your own poison" kind of situation. >> >> Just to be clear: >> >> NAK on this patch and any and all ductape crap. Fix it properly, make >> hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA >> thing. >> >> The code really doesn't look _that_ complicated. > > Fair enough. As I've said I am not a great fan of this patch either > but it is a quick fix for a likely long term problem. If reworking the > hugetlb locking is preferable then be it. Thanks you Michal and Peter. This patch was mostly about starting a discussion, as this topic came up in a couple different places. I included the 'train wreck' of how we got here just for a bit of history. I'll start working on a proper fix.
On Thu, Mar 11, 2021 at 12:36:51PM +0100, Peter Zijlstra wrote: > On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote: > > > Sorry for being dense but I do not follow. You have provided the > > following example > > spin_lock(&A); > > <IRQ> > > spin_lock(&A); > > > > if A == hugetlb_lock then we should never reenter with > > free_huge_page > > What I'm saying is that if irq_disabled(), the that interrupt cannot > happen, so the second spin_lock cannot happen, so the deadlock cannot > happen. > > So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ > recursion deadlock. > > Also, Linus hates constructs like this: > > https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com To be fair, later in that same thread Linus states that his main concern is not core code, but rather driver code: https://lore.kernel.org/lkml/CAHk-=wjsMycgMHJrCmeetR3r+K5bpSRtmVWfd8iaoQCYd_VYAg@mail.gmail.com/ Nevertheless, if the job can be done reasonably without checking the preemption/interrupt state, why not? And Mike's patch is still useful for people hitting this bug. Thanx, Paul > > From the code simplicity POV (and hugetlb has grown a lot of complexity) > > it would be really easiest to make sure __free_huge_page to be called > > from a non-atomic process context. There are few ways to do that > > - defer each call to a WQ - user visible which sucks > > - defer from atomic or otherwise non-sleeping contextx - requires > > reliable in_atomic AFAICS > > - defer sleeping operations - makes the code flow more complex and it > > would be again user visible in some cases. > > > > So I would say we are in "pick your own poison" kind of situation. > > Just to be clear: > > NAK on this patch and any and all ductape crap. Fix it properly, make > hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA > thing. > > The code really doesn't look _that_ complicated.
diff --git a/fs/Kconfig b/fs/Kconfig index 462253ae483a..403d7a7a619a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -235,6 +235,7 @@ config HUGETLBFS config HUGETLB_PAGE def_bool HUGETLBFS + select PREEMPT_COUNT config MEMFD_CREATE def_bool TMPFS || HUGETLBFS diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 33b0d8778551..5407e77ca803 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1437,9 +1437,9 @@ static void __free_huge_page(struct page *page) } /* - * As free_huge_page() can be called from a non-task context, we have - * to defer the actual freeing in a workqueue to prevent potential - * hugetlb_lock deadlock. + * If free_huge_page() is called from an atomic context, we have to defer + * the actual freeing in a workqueue. This is to prevent possible sleeping + * while in atomic and potential hugetlb_lock deadlock. * * free_hpage_workfn() locklessly retrieves the linked list of pages to * be freed and frees them one-by-one. As the page->mapping pointer is @@ -1467,9 +1467,9 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn); void free_huge_page(struct page *page) { /* - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock. + * Defer freeing if in atomic context and sleeping is not allowed */ - if (!in_task()) { + if (in_atomic()) { /* * Only call schedule_work() if hpage_freelist is previously * empty. Otherwise, schedule_work() had been called but the
put_page does not correctly handle all calling contexts for hugetlb pages. This was recently discussed in the threads [1] and [2]. free_huge_page is the routine called for the final put_page of huegtlb pages. Since at least the beginning of git history, free_huge_page has acquired the hugetlb_lock to move the page to a free list and possibly perform other processing. When this code was originally written, the hugetlb_lock should have been made irq safe. For many years, nobody noticed this situation until lockdep code caught free_huge_page being called from irq context. By this time, another lock (hugetlb subpool) was also taken in the free_huge_page path. In addition, hugetlb cgroup code had been added which could hold hugetlb_lock for a considerable period of time. Because of this, commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") was added to address the issue of free_huge_page being called from irq context. That commit hands off free_huge_page processing to a workqueue if !in_task. The !in_task check handles the case of being called from irq context. However, it does not take into account the case when called with irqs disabled as in [1]. To complicate matters, functionality has been added to hugetlb such that free_huge_page may block/sleep in certain situations. The hugetlb_lock is of course dropped before potentially blocking. One way to handle all calling contexts is to have free_huge_page always send pages to the workqueue for processing. This idea was briefly discussed here [3], but has some undesirable side effects. Ideally, the hugetlb_lock should have been irq safe from the beginning and any code added to the free_huge_page path should have taken this into account. However, this has not happened. The code today does have the ability to hand off requests to a workqueue. It does this for calls from irq context. Changing the check in the code from !in_task to in_atomic would handle the situations when called with irqs disabled. However, it does not not handle the case when called with a spinlock held. This is needed because the code could block/sleep. Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be used to detect all atomic contexts where sleeping is not possible. [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/ [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/ Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- fs/Kconfig | 1 + mm/hugetlb.c | 10 +++++----- 2 files changed, 6 insertions(+), 5 deletions(-)