vunmap() on large regions may trigger soft lockup warnings

Message ID	52A899AB.3010506@citrix.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-acpi-owner@kernel.org> Message-ID: <52A899AB.3010506@citrix.com> Date: Wed, 11 Dec 2013 16:58:19 +0000 From: David Vrabel <david.vrabel@citrix.com> User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20121215 Iceowl/1.0b1 Icedove/3.0.11 MIME-Version: 1.0 To: Andrew Morton <akpm@linux-foundation.org> CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Len Brown <lenb@kernel.org>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, <linux-acpi@vger.kernel.org>, xen-devel <xen-devel@lists.xenproject.org>, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Subject: vunmap() on large regions may trigger soft lockup warnings Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk

David Vrabel Dec. 11, 2013, 4:58 p.m. UTC

Andrew,

Dietmar Hahn reported an issue where calling vunmap() on a large (50 GB)
region would trigger soft lockup warnings.

The following patch would resolve this (by adding a cond_resched() call
to vunmap_pmd_range()). Almost calls of vunmap(), unmap_kernel_range()
are from process context (as far as I could tell) except for an ACPI
driver (drivers/acpi/apei/ghes.c) calls unmap_kernel_range_noflush()
from an interrupt and NMI contexts.

Can you advise on a preferred solution?

For example, an unmap_kernel_page() function (callable from atomic
context) could be provided since the GHES driver only maps/unmaps a
single page.

8<-------------------------
mm/vmalloc: avoid soft lockup warnings when vunmap()'ing large ranges

From: David Vrabel <david.vrabel@citrix.com>

If vunmap() is used to unmap a large (e.g., 50 GB) region, it may take
sufficiently long that it triggers soft lockup warnings.

Add a cond_resched() into vunmap_pmd_range() so the calling task may
be resheduled after unmapping each PMD entry.  This is how
zap_pmd_range() fixes the same problem for userspace mappings.

Reported-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 mm/vmalloc.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

Andrew Morton Dec. 11, 2013, 9:39 p.m. UTC | #1

On Wed, 11 Dec 2013 16:58:19 +0000 David Vrabel <david.vrabel@citrix.com> wrote:

> Andrew,
> 
> Dietmar Hahn reported an issue where calling vunmap() on a large (50 GB)
> region would trigger soft lockup warnings.
> 
> The following patch would resolve this (by adding a cond_resched() call
> to vunmap_pmd_range()). Almost calls of vunmap(), unmap_kernel_range()
> are from process context (as far as I could tell) except for an ACPI
> driver (drivers/acpi/apei/ghes.c) calls unmap_kernel_range_noflush()
> from an interrupt and NMI contexts.
> 
> Can you advise on a preferred solution?
> 
> For example, an unmap_kernel_page() function (callable from atomic
> context) could be provided since the GHES driver only maps/unmaps a
> single page.
> 
> 8<-------------------------
> mm/vmalloc: avoid soft lockup warnings when vunmap()'ing large ranges
> 
> From: David Vrabel <david.vrabel@citrix.com>
> 
> If vunmap() is used to unmap a large (e.g., 50 GB) region, it may take
> sufficiently long that it triggers soft lockup warnings.
> 
> Add a cond_resched() into vunmap_pmd_range() so the calling task may
> be resheduled after unmapping each PMD entry.  This is how
> zap_pmd_range() fixes the same problem for userspace mappings.
> 
> ...
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -75,6 +75,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long
> addr, unsigned long end)
>  		if (pmd_none_or_clear_bad(pmd))
>  			continue;
>  		vunmap_pte_range(pmd, addr, next);
> +		cond_resched();
>  	} while (pmd++, addr = next, addr != end);
>  }

Well that's ugly.

We could redo unmap_kernel_range() so it takes an `atomic' flag then
loops around unmapping N MB at a time, doing

	if (!atomic)
		cond_resched()

each time.  But that would require difficult tuning of N.

I suppose we could just do

	if (!in_interrupt())
		cond_resched();

in vunmap_pmd_range(), but that's pretty specific to ghes.c and doesn't
permit unmap-inside-spinlock.

So I can't immediately think of a suitable fix apart from adding a new
unmap_kernel_range_atomic().  Then add a `bool atomic' arg to
vunmap_page_range() and pass that all the way down.



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Vrabel Dec. 12, 2013, 12:50 p.m. UTC | #2

On 11/12/13 21:39, Andrew Morton wrote:
> On Wed, 11 Dec 2013 16:58:19 +0000 David Vrabel <david.vrabel@citrix.com> wrote:
> 
>> Andrew,
>>
>> Dietmar Hahn reported an issue where calling vunmap() on a large (50 GB)
>> region would trigger soft lockup warnings.
>>
>> The following patch would resolve this (by adding a cond_resched() call
>> to vunmap_pmd_range()). Almost calls of vunmap(), unmap_kernel_range()
>> are from process context (as far as I could tell) except for an ACPI
>> driver (drivers/acpi/apei/ghes.c) calls unmap_kernel_range_noflush()
>> from an interrupt and NMI contexts.
>>
>> Can you advise on a preferred solution?
>>
>> For example, an unmap_kernel_page() function (callable from atomic
>> context) could be provided since the GHES driver only maps/unmaps a
>> single page.
>>
>> 8<-------------------------
>> mm/vmalloc: avoid soft lockup warnings when vunmap()'ing large ranges
>>
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> If vunmap() is used to unmap a large (e.g., 50 GB) region, it may take
>> sufficiently long that it triggers soft lockup warnings.
>>
>> Add a cond_resched() into vunmap_pmd_range() so the calling task may
>> be resheduled after unmapping each PMD entry.  This is how
>> zap_pmd_range() fixes the same problem for userspace mappings.
>>
>> ...
>>
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -75,6 +75,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long
>> addr, unsigned long end)
>>  		if (pmd_none_or_clear_bad(pmd))
>>  			continue;
>>  		vunmap_pte_range(pmd, addr, next);
>> +		cond_resched();
>>  	} while (pmd++, addr = next, addr != end);
>>  }
> 
> Well that's ugly.
> 
> We could redo unmap_kernel_range() so it takes an `atomic' flag then
> loops around unmapping N MB at a time, doing
> 
> 	if (!atomic)
> 		cond_resched()
> 
> each time.  But that would require difficult tuning of N.
> 
> I suppose we could just do
> 
> 	if (!in_interrupt())
> 		cond_resched();
> 
> in vunmap_pmd_range(), but that's pretty specific to ghes.c and doesn't
> permit unmap-inside-spinlock.
> 
> So I can't immediately think of a suitable fix apart from adding a new
> unmap_kernel_range_atomic().  Then add a `bool atomic' arg to
> vunmap_page_range() and pass that all the way down.

That would work for the unmap, but looking at the GHES driver some more
and it looks like it's call to ioremap_page_range() is already unsafe --
it may need to allocate a new PTE page with a non-atomic alloc in
pte_alloc_one_kernel().

Perhaps what's needed here is a pair of ioremap_page_atomic() and
iounmap_page_atomic() calls?  With some prep function to sure the PTE
pages (etc.) are preallocated.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton Dec. 14, 2013, 8:32 a.m. UTC | #3

On Thu, 12 Dec 2013 12:50:47 +0000 David Vrabel <david.vrabel@citrix.com> wrote:

> > each time.  But that would require difficult tuning of N.
> > 
> > I suppose we could just do
> > 
> > 	if (!in_interrupt())
> > 		cond_resched();
> > 
> > in vunmap_pmd_range(), but that's pretty specific to ghes.c and doesn't
> > permit unmap-inside-spinlock.
> > 
> > So I can't immediately think of a suitable fix apart from adding a new
> > unmap_kernel_range_atomic().  Then add a `bool atomic' arg to
> > vunmap_page_range() and pass that all the way down.
> 
> That would work for the unmap, but looking at the GHES driver some more
> and it looks like it's call to ioremap_page_range() is already unsafe --
> it may need to allocate a new PTE page with a non-atomic alloc in
> pte_alloc_one_kernel().
> 
> Perhaps what's needed here is a pair of ioremap_page_atomic() and
> iounmap_page_atomic() calls?  With some prep function to sure the PTE
> pages (etc.) are preallocated.

Is ghes.c the only problem source here?  If so then a suitable solution
would be to declare that driver hopelessly busted and proceed as if it
didn't exist :(

Just from a quick look, the thing is doing ioremap() from NMI context! 
ioremap has to do a bunch of memory allocations, takes spinlocks etc.

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Vrabel Dec. 16, 2013, 12:56 p.m. UTC | #4

On 14/12/13 08:32, Andrew Morton wrote:
> On Thu, 12 Dec 2013 12:50:47 +0000 David Vrabel <david.vrabel@citrix.com> wrote:
> 
>>> each time.  But that would require difficult tuning of N.
>>>
>>> I suppose we could just do
>>>
>>> 	if (!in_interrupt())
>>> 		cond_resched();
>>>
>>> in vunmap_pmd_range(), but that's pretty specific to ghes.c and doesn't
>>> permit unmap-inside-spinlock.
>>>
>>> So I can't immediately think of a suitable fix apart from adding a new
>>> unmap_kernel_range_atomic().  Then add a `bool atomic' arg to
>>> vunmap_page_range() and pass that all the way down.
>>
>> That would work for the unmap, but looking at the GHES driver some more
>> and it looks like it's call to ioremap_page_range() is already unsafe --
>> it may need to allocate a new PTE page with a non-atomic alloc in
>> pte_alloc_one_kernel().
>>
>> Perhaps what's needed here is a pair of ioremap_page_atomic() and
>> iounmap_page_atomic() calls?  With some prep function to sure the PTE
>> pages (etc.) are preallocated.
> 
> Is ghes.c the only problem source here?  If so then a suitable solution
> would be to declare that driver hopelessly busted and proceed as if it
> didn't exist :(

All the other callers do so from non-atomic context.  ghes.c is the only
broken caller.

Shall I resend or are you happy to take the patch off the first email in
this thread?

David

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton Dec. 16, 2013, 10:57 p.m. UTC | #5

On Mon, 16 Dec 2013 12:56:13 +0000 David Vrabel <david.vrabel@citrix.com> wrote:

> On 14/12/13 08:32, Andrew Morton wrote:
> > On Thu, 12 Dec 2013 12:50:47 +0000 David Vrabel <david.vrabel@citrix.com> wrote:
> > 
> >>> each time.  But that would require difficult tuning of N.
> >>>
> >>> I suppose we could just do
> >>>
> >>> 	if (!in_interrupt())
> >>> 		cond_resched();
> >>>
> >>> in vunmap_pmd_range(), but that's pretty specific to ghes.c and doesn't
> >>> permit unmap-inside-spinlock.
> >>>
> >>> So I can't immediately think of a suitable fix apart from adding a new
> >>> unmap_kernel_range_atomic().  Then add a `bool atomic' arg to
> >>> vunmap_page_range() and pass that all the way down.
> >>
> >> That would work for the unmap, but looking at the GHES driver some more
> >> and it looks like it's call to ioremap_page_range() is already unsafe --
> >> it may need to allocate a new PTE page with a non-atomic alloc in
> >> pte_alloc_one_kernel().
> >>
> >> Perhaps what's needed here is a pair of ioremap_page_atomic() and
> >> iounmap_page_atomic() calls?  With some prep function to sure the PTE
> >> pages (etc.) are preallocated.
> > 
> > Is ghes.c the only problem source here?  If so then a suitable solution
> > would be to declare that driver hopelessly busted and proceed as if it
> > didn't exist :(
> 
> All the other callers do so from non-atomic context.  ghes.c is the only
> broken caller.
> 
> Shall I resend or are you happy to take the patch off the first email in
> this thread?

Well first we should attempt to wake up the ghes maintainers and tell
them we're about to break their stuff. (Which I believe is already broken).

The fix won't be easy - presumably ghes will need to punt all its IRQ-
and NMI_context operations up into kernel thread context.

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

vunmap() on large regions may trigger soft lockup warnings

Commit Message

Comments

Patch