[v1,1/4] mm: Fix lazy mmu docs and usage

Message ID	20250302145555.3236789-2-ryan.roberts@arm.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, "David S. Miller" <davem@davemloft.net>, Andreas Larsson <andreas@gaisler.com>, Juergen Gross <jgross@suse.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Catalin Marinas <catalin.marinas@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, sparclinux@vger.kernel.org, xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org Subject: [PATCH v1 1/4] mm: Fix lazy mmu docs and usage Date: Sun, 2 Mar 2025 14:55:51 +0000 Message-ID: <20250302145555.3236789-2-ryan.roberts@arm.com> In-Reply-To: <20250302145555.3236789-1-ryan.roberts@arm.com> References: <20250302145555.3236789-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Fix lazy mmu mode \| expand [v1,0/4] Fix lazy mmu mode [v1,1/4] mm: Fix lazy mmu docs and usage [v1,2/4] sparc/mm: Disable preemption in lazy mmu mode [v1,3/4] sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes [v1,4/4] Revert "x86/xen: allow nesting of same lazy mode"

Ryan Roberts March 2, 2025, 2:55 p.m. UTC

The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode()
is a bit of a mess (to put it politely). There are a number of issues
related to nesting of lazy mmu regions and confusion over whether the
task, when in a lazy mmu region, is preemptible or not. Fix all the
issues relating to the core-mm. Follow up commits will fix the
arch-specific implementations. 3 arches implement lazy mmu; powerpc,
sparc and x86.

When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
expected that lazy mmu regions would never nest and that the appropriate
page table lock(s) would be held while in the region, thus ensuring the
region is non-preemptible. Additionally lazy mmu regions were only used
during manipulation of user mappings.

Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
updates") started invoking the lazy mmu mode in apply_to_pte_range(),
which is used for both user and kernel mappings. For kernel mappings the
region is no longer protected by any lock so there is no longer any
guarantee about non-preemptibility. Additionally, for RT configs, the
holding the PTL only implies no CPU migration, it doesn't prevent
preemption.

Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
set_ptes(), used by x86. So after this commit, lazy mmu regions can be
nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
page table range API") and commit 9fee28baa601 ("powerpc: implement the
new page table range API") did the same for the sparc and powerpc
set_ptes() overrides.

powerpc couldn't deal with preemption so avoids it in commit
b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"),
which explicitly disables preemption for the whole region in its
implementation. x86 can support preemption (or at least it could until
it tried to add support nesting; more on this below). Sparc looks to be
totally broken in the face of preemption, as far as I can tell.

powewrpc can't deal with nesting, so avoids it in commit 47b8def9358c
("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
which removes the lazy mmu calls from its implementation of set_ptes().
x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
nesting of same lazy mode") but as far as I can tell, this breaks its
support for preemption.

In short, it's all a mess; the semantics for
arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a
result the implementations all have different expectations, sticking
plasters and bugs.

arm64 is aiming to start using these hooks, so let's clean everything up
before adding an arm64 implementation. Update the documentation to state
that lazy mmu regions can never be nested, must not be called in
interrupt context and preemption may or may not be enabled for the
duration of the region.

Additionally, update the way arch_[enter|leave]_lazy_mmu_mode() is
called in pagemap_scan_pmd_entry() to follow the normal pattern of
holding the ptl for user space mappings. As a result the scope is
reduced to only the pte table, but that's where most of the performance
win is. While I believe there wasn't technically a bug here, the
original scope made it easier to accidentally nest or, worse,
accidentally call something like kmap() which would expect an immediate
mode pte modification but it would end up deferred.

arch-specific fixes to conform to the new spec will proceed this one.

These issues were spotted by code review and I have no evidence of
issues being reported in the wild.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 fs/proc/task_mmu.c      | 11 ++++-------
 include/linux/pgtable.h | 14 ++++++++------
 2 files changed, 12 insertions(+), 13 deletions(-)

David Hildenbrand March 3, 2025, 8:49 a.m. UTC | #1

On 02.03.25 15:55, Ryan Roberts wrote:
> The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode()
> is a bit of a mess (to put it politely). There are a number of issues
> related to nesting of lazy mmu regions and confusion over whether the
> task, when in a lazy mmu region, is preemptible or not. Fix all the
> issues relating to the core-mm. Follow up commits will fix the
> arch-specific implementations. 3 arches implement lazy mmu; powerpc,
> sparc and x86.
> 
> When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
> 6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
> expected that lazy mmu regions would never nest and that the appropriate
> page table lock(s) would be held while in the region, thus ensuring the
> region is non-preemptible. Additionally lazy mmu regions were only used
> during manipulation of user mappings.
> 
> Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
> updates") started invoking the lazy mmu mode in apply_to_pte_range(),
> which is used for both user and kernel mappings. For kernel mappings the
> region is no longer protected by any lock so there is no longer any
> guarantee about non-preemptibility. Additionally, for RT configs, the
> holding the PTL only implies no CPU migration, it doesn't prevent
> preemption.
> 
> Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
> arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
> set_ptes(), used by x86. So after this commit, lazy mmu regions can be
> nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
> page table range API") and commit 9fee28baa601 ("powerpc: implement the
> new page table range API") did the same for the sparc and powerpc
> set_ptes() overrides.
> 
> powerpc couldn't deal with preemption so avoids it in commit
> b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"),
> which explicitly disables preemption for the whole region in its
> implementation. x86 can support preemption (or at least it could until
> it tried to add support nesting; more on this below). Sparc looks to be
> totally broken in the face of preemption, as far as I can tell.
> 
> powewrpc can't deal with nesting, so avoids it in commit 47b8def9358c
> ("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
> which removes the lazy mmu calls from its implementation of set_ptes().
> x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
> nesting of same lazy mode") but as far as I can tell, this breaks its
> support for preemption.
> 
> In short, it's all a mess; the semantics for
> arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a
> result the implementations all have different expectations, sticking
> plasters and bugs.
> 
> arm64 is aiming to start using these hooks, so let's clean everything up
> before adding an arm64 implementation. Update the documentation to state
> that lazy mmu regions can never be nested, must not be called in
> interrupt context and preemption may or may not be enabled for the
> duration of the region.
> 
> Additionally, update the way arch_[enter|leave]_lazy_mmu_mode() is
> called in pagemap_scan_pmd_entry() to follow the normal pattern of
> holding the ptl for user space mappings. As a result the scope is
> reduced to only the pte table, but that's where most of the performance
> win is. While I believe there wasn't technically a bug here, the
> original scope made it easier to accidentally nest or, worse,
> accidentally call something like kmap() which would expect an immediate
> mode pte modification but it would end up deferred.
> 
> arch-specific fixes to conform to the new spec will proceed this one.
> 
> These issues were spotted by code review and I have no evidence of
> issues being reported in the wild.
> 

All looking good to me!

Acked-by: David Hildenbrand <david@redhat.com>

David Hildenbrand March 3, 2025, 8:52 a.m. UTC | #2

On 03.03.25 09:49, David Hildenbrand wrote:
> On 02.03.25 15:55, Ryan Roberts wrote:
>> The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode()
>> is a bit of a mess (to put it politely). There are a number of issues
>> related to nesting of lazy mmu regions and confusion over whether the
>> task, when in a lazy mmu region, is preemptible or not. Fix all the
>> issues relating to the core-mm. Follow up commits will fix the
>> arch-specific implementations. 3 arches implement lazy mmu; powerpc,
>> sparc and x86.
>>
>> When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
>> 6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
>> expected that lazy mmu regions would never nest and that the appropriate
>> page table lock(s) would be held while in the region, thus ensuring the
>> region is non-preemptible. Additionally lazy mmu regions were only used
>> during manipulation of user mappings.
>>
>> Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
>> updates") started invoking the lazy mmu mode in apply_to_pte_range(),
>> which is used for both user and kernel mappings. For kernel mappings the
>> region is no longer protected by any lock so there is no longer any
>> guarantee about non-preemptibility. Additionally, for RT configs, the
>> holding the PTL only implies no CPU migration, it doesn't prevent
>> preemption.
>>
>> Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
>> arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
>> set_ptes(), used by x86. So after this commit, lazy mmu regions can be
>> nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
>> page table range API") and commit 9fee28baa601 ("powerpc: implement the
>> new page table range API") did the same for the sparc and powerpc
>> set_ptes() overrides.
>>
>> powerpc couldn't deal with preemption so avoids it in commit
>> b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"),
>> which explicitly disables preemption for the whole region in its
>> implementation. x86 can support preemption (or at least it could until
>> it tried to add support nesting; more on this below). Sparc looks to be
>> totally broken in the face of preemption, as far as I can tell.
>>
>> powewrpc can't deal with nesting, so avoids it in commit 47b8def9358c
>> ("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
>> which removes the lazy mmu calls from its implementation of set_ptes().
>> x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
>> nesting of same lazy mode") but as far as I can tell, this breaks its
>> support for preemption.
>>
>> In short, it's all a mess; the semantics for
>> arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a
>> result the implementations all have different expectations, sticking
>> plasters and bugs.
>>
>> arm64 is aiming to start using these hooks, so let's clean everything up
>> before adding an arm64 implementation. Update the documentation to state
>> that lazy mmu regions can never be nested, must not be called in
>> interrupt context and preemption may or may not be enabled for the
>> duration of the region.
>>
>> Additionally, update the way arch_[enter|leave]_lazy_mmu_mode() is
>> called in pagemap_scan_pmd_entry() to follow the normal pattern of
>> holding the ptl for user space mappings. As a result the scope is
>> reduced to only the pte table, but that's where most of the performance
>> win is. While I believe there wasn't technically a bug here, the
>> original scope made it easier to accidentally nest or, worse,
>> accidentally call something like kmap() which would expect an immediate
>> mode pte modification but it would end up deferred.
>>
>> arch-specific fixes to conform to the new spec will proceed this one.
>>
>> These issues were spotted by code review and I have no evidence of
>> issues being reported in the wild.
>>
> 
> All looking good to me!
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 

... but I do wonder if the set_ptes change should be split from the 
pagemap change.

Ryan Roberts March 3, 2025, 10:22 a.m. UTC | #3

On 03/03/2025 08:52, David Hildenbrand wrote:
> On 03.03.25 09:49, David Hildenbrand wrote:
>> On 02.03.25 15:55, Ryan Roberts wrote:
>>> The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode()
>>> is a bit of a mess (to put it politely). There are a number of issues
>>> related to nesting of lazy mmu regions and confusion over whether the
>>> task, when in a lazy mmu region, is preemptible or not. Fix all the
>>> issues relating to the core-mm. Follow up commits will fix the
>>> arch-specific implementations. 3 arches implement lazy mmu; powerpc,
>>> sparc and x86.
>>>
>>> When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
>>> 6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
>>> expected that lazy mmu regions would never nest and that the appropriate
>>> page table lock(s) would be held while in the region, thus ensuring the
>>> region is non-preemptible. Additionally lazy mmu regions were only used
>>> during manipulation of user mappings.
>>>
>>> Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
>>> updates") started invoking the lazy mmu mode in apply_to_pte_range(),
>>> which is used for both user and kernel mappings. For kernel mappings the
>>> region is no longer protected by any lock so there is no longer any
>>> guarantee about non-preemptibility. Additionally, for RT configs, the
>>> holding the PTL only implies no CPU migration, it doesn't prevent
>>> preemption.
>>>
>>> Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
>>> arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
>>> set_ptes(), used by x86. So after this commit, lazy mmu regions can be
>>> nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
>>> page table range API") and commit 9fee28baa601 ("powerpc: implement the
>>> new page table range API") did the same for the sparc and powerpc
>>> set_ptes() overrides.
>>>
>>> powerpc couldn't deal with preemption so avoids it in commit
>>> b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"),
>>> which explicitly disables preemption for the whole region in its
>>> implementation. x86 can support preemption (or at least it could until
>>> it tried to add support nesting; more on this below). Sparc looks to be
>>> totally broken in the face of preemption, as far as I can tell.
>>>
>>> powewrpc can't deal with nesting, so avoids it in commit 47b8def9358c
>>> ("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
>>> which removes the lazy mmu calls from its implementation of set_ptes().
>>> x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
>>> nesting of same lazy mode") but as far as I can tell, this breaks its
>>> support for preemption.
>>>
>>> In short, it's all a mess; the semantics for
>>> arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a
>>> result the implementations all have different expectations, sticking
>>> plasters and bugs.
>>>
>>> arm64 is aiming to start using these hooks, so let's clean everything up
>>> before adding an arm64 implementation. Update the documentation to state
>>> that lazy mmu regions can never be nested, must not be called in
>>> interrupt context and preemption may or may not be enabled for the
>>> duration of the region.
>>>
>>> Additionally, update the way arch_[enter|leave]_lazy_mmu_mode() is
>>> called in pagemap_scan_pmd_entry() to follow the normal pattern of
>>> holding the ptl for user space mappings. As a result the scope is
>>> reduced to only the pte table, but that's where most of the performance
>>> win is. While I believe there wasn't technically a bug here, the
>>> original scope made it easier to accidentally nest or, worse,
>>> accidentally call something like kmap() which would expect an immediate
>>> mode pte modification but it would end up deferred.
>>>
>>> arch-specific fixes to conform to the new spec will proceed this one.
>>>
>>> These issues were spotted by code review and I have no evidence of
>>> issues being reported in the wild.
>>>
>>
>> All looking good to me!
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>>
> 
> ... but I do wonder if the set_ptes change should be split from the pagemap change.

So set_ptes + docs changes in one patch, and pagemap change in another? I can do
that.

I didn't actually cc stable on these, I'm wondering if I should do that? Perhaps
for all patches except the pagemap change?

Thanks for the quick review!

David Hildenbrand March 3, 2025, 10:30 a.m. UTC | #4

On 03.03.25 11:22, Ryan Roberts wrote:
> On 03/03/2025 08:52, David Hildenbrand wrote:
>> On 03.03.25 09:49, David Hildenbrand wrote:
>>> On 02.03.25 15:55, Ryan Roberts wrote:
>>>> The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode()
>>>> is a bit of a mess (to put it politely). There are a number of issues
>>>> related to nesting of lazy mmu regions and confusion over whether the
>>>> task, when in a lazy mmu region, is preemptible or not. Fix all the
>>>> issues relating to the core-mm. Follow up commits will fix the
>>>> arch-specific implementations. 3 arches implement lazy mmu; powerpc,
>>>> sparc and x86.
>>>>
>>>> When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
>>>> 6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
>>>> expected that lazy mmu regions would never nest and that the appropriate
>>>> page table lock(s) would be held while in the region, thus ensuring the
>>>> region is non-preemptible. Additionally lazy mmu regions were only used
>>>> during manipulation of user mappings.
>>>>
>>>> Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
>>>> updates") started invoking the lazy mmu mode in apply_to_pte_range(),
>>>> which is used for both user and kernel mappings. For kernel mappings the
>>>> region is no longer protected by any lock so there is no longer any
>>>> guarantee about non-preemptibility. Additionally, for RT configs, the
>>>> holding the PTL only implies no CPU migration, it doesn't prevent
>>>> preemption.
>>>>
>>>> Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
>>>> arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
>>>> set_ptes(), used by x86. So after this commit, lazy mmu regions can be
>>>> nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
>>>> page table range API") and commit 9fee28baa601 ("powerpc: implement the
>>>> new page table range API") did the same for the sparc and powerpc
>>>> set_ptes() overrides.
>>>>
>>>> powerpc couldn't deal with preemption so avoids it in commit
>>>> b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"),
>>>> which explicitly disables preemption for the whole region in its
>>>> implementation. x86 can support preemption (or at least it could until
>>>> it tried to add support nesting; more on this below). Sparc looks to be
>>>> totally broken in the face of preemption, as far as I can tell.
>>>>
>>>> powewrpc can't deal with nesting, so avoids it in commit 47b8def9358c
>>>> ("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
>>>> which removes the lazy mmu calls from its implementation of set_ptes().
>>>> x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
>>>> nesting of same lazy mode") but as far as I can tell, this breaks its
>>>> support for preemption.
>>>>
>>>> In short, it's all a mess; the semantics for
>>>> arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a
>>>> result the implementations all have different expectations, sticking
>>>> plasters and bugs.
>>>>
>>>> arm64 is aiming to start using these hooks, so let's clean everything up
>>>> before adding an arm64 implementation. Update the documentation to state
>>>> that lazy mmu regions can never be nested, must not be called in
>>>> interrupt context and preemption may or may not be enabled for the
>>>> duration of the region.
>>>>
>>>> Additionally, update the way arch_[enter|leave]_lazy_mmu_mode() is
>>>> called in pagemap_scan_pmd_entry() to follow the normal pattern of
>>>> holding the ptl for user space mappings. As a result the scope is
>>>> reduced to only the pte table, but that's where most of the performance
>>>> win is. While I believe there wasn't technically a bug here, the
>>>> original scope made it easier to accidentally nest or, worse,
>>>> accidentally call something like kmap() which would expect an immediate
>>>> mode pte modification but it would end up deferred.
>>>>
>>>> arch-specific fixes to conform to the new spec will proceed this one.
>>>>
>>>> These issues were spotted by code review and I have no evidence of
>>>> issues being reported in the wild.
>>>>
>>>
>>> All looking good to me!
>>>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>
>> ... but I do wonder if the set_ptes change should be split from the pagemap change.
> 
> So set_ptes + docs changes in one patch, and pagemap change in another? I can do
> that.

Yes.

> 
> I didn't actually cc stable on these, I'm wondering if I should do that? Perhaps
> for all patches except the pagemap change?

That would make sense to me. CC stable likely doesn't hurt here. 
(although I wonder if anybody cares about stable on sparc :))

Andreas Larsson March 3, 2025, 12:49 p.m. UTC | #5

On 2025-03-03 11:30, David Hildenbrand wrote:
> On 03.03.25 11:22, Ryan Roberts wrote:[snip]
>>
>> I didn't actually cc stable on these, I'm wondering if I should do that? Perhaps
>> for all patches except the pagemap change?
> 
> That would make sense to me. CC stable likely doesn't hurt here. (although I wonder if anybody cares about stable on sparc :))

Yes, stable is important for sparc just as well as for other architectures.

Cheers,
Andreas

[v1,1/4] mm: Fix lazy mmu docs and usage

Commit Message

Comments

Patch