diff mbox series

[v3,1/5] mm: pagewalk: add the ability to install PTEs

Message ID 9be732fd0e897453116b433fe2f468ef7795602e.1729699916.git.lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series implement lightweight guard pages | expand

Commit Message

Lorenzo Stoakes Oct. 23, 2024, 4:24 p.m. UTC
The existing generic pagewalk logic permits the walking of page tables,
invoking callbacks at individual page table levels via user-provided
mm_walk_ops callbacks.

This is useful for traversing existing page table entries, but precludes
the ability to establish new ones.

Existing mechanism for performing a walk which also installs page table
entries if necessary are heavily duplicated throughout the kernel, each
with semantic differences from one another and largely unavailable for use
elsewhere.

Rather than add yet another implementation, we extend the generic pagewalk
logic to enable the installation of page table entries by adding a new
install_pte() callback in mm_walk_ops. If this is specified, then upon
encountering a missing page table entry, we allocate and install a new one
and continue the traversal.

If a THP huge page is encountered at either the PMD or PUD level we split
it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
otherwise if there is only an ops->install_pte(), we avoid the unnecessary
split.

We do not support hugetlb at this stage.

If this function returns an error, or an allocation fails during the
operation, we abort the operation altogether. It is up to the caller to
deal appropriately with partially populated page table ranges.

If install_pte() is defined, the semantics of pte_entry() change - this
callback is then only invoked if the entry already exists. This is a useful
property, as it allows a caller to handle existing PTEs while installing
new ones where necessary in the specified range.

If install_pte() is not defined, then there is no functional difference to
this patch, so all existing logic will work precisely as it did before.

As we only permit the installation of PTEs where a mapping does not already
exist there is no need for TLB management, however we do invoke
update_mmu_cache() for architectures which require manual maintenance of
mappings for other CPUs.

We explicitly do not allow the existing page walk API to expose this
feature as it is dangerous and intended for internal mm use only. Therefore
we provide a new walk_page_range_mm() function exposed only to
mm/internal.h.

Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/pagewalk.h |  18 +++-
 mm/internal.h            |   6 ++
 mm/pagewalk.c            | 227 +++++++++++++++++++++++++++------------
 3 files changed, 182 insertions(+), 69 deletions(-)

Comments

Andrew Morton Oct. 23, 2024, 11:04 p.m. UTC | #1
On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> 
> ...
>
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel,

How complicated is it to migrate those to use this?

> ...
>
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
>

Is this decision compatible with the above migrations?
Lorenzo Stoakes Oct. 24, 2024, 7:34 a.m. UTC | #2
On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
> On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> >
> > ...
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel,
>
> How complicated is it to migrate those to use this?

I would say probably somewhat difficult as very often people are doing quite
custom things, but I will take a look at seeing if we can't make things a little
more generic.

I am also mildly motivated to look at trying to find a generic way to do
replaces...

Both on the TODO!

>
> > ...
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
>
> Is this decision compatible with the above migrations?

Yes, nobody but mm should be doing this. You have to be enormously careful about
locks, caches, accounting and the various other machinery around actually
setting a PTE, and this allows you to establish entirely new mappings, so could
very much be abused.
David Hildenbrand Oct. 24, 2024, 7:45 a.m. UTC | #3
On 24.10.24 09:34, Lorenzo Stoakes wrote:
> On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
>> On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>>
>>>
>>> ...
>>>
>>> Existing mechanism for performing a walk which also installs page table
>>> entries if necessary are heavily duplicated throughout the kernel,
>>
>> How complicated is it to migrate those to use this?
> 
> I would say probably somewhat difficult as very often people are doing quite
> custom things, but I will take a look at seeing if we can't make things a little
> more generic.
> 
> I am also mildly motivated to look at trying to find a generic way to do
> replaces...
> 
> Both on the TODO!

I'm not super happy about extending the rusty old pagewalk API, because 
it's inefficient (indirect calls) and not future proof (batching, large 
folios).

But I see how we ended up with this patch, and it will be easy to 
convert to something better once we have it.

We already discussed in the past that we need a better and more 
efficient way to walk page tables. I have part of that on my TODO list, 
but I'm getting distracted.

*Inserting* (not walking/modifying existing things as most users to) as 
done in this patch is slightly different though, likely "on thing that 
fits all" will not apply to all page table walker user cases.
Lorenzo Stoakes Oct. 24, 2024, 8:07 a.m. UTC | #4
On Thu, Oct 24, 2024 at 09:45:37AM +0200, David Hildenbrand wrote:
> On 24.10.24 09:34, Lorenzo Stoakes wrote:
> > On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
> > > On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > >
> > > > ...
> > > >
> > > > Existing mechanism for performing a walk which also installs page table
> > > > entries if necessary are heavily duplicated throughout the kernel,
> > >
> > > How complicated is it to migrate those to use this?
> >
> > I would say probably somewhat difficult as very often people are doing quite
> > custom things, but I will take a look at seeing if we can't make things a little
> > more generic.
> >
> > I am also mildly motivated to look at trying to find a generic way to do
> > replaces...
> >
> > Both on the TODO!
>
> I'm not super happy about extending the rusty old pagewalk API, because it's
> inefficient (indirect calls) and not future proof (batching, large folios).

Yeah it could be improved, but I think the ideal way would be to genericise as
much as we can and 'upgrade' this logic.

>
> But I see how we ended up with this patch, and it will be easy to convert to
> something better once we have it.

Yes, I was quite happy with what an ultimately small delta this all ended up
being vs. the various other alternatives (change zap logic, introduce custom
page fault mechanism, duplicating page walk code _yet again_, porting uffd
logic, etc. etc.)

But in an ideal world we'd have _one_ place that does this.

>
> We already discussed in the past that we need a better and more efficient
> way to walk page tables. I have part of that on my TODO list, but I'm
> getting distracted.

Yes I remember an LSF session on this, it's a really obvious area of improvement
that stands out at the moment for sure.

Having worked several 12+ hour days in a row though recently I can relate to
workload making this difficult though :)

>
> *Inserting* (not walking/modifying existing things as most users to) as done
> in this patch is slightly different though, likely "on thing that fits all"
> will not apply to all page table walker user cases.

Yeah, there's also replace scenarios which then have to do egregious amounts of
work to make sure we do everything right, in fact there's duplicates of this in
mm/madvise.c *grumble grumble*.

>
> --
> Cheers,
>
> David / dhildenb
>

OK so I guess I'll hold off my TODOs on this as you are looking in this area and
I trust you :)

Cheers!
Vlastimil Babka Oct. 25, 2024, 6:13 p.m. UTC | #5
On 10/23/24 18:24, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
> 
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
> 
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.
> 
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
> 
> If a THP huge page is encountered at either the PMD or PUD level we split
> it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> split.
> 
> We do not support hugetlb at this stage.
> 
> If this function returns an error, or an allocation fails during the
> operation, we abort the operation altogether. It is up to the caller to
> deal appropriately with partially populated page table ranges.
> 
> If install_pte() is defined, the semantics of pte_entry() change - this
> callback is then only invoked if the entry already exists. This is a useful
> property, as it allows a caller to handle existing PTEs while installing
> new ones where necessary in the specified range.
> 
> If install_pte() is not defined, then there is no functional difference to
> this patch, so all existing logic will work precisely as it did before.
> 
> As we only permit the installation of PTEs where a mapping does not already
> exist there is no need for TLB management, however we do invoke
> update_mmu_cache() for architectures which require manual maintenance of
> mappings for other CPUs.
> 
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
> 
> Reviewed-by: Jann Horn <jannh@google.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Just a small subjective suggestion in case you agree and there's a respin or
followups:

> @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		/*
> -		 * Check this here so we only break down trans_huge
> -		 * pages when we _need_ to
> -		 */
> -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry)
> +			continue; /* Nothing to do. */
> +		if (!ops->pte_entry && ops->install_pte &&
> +		    pmd_present(*pmd) &&
> +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +			continue; /* Avoid unnecessary split. */

Much better now, thanks, but maybe the last 2 parts could be:

if (!ops->pte_entry) {
	if (!ops->install_pte)
		continue; /* Nothing to do. */
	else if (pmd_present(*pmd)
		 && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
		continue; /* Avoid unnecessary split. */
}

Or at least put !ops->pte_entry first in both conditions?

>  		if (walk->vma)
>  			split_huge_pmd(walk->vma, pmd, addr);
> +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> +			continue; /* Nothing to do. */
>  
>  		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
> @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   again:
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(*pud)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pmd_alloc(walk->mm, pud, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pmd_entry || ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> +			continue;  /* Nothing to do. */
> +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> +		    pud_present(*pud) &&
> +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> +			continue; /* Avoid unnecessary split. */

Ditto.

Thanks!
David Hildenbrand Oct. 25, 2024, 7:08 p.m. UTC | #6
>>
>> We already discussed in the past that we need a better and more efficient
>> way to walk page tables. I have part of that on my TODO list, but I'm
>> getting distracted.
> 
> Yes I remember an LSF session on this, it's a really obvious area of improvement
> that stands out at the moment for sure.
> 
> Having worked several 12+ hour days in a row though recently I can relate to
> workload making this difficult though :)

Yes :)

> 
>>
>> *Inserting* (not walking/modifying existing things as most users to) as done
>> in this patch is slightly different though, likely "on thing that fits all"
>> will not apply to all page table walker user cases.
> 
> Yeah, there's also replace scenarios which then have to do egregious amounts of
> work to make sure we do everything right, in fact there's duplicates of this in
> mm/madvise.c *grumble grumble*.
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> OK so I guess I'll hold off my TODOs on this as you are looking in this area and
> I trust you :)

It will probably take me a while until I get to it, though. I'd focus on 
walking (and batching what we can) first, then on top modifying existing 
entries.

The "install something where there is nothing yet" (incl. populating 
fresh page tables etc.) case probably deserves a separate "walker".

If you end up having spare cycles and want to sync on a possible design 
for some part of that bigger task -- removing the old pagewalk logic -- 
please do reach out! :)
Lorenzo Stoakes Oct. 25, 2024, 9:58 p.m. UTC | #7
On Fri, Oct 25, 2024 at 08:13:26PM +0200, Vlastimil Babka wrote:
> On 10/23/24 18:24, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
> >
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> >
> > If a THP huge page is encountered at either the PMD or PUD level we split
> > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> > otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> > split.
> >
> > We do not support hugetlb at this stage.
> >
> > If this function returns an error, or an allocation fails during the
> > operation, we abort the operation altogether. It is up to the caller to
> > deal appropriately with partially populated page table ranges.
> >
> > If install_pte() is defined, the semantics of pte_entry() change - this
> > callback is then only invoked if the entry already exists. This is a useful
> > property, as it allows a caller to handle existing PTEs while installing
> > new ones where necessary in the specified range.
> >
> > If install_pte() is not defined, then there is no functional difference to
> > this patch, so all existing logic will work precisely as it did before.
> >
> > As we only permit the installation of PTEs where a mapping does not already
> > exist there is no need for TLB management, however we do invoke
> > update_mmu_cache() for architectures which require manual maintenance of
> > mappings for other CPUs.
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
> > Reviewed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

>
> Just a small subjective suggestion in case you agree and there's a respin or
> followups:
>
> > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		/*
> > -		 * Check this here so we only break down trans_huge
> > -		 * pages when we _need_ to
> > -		 */
> > -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry)
> > +			continue; /* Nothing to do. */
> > +		if (!ops->pte_entry && ops->install_pte &&
> > +		    pmd_present(*pmd) &&
> > +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > +			continue; /* Avoid unnecessary split. */
>
> Much better now, thanks, but maybe the last 2 parts could be:
>
> if (!ops->pte_entry) {
> 	if (!ops->install_pte)
> 		continue; /* Nothing to do. */
> 	else if (pmd_present(*pmd)
> 		 && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> 		continue; /* Avoid unnecessary split. */
> }

I quite liked separating out the 'nothing to do' vs. the 'unnecessary split'
cases, but I agree it makes it harder to see that the 2nd case is an 'install
pte ONLY' case.

Yeah so I think your version is better, but maybe we can find a way to be more
expressive somehow... if we could declare vars mid-way thhrough it'd be easier
:P

Will improve on respin if it comes up

>
> Or at least put !ops->pte_entry first in both conditions?

Ack yeah that'd be better!

>
> >  		if (walk->vma)
> >  			split_huge_pmd(walk->vma, pmd, addr);
> > +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> > +			continue; /* Nothing to do. */
> >
> >  		err = walk_pte_range(pmd, addr, next, walk);
> >  		if (err)
> > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   again:
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(*pud)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pmd_alloc(walk->mm, pud, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pmd_entry || ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> > +			continue;  /* Nothing to do. */
> > +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> > +		    pud_present(*pud) &&
> > +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> > +			continue; /* Avoid unnecessary split. */
>
> Ditto.

Ack!

>
> Thanks!
>

Cheers!
Lorenzo Stoakes Oct. 26, 2024, 7:42 a.m. UTC | #8
On Fri, Oct 25, 2024 at 09:08:24PM +0200, David Hildenbrand wrote:
> > >
> > > We already discussed in the past that we need a better and more efficient
> > > way to walk page tables. I have part of that on my TODO list, but I'm
> > > getting distracted.
> >
> > Yes I remember an LSF session on this, it's a really obvious area of improvement
> > that stands out at the moment for sure.
> >
> > Having worked several 12+ hour days in a row though recently I can relate to
> > workload making this difficult though :)
>
> Yes :)
>
> >
> > >
> > > *Inserting* (not walking/modifying existing things as most users to) as done
> > > in this patch is slightly different though, likely "on thing that fits all"
> > > will not apply to all page table walker user cases.
> >
> > Yeah, there's also replace scenarios which then have to do egregious amounts of
> > work to make sure we do everything right, in fact there's duplicates of this in
> > mm/madvise.c *grumble grumble*.
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > OK so I guess I'll hold off my TODOs on this as you are looking in this area and
> > I trust you :)
>
> It will probably take me a while until I get to it, though. I'd focus on
> walking (and batching what we can) first, then on top modifying existing
> entries.

Yeah entirely understandable and that's the right order I think, the
modifying path is the trickier one especially with all the special cases
all over the place...

>
> The "install something where there is nothing yet" (incl. populating fresh
> page tables etc.) case probably deserves a separate "walker".

Yes this would avoid all the heavy handling a replace handler needs.

>
> If you end up having spare cycles and want to sync on a possible design for
> some part of that bigger task -- removing the old pagewalk logic -- please
> do reach out! :)

Thanks, I may have a play when I have a brief moment, as I feel quite
strongly we need to attack this (as do you I feel! :) will send some RFC or
some thoughts or whatever should I do so!

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers!
Jarkko Sakkinen Oct. 28, 2024, 8:29 p.m. UTC | #9
On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
>
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
>
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.
>
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
>
> If a THP huge page is encountered at either the PMD or PUD level we split
> it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> split.

Just for interest: does this mean that the operation is always
"destructive" (i.e. modifying state) even when install_pte() does not
do anything, i.e. does the split alway happen despite what the callback
does? Not really expert on this but this paragraph won't leave me
alone :-)

>
> We do not support hugetlb at this stage.
>
> If this function returns an error, or an allocation fails during the
> operation, we abort the operation altogether. It is up to the caller to
> deal appropriately with partially populated page table ranges.
>
> If install_pte() is defined, the semantics of pte_entry() change - this
> callback is then only invoked if the entry already exists. This is a useful
> property, as it allows a caller to handle existing PTEs while installing
> new ones where necessary in the specified range.
>
> If install_pte() is not defined, then there is no functional difference to
> this patch, so all existing logic will work precisely as it did before.
>
> As we only permit the installation of PTEs where a mapping does not already
> exist there is no need for TLB management, however we do invoke
> update_mmu_cache() for architectures which require manual maintenance of
> mappings for other CPUs.
>
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
>
> Reviewed-by: Jann Horn <jannh@google.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/pagewalk.h |  18 +++-
>  mm/internal.h            |   6 ++
>  mm/pagewalk.c            | 227 +++++++++++++++++++++++++++------------
>  3 files changed, 182 insertions(+), 69 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index f5eb5a32aeed..9700a29f8afb 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -25,12 +25,15 @@ enum page_walk_lock {
>   *			this handler is required to be able to handle
>   *			pmd_trans_huge() pmds.  They may simply choose to
>   *			split_huge_page() instead of handling it explicitly.
> - * @pte_entry:		if set, called for each PTE (lowest-level) entry,
> - *			including empty ones
> + * @pte_entry:		if set, called for each PTE (lowest-level) entry
> + *			including empty ones, except if @install_pte is set.
> + *			If @install_pte is set, @pte_entry is called only for
> + *			existing PTEs.
>   * @pte_hole:		if set, called for each hole at all levels,
>   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
>   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
> - *			are skipped.
> + *			are skipped. If @install_pte is specified, this will
> + *			not trigger for any populated ranges.
>   * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
>   *			function is called with the vma lock held, in order to
>   *			protect against a concurrent freeing of the pte_t* or
> @@ -51,6 +54,13 @@ enum page_walk_lock {
>   * @pre_vma:            if set, called before starting walk on a non-null vma.
>   * @post_vma:           if set, called after a walk on a non-null vma, provided
>   *                      that @pre_vma and the vma walk succeeded.
> + * @install_pte:        if set, missing page table entries are installed and
> + *                      thus all levels are always walked in the specified
> + *                      range. This callback is then invoked at the PTE level
> + *                      (having split any THP pages prior), providing the PTE to
> + *                      install. If allocations fail, the walk is aborted. This
> + *                      operation is only available for userland memory. Not
> + *                      usable for hugetlb ranges.

Given that especially walk_page_range_novma() has bunch of call sites,
it would not hurt to mention here simply that only for mm-internal use
with not much other explanation. 

>   *
>   * p?d_entry callbacks are called even if those levels are folded on a
>   * particular architecture/configuration.
> @@ -76,6 +86,8 @@ struct mm_walk_ops {
>  	int (*pre_vma)(unsigned long start, unsigned long end,
>  		       struct mm_walk *walk);
>  	void (*post_vma)(struct mm_walk *walk);
> +	int (*install_pte)(unsigned long addr, unsigned long next,
> +			   pte_t *ptep, struct mm_walk *walk);
>  	enum page_walk_lock walk_lock;
>  };
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 508f7802dd2b..fb1fb0c984e4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/pagemap.h>
> +#include <linux/pagewalk.h>
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page)
>  }
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
>  
> +/* pagewalk.c */
> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		void *private);
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 5f9f01532e67..f3cbad384344 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -3,9 +3,14 @@
>  #include <linux/highmem.h>
>  #include <linux/sched.h>
>  #include <linux/hugetlb.h>
> +#include <linux/mmu_context.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <asm/tlbflush.h>
> +
> +#include "internal.h"
> +
>  /*
>   * We want to know the real level where a entry is located ignoring any
>   * folding of levels which may be happening. For example if p4d is folded then
> @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>  	int err = 0;
>  
>  	for (;;) {
> -		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> -		if (err)
> -		       break;
> +		if (ops->install_pte && pte_none(ptep_get(pte))) {
> +			pte_t new_pte;
> +
> +			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
> +					       walk);
> +			if (err)
> +				break;
> +
> +			set_pte_at(walk->mm, addr, pte, new_pte);
> +			/* Non-present before, so for arches that need it. */
> +			if (!WARN_ON_ONCE(walk->no_vma))
> +				update_mmu_cache(walk->vma, addr, pte);
> +		} else {
> +			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> +			if (err)
> +				break;
> +		}
>  		if (addr >= end - PAGE_SIZE)
>  			break;
>  		addr += PAGE_SIZE;
> @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  again:
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_none(*pmd)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pte_alloc(walk->mm, pmd);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		/*
> -		 * Check this here so we only break down trans_huge
> -		 * pages when we _need_ to
> -		 */
> -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry)
> +			continue; /* Nothing to do. */
> +		if (!ops->pte_entry && ops->install_pte &&
> +		    pmd_present(*pmd) &&
> +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +			continue; /* Avoid unnecessary split. */
>  
>  		if (walk->vma)
>  			split_huge_pmd(walk->vma, pmd, addr);
> +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> +			continue; /* Nothing to do. */
>  
>  		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
> @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   again:
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(*pud)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pmd_alloc(walk->mm, pud, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pmd_entry || ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> +			continue;  /* Nothing to do. */
> +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> +		    pud_present(*pud) &&
> +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> +			continue; /* Avoid unnecessary split. */
>  
>  		if (walk->vma)
>  			split_huge_pud(walk->vma, pud, addr);
> +		else if (pud_leaf(*pud) || !pud_present(*pud))
> +			continue; /* Nothing to do. */
> +
>  		if (pud_none(*pud))
>  			goto again;
>  
> @@ -196,18 +228,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	do {
>  		next = p4d_addr_end(addr, end);
>  		if (p4d_none_or_clear_bad(p4d)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pud_alloc(walk->mm, p4d, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  		if (ops->p4d_entry) {
>  			err = ops->p4d_entry(p4d, addr, next, walk);
>  			if (err)
>  				break;
>  		}
> -		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> +		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
> +		    ops->install_pte)
>  			err = walk_pud_range(p4d, addr, next, walk);
>  		if (err)
>  			break;
> @@ -231,18 +267,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(pgd)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __p4d_alloc(walk->mm, pgd, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, 0, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  		if (ops->pgd_entry) {
>  			err = ops->pgd_entry(pgd, addr, next, walk);
>  			if (err)
>  				break;
>  		}
> -		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> +		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
> +		    ops->pte_entry || ops->install_pte)
>  			err = walk_p4d_range(pgd, addr, next, walk);
>  		if (err)
>  			break;
> @@ -334,6 +374,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  	int err = 0;
>  	struct vm_area_struct *vma = walk->vma;
>  	const struct mm_walk_ops *ops = walk->ops;
> +	bool is_hugetlb = is_vm_hugetlb_page(vma);
> +
> +	/* We do not support hugetlb PTE installation. */
> +	if (ops->install_pte && is_hugetlb)
> +		return -EINVAL;
>  
>  	if (ops->pre_vma) {
>  		err = ops->pre_vma(start, end, walk);
> @@ -341,7 +386,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  			return err;
>  	}
>  
> -	if (is_vm_hugetlb_page(vma)) {
> +	if (is_hugetlb) {
>  		if (ops->hugetlb_entry)
>  			err = walk_hugetlb_range(start, end, walk);
>  	} else
> @@ -380,47 +425,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  #endif
>  }
>  
> -/**
> - * walk_page_range - walk page table with caller specific callbacks
> - * @mm:		mm_struct representing the target process of page table walk
> - * @start:	start address of the virtual address range
> - * @end:	end address of the virtual address range
> - * @ops:	operation to call during the walk
> - * @private:	private data for callbacks' usage
> - *
> - * Recursively walk the page table tree of the process represented by @mm
> - * within the virtual address range [@start, @end). During walking, we can do
> - * some caller-specific works for each entry, by setting up pmd_entry(),
> - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> - * callbacks, the associated entries/pages are just ignored.
> - * The return values of these callbacks are commonly defined like below:
> - *
> - *  - 0  : succeeded to handle the current entry, and if you don't reach the
> - *         end address yet, continue to walk.
> - *  - >0 : succeeded to handle the current entry, and return to the caller
> - *         with caller specific value.
> - *  - <0 : failed to handle the current entry, and return to the caller
> - *         with error code.
> - *
> - * Before starting to walk page table, some callers want to check whether
> - * they really want to walk over the current vma, typically by checking
> - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> - * purpose.
> - *
> - * If operations need to be staged before and committed after a vma is walked,
> - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> - * since it is intended to handle commit-type operations, can't return any
> - * errors.
> - *
> - * struct mm_walk keeps current values of some common data like vma and pmd,
> - * which are useful for the access from callbacks. If you want to pass some
> - * caller-specific data to callbacks, @private should be helpful.
> +/*
> + * See the comment for walk_page_range(), this performs the heavy lifting of the
> + * operation, only sets no restrictions on how the walk proceeds.
>   *
> - * Locking:
> - *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> - *   because these function traverse vma list and/or access to vma's data.
> + * We usually restrict the ability to install PTEs, but this functionality is
> + * available to internal memory management code and provided in mm/internal.h.
>   */
> -int walk_page_range(struct mm_struct *mm, unsigned long start,
> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private)
>  {
> @@ -479,6 +491,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>  	return err;
>  }
>  
> +/*
> + * Determine if the walk operations specified are permitted to be used for a
> + * page table walk.
> + *
> + * This check is performed on all functions which are parameterised by walk
> + * operations and exposed in include/linux/pagewalk.h.
> + *
> + * Internal memory management code can use the walk_page_range_mm() function to
> + * be able to use all page walking operations.
> + */
> +static bool check_ops_valid(const struct mm_walk_ops *ops)
> +{
> +	/*
> +	 * The installation of PTEs is solely under the control of memory
> +	 * management logic and subject to many subtle locking, security and
> +	 * cache considerations so we cannot permit other users to do so, and
> +	 * certainly not for exported symbols.
> +	 */
> +	if (ops->install_pte)
> +		return false;
> +
> +	return true;

or "return !!(ops->install_pte);"

> +}

Alternatively one could consider defining "struct mm_walk_internal_ops",
which would be only available in internal.h but I guess there is good
reasons to do it way it is.

> +
> +/**
> + * walk_page_range - walk page table with caller specific callbacks
> + * @mm:		mm_struct representing the target process of page table walk
> + * @start:	start address of the virtual address range
> + * @end:	end address of the virtual address range
> + * @ops:	operation to call during the walk
> + * @private:	private data for callbacks' usage
> + *
> + * Recursively walk the page table tree of the process represented by @mm
> + * within the virtual address range [@start, @end). During walking, we can do
> + * some caller-specific works for each entry, by setting up pmd_entry(),
> + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> + * callbacks, the associated entries/pages are just ignored.
> + * The return values of these callbacks are commonly defined like below:
> + *
> + *  - 0  : succeeded to handle the current entry, and if you don't reach the
> + *         end address yet, continue to walk.
> + *  - >0 : succeeded to handle the current entry, and return to the caller
> + *         with caller specific value.
> + *  - <0 : failed to handle the current entry, and return to the caller
> + *         with error code.
> + *
> + * Before starting to walk page table, some callers want to check whether
> + * they really want to walk over the current vma, typically by checking
> + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> + * purpose.
> + *
> + * If operations need to be staged before and committed after a vma is walked,
> + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> + * since it is intended to handle commit-type operations, can't return any
> + * errors.
> + *
> + * struct mm_walk keeps current values of some common data like vma and pmd,
> + * which are useful for the access from callbacks. If you want to pass some
> + * caller-specific data to callbacks, @private should be helpful.
> + *
> + * Locking:
> + *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> + *   because these function traverse vma list and/or access to vma's data.
> + */
> +int walk_page_range(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		void *private)
> +{
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
> +
> +	return walk_page_range_mm(mm, start, end, ops, private);
> +}
> +
>  /**
>   * walk_page_range_novma - walk a range of pagetables not backed by a vma
>   * @mm:		mm_struct representing the target process of page table walk
> @@ -494,7 +580,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   * walking the kernel pages tables or page tables for firmware.
>   *
>   * Note: Be careful to walk the kernel pages tables, the caller may be need to
> - * take other effective approache (mmap lock may be insufficient) to prevent
> + * take other effective approaches (mmap lock may be insufficient) to prevent
>   * the intermediate kernel page tables belonging to the specified address range
>   * from being freed (e.g. memory hot-remove).
>   */
> @@ -513,6 +599,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>  
>  	if (start >= end || !walk.mm)
>  		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
>  
>  	/*
>  	 * 1) For walking the user virtual address space:
> @@ -556,6 +644,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>  		return -EINVAL;
>  	if (start < vma->vm_start || end > vma->vm_end)
>  		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
>  
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
>  	process_vma_walk_lock(vma, ops->walk_lock);
> @@ -574,6 +664,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>  
>  	if (!walk.mm)
>  		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
>  
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
>  	process_vma_walk_lock(vma, ops->walk_lock);
> @@ -623,6 +715,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>  	unsigned long start_addr, end_addr;
>  	int err = 0;
>  
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
> +
>  	lockdep_assert_held(&mapping->i_mmap_rwsem);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
>  				  first_index + nr - 1) {

So I took a random patch set in order to learn how to compile Fedora Ark
kernel out of any upstream tree (mm in this case), thus making noise
here.

With this goal, which mainly to be able to do such thing once or twice
per release cycle:

Tested-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko
Lorenzo Stoakes Oct. 28, 2024, 9:49 p.m. UTC | #10
Note there's a v4 :)

On Mon, Oct 28, 2024 at 10:29:44PM +0200, Jarkko Sakkinen wrote:
> On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
> >
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> >
> > If a THP huge page is encountered at either the PMD or PUD level we split
> > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> > otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> > split.
>
> Just for interest: does this mean that the operation is always
> "destructive" (i.e. modifying state) even when install_pte() does not
> do anything, i.e. does the split alway happen despite what the callback
> does? Not really expert on this but this paragraph won't leave me
> alone :-)

Well no as I say (perhaps not clearly) if something already exists we don't
even split huge pages. We only do so if you explicitly ask to examine page
tables levels below those where a huge page may exist.

The guard page code goes to great lengths to avoid this in all cases and
doesn't split at all.

>
> >
> > We do not support hugetlb at this stage.
> >
> > If this function returns an error, or an allocation fails during the
> > operation, we abort the operation altogether. It is up to the caller to
> > deal appropriately with partially populated page table ranges.
> >
> > If install_pte() is defined, the semantics of pte_entry() change - this
> > callback is then only invoked if the entry already exists. This is a useful
> > property, as it allows a caller to handle existing PTEs while installing
> > new ones where necessary in the specified range.
> >
> > If install_pte() is not defined, then there is no functional difference to
> > this patch, so all existing logic will work precisely as it did before.
> >
> > As we only permit the installation of PTEs where a mapping does not already
> > exist there is no need for TLB management, however we do invoke
> > update_mmu_cache() for architectures which require manual maintenance of
> > mappings for other CPUs.
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
> > Reviewed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  include/linux/pagewalk.h |  18 +++-
> >  mm/internal.h            |   6 ++
> >  mm/pagewalk.c            | 227 +++++++++++++++++++++++++++------------
> >  3 files changed, 182 insertions(+), 69 deletions(-)
> >
> > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> > index f5eb5a32aeed..9700a29f8afb 100644
> > --- a/include/linux/pagewalk.h
> > +++ b/include/linux/pagewalk.h
> > @@ -25,12 +25,15 @@ enum page_walk_lock {
> >   *			this handler is required to be able to handle
> >   *			pmd_trans_huge() pmds.  They may simply choose to
> >   *			split_huge_page() instead of handling it explicitly.
> > - * @pte_entry:		if set, called for each PTE (lowest-level) entry,
> > - *			including empty ones
> > + * @pte_entry:		if set, called for each PTE (lowest-level) entry
> > + *			including empty ones, except if @install_pte is set.
> > + *			If @install_pte is set, @pte_entry is called only for
> > + *			existing PTEs.
> >   * @pte_hole:		if set, called for each hole at all levels,
> >   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
> >   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
> > - *			are skipped.
> > + *			are skipped. If @install_pte is specified, this will
> > + *			not trigger for any populated ranges.
> >   * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
> >   *			function is called with the vma lock held, in order to
> >   *			protect against a concurrent freeing of the pte_t* or
> > @@ -51,6 +54,13 @@ enum page_walk_lock {
> >   * @pre_vma:            if set, called before starting walk on a non-null vma.
> >   * @post_vma:           if set, called after a walk on a non-null vma, provided
> >   *                      that @pre_vma and the vma walk succeeded.
> > + * @install_pte:        if set, missing page table entries are installed and
> > + *                      thus all levels are always walked in the specified
> > + *                      range. This callback is then invoked at the PTE level
> > + *                      (having split any THP pages prior), providing the PTE to
> > + *                      install. If allocations fail, the walk is aborted. This
> > + *                      operation is only available for userland memory. Not
> > + *                      usable for hugetlb ranges.
>
> Given that especially walk_page_range_novma() has bunch of call sites,
> it would not hurt to mention here simply that only for mm-internal use
> with not much other explanation.

We explicitly document this in multiple places. A user will very quickly
discover this is not available.

I will adjust this blurb if I need to do a respin.

>
> >   *
> >   * p?d_entry callbacks are called even if those levels are folded on a
> >   * particular architecture/configuration.
> > @@ -76,6 +86,8 @@ struct mm_walk_ops {
> >  	int (*pre_vma)(unsigned long start, unsigned long end,
> >  		       struct mm_walk *walk);
> >  	void (*post_vma)(struct mm_walk *walk);
> > +	int (*install_pte)(unsigned long addr, unsigned long next,
> > +			   pte_t *ptep, struct mm_walk *walk);
> >  	enum page_walk_lock walk_lock;
> >  };
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 508f7802dd2b..fb1fb0c984e4 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/mm_inline.h>
> >  #include <linux/pagemap.h>
> > +#include <linux/pagewalk.h>
> >  #include <linux/rmap.h>
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> > @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page)
> >  }
> >  #endif /* CONFIG_UNACCEPTED_MEMORY */
> >
> > +/* pagewalk.c */
> > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> > +		unsigned long end, const struct mm_walk_ops *ops,
> > +		void *private);
> > +
> >  #endif	/* __MM_INTERNAL_H */
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 5f9f01532e67..f3cbad384344 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -3,9 +3,14 @@
> >  #include <linux/highmem.h>
> >  #include <linux/sched.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/mmu_context.h>
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >
> > +#include <asm/tlbflush.h>
> > +
> > +#include "internal.h"
> > +
> >  /*
> >   * We want to know the real level where a entry is located ignoring any
> >   * folding of levels which may be happening. For example if p4d is folded then
> > @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >  	int err = 0;
> >
> >  	for (;;) {
> > -		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> > -		if (err)
> > -		       break;
> > +		if (ops->install_pte && pte_none(ptep_get(pte))) {
> > +			pte_t new_pte;
> > +
> > +			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
> > +					       walk);
> > +			if (err)
> > +				break;
> > +
> > +			set_pte_at(walk->mm, addr, pte, new_pte);
> > +			/* Non-present before, so for arches that need it. */
> > +			if (!WARN_ON_ONCE(walk->no_vma))
> > +				update_mmu_cache(walk->vma, addr, pte);
> > +		} else {
> > +			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> > +			if (err)
> > +				break;
> > +		}
> >  		if (addr >= end - PAGE_SIZE)
> >  			break;
> >  		addr += PAGE_SIZE;
> > @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  again:
> >  		next = pmd_addr_end(addr, end);
> >  		if (pmd_none(*pmd)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pte_alloc(walk->mm, pmd);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		/*
> > -		 * Check this here so we only break down trans_huge
> > -		 * pages when we _need_ to
> > -		 */
> > -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry)
> > +			continue; /* Nothing to do. */
> > +		if (!ops->pte_entry && ops->install_pte &&
> > +		    pmd_present(*pmd) &&
> > +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > +			continue; /* Avoid unnecessary split. */
> >
> >  		if (walk->vma)
> >  			split_huge_pmd(walk->vma, pmd, addr);
> > +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> > +			continue; /* Nothing to do. */
> >
> >  		err = walk_pte_range(pmd, addr, next, walk);
> >  		if (err)
> > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   again:
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(*pud)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pmd_alloc(walk->mm, pud, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pmd_entry || ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> > +			continue;  /* Nothing to do. */
> > +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> > +		    pud_present(*pud) &&
> > +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> > +			continue; /* Avoid unnecessary split. */
> >
> >  		if (walk->vma)
> >  			split_huge_pud(walk->vma, pud, addr);
> > +		else if (pud_leaf(*pud) || !pud_present(*pud))
> > +			continue; /* Nothing to do. */
> > +
> >  		if (pud_none(*pud))
> >  			goto again;
> >
> > @@ -196,18 +228,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> >  	do {
> >  		next = p4d_addr_end(addr, end);
> >  		if (p4d_none_or_clear_bad(p4d)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pud_alloc(walk->mm, p4d, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >  		if (ops->p4d_entry) {
> >  			err = ops->p4d_entry(p4d, addr, next, walk);
> >  			if (err)
> >  				break;
> >  		}
> > -		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> > +		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
> > +		    ops->install_pte)
> >  			err = walk_pud_range(p4d, addr, next, walk);
> >  		if (err)
> >  			break;
> > @@ -231,18 +267,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
> >  	do {
> >  		next = pgd_addr_end(addr, end);
> >  		if (pgd_none_or_clear_bad(pgd)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __p4d_alloc(walk->mm, pgd, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, 0, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >  		if (ops->pgd_entry) {
> >  			err = ops->pgd_entry(pgd, addr, next, walk);
> >  			if (err)
> >  				break;
> >  		}
> > -		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> > +		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
> > +		    ops->pte_entry || ops->install_pte)
> >  			err = walk_p4d_range(pgd, addr, next, walk);
> >  		if (err)
> >  			break;
> > @@ -334,6 +374,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> >  	int err = 0;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	const struct mm_walk_ops *ops = walk->ops;
> > +	bool is_hugetlb = is_vm_hugetlb_page(vma);
> > +
> > +	/* We do not support hugetlb PTE installation. */
> > +	if (ops->install_pte && is_hugetlb)
> > +		return -EINVAL;
> >
> >  	if (ops->pre_vma) {
> >  		err = ops->pre_vma(start, end, walk);
> > @@ -341,7 +386,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> >  			return err;
> >  	}
> >
> > -	if (is_vm_hugetlb_page(vma)) {
> > +	if (is_hugetlb) {
> >  		if (ops->hugetlb_entry)
> >  			err = walk_hugetlb_range(start, end, walk);
> >  	} else
> > @@ -380,47 +425,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> >  #endif
> >  }
> >
> > -/**
> > - * walk_page_range - walk page table with caller specific callbacks
> > - * @mm:		mm_struct representing the target process of page table walk
> > - * @start:	start address of the virtual address range
> > - * @end:	end address of the virtual address range
> > - * @ops:	operation to call during the walk
> > - * @private:	private data for callbacks' usage
> > - *
> > - * Recursively walk the page table tree of the process represented by @mm
> > - * within the virtual address range [@start, @end). During walking, we can do
> > - * some caller-specific works for each entry, by setting up pmd_entry(),
> > - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> > - * callbacks, the associated entries/pages are just ignored.
> > - * The return values of these callbacks are commonly defined like below:
> > - *
> > - *  - 0  : succeeded to handle the current entry, and if you don't reach the
> > - *         end address yet, continue to walk.
> > - *  - >0 : succeeded to handle the current entry, and return to the caller
> > - *         with caller specific value.
> > - *  - <0 : failed to handle the current entry, and return to the caller
> > - *         with error code.
> > - *
> > - * Before starting to walk page table, some callers want to check whether
> > - * they really want to walk over the current vma, typically by checking
> > - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> > - * purpose.
> > - *
> > - * If operations need to be staged before and committed after a vma is walked,
> > - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> > - * since it is intended to handle commit-type operations, can't return any
> > - * errors.
> > - *
> > - * struct mm_walk keeps current values of some common data like vma and pmd,
> > - * which are useful for the access from callbacks. If you want to pass some
> > - * caller-specific data to callbacks, @private should be helpful.
> > +/*
> > + * See the comment for walk_page_range(), this performs the heavy lifting of the
> > + * operation, only sets no restrictions on how the walk proceeds.
> >   *
> > - * Locking:
> > - *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> > - *   because these function traverse vma list and/or access to vma's data.
> > + * We usually restrict the ability to install PTEs, but this functionality is
> > + * available to internal memory management code and provided in mm/internal.h.
> >   */
> > -int walk_page_range(struct mm_struct *mm, unsigned long start,
> > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> >  		unsigned long end, const struct mm_walk_ops *ops,
> >  		void *private)
> >  {
> > @@ -479,6 +491,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> >  	return err;
> >  }
> >
> > +/*
> > + * Determine if the walk operations specified are permitted to be used for a
> > + * page table walk.
> > + *
> > + * This check is performed on all functions which are parameterised by walk
> > + * operations and exposed in include/linux/pagewalk.h.
> > + *
> > + * Internal memory management code can use the walk_page_range_mm() function to
> > + * be able to use all page walking operations.
> > + */
> > +static bool check_ops_valid(const struct mm_walk_ops *ops)
> > +{
> > +	/*
> > +	 * The installation of PTEs is solely under the control of memory
> > +	 * management logic and subject to many subtle locking, security and
> > +	 * cache considerations so we cannot permit other users to do so, and
> > +	 * certainly not for exported symbols.
> > +	 */
> > +	if (ops->install_pte)
> > +		return false;
> > +
> > +	return true;
>
> or "return !!(ops->install_pte);"
>
> > +}
>
> Alternatively one could consider defining "struct mm_walk_internal_ops",
> which would be only available in internal.h but I guess there is good
> reasons to do it way it is.

Yes.

>
> > +
> > +/**
> > + * walk_page_range - walk page table with caller specific callbacks
> > + * @mm:		mm_struct representing the target process of page table walk
> > + * @start:	start address of the virtual address range
> > + * @end:	end address of the virtual address range
> > + * @ops:	operation to call during the walk
> > + * @private:	private data for callbacks' usage
> > + *
> > + * Recursively walk the page table tree of the process represented by @mm
> > + * within the virtual address range [@start, @end). During walking, we can do
> > + * some caller-specific works for each entry, by setting up pmd_entry(),
> > + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> > + * callbacks, the associated entries/pages are just ignored.
> > + * The return values of these callbacks are commonly defined like below:
> > + *
> > + *  - 0  : succeeded to handle the current entry, and if you don't reach the
> > + *         end address yet, continue to walk.
> > + *  - >0 : succeeded to handle the current entry, and return to the caller
> > + *         with caller specific value.
> > + *  - <0 : failed to handle the current entry, and return to the caller
> > + *         with error code.
> > + *
> > + * Before starting to walk page table, some callers want to check whether
> > + * they really want to walk over the current vma, typically by checking
> > + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> > + * purpose.
> > + *
> > + * If operations need to be staged before and committed after a vma is walked,
> > + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> > + * since it is intended to handle commit-type operations, can't return any
> > + * errors.
> > + *
> > + * struct mm_walk keeps current values of some common data like vma and pmd,
> > + * which are useful for the access from callbacks. If you want to pass some
> > + * caller-specific data to callbacks, @private should be helpful.
> > + *
> > + * Locking:
> > + *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> > + *   because these function traverse vma list and/or access to vma's data.
> > + */
> > +int walk_page_range(struct mm_struct *mm, unsigned long start,
> > +		unsigned long end, const struct mm_walk_ops *ops,
> > +		void *private)
> > +{
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> > +
> > +	return walk_page_range_mm(mm, start, end, ops, private);
> > +}
> > +
> >  /**
> >   * walk_page_range_novma - walk a range of pagetables not backed by a vma
> >   * @mm:		mm_struct representing the target process of page table walk
> > @@ -494,7 +580,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> >   * walking the kernel pages tables or page tables for firmware.
> >   *
> >   * Note: Be careful to walk the kernel pages tables, the caller may be need to
> > - * take other effective approache (mmap lock may be insufficient) to prevent
> > + * take other effective approaches (mmap lock may be insufficient) to prevent
> >   * the intermediate kernel page tables belonging to the specified address range
> >   * from being freed (e.g. memory hot-remove).
> >   */
> > @@ -513,6 +599,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> >
> >  	if (start >= end || !walk.mm)
> >  		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> >
> >  	/*
> >  	 * 1) For walking the user virtual address space:
> > @@ -556,6 +644,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> >  		return -EINVAL;
> >  	if (start < vma->vm_start || end > vma->vm_end)
> >  		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> >
> >  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> >  	process_vma_walk_lock(vma, ops->walk_lock);
> > @@ -574,6 +664,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> >
> >  	if (!walk.mm)
> >  		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> >
> >  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> >  	process_vma_walk_lock(vma, ops->walk_lock);
> > @@ -623,6 +715,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> >  	unsigned long start_addr, end_addr;
> >  	int err = 0;
> >
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> > +
> >  	lockdep_assert_held(&mapping->i_mmap_rwsem);
> >  	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
> >  				  first_index + nr - 1) {
>
> So I took a random patch set in order to learn how to compile Fedora Ark
> kernel out of any upstream tree (mm in this case), thus making noise
> here.
>
> With this goal, which mainly to be able to do such thing once or twice
> per release cycle:
>
> Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
>
> BR, Jarkko

Thanks!
diff mbox series

Patch

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index f5eb5a32aeed..9700a29f8afb 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -25,12 +25,15 @@  enum page_walk_lock {
  *			this handler is required to be able to handle
  *			pmd_trans_huge() pmds.  They may simply choose to
  *			split_huge_page() instead of handling it explicitly.
- * @pte_entry:		if set, called for each PTE (lowest-level) entry,
- *			including empty ones
+ * @pte_entry:		if set, called for each PTE (lowest-level) entry
+ *			including empty ones, except if @install_pte is set.
+ *			If @install_pte is set, @pte_entry is called only for
+ *			existing PTEs.
  * @pte_hole:		if set, called for each hole at all levels,
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
- *			are skipped.
+ *			are skipped. If @install_pte is specified, this will
+ *			not trigger for any populated ranges.
  * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
  *			function is called with the vma lock held, in order to
  *			protect against a concurrent freeing of the pte_t* or
@@ -51,6 +54,13 @@  enum page_walk_lock {
  * @pre_vma:            if set, called before starting walk on a non-null vma.
  * @post_vma:           if set, called after a walk on a non-null vma, provided
  *                      that @pre_vma and the vma walk succeeded.
+ * @install_pte:        if set, missing page table entries are installed and
+ *                      thus all levels are always walked in the specified
+ *                      range. This callback is then invoked at the PTE level
+ *                      (having split any THP pages prior), providing the PTE to
+ *                      install. If allocations fail, the walk is aborted. This
+ *                      operation is only available for userland memory. Not
+ *                      usable for hugetlb ranges.
  *
  * p?d_entry callbacks are called even if those levels are folded on a
  * particular architecture/configuration.
@@ -76,6 +86,8 @@  struct mm_walk_ops {
 	int (*pre_vma)(unsigned long start, unsigned long end,
 		       struct mm_walk *walk);
 	void (*post_vma)(struct mm_walk *walk);
+	int (*install_pte)(unsigned long addr, unsigned long next,
+			   pte_t *ptep, struct mm_walk *walk);
 	enum page_walk_lock walk_lock;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 508f7802dd2b..fb1fb0c984e4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -12,6 +12,7 @@ 
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/pagemap.h>
+#include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -1451,4 +1452,9 @@  static inline void accept_page(struct page *page)
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
+/* pagewalk.c */
+int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 5f9f01532e67..f3cbad384344 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,9 +3,14 @@ 
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/mmu_context.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
 /*
  * We want to know the real level where a entry is located ignoring any
  * folding of levels which may be happening. For example if p4d is folded then
@@ -29,9 +34,23 @@  static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
 	int err = 0;
 
 	for (;;) {
-		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
-		if (err)
-		       break;
+		if (ops->install_pte && pte_none(ptep_get(pte))) {
+			pte_t new_pte;
+
+			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
+					       walk);
+			if (err)
+				break;
+
+			set_pte_at(walk->mm, addr, pte, new_pte);
+			/* Non-present before, so for arches that need it. */
+			if (!WARN_ON_ONCE(walk->no_vma))
+				update_mmu_cache(walk->vma, addr, pte);
+		} else {
+			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
+			if (err)
+				break;
+		}
 		if (addr >= end - PAGE_SIZE)
 			break;
 		addr += PAGE_SIZE;
@@ -89,11 +108,14 @@  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 again:
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pte_alloc(walk->mm, pmd);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 
 		walk->action = ACTION_SUBTREE;
@@ -109,18 +131,19 @@  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
-
-		/*
-		 * Check this here so we only break down trans_huge
-		 * pages when we _need_ to
-		 */
-		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
-		    walk->action == ACTION_CONTINUE ||
-		    !(ops->pte_entry))
+		if (walk->action == ACTION_CONTINUE)
 			continue;
+		if (!ops->install_pte && !ops->pte_entry)
+			continue; /* Nothing to do. */
+		if (!ops->pte_entry && ops->install_pte &&
+		    pmd_present(*pmd) &&
+		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+			continue; /* Avoid unnecessary split. */
 
 		if (walk->vma)
 			split_huge_pmd(walk->vma, pmd, addr);
+		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+			continue; /* Nothing to do. */
 
 		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
@@ -148,11 +171,14 @@  static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
  again:
 		next = pud_addr_end(addr, end);
 		if (pud_none(*pud)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pmd_alloc(walk->mm, pud, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 
 		walk->action = ACTION_SUBTREE;
@@ -164,14 +190,20 @@  static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
-
-		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
-		    walk->action == ACTION_CONTINUE ||
-		    !(ops->pmd_entry || ops->pte_entry))
+		if (walk->action == ACTION_CONTINUE)
 			continue;
+		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
+			continue;  /* Nothing to do. */
+		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
+		    pud_present(*pud) &&
+		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
+			continue; /* Avoid unnecessary split. */
 
 		if (walk->vma)
 			split_huge_pud(walk->vma, pud, addr);
+		else if (pud_leaf(*pud) || !pud_present(*pud))
+			continue; /* Nothing to do. */
+
 		if (pud_none(*pud))
 			goto again;
 
@@ -196,18 +228,22 @@  static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pud_alloc(walk->mm, p4d, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 		if (ops->p4d_entry) {
 			err = ops->p4d_entry(p4d, addr, next, walk);
 			if (err)
 				break;
 		}
-		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
+		    ops->install_pte)
 			err = walk_pud_range(p4d, addr, next, walk);
 		if (err)
 			break;
@@ -231,18 +267,22 @@  static int walk_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __p4d_alloc(walk->mm, pgd, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 		if (ops->pgd_entry) {
 			err = ops->pgd_entry(pgd, addr, next, walk);
 			if (err)
 				break;
 		}
-		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
+		    ops->pte_entry || ops->install_pte)
 			err = walk_p4d_range(pgd, addr, next, walk);
 		if (err)
 			break;
@@ -334,6 +374,11 @@  static int __walk_page_range(unsigned long start, unsigned long end,
 	int err = 0;
 	struct vm_area_struct *vma = walk->vma;
 	const struct mm_walk_ops *ops = walk->ops;
+	bool is_hugetlb = is_vm_hugetlb_page(vma);
+
+	/* We do not support hugetlb PTE installation. */
+	if (ops->install_pte && is_hugetlb)
+		return -EINVAL;
 
 	if (ops->pre_vma) {
 		err = ops->pre_vma(start, end, walk);
@@ -341,7 +386,7 @@  static int __walk_page_range(unsigned long start, unsigned long end,
 			return err;
 	}
 
-	if (is_vm_hugetlb_page(vma)) {
+	if (is_hugetlb) {
 		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
@@ -380,47 +425,14 @@  static inline void process_vma_walk_lock(struct vm_area_struct *vma,
 #endif
 }
 
-/**
- * walk_page_range - walk page table with caller specific callbacks
- * @mm:		mm_struct representing the target process of page table walk
- * @start:	start address of the virtual address range
- * @end:	end address of the virtual address range
- * @ops:	operation to call during the walk
- * @private:	private data for callbacks' usage
- *
- * Recursively walk the page table tree of the process represented by @mm
- * within the virtual address range [@start, @end). During walking, we can do
- * some caller-specific works for each entry, by setting up pmd_entry(),
- * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
- * callbacks, the associated entries/pages are just ignored.
- * The return values of these callbacks are commonly defined like below:
- *
- *  - 0  : succeeded to handle the current entry, and if you don't reach the
- *         end address yet, continue to walk.
- *  - >0 : succeeded to handle the current entry, and return to the caller
- *         with caller specific value.
- *  - <0 : failed to handle the current entry, and return to the caller
- *         with error code.
- *
- * Before starting to walk page table, some callers want to check whether
- * they really want to walk over the current vma, typically by checking
- * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
- * purpose.
- *
- * If operations need to be staged before and committed after a vma is walked,
- * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
- * since it is intended to handle commit-type operations, can't return any
- * errors.
- *
- * struct mm_walk keeps current values of some common data like vma and pmd,
- * which are useful for the access from callbacks. If you want to pass some
- * caller-specific data to callbacks, @private should be helpful.
+/*
+ * See the comment for walk_page_range(), this performs the heavy lifting of the
+ * operation, only sets no restrictions on how the walk proceeds.
  *
- * Locking:
- *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
- *   because these function traverse vma list and/or access to vma's data.
+ * We usually restrict the ability to install PTEs, but this functionality is
+ * available to internal memory management code and provided in mm/internal.h.
  */
-int walk_page_range(struct mm_struct *mm, unsigned long start,
+int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private)
 {
@@ -479,6 +491,80 @@  int walk_page_range(struct mm_struct *mm, unsigned long start,
 	return err;
 }
 
+/*
+ * Determine if the walk operations specified are permitted to be used for a
+ * page table walk.
+ *
+ * This check is performed on all functions which are parameterised by walk
+ * operations and exposed in include/linux/pagewalk.h.
+ *
+ * Internal memory management code can use the walk_page_range_mm() function to
+ * be able to use all page walking operations.
+ */
+static bool check_ops_valid(const struct mm_walk_ops *ops)
+{
+	/*
+	 * The installation of PTEs is solely under the control of memory
+	 * management logic and subject to many subtle locking, security and
+	 * cache considerations so we cannot permit other users to do so, and
+	 * certainly not for exported symbols.
+	 */
+	if (ops->install_pte)
+		return false;
+
+	return true;
+}
+
+/**
+ * walk_page_range - walk page table with caller specific callbacks
+ * @mm:		mm_struct representing the target process of page table walk
+ * @start:	start address of the virtual address range
+ * @end:	end address of the virtual address range
+ * @ops:	operation to call during the walk
+ * @private:	private data for callbacks' usage
+ *
+ * Recursively walk the page table tree of the process represented by @mm
+ * within the virtual address range [@start, @end). During walking, we can do
+ * some caller-specific works for each entry, by setting up pmd_entry(),
+ * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
+ * callbacks, the associated entries/pages are just ignored.
+ * The return values of these callbacks are commonly defined like below:
+ *
+ *  - 0  : succeeded to handle the current entry, and if you don't reach the
+ *         end address yet, continue to walk.
+ *  - >0 : succeeded to handle the current entry, and return to the caller
+ *         with caller specific value.
+ *  - <0 : failed to handle the current entry, and return to the caller
+ *         with error code.
+ *
+ * Before starting to walk page table, some callers want to check whether
+ * they really want to walk over the current vma, typically by checking
+ * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
+ * purpose.
+ *
+ * If operations need to be staged before and committed after a vma is walked,
+ * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
+ * since it is intended to handle commit-type operations, can't return any
+ * errors.
+ *
+ * struct mm_walk keeps current values of some common data like vma and pmd,
+ * which are useful for the access from callbacks. If you want to pass some
+ * caller-specific data to callbacks, @private should be helpful.
+ *
+ * Locking:
+ *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
+ *   because these function traverse vma list and/or access to vma's data.
+ */
+int walk_page_range(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private)
+{
+	if (!check_ops_valid(ops))
+		return -EINVAL;
+
+	return walk_page_range_mm(mm, start, end, ops, private);
+}
+
 /**
  * walk_page_range_novma - walk a range of pagetables not backed by a vma
  * @mm:		mm_struct representing the target process of page table walk
@@ -494,7 +580,7 @@  int walk_page_range(struct mm_struct *mm, unsigned long start,
  * walking the kernel pages tables or page tables for firmware.
  *
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
- * take other effective approache (mmap lock may be insufficient) to prevent
+ * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
  * from being freed (e.g. memory hot-remove).
  */
@@ -513,6 +599,8 @@  int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 
 	if (start >= end || !walk.mm)
 		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
 
 	/*
 	 * 1) For walking the user virtual address space:
@@ -556,6 +644,8 @@  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 	if (start < vma->vm_start || end > vma->vm_end)
 		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
@@ -574,6 +664,8 @@  int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 
 	if (!walk.mm)
 		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
@@ -623,6 +715,9 @@  int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 	unsigned long start_addr, end_addr;
 	int err = 0;
 
+	if (!check_ops_valid(ops))
+		return -EINVAL;
+
 	lockdep_assert_held(&mapping->i_mmap_rwsem);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
 				  first_index + nr - 1) {