diff mbox series

[v3,3/7] mm: Add write-protect and clean utilities for address space ranges

Message ID 20191002134730.40985-4-thomas_os@shipmail.org (mailing list archive)
State New, archived
Headers show
Series Emulated coherent graphics memory take 2 | expand

Commit Message

Thomas Hellström (Intel) Oct. 2, 2019, 1:47 p.m. UTC
From: Thomas Hellstrom <thellstrom@vmware.com>

Add two utilities to a) write-protect and b) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 include/linux/mm.h    |  13 +-
 mm/Kconfig            |   3 +
 mm/Makefile           |   1 +
 mm/as_dirty_helpers.c | 315 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 331 insertions(+), 1 deletion(-)
 create mode 100644 mm/as_dirty_helpers.c

Comments

Linus Torvalds Oct. 2, 2019, 6:06 p.m. UTC | #1
On Wed, Oct 2, 2019 at 6:48 AM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> From: Thomas Hellstrom <thellstrom@vmware.com>
>
> Add two utilities to a) write-protect and b) clean all ptes pointing into
> a range of an address space.

This one I still don't exactly love.

I'm not entirely sure what rubs me the wrong way, but part of it is
naming. We don't use the name "as", because it reads as (sic) an
English word.

The name we use for address_space pointers is "mapping", both for
variables and for existing functions.

See for example "pte_same_as_swp()" which uses "as" as the _word_ 'as'.

Contrast that with "unmap_mapping_range()" or
"mapping_set_unevictable()" or "read_mapping_page()" or
"invalidate_mapping_pages()", that all work on address spaces.

So please don't use 'as' as shorthand for that - eithe rin the
function names or the filename.

I'm not sure if that's the _only_ thing that raises my heckles when I
read this patch, but I think it might be. So I'm not saying "ack with
naming change", but I suspect that if the naming was changed, it would
look much better to me.

Yes, it's a bit more typing. But I really think
"clean_mapping_dirty_pages()" is just not only more in line with the
mm naming, I think it's a lot more legible and understandable than
"as_dirty_clean()", which just makes me go "what the heck does that
function do?"

And I really think it needs more than just "as" -> "mapping".
"mapping_dirty_clean()" still makes me go "what?" in a way that
"clean_mapping_dirty_pages()" does not. One name reads as a series or
random words, the other reads as a "this is what the function does".

I know I sometimes get hung up about naming, but I do think naming
matters.  A descriptive name that just reads as what the function does
makes it much easier to read the logic of code, imnsho.

              Linus
Matthew Wilcox Oct. 2, 2019, 6:13 p.m. UTC | #2
On Wed, Oct 02, 2019 at 11:06:43AM -0700, Linus Torvalds wrote:
> On Wed, Oct 2, 2019 at 6:48 AM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
> >
> > From: Thomas Hellstrom <thellstrom@vmware.com>
> >
> > Add two utilities to a) write-protect and b) clean all ptes pointing into
> > a range of an address space.
[...]
> Yes, it's a bit more typing. But I really think
> "clean_mapping_dirty_pages()" is just not only more in line with the
> mm naming, I think it's a lot more legible and understandable than
> "as_dirty_clean()", which just makes me go "what the heck does that
> function do?"
> 
> And I really think it needs more than just "as" -> "mapping".
> "mapping_dirty_clean()" still makes me go "what?" in a way that
> "clean_mapping_dirty_pages()" does not. One name reads as a series or
> random words, the other reads as a "this is what the function does".

I'd suggest clean_mapping_pages() -- a function which does that would
naturally skip the non-dirty pages, and that doesn't need to be in the
function name.
Thomas Hellström (Intel) Oct. 2, 2019, 7:09 p.m. UTC | #3
On 10/2/19 8:06 PM, Linus Torvalds wrote:
> On Wed, Oct 2, 2019 at 6:48 AM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> From: Thomas Hellstrom <thellstrom@vmware.com>
>>
>> Add two utilities to a) write-protect and b) clean all ptes pointing into
>> a range of an address space.
> This one I still don't exactly love.
>
> I'm not entirely sure what rubs me the wrong way, but part of it is
> naming. We don't use the name "as", because it reads as (sic) an
> English word.
>
> The name we use for address_space pointers is "mapping", both for
> variables and for existing functions.
>
> See for example "pte_same_as_swp()" which uses "as" as the _word_ 'as'.
>
> Contrast that with "unmap_mapping_range()" or
> "mapping_set_unevictable()" or "read_mapping_page()" or
> "invalidate_mapping_pages()", that all work on address spaces.
>
> So please don't use 'as' as shorthand for that - eithe rin the
> function names or the filename.
>
> I'm not sure if that's the _only_ thing that raises my heckles when I
> read this patch, but I think it might be. So I'm not saying "ack with
> naming change", but I suspect that if the naming was changed, it would
> look much better to me.
>
> Yes, it's a bit more typing. But I really think
> "clean_mapping_dirty_pages()" is just not only more in line with the
> mm naming, I think it's a lot more legible and understandable than
> "as_dirty_clean()", which just makes me go "what the heck does that
> function do?"
>
> And I really think it needs more than just "as" -> "mapping".
> "mapping_dirty_clean()" still makes me go "what?" in a way that
> "clean_mapping_dirty_pages()" does not. One name reads as a series or
> random words, the other reads as a "this is what the function does".
>
> I know I sometimes get hung up about naming, but I do think naming
> matters.  A descriptive name that just reads as what the function does
> makes it much easier to read the logic of code, imnsho.
>
>                Linus

Yes I typically tend towards using a "namespace_object_operation" naming 
scheme, with "as_dirty" being the namespace here,

But I'll give it a shot to see if I can rename it more in line with the 
above.

Looking at Matthew's suggestion but lining up with 
"unmap_mapping_range()", perhaps we could use "clean_mapping_range" and 
"wp_mapping_range"?

Thanks,

Thomas
Linus Torvalds Oct. 2, 2019, 8:27 p.m. UTC | #4
On Wed, Oct 2, 2019 at 12:09 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> Yes I typically tend towards using a "namespace_object_operation" naming
> scheme, with "as_dirty" being the namespace here,

We discourage that kind of mindless namespacing for core stuff.

It makes sense in a driver or a filesystem: when there are 20+
different filesystems all implementing the same operation, you can't
have descriptive and natural names that are unique. So having a
namespace prefix for the filesystem or driver makes sense. But even
then it tends ot be just a simple name, not the op it does.

> Looking at Matthew's suggestion but lining up with
> "unmap_mapping_range()", perhaps we could use "clean_mapping_range" and
> "wp_mapping_range"?

Yes, I agree with Willy that "dirty" is kind of implicit when the
operation is to clean something, so the above sounds sane to me.

The wp part I'm not entirely sure about: you're not actually
write-protecting the range. You're doing something different. You're
only doing it for shared writable mappings. And I'd rather see
separate 'struct mm_walk_ops' than shared ones that then have a flag
in a differfent structure to change behavior.

Yes, yes, some of the levels share the same logic, but that just means
that you can use the same function pointer for that level in the
different "clean" vs "wp" mm_walk_op.

Also, looking closer at this patch, this makes me go "Whaa?":

+       pte = (mm == &init_mm) ?
+               pte_offset_kernel(pmd, addr) :
+               pte_offset_map_lock(mm, pmd, addr, &ptl);

because I don't think that's sensible. When do you have a vma in kernel space?

Also, why do you loop inside the pmd_entry function, instead of just
having a pte_entry function?

           Linus
Thomas Hellstrom Oct. 3, 2019, 7:56 a.m. UTC | #5
On 10/2/19 10:28 PM, Linus Torvalds wrote:
> On Wed, Oct 2, 2019 at 12:09 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> Yes I typically tend towards using a "namespace_object_operation" naming
>> scheme, with "as_dirty" being the namespace here,
> We discourage that kind of mindless namespacing for core stuff.
>
> It makes sense in a driver or a filesystem: when there are 20+
> different filesystems all implementing the same operation, you can't
> have descriptive and natural names that are unique. So having a
> namespace prefix for the filesystem or driver makes sense. But even
> then it tends ot be just a simple name, not the op it does.

Understood.

>
>> Looking at Matthew's suggestion but lining up with
>> "unmap_mapping_range()", perhaps we could use "clean_mapping_range" and
>> "wp_mapping_range"?
> Yes, I agree with Willy that "dirty" is kind of implicit when the
> operation is to clean something, so the above sounds sane to me.
>
> The wp part I'm not entirely sure about: you're not actually
> write-protecting the range. You're doing something different. You're
> only doing it for shared writable mappings.

Both the cleaning operation and the wp operation operate on shared
writable mappings, and since they are also both restricted to entries
that may take part in dirty-tracking (they're ignoring pmds and puds),
so perhaps a "dirty" may make sense anyway, and to point out the similarity:

clean_mapping_dirty_range() and wp_mapping_dirty_range()"?

>  And I'd rather see
> separate 'struct mm_walk_ops' than shared ones that then have a flag
> in a differfent structure to change behavior.
>
> Yes, yes, some of the levels share the same logic, but that just means
> that you can use the same function pointer for that level in the
> different "clean" vs "wp" mm_walk_op.

I think that this comment and this last one:

> Also, why do you loop inside the pmd_entry function, instead of just
> having a pte_entry function?"

are tied together. The pagewalk code is kind of awkward since if you
provide a pte_entry function,
then huge pmds are unconditionally split, even if handled in pmd_entry,
forcing pmd-aware users to handle also ptes in pmd_entry(). I'm not
entirely sure why, but it looks like it's to avoid a race where huge
pmds could otherwise be unintentionally split if appearing *after* the
pmd_entry() call. Other pagewalk users do the same here,  see for
example clear_refs_pte_range();

https://elixir.bootlin.com/linux/latest/source/fs/proc/task_mmu.c#L1040

Also the pagewalk walk_pte_range() for some reason doesn't take the page
table lock, which means that a pte may well be cleared under us while we
have it cached for modification.

For these reasons we can't use the pte_entry, even internally and this
means we have three choices:

a) Selecting the pte function using a bool. Saves code and avoids extra
indirect function call.
b) Duplicating the pmd_entry with a different pte function, also
duplicating the ops. Avoids extra indirect function call but some extra
code.
c) Instead of the bool, use another function pointer in yet another ops
pointed to by the private structure. Saves some code.

I opted for a) here, but I can of course change that if needed?

>
> Also, looking closer at this patch, this makes me go "Whaa?":
>
> +       pte = (mm == &init_mm) ?
> +               pte_offset_kernel(pmd, addr) :
> +               pte_offset_map_lock(mm, pmd, addr, &ptl);
>
> because I don't think that's sensible. When do you have a vma in kernel space?
>
> Also, why do you loop inside the pmd_entry function, instead of just
> having a pte_entry function?

Yes, that was a blind copy from v1 of the code that used
"apply_to_page_range()". I'll fix that up.

Thanks,

/Thomas

>
>            Linus
>
Linus Torvalds Oct. 3, 2019, 4:55 p.m. UTC | #6
On Thu, Oct 3, 2019 at 12:56 AM Thomas Hellstrom <thellstrom@vmware.com> wrote:
>
> Both the cleaning operation and the wp operation operate on shared
> writable mappings, and since they are also both restricted to entries
> that may take part in dirty-tracking (they're ignoring pmds and puds),
> so perhaps a "dirty" may make sense anyway, and to point out the similarity:
>
> clean_mapping_dirty_range() and wp_mapping_dirty_range()"?

Hmm. Yeah, either that, or "shared", as in "clean_shared_mapping_range()"?

I don't know. Maybe this isn't a huge deal, and the "don't care about
private dirty/write" is kind of implied by the fact that it's about
the backing mapping rather than the vma.

But on the other hand we do have things like "unmap_mapping_range()"
which will unmap all the private mappings too, so maybe it's a good
idea to clarify that this is about shared mappings.

I suspect it's a bit of a bike shed, and I could go either way.

> >  And I'd rather see
> > separate 'struct mm_walk_ops' than shared ones that then have a flag
> > in a differfent structure to change behavior.
> >
> > Yes, yes, some of the levels share the same logic, but that just means
> > that you can use the same function pointer for that level in the
> > different "clean" vs "wp" mm_walk_op.
>
> I think that this comment and this last one:
>
> > Also, why do you loop inside the pmd_entry function, instead of just
> > having a pte_entry function?"
>
> are tied together. The pagewalk code is kind of awkward since if you
> provide a pte_entry function, then huge pmds are unconditionally split,

So I think that we could handle differently.

But even if you don't want to do the pte_entry function, the "use two
different set of walker op structures" stands. Instead of havign a
test for "am I walking for wp or not", just make it implicit.

Also, I do think that you're seeing what is a weakness in the pte
walkers wrt splitting. I do think we shouldn't split unconditionally.

I suspect we should instead change the logic to split only if we
didn't have a pmd_entry. Because looking at the existing cases, if you
have a pmd_entry, you currently never have a pte_entry, so it wouldn't
change semantics for current cases.

> even if handled in pmd_entry,
> forcing pmd-aware users to handle also ptes in pmd_entry(). I'm not
> entirely sure why, but it looks like it's to avoid a race where huge
> pmds could otherwise be unintentionally split if appearing *after* the
> pmd_entry() call.

See above: we never have both pmd_entry and pte_entry walkers as it is
right now. So you can't have that race.

> Other pagewalk users do the same here,  see for
> example clear_refs_pte_range();
>
> https://elixir.bootlin.com/linux/latest/source/fs/proc/task_mmu.c#L1040
>
> Also the pagewalk walk_pte_range() for some reason doesn't take the page
> table lock, which means that a pte may well be cleared under us while we
> have it cached for modification.

Ho humm. That looks really sketchy to me. We actually have very few
pte_entry cases, perhaps exactly because it's all kinds of broken.

There's a couple of odd architectures, and then there is
prot_none_walk_ops. And that one purely looks at the pte's, and does a
special "can we do this" phase before doing the real modification.
It's limited to special PFNMAP/MIXEDMAP things.

So it does look like the pte walking code is accidentally (or
intentionally) ok in that case, but looking at the architecture users
(s390, at least), I do get the feeling that the lack of locking is an
actual and real bug.

It's probably impossible to hit in practice because it's limited to
special ops and you'd have to race with swapout anbd even then it
might not much matter. But it looks _wrong_.

And honestly, it looks like the clear_refs_pte_range() example you
point at comes from this for _exactly_ the same reason your code does
this: the pte walker is buggy, and the pmd case doesn't want to split.

The pte walker code in question has been there since 2008. This is not
a new thing. The clear_refs_pte_range() code partly goes all the way
back to that, although it's been changed some since. But the actual
pte_offset_map_lock() call in there predates the generic walker, so
what I think happened is that the generic walker was done only
partially, and intentionally didn't go all the way to the pte level
exactly because the walker was made to split things up unconditionally
if you did the pte case.

> For these reasons we can't use the pte_entry, even internally and this
> means we have three choices:
>
> a) Selecting the pte function using a bool. Saves code and avoids extra
> indirect function call.
> b) Duplicating the pmd_entry with a different pte function, also
> duplicating the ops. Avoids extra indirect function call but some extra
> code.
> c) Instead of the bool, use another function pointer in yet another ops
> pointed to by the private structure. Saves some code.

So considering that

 - there is one current user of the pte code that doesn't care about
the proper pte lock, because what it does is a racy pre-check anyway

 - that one user is also documented to not care too much about
performance (it's a crazy special case)

 - the other users _do_ look like they are currently buggy and want a
stable pte with proper locking

 - there are currently no cases of "I have a pmd walker _and_ a pte
walker" because that case was always broken due to splitting

I would say that the current walker code is simply buggy in this
respect, and people have worked around it for over a decade.

End result: since you have an actual test-case that wants this, I'd
like to at least try

 d) Fix the pte walker to do the right thing, then just use separate
pte walkers in your code

The fix would be those two conceptual changes:

 1) don't split if the walker asks for a pmd_entry (the walker itself
can then decide to split, of course, but right now no walkers want it
since there are no pmd _and_ pte walkers, because people who want that
do the pte walk themselves)

 2) get the proper page table lock if you do walk the pte, since
otherwise it's racy

Then there won't be any code duplication, because all the duplication
you now have at the pmd level is literally just workarounds for the
fact that our current walker has this bug.

That "fix the pte walker" would be one preliminary patch that would
look something like the attached TOTALLY UNTESTED garbage.

I call it "garbage" because I really hope people take it just as what
it is: "something like this". It compiles for me, and I did try to
think it through, but I might have missed some big piece of the
picture when writing that patch.

And yes, this is a much bigger conceptual change for the VM layer, but
I really think our pagewalk code is actively buggy right now, and is
forcing users to do bad things because they work around the existing
limitations.

Hmm? Could some of the core mm people look over that patch?

And yes, I was tempted to move the proper pmd locking into the walker
too, and do

        ptl = pmd_trans_huge_lock(pmd, vma);
        if (ptl) {
                err = ops->pmd_entry(pmd, addr, next, walk);
                spin_unlock(ptl);
                ...

but while I think that's the correct thing to do in the long run, that
would have to be done together with changing all the existing
pmd_entry users. It would make the pmd_entry _solely_ handle the
hugepage case, and then you'd have to remove the locking in the
pmd_entry, and have to make the pte walking be a walker entry. But
that would _really_ clean things up, and would make things like
smaps_pte_range() much easier to read, and much more obvious (it would
be split into a smaps_pmd_range and smaps_pte_range, and the callbacks
wouldn't need to know about the complex locking).

So I think this is the right direction to move into, but I do want
people to think about this, and think about that next phase of doing
the pmd_trans_huge_lock too.

Comments?

                Linus
Thomas Hellström (Intel) Oct. 3, 2019, 6:03 p.m. UTC | #7
On 10/3/19 6:55 PM, Linus Torvalds wrote:
>
>   d) Fix the pte walker to do the right thing, then just use separate
> pte walkers in your code
>
> The fix would be those two conceptual changes:
>
>   1) don't split if the walker asks for a pmd_entry (the walker itself
> can then decide to split, of course, but right now no walkers want it
> since there are no pmd _and_ pte walkers, because people who want that
> do the pte walk themselves)
>
>   2) get the proper page table lock if you do walk the pte, since
> otherwise it's racy
>
> Then there won't be any code duplication, because all the duplication
> you now have at the pmd level is literally just workarounds for the
> fact that our current walker has this bug.

I actually started on d) already when Kirill asked me to unify the 
pud_entry() and pmd_entry()
callbacks.

>
> That "fix the pte walker" would be one preliminary patch that would
> look something like the attached TOTALLY UNTESTED garbage.
>
> I call it "garbage" because I really hope people take it just as what
> it is: "something like this". It compiles for me, and I did try to
> think it through, but I might have missed some big piece of the
> picture when writing that patch.
>
> And yes, this is a much bigger conceptual change for the VM layer, but
> I really think our pagewalk code is actively buggy right now, and is
> forcing users to do bad things because they work around the existing
> limitations.
>
> Hmm? Could some of the core mm people look over that patch?
>
> And yes, I was tempted to move the proper pmd locking into the walker
> too, and do
>
>          ptl = pmd_trans_huge_lock(pmd, vma);
>          if (ptl) {
>                  err = ops->pmd_entry(pmd, addr, next, walk);
>                  spin_unlock(ptl);
>                  ...
>
> but while I think that's the correct thing to do in the long run, that
> would have to be done together with changing all the existing
> pmd_entry users. It would make the pmd_entry _solely_ handle the
> hugepage case, and then you'd have to remove the locking in the
> pmd_entry, and have to make the pte walking be a walker entry. But
> that would _really_ clean things up, and would make things like
> smaps_pte_range() much easier to read, and much more obvious (it would
> be split into a smaps_pmd_range and smaps_pte_range, and the callbacks
> wouldn't need to know about the complex locking).
>
> So I think this is the right direction to move into, but I do want
> people to think about this, and think about that next phase of doing
> the pmd_trans_huge_lock too.

>
> Comments?
>
>                  Linus

I think if we take the ptl lock outside the callback, we'd need to allow 
the callback to unlock to do non-atomic things or to avoid recursive 
locking if it decides to split in the callback. FWIW The pud_entry call 
is being made locked, but the only current implementation appears to 
happily ignore that from what I can tell.

And if we allow unlocking or call the callback unlocked, the callback 
needs to tell us whether the entry was actually handled or if we need to 
fallback to the next level. Perhaps using a positive PAGE_WALK_FALLBACK 
return value? That would allow current implementations to be unmodified.

/Thomas
Linus Torvalds Oct. 3, 2019, 6:11 p.m. UTC | #8
On Thu, Oct 3, 2019 at 11:03 AM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> >
> > So I think this is the right direction to move into, but I do want
> > people to think about this, and think about that next phase of doing
> > the pmd_trans_huge_lock too.
>
> I think if we take the ptl lock outside the callback, we'd need to allow
> the callback to unlock to do non-atomic things or to avoid recursive
> locking if it decides to split in the callback.

Note that I think that the particular pmd locking case we should leave
for later, simply because it's a separate issue, and it comes with
more worries.

So I just wanted to mention it to see what people thought and keep it
in mind for later, but I don't think it should be part of this series.
Your use case doesn't need it (at least yet), and existing users
already do their own locking.

The "change pte_entry" to do the proper locking I think is safe.
Probably exactly *because* pte_entry is so broken, we literally have
only five users of it in the whole kernel, and they are fairly simple
and certainly don't block.

(Ok, the s390 has some funky hw locking, so maybe "simple" is the
wrong word to use, but it doesn't seem to have any interaction with
the locking).

End result: I absolutely agree that changing the pmd locking to be
done by the walker would be a much bigger change. I don't think we
need to do that part yet. It's not the current pain-point.

             Linus
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..4a9b02f7f91c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2637,7 +2637,6 @@  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
 
-
 #ifdef CONFIG_PAGE_POISONING
 extern bool page_poisoning_enabled(void);
 extern void kernel_poison_pages(struct page *page, int numpages, int enable);
@@ -2878,5 +2877,17 @@  static inline int pages_identical(struct page *page1, struct page *page2)
 	return !memcmp_pages(page1, page2);
 }
 
+#ifdef CONFIG_AS_DIRTY_HELPERS
+unsigned long as_dirty_clean(struct address_space *mapping,
+			     pgoff_t first_index, pgoff_t nr,
+			     pgoff_t bitmap_pgoff,
+			     unsigned long *bitmap,
+			     pgoff_t *start,
+			     pgoff_t *end);
+
+unsigned long as_dirty_wrprotect(struct address_space *mapping,
+				 pgoff_t first_index, pgoff_t nr);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..7a0538fe507f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@  config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config AS_DIRTY_HELPERS
+        bool
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..828bf4278c9d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,3 +107,4 @@  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_AS_DIRTY_HELPERS) += as_dirty_helpers.o
diff --git a/mm/as_dirty_helpers.c b/mm/as_dirty_helpers.c
new file mode 100644
index 000000000000..2cbf23a86fb4
--- /dev/null
+++ b/mm/as_dirty_helpers.c
@@ -0,0 +1,315 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pagewalk.h>
+#include <linux/hugetlb.h>
+#include <linux/bitops.h>
+#include <linux/mmu_notifier.h>
+#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
+
+/**
+ * struct as_dirty_walk - Private struct for pagetable walk callbacks
+ * @range: Range for mmu notifiers
+ * @tlbflush_start: Address of first modified pte
+ * @tlbflush_end: Address of last modified pte + 1
+ * @total: Total number of modified ptes
+ * @wrprotect: Whether this is a write-protect or a clean operation
+ */
+struct as_dirty_walk {
+	struct mmu_notifier_range range;
+	unsigned long tlbflush_start;
+	unsigned long tlbflush_end;
+	unsigned long total;
+	unsigned int wrprotect;
+};
+
+/**
+ * as_dirty_pte_wrprotect - Write-protect a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function write-protects a pte and records the range in
+ * virtual address space of touched ptes for efficient range TLB flushes.
+ */
+static void as_dirty_pte_wrprotect(pte_t *pte, unsigned long addr,
+				   struct mm_walk *walk)
+{
+	struct as_dirty_walk *adw = walk->private;
+	pte_t ptent = *pte;
+
+	if (pte_write(ptent)) {
+		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+		ptent = pte_wrprotect(old_pte);
+		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+		adw->total++;
+		adw->tlbflush_start = min(adw->tlbflush_start, addr);
+		adw->tlbflush_end = max(adw->tlbflush_end, addr + PAGE_SIZE);
+	}
+}
+
+/**
+ * struct as_dirty_walk_clean - Private struct for the as_dirty_walk_clean
+ * function.
+ * @base: struct as_dirty_walk we derive from
+ * @bitmap_pgoff: Address_space Page offset of the first bit in @bitmap
+ * @bitmap: Bitmap with one bit for each page offset in the address_space range
+ * covered.
+ * @start: Address_space page offset of first modified pte relative
+ * to @bitmap_pgoff
+ * @end: Address_space page offset of last modified pte relative
+ * to @bitmap_pgoff
+ */
+struct as_dirty_walk_clean {
+	struct as_dirty_walk base;
+	pgoff_t bitmap_pgoff;
+	unsigned long *bitmap;
+	pgoff_t start;
+	pgoff_t end;
+};
+
+#define to_as_dirty_walk_clean(_adw)			\
+	container_of(_adw, struct as_dirty_walk_clean, base)
+
+/**
+ * as_dirty_pte_clean - Clean a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function cleans a pte and records the range in
+ * virtual address space of touched ptes for efficient TLB flushes.
+ * It also records dirty ptes in a bitmap representing page offsets
+ * in the address_space, as well as the first and last of the bits
+ * touched.
+ */
+static void as_dirty_pte_clean(pte_t *pte, unsigned long addr,
+			       struct mm_walk *walk)
+{
+	struct as_dirty_walk *adw = walk->private;
+	struct as_dirty_walk_clean *clean = to_as_dirty_walk_clean(adw);
+	pte_t ptent = *pte;
+
+	if (pte_dirty(ptent)) {
+		pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
+			walk->vma->vm_pgoff - clean->bitmap_pgoff;
+		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+		ptent = pte_mkclean(old_pte);
+		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+
+		adw->total++;
+		adw->tlbflush_start = min(adw->tlbflush_start, addr);
+		adw->tlbflush_end = max(adw->tlbflush_end, addr + PAGE_SIZE);
+
+		__set_bit(pgoff, clean->bitmap);
+		clean->start = min(clean->start, pgoff);
+		clean->end = max(clean->end, pgoff + 1);
+	}
+}
+
+/*
+ * as_dirty_pmd_entry - The pagewalk pmd callback.
+ *
+ * Loops over ptes and calls the appropriate PTE callback.
+ * The pmd callback is needed to take the page-table lock and to
+ * avoid unnecessarily splitting huge pmd entries.
+ */
+static int as_dirty_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
+			      struct mm_walk *walk)
+{
+	struct mm_struct *mm = walk->mm;
+	struct as_dirty_walk *adw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	/* Ignore huge pmds. Dirty tracking is done on the PTE level! */
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = (mm == &init_mm) ?
+		pte_offset_kernel(pmd, addr) :
+		pte_offset_map_lock(mm, pmd, addr, &ptl);
+
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		if (adw->wrprotect)
+			as_dirty_pte_wrprotect(pte++, addr, walk);
+		else
+			as_dirty_pte_clean(pte++, addr, walk);
+	} while (addr += PAGE_SIZE, addr != end);
+
+	arch_leave_lazy_mmu_mode();
+
+	if (mm != &init_mm)
+		pte_unmap_unlock(pte - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * as_dirty_pud_entry - The pagewalk pud callback.
+ *
+ * The pud callback is needed solely to avoid unnecessarily splitting huge
+ * pud entries.
+ */
+static int as_dirty_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
+			      struct mm_walk *walk)
+{
+	/* Ignore huge puds. Dirty tracking is done on the PTE level. */
+	return 0;
+}
+
+/*
+ * as_dirty_pre_vma - The pagewalk pre_vma callback.
+ *
+ * The pre_vma callback performs the cache flush, stages the tlb flush
+ * and calls the necessary mmu notifiers.
+ */
+static int as_dirty_pre_vma(unsigned long start, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct as_dirty_walk *adw = walk->private;
+
+	adw->tlbflush_start = end;
+	adw->tlbflush_end = start;
+
+	mmu_notifier_range_init(&adw->range, MMU_NOTIFY_PROTECTION_PAGE, 0,
+				walk->vma, walk->mm, start, end);
+	mmu_notifier_invalidate_range_start(&adw->range);
+	flush_cache_range(walk->vma, start, end);
+
+	/*
+	 * We're not using tlb_gather_mmu() since typically
+	 * only a small subrange of PTEs are affected, whereas
+	 * tlb_gather_mmu() records the full range.
+	 */
+	inc_tlb_flush_pending(walk->mm);
+
+	return 0;
+}
+
+/*
+ * as_dirty_post_vma - The pagewalk post_vma callback.
+ *
+ * The post_vma callback performs the tlb flush and calls necessary mmu
+ * notifiers.
+ */
+static void as_dirty_post_vma(struct mm_walk *walk)
+{
+	struct as_dirty_walk *adw = walk->private;
+
+	if (adw->tlbflush_end > adw->tlbflush_start)
+		flush_tlb_range(walk->vma, adw->tlbflush_start,
+				adw->tlbflush_end);
+
+	mmu_notifier_invalidate_range_end(&adw->range);
+	dec_tlb_flush_pending(walk->mm);
+}
+
+/*
+ * as_dirty_test_walk - The pagewalk test_walk callback.
+ *
+ * Won't perform dirty-tracking on COW, read-only or HUGETLB vmas.
+ */
+static int as_dirty_test_walk(unsigned long start, unsigned long end,
+			      struct mm_walk *walk)
+{
+	/* Skip non-applicable VMAs */
+	if ((walk->vma->vm_flags & (VM_SHARED | VM_WRITE | VM_HUGETLB)) !=
+	    (VM_SHARED | VM_WRITE))
+		return 1;
+
+	return 0;
+}
+
+static const struct mm_walk_ops walk_ops = {
+		.pmd_entry = as_dirty_pmd_entry,
+		.pud_entry = as_dirty_pud_entry,
+		.test_walk = as_dirty_test_walk,
+		.pre_vma = as_dirty_pre_vma,
+		.post_vma = as_dirty_post_vma
+};
+
+/**
+ * as_dirty_wrprotect - Write-protect all ptes in an address_space range
+ * @mapping: The address_space we want to write protect
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ *
+ * Return: The number of ptes actually write-protected. Note that
+ * already write-protected ptes are not counted.
+ */
+unsigned long as_dirty_wrprotect(struct address_space *mapping,
+				 pgoff_t first_index, pgoff_t nr)
+{
+	struct as_dirty_walk adw = { .total = 0,
+				     .wrprotect = 1};
+
+	i_mmap_lock_read(mapping);
+	WARN_ON(walk_page_mapping(mapping, first_index, nr, &walk_ops, &adw));
+	i_mmap_unlock_read(mapping);
+
+	return adw.total;
+}
+EXPORT_SYMBOL_GPL(as_dirty_wrprotect);
+
+/**
+ * as_dirty_clean - Clean all ptes in an address_space range
+ * @mapping: The address_space we want to clean
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ * @bitmap_pgoff: The page offset of the first bit in @bitmap
+ * @bitmap: Pointer to a bitmap of at least @nr bits. The bitmap needs to
+ * cover the whole range @first_index..@first_index + @nr.
+ * @start: Pointer to number of the first set bit in @bitmap.
+ * is modified as new bits are set by the function.
+ * @end: Pointer to the number of the last set bit in @bitmap.
+ * none set. The value is modified as new bits are set by the function.
+ *
+ * Note: When this function returns there is no guarantee that a CPU has
+ * not already dirtied new ptes. However it will not clean any ptes not
+ * reported in the bitmap.
+ *
+ * If a caller needs to make sure all dirty ptes are picked up and none
+ * additional are added, it first needs to write-protect the address-space
+ * range and make sure new writers are blocked in page_mkwrite() or
+ * pfn_mkwrite(). And then after a TLB flush following the write-protection
+ * pick up all dirty bits.
+ *
+ * WARNING: This function should only be used for address spaces whose
+ * vmas are marked VM_IO and that do not contain huge pages.
+ * To avoid interference with COW'd pages, vmas not marked VM_SHARED are
+ * simply skipped.
+ *
+ * Return: The number of dirty ptes actually cleaned.
+ */
+unsigned long as_dirty_clean(struct address_space *mapping,
+			     pgoff_t first_index, pgoff_t nr,
+			     pgoff_t bitmap_pgoff,
+			     unsigned long *bitmap,
+			     pgoff_t *start,
+			     pgoff_t *end)
+{
+	bool none_set = (*start >= *end);
+	struct as_dirty_walk_clean clean = {
+		.base = { .total = 0,
+			  .wrprotect = 0},
+		.bitmap_pgoff = bitmap_pgoff,
+		.bitmap = bitmap,
+		.start = none_set ? nr : *start,
+		.end = none_set ? 0 : *end,
+	};
+
+	i_mmap_lock_read(mapping);
+	WARN_ON(walk_page_mapping(mapping, first_index, nr, &walk_ops,
+				  &clean.base));
+	i_mmap_unlock_read(mapping);
+
+	*start = clean.start;
+	*end = clean.end;
+
+	return clean.base.total;
+}
+EXPORT_SYMBOL_GPL(as_dirty_clean);