diff mbox series

[2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

Message ID 20210107200402.31095-3-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show
Series page_count can't be used to decide when wp_page_copy | expand

Commit Message

Andrea Arcangeli Jan. 7, 2021, 8:04 p.m. UTC
NOTE: the "Fixes" tag used here is to optimize the backporting, but
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is completely
correct. Despite being correct it happened to uncover some implicit
assumption some other code made on a very specific behavior in
do_wp_page that had to be altered by such commit.

The page_mapcount is always guaranteed == 1 for an exclusive anon
page, so when it was used to decide if an exclusive page under
wrprotection could be reused (as in wp_page_reuse), the outcome would
always come true.

The page_mapcount had to be replaced with the page_count because it
couldn't account for GUP pins, so after that change, for the first
time, wp_page_copy can now be called also for exclusive anon pages
that are underway wrprotection.

Even then everything is still fine for all cases that wrprotect with
the mmap_write_lock since the COW faults cannot run concurrently in
such case.

However there are two cases that could wrprotecting exclusive anon
pages with only the mmap_read_lock: soft dirty clear_refs_write and
UFFDIO_WRITEPROTECT. Both of them would benefit from keeping their
wrprotection runtime scalable and to keep using the mmap_read_lock
without having to switch to the mmap_write_lock.

To stick to the mmap_read_lock, for both UFFDIO_WRITEPROTECT and
clear_refs_write we need to handle the new reality that there can be
COWs (as in wp_page_copy) happening on exclusive anon pages that are
under wrprotection, but with the respective TLB flush still deferred.

An example of the problematic UFFDIO_USERFAULTFD runtime is shown
below.

 CPU0			CPU 1		CPU 2
 ------			--------	-------
 userfaultfd_wrprotect(mode_wp = true)
 PT lock
 atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
 PT unlock

			do_page_fault FAULT_FLAG_WRITE
					userfaultfd_wrprotect(mode_wp = false)
					PT lock
					ATOMIC clear _PAGE_UFFD_WP <- problem
					/* _PAGE_WRITE not set */
					PT unlock
					XXXXXXXXXXXXXX BUG RACE window open here

			PT lock
			FAULT_FLAG_WRITE is set by CPU
			_PAGE_WRITE is still clear in pte
			PT unlock

			wp_page_copy
			cow_user_page runs with stale TLB

 deferred tlb flush <- too late
 XXXXXXXXXXXXXX BUG RACE window close here

userfaultfd_wrprotect(mode_wp = true) is never a problem because as
long as the uffd-wp flag is set in the pte/hugepmd the page fault is
guaranteed to reach a dead end in handle_userfault(). The window for
uffd-wp not to be set while the pte has been wrprotected but the TLB
flush is still pending, is opened when userfaultfd_wrprotect(mode_wp =
false) releases the PT-lock as shown above and it closes when the
first deferred TLB flush runs. If do_wp_page->wp_copy_page runs within
such window, some userland writes can get lost in the copy and they
can end up in the original page that gets discarded.

The softy dirty runtime is similar and it would be like below:

 CPU0			CPU 1			CPU 2
 ------			--------		-------
						instantiate writable TLB
 clear_refs_write
 PT lock
 pte_wrprotect
 PT unlock
			do_page_fault FAULT_FLAG_WRITE
			PT lock
			FAULT_FLAG_WRITE is set by CPU
			_PAGE_WRITE is still clear in pte
			PT unlock

			wp_page_copy
			cow_user_page...
						writes through the TLB
			...cow_user_page

So to handle this race a wrprotect_tlb_flush_pending atomic counter is
added to the vma.

This counter needs to be elevated while holding the mmap_read_lock
before taking the PT lock to wrprotect the pagetable and it can only
be decreased after the deferred TLB flush is complete.

This way the page fault can trivially serialize against pending TLB
flushes using a new helper: sync_wrprotect_tlb_flush_pending().

Testing with the userfaultfd selftest is showing 100% reproducible mm
corruption with writes getting lost, before this commit.

$ ./userfaultfd anon 100 100
nr_pages: 25600, nr_pages_per_cpu: 3200
bounces: 99, mode: rnd racing read, userfaults: 773 missing (215+205+58+114+72+85+18+6), 9011 wp (1714+1714+886+1227+1009+1278+646+537)
[..]
bounces: 72, mode: poll, userfaults: 720 missing (187+148+102+49+92+103+24+15), 9885 wp (1452+1175+1104+1667+1101+1365+913+1108)
bounces: 71, mode: rnd racing ver read, page_nr 25241 memory corruption 6 7

After the commit the userland memory corruption is gone as expected.

Cc: stable@kernel.org
Reported-by: Nadav Amit <namit@vmware.com>
Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/proc/task_mmu.c       | 17 +++++++++-
 include/linux/mm.h       | 46 +++++++++++++++++++++++++++
 include/linux/mm_types.h |  5 +++
 kernel/fork.c            |  1 +
 mm/memory.c              | 69 ++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c            |  4 +++
 6 files changed, 141 insertions(+), 1 deletion(-)

Comments

Linus Torvalds Jan. 7, 2021, 8:17 p.m. UTC | #1
On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> However there are two cases that could wrprotecting exclusive anon
> pages with only the mmap_read_lock:

I still think the real fix is "Don't do that then", and just take the
write lock.

The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a
normal operation. Same goes for softdirty.

Why have those become _so_ magical that they can break the VM for
everybody else?

         Linus
Linus Torvalds Jan. 7, 2021, 8:25 p.m. UTC | #2
On Thu, Jan 7, 2021 at 12:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I still think the real fix is "Don't do that then", and just take the
> write lock.

The alternative, of course, is to just make sure the page table flush
is done inside the page table lock (and then we make COW do the copy
inside of it).

But this whole "we know UFFD breaks all rules, we'll add even more
crap to it" approach is horrendous.

               Linus
Andrea Arcangeli Jan. 7, 2021, 8:58 p.m. UTC | #3
Hi Linus,

On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > However there are two cases that could wrprotecting exclusive anon
> > pages with only the mmap_read_lock:
> 
> I still think the real fix is "Don't do that then", and just take the
> write lock.
> 
> The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a
> normal operation. Same goes for softdirty.
> 
> Why have those become _so_ magical that they can break the VM for
> everybody else?

I see what you mean above and I agree. Like said below:

==
In simple terms: the page_count check in do_wp_page makes it
impossible to wrprotect memory, if such memory is under a !FOLL_WRITE
GUP pin.
==

So to simplify let's ignore UFFDIO_WRITEPROTECT here, which is new and
adds no dependency on top of clear_refs in this respect.

So yes, if we drop any code that has to wrprotect memory in place in
the kernel (since all userland memory can be under GUP pin in read
mode) and we make such an operation illegal, then it's fine, but that
means clear_refs has to go or it has to fail if there's a GUP pin
during the wrprotection.

The problem is it's not even possible to detect reliably if there's
really a long term GUP pin because of speculative pagecache lookups.

We would need to declare that any secondary MMU which is supposed to
be VM-neutral using mmu notifier like a GPU or a RDMA device, cannot
be used in combination on clear_refs and it would need to be enforced
fully in userland. Most mmu notifier users drop the GUP pin during the
invalidate for extra safety in case an invalidate goes missing: they
would all need to drop FOLL_GET to be compliant and stop causing
memory corruption if clear_refs shall be still allowed to happen on
mmu notifier capable secondary MMUs. Even then how does userland know
which devices attaches to the memory with mmu notifer and never using
FOLL_GET and which aren't? It doesn't sound reliable to enforce this
in userland. So I don't see how clear_refs can be saved.

Now let's make another example that still shows at least some
fundamental inefficiency that has nothing to do with clear_refs.

Let's suppose a GUP pin is taken on a subpageA by a RDMA device by
process A (parent). Let's now assume subpageB is mapped in process B
(child of process A). Both subpageA and subpageB are exclusive
(mapcount == 1) and read-write but they share the same page_count
atomic counter (only the page_mapcounts are subpage granular). To
still tame the zygote concern with the page_count in do_wp_page, then
process B when it forks a child (processC) would forever have to do an
extra superflous COW even after process C exits. Is that what we want
on top of the fundamental unsafety added to clear_refs?

Thanks,
Andrea
Linus Torvalds Jan. 7, 2021, 9:29 p.m. UTC | #4
On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> The problem is it's not even possible to detect reliably if there's
> really a long term GUP pin because of speculative pagecache lookups.

So none of the normal code _needs_ that any more these days, which is
what I think is so nice. Any pinning will do the COW, and then we have
the logic to make sure it stays writable, and that keeps everything
nicely coherent and is all fairly simple.

And yes, it does mean that if somebody then explicitly write-protects
a page, it may end up being COW'ed after all, but if you first pinned
it, and then started playing with the protections of that page, why
should you be surprised?

So to me, this sounds like a "don't do that then" situation.

Anybody who does page pinning and wants coherency should NOT TOUCH THE
MAPPING IT PINNED.

(And if you do touch it, it's your own fault, and you get to keep both
of the broken pieces)

Now, I do agree that from a QoI standpoint, it would be really lovely
if we actually enforced it. I'm not entirely sure we can, but maybe it
would be reasonable to use that

  mm->has_pinned && page_maybe_dma_pinned(page)

at least as the beginning of a heuristic.

In fact, I do think that "page_maybe_dma_pinned()" could possibly be
made stronger than it is. Because at *THAT* point, we might say "we
know a pinned page always must have a page_mapcount() of 1" - since as
part of pinning it and doing the GUP_PIN, we forced the COW, and then
subsequent fork() operations enforce it too.

So I do think that it might be possible to make that clear_refs code
notice "this page is pinned, I can't mark it WP without the pinning
coherency breaking".

It might not even be hard. But admittedly I'm somewhat handwaving
here, and I might not have thought of some situation.

              Linus
kernel test robot Jan. 7, 2021, 9:36 p.m. UTC | #5
Hi Andrea,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master hnaz-linux-mm/master v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/page_count-can-t-be-used-to-decide-when-wp_page_copy/20210108-040616
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 09162bc32c880a791c6c0668ce0745cf7958f576
compiler: nds32le-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"cppcheck warnings: (new ones prefixed by >>)"
>> fs/proc/task_mmu.c:1248:33: warning: Uninitialized variable: tmp [uninitvar]
       for (vma = mm->mmap; vma != tmp;
                                   ^

vim +1248 fs/proc/task_mmu.c

  1183	
  1184	static ssize_t clear_refs_write(struct file *file, const char __user *buf,
  1185					size_t count, loff_t *ppos)
  1186	{
  1187		struct task_struct *task;
  1188		char buffer[PROC_NUMBUF];
  1189		struct mm_struct *mm;
  1190		struct vm_area_struct *vma;
  1191		enum clear_refs_types type;
  1192		int itype;
  1193		int rv;
  1194	
  1195		memset(buffer, 0, sizeof(buffer));
  1196		if (count > sizeof(buffer) - 1)
  1197			count = sizeof(buffer) - 1;
  1198		if (copy_from_user(buffer, buf, count))
  1199			return -EFAULT;
  1200		rv = kstrtoint(strstrip(buffer), 10, &itype);
  1201		if (rv < 0)
  1202			return rv;
  1203		type = (enum clear_refs_types)itype;
  1204		if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
  1205			return -EINVAL;
  1206	
  1207		task = get_proc_task(file_inode(file));
  1208		if (!task)
  1209			return -ESRCH;
  1210		mm = get_task_mm(task);
  1211		if (mm) {
  1212			struct mmu_notifier_range range;
  1213			struct clear_refs_private cp = {
  1214				.type = type,
  1215			};
  1216	
  1217			if (type == CLEAR_REFS_MM_HIWATER_RSS) {
  1218				if (mmap_write_lock_killable(mm)) {
  1219					count = -EINTR;
  1220					goto out_mm;
  1221				}
  1222	
  1223				/*
  1224				 * Writing 5 to /proc/pid/clear_refs resets the peak
  1225				 * resident set size to this mm's current rss value.
  1226				 */
  1227				reset_mm_hiwater_rss(mm);
  1228				mmap_write_unlock(mm);
  1229				goto out_mm;
  1230			}
  1231	
  1232			if (mmap_read_lock_killable(mm)) {
  1233				count = -EINTR;
  1234				goto out_mm;
  1235			}
  1236			if (type == CLEAR_REFS_SOFT_DIRTY) {
  1237				for (vma = mm->mmap; vma; vma = vma->vm_next) {
  1238					struct vm_area_struct *tmp;
  1239					if (!(vma->vm_flags & VM_SOFTDIRTY)) {
  1240						inc_wrprotect_tlb_flush_pending(vma);
  1241						continue;
  1242					}
  1243	
  1244					/*
  1245					 * Rollback wrprotect_tlb_flush_pending before
  1246					 * releasing the mmap_lock.
  1247					 */
> 1248					for (vma = mm->mmap; vma != tmp;
  1249					     vma = vma->vm_next)
  1250						dec_wrprotect_tlb_flush_pending(vma);
  1251	
  1252					mmap_read_unlock(mm);
  1253					if (mmap_write_lock_killable(mm)) {
  1254						count = -EINTR;
  1255						goto out_mm;
  1256					}
  1257					for (vma = mm->mmap; vma; vma = vma->vm_next) {
  1258						vma->vm_flags &= ~VM_SOFTDIRTY;
  1259						vma_set_page_prot(vma);
  1260						inc_wrprotect_tlb_flush_pending(vma);
  1261					}
  1262					mmap_write_downgrade(mm);
  1263					break;
  1264				}
  1265	
  1266				inc_tlb_flush_pending(mm);
  1267				mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY,
  1268							0, NULL, mm, 0, -1UL);
  1269				mmu_notifier_invalidate_range_start(&range);
  1270			}
  1271			walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops,
  1272					&cp);
  1273			if (type == CLEAR_REFS_SOFT_DIRTY) {
  1274				mmu_notifier_invalidate_range_end(&range);
  1275				flush_tlb_mm(mm);
  1276				for (vma = mm->mmap; vma; vma = vma->vm_next)
  1277					dec_wrprotect_tlb_flush_pending(vma);
  1278				dec_tlb_flush_pending(mm);
  1279			}
  1280			mmap_read_unlock(mm);
  1281	out_mm:
  1282			mmput(mm);
  1283		}
  1284		put_task_struct(task);
  1285	
  1286		return count;
  1287	}
  1288	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
John Hubbard Jan. 7, 2021, 9:53 p.m. UTC | #6
On 1/7/21 1:29 PM, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>> The problem is it's not even possible to detect reliably if there's
>> really a long term GUP pin because of speculative pagecache lookups.
> 
> So none of the normal code _needs_ that any more these days, which is
> what I think is so nice. Any pinning will do the COW, and then we have
> the logic to make sure it stays writable, and that keeps everything
> nicely coherent and is all fairly simple.
> 
> And yes, it does mean that if somebody then explicitly write-protects
> a page, it may end up being COW'ed after all, but if you first pinned
> it, and then started playing with the protections of that page, why
> should you be surprised?
> 
> So to me, this sounds like a "don't do that then" situation.
> 
> Anybody who does page pinning and wants coherency should NOT TOUCH THE
> MAPPING IT PINNED.
> 
> (And if you do touch it, it's your own fault, and you get to keep both
> of the broken pieces)
> 
> Now, I do agree that from a QoI standpoint, it would be really lovely
> if we actually enforced it. I'm not entirely sure we can, but maybe it
> would be reasonable to use that
> 
>    mm->has_pinned && page_maybe_dma_pinned(page)
> 
> at least as the beginning of a heuristic.
> 
> In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> made stronger than it is. Because at *THAT* point, we might say "we

What exactly did you have in mind, to make it stronger? I think the
answer is in this email but I don't quite see it yet...

Also, now seems to be a good time to mention that I've been thinking about
a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer
to peer with GUP/NIC, and HMM support for atomic operations from a device).
And it seems like the following approach would help:

* Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks
to Jason for this point) means "user space owns the lifetime". We might
even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would
make them mean the same thing. The idea is that there are no "short term"
pins of this kind of memory.

* Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans,
because several of us had thought that Direct IO needs FOLL_PIN. However, this
recent conversation, plus my list of cases above, seems to indicate otherwise.
That's because we only have one refcount approach for marking pages in this way,
and we should spend it on the long-term pinned pages. Those are both hard to
identify otherwise, and actionable once we identify them.

Direct IO pins, on the other hand, are more transient. We can probably live
without tagging Direct IO pages as FOLL_PIN. I think.

This is all assuming that we make progress in the area of "if it's not a
page_maybe_dma_pinned() page, then we can wait for it or otherwise do reasonable
things about the refcount". So we end up with a clear (-ish) difference between
pages that can be waited for, and pages that should not be waited for in the
kernel.

I hope this helps, but if it's too much of a side-track, please disregard.


thanks,
Linus Torvalds Jan. 7, 2021, 10 p.m. UTC | #7
On Thu, Jan 7, 2021 at 1:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> >
> > Now, I do agree that from a QoI standpoint, it would be really lovely
> > if we actually enforced it. I'm not entirely sure we can, but maybe it
> > would be reasonable to use that
> >
> >    mm->has_pinned && page_maybe_dma_pinned(page)
> >
> > at least as the beginning of a heuristic.
> >
> > In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> > made stronger than it is. Because at *THAT* point, we might say "we
>
> What exactly did you have in mind, to make it stronger? I think the
> answer is in this email but I don't quite see it yet...

Literally just adding a " && page_mapcount(page) == 1" in there
(probably best done inside page_maybe_dma_pinned() itself)

> Direct IO pins, on the other hand, are more transient. We can probably live
> without tagging Direct IO pages as FOLL_PIN. I think.

Yes. I think direct-IO writes should be able to just do a transient
GUP, and if it causes a COW fault that isn't coherent, that's the
correct semantics, I think (ie the direct-IO will see the original
data, the COW faulter will get it's own private copy to make changes
to).

I think pinning should be primarily limited to things that _require_
coherency (ie you pin because you're going to do some active two-way
communication using that page)

Does that match your thinking?

               Linus
John Hubbard Jan. 7, 2021, 10:14 p.m. UTC | #8
On 1/7/21 2:00 PM, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 1:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>>>
>>> Now, I do agree that from a QoI standpoint, it would be really lovely
>>> if we actually enforced it. I'm not entirely sure we can, but maybe it
>>> would be reasonable to use that
>>>
>>>     mm->has_pinned && page_maybe_dma_pinned(page)
>>>
>>> at least as the beginning of a heuristic.
>>>
>>> In fact, I do think that "page_maybe_dma_pinned()" could possibly be
>>> made stronger than it is. Because at *THAT* point, we might say "we
>>
>> What exactly did you have in mind, to make it stronger? I think the
>> answer is in this email but I don't quite see it yet...
> 
> Literally just adding a " && page_mapcount(page) == 1" in there
> (probably best done inside page_maybe_dma_pinned() itself)

Well, that means that pages that are used for pinned DMA like this, can
not be shared with other processes. Is that an acceptable limitation
for the RDMA users? It seems a bit constraining, at first glance anyway.

> 
>> Direct IO pins, on the other hand, are more transient. We can probably live
>> without tagging Direct IO pages as FOLL_PIN. I think.
> 
> Yes. I think direct-IO writes should be able to just do a transient
> GUP, and if it causes a COW fault that isn't coherent, that's the
> correct semantics, I think (ie the direct-IO will see the original
> data, the COW faulter will get it's own private copy to make changes
> to).
> 
> I think pinning should be primarily limited to things that _require_
> coherency (ie you pin because you're going to do some active two-way
> communication using that page)
> 
> Does that match your thinking?
> 

Yes, perfectly. I'm going to update Documentation/core-api/pin_user_pages.rst
accordingly, once the dust settles on these discussions.

thanks,
Linus Torvalds Jan. 7, 2021, 10:20 p.m. UTC | #9
On Thu, Jan 7, 2021 at 2:14 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > Literally just adding a " && page_mapcount(page) == 1" in there
> > (probably best done inside page_maybe_dma_pinned() itself)
>
> Well, that means that pages that are used for pinned DMA like this, can
> not be shared with other processes. Is that an acceptable limitation
> for the RDMA users? It seems a bit constraining, at first glance anyway.

Hmm, add a check for the page being PageAnon(), perhaps?

If it's a shared vma, then the page can be pinned shared with multiple
mappings, I agree.

So yeah, I didn't think it through entirely.. And maybe I'm still
missing something else..

          Linus
Linus Torvalds Jan. 7, 2021, 10:24 p.m. UTC | #10
On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm, add a check for the page being PageAnon(), perhaps?
>
> If it's a shared vma, then the page can be pinned shared with multiple
> mappings, I agree.

Or check the vma directly for whether it's a COW vma. That's probably
a more obvious test, but would have to be done outside of
page_maybe_dma_pinned().

For example, in copy_present_page(), we've already done that COW-vma
test, so if we want to strengthen just _that_ test, then it would be
sufficient to just add a

        /* This cannot be a pinned page if it has more than one mapping */
        if (page_mappings(page) != 1)
                return 1;

to copy_present_page() along with the existing page_maybe_dma_pinned() test.

No?

            Linus
Andrea Arcangeli Jan. 7, 2021, 10:31 p.m. UTC | #11
On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > The problem is it's not even possible to detect reliably if there's
> > really a long term GUP pin because of speculative pagecache lookups.
> 
> So none of the normal code _needs_ that any more these days, which is
> what I think is so nice. Any pinning will do the COW, and then we have
> the logic to make sure it stays writable, and that keeps everything
> nicely coherent and is all fairly simple.
> 
> And yes, it does mean that if somebody then explicitly write-protects
> a page, it may end up being COW'ed after all, but if you first pinned
> it, and then started playing with the protections of that page, why
> should you be surprised?
> 
> So to me, this sounds like a "don't do that then" situation.
> 
> Anybody who does page pinning and wants coherency should NOT TOUCH THE
> MAPPING IT PINNED.
> 
> (And if you do touch it, it's your own fault, and you get to keep both
> of the broken pieces)
> 
> Now, I do agree that from a QoI standpoint, it would be really lovely
> if we actually enforced it. I'm not entirely sure we can, but maybe it
> would be reasonable to use that
> 
>   mm->has_pinned && page_maybe_dma_pinned(page)
> 
> at least as the beginning of a heuristic.
> 
> In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> made stronger than it is. Because at *THAT* point, we might say "we
> know a pinned page always must have a page_mapcount() of 1" - since as
> part of pinning it and doing the GUP_PIN, we forced the COW, and then
> subsequent fork() operations enforce it too.
> 
> So I do think that it might be possible to make that clear_refs code
> notice "this page is pinned, I can't mark it WP without the pinning
> coherency breaking".
> 
> It might not even be hard. But admittedly I'm somewhat handwaving
> here, and I might not have thought of some situation.

I suppose the objective would be to detect when it's a transient pin
(as an O_DIRECT write) and fail clear_refs with an error for all other
cases of real long term pins that need to keep reading at full PCI
bandwidth, without extra GUP invocations after the wp_copy_page run.

Because of the speculative lookups are making the count unstable, it's
not even enough to use mmu notifier and never use FOLL_GET in GUP to
make it safe again (unlike what I mentioned in a previous email).

Random memory corruption will still silently materialize as result of
the speculative lookups in the above scenario.

My whole point here in starting this new thread to suggest page_count
in do_wp_page is an untenable solution is that such commit silently
broke every single long term PIN user that may be used in combination
of clear_refs since 2013.

Silent memory corruption undetected or a detectable error out of
clear_refs, are both different side effects the same technical ABI
break that rendered clear_refs fundamentally incompatible with
clear_refs, so detecting it or not still an ABI break that is.

I felt obliged to report there's something much deeper and
fundamentally incompatible between the page_count in do_wp_page any
wrprotection of exclusive memory in place, as if used in combination
with any RDMA for example. The TLB flushing and the
mmap_read/write_lock were just the tip of the iceberg and they're not
the main concern anymore.

In addition, the inefficiency caused by the fact the page_count effect
is multiplied by 512 times or 262144 while the mapcount remains 4k
granular, makes me think the page_count is unsuitable to be used there
and this specific cure with page_count in do_wp_page, looks worse than
the initial zygote disease.

Thanks,
Andrea
John Hubbard Jan. 7, 2021, 10:37 p.m. UTC | #12
On 1/7/21 2:24 PM, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Hmm, add a check for the page being PageAnon(), perhaps?
>>
>> If it's a shared vma, then the page can be pinned shared with multiple
>> mappings, I agree.
> 
> Or check the vma directly for whether it's a COW vma. That's probably
> a more obvious test, but would have to be done outside of
> page_maybe_dma_pinned().
> 
> For example, in copy_present_page(), we've already done that COW-vma
> test, so if we want to strengthen just _that_ test, then it would be
> sufficient to just add a
> 
>          /* This cannot be a pinned page if it has more than one mapping */
>          if (page_mappings(page) != 1)
>                  return 1;
> 
> to copy_present_page() along with the existing page_maybe_dma_pinned() test.
> 
> No?
> 
>              Linus

That approach makes me a lot happier, yes. Because it doesn't add constraints
to the RDMA cases, but still does what we need here.


thanks,
Linus Torvalds Jan. 7, 2021, 10:42 p.m. UTC | #13
On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Random memory corruption will still silently materialize as result of
> the speculative lookups in the above scenario.

Explain.

Yes, you'll get random memory corruption if you keep doing wrprotect()
without mmap_sem held for writing.

But I thought we agreed earlier that that is a bug. And I thought the
softdirty code already got it for writing.

           Linus
Linus Torvalds Jan. 7, 2021, 10:51 p.m. UTC | #14
On Thu, Jan 7, 2021 at 2:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I thought we agreed earlier that that is a bug. And I thought the
> softdirty code already got it for writing.

Ho humm. I had obviously not looked very much at that code. I had done
a quick git grep, but now that I look closer, it *does* get the
mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and
then it does a mmap_write_downgrade().

So that's why I had looked more at the UFFD code, because that one was
the one I was aware of doing this all with just the read lock. I
_thought_ the softdirty code already took the write lock and wouldn't
race with page faults.

But I had missed that write_downgrade. So yeah, this code has the same issue.

Anyway, the fix is - I think - the same I outlined earlier when I was
talking about UFFD: take the thing for writing, so that you can't
race.

The alternate fix remains "make sure we always flush the TLB before
releasing the page table lock, and make COW do the copy under the page
table lock". But I really would prefer to just have this code follow
all the usual rules, and if it does a write protect, then it should
take the mmap_sem for writing.

Why is that very simple rule so bad?

(And see my unrelated but incidental note on it being a good idea to
try to minimize latency by making surfe we don't do any IO under the
mmap lock - whether held for reading _or_ writing. Because I do think
we can improve in that area, if you have some good test-case).

              Linus
Andrea Arcangeli Jan. 7, 2021, 11:28 p.m. UTC | #15
On Thu, Jan 07, 2021 at 02:42:17PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Random memory corruption will still silently materialize as result of
> > the speculative lookups in the above scenario.
> 
> Explain.
> 
> Yes, you'll get random memory corruption if you keep doing wrprotect()
> without mmap_sem held for writing.

I didn't meant that.

> But I thought we agreed earlier that that is a bug. And I thought the
> softdirty code already got it for writing.

softdirty used mmap_read_lock too but this again isn't relevant here
and for the sake of discussion we can safely assume mmap_read_lock
doesn't exist in the kernel, and everything takes the mmap_write_lock
whenever a mmap_lock is taken at all.

I mean something bad will happen if a write happens, but soft dirty
cannot register it because we didn't wrprotect the pte? Some dirty
page won't be transferred to destination and it will be assumed there
was no softy dirty event for such page? Otherwise it would mean that
wrprotecting is simply optional for all pages under clear_refs?

Not doing the final TLB flush in softdirty caused some issue even when
there was no COW and the deferred flush only would delay the wrprotect
fault:

       https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=RDJNU6vGF3hLO+j-g@mail.gmail.com/
       https://lore.kernel.org/linux-mm/20210105221628.GA12854@willie-the-truck/

Skipping the wrprotection of the pte because of a speculative
pagecache lookup elevating a random page_count, from the userland
point of view, I guessed would behave as missing the final TLB flush
before clear_refs returns to userland, just worse.

Thanks,
Andrea
Andrea Arcangeli Jan. 7, 2021, 11:48 p.m. UTC | #16
On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote:
> Ho humm. I had obviously not looked very much at that code. I had done
> a quick git grep, but now that I look closer, it *does* get the
> mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and
> then it does a mmap_write_downgrade().
> 
> So that's why I had looked more at the UFFD code, because that one was
> the one I was aware of doing this all with just the read lock. I
> _thought_ the softdirty code already took the write lock and wouldn't
> race with page faults.
> 
> But I had missed that write_downgrade. So yeah, this code has the same issue.

I overlooked the same thing initially. It's only when I noticed it
also used mmap_read_lock, that I figured that the group lock thingy
uffd-wp ad-hoc solution, despite it was fully self contained thanks to
the handle_userfault() catcher for the uffd-wp bit in the pagetable,
wasn't worth it since uffd-wp could always use whatever clear_refs
used to solve it.

> Anyway, the fix is - I think - the same I outlined earlier when I was
> talking about UFFD: take the thing for writing, so that you can't
> race.

Sure.

> The alternate fix remains "make sure we always flush the TLB before
> releasing the page table lock, and make COW do the copy under the page
> table lock". But I really would prefer to just have this code follow

The copy under PT lock isn't enough.

Flush TLB before releasing is enough of course.

Note also the patch in 2/2 patch that I sent:

     https://lkml.kernel.org/r/20210107200402.31095-3-aarcange@redhat.com

2/2 would have been my preferred solution for both and it works
fine. All corruption that was trivially reproducible with heavy
selftest program in the kernel, is all gone.

If only the TLB pending issue was the only regression page_count in
do_wp_page introduced, I would have never suggested we should
re-evaluate it. It'd be a good tradeoff in such case, even if it'd
penalize the soft-dirty runtime, especially if we were allowed to
deploy 2/2 as a non-blocking solution.

Until yesterday I fully intended to just propose 1/2 and 2/2, with a
whole different cover letter, CC stable and close this case.

> all the usual rules, and if it does a write protect, then it should
> take the mmap_sem for writing.

The problem isn't about performance anymore, the problem is a silent
ABI break to long term PIN user attached to an mm under clear_refs.

> Why is that very simple rule so bad?
> 
> (And see my unrelated but incidental note on it being a good idea to
> try to minimize latency by making surfe we don't do any IO under the
> mmap lock - whether held for reading _or_ writing. Because I do think
> we can improve in that area, if you have some good test-case).

That would be great indeed.

Thanks,
Andrea
Linus Torvalds Jan. 8, 2021, 12:25 a.m. UTC | #17
On Thu, Jan 7, 2021 at 3:48 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> > The alternate fix remains "make sure we always flush the TLB before
> > releasing the page table lock, and make COW do the copy under the page
> > table lock". But I really would prefer to just have this code follow

> The copy under PT lock isn't enough.
>
> Flush TLB before releasing is enough of course.

Right. That's why I said "and". You need both, afaik.

But if we just do the mmap write lock, you need neither - then you
just need to flush before you release the write lock.

> Note also the patch in 2/2 patch that I sent:

Yes, yes, and that's what I'm objecting to.

All these stupid games with "flush_pending(" counts are complete garbage.

They all come from the fact that this code doesn't hold the right lock.

I don't understand you: in one breath you seem to say "yes, taking the
write lock is the right thing to do", and then in the next one you
point to this patch that adds all this garbage *because* it's not
holding the write lock.

All of those "tlb_flush_pending" things are wrong. They should not
exist. The code in clear_refs_write() should hold the mmap_sem for
writing, and do the TLB flush before it releases the mmap sem, and
then it *cannot* race with the page faults.

See what I'm saying? I refuse to apply your patch 2/2, because it all
seems entirely wrong.

What's doubly ludicrous about that is that the coide already _took_
the mmap_sem for writing, and spent extra cycles to downgrade it -
INCORRECTLY - to a read-lock. And as far as I can tell, it doesn't
even do anything expensive inside that (now downgraded) region, so the
downgrading was

 (a) buggy

 (b) slower than just keeping the lock the way it had

and (b) is because it had already done the expensive part (which was
to get the lock in the first place).

Just as an example, the whole "Rollback wrprotect_tlb_flush_pending"
is all because it got the lock - again wrongly - as a read-lock
initially, then it says "oh, I need to get a write lock", releases it,
re-takes it as a write lock, does a tiny amount of work, and then -
again incorrectly - downgrades it to a read-lock.

To make it even worse (if that is possible) it actually had YET
ANOTHER case - that CLEAR_REFS_MM_HIWATER_RSS - where it took the mmap
sem for writing, did its thing, and then released it.

So there's like *four* different locking mistakes in that single
function. And it's not even an important function to begin with.

It shgould just have done a single

        mmap_write_lock_killable(mm);
        ...
        mmap_write_unlock(mm);

around the whole thing, instead of _any_ of that crazy stuff.

That code is WRONG.

And your PATCH 2/2 makes that insane code EVEN WORSE. Why the heck is
that magic fs/proc/ interface allowed to get VM internals so wrong,
and make things so much worse?

Can you not see why I'm arguing with you?

Please. Why is the correct patch not the attached one (apart from the
obvious fact that I haven't tested it and maybe just screwed up
completely - but you get the idea)?

            Linus
Will Deacon Jan. 8, 2021, 12:48 p.m. UTC | #18
On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote:
> Please. Why is the correct patch not the attached one (apart from the
> obvious fact that I haven't tested it and maybe just screwed up
> completely - but you get the idea)?

It certainly looks simple and correct to me, although it means we're now
taking the mmap sem for write in the case where we only want to clear the
access flag, which should be fine with the thing only held for read, no?

Will
Andrea Arcangeli Jan. 8, 2021, 4:14 p.m. UTC | #19
Hello everyone,

On Fri, Jan 08, 2021 at 12:48:16PM +0000, Will Deacon wrote:
> On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote:
> > Please. Why is the correct patch not the attached one (apart from the
> > obvious fact that I haven't tested it and maybe just screwed up
> > completely - but you get the idea)?
> 
> It certainly looks simple and correct to me, although it means we're now
> taking the mmap sem for write in the case where we only want to clear the
> access flag, which should be fine with the thing only held for read, no?

I'm curious, would you also suggest that fixing just the TLB flushing
symptom is enough and we can forget about the ABI break coming from
page_count used in do_wp_page?

One random example: clear_refs will still break all long term GUP
pins, are you ok with that too?

page_count in do_wp_page is a fix for the original security issue from
vmsplice (where the child is fooling the parent in taking the
exclusive page in do_wp_page), that appears worse than the bug itself.

page_count in do_wp_page, instead of isolating as malicious when the
parent is reusing the page queued in the vmsplice pipe, is treating as
malicious also all legit cases that had to reliably reuse the page to
avoid the secondary MMUs to go out of sync.

page_count in do_wp_page is like a credit card provider blocking all
credit cards of all customers, because one credit card may have been
cloned (by vmsplice), but nobody can know which one was it. Of course
this technique will work perfectly as security fix because it will
treat all credit card users as malicious and it'll block them all
("block as in preventing re-use of the anon page").

The problem are those other credit card users that weren't malicious
that get their COW broken too. Those are the very long term GUP pins
if any anon page can be still wrprotected anywhere in the VM.

At the same time the real hanging fruit (vmsplice) that, if taken care
of, would have rendered the bug purely theoretical in security terms
hasn't been fixed yet, despite those unprivileged long term GUP pins
causes more reproducible security issues than just the COW, since they
can still DoS the OOM killer and they bypass at least the mlock
enforcement, even for non compound pages.

Of course just fixing vmsplice to require some privilege won't fix the
bug in full, so it's not suitable long term solution, but it has to
happen orthogonality for other reason, and it'd at least remove the
short term security concern.

In addition you're not experiencing the full fallout of the side
effects of page_count used to decide if to re-use all anon COW pages
because the bug is still there (with enterprise default config options
at least). Not all credit cards are blocked yet with only
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will
block them all, you will experience all the side effects of replacing
the per-subpage finegrined mapcount with the compound-wide page count.

The two statements above combined, result in my recommendation at this
point to resolve this in userland by rendering the security issue
theoretical by removing vmsplice from the OCI schema allowlist or by
enforcing it fixing in userland by always using execve after drop
privs (as crun always does when it starts the container of course).

For the long term, I can't see how using page_count in do_wp_page is a
tenable proposition, unless we either drop all secondary MMUs from the
kernel or VM features like clear_refs are dropped or unless the
page_count is magically stabilized and the speculative pagecache
lookups are also dropped.

If trying to manage the fallout by enforcing no anon page can ever be
wrprotected in place (i.e. dropping clear_refs feature or rendering it
unreliable by skipping elevated counts caused by spurious pagecache
lookups), it'd still sounds a too fragile design and too prone to
break to rely on that. There's random arch stuff even wrprotecting
memory, even very vm86 does it under the hood (vm86 is unlikely it has
a long term GUP pin on it of course, but still who knows?). I mean the
VM core cannot make assumptions like: "this vm86 case can still
wrprotect without worry because probably vm86 isn't used anymore with
any advanced secondary MMU, so if there's a GUP pin it probably is a
malicious vmsplice and not a RDMA or GPU or Virtualization secondary
MMU".

Then there's the secondary concern of the inefficiency it introduces
with extra unnecessary copies when a single GUP pin will prevent reuse
of 512 or 262144 subpages, in the 512 case also potentially mapped in
different processes. The TLB flushing discussions registers as the
last concern in my view.

Thanks,
Andrea
Linus Torvalds Jan. 8, 2021, 5:30 p.m. UTC | #20
On Fri, Jan 8, 2021 at 4:48 AM Will Deacon <will@kernel.org> wrote:
>
> It certainly looks simple and correct to me, although it means we're now
> taking the mmap sem for write in the case where we only want to clear the
> access flag, which should be fine with the thing only held for read, no?

When I was looking at that code, I was thinking that the whole
function should be split up to get rid of some of the indentation and
the "goto out_mm".

And yes, it would probably be good to split up up even more than that
"initial mm lookup and error handling", and have an actual case
statement for the different clear_ref 'type' cases.

And then it would be fairly simple and clean to say "this case only
needs the mmap_sem for read, that case needs it for write".

So I don't disagree, but I think it should be a separate patch - if it
even matters. Is this strange /proc case something that is even
commonly done?

            Linus
Linus Torvalds Jan. 8, 2021, 5:39 p.m. UTC | #21
On Fri, Jan 8, 2021 at 8:14 AM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> page_count in do_wp_page is a fix for the original security issue

Not just that.

page_count() is simply the right and efficient thing to do.

You talk about all these theoretical inefficiencies for cases like
zygote and page pinning, which have never ever been seen except as a
possible attack vector.

Stop talking about irrelevant things. Stop trying to "optimize" things
that never happen and don't matter.

Instead, what matters is the *NORMAL* VM flow.

Things like COW.

Things like "oh, now that we check just the page count, we don't even
need the page lock for the common case any more".

> For the long term, I can't see how using page_count in do_wp_page is a
> tenable proposition,

I think you should re-calibrate your expectations, and accept that
page_count() is the right thing to do, and live with it.

And instead of worrying about irrelevant special-case code, start
worrying about the code that gets triggered tens of thousands of times
a second, on regular loads, without anybody doing anything odd or
special at all, just running plain and normal shell scripts or any
other normal traditional load.

Those irrelevant special cases should be simple and work, not badly
optimized to the point where they are buggy. And they are MUCH LESS
IMPORTANT than the normal VM code, so if somebody does something odd,
and it's slow, then that is the problem for the _odd_ case, not for
the normal codepaths.

This is why I refuse to add crazy new special cases to core code. Make
the rules simple and straightforward, and make the code VM work well.

                  Linus
Andrea Arcangeli Jan. 8, 2021, 5:53 p.m. UTC | #22
On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote:
> page_count() is simply the right and efficient thing to do.
> 
> You talk about all these theoretical inefficiencies for cases like
> zygote and page pinning, which have never ever been seen except as a
> possible attack vector.

Do you intend to eventually fix the zygote vmsplice case or not?
Because in current upstream it's not fixed currently using the
enterprise default config.

> Stop talking about irrelevant things. Stop trying to "optimize" things
> that never happen and don't matter.
> 
> Instead, what matters is the *NORMAL* VM flow.
> 
> Things like COW.
> 
> Things like "oh, now that we check just the page count, we don't even
> need the page lock for the common case any more".
> 
> > For the long term, I can't see how using page_count in do_wp_page is a
> > tenable proposition,
> 
> I think you should re-calibrate your expectations, and accept that
> page_count() is the right thing to do, and live with it.
> 
> And instead of worrying about irrelevant special-case code, start

Irrelevant special case as in: long term GUP pin on the memory?

Or irrelevant special case as in: causing secondary MMU to hit silent
data loss if a pte is ever wrprotected (arch code, vm86, whatever, all
under mmap_write_lock of course).

> worrying about the code that gets triggered tens of thousands of times
> a second, on regular loads, without anybody doing anything odd or
> special at all, just running plain and normal shell scripts or any
> other normal traditional load.
> 
> Those irrelevant special cases should be simple and work, not badly
> optimized to the point where they are buggy. And they are MUCH LESS
> IMPORTANT than the normal VM code, so if somebody does something odd,
> and it's slow, then that is the problem for the _odd_ case, not for
> the normal codepaths.
> 
> This is why I refuse to add crazy new special cases to core code. Make
> the rules simple and straightforward, and make the code VM work well.

New special cases? which new cases?

There's nothing new here besides the zygote that wasn't fully fixed
with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only
new case I can imagine where page_count actually isn't a regression.

All old cases that you seem to refer as irrelevant and are in
production in v4.18, I don't see anything new here.

Even for the pure COW case with zero GUP involvement an hugepage with
cows happening in different processes, would forever hit wp_copy_page
since count is always > 1 despite mapcount can be 1 for all
subpages. A simple app doing fork/exec would forever copy all memory
in the parent even after the exec is finished.

Thanks,
Andrea
Linus Torvalds Jan. 8, 2021, 7:25 p.m. UTC | #23
On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Do you intend to eventually fix the zygote vmsplice case or not?
> Because in current upstream it's not fixed currently using the
> enterprise default config.

Is this the hugepage case? Neither of your patches actually touched
that, so I've forgotten the details.

> Irrelevant special case as in: long term GUP pin on the memory?

Irrelevant special case in that

 (a) an extra COW shouldn't be a correctness issue unless somebody
does something horribly wrong (and obviously the code that hasn't
taken the mmap_lock for writing are then examples of that)

and

 (b) it's not a performance issue either unless you can find a real
load that does it.

Hmm?

               Linus
Andrea Arcangeli Jan. 9, 2021, 12:12 a.m. UTC | #24
On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Do you intend to eventually fix the zygote vmsplice case or not?
> > Because in current upstream it's not fixed currently using the
> > enterprise default config.
> 
> Is this the hugepage case? Neither of your patches actually touched
> that, so I've forgotten the details.

The two patches only fixed the TLB flushing deferral in clear_refs and
uffd-wp.

So I didn't actually try to fix the hugepage case by adding the
page_count checks there too. I could try to do that at least it'd be
consistent but I still would try to find an alternate solution later.

> > Irrelevant special case as in: long term GUP pin on the memory?
> 
> Irrelevant special case in that
> 
>  (a) an extra COW shouldn't be a correctness issue unless somebody
> does something horribly wrong (and obviously the code that hasn't
> taken the mmap_lock for writing are then examples of that)
> 
> and
> 
>  (b) it's not a performance issue either unless you can find a real
> load that does it.
> 
> Hmm?

For b) I don't have an hard time to imagine `ps` hanging for seconds,
if clear_refs is touched on a 4T mm, but b) is not the main
concern.

Having to rely on a) is the main concern and it's not about tlb
flushes but the long term GUP pins.

Thanks,
Andrea
Jan Kara Jan. 15, 2021, 11:27 a.m. UTC | #25
On Thu 07-01-21 13:53:18, John Hubbard wrote:
> On 1/7/21 1:29 PM, Linus Torvalds wrote:
> > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > 
> > > The problem is it's not even possible to detect reliably if there's
> > > really a long term GUP pin because of speculative pagecache lookups.
> > 
> > So none of the normal code _needs_ that any more these days, which is
> > what I think is so nice. Any pinning will do the COW, and then we have
> > the logic to make sure it stays writable, and that keeps everything
> > nicely coherent and is all fairly simple.
> > 
> > And yes, it does mean that if somebody then explicitly write-protects
> > a page, it may end up being COW'ed after all, but if you first pinned
> > it, and then started playing with the protections of that page, why
> > should you be surprised?
> > 
> > So to me, this sounds like a "don't do that then" situation.
> > 
> > Anybody who does page pinning and wants coherency should NOT TOUCH THE
> > MAPPING IT PINNED.
> > 
> > (And if you do touch it, it's your own fault, and you get to keep both
> > of the broken pieces)
> > 
> > Now, I do agree that from a QoI standpoint, it would be really lovely
> > if we actually enforced it. I'm not entirely sure we can, but maybe it
> > would be reasonable to use that
> > 
> >    mm->has_pinned && page_maybe_dma_pinned(page)
> > 
> > at least as the beginning of a heuristic.
> > 
> > In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> > made stronger than it is. Because at *THAT* point, we might say "we
> 
> What exactly did you have in mind, to make it stronger? I think the
> answer is in this email but I don't quite see it yet...
> 
> Also, now seems to be a good time to mention that I've been thinking about
> a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer
> to peer with GUP/NIC, and HMM support for atomic operations from a device).
> And it seems like the following approach would help:
> 
> * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks
> to Jason for this point) means "user space owns the lifetime". We might
> even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would
> make them mean the same thing. The idea is that there are no "short term"
> pins of this kind of memory.
> 
> * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans,
> because several of us had thought that Direct IO needs FOLL_PIN. However, this
> recent conversation, plus my list of cases above, seems to indicate otherwise.
> That's because we only have one refcount approach for marking pages in this way,
> and we should spend it on the long-term pinned pages. Those are both hard to
> identify otherwise, and actionable once we identify them.

Somewhat late to the game but I disagree here. I think direct IO still
needs FOLL_PIN so that page_may_be_dma_pinned() returns true for it. At
least for shared pages. Because filesystems/mm in the writeback path need
to detect whether the page is pinned and thus its contents can change
anytime without noticing, the page can be dirtied at random times etc. In
that case we need to bounce the page during writeback (to avoid checksum
failures), keep page as dirty in internal filesystem bookkeeping (and in MM
as well) etc...

								Honza
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a127262ba517..e75cb135db02 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1235,8 +1235,20 @@  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
-				if (!(vma->vm_flags & VM_SOFTDIRTY))
+				struct vm_area_struct *tmp;
+				if (!(vma->vm_flags & VM_SOFTDIRTY)) {
+					inc_wrprotect_tlb_flush_pending(vma);
 					continue;
+				}
+
+				/*
+				 * Rollback wrprotect_tlb_flush_pending before
+				 * releasing the mmap_lock.
+				 */
+				for (vma = mm->mmap; vma != tmp;
+				     vma = vma->vm_next)
+					dec_wrprotect_tlb_flush_pending(vma);
+
 				mmap_read_unlock(mm);
 				if (mmap_write_lock_killable(mm)) {
 					count = -EINTR;
@@ -1245,6 +1257,7 @@  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
+					inc_wrprotect_tlb_flush_pending(vma);
 				}
 				mmap_write_downgrade(mm);
 				break;
@@ -1260,6 +1273,8 @@  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			mmu_notifier_invalidate_range_end(&range);
 			flush_tlb_mm(mm);
+			for (vma = mm->mmap; vma; vma = vma->vm_next)
+				dec_wrprotect_tlb_flush_pending(vma);
 			dec_tlb_flush_pending(mm);
 		}
 		mmap_read_unlock(mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..caa1d9a71cb2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3177,5 +3177,51 @@  unsigned long wp_shared_mapping_range(struct address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+/*
+ * NOTE: the mmap_lock must be hold and it cannot be released at any
+ * time in between inc_wrprotect_tlb_flush_pending() and
+ * dec_wrprotect_tlb_flush_pending().
+ *
+ * This counter has to be elevated before taking the PT-lock to
+ * wrprotect pagetables, if the TLB isn't flushed before the
+ * PT-unlock.
+ *
+ * The only reader is the page fault so this has to be elevated (in
+ * addition of the mm->tlb_flush_pending) only when the mmap_read_lock
+ * is taken instead of the mmap_write_lock (otherwise the page fault
+ * couldn't run concurrently in the first place).
+ *
+ * This doesn't need to be elevated when clearing pagetables even if
+ * only holding the mmap_read_lock (as in MADV_DONTNEED). The page
+ * fault doesn't risk to access the data of the page that is still
+ * under tlb-gather deferred flushing, if the pagetable is none,
+ * because the pagetable doesn't point to it anymore.
+ *
+ * This counter is read more specifically by the page fault when it
+ * has to issue a COW that doesn't result in page re-use because of
+ * the lack of stability of the page_count (vs speculative pagecache
+ * lookups) or because of a GUP pin exist on an exclusive and writable
+ * anon page.
+ *
+ * If this counter is elevated and the pageteable is wrprotected (as
+ * in !pte/pmd_write) and present, it means the page may be still
+ * modified by userland through a stale TLB entry that was
+ * instantiated before the wrprotection. In such case the COW fault,
+ * if it decides not to re-use the page, will have to either wait this
+ * counter to return zero, or it needs to flush the TLB before
+ * proceeding copying the page.
+ */
+static inline void inc_wrprotect_tlb_flush_pending(struct vm_area_struct *vma)
+{
+	atomic_inc(&vma->wrprotect_tlb_flush_pending);
+	VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) <= 0);
+}
+
+static inline void dec_wrprotect_tlb_flush_pending(struct vm_area_struct *vma)
+{
+	atomic_dec(&vma->wrprotect_tlb_flush_pending);
+	VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 07d9acb5b19c..e3f412c43c30 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -369,6 +369,11 @@  struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	/*
+	 * When elevated, it indicates that a deferred TLB flush may
+	 * be pending after a pagetable write protection on this vma.
+	 */
+	atomic_t wrprotect_tlb_flush_pending;
 } __randomize_layout;
 
 struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index 37720a6d04ea..7a658c608f3a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -365,6 +365,7 @@  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
 		new->vm_next = new->vm_prev = NULL;
+		VM_WARN_ON_ONCE(atomic_read(&new->wrprotect_tlb_flush_pending));
 	}
 	return new;
 }
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..e8e407443119 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2809,6 +2809,73 @@  static inline void wp_page_reuse(struct vm_fault *vmf)
 	count_vm_event(PGREUSE);
 }
 
+/*
+ * This synchronization helper, by the time it returns, has to enforce
+ * there cannot be stale writable TLB entries left, on any page mapped
+ * as wrprotected in the pagetables in this vma range.
+ *
+ * This is normally used only by the COW fault immediately before
+ * copying the page to make it proof against stale TLB entries (as the
+ * one pictured below in CPU 2).
+ *
+ *  CPU 0		CPU 1			CPU 2
+ *  -----		-----			-----
+ *						writable TLB instantiated
+ *			mmap_lock_read
+ *			inc_wrprotect_tlb_flush_pending()
+ *			PT_lock
+ *			wrprotect the pte
+ *			PT unlock
+ *  mmap_lock_read
+ *  PT_lock
+ *  vmf->orig_pte = pte
+ *  do_wp_page()
+ *  PT_unlock
+ *  wp_page_copy()
+ *  sync_wrprotect_tlb_flush_pending()
+ *  found wrprotect_tlb_flush_pending elevated
+ *  flush_tlb_page()
+ *						writable TLB invalidated [1]
+ *  kret of sync_wrprotect_tlb_flush_pending()
+ *  cow_user_page() [2]
+ *
+ * The whole objective of the wrprotect_tlb_flush_pending atomic
+ * counter is to enforce [1] happens before [2] in the above sequence.
+ *
+ * Without having to alter the caller of this helper, it'd also be
+ * possible to replace the flush_tlb_page with a wait for
+ * wrprotect_tlb_flush_pending counter to return zero using the same
+ * logic as above. In such case the point [1] would be replaced by
+ * dec_wrprotect_tlb_flush_pending() happening in CPU 1.
+ *
+ * In terms of memory ordering guarantees: all page payload reads of
+ * page mapped by a wrprotected pagetable, executed after this
+ * function returns, must not be allowed to be reordered before the
+ * read of the wrprotect_tlb_flush_pending atomic counter at the start
+ * of the function. So this function has to provide acquire semantics
+ * to those page payload reads.
+ */
+static __always_inline
+void sync_wrprotect_tlb_flush_pending(struct vm_area_struct *vma,
+				      unsigned long address)
+{
+	int val = atomic_read(&vma->wrprotect_tlb_flush_pending);
+	if (val) {
+		/*
+		 * flush_tlb_page() needs to deliver acquire semantics
+		 * implicitly. Archs using IPIs to flush remote TLBs
+		 * provide those with csd_lock_wait().
+		 */
+		flush_tlb_page(vma, address);
+	} else {
+		/*
+		 * Prevent the read of the wrprotect page payload to be
+		 * reordered before the above atomic_read().
+		 */
+		smp_rmb();
+	}
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2849,6 +2916,8 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 
+		sync_wrprotect_tlb_flush_pending(vma, vmf->address);
+
 		if (!cow_user_page(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ab709023e9aa..6b7a52662de8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -335,6 +335,8 @@  static unsigned long change_protection_range(struct vm_area_struct *vma,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
+	if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL))
+		inc_wrprotect_tlb_flush_pending(vma);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
@@ -346,6 +348,8 @@  static unsigned long change_protection_range(struct vm_area_struct *vma,
 	/* Only flush the TLB if we actually modified any entries: */
 	if (pages)
 		flush_tlb_range(vma, start, end);
+	if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL))
+		dec_wrprotect_tlb_flush_pending(vma);
 	dec_tlb_flush_pending(mm);
 
 	return pages;