mbox series

[0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

Message ID 20230726214103.3261108-1-jannh@google.com (mailing list archive)
Headers show
Series fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering | expand

Message

Jann Horn July 26, 2023, 9:41 p.m. UTC
Hi!

Patch 1 here is a straightforward fix for a race in per-VMA locking code
that can lead to use-after-free; I hope we can get this one into
mainline and stable quickly.

Patch 2 is a fix for what I believe is a longstanding memory ordering
issue in how vma->anon_vma is used across the MM subsystem; I expect
that this one will have to go through a few iterations of review and
potentially rewrites, because memory ordering is tricky.
(If someone else wants to take over patch 2, I would be very happy.)

These patches don't really belong together all that much, I'm just
sending them as a series because they'd otherwise conflict.

I am CCing:

 - Suren because patch 1 touches his code
 - Matthew Wilcox because he is also currently working on per-VMA
   locking stuff
 - all the maintainers/reviewers for the Kernel Memory Consistency Model
   so they can help figure out the READ_ONCE() vs smp_load_acquire()
   thing
 - people involved in the previous discussion on the security list


Jann Horn (2):
  mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
  mm: Fix anon_vma memory ordering

 include/linux/rmap.h | 15 ++++++++++++++-
 mm/huge_memory.c     |  4 +++-
 mm/khugepaged.c      |  2 +-
 mm/ksm.c             | 16 +++++++++++-----
 mm/memory.c          | 32 ++++++++++++++++++++------------
 mm/mmap.c            | 13 ++++++++++---
 mm/rmap.c            |  6 ++++--
 mm/swapfile.c        |  3 ++-
 8 files changed, 65 insertions(+), 26 deletions(-)


base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f

Comments

Paul E. McKenney July 26, 2023, 11:19 p.m. UTC | #1
On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
> Hi!
> 
> Patch 1 here is a straightforward fix for a race in per-VMA locking code
> that can lead to use-after-free; I hope we can get this one into
> mainline and stable quickly.
> 
> Patch 2 is a fix for what I believe is a longstanding memory ordering
> issue in how vma->anon_vma is used across the MM subsystem; I expect
> that this one will have to go through a few iterations of review and
> potentially rewrites, because memory ordering is tricky.
> (If someone else wants to take over patch 2, I would be very happy.)
> 
> These patches don't really belong together all that much, I'm just
> sending them as a series because they'd otherwise conflict.
> 
> I am CCing:
> 
>  - Suren because patch 1 touches his code
>  - Matthew Wilcox because he is also currently working on per-VMA
>    locking stuff
>  - all the maintainers/reviewers for the Kernel Memory Consistency Model
>    so they can help figure out the READ_ONCE() vs smp_load_acquire()
>    thing

READ_ONCE() has weaker ordering properties than smp_load_acquire().

For example, given a pointer gp:

	p = whichever(gp);
	a = 1;
	r1 = p->b;
	if ((uintptr_t)p & 0x1)
		WRITE_ONCE(b, 1);
	WRITE_ONCE(c, 1);

Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
"READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
ordered after the load from gp (the former due to an address dependency
and the latter due to a (fragile) control dependency).  The compiler
is within its rights to reorder the store to "a" to precede the load
from gp.  The compiler is forbidden from reordering the store to "c"
wtih the load from gp (because both are volatile accesses), but the CPU
is completely within its rights to do this reordering.

But if "whichever" is "smp_load_acquire()", all four of the subsequent
memory accesses are ordered after the load from gp.

Similarly, for WRITE_ONCE() and smp_store_release():

	p = READ_ONCE(gp);
	r1 = READ_ONCE(gi);
	r2 = READ_ONCE(gj);
	a = 1;
	WRITE_ONCE(b, 1);
	if (r1 & 0x1)
		whichever(p->q, r2);

Again leaving aside the "&" needed by smp_store_release(), if "whichever"
is WRITE_ONCE(), then the load from gp, the load from gi, and the load
from gj are all ordered before the store to p->q (by address dependency,
control dependency, and data dependency, respectively).  The store to "a"
can be reordered with the store to p->q by the compiler.  The store to
"b" cannot be reordered with the store to p->q by the compiler (again,
both are volatile), but the CPU is free to reorder them, especially when
whichever() is implemented as a conditional store.

But if "whichever" is "smp_store_release()", all five of the earlier
memory accesses are ordered before the store to p->q.

Does that help, or am I missing the point of your question?

							Thanx, Paul

>  - people involved in the previous discussion on the security list
> 
> 
> Jann Horn (2):
>   mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
>   mm: Fix anon_vma memory ordering
> 
>  include/linux/rmap.h | 15 ++++++++++++++-
>  mm/huge_memory.c     |  4 +++-
>  mm/khugepaged.c      |  2 +-
>  mm/ksm.c             | 16 +++++++++++-----
>  mm/memory.c          | 32 ++++++++++++++++++++------------
>  mm/mmap.c            | 13 ++++++++++---
>  mm/rmap.c            |  6 ++++--
>  mm/swapfile.c        |  3 ++-
>  8 files changed, 65 insertions(+), 26 deletions(-)
> 
> 
> base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> -- 
> 2.41.0.487.g6d72f3e995-goog
>
Jann Horn July 27, 2023, 2:39 p.m. UTC | #2
On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
> > Hi!
> >
> > Patch 1 here is a straightforward fix for a race in per-VMA locking code
> > that can lead to use-after-free; I hope we can get this one into
> > mainline and stable quickly.
> >
> > Patch 2 is a fix for what I believe is a longstanding memory ordering
> > issue in how vma->anon_vma is used across the MM subsystem; I expect
> > that this one will have to go through a few iterations of review and
> > potentially rewrites, because memory ordering is tricky.
> > (If someone else wants to take over patch 2, I would be very happy.)
> >
> > These patches don't really belong together all that much, I'm just
> > sending them as a series because they'd otherwise conflict.
> >
> > I am CCing:
> >
> >  - Suren because patch 1 touches his code
> >  - Matthew Wilcox because he is also currently working on per-VMA
> >    locking stuff
> >  - all the maintainers/reviewers for the Kernel Memory Consistency Model
> >    so they can help figure out the READ_ONCE() vs smp_load_acquire()
> >    thing
>
> READ_ONCE() has weaker ordering properties than smp_load_acquire().
>
> For example, given a pointer gp:
>
>         p = whichever(gp);
>         a = 1;
>         r1 = p->b;
>         if ((uintptr_t)p & 0x1)
>                 WRITE_ONCE(b, 1);
>         WRITE_ONCE(c, 1);
>
> Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
> "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
> ordered after the load from gp (the former due to an address dependency
> and the latter due to a (fragile) control dependency).  The compiler
> is within its rights to reorder the store to "a" to precede the load
> from gp.  The compiler is forbidden from reordering the store to "c"
> wtih the load from gp (because both are volatile accesses), but the CPU
> is completely within its rights to do this reordering.
>
> But if "whichever" is "smp_load_acquire()", all four of the subsequent
> memory accesses are ordered after the load from gp.
>
> Similarly, for WRITE_ONCE() and smp_store_release():
>
>         p = READ_ONCE(gp);
>         r1 = READ_ONCE(gi);
>         r2 = READ_ONCE(gj);
>         a = 1;
>         WRITE_ONCE(b, 1);
>         if (r1 & 0x1)
>                 whichever(p->q, r2);
>
> Again leaving aside the "&" needed by smp_store_release(), if "whichever"
> is WRITE_ONCE(), then the load from gp, the load from gi, and the load
> from gj are all ordered before the store to p->q (by address dependency,
> control dependency, and data dependency, respectively).  The store to "a"
> can be reordered with the store to p->q by the compiler.  The store to
> "b" cannot be reordered with the store to p->q by the compiler (again,
> both are volatile), but the CPU is free to reorder them, especially when
> whichever() is implemented as a conditional store.
>
> But if "whichever" is "smp_store_release()", all five of the earlier
> memory accesses are ordered before the store to p->q.
>
> Does that help, or am I missing the point of your question?

My main question is how permissible/ugly you think the following use
of READ_ONCE() would be, and whether you think it ought to be an
smp_load_acquire() instead.

Assume that we are holding some kind of lock that ensures that the
only possible concurrent update to "vma->anon_vma" is that it changes
from a NULL pointer to a non-NULL pointer (using smp_store_release()).


if (READ_ONCE(vma->anon_vma) != NULL) {
  // we now know that vma->anon_vma cannot change anymore

  // access the same memory location again with a plain load
  struct anon_vma *a = vma->anon_vma;

  // this needs to be address-dependency-ordered against one of
  // the loads from vma->anon_vma
  struct anon_vma *root = a->root;
}


Is this fine? If it is not fine just because the compiler might
reorder the plain load of vma->anon_vma before the READ_ONCE() load,
would it be fine after adding a barrier() directly after the
READ_ONCE()?

I initially suggested using READ_ONCE() for this, and then Linus and
me tried to reason it out and Linus suggested (if I understood him
correctly) that you could make the ugly argument that this works
because loads from the same location will not be reordered by the
hardware. So on anything other than alpha, we'd still have the
required address-dependency ordering because that happens for all
loads, even plain loads, while on alpha, the READ_ONCE() includes a
memory barrier. But that argument is weirdly reliant on
architecture-specific implementation details.

The other option is to replace the READ_ONCE() with a
smp_load_acquire(), at which point it becomes a lot simpler to show
that the code is correct.


>                                                         Thanx, Paul
>
> >  - people involved in the previous discussion on the security list
> >
> >
> > Jann Horn (2):
> >   mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
> >   mm: Fix anon_vma memory ordering
> >
> >  include/linux/rmap.h | 15 ++++++++++++++-
> >  mm/huge_memory.c     |  4 +++-
> >  mm/khugepaged.c      |  2 +-
> >  mm/ksm.c             | 16 +++++++++++-----
> >  mm/memory.c          | 32 ++++++++++++++++++++------------
> >  mm/mmap.c            | 13 ++++++++++---
> >  mm/rmap.c            |  6 ++++--
> >  mm/swapfile.c        |  3 ++-
> >  8 files changed, 65 insertions(+), 26 deletions(-)
> >
> >
> > base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> > --
> > 2.41.0.487.g6d72f3e995-goog
> >
Will Deacon July 27, 2023, 2:57 p.m. UTC | #3
On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
> > > Hi!
> > >
> > > Patch 1 here is a straightforward fix for a race in per-VMA locking code
> > > that can lead to use-after-free; I hope we can get this one into
> > > mainline and stable quickly.
> > >
> > > Patch 2 is a fix for what I believe is a longstanding memory ordering
> > > issue in how vma->anon_vma is used across the MM subsystem; I expect
> > > that this one will have to go through a few iterations of review and
> > > potentially rewrites, because memory ordering is tricky.
> > > (If someone else wants to take over patch 2, I would be very happy.)
> > >
> > > These patches don't really belong together all that much, I'm just
> > > sending them as a series because they'd otherwise conflict.
> > >
> > > I am CCing:
> > >
> > >  - Suren because patch 1 touches his code
> > >  - Matthew Wilcox because he is also currently working on per-VMA
> > >    locking stuff
> > >  - all the maintainers/reviewers for the Kernel Memory Consistency Model
> > >    so they can help figure out the READ_ONCE() vs smp_load_acquire()
> > >    thing
> >
> > READ_ONCE() has weaker ordering properties than smp_load_acquire().
> >
> > For example, given a pointer gp:
> >
> >         p = whichever(gp);
> >         a = 1;
> >         r1 = p->b;
> >         if ((uintptr_t)p & 0x1)
> >                 WRITE_ONCE(b, 1);
> >         WRITE_ONCE(c, 1);
> >
> > Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
> > "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
> > ordered after the load from gp (the former due to an address dependency
> > and the latter due to a (fragile) control dependency).  The compiler
> > is within its rights to reorder the store to "a" to precede the load
> > from gp.  The compiler is forbidden from reordering the store to "c"
> > wtih the load from gp (because both are volatile accesses), but the CPU
> > is completely within its rights to do this reordering.
> >
> > But if "whichever" is "smp_load_acquire()", all four of the subsequent
> > memory accesses are ordered after the load from gp.
> >
> > Similarly, for WRITE_ONCE() and smp_store_release():
> >
> >         p = READ_ONCE(gp);
> >         r1 = READ_ONCE(gi);
> >         r2 = READ_ONCE(gj);
> >         a = 1;
> >         WRITE_ONCE(b, 1);
> >         if (r1 & 0x1)
> >                 whichever(p->q, r2);
> >
> > Again leaving aside the "&" needed by smp_store_release(), if "whichever"
> > is WRITE_ONCE(), then the load from gp, the load from gi, and the load
> > from gj are all ordered before the store to p->q (by address dependency,
> > control dependency, and data dependency, respectively).  The store to "a"
> > can be reordered with the store to p->q by the compiler.  The store to
> > "b" cannot be reordered with the store to p->q by the compiler (again,
> > both are volatile), but the CPU is free to reorder them, especially when
> > whichever() is implemented as a conditional store.
> >
> > But if "whichever" is "smp_store_release()", all five of the earlier
> > memory accesses are ordered before the store to p->q.
> >
> > Does that help, or am I missing the point of your question?
> 
> My main question is how permissible/ugly you think the following use
> of READ_ONCE() would be, and whether you think it ought to be an
> smp_load_acquire() instead.
> 
> Assume that we are holding some kind of lock that ensures that the
> only possible concurrent update to "vma->anon_vma" is that it changes
> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> 
> 
> if (READ_ONCE(vma->anon_vma) != NULL) {
>   // we now know that vma->anon_vma cannot change anymore
> 
>   // access the same memory location again with a plain load
>   struct anon_vma *a = vma->anon_vma;
> 
>   // this needs to be address-dependency-ordered against one of
>   // the loads from vma->anon_vma
>   struct anon_vma *root = a->root;
> }
> 
> 
> Is this fine? If it is not fine just because the compiler might
> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> would it be fine after adding a barrier() directly after the
> READ_ONCE()?

I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
as I've run into cases where you have sequences such as:

	// Assume *ptr is initially 0 and somebody else writes it to 1
	// concurrently

	foo = *ptr;
	bar = READ_ONCE(*ptr);
	baz = *ptr;

and you can get foo == baz == 0 but bar == 1 because the compiler only
ends up reading from memory twice.

That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
when dereferencing pointer to pte table"), which was very unpleasant to
debug.

Will
Matthew Wilcox July 27, 2023, 3:07 p.m. UTC | #4
On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> Assume that we are holding some kind of lock that ensures that the
> only possible concurrent update to "vma->anon_vma" is that it changes
> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> 
> 
> if (READ_ONCE(vma->anon_vma) != NULL) {
>   // we now know that vma->anon_vma cannot change anymore
> 
>   // access the same memory location again with a plain load
>   struct anon_vma *a = vma->anon_vma;
> 
>   // this needs to be address-dependency-ordered against one of
>   // the loads from vma->anon_vma
>   struct anon_vma *root = a->root;
> }
> 
> 
> Is this fine? If it is not fine just because the compiler might
> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> would it be fine after adding a barrier() directly after the
> READ_ONCE()?
> 
> I initially suggested using READ_ONCE() for this, and then Linus and
> me tried to reason it out and Linus suggested (if I understood him
> correctly) that you could make the ugly argument that this works
> because loads from the same location will not be reordered by the
> hardware. So on anything other than alpha, we'd still have the
> required address-dependency ordering because that happens for all
> loads, even plain loads, while on alpha, the READ_ONCE() includes a
> memory barrier. But that argument is weirdly reliant on
> architecture-specific implementation details.
> 
> The other option is to replace the READ_ONCE() with a
> smp_load_acquire(), at which point it becomes a lot simpler to show
> that the code is correct.

Aren't we straining at gnats here?  The context of this is handling a
page fault, and we used to take an entire rwsem for read.  I'm having
a hard time caring about "the extra expense" of an unnecessarily broad
barrier.

Cost of an L3 cacheline miss is in the thousands of cycles.  Cost of a
barrier is ... tens?
Jann Horn July 27, 2023, 3:15 p.m. UTC | #5
On Thu, Jul 27, 2023 at 5:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > The other option is to replace the READ_ONCE() with a
> > smp_load_acquire(), at which point it becomes a lot simpler to show
> > that the code is correct.
>
> Aren't we straining at gnats here?  The context of this is handling a
> page fault, and we used to take an entire rwsem for read.  I'm having
> a hard time caring about "the extra expense" of an unnecessarily broad
> barrier.
>
> Cost of an L3 cacheline miss is in the thousands of cycles.  Cost of a
> barrier is ... tens?

Yeah, fair point. If it's hard to show correctness with READ_ONCE() we
can just use smp_load_acquire() and call it a day.
Alan Stern July 27, 2023, 3:44 p.m. UTC | #6
On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:

> > Assume that we are holding some kind of lock that ensures that the
> > only possible concurrent update to "vma->anon_vma" is that it changes
> > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > 
> > 
> > if (READ_ONCE(vma->anon_vma) != NULL) {
> >   // we now know that vma->anon_vma cannot change anymore
> > 
> >   // access the same memory location again with a plain load
> >   struct anon_vma *a = vma->anon_vma;
> > 
> >   // this needs to be address-dependency-ordered against one of
> >   // the loads from vma->anon_vma
> >   struct anon_vma *root = a->root;
> > }

This reads a little oddly, perhaps because it's a fragment from a larger 
piece of code.  Still, if I were doing something like this, I'd write it 
as:

struct anon_vma *a;

a = READ_ONCE(vma->anon_vma);
if (a != NULL) {
	struct anon_vma *root = a->root;
	...

thus eliminating the possibility of confusion from multiple reads of the 
same address.

In this situation, the ordering of the two reads is guaranteed by the 
address dependency.  And people shouldn't worry too much about using 
that sort of ordering; RCU relies on it critically, all the time.

> > Is this fine? If it is not fine just because the compiler might
> > reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > would it be fine after adding a barrier() directly after the
> > READ_ONCE()?
> 
> I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> as I've run into cases where you have sequences such as:
> 
> 	// Assume *ptr is initially 0 and somebody else writes it to 1
> 	// concurrently
> 
> 	foo = *ptr;
> 	bar = READ_ONCE(*ptr);
> 	baz = *ptr;
> 
> and you can get foo == baz == 0 but bar == 1 because the compiler only
> ends up reading from memory twice.
> 
> That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> when dereferencing pointer to pte table"), which was very unpleasant to
> debug.

Indeed, that's the sort of thing that can happen when plain accesses are 
involved in a race.

Alan Stern
Paul E. McKenney July 27, 2023, 4:09 p.m. UTC | #7
On Thu, Jul 27, 2023 at 04:07:32PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > Assume that we are holding some kind of lock that ensures that the
> > only possible concurrent update to "vma->anon_vma" is that it changes
> > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > 
> > 
> > if (READ_ONCE(vma->anon_vma) != NULL) {
> >   // we now know that vma->anon_vma cannot change anymore
> > 
> >   // access the same memory location again with a plain load
> >   struct anon_vma *a = vma->anon_vma;
> > 
> >   // this needs to be address-dependency-ordered against one of
> >   // the loads from vma->anon_vma
> >   struct anon_vma *root = a->root;
> > }
> > 
> > 
> > Is this fine? If it is not fine just because the compiler might
> > reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > would it be fine after adding a barrier() directly after the
> > READ_ONCE()?
> > 
> > I initially suggested using READ_ONCE() for this, and then Linus and
> > me tried to reason it out and Linus suggested (if I understood him
> > correctly) that you could make the ugly argument that this works
> > because loads from the same location will not be reordered by the
> > hardware. So on anything other than alpha, we'd still have the
> > required address-dependency ordering because that happens for all
> > loads, even plain loads, while on alpha, the READ_ONCE() includes a
> > memory barrier. But that argument is weirdly reliant on
> > architecture-specific implementation details.
> > 
> > The other option is to replace the READ_ONCE() with a
> > smp_load_acquire(), at which point it becomes a lot simpler to show
> > that the code is correct.
> 
> Aren't we straining at gnats here?  The context of this is handling a
> page fault, and we used to take an entire rwsem for read.  I'm having
> a hard time caring about "the extra expense" of an unnecessarily broad
> barrier.
> 
> Cost of an L3 cacheline miss is in the thousands of cycles.  Cost of a
> barrier is ... tens?

Couldn't agree more!

							Thanx, Paul
Jann Horn July 27, 2023, 4:10 p.m. UTC | #8
On Thu, Jul 27, 2023 at 5:44 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>
> > > Assume that we are holding some kind of lock that ensures that the
> > > only possible concurrent update to "vma->anon_vma" is that it changes
> > > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > >
> > >
> > > if (READ_ONCE(vma->anon_vma) != NULL) {
> > >   // we now know that vma->anon_vma cannot change anymore
> > >
> > >   // access the same memory location again with a plain load
> > >   struct anon_vma *a = vma->anon_vma;
> > >
> > >   // this needs to be address-dependency-ordered against one of
> > >   // the loads from vma->anon_vma
> > >   struct anon_vma *root = a->root;
> > > }
>
> This reads a little oddly, perhaps because it's a fragment from a larger
> piece of code.

Yes, exactly. The READ_ONCE() would be in anon_vma_prepare(), which is
a helper used to ensure that a VMA is associated with an anon_vma, and
then the vma->anon_vma is used further down inside the fault handling
path. Something like:

do_cow_fault
  anon_vma_prepare
    READ_ONCE(vma->anon_vma)
    barrier()
  finish_fault
    do_set_pte
      page_add_new_anon_rmap
        folio_add_new_anon_rmap
          __page_set_anon_rmap
            [reads vma->anon_vma]

Anyway, I guess I'll follow what Paul and Matthew said and go with
smp_load_acquire().
Paul E. McKenney July 27, 2023, 4:16 p.m. UTC | #9
On Thu, Jul 27, 2023 at 11:44:02AM -0400, Alan Stern wrote:
> On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> 
> > > Assume that we are holding some kind of lock that ensures that the
> > > only possible concurrent update to "vma->anon_vma" is that it changes
> > > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > > 
> > > 
> > > if (READ_ONCE(vma->anon_vma) != NULL) {
> > >   // we now know that vma->anon_vma cannot change anymore
> > > 
> > >   // access the same memory location again with a plain load
> > >   struct anon_vma *a = vma->anon_vma;
> > > 
> > >   // this needs to be address-dependency-ordered against one of
> > >   // the loads from vma->anon_vma
> > >   struct anon_vma *root = a->root;
> > > }
> 
> This reads a little oddly, perhaps because it's a fragment from a larger 
> piece of code.  Still, if I were doing something like this, I'd write it 
> as:
> 
> struct anon_vma *a;
> 
> a = READ_ONCE(vma->anon_vma);
> if (a != NULL) {
> 	struct anon_vma *root = a->root;
> 	...
> 
> thus eliminating the possibility of confusion from multiple reads of the 
> same address.
> 
> In this situation, the ordering of the two reads is guaranteed by the 
> address dependency.  And people shouldn't worry too much about using 
> that sort of ordering; RCU relies on it critically, all the time.

Agreed.  In contrast, control dependencies require quite a bit more care
and feeding, and are usually best avoided.

But even with the normal RCU address/data dependencies, it is possible
to get in trouble.  For but one example, comparing a pointer obtained
from rcu_dereference() to the address of a static structure is a good
way to break your address dependency.  (Just yesterday evening I talked
to someone who had spent quite a bit of time chasing one of these down,
so yes, this is quite real.)

> > > Is this fine? If it is not fine just because the compiler might
> > > reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > > would it be fine after adding a barrier() directly after the
> > > READ_ONCE()?
> > 
> > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > as I've run into cases where you have sequences such as:
> > 
> > 	// Assume *ptr is initially 0 and somebody else writes it to 1
> > 	// concurrently
> > 
> > 	foo = *ptr;
> > 	bar = READ_ONCE(*ptr);
> > 	baz = *ptr;
> > 
> > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > ends up reading from memory twice.
> > 
> > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > when dereferencing pointer to pte table"), which was very unpleasant to
> > debug.
> 
> Indeed, that's the sort of thing that can happen when plain accesses are 
> involved in a race.

Agreed.  Furthermore, it is more important to comment plain C-language
accesses to shared variables than to comment the likes of READ_ONCE().
"OK, tell me again exactly why you think the compiler cannot mess you
up here?"

							Thanx, Paul
Paul E. McKenney July 27, 2023, 4:17 p.m. UTC | #10
On Thu, Jul 27, 2023 at 06:10:12PM +0200, Jann Horn wrote:
> On Thu, Jul 27, 2023 at 5:44 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> >
> > > > Assume that we are holding some kind of lock that ensures that the
> > > > only possible concurrent update to "vma->anon_vma" is that it changes
> > > > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > > >
> > > >
> > > > if (READ_ONCE(vma->anon_vma) != NULL) {
> > > >   // we now know that vma->anon_vma cannot change anymore
> > > >
> > > >   // access the same memory location again with a plain load
> > > >   struct anon_vma *a = vma->anon_vma;
> > > >
> > > >   // this needs to be address-dependency-ordered against one of
> > > >   // the loads from vma->anon_vma
> > > >   struct anon_vma *root = a->root;
> > > > }
> >
> > This reads a little oddly, perhaps because it's a fragment from a larger
> > piece of code.
> 
> Yes, exactly. The READ_ONCE() would be in anon_vma_prepare(), which is
> a helper used to ensure that a VMA is associated with an anon_vma, and
> then the vma->anon_vma is used further down inside the fault handling
> path. Something like:
> 
> do_cow_fault
>   anon_vma_prepare
>     READ_ONCE(vma->anon_vma)
>     barrier()
>   finish_fault
>     do_set_pte
>       page_add_new_anon_rmap
>         folio_add_new_anon_rmap
>           __page_set_anon_rmap
>             [reads vma->anon_vma]
> 
> Anyway, I guess I'll follow what Paul and Matthew said and go with
> smp_load_acquire().

I thank you now, and you will thank youself later.  ;-)

							Thanx, Paul
Linus Torvalds July 27, 2023, 5:11 p.m. UTC | #11
On Thu, 27 Jul 2023 at 08:44, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> This reads a little oddly, perhaps because it's a fragment from a larger
> piece of code.

Yes. As Jann already said, this is basically a preparatory step in a
much longer sequence, and the code simply wants to make sure that any
later code (possibly quite a bit later) will not see a NULL value.

I do believe it happens to be safe to use READ_ONCE() for a number of
reasons, but I will argue that we should *never* use a bare READ_ONCE
if there is any actual real question about what any memory ordering
might be.

And yes, the RCU code obviously does use READ_ONCE(), so that's why I
say "bare" - the RCU code wraps it in helper accessors with strict
semantics.

The reason I think it would be techncially *safe* to do here is that

 (a) code like this:

        if (READ_ONCE(..))

     may end up having the conditional speculated by the CPU, and have
the actual read done without any ordering by the CPU, but we do have
*one* guarantee: the memory load instruction will be before the
conditional branch (or whatever) in the instruction stream.

     So the code may be *executed* out of order, but we know the
memory load can not be moved after the conditional (whatever form that
conditional then takes) by a compiler.

     (We do have various barriers like "rcu_read_unlock()" that
guarantees that the READ_ONCE() couldn't be moved lmuch later even in
the absence of the conditional, but we can ignore that).

 (b) the later use of the anon_vma (that depends on the value being
stable) is *so* much later, and behind things that the compiler sees
as barriers (again, that rcu_read_unlock() acts at a minimum as a
compiler barrier) that any subsequent use would not have its load
moved down to before the READ_ONCE() in the instruction stream.

     Again, this is purely a "location in the instruction stream"
ordering argument, not a "execution order" ordering argument.

And finally

 (c) I do think that all our architectures have the rules that when
you read from the *same* memory location from the same CPU, the
accesses are ordered.

Now, I didn't actually find (c) spelled out anywhere, but even alpha -
the worst memory ordering ever devised - had that particular rule (the
alpha architecture manual calls it "Location Access Order").

Now, with that said, I did argue to Jann that we should use
smp_store_release and smp_load_acquire pairs here, because honestly,
the above (a)-(c) argument is too damn subtle for me, and I don't
think this particular use requires it.

With smp_load_acquire(), you don't need any of the (a)-(c) rules. The
rule is "the load is done before any subsequent memory operations".
End of story.

So while I do think READ_ONCE() is sufficient here, I actually think
that once you start going down that path you can argue that
READ_ONCE() is actually entirely unnecessary, because we also have
other ordering rules that mean that the compiler can't really do
anythinig else even *without* the READ_ONCE().

End result: I can trivially extend the (a)-(c) to say "even
READ_ONCE() isn't strictly necessary here, because even any access
tearing - which won't happen anyway - wouldn't actually change the
result.

So if we want to make it *obvious* that it's safe, we should use
smp_load_acquire().

And if we do find that there are situations where we care so much
about the (generally fairly cheap) possible additional
synchronization, and we really want to use READ_ONCE() rather than
smp_load_acquire(), I'd rather try to wrap a READ_ONCE in some kind of
better abstraction (maybe make the conditional part of the operation,
and make it clear that you are doing a "one-time test which depends on
the same-location rule".

Do we even have the same-location rule in the LKMM?

                       Linus
Alan Stern July 27, 2023, 5:41 p.m. UTC | #12
On Thu, Jul 27, 2023 at 10:11:29AM -0700, Linus Torvalds wrote:
> On Thu, 27 Jul 2023 at 08:44, Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > This reads a little oddly, perhaps because it's a fragment from a larger
> > piece of code.
> 
> Yes. As Jann already said, this is basically a preparatory step in a
> much longer sequence, and the code simply wants to make sure that any
> later code (possibly quite a bit later) will not see a NULL value.

...

> Do we even have the same-location rule in the LKMM?

Yes.  The comment in the source file calls it "Sequential Consistency 
Per Variable", under the category of "Fundamental coherence ordering".  
It applies even to plain accesses, not just to READ_ONCE or stronger.

But in the presence of data races (as in the example that Will posted 
earlier), all bets are off.  So if you want to use a plain access rather 
than READ_ONCE, you need to be certain that it won't race with anything.

Alan Stern
Linus Torvalds July 27, 2023, 6:01 p.m. UTC | #13
On Thu, 27 Jul 2023 at 10:41, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> But in the presence of data races (as in the example that Will posted
> earlier), all bets are off.  So if you want to use a plain access rather
> than READ_ONCE, you need to be certain that it won't race with anything.

So in this case, the initial NULL check really is just checking for
"has the smp_store_release() happened". The reason even tearing
wouldn't matter is that seeing *any* value other than all-zeroes (aka
NULL) is already sufficient information.

Because once the smp_store_release() has been seen by this CPU, the
data race no longer exists, and the value is stable.

Which is why I argue that even without READ_ONCE() the code would
*work* due to all the circumstances around it (one of them is that we
just took a lock, after doing an optimistic check that really has no
semantic effect at all except for the "don't even take the lock if it
looks like things are going to fail").

But if we want to have the code be obvious, and not have to refer to
those kinds of arguments, I think smp_load_acquire() is the only
actual "obvious" thing to use. At that point it's no longer some chain
of "because of X, Y and Z".

              Linus
Nadav Amit July 27, 2023, 7:05 p.m. UTC | #14
> On Jul 27, 2023, at 7:57 AM, Will Deacon <will@kernel.org> wrote:
> 
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>> On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
>>>> Hi!
>>>> 
>>>> Patch 1 here is a straightforward fix for a race in per-VMA locking code
>>>> that can lead to use-after-free; I hope we can get this one into
>>>> mainline and stable quickly.
>>>> 
>>>> Patch 2 is a fix for what I believe is a longstanding memory ordering
>>>> issue in how vma->anon_vma is used across the MM subsystem; I expect
>>>> that this one will have to go through a few iterations of review and
>>>> potentially rewrites, because memory ordering is tricky.
>>>> (If someone else wants to take over patch 2, I would be very happy.)
>>>> 
>>>> These patches don't really belong together all that much, I'm just
>>>> sending them as a series because they'd otherwise conflict.
>>>> 
>>>> I am CCing:
>>>> 
>>>> - Suren because patch 1 touches his code
>>>> - Matthew Wilcox because he is also currently working on per-VMA
>>>>   locking stuff
>>>> - all the maintainers/reviewers for the Kernel Memory Consistency Model
>>>>   so they can help figure out the READ_ONCE() vs smp_load_acquire()
>>>>   thing
>>> 
>>> READ_ONCE() has weaker ordering properties than smp_load_acquire().
>>> 
>>> For example, given a pointer gp:
>>> 
>>>        p = whichever(gp);
>>>        a = 1;
>>>        r1 = p->b;
>>>        if ((uintptr_t)p & 0x1)
>>>                WRITE_ONCE(b, 1);
>>>        WRITE_ONCE(c, 1);
>>> 
>>> Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
>>> "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
>>> ordered after the load from gp (the former due to an address dependency
>>> and the latter due to a (fragile) control dependency).  The compiler
>>> is within its rights to reorder the store to "a" to precede the load
>>> from gp.  The compiler is forbidden from reordering the store to "c"
>>> wtih the load from gp (because both are volatile accesses), but the CPU
>>> is completely within its rights to do this reordering.
>>> 
>>> But if "whichever" is "smp_load_acquire()", all four of the subsequent
>>> memory accesses are ordered after the load from gp.
>>> 
>>> Similarly, for WRITE_ONCE() and smp_store_release():
>>> 
>>>        p = READ_ONCE(gp);
>>>        r1 = READ_ONCE(gi);
>>>        r2 = READ_ONCE(gj);
>>>        a = 1;
>>>        WRITE_ONCE(b, 1);
>>>        if (r1 & 0x1)
>>>                whichever(p->q, r2);
>>> 
>>> Again leaving aside the "&" needed by smp_store_release(), if "whichever"
>>> is WRITE_ONCE(), then the load from gp, the load from gi, and the load
>>> from gj are all ordered before the store to p->q (by address dependency,
>>> control dependency, and data dependency, respectively).  The store to "a"
>>> can be reordered with the store to p->q by the compiler.  The store to
>>> "b" cannot be reordered with the store to p->q by the compiler (again,
>>> both are volatile), but the CPU is free to reorder them, especially when
>>> whichever() is implemented as a conditional store.
>>> 
>>> But if "whichever" is "smp_store_release()", all five of the earlier
>>> memory accesses are ordered before the store to p->q.
>>> 
>>> Does that help, or am I missing the point of your question?
>> 
>> My main question is how permissible/ugly you think the following use
>> of READ_ONCE() would be, and whether you think it ought to be an
>> smp_load_acquire() instead.
>> 
>> Assume that we are holding some kind of lock that ensures that the
>> only possible concurrent update to "vma->anon_vma" is that it changes
>> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
>> 
>> 
>> if (READ_ONCE(vma->anon_vma) != NULL) {
>>  // we now know that vma->anon_vma cannot change anymore
>> 
>>  // access the same memory location again with a plain load
>>  struct anon_vma *a = vma->anon_vma;
>> 
>>  // this needs to be address-dependency-ordered against one of
>>  // the loads from vma->anon_vma
>>  struct anon_vma *root = a->root;
>> }
>> 
>> 
>> Is this fine? If it is not fine just because the compiler might
>> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
>> would it be fine after adding a barrier() directly after the
>> READ_ONCE()?
> 
> I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> as I've run into cases where you have sequences such as:
> 
> // Assume *ptr is initially 0 and somebody else writes it to 1
> // concurrently
> 
> foo = *ptr;
> bar = READ_ONCE(*ptr);
> baz = *ptr;
> 
> and you can get foo == baz == 0 but bar == 1 because the compiler only
> ends up reading from memory twice.
> 
> That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> when dereferencing pointer to pte table"), which was very unpleasant to
> debug.

Interesting. I wonder if you considered adding to READ_ONCE() something
like:

	asm volatile("" : "+g" (x) );

So later loads (such as baz = *ptr) would reload the updated value.
Linus Torvalds July 27, 2023, 7:39 p.m. UTC | #15
On Thu, 27 Jul 2023 at 12:10, Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Interesting. I wonder if you considered adding to READ_ONCE() something
> like:
>
>         asm volatile("" : "+g" (x) );
>
> So later loads (such as baz = *ptr) would reload the updated value.

Not necessarily a bad idea.  Although I suspect you'd want to add
*two* of them - on either side - to make sure any previous loads
wouldn't be moved around it either.

                 Linus
Nadav Amit July 27, 2023, 8:11 p.m. UTC | #16
> On Jul 27, 2023, at 12:39 PM, Linus Torvalds <torvalds@linuxfoundation.org> wrote:
> 
> On Thu, 27 Jul 2023 at 12:10, Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> Interesting. I wonder if you considered adding to READ_ONCE() something
>> like:
>> 
>>        asm volatile("" : "+g" (x) );
>> 
>> So later loads (such as baz = *ptr) would reload the updated value.
> 
> Not necessarily a bad idea.  Although I suspect you'd want to add
> *two* of them - on either side - to make sure any previous loads
> wouldn't be moved around it either.

You are right, two are needed.

I’ll give it a shot and see if I see changes to the binary.
Nadav Amit July 28, 2023, 9:18 a.m. UTC | #17
> On Jul 27, 2023, at 1:11 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
> 
>> On Jul 27, 2023, at 12:39 PM, Linus Torvalds <torvalds@linuxfoundation.org> wrote:
>> 
>> On Thu, 27 Jul 2023 at 12:10, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>> Interesting. I wonder if you considered adding to READ_ONCE() something
>>> like:
>>> 
>>>       asm volatile("" : "+g" (x) );
>>> 
>>> So later loads (such as baz = *ptr) would reload the updated value.
>> 
>> Not necessarily a bad idea.  Although I suspect you'd want to add
>> *two* of them - on either side - to make sure any previous loads
>> wouldn't be moved around it either.
> 
> You are right, two are needed.
> 
> I’ll give it a shot and see if I see changes to the binary.

Just for the record (so nobody will make my mistakes):

I implemented it, but it works poorly. It appears to communicate the
constraints but the generated code is much worse.

The problem is that if the GCC inline asm decides to use a memory operand
(e.g. “+m”), it computes the address (uses LEA first before the MOV) and
allocates a register for the address. Using “+g” also causes the compiler
to allocate a register.

-- >8 --

---
 include/asm-generic/rwonce.h | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwonce.h b/include/asm-generic/rwonce.h
index 8d0a6280e982..c6a2f8e3013c 100644
--- a/include/asm-generic/rwonce.h
+++ b/include/asm-generic/rwonce.h
@@ -44,10 +44,34 @@
 #define __READ_ONCE(x)	(*(const volatile __unqual_scalar_typeof(x) *)&(x))
 #endif
 
-#define READ_ONCE(x)							\
-({									\
-	compiletime_assert_rwonce_type(x);				\
-	__READ_ONCE(x);							\
+#define as_scalar(x)							\
+    __builtin_choose_expr(sizeof(x) == sizeof(char),  			\
+				(__force char *)&(x), 			\
+    __builtin_choose_expr(sizeof(x) == sizeof(short),			\
+				(__force short *)&(x),			\
+    __builtin_choose_expr(sizeof(x) == sizeof(int),			\
+				(__force int *)&(x), 			\
+    __builtin_choose_expr(sizeof(x) == sizeof(long),			\
+				(__force long *)&(x),			\
+    __builtin_choose_expr(sizeof(x) == sizeof(long long),		\
+				(__force long long *)&(x),		\
+    (void*)0)))))
+
+#define READ_ONCE(x)                                                    \
+({                                                                      \
+    typeof(x) * px = &(x);						\
+    union {								\
+        typeof(x) __orig;						\
+        typeof(*as_scalar(x)) __alias;					\
+    } __u;								\
+									\
+    compiletime_assert_rwonce_type(x);					\
+    asm volatile ("" :							\
+               "+g" (*(__force typeof(*as_scalar(x)) *)(px)));		\
+    __u.__alias = __READ_ONCE(*as_scalar(*px));				\
+    asm volatile ("" :							\
+               "+g" (*(__force typeof(*as_scalar(x)) *)(px)));		\
+    __u.__orig;								\
 })
 
 #define __WRITE_ONCE(x, val)						\