Message ID | 20230726214103.3261108-1-jannh@google.com (mailing list archive) |
---|---|
Headers | show |
Series | fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering | expand |
On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote: > Hi! > > Patch 1 here is a straightforward fix for a race in per-VMA locking code > that can lead to use-after-free; I hope we can get this one into > mainline and stable quickly. > > Patch 2 is a fix for what I believe is a longstanding memory ordering > issue in how vma->anon_vma is used across the MM subsystem; I expect > that this one will have to go through a few iterations of review and > potentially rewrites, because memory ordering is tricky. > (If someone else wants to take over patch 2, I would be very happy.) > > These patches don't really belong together all that much, I'm just > sending them as a series because they'd otherwise conflict. > > I am CCing: > > - Suren because patch 1 touches his code > - Matthew Wilcox because he is also currently working on per-VMA > locking stuff > - all the maintainers/reviewers for the Kernel Memory Consistency Model > so they can help figure out the READ_ONCE() vs smp_load_acquire() > thing READ_ONCE() has weaker ordering properties than smp_load_acquire(). For example, given a pointer gp: p = whichever(gp); a = 1; r1 = p->b; if ((uintptr_t)p & 0x1) WRITE_ONCE(b, 1); WRITE_ONCE(c, 1); Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are ordered after the load from gp (the former due to an address dependency and the latter due to a (fragile) control dependency). The compiler is within its rights to reorder the store to "a" to precede the load from gp. The compiler is forbidden from reordering the store to "c" wtih the load from gp (because both are volatile accesses), but the CPU is completely within its rights to do this reordering. But if "whichever" is "smp_load_acquire()", all four of the subsequent memory accesses are ordered after the load from gp. Similarly, for WRITE_ONCE() and smp_store_release(): p = READ_ONCE(gp); r1 = READ_ONCE(gi); r2 = READ_ONCE(gj); a = 1; WRITE_ONCE(b, 1); if (r1 & 0x1) whichever(p->q, r2); Again leaving aside the "&" needed by smp_store_release(), if "whichever" is WRITE_ONCE(), then the load from gp, the load from gi, and the load from gj are all ordered before the store to p->q (by address dependency, control dependency, and data dependency, respectively). The store to "a" can be reordered with the store to p->q by the compiler. The store to "b" cannot be reordered with the store to p->q by the compiler (again, both are volatile), but the CPU is free to reorder them, especially when whichever() is implemented as a conditional store. But if "whichever" is "smp_store_release()", all five of the earlier memory accesses are ordered before the store to p->q. Does that help, or am I missing the point of your question? Thanx, Paul > - people involved in the previous discussion on the security list > > > Jann Horn (2): > mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock > mm: Fix anon_vma memory ordering > > include/linux/rmap.h | 15 ++++++++++++++- > mm/huge_memory.c | 4 +++- > mm/khugepaged.c | 2 +- > mm/ksm.c | 16 +++++++++++----- > mm/memory.c | 32 ++++++++++++++++++++------------ > mm/mmap.c | 13 ++++++++++--- > mm/rmap.c | 6 ++++-- > mm/swapfile.c | 3 ++- > 8 files changed, 65 insertions(+), 26 deletions(-) > > > base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f > -- > 2.41.0.487.g6d72f3e995-goog >
On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote: > > Hi! > > > > Patch 1 here is a straightforward fix for a race in per-VMA locking code > > that can lead to use-after-free; I hope we can get this one into > > mainline and stable quickly. > > > > Patch 2 is a fix for what I believe is a longstanding memory ordering > > issue in how vma->anon_vma is used across the MM subsystem; I expect > > that this one will have to go through a few iterations of review and > > potentially rewrites, because memory ordering is tricky. > > (If someone else wants to take over patch 2, I would be very happy.) > > > > These patches don't really belong together all that much, I'm just > > sending them as a series because they'd otherwise conflict. > > > > I am CCing: > > > > - Suren because patch 1 touches his code > > - Matthew Wilcox because he is also currently working on per-VMA > > locking stuff > > - all the maintainers/reviewers for the Kernel Memory Consistency Model > > so they can help figure out the READ_ONCE() vs smp_load_acquire() > > thing > > READ_ONCE() has weaker ordering properties than smp_load_acquire(). > > For example, given a pointer gp: > > p = whichever(gp); > a = 1; > r1 = p->b; > if ((uintptr_t)p & 0x1) > WRITE_ONCE(b, 1); > WRITE_ONCE(c, 1); > > Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is > "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are > ordered after the load from gp (the former due to an address dependency > and the latter due to a (fragile) control dependency). The compiler > is within its rights to reorder the store to "a" to precede the load > from gp. The compiler is forbidden from reordering the store to "c" > wtih the load from gp (because both are volatile accesses), but the CPU > is completely within its rights to do this reordering. > > But if "whichever" is "smp_load_acquire()", all four of the subsequent > memory accesses are ordered after the load from gp. > > Similarly, for WRITE_ONCE() and smp_store_release(): > > p = READ_ONCE(gp); > r1 = READ_ONCE(gi); > r2 = READ_ONCE(gj); > a = 1; > WRITE_ONCE(b, 1); > if (r1 & 0x1) > whichever(p->q, r2); > > Again leaving aside the "&" needed by smp_store_release(), if "whichever" > is WRITE_ONCE(), then the load from gp, the load from gi, and the load > from gj are all ordered before the store to p->q (by address dependency, > control dependency, and data dependency, respectively). The store to "a" > can be reordered with the store to p->q by the compiler. The store to > "b" cannot be reordered with the store to p->q by the compiler (again, > both are volatile), but the CPU is free to reorder them, especially when > whichever() is implemented as a conditional store. > > But if "whichever" is "smp_store_release()", all five of the earlier > memory accesses are ordered before the store to p->q. > > Does that help, or am I missing the point of your question? My main question is how permissible/ugly you think the following use of READ_ONCE() would be, and whether you think it ought to be an smp_load_acquire() instead. Assume that we are holding some kind of lock that ensures that the only possible concurrent update to "vma->anon_vma" is that it changes from a NULL pointer to a non-NULL pointer (using smp_store_release()). if (READ_ONCE(vma->anon_vma) != NULL) { // we now know that vma->anon_vma cannot change anymore // access the same memory location again with a plain load struct anon_vma *a = vma->anon_vma; // this needs to be address-dependency-ordered against one of // the loads from vma->anon_vma struct anon_vma *root = a->root; } Is this fine? If it is not fine just because the compiler might reorder the plain load of vma->anon_vma before the READ_ONCE() load, would it be fine after adding a barrier() directly after the READ_ONCE()? I initially suggested using READ_ONCE() for this, and then Linus and me tried to reason it out and Linus suggested (if I understood him correctly) that you could make the ugly argument that this works because loads from the same location will not be reordered by the hardware. So on anything other than alpha, we'd still have the required address-dependency ordering because that happens for all loads, even plain loads, while on alpha, the READ_ONCE() includes a memory barrier. But that argument is weirdly reliant on architecture-specific implementation details. The other option is to replace the READ_ONCE() with a smp_load_acquire(), at which point it becomes a lot simpler to show that the code is correct. > Thanx, Paul > > > - people involved in the previous discussion on the security list > > > > > > Jann Horn (2): > > mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock > > mm: Fix anon_vma memory ordering > > > > include/linux/rmap.h | 15 ++++++++++++++- > > mm/huge_memory.c | 4 +++- > > mm/khugepaged.c | 2 +- > > mm/ksm.c | 16 +++++++++++----- > > mm/memory.c | 32 ++++++++++++++++++++------------ > > mm/mmap.c | 13 ++++++++++--- > > mm/rmap.c | 6 ++++-- > > mm/swapfile.c | 3 ++- > > 8 files changed, 65 insertions(+), 26 deletions(-) > > > > > > base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f > > -- > > 2.41.0.487.g6d72f3e995-goog > >
On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote: > > > Hi! > > > > > > Patch 1 here is a straightforward fix for a race in per-VMA locking code > > > that can lead to use-after-free; I hope we can get this one into > > > mainline and stable quickly. > > > > > > Patch 2 is a fix for what I believe is a longstanding memory ordering > > > issue in how vma->anon_vma is used across the MM subsystem; I expect > > > that this one will have to go through a few iterations of review and > > > potentially rewrites, because memory ordering is tricky. > > > (If someone else wants to take over patch 2, I would be very happy.) > > > > > > These patches don't really belong together all that much, I'm just > > > sending them as a series because they'd otherwise conflict. > > > > > > I am CCing: > > > > > > - Suren because patch 1 touches his code > > > - Matthew Wilcox because he is also currently working on per-VMA > > > locking stuff > > > - all the maintainers/reviewers for the Kernel Memory Consistency Model > > > so they can help figure out the READ_ONCE() vs smp_load_acquire() > > > thing > > > > READ_ONCE() has weaker ordering properties than smp_load_acquire(). > > > > For example, given a pointer gp: > > > > p = whichever(gp); > > a = 1; > > r1 = p->b; > > if ((uintptr_t)p & 0x1) > > WRITE_ONCE(b, 1); > > WRITE_ONCE(c, 1); > > > > Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is > > "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are > > ordered after the load from gp (the former due to an address dependency > > and the latter due to a (fragile) control dependency). The compiler > > is within its rights to reorder the store to "a" to precede the load > > from gp. The compiler is forbidden from reordering the store to "c" > > wtih the load from gp (because both are volatile accesses), but the CPU > > is completely within its rights to do this reordering. > > > > But if "whichever" is "smp_load_acquire()", all four of the subsequent > > memory accesses are ordered after the load from gp. > > > > Similarly, for WRITE_ONCE() and smp_store_release(): > > > > p = READ_ONCE(gp); > > r1 = READ_ONCE(gi); > > r2 = READ_ONCE(gj); > > a = 1; > > WRITE_ONCE(b, 1); > > if (r1 & 0x1) > > whichever(p->q, r2); > > > > Again leaving aside the "&" needed by smp_store_release(), if "whichever" > > is WRITE_ONCE(), then the load from gp, the load from gi, and the load > > from gj are all ordered before the store to p->q (by address dependency, > > control dependency, and data dependency, respectively). The store to "a" > > can be reordered with the store to p->q by the compiler. The store to > > "b" cannot be reordered with the store to p->q by the compiler (again, > > both are volatile), but the CPU is free to reorder them, especially when > > whichever() is implemented as a conditional store. > > > > But if "whichever" is "smp_store_release()", all five of the earlier > > memory accesses are ordered before the store to p->q. > > > > Does that help, or am I missing the point of your question? > > My main question is how permissible/ugly you think the following use > of READ_ONCE() would be, and whether you think it ought to be an > smp_load_acquire() instead. > > Assume that we are holding some kind of lock that ensures that the > only possible concurrent update to "vma->anon_vma" is that it changes > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > if (READ_ONCE(vma->anon_vma) != NULL) { > // we now know that vma->anon_vma cannot change anymore > > // access the same memory location again with a plain load > struct anon_vma *a = vma->anon_vma; > > // this needs to be address-dependency-ordered against one of > // the loads from vma->anon_vma > struct anon_vma *root = a->root; > } > > > Is this fine? If it is not fine just because the compiler might > reorder the plain load of vma->anon_vma before the READ_ONCE() load, > would it be fine after adding a barrier() directly after the > READ_ONCE()? I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable, as I've run into cases where you have sequences such as: // Assume *ptr is initially 0 and somebody else writes it to 1 // concurrently foo = *ptr; bar = READ_ONCE(*ptr); baz = *ptr; and you can get foo == baz == 0 but bar == 1 because the compiler only ends up reading from memory twice. That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE when dereferencing pointer to pte table"), which was very unpleasant to debug. Will
On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > Assume that we are holding some kind of lock that ensures that the > only possible concurrent update to "vma->anon_vma" is that it changes > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > if (READ_ONCE(vma->anon_vma) != NULL) { > // we now know that vma->anon_vma cannot change anymore > > // access the same memory location again with a plain load > struct anon_vma *a = vma->anon_vma; > > // this needs to be address-dependency-ordered against one of > // the loads from vma->anon_vma > struct anon_vma *root = a->root; > } > > > Is this fine? If it is not fine just because the compiler might > reorder the plain load of vma->anon_vma before the READ_ONCE() load, > would it be fine after adding a barrier() directly after the > READ_ONCE()? > > I initially suggested using READ_ONCE() for this, and then Linus and > me tried to reason it out and Linus suggested (if I understood him > correctly) that you could make the ugly argument that this works > because loads from the same location will not be reordered by the > hardware. So on anything other than alpha, we'd still have the > required address-dependency ordering because that happens for all > loads, even plain loads, while on alpha, the READ_ONCE() includes a > memory barrier. But that argument is weirdly reliant on > architecture-specific implementation details. > > The other option is to replace the READ_ONCE() with a > smp_load_acquire(), at which point it becomes a lot simpler to show > that the code is correct. Aren't we straining at gnats here? The context of this is handling a page fault, and we used to take an entire rwsem for read. I'm having a hard time caring about "the extra expense" of an unnecessarily broad barrier. Cost of an L3 cacheline miss is in the thousands of cycles. Cost of a barrier is ... tens?
On Thu, Jul 27, 2023 at 5:07 PM Matthew Wilcox <willy@infradead.org> wrote: > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > > The other option is to replace the READ_ONCE() with a > > smp_load_acquire(), at which point it becomes a lot simpler to show > > that the code is correct. > > Aren't we straining at gnats here? The context of this is handling a > page fault, and we used to take an entire rwsem for read. I'm having > a hard time caring about "the extra expense" of an unnecessarily broad > barrier. > > Cost of an L3 cacheline miss is in the thousands of cycles. Cost of a > barrier is ... tens? Yeah, fair point. If it's hard to show correctness with READ_ONCE() we can just use smp_load_acquire() and call it a day.
On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote: > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > > Assume that we are holding some kind of lock that ensures that the > > only possible concurrent update to "vma->anon_vma" is that it changes > > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > > > > if (READ_ONCE(vma->anon_vma) != NULL) { > > // we now know that vma->anon_vma cannot change anymore > > > > // access the same memory location again with a plain load > > struct anon_vma *a = vma->anon_vma; > > > > // this needs to be address-dependency-ordered against one of > > // the loads from vma->anon_vma > > struct anon_vma *root = a->root; > > } This reads a little oddly, perhaps because it's a fragment from a larger piece of code. Still, if I were doing something like this, I'd write it as: struct anon_vma *a; a = READ_ONCE(vma->anon_vma); if (a != NULL) { struct anon_vma *root = a->root; ... thus eliminating the possibility of confusion from multiple reads of the same address. In this situation, the ordering of the two reads is guaranteed by the address dependency. And people shouldn't worry too much about using that sort of ordering; RCU relies on it critically, all the time. > > Is this fine? If it is not fine just because the compiler might > > reorder the plain load of vma->anon_vma before the READ_ONCE() load, > > would it be fine after adding a barrier() directly after the > > READ_ONCE()? > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable, > as I've run into cases where you have sequences such as: > > // Assume *ptr is initially 0 and somebody else writes it to 1 > // concurrently > > foo = *ptr; > bar = READ_ONCE(*ptr); > baz = *ptr; > > and you can get foo == baz == 0 but bar == 1 because the compiler only > ends up reading from memory twice. > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE > when dereferencing pointer to pte table"), which was very unpleasant to > debug. Indeed, that's the sort of thing that can happen when plain accesses are involved in a race. Alan Stern
On Thu, Jul 27, 2023 at 04:07:32PM +0100, Matthew Wilcox wrote: > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > > Assume that we are holding some kind of lock that ensures that the > > only possible concurrent update to "vma->anon_vma" is that it changes > > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > > > > if (READ_ONCE(vma->anon_vma) != NULL) { > > // we now know that vma->anon_vma cannot change anymore > > > > // access the same memory location again with a plain load > > struct anon_vma *a = vma->anon_vma; > > > > // this needs to be address-dependency-ordered against one of > > // the loads from vma->anon_vma > > struct anon_vma *root = a->root; > > } > > > > > > Is this fine? If it is not fine just because the compiler might > > reorder the plain load of vma->anon_vma before the READ_ONCE() load, > > would it be fine after adding a barrier() directly after the > > READ_ONCE()? > > > > I initially suggested using READ_ONCE() for this, and then Linus and > > me tried to reason it out and Linus suggested (if I understood him > > correctly) that you could make the ugly argument that this works > > because loads from the same location will not be reordered by the > > hardware. So on anything other than alpha, we'd still have the > > required address-dependency ordering because that happens for all > > loads, even plain loads, while on alpha, the READ_ONCE() includes a > > memory barrier. But that argument is weirdly reliant on > > architecture-specific implementation details. > > > > The other option is to replace the READ_ONCE() with a > > smp_load_acquire(), at which point it becomes a lot simpler to show > > that the code is correct. > > Aren't we straining at gnats here? The context of this is handling a > page fault, and we used to take an entire rwsem for read. I'm having > a hard time caring about "the extra expense" of an unnecessarily broad > barrier. > > Cost of an L3 cacheline miss is in the thousands of cycles. Cost of a > barrier is ... tens? Couldn't agree more! Thanx, Paul
On Thu, Jul 27, 2023 at 5:44 PM Alan Stern <stern@rowland.harvard.edu> wrote: > On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote: > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > > > > Assume that we are holding some kind of lock that ensures that the > > > only possible concurrent update to "vma->anon_vma" is that it changes > > > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > > > > > > > if (READ_ONCE(vma->anon_vma) != NULL) { > > > // we now know that vma->anon_vma cannot change anymore > > > > > > // access the same memory location again with a plain load > > > struct anon_vma *a = vma->anon_vma; > > > > > > // this needs to be address-dependency-ordered against one of > > > // the loads from vma->anon_vma > > > struct anon_vma *root = a->root; > > > } > > This reads a little oddly, perhaps because it's a fragment from a larger > piece of code. Yes, exactly. The READ_ONCE() would be in anon_vma_prepare(), which is a helper used to ensure that a VMA is associated with an anon_vma, and then the vma->anon_vma is used further down inside the fault handling path. Something like: do_cow_fault anon_vma_prepare READ_ONCE(vma->anon_vma) barrier() finish_fault do_set_pte page_add_new_anon_rmap folio_add_new_anon_rmap __page_set_anon_rmap [reads vma->anon_vma] Anyway, I guess I'll follow what Paul and Matthew said and go with smp_load_acquire().
On Thu, Jul 27, 2023 at 11:44:02AM -0400, Alan Stern wrote: > On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote: > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > > > > Assume that we are holding some kind of lock that ensures that the > > > only possible concurrent update to "vma->anon_vma" is that it changes > > > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > > > > > > > if (READ_ONCE(vma->anon_vma) != NULL) { > > > // we now know that vma->anon_vma cannot change anymore > > > > > > // access the same memory location again with a plain load > > > struct anon_vma *a = vma->anon_vma; > > > > > > // this needs to be address-dependency-ordered against one of > > > // the loads from vma->anon_vma > > > struct anon_vma *root = a->root; > > > } > > This reads a little oddly, perhaps because it's a fragment from a larger > piece of code. Still, if I were doing something like this, I'd write it > as: > > struct anon_vma *a; > > a = READ_ONCE(vma->anon_vma); > if (a != NULL) { > struct anon_vma *root = a->root; > ... > > thus eliminating the possibility of confusion from multiple reads of the > same address. > > In this situation, the ordering of the two reads is guaranteed by the > address dependency. And people shouldn't worry too much about using > that sort of ordering; RCU relies on it critically, all the time. Agreed. In contrast, control dependencies require quite a bit more care and feeding, and are usually best avoided. But even with the normal RCU address/data dependencies, it is possible to get in trouble. For but one example, comparing a pointer obtained from rcu_dereference() to the address of a static structure is a good way to break your address dependency. (Just yesterday evening I talked to someone who had spent quite a bit of time chasing one of these down, so yes, this is quite real.) > > > Is this fine? If it is not fine just because the compiler might > > > reorder the plain load of vma->anon_vma before the READ_ONCE() load, > > > would it be fine after adding a barrier() directly after the > > > READ_ONCE()? > > > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable, > > as I've run into cases where you have sequences such as: > > > > // Assume *ptr is initially 0 and somebody else writes it to 1 > > // concurrently > > > > foo = *ptr; > > bar = READ_ONCE(*ptr); > > baz = *ptr; > > > > and you can get foo == baz == 0 but bar == 1 because the compiler only > > ends up reading from memory twice. > > > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE > > when dereferencing pointer to pte table"), which was very unpleasant to > > debug. > > Indeed, that's the sort of thing that can happen when plain accesses are > involved in a race. Agreed. Furthermore, it is more important to comment plain C-language accesses to shared variables than to comment the likes of READ_ONCE(). "OK, tell me again exactly why you think the compiler cannot mess you up here?" Thanx, Paul
On Thu, Jul 27, 2023 at 06:10:12PM +0200, Jann Horn wrote: > On Thu, Jul 27, 2023 at 5:44 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote: > > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: > > > > > > Assume that we are holding some kind of lock that ensures that the > > > > only possible concurrent update to "vma->anon_vma" is that it changes > > > > from a NULL pointer to a non-NULL pointer (using smp_store_release()). > > > > > > > > > > > > if (READ_ONCE(vma->anon_vma) != NULL) { > > > > // we now know that vma->anon_vma cannot change anymore > > > > > > > > // access the same memory location again with a plain load > > > > struct anon_vma *a = vma->anon_vma; > > > > > > > > // this needs to be address-dependency-ordered against one of > > > > // the loads from vma->anon_vma > > > > struct anon_vma *root = a->root; > > > > } > > > > This reads a little oddly, perhaps because it's a fragment from a larger > > piece of code. > > Yes, exactly. The READ_ONCE() would be in anon_vma_prepare(), which is > a helper used to ensure that a VMA is associated with an anon_vma, and > then the vma->anon_vma is used further down inside the fault handling > path. Something like: > > do_cow_fault > anon_vma_prepare > READ_ONCE(vma->anon_vma) > barrier() > finish_fault > do_set_pte > page_add_new_anon_rmap > folio_add_new_anon_rmap > __page_set_anon_rmap > [reads vma->anon_vma] > > Anyway, I guess I'll follow what Paul and Matthew said and go with > smp_load_acquire(). I thank you now, and you will thank youself later. ;-) Thanx, Paul
On Thu, 27 Jul 2023 at 08:44, Alan Stern <stern@rowland.harvard.edu> wrote: > > This reads a little oddly, perhaps because it's a fragment from a larger > piece of code. Yes. As Jann already said, this is basically a preparatory step in a much longer sequence, and the code simply wants to make sure that any later code (possibly quite a bit later) will not see a NULL value. I do believe it happens to be safe to use READ_ONCE() for a number of reasons, but I will argue that we should *never* use a bare READ_ONCE if there is any actual real question about what any memory ordering might be. And yes, the RCU code obviously does use READ_ONCE(), so that's why I say "bare" - the RCU code wraps it in helper accessors with strict semantics. The reason I think it would be techncially *safe* to do here is that (a) code like this: if (READ_ONCE(..)) may end up having the conditional speculated by the CPU, and have the actual read done without any ordering by the CPU, but we do have *one* guarantee: the memory load instruction will be before the conditional branch (or whatever) in the instruction stream. So the code may be *executed* out of order, but we know the memory load can not be moved after the conditional (whatever form that conditional then takes) by a compiler. (We do have various barriers like "rcu_read_unlock()" that guarantees that the READ_ONCE() couldn't be moved lmuch later even in the absence of the conditional, but we can ignore that). (b) the later use of the anon_vma (that depends on the value being stable) is *so* much later, and behind things that the compiler sees as barriers (again, that rcu_read_unlock() acts at a minimum as a compiler barrier) that any subsequent use would not have its load moved down to before the READ_ONCE() in the instruction stream. Again, this is purely a "location in the instruction stream" ordering argument, not a "execution order" ordering argument. And finally (c) I do think that all our architectures have the rules that when you read from the *same* memory location from the same CPU, the accesses are ordered. Now, I didn't actually find (c) spelled out anywhere, but even alpha - the worst memory ordering ever devised - had that particular rule (the alpha architecture manual calls it "Location Access Order"). Now, with that said, I did argue to Jann that we should use smp_store_release and smp_load_acquire pairs here, because honestly, the above (a)-(c) argument is too damn subtle for me, and I don't think this particular use requires it. With smp_load_acquire(), you don't need any of the (a)-(c) rules. The rule is "the load is done before any subsequent memory operations". End of story. So while I do think READ_ONCE() is sufficient here, I actually think that once you start going down that path you can argue that READ_ONCE() is actually entirely unnecessary, because we also have other ordering rules that mean that the compiler can't really do anythinig else even *without* the READ_ONCE(). End result: I can trivially extend the (a)-(c) to say "even READ_ONCE() isn't strictly necessary here, because even any access tearing - which won't happen anyway - wouldn't actually change the result. So if we want to make it *obvious* that it's safe, we should use smp_load_acquire(). And if we do find that there are situations where we care so much about the (generally fairly cheap) possible additional synchronization, and we really want to use READ_ONCE() rather than smp_load_acquire(), I'd rather try to wrap a READ_ONCE in some kind of better abstraction (maybe make the conditional part of the operation, and make it clear that you are doing a "one-time test which depends on the same-location rule". Do we even have the same-location rule in the LKMM? Linus
On Thu, Jul 27, 2023 at 10:11:29AM -0700, Linus Torvalds wrote: > On Thu, 27 Jul 2023 at 08:44, Alan Stern <stern@rowland.harvard.edu> wrote: > > > > This reads a little oddly, perhaps because it's a fragment from a larger > > piece of code. > > Yes. As Jann already said, this is basically a preparatory step in a > much longer sequence, and the code simply wants to make sure that any > later code (possibly quite a bit later) will not see a NULL value. ... > Do we even have the same-location rule in the LKMM? Yes. The comment in the source file calls it "Sequential Consistency Per Variable", under the category of "Fundamental coherence ordering". It applies even to plain accesses, not just to READ_ONCE or stronger. But in the presence of data races (as in the example that Will posted earlier), all bets are off. So if you want to use a plain access rather than READ_ONCE, you need to be certain that it won't race with anything. Alan Stern
On Thu, 27 Jul 2023 at 10:41, Alan Stern <stern@rowland.harvard.edu> wrote: > > But in the presence of data races (as in the example that Will posted > earlier), all bets are off. So if you want to use a plain access rather > than READ_ONCE, you need to be certain that it won't race with anything. So in this case, the initial NULL check really is just checking for "has the smp_store_release() happened". The reason even tearing wouldn't matter is that seeing *any* value other than all-zeroes (aka NULL) is already sufficient information. Because once the smp_store_release() has been seen by this CPU, the data race no longer exists, and the value is stable. Which is why I argue that even without READ_ONCE() the code would *work* due to all the circumstances around it (one of them is that we just took a lock, after doing an optimistic check that really has no semantic effect at all except for the "don't even take the lock if it looks like things are going to fail"). But if we want to have the code be obvious, and not have to refer to those kinds of arguments, I think smp_load_acquire() is the only actual "obvious" thing to use. At that point it's no longer some chain of "because of X, Y and Z". Linus
> On Jul 27, 2023, at 7:57 AM, Will Deacon <will@kernel.org> wrote: > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote: >> On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <paulmck@kernel.org> wrote: >>> >>> On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote: >>>> Hi! >>>> >>>> Patch 1 here is a straightforward fix for a race in per-VMA locking code >>>> that can lead to use-after-free; I hope we can get this one into >>>> mainline and stable quickly. >>>> >>>> Patch 2 is a fix for what I believe is a longstanding memory ordering >>>> issue in how vma->anon_vma is used across the MM subsystem; I expect >>>> that this one will have to go through a few iterations of review and >>>> potentially rewrites, because memory ordering is tricky. >>>> (If someone else wants to take over patch 2, I would be very happy.) >>>> >>>> These patches don't really belong together all that much, I'm just >>>> sending them as a series because they'd otherwise conflict. >>>> >>>> I am CCing: >>>> >>>> - Suren because patch 1 touches his code >>>> - Matthew Wilcox because he is also currently working on per-VMA >>>> locking stuff >>>> - all the maintainers/reviewers for the Kernel Memory Consistency Model >>>> so they can help figure out the READ_ONCE() vs smp_load_acquire() >>>> thing >>> >>> READ_ONCE() has weaker ordering properties than smp_load_acquire(). >>> >>> For example, given a pointer gp: >>> >>> p = whichever(gp); >>> a = 1; >>> r1 = p->b; >>> if ((uintptr_t)p & 0x1) >>> WRITE_ONCE(b, 1); >>> WRITE_ONCE(c, 1); >>> >>> Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is >>> "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are >>> ordered after the load from gp (the former due to an address dependency >>> and the latter due to a (fragile) control dependency). The compiler >>> is within its rights to reorder the store to "a" to precede the load >>> from gp. The compiler is forbidden from reordering the store to "c" >>> wtih the load from gp (because both are volatile accesses), but the CPU >>> is completely within its rights to do this reordering. >>> >>> But if "whichever" is "smp_load_acquire()", all four of the subsequent >>> memory accesses are ordered after the load from gp. >>> >>> Similarly, for WRITE_ONCE() and smp_store_release(): >>> >>> p = READ_ONCE(gp); >>> r1 = READ_ONCE(gi); >>> r2 = READ_ONCE(gj); >>> a = 1; >>> WRITE_ONCE(b, 1); >>> if (r1 & 0x1) >>> whichever(p->q, r2); >>> >>> Again leaving aside the "&" needed by smp_store_release(), if "whichever" >>> is WRITE_ONCE(), then the load from gp, the load from gi, and the load >>> from gj are all ordered before the store to p->q (by address dependency, >>> control dependency, and data dependency, respectively). The store to "a" >>> can be reordered with the store to p->q by the compiler. The store to >>> "b" cannot be reordered with the store to p->q by the compiler (again, >>> both are volatile), but the CPU is free to reorder them, especially when >>> whichever() is implemented as a conditional store. >>> >>> But if "whichever" is "smp_store_release()", all five of the earlier >>> memory accesses are ordered before the store to p->q. >>> >>> Does that help, or am I missing the point of your question? >> >> My main question is how permissible/ugly you think the following use >> of READ_ONCE() would be, and whether you think it ought to be an >> smp_load_acquire() instead. >> >> Assume that we are holding some kind of lock that ensures that the >> only possible concurrent update to "vma->anon_vma" is that it changes >> from a NULL pointer to a non-NULL pointer (using smp_store_release()). >> >> >> if (READ_ONCE(vma->anon_vma) != NULL) { >> // we now know that vma->anon_vma cannot change anymore >> >> // access the same memory location again with a plain load >> struct anon_vma *a = vma->anon_vma; >> >> // this needs to be address-dependency-ordered against one of >> // the loads from vma->anon_vma >> struct anon_vma *root = a->root; >> } >> >> >> Is this fine? If it is not fine just because the compiler might >> reorder the plain load of vma->anon_vma before the READ_ONCE() load, >> would it be fine after adding a barrier() directly after the >> READ_ONCE()? > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable, > as I've run into cases where you have sequences such as: > > // Assume *ptr is initially 0 and somebody else writes it to 1 > // concurrently > > foo = *ptr; > bar = READ_ONCE(*ptr); > baz = *ptr; > > and you can get foo == baz == 0 but bar == 1 because the compiler only > ends up reading from memory twice. > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE > when dereferencing pointer to pte table"), which was very unpleasant to > debug. Interesting. I wonder if you considered adding to READ_ONCE() something like: asm volatile("" : "+g" (x) ); So later loads (such as baz = *ptr) would reload the updated value.
On Thu, 27 Jul 2023 at 12:10, Nadav Amit <nadav.amit@gmail.com> wrote: > > Interesting. I wonder if you considered adding to READ_ONCE() something > like: > > asm volatile("" : "+g" (x) ); > > So later loads (such as baz = *ptr) would reload the updated value. Not necessarily a bad idea. Although I suspect you'd want to add *two* of them - on either side - to make sure any previous loads wouldn't be moved around it either. Linus
> On Jul 27, 2023, at 12:39 PM, Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > On Thu, 27 Jul 2023 at 12:10, Nadav Amit <nadav.amit@gmail.com> wrote: >> >> Interesting. I wonder if you considered adding to READ_ONCE() something >> like: >> >> asm volatile("" : "+g" (x) ); >> >> So later loads (such as baz = *ptr) would reload the updated value. > > Not necessarily a bad idea. Although I suspect you'd want to add > *two* of them - on either side - to make sure any previous loads > wouldn't be moved around it either. You are right, two are needed. I’ll give it a shot and see if I see changes to the binary.
> On Jul 27, 2023, at 1:11 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > > > >> On Jul 27, 2023, at 12:39 PM, Linus Torvalds <torvalds@linuxfoundation.org> wrote: >> >> On Thu, 27 Jul 2023 at 12:10, Nadav Amit <nadav.amit@gmail.com> wrote: >>> >>> Interesting. I wonder if you considered adding to READ_ONCE() something >>> like: >>> >>> asm volatile("" : "+g" (x) ); >>> >>> So later loads (such as baz = *ptr) would reload the updated value. >> >> Not necessarily a bad idea. Although I suspect you'd want to add >> *two* of them - on either side - to make sure any previous loads >> wouldn't be moved around it either. > > You are right, two are needed. > > I’ll give it a shot and see if I see changes to the binary. Just for the record (so nobody will make my mistakes): I implemented it, but it works poorly. It appears to communicate the constraints but the generated code is much worse. The problem is that if the GCC inline asm decides to use a memory operand (e.g. “+m”), it computes the address (uses LEA first before the MOV) and allocates a register for the address. Using “+g” also causes the compiler to allocate a register. -- >8 -- --- include/asm-generic/rwonce.h | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/rwonce.h b/include/asm-generic/rwonce.h index 8d0a6280e982..c6a2f8e3013c 100644 --- a/include/asm-generic/rwonce.h +++ b/include/asm-generic/rwonce.h @@ -44,10 +44,34 @@ #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x)) #endif -#define READ_ONCE(x) \ -({ \ - compiletime_assert_rwonce_type(x); \ - __READ_ONCE(x); \ +#define as_scalar(x) \ + __builtin_choose_expr(sizeof(x) == sizeof(char), \ + (__force char *)&(x), \ + __builtin_choose_expr(sizeof(x) == sizeof(short), \ + (__force short *)&(x), \ + __builtin_choose_expr(sizeof(x) == sizeof(int), \ + (__force int *)&(x), \ + __builtin_choose_expr(sizeof(x) == sizeof(long), \ + (__force long *)&(x), \ + __builtin_choose_expr(sizeof(x) == sizeof(long long), \ + (__force long long *)&(x), \ + (void*)0))))) + +#define READ_ONCE(x) \ +({ \ + typeof(x) * px = &(x); \ + union { \ + typeof(x) __orig; \ + typeof(*as_scalar(x)) __alias; \ + } __u; \ + \ + compiletime_assert_rwonce_type(x); \ + asm volatile ("" : \ + "+g" (*(__force typeof(*as_scalar(x)) *)(px))); \ + __u.__alias = __READ_ONCE(*as_scalar(*px)); \ + asm volatile ("" : \ + "+g" (*(__force typeof(*as_scalar(x)) *)(px))); \ + __u.__orig; \ }) #define __WRITE_ONCE(x, val) \