mbox series

[WIP,0/3] Memory model and atomic API in Rust

Message ID	20240322233838.868874-1-boqun.feng@gmail.com (mailing list archive)
Headers	show Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C30CF1D6AE; Fri, 22 Mar 2024 23:39:08 +0000 (UTC) Feedback-ID: iad51458e:Fastmail From: Boqun Feng <boqun.feng@gmail.com> To: rust-for-linux@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, llvm@lists.linux.dev Cc: Miguel Ojeda <ojeda@kernel.org>, Alex Gaynor <alex.gaynor@gmail.com>, Wedson Almeida Filho <wedsonaf@gmail.com>, Boqun Feng <boqun.feng@gmail.com>, Gary Guo <gary@garyguo.net>, =?utf-8?q?Bj=C3=B6rn_Roy_Baron?= <bjorn3_gh@protonmail.com>, Benno Lossin <benno.lossin@proton.me>, Andreas Hindborg <a.hindborg@samsung.com>, Alice Ryhl <aliceryhl@google.com>, Alan Stern <stern@rowland.harvard.edu>, Andrea Parri <parri.andrea@gmail.com>, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Nicholas Piggin <npiggin@gmail.com>, David Howells <dhowells@redhat.com>, Jade Alglave <j.alglave@ucl.ac.uk>, Luc Maranget <luc.maranget@inria.fr>, "Paul E. McKenney" <paulmck@kernel.org>, Akira Yokosawa <akiyks@gmail.com>, Daniel Lustig <dlustig@nvidia.com>, Joel Fernandes <joel@joelfernandes.org>, Nathan Chancellor <nathan@kernel.org>, Nick Desaulniers <ndesaulniers@google.com>, kent.overstreet@gmail.com, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, elver@google.com, Mark Rutland <mark.rutland@arm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Catalin Marinas <catalin.marinas@arm.com>, torvalds@linux-foundation.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org Subject: [WIP 0/3] Memory model and atomic API in Rust Date: Fri, 22 Mar 2024 16:38:35 -0700 Message-ID: <20240322233838.868874-1-boqun.feng@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Memory model and atomic API in Rust \| expand [WIP,0/3] Memory model and atomic API in Rust [WIP,1/3] rust: Introduce atomic module [WIP,2/3] rust: atomic: Add ARM64 fetch_add_relaxed() [WIP,3/3] rust: atomic: Add fetch_sub_release()

Message ID

20240322233838.868874-1-boqun.feng@gmail.com (mailing list archive)

Headers

Feedback-ID: iad51458e:Fastmail
From: Boqun Feng <boqun.feng@gmail.com>
To: rust-for-linux@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org,
	llvm@lists.linux.dev
Cc: Miguel Ojeda <ojeda@kernel.org>,	Alex Gaynor <alex.gaynor@gmail.com>,
 Wedson Almeida Filho <wedsonaf@gmail.com>, Boqun Feng <boqun.feng@gmail.com>,
	Gary Guo <gary@garyguo.net>,
 =?utf-8?q?Bj=C3=B6rn_Roy_Baron?= <bjorn3_gh@protonmail.com>,
 Benno Lossin <benno.lossin@proton.me>,
 Andreas Hindborg <a.hindborg@samsung.com>, Alice Ryhl <aliceryhl@google.com>,
 Alan Stern <stern@rowland.harvard.edu>,
 Andrea Parri <parri.andrea@gmail.com>,	Will Deacon <will@kernel.org>,
 Peter Zijlstra <peterz@infradead.org>, Nicholas Piggin <npiggin@gmail.com>,
	David Howells <dhowells@redhat.com>, Jade Alglave <j.alglave@ucl.ac.uk>,
	Luc Maranget <luc.maranget@inria.fr>,
 "Paul E. McKenney" <paulmck@kernel.org>, Akira Yokosawa <akiyks@gmail.com>,
	Daniel Lustig <dlustig@nvidia.com>, Joel Fernandes <joel@joelfernandes.org>,
 Nathan Chancellor <nathan@kernel.org>,
 Nick Desaulniers <ndesaulniers@google.com>,	kent.overstreet@gmail.com,
 Greg Kroah-Hartman <gregkh@linuxfoundation.org>,	elver@google.com,
 Mark Rutland <mark.rutland@arm.com>, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
 Dave Hansen <dave.hansen@linux.intel.com>,	x86@kernel.org,
 "H. Peter Anvin" <hpa@zytor.com>, Catalin Marinas <catalin.marinas@arm.com>,
	torvalds@linux-foundation.org, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org
Subject: [WIP 0/3] Memory model and atomic API in Rust
Date: Fri, 22 Mar 2024 16:38:35 -0700
Message-ID: <20240322233838.868874-1-boqun.feng@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Memory model and atomic API in Rust | expand

Message

Boqun Feng March 22, 2024, 11:38 p.m. UTC

Hi,

Since I see more and more Rust code is comming in, I feel like this
should be sent sooner rather than later, so here is a WIP to open the
discussion and get feedback.

One of the most important questions we need to answer is: which
memory (ordering) model we should use when developing Rust in Linux
kernel, given Rust has its own memory ordering model[1]. I had some
discussion with Rust language community to understand their position
on this:

	https://github.com/rust-lang/unsafe-code-guidelines/issues/348#issuecomment-1218407557
	https://github.com/rust-lang/unsafe-code-guidelines/issues/476#issue-2001382992

My takeaway from these discussions, along with other offline discussion
is that supporting two memory models is challenging for both correctness
reasoning (some one needs to provide a model) and implementation (one
model needs to be aware of the other model). So that's not wise to do
(at least at the beginning). So the most reasonable option to me is:

	we only use LKMM for Rust code in kernel (i.e. avoid using
	Rust's own atomic).

Because kernel developers are more familiar with LKMM and when Rust code
interacts with C code, it has to use the model that C code uses.


And this patchset is the result of that option. I introduced an atomic
library to wrap and implement LKMM atomics (of course, given it's a WIP,
so it's unfinished). Things to notice:

* I know I could use Rust macro to generate the whole set of atomics,
  but I choose not to in the beginning, as I want to make it easier to
  review.

* Very likely, we will only have AtomicI32, AtomicI64 and AtomicUsize
  (i.e no atomic for bool, u8, u16, etc), with limited support for
  atomic load and store on 8/16 bits.

* I choose to re-implement atomics in Rust `asm` because we are still
  figuring out how we can make it easy and maintainable for Rust to call
  a C function _inlinely_ (Gary makes some progress [2]). Otherwise,
  atomic primitives would be function calls, and that can be performance
  bottleneck in a few cases.

* I only have two API implemented and two architecture supported yet,
  the complete support surely can be added when everyone is on the same
  page.


Any suggestion, question, review, help is welcome!

Regards,
Boqun

[1]: https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses
[2]: https://rust-for-linux.zulipchat.com/#narrow/stream/288089-General/topic/LTO.20Rust.20modules.20with.20C.20helpers/near/425361365

Boqun Feng (3):
  rust: Introduce atomic module
  rust: atomic: Add ARM64 fetch_add_relaxed()
  rust: atomic: Add fetch_sub_release()

 rust/kernel/sync.rs                   |  1 +
 rust/kernel/sync/atomic.rs            | 65 +++++++++++++++++++++++++++
 rust/kernel/sync/atomic/arch.rs       | 15 +++++++
 rust/kernel/sync/atomic/arch/arm64.rs | 46 +++++++++++++++++++
 rust/kernel/sync/atomic/arch/x86.rs   | 48 ++++++++++++++++++++
 5 files changed, 175 insertions(+)
 create mode 100644 rust/kernel/sync/atomic.rs
 create mode 100644 rust/kernel/sync/atomic/arch.rs
 create mode 100644 rust/kernel/sync/atomic/arch/arm64.rs
 create mode 100644 rust/kernel/sync/atomic/arch/x86.rs

Comments

Kent Overstreet March 22, 2024, 11:57 p.m. UTC | #1

On Fri, Mar 22, 2024 at 04:38:35PM -0700, Boqun Feng wrote:
> Hi,
> 
> Since I see more and more Rust code is comming in, I feel like this
> should be sent sooner rather than later, so here is a WIP to open the
> discussion and get feedback.
> 
> One of the most important questions we need to answer is: which
> memory (ordering) model we should use when developing Rust in Linux
> kernel, given Rust has its own memory ordering model[1]. I had some
> discussion with Rust language community to understand their position
> on this:
> 
> 	https://github.com/rust-lang/unsafe-code-guidelines/issues/348#issuecomment-1218407557
> 	https://github.com/rust-lang/unsafe-code-guidelines/issues/476#issue-2001382992
> 
> My takeaway from these discussions, along with other offline discussion
> is that supporting two memory models is challenging for both correctness
> reasoning (some one needs to provide a model) and implementation (one
> model needs to be aware of the other model). So that's not wise to do
> (at least at the beginning). So the most reasonable option to me is:
> 
> 	we only use LKMM for Rust code in kernel (i.e. avoid using
> 	Rust's own atomic).
> 
> Because kernel developers are more familiar with LKMM and when Rust code
> interacts with C code, it has to use the model that C code uses.

I wonder about that. The disadvantage of only supporting LKMM atomics is
that we'll be incompatible with third party code, and we don't want to
be rolling all of our own data structures forever.

Do we see a path towards eventually supporting the standard Rust model?

Perhaps LKMM atomics could be reworked to be a layer on top of C/C++
atomics. When I last looked, they didn't look completely incompatible;
rather, there is a common subset that both support with the same
semantics, and either has some things that it supports and the other
doesn't (i.e., LKMLL atomics have smp_mb__after_atomic(); this is just a
straightforward optimization to avoid an unnecessary barrier on
architectures where the atomic already provided it).

Linus Torvalds March 23, 2024, 12:12 a.m. UTC | #2

On Fri, 22 Mar 2024 at 16:57, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> I wonder about that. The disadvantage of only supporting LKMM atomics is
> that we'll be incompatible with third party code, and we don't want to
> be rolling all of our own data structures forever.

Honestly, having seen the shit-show that is language standards bodies
and incomplete compiler support, I do not understand why people think
that we wouldn't want to roll our own.

The C++ memory model may be reliable in another decade. And then a
decade after *that*, we can drop support for the pre-reliable
compilers.

People who think that compilers do things right just because they are
automated simply don't know what they are talking about.

It was just a couple of days ago that I was pointed at

    https://github.com/llvm/llvm-project/issues/64188

which is literally the compiler completely missing a C++ memory barrier.

And when the compiler itself is fundamentally buggy, you're kind of
screwed. When you roll your own, you can work around the bugs in
compilers.

And this is all doubly true when it is something that the kernel does,
and very few other projects do. For example, we're often better off
using inline asm over dubious builtins that have "native" compiler
support for them, but little actual real coverage. It really is often
a "ok, this builtin has actually been used for a decade, so it's
hopefully stable now".

We have years of examples of builtins either being completely broken
(as in "immediate crash" broken), or simply generating crap code that
is actively worse than using the inline asm.

The memory ordering isn't going to be at all different. Moving it into
the compiler doesn't solve problems. It creates them.

                 Linus

Boqun Feng March 23, 2024, 12:15 a.m. UTC | #3

On Fri, Mar 22, 2024 at 07:57:41PM -0400, Kent Overstreet wrote:
> On Fri, Mar 22, 2024 at 04:38:35PM -0700, Boqun Feng wrote:
> > Hi,
> > 
> > Since I see more and more Rust code is comming in, I feel like this
> > should be sent sooner rather than later, so here is a WIP to open the
> > discussion and get feedback.
> > 
> > One of the most important questions we need to answer is: which
> > memory (ordering) model we should use when developing Rust in Linux
> > kernel, given Rust has its own memory ordering model[1]. I had some
> > discussion with Rust language community to understand their position
> > on this:
> > 
> > 	https://github.com/rust-lang/unsafe-code-guidelines/issues/348#issuecomment-1218407557
> > 	https://github.com/rust-lang/unsafe-code-guidelines/issues/476#issue-2001382992
> > 
> > My takeaway from these discussions, along with other offline discussion
> > is that supporting two memory models is challenging for both correctness
> > reasoning (some one needs to provide a model) and implementation (one
> > model needs to be aware of the other model). So that's not wise to do
> > (at least at the beginning). So the most reasonable option to me is:
> > 
> > 	we only use LKMM for Rust code in kernel (i.e. avoid using
> > 	Rust's own atomic).
> > 
> > Because kernel developers are more familiar with LKMM and when Rust code
> > interacts with C code, it has to use the model that C code uses.
> 
> I wonder about that. The disadvantage of only supporting LKMM atomics is
> that we'll be incompatible with third party code, and we don't want to
> be rolling all of our own data structures forever.
> 

A possible solution to that is a set of C++ memory model atomics
implemented by LKMM atomics. That should be possible.

> Do we see a path towards eventually supporting the standard Rust model?
> 

Things that Rust/C++ memory model don't suppor but we use are at least:
mixed size atomics (cmpxchg a u64, but read a u8 from another thread),
dependencies (we used a lot in fast path), so it's not trivial.

There are also issues like where one Rust thread does a store(..,
RELEASE), and a C thread does a rcu_deference(), in practice, it
probably works but no one works out (and no one would work out) a model
to describe such an interaction.

Regards,
Boqun

> Perhaps LKMM atomics could be reworked to be a layer on top of C/C++
> atomics. When I last looked, they didn't look completely incompatible;
> rather, there is a common subset that both support with the same
> semantics, and either has some things that it supports and the other
> doesn't (i.e., LKMLL atomics have smp_mb__after_atomic(); this is just a
> straightforward optimization to avoid an unnecessary barrier on
> architectures where the atomic already provided it).

Kent Overstreet March 23, 2024, 12:21 a.m. UTC | #4

On Fri, Mar 22, 2024 at 05:12:29PM -0700, Linus Torvalds wrote:
> On Fri, 22 Mar 2024 at 16:57, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > I wonder about that. The disadvantage of only supporting LKMM atomics is
> > that we'll be incompatible with third party code, and we don't want to
> > be rolling all of our own data structures forever.
> 
> Honestly, having seen the shit-show that is language standards bodies
> and incomplete compiler support, I do not understand why people think
> that we wouldn't want to roll our own.
> 
> The C++ memory model may be reliable in another decade. And then a
> decade after *that*, we can drop support for the pre-reliable
> compilers.
> 
> People who think that compilers do things right just because they are
> automated simply don't know what they are talking about.
> 
> It was just a couple of days ago that I was pointed at
> 
>     https://github.com/llvm/llvm-project/issues/64188

Besides that there's cross arch support to think about - it's hard to
imagine us ever ditching our own atomics.

I was thinking about something more incremental - just an optional mode
where our atomics were C atomics underneath. It'd probably give the
compiler people a much more effective way to test their stuff than
anything they have now.

Linus Torvalds March 23, 2024, 12:36 a.m. UTC | #5

On Fri, 22 Mar 2024 at 17:21, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Besides that there's cross arch support to think about - it's hard to
> imagine us ever ditching our own atomics.

Well, that's one of the advantages of using compiler builtins -
projects that do want cross-architecture support, but that aren't
actually maintaining their _own_ architecture support.

So I very much see the lure of compiler support for that kind of
situation - to write portable code without having to know or care
about architecture details.

This is one reason I think the kernel is kind of odd and special -
because in the kernel, we obviously very fundamentally have to care
about the architecture details _anyway_, so then having the
architecture also define things like atomics is just a pretty small
(and relatively straightforward) detail.

The same argument goes for compiler builtins vs inline asm. In the
kernel, we have to have people who are intimately familiar with the
architecture _anyway_, so inline asms and architecture-specific header
files aren't some big pain-point: they'd be needed _anyway_.

But in some random user level program, where all you want is an
efficient way to do "find first bit"? Then using a compiler intrinsic
makes a lot more sense.

> I was thinking about something more incremental - just an optional mode
> where our atomics were C atomics underneath. It'd probably give the
> compiler people a much more effective way to test their stuff than
> anything they have now.

I suspect it might be painful, and some compiler people would throw
their hands up in horror, because the C++ atomics model is based
fairly solidly on atomic types, and the kernel memory model is much
more fluid.

Boqun already mentioned the "mixing access sizes", which is actually
quite fundamental in the kernel, where we play lots of games with that
(typically around locking, where you find patterns line unlock writing
a zero to a single byte, even though the whole lock data structure is
a word). And sometimes the access size games are very explicit (eg
lib/lockref.c).

But it actually goes deeper than that. While we do have "atomic_t" etc
for arithmetic atomics, and that probably would map fairly well to C++
atomics, in other cases we simply base our atomics not on _types_, but
on code.

IOW, we do things like "cmpxchg()", and the target of that atomic
access is just a regular data structure field.

It's kind of like our "volatile" usage. If you read the C (and C++)
standards, you'll find that you should use "volatile" on data types.
That's almost *never* what the kernel does. The kernel uses "volatile"
in _code_ (ie READ_ONCE() etc), and uses it by casting etc.

Compiler people don't tend to really like those kinds of things.

            Linus

Boqun Feng March 23, 2024, 12:49 a.m. UTC | #6

On Fri, Mar 22, 2024 at 05:15:08PM -0700, Boqun Feng wrote:
[...]
> > 
> > I wonder about that. The disadvantage of only supporting LKMM atomics is
> > that we'll be incompatible with third party code, and we don't want to
> > be rolling all of our own data structures forever.
> > 
> 
> A possible solution to that is a set of C++ memory model atomics
> implemented by LKMM atomics. That should be possible.
> 

Another possible "solution" works in the opposite direction, since the
folder rust/kernel/sync/atomic is quite stand-alone, we can export that
as a Rust crate (library), and third party code can support using LKMM
atomics instead of Rust own atomics ;-) Of course if the project is
supposed to work with Linux kernel.

Regards,
Boqun

Kent Overstreet March 23, 2024, 1:42 a.m. UTC | #7

On Fri, Mar 22, 2024 at 05:49:23PM -0700, Boqun Feng wrote:
> On Fri, Mar 22, 2024 at 05:15:08PM -0700, Boqun Feng wrote:
> [...]
> > > 
> > > I wonder about that. The disadvantage of only supporting LKMM atomics is
> > > that we'll be incompatible with third party code, and we don't want to
> > > be rolling all of our own data structures forever.
> > > 
> > 
> > A possible solution to that is a set of C++ memory model atomics
> > implemented by LKMM atomics. That should be possible.
> > 
> 
> Another possible "solution" works in the opposite direction, since the
> folder rust/kernel/sync/atomic is quite stand-alone, we can export that
> as a Rust crate (library), and third party code can support using LKMM
> atomics instead of Rust own atomics ;-) Of course if the project is
> supposed to work with Linux kernel.

Not just from the Rust side, the C side would be useful as well. I've
got a quicky, dirty, hacky version of that in bcachefs-tools, and I
would switch immediately if someone took that over and made it a real
project, and I'm quite sure other projects would as well.

Kent Overstreet March 23, 2024, 2:07 a.m. UTC | #8

On Fri, Mar 22, 2024 at 05:36:00PM -0700, Linus Torvalds wrote:
> On Fri, 22 Mar 2024 at 17:21, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Besides that there's cross arch support to think about - it's hard to
> > imagine us ever ditching our own atomics.
> 
> Well, that's one of the advantages of using compiler builtins -
> projects that do want cross-architecture support, but that aren't
> actually maintaining their _own_ architecture support.
> 
> So I very much see the lure of compiler support for that kind of
> situation - to write portable code without having to know or care
> about architecture details.
> 
> This is one reason I think the kernel is kind of odd and special -
> because in the kernel, we obviously very fundamentally have to care
> about the architecture details _anyway_, so then having the
> architecture also define things like atomics is just a pretty small
> (and relatively straightforward) detail.
> 
> The same argument goes for compiler builtins vs inline asm. In the
> kernel, we have to have people who are intimately familiar with the
> architecture _anyway_, so inline asms and architecture-specific header
> files aren't some big pain-point: they'd be needed _anyway_.
> 
> But in some random user level program, where all you want is an
> efficient way to do "find first bit"? Then using a compiler intrinsic
> makes a lot more sense.

We've got a whole spectrum of kernel code though, and a lot of it is
code that - honestly, we'd be better off if it wasn't specific to the
kernel.

rhashtable comes to mind; it's a fully generic, excellent at what it
does, but it's had a number of annoyingly subtle bugs and sharp edges
over the years that are really just a result of it not having enough
users.

So I see some real value in regularizing things.

> > I was thinking about something more incremental - just an optional mode
> > where our atomics were C atomics underneath. It'd probably give the
> > compiler people a much more effective way to test their stuff than
> > anything they have now.
> 
> I suspect it might be painful, and some compiler people would throw
> their hands up in horror, because the C++ atomics model is based
> fairly solidly on atomic types, and the kernel memory model is much
> more fluid.
> 
> Boqun already mentioned the "mixing access sizes", which is actually
> quite fundamental in the kernel, where we play lots of games with that
> (typically around locking, where you find patterns line unlock writing
> a zero to a single byte, even though the whole lock data structure is
> a word). And sometimes the access size games are very explicit (eg
> lib/lockref.c).

I don't think mixing access sizes should be a real barrier. On the read
side we can obviously do that with a helper; the write side needs
compiler help, but "writing just a byte out of a word" is no different
from a compiler POV that "write a single bit", and we can already mix
atomic_or() with atomic_add(), with both C atomics and LKMM atomics.

> But it actually goes deeper than that. While we do have "atomic_t" etc
> for arithmetic atomics, and that probably would map fairly well to C++
> atomics, in other cases we simply base our atomics not on _types_, but
> on code.
> 
> IOW, we do things like "cmpxchg()", and the target of that atomic
> access is just a regular data structure field.

Well, some of that's historical cruft; cmpxchg() and atomic_cmpxchg()
have different orderings, and we can specify that more directly now.

But we definitely need the ability to cmpxchg() any struct of a size the
machine supports atomic access to. Rust should be able to manage that
more easily than C/C++ though - they've got a type system that can
sanely represent that.

Boqun Feng March 23, 2024, 2:26 a.m. UTC | #9

On Fri, Mar 22, 2024 at 10:07:31PM -0400, Kent Overstreet wrote:
[...]
> > Boqun already mentioned the "mixing access sizes", which is actually
> > quite fundamental in the kernel, where we play lots of games with that
> > (typically around locking, where you find patterns line unlock writing
> > a zero to a single byte, even though the whole lock data structure is
> > a word). And sometimes the access size games are very explicit (eg
> > lib/lockref.c).
> 
> I don't think mixing access sizes should be a real barrier. On the read

Well, it actually is, since mixing access sizes is, guess what,
an undefined behavior:

(example in https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses)

	thread::scope(|s| {
	    // This is UB: using different-sized atomic accesses to the same data
	    s.spawn(|| atomic.store(1, Ordering::Relaxed));
	    s.spawn(|| unsafe {
		let differently_sized = transmute::<&AtomicU16, &AtomicU8>(&atomic);
		differently_sized.store(2, Ordering::Relaxed);
	    });
	});

Of course, you can say "I will just ignore the UB", but if you have to
ignore "compiler rules" to make your code work, why bother use compiler
builtin in the first place? Being UB means they are NOT guaranteed to
work.

> side we can obviously do that with a helper; the write side needs
> compiler help, but "writing just a byte out of a word" is no different
> from a compiler POV that "write a single bit", and we can already mix
> atomic_or() with atomic_add(), with both C atomics and LKMM atomics.
> 

I totally agree with your reasoning here, but maybe the standard doesn't
;-)

Regards,
Boqun

Kent Overstreet March 23, 2024, 2:33 a.m. UTC | #10

On Fri, Mar 22, 2024 at 07:26:28PM -0700, Boqun Feng wrote:
> On Fri, Mar 22, 2024 at 10:07:31PM -0400, Kent Overstreet wrote:
> [...]
> > > Boqun already mentioned the "mixing access sizes", which is actually
> > > quite fundamental in the kernel, where we play lots of games with that
> > > (typically around locking, where you find patterns line unlock writing
> > > a zero to a single byte, even though the whole lock data structure is
> > > a word). And sometimes the access size games are very explicit (eg
> > > lib/lockref.c).
> > 
> > I don't think mixing access sizes should be a real barrier. On the read
> 
> Well, it actually is, since mixing access sizes is, guess what,
> an undefined behavior:
> 
> (example in https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses)
> 
> 	thread::scope(|s| {
> 	    // This is UB: using different-sized atomic accesses to the same data
> 	    s.spawn(|| atomic.store(1, Ordering::Relaxed));
> 	    s.spawn(|| unsafe {
> 		let differently_sized = transmute::<&AtomicU16, &AtomicU8>(&atomic);
> 		differently_sized.store(2, Ordering::Relaxed);
> 	    });
> 	});
> 
> Of course, you can say "I will just ignore the UB", but if you have to
> ignore "compiler rules" to make your code work, why bother use compiler
> builtin in the first place? Being UB means they are NOT guaranteed to
> work.

That's not what I'm proposing - you'd need additional compiler support.
but the new intrinsic would be no different, semantics wise for the
compiler to model, than a "lock orb".

Boqun Feng March 23, 2024, 2:57 a.m. UTC | #11

On Fri, Mar 22, 2024 at 10:33:13PM -0400, Kent Overstreet wrote:
> On Fri, Mar 22, 2024 at 07:26:28PM -0700, Boqun Feng wrote:
> > On Fri, Mar 22, 2024 at 10:07:31PM -0400, Kent Overstreet wrote:
> > [...]
> > > > Boqun already mentioned the "mixing access sizes", which is actually
> > > > quite fundamental in the kernel, where we play lots of games with that
> > > > (typically around locking, where you find patterns line unlock writing
> > > > a zero to a single byte, even though the whole lock data structure is
> > > > a word). And sometimes the access size games are very explicit (eg
> > > > lib/lockref.c).
> > > 
> > > I don't think mixing access sizes should be a real barrier. On the read
> > 
> > Well, it actually is, since mixing access sizes is, guess what,
> > an undefined behavior:
> > 
> > (example in https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses)
> > 
> > 	thread::scope(|s| {
> > 	    // This is UB: using different-sized atomic accesses to the same data
> > 	    s.spawn(|| atomic.store(1, Ordering::Relaxed));
> > 	    s.spawn(|| unsafe {
> > 		let differently_sized = transmute::<&AtomicU16, &AtomicU8>(&atomic);
> > 		differently_sized.store(2, Ordering::Relaxed);
> > 	    });
> > 	});
> > 
> > Of course, you can say "I will just ignore the UB", but if you have to
> > ignore "compiler rules" to make your code work, why bother use compiler
> > builtin in the first place? Being UB means they are NOT guaranteed to
> > work.
> 
> That's not what I'm proposing - you'd need additional compiler support.

Ah, OK.

> but the new intrinsic would be no different, semantics wise for the
> compiler to model, than a "lock orb".

Be ready to be disappointed:

	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402078545
	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402082631

;-)

In fact, if you get a chance to read the previous discussion links I
shared, you will find I was just like you in the beginning: hope we
could extend the model to support more kernel code properly. But my
overall feeling is that it's either very challenging or lack of
motivation to do.

Regards,
Boqun

Kent Overstreet March 23, 2024, 3:10 a.m. UTC | #12

On Fri, Mar 22, 2024 at 07:57:20PM -0700, Boqun Feng wrote:
> On Fri, Mar 22, 2024 at 10:33:13PM -0400, Kent Overstreet wrote:
> > On Fri, Mar 22, 2024 at 07:26:28PM -0700, Boqun Feng wrote:
> > > On Fri, Mar 22, 2024 at 10:07:31PM -0400, Kent Overstreet wrote:
> > > [...]
> > > > > Boqun already mentioned the "mixing access sizes", which is actually
> > > > > quite fundamental in the kernel, where we play lots of games with that
> > > > > (typically around locking, where you find patterns line unlock writing
> > > > > a zero to a single byte, even though the whole lock data structure is
> > > > > a word). And sometimes the access size games are very explicit (eg
> > > > > lib/lockref.c).
> > > > 
> > > > I don't think mixing access sizes should be a real barrier. On the read
> > > 
> > > Well, it actually is, since mixing access sizes is, guess what,
> > > an undefined behavior:
> > > 
> > > (example in https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses)
> > > 
> > > 	thread::scope(|s| {
> > > 	    // This is UB: using different-sized atomic accesses to the same data
> > > 	    s.spawn(|| atomic.store(1, Ordering::Relaxed));
> > > 	    s.spawn(|| unsafe {
> > > 		let differently_sized = transmute::<&AtomicU16, &AtomicU8>(&atomic);
> > > 		differently_sized.store(2, Ordering::Relaxed);
> > > 	    });
> > > 	});
> > > 
> > > Of course, you can say "I will just ignore the UB", but if you have to
> > > ignore "compiler rules" to make your code work, why bother use compiler
> > > builtin in the first place? Being UB means they are NOT guaranteed to
> > > work.
> > 
> > That's not what I'm proposing - you'd need additional compiler support.
> 
> Ah, OK.
> 
> > but the new intrinsic would be no different, semantics wise for the
> > compiler to model, than a "lock orb".
> 
> Be ready to be disappointed:
> 
> 	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402078545
> 	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402082631
> 
> ;-)
> 
> In fact, if you get a chance to read the previous discussion links I
> shared, you will find I was just like you in the beginning: hope we
> could extend the model to support more kernel code properly. But my
> overall feeling is that it's either very challenging or lack of
> motivation to do.

That's casting - that doesn't work because compiler people hate
aliasing.

But intrinsics for e.g.
__atomic32_read_u8(atomic_u32_t *a, unsigned byte)
__atomic32_write_u8(atomic_u32_t a*, unsigned byte)

should be doable - that's perfectly fine for the compiler to model.

That would admittedly be ugly to use. But, if Rust ever allowed for
marking any struct up to word size as atomic (which we want anyways...),
it could use that under the hood for setting a member variable without
cmpxchg.

Boqun Feng March 23, 2024, 3:51 a.m. UTC | #13

On Fri, Mar 22, 2024 at 11:10:36PM -0400, Kent Overstreet wrote:
> On Fri, Mar 22, 2024 at 07:57:20PM -0700, Boqun Feng wrote:
> > On Fri, Mar 22, 2024 at 10:33:13PM -0400, Kent Overstreet wrote:
> > > On Fri, Mar 22, 2024 at 07:26:28PM -0700, Boqun Feng wrote:
> > > > On Fri, Mar 22, 2024 at 10:07:31PM -0400, Kent Overstreet wrote:
> > > > [...]
> > > > > > Boqun already mentioned the "mixing access sizes", which is actually
> > > > > > quite fundamental in the kernel, where we play lots of games with that
> > > > > > (typically around locking, where you find patterns line unlock writing
> > > > > > a zero to a single byte, even though the whole lock data structure is
> > > > > > a word). And sometimes the access size games are very explicit (eg
> > > > > > lib/lockref.c).
> > > > > 
> > > > > I don't think mixing access sizes should be a real barrier. On the read
> > > > 
> > > > Well, it actually is, since mixing access sizes is, guess what,
> > > > an undefined behavior:
> > > > 
> > > > (example in https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses)
> > > > 
> > > > 	thread::scope(|s| {
> > > > 	    // This is UB: using different-sized atomic accesses to the same data
> > > > 	    s.spawn(|| atomic.store(1, Ordering::Relaxed));
> > > > 	    s.spawn(|| unsafe {
> > > > 		let differently_sized = transmute::<&AtomicU16, &AtomicU8>(&atomic);
> > > > 		differently_sized.store(2, Ordering::Relaxed);
> > > > 	    });
> > > > 	});
> > > > 
> > > > Of course, you can say "I will just ignore the UB", but if you have to
> > > > ignore "compiler rules" to make your code work, why bother use compiler
> > > > builtin in the first place? Being UB means they are NOT guaranteed to
> > > > work.
> > > 
> > > That's not what I'm proposing - you'd need additional compiler support.
> > 
> > Ah, OK.
> > 
> > > but the new intrinsic would be no different, semantics wise for the
> > > compiler to model, than a "lock orb".
> > 
> > Be ready to be disappointed:
> > 
> > 	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402078545
> > 	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402082631
> > 
> > ;-)
> > 
> > In fact, if you get a chance to read the previous discussion links I
> > shared, you will find I was just like you in the beginning: hope we
> > could extend the model to support more kernel code properly. But my
> > overall feeling is that it's either very challenging or lack of
> > motivation to do.
> 
> That's casting - that doesn't work because compiler people hate
> aliasing.
> 
> But intrinsics for e.g.
> __atomic32_read_u8(atomic_u32_t *a, unsigned byte)
> __atomic32_write_u8(atomic_u32_t a*, unsigned byte)
> 

so "byte" here is the byte indexing in the u32? Hmm... I guess that'll
work. But I really don't know whether LLVM/Rust will support such an
intrinsic...

Regards,
Boqun

> should be doable - that's perfectly fine for the compiler to model.
> 
> That would admittedly be ugly to use. But, if Rust ever allowed for
> marking any struct up to word size as atomic (which we want anyways...),
> it could use that under the hood for setting a member variable without
> cmpxchg.

Kent Overstreet March 23, 2024, 4:16 a.m. UTC | #14

On Fri, Mar 22, 2024 at 08:51:03PM -0700, Boqun Feng wrote:
> On Fri, Mar 22, 2024 at 11:10:36PM -0400, Kent Overstreet wrote:
> > On Fri, Mar 22, 2024 at 07:57:20PM -0700, Boqun Feng wrote:
> > > On Fri, Mar 22, 2024 at 10:33:13PM -0400, Kent Overstreet wrote:
> > > > On Fri, Mar 22, 2024 at 07:26:28PM -0700, Boqun Feng wrote:
> > > > > On Fri, Mar 22, 2024 at 10:07:31PM -0400, Kent Overstreet wrote:
> > > > > [...]
> > > > > > > Boqun already mentioned the "mixing access sizes", which is actually
> > > > > > > quite fundamental in the kernel, where we play lots of games with that
> > > > > > > (typically around locking, where you find patterns line unlock writing
> > > > > > > a zero to a single byte, even though the whole lock data structure is
> > > > > > > a word). And sometimes the access size games are very explicit (eg
> > > > > > > lib/lockref.c).
> > > > > > 
> > > > > > I don't think mixing access sizes should be a real barrier. On the read
> > > > > 
> > > > > Well, it actually is, since mixing access sizes is, guess what,
> > > > > an undefined behavior:
> > > > > 
> > > > > (example in https://doc.rust-lang.org/std/sync/atomic/#memory-model-for-atomic-accesses)
> > > > > 
> > > > > 	thread::scope(|s| {
> > > > > 	    // This is UB: using different-sized atomic accesses to the same data
> > > > > 	    s.spawn(|| atomic.store(1, Ordering::Relaxed));
> > > > > 	    s.spawn(|| unsafe {
> > > > > 		let differently_sized = transmute::<&AtomicU16, &AtomicU8>(&atomic);
> > > > > 		differently_sized.store(2, Ordering::Relaxed);
> > > > > 	    });
> > > > > 	});
> > > > > 
> > > > > Of course, you can say "I will just ignore the UB", but if you have to
> > > > > ignore "compiler rules" to make your code work, why bother use compiler
> > > > > builtin in the first place? Being UB means they are NOT guaranteed to
> > > > > work.
> > > > 
> > > > That's not what I'm proposing - you'd need additional compiler support.
> > > 
> > > Ah, OK.
> > > 
> > > > but the new intrinsic would be no different, semantics wise for the
> > > > compiler to model, than a "lock orb".
> > > 
> > > Be ready to be disappointed:
> > > 
> > > 	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402078545
> > > 	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/is.20atomic.20aliasing.20allowed.3F/near/402082631
> > > 
> > > ;-)
> > > 
> > > In fact, if you get a chance to read the previous discussion links I
> > > shared, you will find I was just like you in the beginning: hope we
> > > could extend the model to support more kernel code properly. But my
> > > overall feeling is that it's either very challenging or lack of
> > > motivation to do.
> > 
> > That's casting - that doesn't work because compiler people hate
> > aliasing.
> > 
> > But intrinsics for e.g.
> > __atomic32_read_u8(atomic_u32_t *a, unsigned byte)
> > __atomic32_write_u8(atomic_u32_t a*, unsigned byte)
> > 
> 
> so "byte" here is the byte indexing in the u32? Hmm... I guess that'll
> work. But I really don't know whether LLVM/Rust will support such an
> intrinsic...

They're going to need this eventually - really, entire structs that can
be marked as atomic. Types aren't limited to the integers.

Andrew Lunn March 23, 2024, 2:29 p.m. UTC | #15

> There are also issues like where one Rust thread does a store(..,
> RELEASE), and a C thread does a rcu_deference(), in practice, it
> probably works but no one works out (and no one would work out) a model
> to describe such an interaction.

Isn't that what Paul E. McKenney litmus tests are all about?

tools/memory-model/litmus-test

	Andrew

Boqun Feng March 23, 2024, 2:41 p.m. UTC | #16

On Sat, Mar 23, 2024 at 03:29:11PM +0100, Andrew Lunn wrote:
> > There are also issues like where one Rust thread does a store(..,
> > RELEASE), and a C thread does a rcu_deference(), in practice, it
> > probably works but no one works out (and no one would work out) a model
> > to describe such an interaction.
> 
> Isn't that what Paul E. McKenney litmus tests are all about?
> 

Litmus tests (or herd, or any other memory model tools) works for either
LKMM or C++ memory model. But there is no model I'm aware of works for
the communication between two memory models. So for example:

	Rust thread:

	let mut foo: Box<Foo> = ...;
	foo.a = 1;
	let global_ptr: &AtomicPtr = ...;
	global_ptr.store(foo.leak() as _, RELEASE);

	C thread:

	rcu_read_lock();

	foo = rcu_dereference(global_ptr);
	if (foo) {
		r1 = foo->a;
	}

	rcu_read_unlock();

no tool or model yet to guarantee "r1" is 1, but yeah, in practice for
the case we care, it's probably guaranteed. But no tool or model means
challenging for code reasoning.

Regards,
Boqun

> tools/memory-model/litmus-test
> 
> 	Andrew

Boqun Feng March 23, 2024, 2:55 p.m. UTC | #17

On Sat, Mar 23, 2024 at 07:41:28AM -0700, Boqun Feng wrote:
> On Sat, Mar 23, 2024 at 03:29:11PM +0100, Andrew Lunn wrote:
> > > There are also issues like where one Rust thread does a store(..,
> > > RELEASE), and a C thread does a rcu_deference(), in practice, it
> > > probably works but no one works out (and no one would work out) a model
> > > to describe such an interaction.
> > 
> > Isn't that what Paul E. McKenney litmus tests are all about?
> > 
> 
> Litmus tests (or herd, or any other memory model tools) works for either
> LKMM or C++ memory model. But there is no model I'm aware of works for
> the communication between two memory models. So for example:
> 
> 	Rust thread:
> 
> 	let mut foo: Box<Foo> = ...;
> 	foo.a = 1;
> 	let global_ptr: &AtomicPtr = ...;
> 	global_ptr.store(foo.leak() as _, RELEASE);
> 
> 	
> 	C thread:
> 
> 	rcu_read_lock();
> 
> 	foo = rcu_dereference(global_ptr);
> 	if (foo) {
> 		r1 = foo->a;
> 	}
> 	
> 	rcu_read_unlock();
> 
> no tool or model yet to guarantee "r1" is 1, but yeah, in practice for
> the case we care, it's probably guaranteed. But no tool or model means
> challenging for code reasoning.
> 

There are also cases where two similar APIs from C++ memory model and
LKMM have different semantics, for example, a SeqCst atomic in C++
memory model doesn't imply a full barrier, while a fully ordered LKMM
atomic does:

	Rust:

	a.store(1, RELAXED);
	x.fetch_add(1, SeqCst);
	b.store(2, RELAXED);

	// ^ writes to a and b are not ordered.

	C:

	WRITE_ONCE(*a, 1);
	atomic_fetch_add(x, 1);
	WRITE_ONCE(*b, 2);

	// ^ writes to a and b are ordered.

So if you used to have two parts synchronizing each other with LKMM
atomics, converting one side to Rust *and* using Rust atomics requires
much caution.

Regards,
Boqun

> Regards,
> Boqun
> 
> > tools/memory-model/litmus-test
> > 
> > 	Andrew

comex March 23, 2024, 9:40 p.m. UTC | #18

> On Mar 22, 2024, at 8:12 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> And when the compiler itself is fundamentally buggy, you're kind of
> screwed. When you roll your own, you can work around the bugs in
> compilers.

That may be true, but the LLVM issue you cited isn’t a good example.  In that issue, the function being miscompiled doesn’t actually use any barriers or atomics itself; only the scaffolding around it does.  The same issue would happen even if the scaffolding used LKMM atomics.

For anyone curious: The problematic optimization involves an allocation (‘p’) that is initially private to the function, but is returned at the end of the function.   LLVM moves a non-atomic store to that allocation across an external function call (to ‘foo’).  This reordering would be blatantly invalid if any other code could observe the contents of the allocation, but is valid if the allocation is private to the function.  LLVM assumes the latter: after all, the pointer to it hasn’t escaped.  Yet.  Except that in a weak memory model, the escape can ‘time travel’...

Alan Stern March 24, 2024, 3:22 p.m. UTC | #19

On Sat, Mar 23, 2024 at 05:40:23PM -0400, comex wrote:
> That may be true, but the LLVM issue you cited isn’t a good example.  
> In that issue, the function being miscompiled doesn’t actually use any 
> barriers or atomics itself; only the scaffolding around it does.  The 
> same issue would happen even if the scaffolding used LKMM atomics.
> 
> For anyone curious: The problematic optimization involves an 
> allocation (‘p’) that is initially private to the function, but is 
> returned at the end of the function.  LLVM moves a non-atomic store to 
> that allocation across an external function call (to ‘foo’).  This 
> reordering would be blatantly invalid if any other code could observe 
> the contents of the allocation, but is valid if the allocation is 
> private to the function.  LLVM assumes the latter: after all, the 
> pointer to it hasn’t escaped.  Yet.  Except that in a weak memory 
> model, the escape can ‘time travel’...

It's hard to understand exactly what you mean, but consider the 
following example:

int *globalptr;
int x;

int *f() {
	int *p = kzalloc(sizeof(int));

	L1: *p = 1;
	L2: foo();
	return p;
}

void foo() {
	smp_store_release(&x, 2);
}

void thread0() {
	WRITE_ONCE(globalptr, f());
}

void thread1() {
	int m, n;
	int *q;

	m = smp_load_acquire(&x);
	q = READ_ONCE(globalptr);
	if (m && q)
		n = *q;
}

(If you like, pretend each of these function definitions lives in a 
different source file -- it doesn't matter.)

With no optimization, whenever thread1() reads *q it will always obtain 
1, thanks to the store-release in foo() and the load-acquire() in 
thread1().  But if the compiler swaps L1 and L2 in f() then this is not 
guaranteed.  On a weakly ordered architecture, thread1() could then get 
0 from *q.

I don't know if this is what you meant by "in a weak memory model, the 
escape can ‘time travel'".  Regardless, it seems very clear that any 
compiler which swaps L1 and L2 in f() has a genuine bug.

Alan Stern

comex March 24, 2024, 5:37 p.m. UTC | #20

> On Mar 24, 2024, at 11:22 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> 
> I don't know if this is what you meant by "in a weak memory model, the 
> escape can ‘time travel'".  Regardless, it seems very clear that any 
> compiler which swaps L1 and L2 in f() has a genuine bug.

Yes, that’s what I meant.  Clang thinks it’s valid to swap L1 and L2.  Though, for it to actually happen, they would have to be in a loop, since the problematic optimization is “loop-invariant code motion".  Here’s a modified version of your f() that shows the optimization in action:

https://godbolt.org/z/bdaTjjvMs

Anyway, my point is just that using LKMM doesn’t save you from the bug.

Mark Rutland March 25, 2024, 10:44 a.m. UTC | #21

On Fri, Mar 22, 2024 at 04:38:35PM -0700, Boqun Feng wrote:
> Hi,
> 
> Since I see more and more Rust code is comming in, I feel like this
> should be sent sooner rather than later, so here is a WIP to open the
> discussion and get feedback.
> 
> One of the most important questions we need to answer is: which
> memory (ordering) model we should use when developing Rust in Linux
> kernel, given Rust has its own memory ordering model[1]. I had some
> discussion with Rust language community to understand their position
> on this:
> 
> 	https://github.com/rust-lang/unsafe-code-guidelines/issues/348#issuecomment-1218407557
> 	https://github.com/rust-lang/unsafe-code-guidelines/issues/476#issue-2001382992
> 
> My takeaway from these discussions, along with other offline discussion
> is that supporting two memory models is challenging for both correctness
> reasoning (some one needs to provide a model) and implementation (one
> model needs to be aware of the other model). So that's not wise to do
> (at least at the beginning). So the most reasonable option to me is:
> 
> 	we only use LKMM for Rust code in kernel (i.e. avoid using
> 	Rust's own atomic).
> 
> Because kernel developers are more familiar with LKMM and when Rust code
> interacts with C code, it has to use the model that C code uses.

I think that makes sense; if nothing else it's consistent with how we handle
the C atomics today.

> And this patchset is the result of that option. I introduced an atomic
> library to wrap and implement LKMM atomics (of course, given it's a WIP,
> so it's unfinished). Things to notice:
> 
> * I know I could use Rust macro to generate the whole set of atomics,
>   but I choose not to in the beginning, as I want to make it easier to
>   review.
> 
> * Very likely, we will only have AtomicI32, AtomicI64 and AtomicUsize
>   (i.e no atomic for bool, u8, u16, etc), with limited support for
>   atomic load and store on 8/16 bits.
> 
> * I choose to re-implement atomics in Rust `asm` because we are still
>   figuring out how we can make it easy and maintainable for Rust to call
>   a C function _inlinely_ (Gary makes some progress [2]). Otherwise,
>   atomic primitives would be function calls, and that can be performance
>   bottleneck in a few cases.

I don't think we want to maintain two copies of each architecture's atomics.
This gets painful very quickly (e.g. as arm64's atomics get patched between
LL/SC and LSE forms).

Can we start off with out-of-line atomics, and see where the bottlenecks are?

It's relatively easy to do that today, at least for the atomic*_*() APIs:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=atomics/outlined&id=e0a77bfa63e7416d610769aa4ab62bc06993ce56

... which IIUC covers the "AtomicI32, AtomicI64 and AtomicUsize" cases you
mention above.

Mark.

Philipp Stanner March 25, 2024, 1:56 p.m. UTC | #22

On Fri, 2024-03-22 at 17:36 -0700, Linus Torvalds wrote:
> On Fri, 22 Mar 2024 at 17:21, Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> > 
> > Besides that there's cross arch support to think about - it's hard
> > to
> > imagine us ever ditching our own atomics.
> 
> > [... SNIP ...]
> > 
> > I was thinking about something more incremental - just an optional
> > mode
> > where our atomics were C atomics underneath. It'd probably give the
> > compiler people a much more effective way to test their stuff than
> > anything they have now.
> 
> I suspect it might be painful, and some compiler people would throw
> their hands up in horror, because the C++ atomics model is based
> fairly solidly on atomic types, and the kernel memory model is much
> more fluid.
> 
> Boqun already mentioned the "mixing access sizes", which is actually
> quite fundamental in the kernel, where we play lots of games with
> that
> (typically around locking, where you find patterns line unlock
> writing
> a zero to a single byte, even though the whole lock data structure is
> a word). And sometimes the access size games are very explicit (eg
> lib/lockref.c).
> 
> But it actually goes deeper than that. While we do have "atomic_t"
> etc
> for arithmetic atomics, and that probably would map fairly well to
> C++
> atomics, in other cases we simply base our atomics not on _types_,
> but
> on code.
> 
> IOW, we do things like "cmpxchg()", and the target of that atomic
> access is just a regular data structure field.
> 
> It's kind of like our "volatile" usage. If you read the C (and C++)
> standards, you'll find that you should use "volatile" on data types.
> That's almost *never* what the kernel does. The kernel uses
> "volatile"
> in _code_ (ie READ_ONCE() etc), and uses it by casting etc.
> 
> Compiler people don't tend to really like those kinds of things.

Just for my understanding: Why don't they like it?

I guess since compiler people have to support volatile pointers
anyways, temporarily casting something to such a volatile pointer
shouldn't be a problem either – so they don't dislike it because it's
more difficult to implement, but because it's more difficult to verify
for correctness?


P.

> 
>             Linus
>

Linus Torvalds March 25, 2024, 5:44 p.m. UTC | #23

On Mon, 25 Mar 2024 at 06:57, Philipp Stanner <pstanner@redhat.com> wrote:
>
> On Fri, 2024-03-22 at 17:36 -0700, Linus Torvalds wrote:
> >
> > It's kind of like our "volatile" usage. If you read the C (and C++)
> > standards, you'll find that you should use "volatile" on data types.
> > That's almost *never* what the kernel does. The kernel uses
> > "volatile"
> > in _code_ (ie READ_ONCE() etc), and uses it by casting etc.
> >
> > Compiler people don't tend to really like those kinds of things.
>
> Just for my understanding: Why don't they like it?

So I actually think most compiler people are perfectly fine with the
kernel model of mostly doing 'volatile' not on the data structures
themselves, but as accesses through casts.

It's very traditional C, and there's actually nothing particularly odd
about it. Not even from a compiler standpoint.

In fact, I personally will argue that it is fundamentally wrong to
think that the underlying data has to be volatile. A variable may be
entirely stable in some cases (ie locks held), but not in others.

So it's not the *variable* (aka "object") that is 'volatile', it's the
*context* that makes a particular access volatile.

That explains why the kernel has basically zero actual volatile
objects, and 99% of all volatile accesses are done through accessor
functions that use a cast to mark a particular access volatile.

But I've had negative comments from compiler people who read the
standards as language lawyers (which honestly, I despise - it's always
possible to try to argue what the meaning of some wording is), and
particularly C++ people used to be very very antsy about "volatile".

They had some truly _serious_ problems with volatile.

The C++ people spent absolutely insane amounts of time arguing about
"volatile objects" vs "accesses", and how an access through a cast
didn't make the underlying object volatile etc.

There were endless discussions because a lvalue isn't supposed to be
an access (an lvalue is something that is being acted on, and it
shouldn't imply an access because an access will then cause other
things in C++). So a statement expression that was just an lvalue
shouldn't imply an access in C++ originally, but obviously when the
thing was volatile it *had* to do so, and there was gnashing of teeth
over this all.

And all of it was purely semantic nitpicking about random wording. The
C++ people finally tried to save face by claiming that it was always
the C (not C++) rules that were unclear, and introduced the notion of
"glvalue", and it's all good now, but there's literally decades of
language lawyering and pointless nitpicking about the difference
between "objects" and "accesses".

Sane people didn't care, but if you reported a compiler bug about
volatile use, you had better be ready to sit back and be flamed for
how your volatile pointer cast wasn't an "object" and that the
compiler that clearly generated wrong code was technically correct,
and that your mother was a hamster.

It's a bit like the NULL debacle. Another thing that took the C++
people a couple of decades to admit they were wrong all along, and
that NULL isn't actually 'integer zero' in any sane language that
claims to care deeply about types.

[ And again, to save face, at no point did they say "ok, '(void *)0'
is fine" - they introduced a new __nullptr thing just so that they
wouldn't have to admit that their decades of arguing was just them
being wrong. You'll find another decade of arguments explaining the
finer details about _that_ difference ]

It turns out that the people who are language-lawyering nitpickers are
then happy to be "proven right" by adding some more pointless
syntacting language-lawyering language.

Which I guess makes sense, but to the rest of us it all looks a bit pointless.

                  Linus

Kent Overstreet March 25, 2024, 6:59 p.m. UTC | #24

On Mon, Mar 25, 2024 at 10:44:43AM -0700, Linus Torvalds wrote:
> On Mon, 25 Mar 2024 at 06:57, Philipp Stanner <pstanner@redhat.com> wrote:
> >
> > On Fri, 2024-03-22 at 17:36 -0700, Linus Torvalds wrote:
> > >
> > > It's kind of like our "volatile" usage. If you read the C (and C++)
> > > standards, you'll find that you should use "volatile" on data types.
> > > That's almost *never* what the kernel does. The kernel uses
> > > "volatile"
> > > in _code_ (ie READ_ONCE() etc), and uses it by casting etc.
> > >
> > > Compiler people don't tend to really like those kinds of things.
> >
> > Just for my understanding: Why don't they like it?
> 
> So I actually think most compiler people are perfectly fine with the
> kernel model of mostly doing 'volatile' not on the data structures
> themselves, but as accesses through casts.
> 
> It's very traditional C, and there's actually nothing particularly odd
> about it. Not even from a compiler standpoint.
> 
> In fact, I personally will argue that it is fundamentally wrong to
> think that the underlying data has to be volatile. A variable may be
> entirely stable in some cases (ie locks held), but not in others.
> 
> So it's not the *variable* (aka "object") that is 'volatile', it's the
> *context* that makes a particular access volatile.
> 
> That explains why the kernel has basically zero actual volatile
> objects, and 99% of all volatile accesses are done through accessor
> functions that use a cast to mark a particular access volatile.
> 
> But I've had negative comments from compiler people who read the
> standards as language lawyers (which honestly, I despise - it's always
> possible to try to argue what the meaning of some wording is), and
> particularly C++ people used to be very very antsy about "volatile".
> 
> They had some truly _serious_ problems with volatile.
> 
> The C++ people spent absolutely insane amounts of time arguing about
> "volatile objects" vs "accesses", and how an access through a cast
> didn't make the underlying object volatile etc.

To be fair, "volatile" dates from an era when we didn't have the haziest
understanding of what a working memory model for C would look like or
why we'd even want one.

(Someone might want to think about depracating volatile on objects and
adding compiler flag to disable it; I suspect it'd be a useful cleanup
for the compiler guys if they could get rid of it.)

The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
with modern thinking, just done with the tools available at the time. A
more modern version would be just

__atomic_load_n(ptr, __ATOMIC_RELAXED)

Except C atomics only support pointer and integer types, READ_ONCE()
supports anything up to machine word size - that's something the C
people need to fix.

Linus Torvalds March 25, 2024, 7:44 p.m. UTC | #25

On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> To be fair, "volatile" dates from an era when we didn't have the haziest
> understanding of what a working memory model for C would look like or
> why we'd even want one.

I don't disagree, but I find it very depressing that now that we *do*
know about memory models etc, the C++ memory model basically doubled
down on the same "object" model.

> The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> with modern thinking, just done with the tools available at the time. A
> more modern version would be just
>
> __atomic_load_n(ptr, __ATOMIC_RELAXED)

Yes. Again, that's the *right* model in many ways, where you mark the
*access*, not the variable. You make it completely and utterly clear
that this is a very explicit access to memory.

But that's not what C++ actually did. They went down the same old
"volatile object" road, and instead of marking the access, they mark
the object, and the way you do the above is

    std::atomic_int value;

and then you just access 'value' and magic happens.

EXACTLY the same way that

   volatile int value;

works, in other words. With exactly the same downsides.

And yes, I think that model is a nice shorthand. But it should be a
*shorthand*, not the basis of the model.

I do find it annoying, because the C++ people literally started out
with shorthands. The whole "pass by reference" is literally nothing
but a shorthand for pointers (ooh, scary scary pointers), where the
address-of is implied at the call site, and the 'dereference'
operation is implied at use.

So it's not that shorthands are wrong. And it's not that C++ isn't
already very fundamentally used to them. But despite that, the C++
memory model is very much designed around the broken object model, and
as already shown in this thread, it causes actual immediate problems.

And it's not just C++. Rust is supposed to be the really moden thing.
And it made the *SAME* fundamental design mistake.

IOW, the whole access size problem that Boqun described is
*inherently* tied to the fact that the C++ and Rust memory model is
badly designed from the wrong principles.

Instead of designing it as a "this is an atomic object that you can do
these operations on", it should have been "this is an atomic access,
and you can use this simple object model to have the compiler generate
the accesses for you".

This is why I claim that LKMM is fundamentally better. It didn't start
out from a bass-ackwards starting point of marking objects "atomic".

And yes, the LKMM is a bit awkward, because we don't have the
shorthands, so you have to write out "atomic_read()" and friends.

Tough. It's better to be correct than to be simple.

             Linus

Boqun Feng March 25, 2024, 8:59 p.m. UTC | #26

On Mon, Mar 25, 2024 at 10:44:45AM +0000, Mark Rutland wrote:
[...]
> > 
> > * I choose to re-implement atomics in Rust `asm` because we are still
> >   figuring out how we can make it easy and maintainable for Rust to call
> >   a C function _inlinely_ (Gary makes some progress [2]). Otherwise,
> >   atomic primitives would be function calls, and that can be performance
> >   bottleneck in a few cases.
> 
> I don't think we want to maintain two copies of each architecture's atomics.
> This gets painful very quickly (e.g. as arm64's atomics get patched between
> LL/SC and LSE forms).
> 

No argument here ;-)

> Can we start off with out-of-line atomics, and see where the bottlenecks are?
> 
> It's relatively easy to do that today, at least for the atomic*_*() APIs:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=atomics/outlined&id=e0a77bfa63e7416d610769aa4ab62bc06993ce56
> 
> ... which IIUC covers the "AtomicI32, AtomicI64 and AtomicUsize" cases you
> mention above.
> 

Thanks! Yes, I know I should check with you before I finalize the
implementation ;-) I will try to integrate that but things to notice:

* For module usage, we need to EXPORT_SYMBOL_GPL() all the atomics, I'm
  OK with that, but I don't know how others feel about it.

* Alice reported performance gap between inline and out-of-line refcount
  operations in Rust binder driver:

	https://github.com/Darksonn/linux/commit/b4be1bd6c44225bf7276a4666fd30b8da9cba517	

  I don't know how much worse since I don't have the data, but that's
  one of the reasons I started with inline asm.

That being said, I totally agree that we could start with out-of-line
atomics, and maybe provide inline version for performance critical
paths. Hoping is we can figure out how Rust could inline a C function
eventually.

Regards,
Boqun

> Mark.

Kent Overstreet March 25, 2024, 9:14 p.m. UTC | #27

On Mon, Mar 25, 2024 at 12:44:34PM -0700, Linus Torvalds wrote:
> On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > To be fair, "volatile" dates from an era when we didn't have the haziest
> > understanding of what a working memory model for C would look like or
> > why we'd even want one.
> 
> I don't disagree, but I find it very depressing that now that we *do*
> know about memory models etc, the C++ memory model basically doubled
> down on the same "object" model.
> 
> > The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> > with modern thinking, just done with the tools available at the time. A
> > more modern version would be just
> >
> > __atomic_load_n(ptr, __ATOMIC_RELAXED)
> 
> Yes. Again, that's the *right* model in many ways, where you mark the
> *access*, not the variable. You make it completely and utterly clear
> that this is a very explicit access to memory.
> 
> But that's not what C++ actually did. They went down the same old
> "volatile object" road, and instead of marking the access, they mark
> the object, and the way you do the above is
> 
>     std::atomic_int value;
> 
> and then you just access 'value' and magic happens.
> 
> EXACTLY the same way that
> 
>    volatile int value;
> 
> works, in other words. With exactly the same downsides.

Yeah that's crap. Unfortunate too, because this does need to be a type
system thing and we have all the tools to do it correctly now.

What we need is for loads and stores to be explict, and that absolutely
can and should be a type system thing.

In Rust terminology, what we want is

  Volatile<T>

where T is any type that fits in a machine word, and the only operations
it supports are get(), set(), xchg() and cmpxchG().

You DO NOT want it to be possible to transparantly use Volatile<T> in
place of a regular T - in exactly the same way as an atomic_t can't be
used in place of a regular integer.

Boqun Feng March 25, 2024, 9:37 p.m. UTC | #28

On Mon, Mar 25, 2024 at 05:14:41PM -0400, Kent Overstreet wrote:
> On Mon, Mar 25, 2024 at 12:44:34PM -0700, Linus Torvalds wrote:
> > On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > To be fair, "volatile" dates from an era when we didn't have the haziest
> > > understanding of what a working memory model for C would look like or
> > > why we'd even want one.
> > 
> > I don't disagree, but I find it very depressing that now that we *do*
> > know about memory models etc, the C++ memory model basically doubled
> > down on the same "object" model.
> > 
> > > The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> > > with modern thinking, just done with the tools available at the time. A
> > > more modern version would be just
> > >
> > > __atomic_load_n(ptr, __ATOMIC_RELAXED)

Note that Rust does have something similiar:

	https://doc.rust-lang.org/std/ptr/fn.read_volatile.html

	pub unsafe fn read_volatile<T>(src: *const T) -> T

(and also write_volatile()). So they made a good design putting the
volatile on the accesses rather than the type. However, per the current
Rust memory model these two primitives will be UB when data races happen
:-(

I mean, sure, if I use read_volatile() on an enum (whose valid values
are only 0, 1, 2), and I get a value 3, and the compiler says "you have
a logic bug and I refuse to compile the program correctly", I'm OK. But
if I use read_volatile() to read something like a u32, and I know it's
racy so my program actually handle that, I don't know any sane compiler
would miss-compile, so I don't know why that has to be a UB.

> > 
> > Yes. Again, that's the *right* model in many ways, where you mark the
> > *access*, not the variable. You make it completely and utterly clear
> > that this is a very explicit access to memory.
> > 
> > But that's not what C++ actually did. They went down the same old
> > "volatile object" road, and instead of marking the access, they mark
> > the object, and the way you do the above is
> > 
> >     std::atomic_int value;
> > 
> > and then you just access 'value' and magic happens.
> > 
> > EXACTLY the same way that
> > 
> >    volatile int value;
> > 
> > works, in other words. With exactly the same downsides.
> 
> Yeah that's crap. Unfortunate too, because this does need to be a type
> system thing and we have all the tools to do it correctly now.
> 
> What we need is for loads and stores to be explict, and that absolutely
> can and should be a type system thing.
> 
> In Rust terminology, what we want is
> 
>   Volatile<T>
> 
> where T is any type that fits in a machine word, and the only operations
> it supports are get(), set(), xchg() and cmpxchG().
> 
> You DO NOT want it to be possible to transparantly use Volatile<T> in
> place of a regular T - in exactly the same way as an atomic_t can't be
> used in place of a regular integer.

Yes, this is useful. But no it's not that useful, how could you use that
to read another CPU's stack during some debug functions in a way you
know it's racy?

Regards,
Boqun

Kent Overstreet March 25, 2024, 10:09 p.m. UTC | #29

On Mon, Mar 25, 2024 at 02:37:14PM -0700, Boqun Feng wrote:
> On Mon, Mar 25, 2024 at 05:14:41PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 25, 2024 at 12:44:34PM -0700, Linus Torvalds wrote:
> > > On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > >
> > > > To be fair, "volatile" dates from an era when we didn't have the haziest
> > > > understanding of what a working memory model for C would look like or
> > > > why we'd even want one.
> > > 
> > > I don't disagree, but I find it very depressing that now that we *do*
> > > know about memory models etc, the C++ memory model basically doubled
> > > down on the same "object" model.
> > > 
> > > > The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> > > > with modern thinking, just done with the tools available at the time. A
> > > > more modern version would be just
> > > >
> > > > __atomic_load_n(ptr, __ATOMIC_RELAXED)
> 
> Note that Rust does have something similiar:
> 
> 	https://doc.rust-lang.org/std/ptr/fn.read_volatile.html
> 
> 	pub unsafe fn read_volatile<T>(src: *const T) -> T
> 
> (and also write_volatile()). So they made a good design putting the
> volatile on the accesses rather than the type. However, per the current
> Rust memory model these two primitives will be UB when data races happen
> :-(
> 
> I mean, sure, if I use read_volatile() on an enum (whose valid values
> are only 0, 1, 2), and I get a value 3, and the compiler says "you have
> a logic bug and I refuse to compile the program correctly", I'm OK. But
> if I use read_volatile() to read something like a u32, and I know it's
> racy so my program actually handle that, I don't know any sane compiler
> would miss-compile, so I don't know why that has to be a UB.

Well, if T is too big to read/write atomically then you'll get torn
reads, including potentially a bit representation that is not a valid T.

Which is why the normal read_volatile<> or Volatile<> should disallow
that.

> > where T is any type that fits in a machine word, and the only operations
> > it supports are get(), set(), xchg() and cmpxchG().
> > 
> > You DO NOT want it to be possible to transparantly use Volatile<T> in
> > place of a regular T - in exactly the same way as an atomic_t can't be
> > used in place of a regular integer.
> 
> Yes, this is useful. But no it's not that useful, how could you use that
> to read another CPU's stack during some debug functions in a way you
> know it's racy?

That's a pretty difficult thing to do, because you don't know the
_layout_ of the other CPU's stack, and even if you do it's going to be
changing underneath you without locking.

So the races thare are equivalent to a bad mem::transmute(), and that is
very much UB.

For a more typical usage of volatile, consider a ringbuffer with one
thread producing and another thread consuming. Then you've got head and
tail pointers, each written by one thread and read by another.

You don't need any locking, just memory barriers and
READ_ONCE()/WRITE_ONCE() to update the head and tail pointers. If you
were writing this in Rust today the easy way would be an atomic integer,
but that's not really correct - you're not doing atomic operations
(locked arithmetic), just volatile reads and writes.

Volatile<T> would be Send and Sync, just like atomic integers. You don't
need locking if you're just working with single values that are small
enough for the machine to read/write atomically.

Boqun Feng March 25, 2024, 10:38 p.m. UTC | #30

On Mon, Mar 25, 2024 at 06:09:19PM -0400, Kent Overstreet wrote:
> On Mon, Mar 25, 2024 at 02:37:14PM -0700, Boqun Feng wrote:
> > On Mon, Mar 25, 2024 at 05:14:41PM -0400, Kent Overstreet wrote:
> > > On Mon, Mar 25, 2024 at 12:44:34PM -0700, Linus Torvalds wrote:
> > > > On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > To be fair, "volatile" dates from an era when we didn't have the haziest
> > > > > understanding of what a working memory model for C would look like or
> > > > > why we'd even want one.
> > > > 
> > > > I don't disagree, but I find it very depressing that now that we *do*
> > > > know about memory models etc, the C++ memory model basically doubled
> > > > down on the same "object" model.
> > > > 
> > > > > The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> > > > > with modern thinking, just done with the tools available at the time. A
> > > > > more modern version would be just
> > > > >
> > > > > __atomic_load_n(ptr, __ATOMIC_RELAXED)
> > 
> > Note that Rust does have something similiar:
> > 
> > 	https://doc.rust-lang.org/std/ptr/fn.read_volatile.html
> > 
> > 	pub unsafe fn read_volatile<T>(src: *const T) -> T
> > 
> > (and also write_volatile()). So they made a good design putting the
> > volatile on the accesses rather than the type. However, per the current
> > Rust memory model these two primitives will be UB when data races happen
> > :-(
> > 
> > I mean, sure, if I use read_volatile() on an enum (whose valid values
> > are only 0, 1, 2), and I get a value 3, and the compiler says "you have
> > a logic bug and I refuse to compile the program correctly", I'm OK. But
> > if I use read_volatile() to read something like a u32, and I know it's
> > racy so my program actually handle that, I don't know any sane compiler
> > would miss-compile, so I don't know why that has to be a UB.
> 
> Well, if T is too big to read/write atomically then you'll get torn
> reads, including potentially a bit representation that is not a valid T.
> 
> Which is why the normal read_volatile<> or Volatile<> should disallow
> that.
> 

Well, why a racy read_volatile<> is UB on a T who is valid for all bit
representations is what I was complaining about ;-)

> > > where T is any type that fits in a machine word, and the only operations
> > > it supports are get(), set(), xchg() and cmpxchG().
> > > 
> > > You DO NOT want it to be possible to transparantly use Volatile<T> in
> > > place of a regular T - in exactly the same way as an atomic_t can't be
> > > used in place of a regular integer.
> > 
> > Yes, this is useful. But no it's not that useful, how could you use that
> > to read another CPU's stack during some debug functions in a way you
> > know it's racy?
> 
> That's a pretty difficult thing to do, because you don't know the
> _layout_ of the other CPU's stack, and even if you do it's going to be
> changing underneath you without locking.
> 

It's a debug function, I don't care whether the data is accurate, I just
want to get much information as possible. This kinda of usage, along
with cases where the alorigthms are racy themselves are the primary
reasons of volatile _accesses_ instead of volatile _types_. For example,
you want to read ahead of a counter protected by a lock:

	if (unlikely(READ_ONCE(cnt))) {
		spin_lock(lock);
		int c = cnt; // update of the cnt is protected by a lock.
		...
	}

because you want to skip the case where cnt == 0 in a hotpath, and you
know someone is going to check this again in some slowpath, so
inaccurate data doesn't matter.

> So the races thare are equivalent to a bad mem::transmute(), and that is
> very much UB.
> 
> For a more typical usage of volatile, consider a ringbuffer with one
> thread producing and another thread consuming. Then you've got head and
> tail pointers, each written by one thread and read by another.
> 
> You don't need any locking, just memory barriers and
> READ_ONCE()/WRITE_ONCE() to update the head and tail pointers. If you
> were writing this in Rust today the easy way would be an atomic integer,
> but that's not really correct - you're not doing atomic operations
> (locked arithmetic), just volatile reads and writes.
> 

Confused, I don't see how Volatile<T> is better than just atomic in this
case, since atomc_load() and atomic_store() are also not locked in any
memory model if lockless implementation is available.

> Volatile<T> would be Send and Sync, just like atomic integers. You don't
> need locking if you're just working with single values that are small
> enough for the machine to read/write atomically.

So to me Volatile<T> can help in the cases where we know some memory is
"external", for example a MMIO address, or ringbuffer between guests and
hypervisor. But it doesn't really fix the missing functionality here:
allow generating a plain "mov" instruction on x86 for example on _any
valid memory_, and programmers can take care of the result.

Regards,
Boqun

Kent Overstreet March 25, 2024, 11:02 p.m. UTC | #31

On Mon, Mar 25, 2024 at 03:38:32PM -0700, Boqun Feng wrote:
> On Mon, Mar 25, 2024 at 06:09:19PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 25, 2024 at 02:37:14PM -0700, Boqun Feng wrote:
> > > On Mon, Mar 25, 2024 at 05:14:41PM -0400, Kent Overstreet wrote:
> > > > On Mon, Mar 25, 2024 at 12:44:34PM -0700, Linus Torvalds wrote:
> > > > > On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > To be fair, "volatile" dates from an era when we didn't have the haziest
> > > > > > understanding of what a working memory model for C would look like or
> > > > > > why we'd even want one.
> > > > > 
> > > > > I don't disagree, but I find it very depressing that now that we *do*
> > > > > know about memory models etc, the C++ memory model basically doubled
> > > > > down on the same "object" model.
> > > > > 
> > > > > > The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> > > > > > with modern thinking, just done with the tools available at the time. A
> > > > > > more modern version would be just
> > > > > >
> > > > > > __atomic_load_n(ptr, __ATOMIC_RELAXED)
> > > 
> > > Note that Rust does have something similiar:
> > > 
> > > 	https://doc.rust-lang.org/std/ptr/fn.read_volatile.html
> > > 
> > > 	pub unsafe fn read_volatile<T>(src: *const T) -> T
> > > 
> > > (and also write_volatile()). So they made a good design putting the
> > > volatile on the accesses rather than the type. However, per the current
> > > Rust memory model these two primitives will be UB when data races happen
> > > :-(
> > > 
> > > I mean, sure, if I use read_volatile() on an enum (whose valid values
> > > are only 0, 1, 2), and I get a value 3, and the compiler says "you have
> > > a logic bug and I refuse to compile the program correctly", I'm OK. But
> > > if I use read_volatile() to read something like a u32, and I know it's
> > > racy so my program actually handle that, I don't know any sane compiler
> > > would miss-compile, so I don't know why that has to be a UB.
> > 
> > Well, if T is too big to read/write atomically then you'll get torn
> > reads, including potentially a bit representation that is not a valid T.
> > 
> > Which is why the normal read_volatile<> or Volatile<> should disallow
> > that.
> > 
> 
> Well, why a racy read_volatile<> is UB on a T who is valid for all bit
> representations is what I was complaining about ;-)

yeah, that should not be considered UB; that should be an easy fix. Are
you talking to Rust compiler people about this stuff? I've been meaning
to make my own contacts there, but - sadly, busy as hell.

> > > > where T is any type that fits in a machine word, and the only operations
> > > > it supports are get(), set(), xchg() and cmpxchG().
> > > > 
> > > > You DO NOT want it to be possible to transparantly use Volatile<T> in
> > > > place of a regular T - in exactly the same way as an atomic_t can't be
> > > > used in place of a regular integer.
> > > 
> > > Yes, this is useful. But no it's not that useful, how could you use that
> > > to read another CPU's stack during some debug functions in a way you
> > > know it's racy?
> > 
> > That's a pretty difficult thing to do, because you don't know the
> > _layout_ of the other CPU's stack, and even if you do it's going to be
> > changing underneath you without locking.
> > 
> 
> It's a debug function, I don't care whether the data is accurate, I just
> want to get much information as possible.

yeah, if you just want the typical backtrace functionality where you're
just looking for instruction pointers - that's perfectly
straightforward.

> This kinda of usage, along
> with cases where the alorigthms are racy themselves are the primary
> reasons of volatile _accesses_ instead of volatile _types_. For example,
> you want to read ahead of a counter protected by a lock:
> 
> 	if (unlikely(READ_ONCE(cnt))) {
> 		spin_lock(lock);
> 		int c = cnt; // update of the cnt is protected by a lock.
> 		...
> 	}
> 
> because you want to skip the case where cnt == 0 in a hotpath, and you
> know someone is going to check this again in some slowpath, so
> inaccurate data doesn't matter.

That's an interesting one because in Rust cnt is "inside" the lock, you
can't access it at all without taking the lock - and usually that's
exactly right.

So to allow this we'd annotate in the type definition (with an
attribute) which fields we allow read access to without the lock, then
with some proc macro wizardry we'd get accessors that we can call without
the lock held.

So that probably wouldn't be a Volatile<T> thing, that'd be coupled with
the lock implementation because that's where the accessors would hang
off of and they'd internally probably just use mem::volatile_read().

> > So the races thare are equivalent to a bad mem::transmute(), and that is
> > very much UB.
> > 
> > For a more typical usage of volatile, consider a ringbuffer with one
> > thread producing and another thread consuming. Then you've got head and
> > tail pointers, each written by one thread and read by another.
> > 
> > You don't need any locking, just memory barriers and
> > READ_ONCE()/WRITE_ONCE() to update the head and tail pointers. If you
> > were writing this in Rust today the easy way would be an atomic integer,
> > but that's not really correct - you're not doing atomic operations
> > (locked arithmetic), just volatile reads and writes.
> > 
> 
> Confused, I don't see how Volatile<T> is better than just atomic in this
> case, since atomc_load() and atomic_store() are also not locked in any
> memory model if lockless implementation is available.

It certainly compiles to the same code, yeah. But Volatile<T> really is
the more primitive/generic concept, Atomic<T> is a specialization.

> > Volatile<T> would be Send and Sync, just like atomic integers. You don't
> > need locking if you're just working with single values that are small
> > enough for the machine to read/write atomically.
> 
> So to me Volatile<T> can help in the cases where we know some memory is
> "external", for example a MMIO address, or ringbuffer between guests and
> hypervisor. But it doesn't really fix the missing functionality here:
> allow generating a plain "mov" instruction on x86 for example on _any
> valid memory_, and programmers can take care of the result.

You're talking about going completely outside the type system, though.
There is a need for that, but it's very rare and something we really
want to discourage. Usually, even with volatile access, you do know the
type - and even if you don't, you have to treat it as _something_ so
Volatile<ulong> is probably as good as anything.

Boqun Feng March 25, 2024, 11:41 p.m. UTC | #32

On Mon, Mar 25, 2024 at 07:02:12PM -0400, Kent Overstreet wrote:
> On Mon, Mar 25, 2024 at 03:38:32PM -0700, Boqun Feng wrote:
> > On Mon, Mar 25, 2024 at 06:09:19PM -0400, Kent Overstreet wrote:
> > > On Mon, Mar 25, 2024 at 02:37:14PM -0700, Boqun Feng wrote:
> > > > On Mon, Mar 25, 2024 at 05:14:41PM -0400, Kent Overstreet wrote:
> > > > > On Mon, Mar 25, 2024 at 12:44:34PM -0700, Linus Torvalds wrote:
> > > > > > On Mon, 25 Mar 2024 at 11:59, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > To be fair, "volatile" dates from an era when we didn't have the haziest
> > > > > > > understanding of what a working memory model for C would look like or
> > > > > > > why we'd even want one.
> > > > > > 
> > > > > > I don't disagree, but I find it very depressing that now that we *do*
> > > > > > know about memory models etc, the C++ memory model basically doubled
> > > > > > down on the same "object" model.
> > > > > > 
> > > > > > > The way the kernel uses volatile in e.g. READ_ONCE() is fully in line
> > > > > > > with modern thinking, just done with the tools available at the time. A
> > > > > > > more modern version would be just
> > > > > > >
> > > > > > > __atomic_load_n(ptr, __ATOMIC_RELAXED)
> > > > 
> > > > Note that Rust does have something similiar:
> > > > 
> > > > 	https://doc.rust-lang.org/std/ptr/fn.read_volatile.html
> > > > 
> > > > 	pub unsafe fn read_volatile<T>(src: *const T) -> T
> > > > 
> > > > (and also write_volatile()). So they made a good design putting the
> > > > volatile on the accesses rather than the type. However, per the current
> > > > Rust memory model these two primitives will be UB when data races happen
> > > > :-(
> > > > 
> > > > I mean, sure, if I use read_volatile() on an enum (whose valid values
> > > > are only 0, 1, 2), and I get a value 3, and the compiler says "you have
> > > > a logic bug and I refuse to compile the program correctly", I'm OK. But
> > > > if I use read_volatile() to read something like a u32, and I know it's
> > > > racy so my program actually handle that, I don't know any sane compiler
> > > > would miss-compile, so I don't know why that has to be a UB.
> > > 
> > > Well, if T is too big to read/write atomically then you'll get torn
> > > reads, including potentially a bit representation that is not a valid T.
> > > 
> > > Which is why the normal read_volatile<> or Volatile<> should disallow
> > > that.
> > > 
> > 
> > Well, why a racy read_volatile<> is UB on a T who is valid for all bit
> > representations is what I was complaining about ;-)
> 
> yeah, that should not be considered UB; that should be an easy fix. Are
> you talking to Rust compiler people about this stuff? I've been meaning

Here you go:

	https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/UB.20caused.20by.20races.20on.20.60.7Bread.2Cwrite.7D_volatile.60/near/399343771	

but maybe instead of Rust, LLVM is the best one to talk with on this.
Because my take from the communication with Rust folks is more like
"it's more up to LLVM on this".

> to make my own contacts there, but - sadly, busy as hell.
> 
> > > > > where T is any type that fits in a machine word, and the only operations
> > > > > it supports are get(), set(), xchg() and cmpxchG().
> > > > > 
> > > > > You DO NOT want it to be possible to transparantly use Volatile<T> in
> > > > > place of a regular T - in exactly the same way as an atomic_t can't be
> > > > > used in place of a regular integer.
> > > > 
> > > > Yes, this is useful. But no it's not that useful, how could you use that
> > > > to read another CPU's stack during some debug functions in a way you
> > > > know it's racy?
> > > 
> > > That's a pretty difficult thing to do, because you don't know the
> > > _layout_ of the other CPU's stack, and even if you do it's going to be
> > > changing underneath you without locking.
> > > 
> > 
> > It's a debug function, I don't care whether the data is accurate, I just
> > want to get much information as possible.
> 
> yeah, if you just want the typical backtrace functionality where you're
> just looking for instruction pointers - that's perfectly
> straightforward.
> 
> > This kinda of usage, along
> > with cases where the alorigthms are racy themselves are the primary
> > reasons of volatile _accesses_ instead of volatile _types_. For example,
> > you want to read ahead of a counter protected by a lock:
> > 
> > 	if (unlikely(READ_ONCE(cnt))) {
> > 		spin_lock(lock);
> > 		int c = cnt; // update of the cnt is protected by a lock.
> > 		...
> > 	}
> > 
> > because you want to skip the case where cnt == 0 in a hotpath, and you
> > know someone is going to check this again in some slowpath, so
> > inaccurate data doesn't matter.
> 
> That's an interesting one because in Rust cnt is "inside" the lock, you
> can't access it at all without taking the lock - and usually that's
> exactly right.
> 

(Now you mention that, once I was trying to construct a deadlock case
with some Rust code, but I couldn't since the locks are naturally
hierarchical because of the types, therefore it's impossible to reverse
the lock ordering. Yes, you still can have deadlocks in Rust, but that
hierarchial type trees really help a lot).

> So to allow this we'd annotate in the type definition (with an
> attribute) which fields we allow read access to without the lock, then
> with some proc macro wizardry we'd get accessors that we can call without
> the lock held.
> 

Right, that's field projection:

	https://internals.rust-lang.org/t/pre-rfc-field-projection/17383	

> So that probably wouldn't be a Volatile<T> thing, that'd be coupled with
> the lock implementation because that's where the accessors would hang
> off of and they'd internally probably just use mem::volatile_read().
> 

So we can play the type game as deep as we can, and I'm sure it'll be
helpful, but in the same time, having a reasonable
{read,write}_volatile() semantics on UB and races is also what we need.

> > > So the races thare are equivalent to a bad mem::transmute(), and that is
> > > very much UB.
> > > 
> > > For a more typical usage of volatile, consider a ringbuffer with one
> > > thread producing and another thread consuming. Then you've got head and
> > > tail pointers, each written by one thread and read by another.
> > > 
> > > You don't need any locking, just memory barriers and
> > > READ_ONCE()/WRITE_ONCE() to update the head and tail pointers. If you
> > > were writing this in Rust today the easy way would be an atomic integer,
> > > but that's not really correct - you're not doing atomic operations
> > > (locked arithmetic), just volatile reads and writes.
> > > 
> > 
> > Confused, I don't see how Volatile<T> is better than just atomic in this
> > case, since atomc_load() and atomic_store() are also not locked in any
> > memory model if lockless implementation is available.
> 
> It certainly compiles to the same code, yeah. But Volatile<T> really is
> the more primitive/generic concept, Atomic<T> is a specialization.
> 
> > > Volatile<T> would be Send and Sync, just like atomic integers. You don't
> > > need locking if you're just working with single values that are small
> > > enough for the machine to read/write atomically.
> > 
> > So to me Volatile<T> can help in the cases where we know some memory is
> > "external", for example a MMIO address, or ringbuffer between guests and
> > hypervisor. But it doesn't really fix the missing functionality here:
> > allow generating a plain "mov" instruction on x86 for example on _any
> > valid memory_, and programmers can take care of the result.
> 
> You're talking about going completely outside the type system, though.
> There is a need for that, but it's very rare and something we really
> want to discourage. Usually, even with volatile access, you do know the

Hey, in memory ordering model areas, we are supposed to work on these
rare cases ;-) These are building blocks for high level synchronization
constructions, so I'm entitled to complain ;-) But yes, to your point,
type system can help a lot, however, there are still cases we need to
roll our own.

Regard,
Boqun

> type - and even if you don't, you have to treat it as _something_ so
> Volatile<ulong> is probably as good as anything.

Dr. David Alan Gilbert March 26, 2024, 12:05 a.m. UTC | #33

* Linus Torvalds (torvalds@linux-foundation.org) wrote:

<snip>

> IOW, the whole access size problem that Boqun described is
> *inherently* tied to the fact that the C++ and Rust memory model is
> badly designed from the wrong principles.
> 
> Instead of designing it as a "this is an atomic object that you can do
> these operations on", it should have been "this is an atomic access,
> and you can use this simple object model to have the compiler generate
> the accesses for you".

Isn't one of the aims of the Rust/C++ idea that you can't forget to access
a shared piece of data atomically?

If you want to have 'atomic accesses' explicitly, how do you tell the compiler
what you can use them on, and when it should stop you mixing them with
normal accesses on the same object?

Dave

> This is why I claim that LKMM is fundamentally better. It didn't start
> out from a bass-ackwards starting point of marking objects "atomic".
> 
> And yes, the LKMM is a bit awkward, because we don't have the
> shorthands, so you have to write out "atomic_read()" and friends.
> 
> Tough. It's better to be correct than to be simple.
> 
>              Linus
>

Kent Overstreet March 26, 2024, 12:36 a.m. UTC | #34

On Tue, Mar 26, 2024 at 12:05:48AM +0000, Dr. David Alan Gilbert wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> <snip>
> 
> > IOW, the whole access size problem that Boqun described is
> > *inherently* tied to the fact that the C++ and Rust memory model is
> > badly designed from the wrong principles.
> > 
> > Instead of designing it as a "this is an atomic object that you can do
> > these operations on", it should have been "this is an atomic access,
> > and you can use this simple object model to have the compiler generate
> > the accesses for you".
> 
> Isn't one of the aims of the Rust/C++ idea that you can't forget to access
> a shared piece of data atomically?
> 
> If you want to have 'atomic accesses' explicitly, how do you tell the compiler
> what you can use them on, and when it should stop you mixing them with
> normal accesses on the same object?

"can't forget to access data atomically" - that's only half of it. And
atomic accesses loads/stores are not a thing under the hood, they're
just loads and stores (possibly, but not necessarily, with memory
barriers).

The other half is at the _source_ level you don't want to treat accesses
to volatiles/atomics like accesses to normal variables, you really want
those to be explicit, and not look like normal variable accesses.

std:atomic_int is way better than volatile in the sense that it's not a
barely specified mess, but adding operator overloading was just
gratuitious and unnecessary.

This is a theme with C++ - they add a _ton_ of magic to make things
concise and pretty, but you have to understand in intimate detail what
all that magic is doing or you're totally fucked.

std::atomic_int makes it such that just changing a single line of code
in a single location in your program will change the semantics of your
_entire_ program and the only obserable result will be that it's faster
but a ticking time bomb because you just introduced a ton of races.

With Rust - I honestly haven't looked at whether they added operator
overlaoding for their atomics, but it's _much_ less of a concern because
changing the type to the non-atomic version means your program won't
compile if it's now racy.

Dr. David Alan Gilbert March 26, 2024, 1:35 a.m. UTC | #35

* Kent Overstreet (kent.overstreet@linux.dev) wrote:
> On Tue, Mar 26, 2024 at 12:05:48AM +0000, Dr. David Alan Gilbert wrote:
> > * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > 
> > <snip>
> > 
> > > IOW, the whole access size problem that Boqun described is
> > > *inherently* tied to the fact that the C++ and Rust memory model is
> > > badly designed from the wrong principles.
> > > 
> > > Instead of designing it as a "this is an atomic object that you can do
> > > these operations on", it should have been "this is an atomic access,
> > > and you can use this simple object model to have the compiler generate
> > > the accesses for you".
> > 
> > Isn't one of the aims of the Rust/C++ idea that you can't forget to access
> > a shared piece of data atomically?
> > 
> > If you want to have 'atomic accesses' explicitly, how do you tell the compiler
> > what you can use them on, and when it should stop you mixing them with
> > normal accesses on the same object?
> 
> "can't forget to access data atomically" - that's only half of it. And
> atomic accesses loads/stores are not a thing under the hood, they're
> just loads and stores (possibly, but not necessarily, with memory
> barriers).

That's quite architecturally specific isn't it?
Or is this the distinction between accesses that are implicitly atomic
(i.e. naturally aligned word) and things that are locked/exclusive?
(either with a 'lock' on x86 or load-exclusive/store exclusive on some others)?
Which are we talking about here?

> The other half is at the _source_ level you don't want to treat accesses
> to volatiles/atomics like accesses to normal variables, you really want
> those to be explicit, and not look like normal variable accesses.
> 
> std:atomic_int is way better than volatile in the sense that it's not a
> barely specified mess, but adding operator overloading was just
> gratuitious and unnecessary.
> 
> This is a theme with C++ - they add a _ton_ of magic to make things
> concise and pretty, but you have to understand in intimate detail what
> all that magic is doing or you're totally fucked.
> 
> std::atomic_int makes it such that just changing a single line of code
> in a single location in your program will change the semantics of your
> _entire_ program and the only obserable result will be that it's faster
> but a ticking time bomb because you just introduced a ton of races.
> 
> With Rust - I honestly haven't looked at whether they added operator
> overlaoding for their atomics, but it's _much_ less of a concern because
> changing the type to the non-atomic version means your program won't
> compile if it's now racy.

OK, so that's essentially the opposite worry of what I was saying; I was
worrying about people forgetting to use an atomic access to a shared
variable; I think you're worrying about people forgetting to mark
a variable shared and since the accesses are the same nothing shouts?

Dave

Boqun Feng March 26, 2024, 2:51 a.m. UTC | #36

On Tue, Mar 26, 2024 at 12:05:48AM +0000, Dr. David Alan Gilbert wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> <snip>
> 
> > IOW, the whole access size problem that Boqun described is
> > *inherently* tied to the fact that the C++ and Rust memory model is
> > badly designed from the wrong principles.
> > 
> > Instead of designing it as a "this is an atomic object that you can do
> > these operations on", it should have been "this is an atomic access,
> > and you can use this simple object model to have the compiler generate
> > the accesses for you".
> 
> Isn't one of the aims of the Rust/C++ idea that you can't forget to access
> a shared piece of data atomically?
> 
> If you want to have 'atomic accesses' explicitly, how do you tell the compiler
> what you can use them on, and when it should stop you mixing them with
> normal accesses on the same object?
> 

Well, you can just wrap it in your own atomic types, can't you?

If the atomic primitives that a language provides is access-based, users
can create their own atomic types or language can provide via standard
library, but mixed usage is still allowed when it makes sense (debug
functionality, low level concurrent code that utilizes races, etc.) But
if the atomic primitives that a language provides is type-based, then
you're limited to what you can do. It might be totally fine as Linus
pointed out, if you just write a portable library, and don't want to
care about architectural details. But that's not the case in Linux
kernel.

Regards,
Boqun

> Dave
> 
> > This is why I claim that LKMM is fundamentally better. It didn't start
> > out from a bass-ackwards starting point of marking objects "atomic".
> > 
> > And yes, the LKMM is a bit awkward, because we don't have the
> > shorthands, so you have to write out "atomic_read()" and friends.
> > 
> > Tough. It's better to be correct than to be simple.
> > 
> >              Linus
> > 
> -- 
>  -----Open up your eyes, open up your mind, open up your code -------   
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/

Kent Overstreet March 26, 2024, 3:28 a.m. UTC | #37

On Tue, Mar 26, 2024 at 01:35:46AM +0000, Dr. David Alan Gilbert wrote:
> OK, so that's essentially the opposite worry of what I was saying; I was
> worrying about people forgetting to use an atomic access to a shared
> variable; I think you're worrying about people forgetting to mark
> a variable shared and since the accesses are the same nothing shouts?

In biological evolution, novel useful traits are generally not
accessible via a single mutation; many neutral mutations are required
first.

Evolution is able to proceed quickly because there are a great many
neutral mutations (that is, evolution quickly searches all possible
paths to find accessible positive mutations), and because negative
mutations are culled quickly - often before the first cell division.

(The most common mutation being the addition or deletion of a base pair;
but amino acids are coded for by groups of three base pairs, so that
shifts everything on the chromosone after the mutation so that it codes
for completely different amino acids. That cell won't live to divide
again).

Actual genetic diseases that significantly impair fitness are quite
rare, and if they weren't we'd have a major problem.

Programming at scale is million monkeys stuff - we're all hammering on
our keyboards at random, the good programs survive and the bad programs
are forgotten.

Similarly to biological evolution, we want most edits to a program to
result in a program that either still works, or fails immediately -
fails to compile, or is caught immediately by basic testing.

If edits can result in latent undefined behaviour or programs that
_mostly_ work, and then explode in unpredictable ways weeks/months/years
later - that's a huge problem. In the worst case, those bugs/negative
mutations accumulate faster than they can be culled.

Thank god we have source control.

Places where we're working with extremely loose synchronization - no
locking, raw memory barriers - are the worst kind of hand grenade, they
result in bugs that are impossible to cull quickly.

In kernel programming, we're always walking around with live hand
grenades.

So what do we do?

We slow down, we take every step slowly and intentionally while telling
everyone not to bump us because we're holding a live hand grenade - raw
atomics, raw unlocked variables, memory barriers, they all get special
care and extra comments.

And we all have fuck-tons of code we need to be able to understand,
review, debug and maintain, so we always try to write our code in a
style where the if it's wrong, we'll see that _locally_, without having
to go through and remember how _everything_ out of the possibly
thousands of relevant lines work.

I'm personally responsible for over 100k LOC of highly intricate code
with high consequences for failure, and regularly have to debug issues
arising somewhere in north of a million LOC - and when something goes
wrong I have to be able to fully debug it _quickly_.

What C++ does is like taking those hand grenades, with the pin already
out - and leaving one under the couch cushions, another in the
silverware drawer, another in the laundry basket - and expecting you to
remember where you put them.

Going back to the C++ example, the really fatal thing with how they do
it is how a change in one line of code can completely change the
semantics of a bunch of different code, and no human reviewer can be
expected to catch bugs that might introduce and the compiler certainly
won't.

Now imagine multiple people working on the same code, at different
times.

Now imagine patches getting mixed up, reordered, one of them getting
lost, merge conflicts - i.e. shit that happens all the time, and what
happens if you're using C++ style atomics.

Terrifying stuff.

Linus Torvalds March 26, 2024, 3:49 a.m. UTC | #38

On Mon, 25 Mar 2024 at 17:05, Dr. David Alan Gilbert <dave@treblig.org> wrote:
>
> Isn't one of the aims of the Rust/C++ idea that you can't forget to access
> a shared piece of data atomically?

If that is an aim, it's a really *bad* one.

Really.

It very much should never have been an aim, and I hope it wasn't. I
think, and hope, that the source of the C++ and Rust bad decisions is
cluelessness, not active malice.

Take Rust - one big point of Rust is the whole "safe" thing, but it's
very much not a straightjacket like Pascal was. There's a "safe" part
to Rust, but equally importantly, there's also the "unsafe" part to
Rust.

The safe part is the one that most programmers are supposed to use.
It's the one that allows you to not have to worry too much about
things. It's the part that makes it much harder to screw up.

But the *unsafe* part is what makes Rust powerful. It's the part that
works behind the curtain. It's the part that may be needed to make the
safe parts *work*.

And yes, an application programmer might never actually need to use
it, and in fact in many projects the rule might be that unsafe Rust is
simply never even an option - but that doesn't mean that the unsafe
parts don't exist.

Because those unsafe parts are needed to make it all work in reality.

And you should never EVER base your whole design around the "safe"
part. Then you get a language that is a straight-jacket.

So I'd very strongly argue that the core atomics should be done the
"unsafe" way - allow people to specify exactly when they want exactly
what access. Allow people to mix and match and have overlapping
partial aliases, because if you implement things like locking, you
*need* those partially aliasing accesses, and you need to make
overlapping atomics where sometimes you access only one part of the
field.

And yes, that will be unsafe, and it might even be unportable, but
it's exactly the kind of thing you need in order to avoid having to
use assembly language to do your locking.

And by all means, you should relegate that to the "unsafe corner" of
the language. And maybe don't talk about the unsafe sharp edges in the
first chapter of the book about the language.

But you should _start_ the design of your language memory model around
the unsafe "raw atomic access operations" model.

Then you can use those strictly more powerful operations, and you
create an object model *around* it.

So you create safe objects like just an atomic counter. In *that*
corner of the language, you have the "safe atomics" - they aren't the
fundamental implementation, but they are the safe wrappers *around*
the more powerful (but unsafe) core.

With that "atomic counter" you cannot forget to do atomic accesses,
because that safe corner of the world doesn't _have_ anything but the
safe atomic accesses for every time you use the object.

See? Having the capability to do powerful and maybe unsafe things does
not force people to expose and use all that power. You can - and
should - wrap the powerful model with safer and simpler interfaces.

This isn't something specific to atomics. Not even remotely. This is
quite fundamental. You often literally _cannot_ do interesting things
using only safe interfaces. You want safe memory allocations - but to
actually write the allocator itself, you want to have all those unsafe
escape methods - all those raw pointers with arbitrary arithmetic etc.

And if you don't have unsafe escapes, you end up doing what so many
languages did: the libraries are written in something more powerful
like C, because C literally can do things that other languages
*cannot* do.

Don't let people fool you with talk about Turing machines and similar
smoke-and-mirror garbage. It's a bedtime story for first-year CS
students. It's not true.

Not all languages are created equal. Not all languages can do the same
things. If your language doesn't have those unsafe escapes, your
language is inherently weaker, and inherently worse for it.

           Linus

Dr. David Alan Gilbert March 26, 2024, 2:35 p.m. UTC | #39

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Mon, 25 Mar 2024 at 17:05, Dr. David Alan Gilbert <dave@treblig.org> wrote:
> >
> > Isn't one of the aims of the Rust/C++ idea that you can't forget to access
> > a shared piece of data atomically?
> 
> If that is an aim, it's a really *bad* one.
> 
> Really.
>
> It very much should never have been an aim, and I hope it wasn't. I
> think, and hope, that the source of the C++ and Rust bad decisions is
> cluelessness, not active malice.

Oh that hit a nerve :-)

> Take Rust - one big point of Rust is the whole "safe" thing, but it's
> very much not a straightjacket like Pascal was. There's a "safe" part
> to Rust, but equally importantly, there's also the "unsafe" part to
> Rust.
> 
> The safe part is the one that most programmers are supposed to use.
> It's the one that allows you to not have to worry too much about
> things. It's the part that makes it much harder to screw up.
> 
> But the *unsafe* part is what makes Rust powerful. It's the part that
> works behind the curtain. It's the part that may be needed to make the
> safe parts *work*.
> 
> And yes, an application programmer might never actually need to use
> it, and in fact in many projects the rule might be that unsafe Rust is
> simply never even an option - but that doesn't mean that the unsafe
> parts don't exist.
> 
> Because those unsafe parts are needed to make it all work in reality.
> 
> And you should never EVER base your whole design around the "safe"
> part. Then you get a language that is a straight-jacket.
> 
> So I'd very strongly argue that the core atomics should be done the
> "unsafe" way - allow people to specify exactly when they want exactly
> what access. Allow people to mix and match and have overlapping
> partial aliases, because if you implement things like locking, you
> *need* those partially aliasing accesses, and you need to make
> overlapping atomics where sometimes you access only one part of the
> field.
> 
> And yes, that will be unsafe, and it might even be unportable, but
> it's exactly the kind of thing you need in order to avoid having to
> use assembly language to do your locking.
> 
> And by all means, you should relegate that to the "unsafe corner" of
> the language. And maybe don't talk about the unsafe sharp edges in the
> first chapter of the book about the language.
> 
> But you should _start_ the design of your language memory model around
> the unsafe "raw atomic access operations" model.
> 
> Then you can use those strictly more powerful operations, and you
> create an object model *around* it.
> 
> So you create safe objects like just an atomic counter. In *that*
> corner of the language, you have the "safe atomics" - they aren't the
> fundamental implementation, but they are the safe wrappers *around*
> the more powerful (but unsafe) core.
> 
> With that "atomic counter" you cannot forget to do atomic accesses,
> because that safe corner of the world doesn't _have_ anything but the
> safe atomic accesses for every time you use the object.
> 
> See? Having the capability to do powerful and maybe unsafe things does
> not force people to expose and use all that power. You can - and
> should - wrap the powerful model with safer and simpler interfaces.

I'd agree it's important to get the primitives right; but 
I'd argue that from a design point of view it's probably better to keep
both in mind from early on; you need to create a safe interface which
people can actually use most of the time, otherwise you're not getting
any benefit; so yes get the bases right, but just keep a feel for how
they'll get encapsulated so most of the more boring code can be safe.

> This isn't something specific to atomics. Not even remotely. This is
> quite fundamental. You often literally _cannot_ do interesting things
> using only safe interfaces. You want safe memory allocations - but to
> actually write the allocator itself, you want to have all those unsafe
> escape methods - all those raw pointers with arbitrary arithmetic etc.
> 
> And if you don't have unsafe escapes, you end up doing what so many
> languages did: the libraries are written in something more powerful
> like C, because C literally can do things that other languages
> *cannot* do.

Yeh that's fine, I'm not at all arguing against that; but it doesn't
mean you shouldn't keep an eye on how the safe side should look; even in the
kernel.
Get it right and those unsafe escapes shouldn't be needed too commonly;
get it wrong and you'll either have painful abstractions or end up with
unsafes shotgunned all over the place.

> Don't let people fool you with talk about Turing machines and similar
> smoke-and-mirror garbage. It's a bedtime story for first-year CS
> students. It's not true.

My infinitely long tape is still on back order.

Dave

> things. If your language doesn't have those unsafe escapes, your
> language is inherently weaker, and inherently worse for it.
> 
>            Linus
>

comex March 27, 2024, 4:16 p.m. UTC | #40

On Mar 25, 2024, at 8:49 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But you should _start_ the design of your language memory model around
> the unsafe "raw atomic access operations" model.
> 
> Then you can use those strictly more powerful operations, and you
> create an object model *around* it.

To some extent Rust does this already, unlike C++.

C++ allows atomics to be implemented using locks.  Partly for this reason,
`std::atomic<T>` is documented as not necessarily having the same
representation as `T` [1].  C++ also has strict aliasing, so even if those types
do have the same representation, you still can't cast `T *` to
`std::atomic<T> *`.

But Rust atomics are lower-level.  First, they are guaranteed lock-free [2].
Second, they are documented as having "the same in-memory representation as the
underlying" type [3].  (They also usually have the same alignment, except on
x86 where u64 is only 4-byte aligned but AtomicU64 of course needs to be 8-byte
aligned.)  Meanwhile, Rust intentionally lacks strict aliasing.

Combined, this means it's perfectly legal in Rust to cast e.g. `&mut u32` to
`&AtomicU32` and perform atomic accesses on it.  Or the same with u64/AtomicU64
if you know the pointer is validly aligned.  This is by design; the Atomic
types' methods are considered the official way to perform atomic operations on
arbitrary memory, making it unnecessary to also stabilize 'lower-level'
intrinsics.

That said, there *are* currently some holes in Rust's atomics model, based on
the fact that it's mostly inherited from C++.  From the documentation:

> Since C++ does not support mixing atomic and non-atomic accesses, or
> non-synchronized different-sized accesses to the same data, Rust does not
> support those operations either. Note that both of those restrictions only
> apply if the accesses are non-synchronized.
https://doc.rust-lang.org/std/sync/atomic/index.html

There are some open issues around this:

- "How can we allow read-read races between atomic and non-atomic accesses?"
  https://github.com/rust-lang/unsafe-code-guidelines/issues/483

  > [..] I do think we should allow such code. However, then we have to change
  > the way we document our atomics [..]

- "What about: mixed-size atomic accesses"
  https://github.com/rust-lang/unsafe-code-guidelines/issues/345"

  > Apparently the x86 manual says you "should" not do this [..] It is unclear
  > what "should" means (or what anything else here really means, operationally
  > speaking...)

[1] https://eel.is/c++draft/atomics#types.generic.general-3
[2] https://doc.rust-lang.org/std/sync/atomic/index.html#portability
[3] https://doc.rust-lang.org/nightly/std/sync/atomic/struct.AtomicU64.html

Kent Overstreet March 27, 2024, 6:50 p.m. UTC | #41

On Wed, Mar 27, 2024 at 09:16:09AM -0700, comex wrote:
> Meanwhile, Rust intentionally lacks strict aliasing.

I wasn't aware of this. Given that unrestricted pointers are a real
impediment to compiler optimization, I thought that with Rust we were
finally starting to nail down a concrete enough memory model to tackle
this safely. But I guess not?

Linus Torvalds March 27, 2024, 7:07 p.m. UTC | #42

On Wed, 27 Mar 2024 at 11:51, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Wed, Mar 27, 2024 at 09:16:09AM -0700, comex wrote:
> > Meanwhile, Rust intentionally lacks strict aliasing.
>
> I wasn't aware of this. Given that unrestricted pointers are a real
> impediment to compiler optimization, I thought that with Rust we were
> finally starting to nail down a concrete enough memory model to tackle
> this safely. But I guess not?

Strict aliasing is a *horrible* mistake.

It's not even *remotely* "tackle this safely". It's the exact
opposite. It's completely broken.

Anybody who thinks strict aliasing is a good idea either

 (a) doesn't understand what it means

 (b) has been brainwashed by incompetent compiler people.

it's a horrendous crock that was introduced by people who thought it
was too complicated to write out "restrict" keywords, and that thought
that "let's break old working programs and make it harder to write new
programs" was a good idea.

Nobody should ever do it. The fact that Rust doesn't do the C strict
aliasing is a good thing. Really.

I suspect you have been fooled by the name. Because "strict aliasing"
sounds like a good thing. It sounds like "I know these strictly can't
alias". But despite that name, it's the complete opposite of that, and
means "I will ignore actual real aliasing even if it exists, because I
will make aliasing decisions on entirely made-up grounds".

Just say no to strict aliasing. Thankfully, there's an actual compiler
flag for that: -fno-strict-aliasing. It should absolutely have been
the default.

                 Linus

Kent Overstreet March 27, 2024, 7:41 p.m. UTC | #43

On Wed, Mar 27, 2024 at 12:07:26PM -0700, Linus Torvalds wrote:
> On Wed, 27 Mar 2024 at 11:51, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Mar 27, 2024 at 09:16:09AM -0700, comex wrote:
> > > Meanwhile, Rust intentionally lacks strict aliasing.
> >
> > I wasn't aware of this. Given that unrestricted pointers are a real
> > impediment to compiler optimization, I thought that with Rust we were
> > finally starting to nail down a concrete enough memory model to tackle
> > this safely. But I guess not?
> 
> Strict aliasing is a *horrible* mistake.
> 
> It's not even *remotely* "tackle this safely". It's the exact
> opposite. It's completely broken.
> 
> Anybody who thinks strict aliasing is a good idea either
> 
>  (a) doesn't understand what it means
> 
>  (b) has been brainwashed by incompetent compiler people.
> 
> it's a horrendous crock that was introduced by people who thought it
> was too complicated to write out "restrict" keywords, and that thought
> that "let's break old working programs and make it harder to write new
> programs" was a good idea.

Strict aliasing is crap in C and C++ because we started out with
unrestricetd pointers, and it just doesn't work in C and C++ with the
realities of the kind of code we have to write, and we never got any
kind of a model that would have made it workable. Never mind trying to
graft that onto existing codebases...

(Restrict was crap too... no scoping, nothing but a single f*cking
keyword? Who ever thought _that_ was going to work?)

_But_: the lack of any aliasing guarantees means that writing through
any pointer can invalidate practically anything, and this is a real
problem. A lot of C programmers have stockholm syndrome when it comes to
this, we end up writing a lot of code in weirdly baroque and artificial
styles to partially work around this when we care about performance -
saving things into locals because at least the _stack_ generally can't
alias to avoid forced reloads, or passing and returning things by
reference instead of by value when that's _not the semantics we want_
because otherwise the compiler is going to do an unnecessary copy -
again, that's fundamentally because of aliasing.

Linus Torvalds March 27, 2024, 8:45 p.m. UTC | #44

On Wed, 27 Mar 2024 at 12:41, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> _But_: the lack of any aliasing guarantees means that writing through
> any pointer can invalidate practically anything, and this is a real
> problem.

It's actually much less of a problem than you make it out to be.

A lot of common aliasing information is statically visible (offsets
off the same struct pointer etc).

The big problems tend to be

 (a) old in-order hardware that really wants the compiler to schedule
memory operations

 (b) vectorization and HPC

and honestly, (a) is irrelevant, and (b) is where 'restrict' and
actual real vector extensions come in. In fact, the type-based
aliasing often doesn't help (because you have arrays of the same FP
types), and so you really just need to tell the compiler that your
arrays are disjoint.

Yes, yes, possible aliasing means that the compiler won't generate
nice-looking code in many situations and will end up reloading values
from memory etc.

AND NONE OF THAT MATTERS IN REALITY.

Performance issues to a close approximation come from cache misses and
branch mispredicts. The aliasing issue just isn't the horrendous issue
people claim it is. It's most *definitely* not worth the absolute
garbage that is C type-based aliasing.

And yes, I do think it might be nice to have a nicer 'restrict' model,
because yes, I look at the generated asm and I see the silly code
generation too. But describing aliasing sanely in general is just hard
(both for humans _and_ for some sane machine interface), and it's very
very seldom worth the pain.

            Linus

Boqun Feng March 27, 2024, 9:21 p.m. UTC | #45

On Wed, Mar 27, 2024 at 03:41:16PM -0400, Kent Overstreet wrote:
> On Wed, Mar 27, 2024 at 12:07:26PM -0700, Linus Torvalds wrote:
> > On Wed, 27 Mar 2024 at 11:51, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 09:16:09AM -0700, comex wrote:
> > > > Meanwhile, Rust intentionally lacks strict aliasing.
> > >
> > > I wasn't aware of this. Given that unrestricted pointers are a real
> > > impediment to compiler optimization, I thought that with Rust we were
> > > finally starting to nail down a concrete enough memory model to tackle
> > > this safely. But I guess not?
> > 
> > Strict aliasing is a *horrible* mistake.
> > 
> > It's not even *remotely* "tackle this safely". It's the exact
> > opposite. It's completely broken.
> > 
> > Anybody who thinks strict aliasing is a good idea either
> > 
> >  (a) doesn't understand what it means
> > 
> >  (b) has been brainwashed by incompetent compiler people.
> > 
> > it's a horrendous crock that was introduced by people who thought it
> > was too complicated to write out "restrict" keywords, and that thought
> > that "let's break old working programs and make it harder to write new
> > programs" was a good idea.
> 
> Strict aliasing is crap in C and C++ because we started out with
> unrestricetd pointers, and it just doesn't work in C and C++ with the
> realities of the kind of code we have to write, and we never got any
> kind of a model that would have made it workable. Never mind trying to
> graft that onto existing codebases...
> 
> (Restrict was crap too... no scoping, nothing but a single f*cking
> keyword? Who ever thought _that_ was going to work?)
> 
> _But_: the lack of any aliasing guarantees means that writing through
> any pointer can invalidate practically anything, and this is a real

I don't know whether I'm 100% correct on this, but Rust has references,
so things like "you have a unique reference to a part of memory, no one
would touch it in the meanwhile" are represented by `&mut`, to get a
`&mut` from a raw pointer, you need unsafe, where programmers can
provide the reasoning of the safety of the accesses. More like "pointers
can alias anyone but references cannot" to me.

Regards,
Boqun

> problem. A lot of C programmers have stockholm syndrome when it comes to
> this, we end up writing a lot of code in weirdly baroque and artificial
> styles to partially work around this when we care about performance -
> saving things into locals because at least the _stack_ generally can't
> alias to avoid forced reloads, or passing and returning things by
> reference instead of by value when that's _not the semantics we want_
> because otherwise the compiler is going to do an unnecessary copy -
> again, that's fundamentally because of aliasing.

Kent Overstreet March 27, 2024, 9:41 p.m. UTC | #46

On Wed, Mar 27, 2024 at 01:45:46PM -0700, Linus Torvalds wrote:
> On Wed, 27 Mar 2024 at 12:41, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > _But_: the lack of any aliasing guarantees means that writing through
> > any pointer can invalidate practically anything, and this is a real
> > problem.
> 
> It's actually much less of a problem than you make it out to be.
> 
> A lot of common aliasing information is statically visible (offsets
> off the same struct pointer etc).
> 
> The big problems tend to be
> 
>  (a) old in-order hardware that really wants the compiler to schedule
> memory operations
> 
>  (b) vectorization and HPC

Yeah, I was being a bit dramatic given the current state of the world -
OOO machines _mostly_ do a reasonable job of hiding the negative effects of
this (although not always, when memory barriers are involved).

But I think this is going to become important again in the future, for a
couple reasons.

On the hardware end, the Mill guys were pointing out years ago that
register renaming is a big power bottleneck in modern processors; the
actual functional units consume practically nothing compared to just
moving data around. They never built anything, but there's at least one
new startup I know of that's continuing their work. Yes, VLIW has been a
bust repeatedly, but as we keep going wider and wider (because what else
are we going to do?) I think it's going to happen eventually.

On the compiler side, I'm taking the long view: it's really not just
about a few reloads here and there, it's that you fundamentally can't do
that much with code that's mutating arbitrary state; analysis that
optimizations depend on is impossible. In the compiler you really want
to be working with code that's pure functional - then all your
optimizations are just algebra. And internally, compilers do this as
much as they can (SSA form...), but - unrestricted pointers really put a
limit on what they can do.

The beautiful thing about Rust to me is that we finally might have a
path out of this; the rules of Rust references constrain mutability in
such a way that's almost as good as being purely functional, yet it's
something we can actually do systems programming with.

I think this is going to lead to substantially better code optimization
in the future, and based on a much more sound model that means we won't
constantly be fighting with compiler people because they're doing shady
shit and breaking previously working code. Based on my experiences with
writing iterator heavy code in both C and Rust, I think it already is.

Kent Overstreet March 27, 2024, 9:49 p.m. UTC | #47

On Wed, Mar 27, 2024 at 02:21:03PM -0700, Boqun Feng wrote:
> On Wed, Mar 27, 2024 at 03:41:16PM -0400, Kent Overstreet wrote:
> > On Wed, Mar 27, 2024 at 12:07:26PM -0700, Linus Torvalds wrote:
> > > On Wed, 27 Mar 2024 at 11:51, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Wed, Mar 27, 2024 at 09:16:09AM -0700, comex wrote:
> > > > > Meanwhile, Rust intentionally lacks strict aliasing.
> > > >
> > > > I wasn't aware of this. Given that unrestricted pointers are a real
> > > > impediment to compiler optimization, I thought that with Rust we were
> > > > finally starting to nail down a concrete enough memory model to tackle
> > > > this safely. But I guess not?
> > > 
> > > Strict aliasing is a *horrible* mistake.
> > > 
> > > It's not even *remotely* "tackle this safely". It's the exact
> > > opposite. It's completely broken.
> > > 
> > > Anybody who thinks strict aliasing is a good idea either
> > > 
> > >  (a) doesn't understand what it means
> > > 
> > >  (b) has been brainwashed by incompetent compiler people.
> > > 
> > > it's a horrendous crock that was introduced by people who thought it
> > > was too complicated to write out "restrict" keywords, and that thought
> > > that "let's break old working programs and make it harder to write new
> > > programs" was a good idea.
> > 
> > Strict aliasing is crap in C and C++ because we started out with
> > unrestricetd pointers, and it just doesn't work in C and C++ with the
> > realities of the kind of code we have to write, and we never got any
> > kind of a model that would have made it workable. Never mind trying to
> > graft that onto existing codebases...
> > 
> > (Restrict was crap too... no scoping, nothing but a single f*cking
> > keyword? Who ever thought _that_ was going to work?)
> > 
> > _But_: the lack of any aliasing guarantees means that writing through
> > any pointer can invalidate practically anything, and this is a real
> 
> I don't know whether I'm 100% correct on this, but Rust has references,
> so things like "you have a unique reference to a part of memory, no one
> would touch it in the meanwhile" are represented by `&mut`, to get a
> `&mut` from a raw pointer, you need unsafe, where programmers can
> provide the reasoning of the safety of the accesses. More like "pointers
> can alias anyone but references cannot" to me.

That's not really a workable rule because in practice every data
structure has unsafe Rust underneath. Strict aliasing would mean that
unsafe Rust very much has to follow the aliasing rules too.

comex March 27, 2024, 9:56 p.m. UTC | #48

> On Mar 27, 2024, at 2:21 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> I don't know whether I'm 100% correct on this, but Rust has references,
> so things like "you have a unique reference to a part of memory, no one
> would touch it in the meanwhile" are represented by `&mut`, to get a
> `&mut` from a raw pointer, you need unsafe, where programmers can
> provide the reasoning of the safety of the accesses. More like "pointers
> can alias anyone but references cannot" to me.

Right.  When I said “strict aliasing” I meant type-based aliasing rules, which is what GCC calls “strict aliasing".  But Rust does have stricter aliasing rules than C in a different way.  Both mutable and immutable references are annotated with LLVM `noalias` by default, equivalent to C `restrict`. For mutable references it’s justified because those references should be unique.  For immutable references it's justified because the memory behind the reference shouldn’t be mutated at all.  (There’s an exception for types with ‘interior mutability’, where ‘immutable' references actually can be used for mutation.)

The hope has always been that this gives Rust better overall optimizability than C or C++ and makes up for the losses from the lack of type-based aliasing rules.  If there’s any empirical data to justify or refute this, I’m not aware of it.  But that’s the hope, and by this point Rust is committed to the approach.

(Why only function parameters?  Mainly because of limitations of what LLVM IR can express.  From the perspective of the work-in-progress Rust memory model spec, function parameters are special in *some* ways, but many of the optimizations could apply to all uses of references.  That's just not currently implemented.)

comex March 27, 2024, 10:02 p.m. UTC | #49

> On Mar 27, 2024, at 2:56 PM, comex <comexk@gmail.com> wrote:
> 
> Right.  When I said “strict aliasing” I meant type-based aliasing rules, which is what GCC calls “strict aliasing".  But Rust does have stricter aliasing rules than C in a different way.  Both mutable and immutable references are annotated with LLVM `noalias` by default, equivalent to C `restrict`.

…oops, this should say “reference-typed function parameters”.

> On Mar 27, 2024, at 2:49 PM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> That's not really a workable rule because in practice every data
> structure has unsafe Rust underneath. Strict aliasing would mean that
> unsafe Rust very much has to follow the aliasing rules too.

There have indeed been a lot of issues where some innocent-seeming piece of unsafe Rust turns out to violate the reference aliasing rules.  Miri helps (it’s a tool that can detect violations at runtime), and there have been attempts to loosen the rules where possible.  But it is definitely a case where Rust’s UB rules are more subtle than one would like.  At least it only applies to unsafe code.

Boqun Feng March 27, 2024, 10:26 p.m. UTC | #50

On Wed, Mar 27, 2024 at 05:49:41PM -0400, Kent Overstreet wrote:
[...]
> > > Strict aliasing is crap in C and C++ because we started out with
> > > unrestricetd pointers, and it just doesn't work in C and C++ with the
> > > realities of the kind of code we have to write, and we never got any
> > > kind of a model that would have made it workable. Never mind trying to
> > > graft that onto existing codebases...
> > > 
> > > (Restrict was crap too... no scoping, nothing but a single f*cking
> > > keyword? Who ever thought _that_ was going to work?)
> > > 
> > > _But_: the lack of any aliasing guarantees means that writing through
> > > any pointer can invalidate practically anything, and this is a real
> > 
> > I don't know whether I'm 100% correct on this, but Rust has references,
> > so things like "you have a unique reference to a part of memory, no one
> > would touch it in the meanwhile" are represented by `&mut`, to get a
> > `&mut` from a raw pointer, you need unsafe, where programmers can
> > provide the reasoning of the safety of the accesses. More like "pointers
> > can alias anyone but references cannot" to me.
> 
> That's not really a workable rule because in practice every data
> structure has unsafe Rust underneath. Strict aliasing would mean that

I don't follow, a plain data structure like:

	struct Point { x: i64, y: i64 }

doesn't have any unsafe Rust underneath I think.

> unsafe Rust very much has to follow the aliasing rules too.
> 

The point I was trying to say, the aliasing rule on Rust raw pointers is 
relatively relaxed compared to strict aliasing in C since Rust has
references which should have more accurate informatio on aliasing.

(but not a language expert).

Regards,
Boqun

Linus Torvalds March 27, 2024, 10:57 p.m. UTC | #51

On Wed, 27 Mar 2024 at 14:41, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
>
> On the hardware end, the Mill guys were pointing out years ago that
> register renaming is a big power bottleneck in modern processors;

LOL.

The Mill guys took the arguments from the Itanium people, and turned
the crazy up to 11, with "the belt" and seemingly trying to do a
dataflow machine but not worrying over-much about memory accesses etc.

The whole "we'll deal with it in the compiler" is crazy talk.

In other words, I'll believe it when I see it. And I doubt we'll ever see it.

               Linus

Kent Overstreet March 27, 2024, 11:35 p.m. UTC | #52

On Wed, Mar 27, 2024 at 03:57:12PM -0700, Linus Torvalds wrote:
> On Wed, 27 Mar 2024 at 14:41, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> >
> > On the hardware end, the Mill guys were pointing out years ago that
> > register renaming is a big power bottleneck in modern processors;
> 
> LOL.
> 
> The Mill guys took the arguments from the Itanium people, and turned
> the crazy up to 11, with "the belt" and seemingly trying to do a
> dataflow machine but not worrying over-much about memory accesses etc.
> 
> The whole "we'll deal with it in the compiler" is crazy talk.

And Itanium did way better on Fortran, but all those spills and reloads
due to aliasing that an OoO processor can hide are death when you're
statically scheduled.

Unrestricted pointers are fundamentally a real barrier to improved ILP.
It's not _all_ branches and cache effects, though I will grant you that
cache effects do dominate.

But I think there's hope for improvement on that one too. A _lot_ of
kernel code defaults to lists instead of vectors even though we know
that vectors are better for performance - beacuse people are afraid of
memory allocations and error paths. Rust makes it way harder to fuck up
your error paths, and also makes it safer to refactor to improve your
data structures.

Philipp Stanner April 5, 2024, 5:13 p.m. UTC | #53

On Wed, 2024-03-27 at 12:07 -0700, Linus Torvalds wrote:
> On Wed, 27 Mar 2024 at 11:51, Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> > 
> > On Wed, Mar 27, 2024 at 09:16:09AM -0700, comex wrote:
> > > Meanwhile, Rust intentionally lacks strict aliasing.
> > 
> > I wasn't aware of this. Given that unrestricted pointers are a real
> > impediment to compiler optimization, I thought that with Rust we
> > were
> > finally starting to nail down a concrete enough memory model to
> > tackle
> > this safely. But I guess not?
> 
> Strict aliasing is a *horrible* mistake.
> 
> It's not even *remotely* "tackle this safely". It's the exact
> opposite. It's completely broken.
> 
> Anybody who thinks strict aliasing is a good idea either
> 
>  (a) doesn't understand what it means
> 
>  (b) has been brainwashed by incompetent compiler people.
> 
> it's a horrendous crock that was introduced by people who thought it
> was too complicated to write out "restrict" keywords, and that
> thought
> that "let's break old working programs and make it harder to write
> new
> programs" was a good idea.
> 
> Nobody should ever do it. The fact that Rust doesn't do the C strict
> aliasing is a good thing. Really.

Btw, for the interested, that's a nice article on strict aliasing:
https://blog.regehr.org/archives/1307

Dennis Ritchie, the Man Himself, back in the 1980s pushed back quite
strongly on (different?) aliasing experiments:
https://www.yodaiken.com/2021/03/19/dennis-ritchie-on-alias-analysis-in-the-c-programming-language-1988/


No idea why they can't just leave C alone... It's not without reason
that new languages like Zig and Hare want to freeze the language
(standard) once they are released.

P.

> 
> I suspect you have been fooled by the name. Because "strict aliasing"
> sounds like a good thing. It sounds like "I know these strictly can't
> alias". But despite that name, it's the complete opposite of that,
> and
> means "I will ignore actual real aliasing even if it exists, because
> I
> will make aliasing decisions on entirely made-up grounds".
> 
> Just say no to strict aliasing. Thankfully, there's an actual
> compiler
> flag for that: -fno-strict-aliasing. It should absolutely have been
> the default.
> 
>                  Linus
>

Matthew Wilcox April 8, 2024, 4:02 p.m. UTC | #54

On Mon, Mar 25, 2024 at 10:44:43AM -0700, Linus Torvalds wrote:
> So I actually think most compiler people are perfectly fine with the
> kernel model of mostly doing 'volatile' not on the data structures
> themselves, but as accesses through casts.
> 
> It's very traditional C, and there's actually nothing particularly odd
> about it. Not even from a compiler standpoint.
> 
> In fact, I personally will argue that it is fundamentally wrong to
> think that the underlying data has to be volatile. A variable may be
> entirely stable in some cases (ie locks held), but not in others.
> 
> So it's not the *variable* (aka "object") that is 'volatile', it's the
> *context* that makes a particular access volatile.
> 
> That explains why the kernel has basically zero actual volatile
> objects, and 99% of all volatile accesses are done through accessor
> functions that use a cast to mark a particular access volatile.

What annoys me is that 'volatile' accesses have (at least) two distinct
meanings:
 - Make this access untorn
 - Prevent various optimisations (code motion,
   common-subexpression-elimination, ...)

As an example, folio_migrate_flags() (in mm/migrate.c):

        if (folio_test_error(folio))
                folio_set_error(newfolio);
        if (folio_test_referenced(folio))
                folio_set_referenced(newfolio);
        if (folio_test_uptodate(folio))
                folio_mark_uptodate(newfolio);

... which becomes...

      1f:       f6 c4 04                test   $0x4,%ah
      22:       74 05                   je     29 <folio_migrate_flags+0x19>
      24:       f0 80 4f 01 04          lock orb $0x4,0x1(%rdi)
      29:       48 8b 03                mov    (%rbx),%rax
      2c:       a8 04                   test   $0x4,%al
      2e:       74 05                   je     35 <folio_migrate_flags+0x25>
      30:       f0 80 4d 00 04          lock orb $0x4,0x0(%rbp)
      35:       48 8b 03                mov    (%rbx),%rax
      38:       a8 08                   test   $0x8,%al
      3a:       74 05                   je     41 <folio_migrate_flags+0x31>
      3c:       f0 80 4d 00 08          lock orb $0x8,0x0(%rbp)

In my ideal world, the compiler would turn this into:

	newfolio->flags |= folio->flags & MIGRATE_MASK;

but because folio_test_foo() and folio_set_foo() contain all manner of
volatile casts, the compiler is forced to do individual tests and sets.

Part of that is us being dumb; folio_set_foo() should be __folio_set_foo()
because this folio is newly allocated and nobody else can be messing
with its flags word yet.  I failed to spot that at the time I was doing
the conversion from SetPageFoo to folio_set_foo.

But if the compiler people could give us something a little more
granular than "scary volatile access disable everything", that would
be nice.  Also hard, because now you have to figure out what this new
thing interacts with and when is it safe to do what.

Paul E. McKenney April 8, 2024, 4:55 p.m. UTC | #55

On Mon, Apr 08, 2024 at 05:02:37PM +0100, Matthew Wilcox wrote:
> On Mon, Mar 25, 2024 at 10:44:43AM -0700, Linus Torvalds wrote:
> > So I actually think most compiler people are perfectly fine with the
> > kernel model of mostly doing 'volatile' not on the data structures
> > themselves, but as accesses through casts.
> > 
> > It's very traditional C, and there's actually nothing particularly odd
> > about it. Not even from a compiler standpoint.
> > 
> > In fact, I personally will argue that it is fundamentally wrong to
> > think that the underlying data has to be volatile. A variable may be
> > entirely stable in some cases (ie locks held), but not in others.
> > 
> > So it's not the *variable* (aka "object") that is 'volatile', it's the
> > *context* that makes a particular access volatile.
> > 
> > That explains why the kernel has basically zero actual volatile
> > objects, and 99% of all volatile accesses are done through accessor
> > functions that use a cast to mark a particular access volatile.
> 
> What annoys me is that 'volatile' accesses have (at least) two distinct
> meanings:
>  - Make this access untorn
>  - Prevent various optimisations (code motion,
>    common-subexpression-elimination, ...)
> 
> As an example, folio_migrate_flags() (in mm/migrate.c):
> 
>         if (folio_test_error(folio))
>                 folio_set_error(newfolio);
>         if (folio_test_referenced(folio))
>                 folio_set_referenced(newfolio);
>         if (folio_test_uptodate(folio))
>                 folio_mark_uptodate(newfolio);
> 
> ... which becomes...
> 
>       1f:       f6 c4 04                test   $0x4,%ah
>       22:       74 05                   je     29 <folio_migrate_flags+0x19>
>       24:       f0 80 4f 01 04          lock orb $0x4,0x1(%rdi)
>       29:       48 8b 03                mov    (%rbx),%rax
>       2c:       a8 04                   test   $0x4,%al
>       2e:       74 05                   je     35 <folio_migrate_flags+0x25>
>       30:       f0 80 4d 00 04          lock orb $0x4,0x0(%rbp)
>       35:       48 8b 03                mov    (%rbx),%rax
>       38:       a8 08                   test   $0x8,%al
>       3a:       74 05                   je     41 <folio_migrate_flags+0x31>
>       3c:       f0 80 4d 00 08          lock orb $0x8,0x0(%rbp)
> 
> In my ideal world, the compiler would turn this into:
> 
> 	newfolio->flags |= folio->flags & MIGRATE_MASK;
> 
> but because folio_test_foo() and folio_set_foo() contain all manner of
> volatile casts, the compiler is forced to do individual tests and sets.
> 
> Part of that is us being dumb; folio_set_foo() should be __folio_set_foo()
> because this folio is newly allocated and nobody else can be messing
> with its flags word yet.  I failed to spot that at the time I was doing
> the conversion from SetPageFoo to folio_set_foo.
> 
> But if the compiler people could give us something a little more
> granular than "scary volatile access disable everything", that would
> be nice.  Also hard, because now you have to figure out what this new
> thing interacts with and when is it safe to do what.

OK, I will bite...

Why not accumulate the changes in a mask, and then apply the mask the
one time?  (In situations where __folio_set_foo() need not apply.)

If it turns out that we really do need a not-quite-volatile, what exactly
does it do?  You clearly want it to be able to be optimized so as to merge
similar accesses.  Is there a limit to the number of accesses that can
be merged or to the region of code over which such merging is permitted?
Either way, how is the compiler informed of these limits?

(I admit that I am not crazy about this sort of proposal, but that might
have something to do with the difficulty of repeatedly convincing
people that volatile is necessary and must be retained...)

							Thanx, Paul

Linus Torvalds April 8, 2024, 5:01 p.m. UTC | #56

On Mon, 8 Apr 2024 at 09:02, Matthew Wilcox <willy@infradead.org> wrote:
>
> What annoys me is that 'volatile' accesses have (at least) two distinct
> meanings:
>  - Make this access untorn
>  - Prevent various optimisations (code motion,
>    common-subexpression-elimination, ...)

Oh, I'm not at all trying to say that "volatile" is great.

My argument was that the C (and C++, and Rust) model of attaching
memory ordering to objects is actively bad. and limiting.

Because the whole "the access rules are context-dependent" is really
fundamental. Anybody who designs an atomic model around the object is
simply not doing it right.

Now, the "volatile" rules actually make sense in a historical
"hardware access" context. So I do not think "volatile" is great, but
I also don't think K&R were incompetent. "volatile" makes perfect
sense in the historical setting of "direct hardware access".

It just so happens that there weren't other tools, so then you end up
using "volatile" for cached memory too when you want to get "access
once" semantics, and then it isn't great.

And then you have *too* many tools on the standards bodies, and they
don't understand atomics, and don't understand volatile, and they have
been told that "volatile" isn't great for atomics because it doesn't
have memory ordering semantics, but do not understand the actual
problem space.

So those people - who in some cases spent decades arguing about (and
sometimes against) "volatile" think that despite all the problems, the
solution for atomics is to make the *same* mistake, and tie it to the
data and the type system, not the action.

Which is honestly just plain *stupid*. What made sense for 'volatile'
in a historical setting, absolutely does not make sense for atomics.

> As an example, folio_migrate_flags() (in mm/migrate.c):
>
>         if (folio_test_error(folio))
>                 folio_set_error(newfolio);
>         if (folio_test_referenced(folio))
>                 folio_set_referenced(newfolio);
>         if (folio_test_uptodate(folio))
>                 folio_mark_uptodate(newfolio);
>
> ... which becomes...

[ individual load and store code generation removed ]

> In my ideal world, the compiler would turn this into:
>
>         newfolio->flags |= folio->flags & MIGRATE_MASK;

Well, honestly, we should just write the code that way, and not expect
too much of the compiler.

We don't currently have a "generic atomic or" operation, but we
probably should have one.

For our own historical reasons, while we have a few generic atomic
operations: bit operations, cmpxchg, etc, most of our arithmetic and
logical ops all rely on a special "atomic_t" type (later extended with
"atomic_long_t").

The reason? The garbage that is legacy Sparc atomics.

Sparc historically basically didn't have any atomics outside of the
'test and set byte' one, so if you wanted an atomic counter thing, and
you cared about sparc, you had to play games with "some bits of the
counter are the atomic byte lock".

And we do not care about that Sparc horror any *more*, but we used to.

End result: instead of having "do atomic ops on a normal type" - which
would be a lot more powerful - we have this model of "do atomic ops on
atomic_t".

We could fix that now. Instead of having architectures define

   arch_atomic_or(int i, atomic_t *v)

operations, we could - and should - make the 'arch' atomics be

   arch_atomic_or(int i, unsigned int *v)

and then we'd still keep the "atomic_t" around for type safety
reasons, but code that just wants to act on an "int" (or a "long")
atomically could just do so.

But in your case, I don't think you actually need it:

> Part of that is us being dumb; folio_set_foo() should be __folio_set_foo()
> because this folio is newly allocated and nobody else can be messing
> with its flags word yet.  I failed to spot that at the time I was doing
> the conversion from SetPageFoo to folio_set_foo.

This is part of my "context matters" rant and why I do *not* think
atomics should be tied to the object, but to the operation.

The compiler generally doesn't know the context rules (insert "some
languages do know in some cases" here), which is why we as programmers
should just use different operations when we do.

In this case, since it's a new folio that hasn't been exposed to
anybody, you should just have done exactly that kind of

    newfolio->flags |= folio->flags & MIGRATE_MASK;

which we already do in the page initialization code when we know we
own the flags (set_page_zone, set_page_zone, set_page_section).

We've generally avoided doing this in general, though - even the buddy
allocator seldom does it. The only case of manual "I know I own the
flags" I know if (apart from the initialization itself) is

        ->flags &= ~PAGE_FLAGS_CHECK_AT_FREE;
     ...
        ->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

kinds of things at free/alloc time.

> But if the compiler people could give us something a little more
> granular than "scary volatile access disable everything", that would
> be nice.  Also hard, because now you have to figure out what this new
> thing interacts with and when is it safe to do what.

I think it would be lovely to have some kind of "atomic access"
operations that the compiler could still combine when it can see that
"this is invisible at a cache access level".

But as things are now, we do have most of those in the kernel, and
what you ask for can either be done today, or could be done (like that
"arch_atomic_or()") with a bit of re-org.

                       Linus

Matthew Wilcox April 8, 2024, 5:03 p.m. UTC | #57

On Mon, Apr 08, 2024 at 09:55:23AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 05:02:37PM +0100, Matthew Wilcox wrote:
> > In my ideal world, the compiler would turn this into:
> > 
> > 	newfolio->flags |= folio->flags & MIGRATE_MASK;
> 
> Why not accumulate the changes in a mask, and then apply the mask the
> one time?  (In situations where __folio_set_foo() need not apply.)

Yes, absolutely, we can, should and probably eventually will do this
when it gets to the top of somebody's todo list.  But it irks me that
we can't tell the compiler this is a safe transformation for it to make.
There are a number of places where similar things happen.

$ git grep folio_test.*folio_test

will find you 82 of them (where they happen to be on the same line)

                if (folio_test_dirty(folio) || folio_test_locked(folio) ||
                                folio_test_writeback(folio))
                        break;

turns into:

    1f41:       48 8b 29                mov    (%rcx),%rbp
    1f44:       48 c1 ed 04             shr    $0x4,%rbp
    1f48:       83 e5 01                and    $0x1,%ebp
    1f4b:       0f 85 d5 00 00 00       jne    2026 <filemap_range_has_writeback+0x1a6>
    1f51:       48 8b 29                mov    (%rcx),%rbp
    1f54:       83 e5 01                and    $0x1,%ebp
    1f57:       0f 85 c9 00 00 00       jne    2026 <filemap_range_has_writeback+0x1a6>
    1f5d:       48 8b 29                mov    (%rcx),%rbp
    1f60:       48 d1 ed                shr    $1,%rbp
    1f63:       83 e5 01                and    $0x1,%ebp
    1f66:       0f 85 ba 00 00 00       jne    2026 <filemap_range_has_writeback+0x1a6>

rather than _one_ load from rcx and a test against a mask.

> If it turns out that we really do need a not-quite-volatile, what exactly
> does it do?  You clearly want it to be able to be optimized so as to merge
> similar accesses.  Is there a limit to the number of accesses that can
> be merged or to the region of code over which such merging is permitted?
> Either way, how is the compiler informed of these limits?

Right, like I said, it's not going to be easy to define exactly what we
want.

Al Viro April 8, 2024, 6:14 p.m. UTC | #58

On Mon, Apr 08, 2024 at 10:01:32AM -0700, Linus Torvalds wrote:

> For our own historical reasons, while we have a few generic atomic
> operations: bit operations, cmpxchg, etc, most of our arithmetic and
> logical ops all rely on a special "atomic_t" type (later extended with
> "atomic_long_t").
> 
> The reason? The garbage that is legacy Sparc atomics.
> 
> Sparc historically basically didn't have any atomics outside of the
> 'test and set byte' one, so if you wanted an atomic counter thing, and
> you cared about sparc, you had to play games with "some bits of the
> counter are the atomic byte lock".
> 
> And we do not care about that Sparc horror any *more*, but we used to.

FWIW, PA-RISC is no better - the same "fetch and replace with constant"
kind of primitive as for sparc32, only the constant is (u32)0 instead
of (u8)~0.  And unlike sparc64, 64bit variant didn't get better.

Paul E. McKenney April 8, 2024, 6:47 p.m. UTC | #59

On Mon, Apr 08, 2024 at 06:03:11PM +0100, Matthew Wilcox wrote:
> On Mon, Apr 08, 2024 at 09:55:23AM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 05:02:37PM +0100, Matthew Wilcox wrote:
> > > In my ideal world, the compiler would turn this into:
> > > 
> > > 	newfolio->flags |= folio->flags & MIGRATE_MASK;
> > 
> > Why not accumulate the changes in a mask, and then apply the mask the
> > one time?  (In situations where __folio_set_foo() need not apply.)
> 
> Yes, absolutely, we can, should and probably eventually will do this
> when it gets to the top of somebody's todo list.  But it irks me that
> we can't tell the compiler this is a safe transformation for it to make.
> There are a number of places where similar things happen.
> 
> $ git grep folio_test.*folio_test
> 
> will find you 82 of them (where they happen to be on the same line)
> 
>                 if (folio_test_dirty(folio) || folio_test_locked(folio) ||
>                                 folio_test_writeback(folio))
>                         break;
> 
> turns into:
> 
>     1f41:       48 8b 29                mov    (%rcx),%rbp
>     1f44:       48 c1 ed 04             shr    $0x4,%rbp
>     1f48:       83 e5 01                and    $0x1,%ebp
>     1f4b:       0f 85 d5 00 00 00       jne    2026 <filemap_range_has_writeback+0x1a6>
>     1f51:       48 8b 29                mov    (%rcx),%rbp
>     1f54:       83 e5 01                and    $0x1,%ebp
>     1f57:       0f 85 c9 00 00 00       jne    2026 <filemap_range_has_writeback+0x1a6>
>     1f5d:       48 8b 29                mov    (%rcx),%rbp
>     1f60:       48 d1 ed                shr    $1,%rbp
>     1f63:       83 e5 01                and    $0x1,%ebp
>     1f66:       0f 85 ba 00 00 00       jne    2026 <filemap_range_has_writeback+0x1a6>
> 
> rather than _one_ load from rcx and a test against a mask.

Agreed, it would be nice if we could convince the compiler to do this
for us, preferably without breaking anything.

> > If it turns out that we really do need a not-quite-volatile, what exactly
> > does it do?  You clearly want it to be able to be optimized so as to merge
> > similar accesses.  Is there a limit to the number of accesses that can
> > be merged or to the region of code over which such merging is permitted?
> > Either way, how is the compiler informed of these limits?
> 
> Right, like I said, it's not going to be easy to define exactly what we
> want.

Or to convince the usual suspects that any definition we might come up
with is useful/implementable/teacheable/...  :-/

							Thanx, Paul

Linus Torvalds April 8, 2024, 8:05 p.m. UTC | #60

On Mon, 8 Apr 2024 at 11:14, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, PA-RISC is no better - the same "fetch and replace with constant"
> kind of primitive as for sparc32, only the constant is (u32)0 instead
> of (u8)~0.  And unlike sparc64, 64bit variant didn't get better.

Heh. The thing about PA-RISC is that it is actually *so* much worse
that it was never useful for an arithmetic type.

IOW, the fact that sparc used just a byte meant that the aotmic_t
hackery on sparc still gave us 24 useful bits in a 32-bit atomic_t.

So long ago, we used to have an arithmetic atomic_t that was 32-bit on
all sane architectures, but only had a 24-bit range on sparc.

And I know you know all this, I'm just explaining the horror for the audience.

On PA-RISC you couldn't do that horrendous trick, so parist just used
the "we use a hashed spinlock for all atomics", and "atomic_t" was a
regular full-sized integer type.

Anyway, the sparc 24-bit atomics were actually replaced by the PA-RISC
version back twenty years ago (almost to the day):

   https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=373f1583c5c5

and while we still had some left-over of that horror in the git tree
up until 2011 (until commit 348738afe530: "sparc32: drop unused
atomic24 support") we probably should have made the
"arch_atomic_xyz()" ops work on generic types rather than "atomic_t"
for a long long time, so that you could use them on other things than
"atomic_t" and friends.

You can see the casting horror here, for example:

   include/asm-generic/bitops/atomic.h

where we do that cast from "volatile unsigned long *p" to
"atomic_long_t *" just to use the raw_atomic_long_xyz() operations.

It would make more sense if the raw atomics took that "native"
volatile unsigned long pointer directly.

(And here that "volatile" is not because it's necessary used as a
volatile - it is - but simply because it's the most permissive type of
pointer. You can see other places using "const volatile unsigned long"
pointers for the same reason: passing in a non-const or non-volatile
pointer is perfectly fine).

              Linus

Kent Overstreet April 9, 2024, 12:58 a.m. UTC | #61

On Mon, Apr 08, 2024 at 06:03:11PM +0100, Matthew Wilcox wrote:
> On Mon, Apr 08, 2024 at 09:55:23AM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 05:02:37PM +0100, Matthew Wilcox wrote:
> > > In my ideal world, the compiler would turn this into:
> > > 
> > > 	newfolio->flags |= folio->flags & MIGRATE_MASK;
> > 
> > Why not accumulate the changes in a mask, and then apply the mask the
> > one time?  (In situations where __folio_set_foo() need not apply.)
> 
> But it irks me that we can't tell the compiler this is a safe
> transformation for it to make. There are a number of places where
> similar things happen.

Same thing comes up with bignum code - you really want to be able to
tell the compiler "you can apply x/y/z optimizations for these
functions", e.g. replace add(mul(a, b), c) with fma(a, b, c).

Compiler optimizations are just algebraic transformations, we just need
a way to tell the compiler what the algebraic properties of our
functions are.

Paul E. McKenney April 9, 2024, 4:47 a.m. UTC | #62

On Mon, Apr 08, 2024 at 08:58:18PM -0400, Kent Overstreet wrote:
> On Mon, Apr 08, 2024 at 06:03:11PM +0100, Matthew Wilcox wrote:
> > On Mon, Apr 08, 2024 at 09:55:23AM -0700, Paul E. McKenney wrote:
> > > On Mon, Apr 08, 2024 at 05:02:37PM +0100, Matthew Wilcox wrote:
> > > > In my ideal world, the compiler would turn this into:
> > > > 
> > > > 	newfolio->flags |= folio->flags & MIGRATE_MASK;
> > > 
> > > Why not accumulate the changes in a mask, and then apply the mask the
> > > one time?  (In situations where __folio_set_foo() need not apply.)
> > 
> > But it irks me that we can't tell the compiler this is a safe
> > transformation for it to make. There are a number of places where
> > similar things happen.
> 
> Same thing comes up with bignum code - you really want to be able to
> tell the compiler "you can apply x/y/z optimizations for these
> functions", e.g. replace add(mul(a, b), c) with fma(a, b, c).
> 
> Compiler optimizations are just algebraic transformations, we just need
> a way to tell the compiler what the algebraic properties of our
> functions are.

That might indeed be more straightforward than doing this on a per-type
basis.  But the C++ guys would likely just start shouting "template
metaprogramming!!!"  ;-)

							Thanx, Paul

Peter Zijlstra April 9, 2024, 10:50 a.m. UTC | #63

On Mon, Mar 25, 2024 at 01:59:55PM -0700, Boqun Feng wrote:
> On Mon, Mar 25, 2024 at 10:44:45AM +0000, Mark Rutland wrote:
> [...]
> > > 
> > > * I choose to re-implement atomics in Rust `asm` because we are still
> > >   figuring out how we can make it easy and maintainable for Rust to call
> > >   a C function _inlinely_ (Gary makes some progress [2]). Otherwise,
> > >   atomic primitives would be function calls, and that can be performance
> > >   bottleneck in a few cases.
> > 
> > I don't think we want to maintain two copies of each architecture's atomics.
> > This gets painful very quickly (e.g. as arm64's atomics get patched between
> > LL/SC and LSE forms).
> > 
> 
> No argument here ;-)

Didn't we talk about bindgen being able to convert inline C functions
into equivalent inline Rust functions? ISTR that getting stuck on Rust
not having a useful inline asm.

But fixing all that in a hurry seems like the much saner path forward.

Boqun Feng April 16, 2024, 6:12 p.m. UTC | #64

On Tue, Apr 09, 2024 at 12:50:15PM +0200, Peter Zijlstra wrote:
> On Mon, Mar 25, 2024 at 01:59:55PM -0700, Boqun Feng wrote:
> > On Mon, Mar 25, 2024 at 10:44:45AM +0000, Mark Rutland wrote:
> > [...]
> > > > 
> > > > * I choose to re-implement atomics in Rust `asm` because we are still
> > > >   figuring out how we can make it easy and maintainable for Rust to call
> > > >   a C function _inlinely_ (Gary makes some progress [2]). Otherwise,
> > > >   atomic primitives would be function calls, and that can be performance
> > > >   bottleneck in a few cases.
> > > 
> > > I don't think we want to maintain two copies of each architecture's atomics.
> > > This gets painful very quickly (e.g. as arm64's atomics get patched between
> > > LL/SC and LSE forms).
> > > 
> > 
> > No argument here ;-)
> 
> Didn't we talk about bindgen being able to convert inline C functions
> into equivalent inline Rust functions? ISTR that getting stuck on Rust

Yes, we did.

> not having a useful inline asm.
> 

Mostly two features were missing: 1) asm goto and 2) memory operands,
#1 gets implemented[1] by Gary, and should be available in Rust 1.78
(plan to release at May 2, 2024); For #2, my understanding is that
arch-specific effort is needed (since different architectures may have
different contraints on memory operands), I haven't yet found anyone is
working on this.

(background explanation for broader audience: in GCC's inline asm, you
can specify an memory location, other than a register location, as an
input or output of an asm block's operand[2], but current Rust inline
asm doesn't provide this functionality, by default, without option
"pure", "nomem", etc, every asm block in Rust can be thought as a C asm
block with "memory" clobber)

That being said, if you look at the link I shared or this gist from
Gary:

	https://gist.github.com/nbdd0121/d4bf7dd7f9b6d6b50fa18b1092f45a3c

there is another way (yeah, we probably also have discussed this
previously), basically what it does is compiling the functions in a C
file as LLVM IR, so that Rust can call these functions at LLVM IR level.
This in theory is doing some local LTO, and I've tested that it works
for asm blocks. We still need to tweak our build system to make this
work, but should it work, it would mean that Rust can call a C function
in a pretty efficient way.

> But fixing all that in a hurry seems like the much saner path forward.

So a sane plan to me is wiring our atomics into Rust functions via what
Mark has (i.e. starting off as FFI calls), then we can switch to the
"local LTO" approach when it's ready. In case that "local LTO" needs
more work and we do have performance need, we can always either 1)
manually implement some primitives in Rust asm, or 2) look into how
bindgen or other tools can translate simple C functions (asm blocks)
into Rust.

Regards,
Boqun

[1]: https://github.com/rust-lang/rust/pull/119365
[2]: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#index-m-in-constraint