diff mbox series

[v5,11/23] xen/riscv: introduce cmpxchg.h

Message ID ce7604de39b3480553eeaeafc11138494016983f.1708962629.git.oleksii.kurochko@gmail.com (mailing list archive)
State Superseded
Headers show
Series Enable build of full Xen for RISC-V | expand

Commit Message

Oleksii Kurochko Feb. 26, 2024, 5:38 p.m. UTC
The header was taken from Linux kernl 6.4.0-rc1.

Addionally, were updated:
* add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
  access.
* replace tabs with spaces
* replace __* variale with *__
* introduce generic version of xchg_* and cmpxchg_*.

Implementation of 4- and 8-byte cases were left as it is done in
Linux kernel as according to the RISC-V spec:
```
Table A.5 ( only part of the table was copied here )

Linux Construct       RVWMO Mapping
atomic <op> relaxed    amo<op>.{w|d}
atomic <op> acquire    amo<op>.{w|d}.aq
atomic <op> release    amo<op>.{w|d}.rl
atomic <op>            amo<op>.{w|d}.aqrl

Linux Construct       RVWMO LR/SC Mapping
atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez loop OR
                       fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop

The Linux mappings for release operations may seem stronger than necessary,
but these mappings are needed to cover some cases in which Linux requires
stronger orderings than the more intuitive mappings would provide.
In particular, as of the time this text is being written, Linux is actively
debating whether to require load-load, load-store, and store-store orderings
between accesses in one critical section and accesses in a subsequent critical
section in the same hart and protected by the same synchronization object.
Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl mappings
combine to provide such orderings.
There are a few ways around this problem, including:
1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
   but is undesirable, as it defeats the purpose of the aq/rl modifiers.
2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
   currently work due to the lack of load and store opcodes with aq and rl
   modifiers.
3. Strengthen the mappings of release operations such that they would
   enforce sufficient orderings in the presence of either type of acquire mapping.
   This is the currently-recommended solution, and the one shown in Table A.5.
```

But in Linux kenrel atomics were strengthen with fences:
```
Atomics present the same issue with locking: release and acquire
variants need to be strengthened to meet the constraints defined
by the Linux-kernel memory consistency model [1].

Atomics present a further issue: implementations of atomics such
as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
which do not give full-ordering with .aqrl; for example, current
implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
below to end up with the state indicated in the "exists" clause.

In order to "synchronize" LKMM and RISC-V's implementation, this
commit strengthens the implementations of the atomics operations
by replacing .rl and .aq with the use of ("lightweigth") fences,
and by replacing .aqrl LR/SC pairs in sequences such as:

0:      lr.w.aqrl  %0, %addr
        bne        %0, %old, 1f
        ...
        sc.w.aqrl  %1, %new, %addr
        bnez       %1, 0b
1:

with sequences of the form:

0:      lr.w       %0, %addr
        bne        %0, %old, 1f
              ...
        sc.w.rl    %1, %new, %addr   /* SC-release   */
        bnez       %1, 0b
        fence      rw, rw            /* "full" fence */
1:

following Daniel's suggestion.

These modifications were validated with simulation of the RISC-V
memory consistency model.

C lr-sc-aqrl-pair-vs-full-barrier

{}

P0(int *x, int *y, atomic_t *u)
{
        int r0;
        int r1;

        WRITE_ONCE(*x, 1);
        r0 = atomic_cmpxchg(u, 0, 1);
        r1 = READ_ONCE(*y);
}

P1(int *x, int *y, atomic_t *v)
{
        int r0;
        int r1;

        WRITE_ONCE(*y, 1);
        r0 = atomic_cmpxchg(v, 0, 1);
        r1 = READ_ONCE(*x);
}

exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)

[1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
https://marc.info/?l=linux-kernel&m=151633436614259&w=2
```

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
 - update the commit message.
 - drop ALIGN_DOWN().
 - update the definition of emulate_xchg_1_2(): 
   - lr.d -> lr.w, sc.d -> sc.w.
   - drop ret argument.
   - code style fixes around asm volatile.
   - update prototype.
   - use asm named operands.
   - rename local variables.
   - add comment above the macros
 - update the definition of __xchg_generic:
   - drop local ptr__ variable.
   - code style fixes around switch()
   - update prototype.
 - introduce RISCV_FULL_BARRIES.
 - redefine cmpxchg()
 - update emulate_cmpxchg_1_2():
   - update prototype
   - update local variables names and usage of them
   - use name asm operands.
   - add comment above the macros
---
Changes in V4:
 - Code style fixes.
 - enforce in __xchg_*() has the same type for new and *ptr, also "\n"
   was removed at the end of asm instruction.
 - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
 - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
 - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
 - drop cmpxcg{32,64}_{local} as they aren't used.
 - introduce generic version of xchg_* and cmpxchg_*.
 - update the commit message.
---
Changes in V3:
 - update the commit message
 - add emulation of {cmp}xchg_... for 1 and 2 bytes types
---
Changes in V2:
 - update the comment at the top of the header.
 - change xen/lib.h to xen/bug.h.
 - sort inclusion of headers properly.
---
 xen/arch/riscv/include/asm/cmpxchg.h | 258 +++++++++++++++++++++++++++
 1 file changed, 258 insertions(+)
 create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h

Comments

Jan Beulich March 6, 2024, 2:56 p.m. UTC | #1
On 26.02.2024 18:38, Oleksii Kurochko wrote:
> The header was taken from Linux kernl 6.4.0-rc1.
> 
> Addionally, were updated:
> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
>   access.
> * replace tabs with spaces
> * replace __* variale with *__
> * introduce generic version of xchg_* and cmpxchg_*.
> 
> Implementation of 4- and 8-byte cases were left as it is done in
> Linux kernel as according to the RISC-V spec:
> ```
> Table A.5 ( only part of the table was copied here )
> 
> Linux Construct       RVWMO Mapping
> atomic <op> relaxed    amo<op>.{w|d}
> atomic <op> acquire    amo<op>.{w|d}.aq
> atomic <op> release    amo<op>.{w|d}.rl
> atomic <op>            amo<op>.{w|d}.aqrl
> 
> Linux Construct       RVWMO LR/SC Mapping
> atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
> atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez loop OR
>                        fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
> atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop
> 
> The Linux mappings for release operations may seem stronger than necessary,
> but these mappings are needed to cover some cases in which Linux requires
> stronger orderings than the more intuitive mappings would provide.
> In particular, as of the time this text is being written, Linux is actively
> debating whether to require load-load, load-store, and store-store orderings
> between accesses in one critical section and accesses in a subsequent critical
> section in the same hart and protected by the same synchronization object.
> Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl mappings
> combine to provide such orderings.
> There are a few ways around this problem, including:
> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
>    but is undesirable, as it defeats the purpose of the aq/rl modifiers.
> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
>    currently work due to the lack of load and store opcodes with aq and rl
>    modifiers.

As before I don't understand this point. Can you give an example of what
sort of opcode / instruction is missing?

> 3. Strengthen the mappings of release operations such that they would
>    enforce sufficient orderings in the presence of either type of acquire mapping.
>    This is the currently-recommended solution, and the one shown in Table A.5.
> ```
> 
> But in Linux kenrel atomics were strengthen with fences:
> ```
> Atomics present the same issue with locking: release and acquire
> variants need to be strengthened to meet the constraints defined
> by the Linux-kernel memory consistency model [1].
> 
> Atomics present a further issue: implementations of atomics such
> as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
> which do not give full-ordering with .aqrl; for example, current
> implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
> below to end up with the state indicated in the "exists" clause.
> 
> In order to "synchronize" LKMM and RISC-V's implementation, this
> commit strengthens the implementations of the atomics operations
> by replacing .rl and .aq with the use of ("lightweigth") fences,
> and by replacing .aqrl LR/SC pairs in sequences such as:
> 
> 0:      lr.w.aqrl  %0, %addr
>         bne        %0, %old, 1f
>         ...
>         sc.w.aqrl  %1, %new, %addr
>         bnez       %1, 0b
> 1:
> 
> with sequences of the form:
> 
> 0:      lr.w       %0, %addr
>         bne        %0, %old, 1f
>               ...
>         sc.w.rl    %1, %new, %addr   /* SC-release   */
>         bnez       %1, 0b
>         fence      rw, rw            /* "full" fence */
> 1:
> 
> following Daniel's suggestion.
> 
> These modifications were validated with simulation of the RISC-V
> memory consistency model.
> 
> C lr-sc-aqrl-pair-vs-full-barrier
> 
> {}
> 
> P0(int *x, int *y, atomic_t *u)
> {
>         int r0;
>         int r1;
> 
>         WRITE_ONCE(*x, 1);
>         r0 = atomic_cmpxchg(u, 0, 1);
>         r1 = READ_ONCE(*y);
> }
> 
> P1(int *x, int *y, atomic_t *v)
> {
>         int r0;
>         int r1;
> 
>         WRITE_ONCE(*y, 1);
>         r0 = atomic_cmpxchg(v, 0, 1);
>         r1 = READ_ONCE(*x);
> }
> 
> exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)

While I'm entirely willing to trust this can happen, I can't bring this
in line with the A extension spec.

Additionally it's not clear to me in how far all of this applies when
you don't really use LR/SC in the 4- and 8-byte cases (and going forward
likely also not in the 1- and 2-byte case, utilizing Zahba when available).

> ---
> Changes in V5:
>  - update the commit message.
>  - drop ALIGN_DOWN().
>  - update the definition of emulate_xchg_1_2(): 
>    - lr.d -> lr.w, sc.d -> sc.w.
>    - drop ret argument.
>    - code style fixes around asm volatile.
>    - update prototype.
>    - use asm named operands.
>    - rename local variables.
>    - add comment above the macros
>  - update the definition of __xchg_generic:
>    - drop local ptr__ variable.
>    - code style fixes around switch()
>    - update prototype.
>  - introduce RISCV_FULL_BARRIES.
>  - redefine cmpxchg()
>  - update emulate_cmpxchg_1_2():
>    - update prototype
>    - update local variables names and usage of them
>    - use name asm operands.
>    - add comment above the macros
> ---
> Changes in V4:
>  - Code style fixes.
>  - enforce in __xchg_*() has the same type for new and *ptr, also "\n"
>    was removed at the end of asm instruction.
>  - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
>  - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
>  - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
>  - drop cmpxcg{32,64}_{local} as they aren't used.
>  - introduce generic version of xchg_* and cmpxchg_*.
>  - update the commit message.
> ---
> Changes in V3:
>  - update the commit message
>  - add emulation of {cmp}xchg_... for 1 and 2 bytes types
> ---
> Changes in V2:
>  - update the comment at the top of the header.
>  - change xen/lib.h to xen/bug.h.
>  - sort inclusion of headers properly.
> ---
>  xen/arch/riscv/include/asm/cmpxchg.h | 258 +++++++++++++++++++++++++++
>  1 file changed, 258 insertions(+)
>  create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h
> 
> diff --git a/xen/arch/riscv/include/asm/cmpxchg.h b/xen/arch/riscv/include/asm/cmpxchg.h
> new file mode 100644
> index 0000000000..66cbe26737
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> @@ -0,0 +1,258 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (C) 2014 Regents of the University of California */
> +
> +#ifndef _ASM_RISCV_CMPXCHG_H
> +#define _ASM_RISCV_CMPXCHG_H
> +
> +#include <xen/compiler.h>
> +#include <xen/lib.h>
> +
> +#include <asm/fence.h>
> +#include <asm/io.h>
> +#include <asm/system.h>
> +
> +#define __amoswap_generic(ptr, new, ret, sfx, pre, post) \
> +({ \
> +    asm volatile( \

Nit: In Xen style this is lacking a blank ahead of the opening parenthesis.

> +        pre \
> +        " amoswap" sfx " %0, %2, %1\n" \
> +        post \
> +        : "=r" (ret), "+A" (*ptr) \
> +        : "r" (new) \
> +        : "memory" ); \
> +})
> +
> +/*
> + * For LR and SC, the A extension requires that the address held in rs1 be
> + * naturally aligned to the size of the operand (i.e., eight-byte aligned
> + * for 64-bit words and four-byte aligned for 32-bit words).
> + * If the address is not naturally aligned, an address-misaligned exception
> + * or an access-fault exception will be generated.
> + * 
> + * Thereby:
> + * - for 1-byte xchg access the containing word by clearing low two bits
> + * - for 2-byte xchg ccess the containing word by clearing first bit.

"first bit" can still be ambiguous. Better say "bit 1".

> + * 

Here and apparently also elsewhere: Stray trailing blank. Git has a config
setting to warn you about (maybe even to automatically strip? such.

> + * If resulting 4-byte access is still misalgined, it will fault just as
> + * non-emulated 4-byte access would.
> + */
> +#define emulate_xchg_1_2(ptr, new, sc_sfx, pre, post) \
> +({ \
> +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr & ~(0x4 - sizeof(*ptr))); \

Here and elsewhere: sizeof(*(ptr)) (i.e. the inner parentheses are needed
also there).

> +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 - sizeof(*ptr))) * BITS_PER_BYTE; \

Why uint8_t?

> +    unsigned long mask = GENMASK(((sizeof(*ptr)) * BITS_PER_BYTE) - 1, 0) << new_val_pos; \
> +    unsigned int new_ = new << new_val_pos; \
> +    unsigned int old_val; \
> +    unsigned int xchged_val; \
> +    \
> +    asm volatile ( \
> +        pre \
> +        "0: lr.w %[op_oldval], %[op_aligned_ptr]\n" \
> +        "   and  %[op_xchged_val], %[op_oldval], %z[op_nmask]\n" \
> +        "   or   %[op_xchged_val], %[op_xchged_val], %z[op_new]\n" \
> +        "   sc.w" sc_sfx " %[op_xchged_val], %[op_xchged_val], %[op_aligned_ptr]\n" \
> +        "   bnez %[op_xchged_val], 0b\n" \
> +        post \
> +        : [op_oldval] "=&r" (old_val), [op_xchged_val] "=&r" (xchged_val), [op_aligned_ptr]"+A" (*aligned_ptr) \

Too long line. Partly because you have op_ prefixes here which I can't
recognized what they would be good for. The val / _val suffixes also
don't appear to carry much useful information. And "xchged", being
explicitly past tense, doesn't look to fit even up and until the SC,
not to speak of afterwards. Anything wrong with calling this just tmp,
aux, or scratch?

> +        : [op_new] "rJ" (new_), [op_nmask] "rJ" (~mask) \
> +        : "memory" ); \
> +    \
> +    (__typeof__(*(ptr)))((old_val & mask) >> new_val_pos); \
> +})
> +
> +#define __xchg_generic(ptr, new, size, sfx, pre, post) \
> +({ \
> +    __typeof__(*(ptr)) new__ = (new); \
> +    __typeof__(*(ptr)) ret__; \
> +    switch ( size ) \

Can't this use sizeof(*(ptr)), allowing for one less macro parameter?

> +    { \
> +    case 1: \
> +    case 2: \
> +        ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \
> +        break; \
> +    case 4: \
> +        __amoswap_generic(ptr, new__, ret__,\
> +                          ".w" sfx,  pre, post); \
> +        break; \
> +    case 8: \
> +        __amoswap_generic(ptr, new__, ret__,\
> +                          ".d" sfx,  pre, post); \
> +        break; \

In io.h you make sure to avoid rv64-only insns. Here you don't. The build
would fail either way, but this still looks inconsistent.

Also nit: Stray double blands (twice) ahead of "pre". Plus with this style
of line continuation you want to consistently have exactly one blank ahead
of each backslash.

> +    default: \
> +        STATIC_ASSERT_UNREACHABLE(); \
> +    } \
> +    ret__; \
> +})
> +
> +#define xchg_relaxed(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) x_ = (x); \

What is the purpose of this, when __xchg_generic() already does this same
type conversion?

> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), "", "", ""); \
> +})
> +
> +#define xchg_acquire(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) x_ = (x); \
> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
> +                                       "", "", RISCV_ACQUIRE_BARRIER); \
> +})
> +
> +#define xchg_release(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) x_ = (x); \
> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
> +                                       "", RISCV_RELEASE_BARRIER, ""); \
> +})

As asked before: Are there going to be any uses of these three? Common
code doesn't require them. And not needing to provide them would
simplify things quite a bit, it seems.

> +#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(*(ptr)), \
> +                                    ".aqrl", "", "")

According to the earlier comment (where I don't follow the example given),
is .aqrl sufficient here? And even if it was for the 4- and 8-byte cases,
is it sufficient in the 1- and 2-byte emulation case (where it then is
appended to just the SC)?

> +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx, pre, post)	\
> + ({ \
> +    register unsigned int rc; \
> +    asm volatile( \
> +        pre \
> +        "0: lr" lr_sfx " %0, %2\n" \
> +        "   bne  %0, %z3, 1f\n" \
> +        "   sc" sc_sfx " %1, %z4, %2\n" \
> +        "   bnez %1, 0b\n" \
> +        post \
> +        "1:\n" \
> +        : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
> +        : "rJ" (old), "rJ" (new) \
> +        : "memory"); \
> + })
> +
> +/*
> + * For LR and SC, the A extension requires that the address held in rs1 be
> + * naturally aligned to the size of the operand (i.e., eight-byte aligned
> + * for 64-bit words and four-byte aligned for 32-bit words).
> + * If the address is not naturally aligned, an address-misaligned exception
> + * or an access-fault exception will be generated.
> + * 
> + * Thereby:
> + * - for 1-byte xchg access the containing word by clearing low two bits
> + * - for 2-byte xchg ccess the containing word by clearing first bit.
> + * 
> + * If resulting 4-byte access is still misalgined, it will fault just as
> + * non-emulated 4-byte access would.
> + *
> + * old_val was casted to unsigned long at the end of the define because of
> + * the following issue:
> + * ./arch/riscv/include/asm/cmpxchg.h:166:5: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
> + * 166 |     (__typeof__(*(ptr)))(old_val >> new_val_pos); \
> + *     |     ^
> + * ./arch/riscv/include/asm/cmpxchg.h:184:17: note: in expansion of macro 'emulate_cmpxchg_1_2'
> + * 184 |         ret__ = emulate_cmpxchg_1_2(ptr, old, new, \
> + *     |                 ^~~~~~~~~~~~~~~~~~~
> + * ./arch/riscv/include/asm/cmpxchg.h:227:5: note: in expansion of macro '__cmpxchg_generic'
> + * 227 |     __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
> + *     |     ^~~~~~~~~~~~~~~~~
> + * ./include/xen/lib.h:141:26: note: in expansion of macro '__cmpxchg'
> + * 141 |     ((__typeof__(*(ptr)))__cmpxchg(ptr, (unsigned long)o_,              \
> + *     |                          ^~~~~~~~~
> + * common/event_channel.c:109:13: note: in expansion of macro 'cmpxchgptr'
> + * 109 |             cmpxchgptr(&xen_consumers[i], NULL, fn);
> + */

This is too much detail on the compile issue. Just mentioning that said
cast is needed for cmpxchgptr() ought to be sufficient.

> +#define emulate_cmpxchg_1_2(ptr, old, new, sc_sfx, pre, post) \
> +({ \
> +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr & ~(0x4 - sizeof(*ptr))); \
> +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 - sizeof(*ptr))) * BITS_PER_BYTE; \
> +    unsigned long mask = GENMASK(((sizeof(*ptr)) * BITS_PER_BYTE) - 1, 0) << new_val_pos; \
> +    unsigned int old_ = old << new_val_pos; \
> +    unsigned int new_ = new << new_val_pos; \
> +    unsigned int old_val; \
> +    unsigned int xchged_val; \
> +    \
> +    __asm__ __volatile__ ( \
> +        pre \
> +        "0: lr.w %[op_xchged_val], %[op_aligned_ptr]\n" \
> +        "   and  %[op_oldval], %[op_xchged_val], %z[op_mask]\n" \
> +        "   bne  %[op_oldval], %z[op_old], 1f\n" \
> +        "   xor  %[op_xchged_val], %[op_oldval], %[op_xchged_val]\n" \
> +        "   or   %[op_xchged_val], %[op_xchged_val], %z[op_new]\n" \
> +        "   sc.w" sc_sfx " %[op_xchged_val], %[op_xchged_val], %[op_aligned_ptr]\n" \
> +        "   bnez %[op_xchged_val], 0b\n" \
> +        post \
> +        "1:\n" \
> +        : [op_oldval] "=&r" (old_val), [op_xchged_val] "=&r" (xchged_val), [op_aligned_ptr] "+A" (*aligned_ptr) \
> +        : [op_old] "rJ" (old_), [op_new] "rJ" (new_), \
> +          [op_mask] "rJ" (mask) \
> +        : "memory" ); \
> +    \
> +    (__typeof__(*(ptr)))((unsigned long)old_val >> new_val_pos); \
> +})
> +
> +/*
> + * Atomic compare and exchange.  Compare OLD with MEM, if identical,
> + * store NEW in MEM.  Return the initial value in MEM.  Success is
> + * indicated by comparing RETURN with OLD.
> + */
> +#define __cmpxchg_generic(ptr, old, new, size, sc_sfx, pre, post) \
> +({ \
> +    __typeof__(ptr) ptr__ = (ptr); \
> +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
> +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
> +    __typeof__(*(ptr)) ret__; \
> +    switch ( size ) \
> +    { \
> +    case 1: \
> +    case 2: \
> +        ret__ = emulate_cmpxchg_1_2(ptr, old, new, \
> +                            sc_sfx, pre, post); \
> +        break; \
> +    case 4: \
> +        __generic_cmpxchg(ptr__, old__, new__, ret__, \
> +                          ".w", ".w"sc_sfx, pre, post); \
> +        break; \
> +    case 8: \
> +        __generic_cmpxchg(ptr__, old__, new__, ret__, \
> +                          ".d", ".d"sc_sfx, pre, post); \
> +        break; \
> +    default: \
> +        STATIC_ASSERT_UNREACHABLE(); \
> +    } \
> +    ret__; \
> +})
> +
> +#define cmpxchg_relaxed(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) o_ = (o); \
> +    __typeof__(*(ptr)) n_ = (n); \
> +    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, \
> +                    o_, n_, sizeof(*(ptr)), "", "", ""); \
> +})
> +
> +#define cmpxchg_acquire(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) o_ = (o); \
> +    __typeof__(*(ptr)) n_ = (n); \
> +    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, o_, n_, sizeof(*(ptr)), \
> +                                          "", "", RISCV_ACQUIRE_BARRIER); \
> +})
> +
> +#define cmpxchg_release(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) o_ = (o); \
> +    __typeof__(*(ptr)) n_ = (n); \
> +    (__typeof__(*(ptr)))__cmpxchg_release(ptr, o_, n_, sizeof(*(ptr)), \

There's no __cmpxchg_release() afaics; dym __cmpxchg_generic()?

Jan
Oleksii Kurochko March 7, 2024, 10:35 a.m. UTC | #2
On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
> On 26.02.2024 18:38, Oleksii Kurochko wrote:
> > The header was taken from Linux kernl 6.4.0-rc1.
> > 
> > Addionally, were updated:
> > * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
> >   access.
> > * replace tabs with spaces
> > * replace __* variale with *__
> > * introduce generic version of xchg_* and cmpxchg_*.
> > 
> > Implementation of 4- and 8-byte cases were left as it is done in
> > Linux kernel as according to the RISC-V spec:
> > ```
> > Table A.5 ( only part of the table was copied here )
> > 
> > Linux Construct       RVWMO Mapping
> > atomic <op> relaxed    amo<op>.{w|d}
> > atomic <op> acquire    amo<op>.{w|d}.aq
> > atomic <op> release    amo<op>.{w|d}.rl
> > atomic <op>            amo<op>.{w|d}.aqrl
> > 
> > Linux Construct       RVWMO LR/SC Mapping
> > atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> > atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
> > atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
> > loop OR
> >                        fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
> > bnez loop
> > atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
> > loop
> > 
> > The Linux mappings for release operations may seem stronger than
> > necessary,
> > but these mappings are needed to cover some cases in which Linux
> > requires
> > stronger orderings than the more intuitive mappings would provide.
> > In particular, as of the time this text is being written, Linux is
> > actively
> > debating whether to require load-load, load-store, and store-store
> > orderings
> > between accesses in one critical section and accesses in a
> > subsequent critical
> > section in the same hart and protected by the same synchronization
> > object.
> > Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl
> > mappings
> > combine to provide such orderings.
> > There are a few ways around this problem, including:
> > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
> > suffices
> >    but is undesirable, as it defeats the purpose of the aq/rl
> > modifiers.
> > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
> > not
> >    currently work due to the lack of load and store opcodes with aq
> > and rl
> >    modifiers.
> 
> As before I don't understand this point. Can you give an example of
> what
> sort of opcode / instruction is missing?
If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d}
instructions don't have aq or rl annotation. Here is text from the
spec:
   ARM Operation                  RVWMO Mapping
   Load                           l{b|h|w|d}
   Load-Acquire                   fence rw, rw; l{b|h|w|d}; fence r,rw 
   Load-Exclusive                 lr.{w|d}
   Load-Acquire-Exclusive         lr.{w|d}.aqrl
   Store                          s{b|h|w|d}
   Store-Release                  fence rw,w; s{b|h|w|d}
   Store-Exclusive                sc.{w|d}
   Store-Release-Exclusive        sc.{w|d}.rl
   dmb                            fence rw,rw
   dmb.ld                         fence r,rw
   dmb.st                         fence w,w
   isb                            fence.i; fence r,r
     Table A.4: Mappings from ARM operations to RISC-V operations

   Table A.4 provides a mapping from ARM memory operations onto RISC-V
   memory instructions.
   Since RISC-V does not currently have plain load and store opcodes with
   aq or rl annotations, ARM
   load-acquire and store-release operations should be mapped using fences
   instead.

> 
> > 3. Strengthen the mappings of release operations such that they
> > would
> >    enforce sufficient orderings in the presence of either type of
> > acquire mapping.
> >    This is the currently-recommended solution, and the one shown in
> > Table A.5.
> > ```
> > 
> > But in Linux kenrel atomics were strengthen with fences:
> > ```
> > Atomics present the same issue with locking: release and acquire
> > variants need to be strengthened to meet the constraints defined
> > by the Linux-kernel memory consistency model [1].
> > 
> > Atomics present a further issue: implementations of atomics such
> > as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
> > which do not give full-ordering with .aqrl; for example, current
> > implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
> > below to end up with the state indicated in the "exists" clause.
> > 
> > In order to "synchronize" LKMM and RISC-V's implementation, this
> > commit strengthens the implementations of the atomics operations
> > by replacing .rl and .aq with the use of ("lightweigth") fences,
> > and by replacing .aqrl LR/SC pairs in sequences such as:
> > 
> > 0:      lr.w.aqrl  %0, %addr
> >         bne        %0, %old, 1f
> >         ...
> >         sc.w.aqrl  %1, %new, %addr
> >         bnez       %1, 0b
> > 1:
> > 
> > with sequences of the form:
> > 
> > 0:      lr.w       %0, %addr
> >         bne        %0, %old, 1f
> >               ...
> >         sc.w.rl    %1, %new, %addr   /* SC-release   */
> >         bnez       %1, 0b
> >         fence      rw, rw            /* "full" fence */
> > 1:
> > 
> > following Daniel's suggestion.
> > 
> > These modifications were validated with simulation of the RISC-V
> > memory consistency model.
> > 
> > C lr-sc-aqrl-pair-vs-full-barrier
> > 
> > {}
> > 
> > P0(int *x, int *y, atomic_t *u)
> > {
> >         int r0;
> >         int r1;
> > 
> >         WRITE_ONCE(*x, 1);
> >         r0 = atomic_cmpxchg(u, 0, 1);
> >         r1 = READ_ONCE(*y);
> > }
> > 
> > P1(int *x, int *y, atomic_t *v)
> > {
> >         int r0;
> >         int r1;
> > 
> >         WRITE_ONCE(*y, 1);
> >         r0 = atomic_cmpxchg(v, 0, 1);
> >         r1 = READ_ONCE(*x);
> > }
> > 
> > exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
> 
> While I'm entirely willing to trust this can happen, I can't bring
> this
> in line with the A extension spec.
> 
> Additionally it's not clear to me in how far all of this applies when
> you don't really use LR/SC in the 4- and 8-byte cases (and going
> forward
> likely also not in the 1- and 2-byte case, utilizing Zahba when
> available).
It just explain what combination of fences, lr/sc, amoswap, .aq and .rl
annotation can be combined, and why combinations introduced in this
patch are used.

> 
> > ---
> > Changes in V5:
> >  - update the commit message.
> >  - drop ALIGN_DOWN().
> >  - update the definition of emulate_xchg_1_2(): 
> >    - lr.d -> lr.w, sc.d -> sc.w.
> >    - drop ret argument.
> >    - code style fixes around asm volatile.
> >    - update prototype.
> >    - use asm named operands.
> >    - rename local variables.
> >    - add comment above the macros
> >  - update the definition of __xchg_generic:
> >    - drop local ptr__ variable.
> >    - code style fixes around switch()
> >    - update prototype.
> >  - introduce RISCV_FULL_BARRIES.
> >  - redefine cmpxchg()
> >  - update emulate_cmpxchg_1_2():
> >    - update prototype
> >    - update local variables names and usage of them
> >    - use name asm operands.
> >    - add comment above the macros
> > ---
> > Changes in V4:
> >  - Code style fixes.
> >  - enforce in __xchg_*() has the same type for new and *ptr, also
> > "\n"
> >    was removed at the end of asm instruction.
> >  - dependency from
> > https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
> >  - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
> >  - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
> >  - drop cmpxcg{32,64}_{local} as they aren't used.
> >  - introduce generic version of xchg_* and cmpxchg_*.
> >  - update the commit message.
> > ---
> > Changes in V3:
> >  - update the commit message
> >  - add emulation of {cmp}xchg_... for 1 and 2 bytes types
> > ---
> > Changes in V2:
> >  - update the comment at the top of the header.
> >  - change xen/lib.h to xen/bug.h.
> >  - sort inclusion of headers properly.
> > ---
> >  xen/arch/riscv/include/asm/cmpxchg.h | 258
> > +++++++++++++++++++++++++++
> >  1 file changed, 258 insertions(+)
> >  create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h
> > 
> > diff --git a/xen/arch/riscv/include/asm/cmpxchg.h
> > b/xen/arch/riscv/include/asm/cmpxchg.h
> > new file mode 100644
> > index 0000000000..66cbe26737
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> > @@ -0,0 +1,258 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/* Copyright (C) 2014 Regents of the University of California */
> > +
> > +#ifndef _ASM_RISCV_CMPXCHG_H
> > +#define _ASM_RISCV_CMPXCHG_H
> > +
> > +#include <xen/compiler.h>
> > +#include <xen/lib.h>
> > +
> > +#include <asm/fence.h>
> > +#include <asm/io.h>
> > +#include <asm/system.h>
> > +
> > +#define __amoswap_generic(ptr, new, ret, sfx, pre, post) \
> > +({ \
> > +    asm volatile( \
> 
> Nit: In Xen style this is lacking a blank ahead of the opening
> parenthesis.
> 
> > +        pre \
> > +        " amoswap" sfx " %0, %2, %1\n" \
> > +        post \
> > +        : "=r" (ret), "+A" (*ptr) \
> > +        : "r" (new) \
> > +        : "memory" ); \
> > +})
> > +
> > +/*
> > + * For LR and SC, the A extension requires that the address held
> > in rs1 be
> > + * naturally aligned to the size of the operand (i.e., eight-byte
> > aligned
> > + * for 64-bit words and four-byte aligned for 32-bit words).
> > + * If the address is not naturally aligned, an address-misaligned
> > exception
> > + * or an access-fault exception will be generated.
> > + * 
> > + * Thereby:
> > + * - for 1-byte xchg access the containing word by clearing low
> > two bits
> > + * - for 2-byte xchg ccess the containing word by clearing first
> > bit.
> 
> "first bit" can still be ambiguous. Better say "bit 1".
> 
> > + * 
> 
> Here and apparently also elsewhere: Stray trailing blank. Git has a
> config
> setting to warn you about (maybe even to automatically strip? such.
It  would be useful for me. Thanks a lot for such recommendation.

> 
> > + * If resulting 4-byte access is still misalgined, it will fault
> > just as
> > + * non-emulated 4-byte access would.
> > + */
> > +#define emulate_xchg_1_2(ptr, new, sc_sfx, pre, post) \
> > +({ \
> > +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr &
> > ~(0x4 - sizeof(*ptr))); \
> 
> Here and elsewhere: sizeof(*(ptr)) (i.e. the inner parentheses are
> needed
> also there).
> 
> > +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 -
> > sizeof(*ptr))) * BITS_PER_BYTE; \
> 
> Why uint8_t?
It is enough to cover possible start bit position of value that should
be updated, so I decided to use uint8_t.

> 
> > +    unsigned long mask = GENMASK(((sizeof(*ptr)) * BITS_PER_BYTE)
> > - 1, 0) << new_val_pos; \
> > +    unsigned int new_ = new << new_val_pos; \
> > +    unsigned int old_val; \
> > +    unsigned int xchged_val; \
> > +    \
> > +    asm volatile ( \
> > +        pre \
> > +        "0: lr.w %[op_oldval], %[op_aligned_ptr]\n" \
> > +        "   and  %[op_xchged_val], %[op_oldval], %z[op_nmask]\n" \
> > +        "   or   %[op_xchged_val], %[op_xchged_val], %z[op_new]\n"
> > \
> > +        "   sc.w" sc_sfx " %[op_xchged_val], %[op_xchged_val],
> > %[op_aligned_ptr]\n" \
> > +        "   bnez %[op_xchged_val], 0b\n" \
> > +        post \
> > +        : [op_oldval] "=&r" (old_val), [op_xchged_val] "=&r"
> > (xchged_val), [op_aligned_ptr]"+A" (*aligned_ptr) \
> 
> Too long line. Partly because you have op_ prefixes here which I
> can't
> recognized what they would be good for. The val / _val suffixes also
> don't appear to carry much useful information. And "xchged", being
> explicitly past tense, doesn't look to fit even up and until the SC,
> not to speak of afterwards. Anything wrong with calling this just
> tmp,
> aux, or scratch?
op_ can be dropped and named operand can be equal to local variable
name, I thought it would be useful to understand that it is named
operand, but after rethinking it looks like unneeded overhead.

In case of emulate_xchg_1_2() there is no sense in val/_val suffixes as
local variables don't intersect with macros variable, and the suffixes
were added just to be in sync with emulate_cmpxchg_1_2 macros, but in
case of emulate_cmpxchg_1_2(ptr, old, new, sc_sfx, pre, post), the
macros has old argument, so to distinguish them _val was added.
Probably, it would be better to rename it to read or read_old.


> 
> > +        : [op_new] "rJ" (new_), [op_nmask] "rJ" (~mask) \
> > +        : "memory" ); \
> > +    \
> > +    (__typeof__(*(ptr)))((old_val & mask) >> new_val_pos); \
> > +})
> > +
> > +#define __xchg_generic(ptr, new, size, sfx, pre, post) \
> > +({ \
> > +    __typeof__(*(ptr)) new__ = (new); \
> > +    __typeof__(*(ptr)) ret__; \
> > +    switch ( size ) \
> 
> Can't this use sizeof(*(ptr)), allowing for one less macro parameter?
> 
> > +    { \
> > +    case 1: \
> > +    case 2: \
> > +        ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \
> > +        break; \
> > +    case 4: \
> > +        __amoswap_generic(ptr, new__, ret__,\
> > +                          ".w" sfx,  pre, post); \
> > +        break; \
> > +    case 8: \
> > +        __amoswap_generic(ptr, new__, ret__,\
> > +                          ".d" sfx,  pre, post); \
> > +        break; \
> 
> In io.h you make sure to avoid rv64-only insns. Here you don't. The
> build
> would fail either way, but this still looks inconsistent.
> 
> Also nit: Stray double blands (twice) ahead of "pre". Plus with this
> style
> of line continuation you want to consistently have exactly one blank
> ahead
> of each backslash.
> 
> > +    default: \
> > +        STATIC_ASSERT_UNREACHABLE(); \
> > +    } \
> > +    ret__; \
> > +})
> > +
> > +#define xchg_relaxed(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) x_ = (x); \
> 
> What is the purpose of this, when __xchg_generic() already does this
> same
> type conversion?
> 
> > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
> > "", "", ""); \
> > +})
> > +
> > +#define xchg_acquire(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) x_ = (x); \
> > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
> > +                                       "", "",
> > RISCV_ACQUIRE_BARRIER); \
> > +})
> > +
> > +#define xchg_release(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) x_ = (x); \
> > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
> > +                                       "", RISCV_RELEASE_BARRIER,
> > ""); \
> > +})
> 
> As asked before: Are there going to be any uses of these three?
> Common
> code doesn't require them. And not needing to provide them would
> simplify things quite a bit, it seems.
I checked my private branches and it looks to me that I introduced them
only for the correspondent atomic operations ( which was copied from
Linux Kernel ) which are not also used.

So we could definitely drop these macros for now, but should
xchg_generic() be updated as well? If to look at:
 #define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(*
(ptr)), \
                                    ".aqrl", "", "")
Last two arguments start to be unneeded, but I've wanted to leave them,
in case someone will needed to back xchg_{release, acquire, ...}. Does
it make any sense?

> 
> > +#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x),
> > sizeof(*(ptr)), \
> > +                                    ".aqrl", "", "")
> 
> According to the earlier comment (where I don't follow the example
> given),
> is .aqrl sufficient here? And even if it was for the 4- and 8-byte
> cases,
> is it sufficient in the 1- and 2-byte emulation case (where it then
> is
> appended to just the SC)?
If I understand your question correctly then accroding to the spec.,
.aqrl is enough for amo<op>.{w|d} instructions:
   Linux Construct        RVWMO AMO Mapping
   atomic <op> relaxed    amo<op>.{w|d}
   atomic <op> acquire    amo<op>.{w|d}.aq
   atomic <op> release    amo<op>.{w|d}.rl
   atomic <op>            amo<op>.{w|d}.aqrl
but in case of lr/sc you are right sc requires suffix too:
   Linux Construct        RVWMO LR/SC Mapping
   atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
   atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
   atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
   loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
   atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
   loop
   
I will add sc_sfx to emulate_xchg_1_2(). The only question is left if
__xchg_generic(ptr, new, size, sfx, pre, post) should be changed to:
__xchg_generic(ptr, new, size, sfx1, sfx2, pre, post) to cover both
cases amo<op>.{w|d}.sfx1 and lr.{w|d}.sfx1 ... sc.{w|d}.sfx2?

~ Oleksii

> 
> > +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx, pre,
> > post)	\
> > + ({ \
> > +    register unsigned int rc; \
> > +    asm volatile( \
> > +        pre \
> > +        "0: lr" lr_sfx " %0, %2\n" \
> > +        "   bne  %0, %z3, 1f\n" \
> > +        "   sc" sc_sfx " %1, %z4, %2\n" \
> > +        "   bnez %1, 0b\n" \
> > +        post \
> > +        "1:\n" \
> > +        : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
> > +        : "rJ" (old), "rJ" (new) \
> > +        : "memory"); \
> > + })
> > +
> > +/*
> > + * For LR and SC, the A extension requires that the address held
> > in rs1 be
> > + * naturally aligned to the size of the operand (i.e., eight-byte
> > aligned
> > + * for 64-bit words and four-byte aligned for 32-bit words).
> > + * If the address is not naturally aligned, an address-misaligned
> > exception
> > + * or an access-fault exception will be generated.
> > + * 
> > + * Thereby:
> > + * - for 1-byte xchg access the containing word by clearing low
> > two bits
> > + * - for 2-byte xchg ccess the containing word by clearing first
> > bit.
> > + * 
> > + * If resulting 4-byte access is still misalgined, it will fault
> > just as
> > + * non-emulated 4-byte access would.
> > + *
> > + * old_val was casted to unsigned long at the end of the define
> > because of
> > + * the following issue:
> > + * ./arch/riscv/include/asm/cmpxchg.h:166:5: error: cast to
> > pointer from integer of different size [-Werror=int-to-pointer-
> > cast]
> > + * 166 |     (__typeof__(*(ptr)))(old_val >> new_val_pos); \
> > + *     |     ^
> > + * ./arch/riscv/include/asm/cmpxchg.h:184:17: note: in expansion
> > of macro 'emulate_cmpxchg_1_2'
> > + * 184 |         ret__ = emulate_cmpxchg_1_2(ptr, old, new, \
> > + *     |                 ^~~~~~~~~~~~~~~~~~~
> > + * ./arch/riscv/include/asm/cmpxchg.h:227:5: note: in expansion of
> > macro '__cmpxchg_generic'
> > + * 227 |     __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned
> > long)(n), \
> > + *     |     ^~~~~~~~~~~~~~~~~
> > + * ./include/xen/lib.h:141:26: note: in expansion of macro
> > '__cmpxchg'
> > + * 141 |     ((__typeof__(*(ptr)))__cmpxchg(ptr, (unsigned
> > long)o_,              \
> > + *     |                          ^~~~~~~~~
> > + * common/event_channel.c:109:13: note: in expansion of macro
> > 'cmpxchgptr'
> > + * 109 |             cmpxchgptr(&xen_consumers[i], NULL, fn);
> > + */
> 
> This is too much detail on the compile issue. Just mentioning that
> said
> cast is needed for cmpxchgptr() ought to be sufficient.
> 
> > +#define emulate_cmpxchg_1_2(ptr, old, new, sc_sfx, pre, post) \
> > +({ \
> > +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr &
> > ~(0x4 - sizeof(*ptr))); \
> > +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 -
> > sizeof(*ptr))) * BITS_PER_BYTE; \
> > +    unsigned long mask = GENMASK(((sizeof(*ptr)) * BITS_PER_BYTE)
> > - 1, 0) << new_val_pos; \
> > +    unsigned int old_ = old << new_val_pos; \
> > +    unsigned int new_ = new << new_val_pos; \
> > +    unsigned int old_val; \
> > +    unsigned int xchged_val; \
> > +    \
> > +    __asm__ __volatile__ ( \
> > +        pre \
> > +        "0: lr.w %[op_xchged_val], %[op_aligned_ptr]\n" \
> > +        "   and  %[op_oldval], %[op_xchged_val], %z[op_mask]\n" \
> > +        "   bne  %[op_oldval], %z[op_old], 1f\n" \
> > +        "   xor  %[op_xchged_val], %[op_oldval],
> > %[op_xchged_val]\n" \
> > +        "   or   %[op_xchged_val], %[op_xchged_val], %z[op_new]\n"
> > \
> > +        "   sc.w" sc_sfx " %[op_xchged_val], %[op_xchged_val],
> > %[op_aligned_ptr]\n" \
> > +        "   bnez %[op_xchged_val], 0b\n" \
> > +        post \
> > +        "1:\n" \
> > +        : [op_oldval] "=&r" (old_val), [op_xchged_val] "=&r"
> > (xchged_val), [op_aligned_ptr] "+A" (*aligned_ptr) \
> > +        : [op_old] "rJ" (old_), [op_new] "rJ" (new_), \
> > +          [op_mask] "rJ" (mask) \
> > +        : "memory" ); \
> > +    \
> > +    (__typeof__(*(ptr)))((unsigned long)old_val >> new_val_pos); \
> > +})
> > +
> > +/*
> > + * Atomic compare and exchange.  Compare OLD with MEM, if
> > identical,
> > + * store NEW in MEM.  Return the initial value in MEM.  Success is
> > + * indicated by comparing RETURN with OLD.
> > + */
> > +#define __cmpxchg_generic(ptr, old, new, size, sc_sfx, pre, post)
> > \
> > +({ \
> > +    __typeof__(ptr) ptr__ = (ptr); \
> > +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
> > +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
> > +    __typeof__(*(ptr)) ret__; \
> > +    switch ( size ) \
> > +    { \
> > +    case 1: \
> > +    case 2: \
> > +        ret__ = emulate_cmpxchg_1_2(ptr, old, new, \
> > +                            sc_sfx, pre, post); \
> > +        break; \
> > +    case 4: \
> > +        __generic_cmpxchg(ptr__, old__, new__, ret__, \
> > +                          ".w", ".w"sc_sfx, pre, post); \
> > +        break; \
> > +    case 8: \
> > +        __generic_cmpxchg(ptr__, old__, new__, ret__, \
> > +                          ".d", ".d"sc_sfx, pre, post); \
> > +        break; \
> > +    default: \
> > +        STATIC_ASSERT_UNREACHABLE(); \
> > +    } \
> > +    ret__; \
> > +})
> > +
> > +#define cmpxchg_relaxed(ptr, o, n) \
> > +({ \
> > +    __typeof__(*(ptr)) o_ = (o); \
> > +    __typeof__(*(ptr)) n_ = (n); \
> > +    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, \
> > +                    o_, n_, sizeof(*(ptr)), "", "", ""); \
> > +})
> > +
> > +#define cmpxchg_acquire(ptr, o, n) \
> > +({ \
> > +    __typeof__(*(ptr)) o_ = (o); \
> > +    __typeof__(*(ptr)) n_ = (n); \
> > +    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, o_, n_,
> > sizeof(*(ptr)), \
> > +                                          "", "",
> > RISCV_ACQUIRE_BARRIER); \
> > +})
> > +
> > +#define cmpxchg_release(ptr, o, n) \
> > +({ \
> > +    __typeof__(*(ptr)) o_ = (o); \
> > +    __typeof__(*(ptr)) n_ = (n); \
> > +    (__typeof__(*(ptr)))__cmpxchg_release(ptr, o_, n_,
> > sizeof(*(ptr)), \
> 
> There's no __cmpxchg_release() afaics; dym __cmpxchg_generic()?
> 
> Jan
Jan Beulich March 7, 2024, 10:46 a.m. UTC | #3
On 07.03.2024 11:35, Oleksii wrote:
> On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
>> On 26.02.2024 18:38, Oleksii Kurochko wrote:
>>> The header was taken from Linux kernl 6.4.0-rc1.
>>>
>>> Addionally, were updated:
>>> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
>>>   access.
>>> * replace tabs with spaces
>>> * replace __* variale with *__
>>> * introduce generic version of xchg_* and cmpxchg_*.
>>>
>>> Implementation of 4- and 8-byte cases were left as it is done in
>>> Linux kernel as according to the RISC-V spec:
>>> ```
>>> Table A.5 ( only part of the table was copied here )
>>>
>>> Linux Construct       RVWMO Mapping
>>> atomic <op> relaxed    amo<op>.{w|d}
>>> atomic <op> acquire    amo<op>.{w|d}.aq
>>> atomic <op> release    amo<op>.{w|d}.rl
>>> atomic <op>            amo<op>.{w|d}.aqrl
>>>
>>> Linux Construct       RVWMO LR/SC Mapping
>>> atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
>>> atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
>>> atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
>>> loop OR
>>>                        fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
>>> bnez loop
>>> atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
>>> loop
>>>
>>> The Linux mappings for release operations may seem stronger than
>>> necessary,
>>> but these mappings are needed to cover some cases in which Linux
>>> requires
>>> stronger orderings than the more intuitive mappings would provide.
>>> In particular, as of the time this text is being written, Linux is
>>> actively
>>> debating whether to require load-load, load-store, and store-store
>>> orderings
>>> between accesses in one critical section and accesses in a
>>> subsequent critical
>>> section in the same hart and protected by the same synchronization
>>> object.
>>> Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl
>>> mappings
>>> combine to provide such orderings.
>>> There are a few ways around this problem, including:
>>> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
>>> suffices
>>>    but is undesirable, as it defeats the purpose of the aq/rl
>>> modifiers.
>>> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
>>> not
>>>    currently work due to the lack of load and store opcodes with aq
>>> and rl
>>>    modifiers.
>>
>> As before I don't understand this point. Can you give an example of
>> what
>> sort of opcode / instruction is missing?
> If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d}
> instructions don't have aq or rl annotation.

How would load insns other that LR and store insns other than SC come
into play here?

>>> 3. Strengthen the mappings of release operations such that they
>>> would
>>>    enforce sufficient orderings in the presence of either type of
>>> acquire mapping.
>>>    This is the currently-recommended solution, and the one shown in
>>> Table A.5.
>>> ```
>>>
>>> But in Linux kenrel atomics were strengthen with fences:
>>> ```
>>> Atomics present the same issue with locking: release and acquire
>>> variants need to be strengthened to meet the constraints defined
>>> by the Linux-kernel memory consistency model [1].
>>>
>>> Atomics present a further issue: implementations of atomics such
>>> as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
>>> which do not give full-ordering with .aqrl; for example, current
>>> implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
>>> below to end up with the state indicated in the "exists" clause.
>>>
>>> In order to "synchronize" LKMM and RISC-V's implementation, this
>>> commit strengthens the implementations of the atomics operations
>>> by replacing .rl and .aq with the use of ("lightweigth") fences,
>>> and by replacing .aqrl LR/SC pairs in sequences such as:
>>>
>>> 0:      lr.w.aqrl  %0, %addr
>>>         bne        %0, %old, 1f
>>>         ...
>>>         sc.w.aqrl  %1, %new, %addr
>>>         bnez       %1, 0b
>>> 1:
>>>
>>> with sequences of the form:
>>>
>>> 0:      lr.w       %0, %addr
>>>         bne        %0, %old, 1f
>>>               ...
>>>         sc.w.rl    %1, %new, %addr   /* SC-release   */
>>>         bnez       %1, 0b
>>>         fence      rw, rw            /* "full" fence */
>>> 1:
>>>
>>> following Daniel's suggestion.
>>>
>>> These modifications were validated with simulation of the RISC-V
>>> memory consistency model.
>>>
>>> C lr-sc-aqrl-pair-vs-full-barrier
>>>
>>> {}
>>>
>>> P0(int *x, int *y, atomic_t *u)
>>> {
>>>         int r0;
>>>         int r1;
>>>
>>>         WRITE_ONCE(*x, 1);
>>>         r0 = atomic_cmpxchg(u, 0, 1);
>>>         r1 = READ_ONCE(*y);
>>> }
>>>
>>> P1(int *x, int *y, atomic_t *v)
>>> {
>>>         int r0;
>>>         int r1;
>>>
>>>         WRITE_ONCE(*y, 1);
>>>         r0 = atomic_cmpxchg(v, 0, 1);
>>>         r1 = READ_ONCE(*x);
>>> }
>>>
>>> exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
>>
>> While I'm entirely willing to trust this can happen, I can't bring
>> this
>> in line with the A extension spec.
>>
>> Additionally it's not clear to me in how far all of this applies when
>> you don't really use LR/SC in the 4- and 8-byte cases (and going
>> forward
>> likely also not in the 1- and 2-byte case, utilizing Zahba when
>> available).
> It just explain what combination of fences, lr/sc, amoswap, .aq and .rl
> annotation can be combined, and why combinations introduced in this
> patch are used.

Except that I don't understand that explanation, iow why said combination
of values could be observed even when using suffixes properly.

>>> +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 -
>>> sizeof(*ptr))) * BITS_PER_BYTE; \
>>
>> Why uint8_t?
> It is enough to cover possible start bit position of value that should
> be updated, so I decided to use uint8_t.

Please take a look at the "Types" section in ./CODING_STYLE.

>>> +    { \
>>> +    case 1: \
>>> +    case 2: \
>>> +        ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \
>>> +        break; \
>>> +    case 4: \
>>> +        __amoswap_generic(ptr, new__, ret__,\
>>> +                          ".w" sfx,  pre, post); \
>>> +        break; \
>>> +    case 8: \
>>> +        __amoswap_generic(ptr, new__, ret__,\
>>> +                          ".d" sfx,  pre, post); \
>>> +        break; \
>>
>> In io.h you make sure to avoid rv64-only insns. Here you don't. The
>> build
>> would fail either way, but this still looks inconsistent.
>>
>> Also nit: Stray double blands (twice) ahead of "pre". Plus with this
>> style
>> of line continuation you want to consistently have exactly one blank
>> ahead
>> of each backslash.
>>
>>> +    default: \
>>> +        STATIC_ASSERT_UNREACHABLE(); \
>>> +    } \
>>> +    ret__; \
>>> +})
>>> +
>>> +#define xchg_relaxed(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>
>> What is the purpose of this, when __xchg_generic() already does this
>> same
>> type conversion?
>>
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
>>> "", "", ""); \
>>> +})
>>> +
>>> +#define xchg_acquire(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
>>> +                                       "", "",
>>> RISCV_ACQUIRE_BARRIER); \
>>> +})
>>> +
>>> +#define xchg_release(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
>>> +                                       "", RISCV_RELEASE_BARRIER,
>>> ""); \
>>> +})
>>
>> As asked before: Are there going to be any uses of these three?
>> Common
>> code doesn't require them. And not needing to provide them would
>> simplify things quite a bit, it seems.
> I checked my private branches and it looks to me that I introduced them
> only for the correspondent atomic operations ( which was copied from
> Linux Kernel ) which are not also used.
> 
> So we could definitely drop these macros for now, but should
> xchg_generic() be updated as well? If to look at:
>  #define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(*
> (ptr)), \
>                                     ".aqrl", "", "")
> Last two arguments start to be unneeded, but I've wanted to leave them,
> in case someone will needed to back xchg_{release, acquire, ...}. Does
> it make any sense?

It all depends on how it's justified in the description.

>>> +#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x),
>>> sizeof(*(ptr)), \
>>> +                                    ".aqrl", "", "")
>>
>> According to the earlier comment (where I don't follow the example
>> given),
>> is .aqrl sufficient here? And even if it was for the 4- and 8-byte
>> cases,
>> is it sufficient in the 1- and 2-byte emulation case (where it then
>> is
>> appended to just the SC)?
> If I understand your question correctly then accroding to the spec.,
> .aqrl is enough for amo<op>.{w|d} instructions:
>    Linux Construct        RVWMO AMO Mapping
>    atomic <op> relaxed    amo<op>.{w|d}
>    atomic <op> acquire    amo<op>.{w|d}.aq
>    atomic <op> release    amo<op>.{w|d}.rl
>    atomic <op>            amo<op>.{w|d}.aqrl
> but in case of lr/sc you are right sc requires suffix too:
>    Linux Construct        RVWMO LR/SC Mapping
>    atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
>    atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
>    atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
>    loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
>    atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
>    loop
>    
> I will add sc_sfx to emulate_xchg_1_2(). The only question is left if
> __xchg_generic(ptr, new, size, sfx, pre, post) should be changed to:
> __xchg_generic(ptr, new, size, sfx1, sfx2, pre, post) to cover both
> cases amo<op>.{w|d}.sfx1 and lr.{w|d}.sfx1 ... sc.{w|d}.sfx2?

I expect that's going to be necessary. In the end you'll see what's needed
when making the code adjustment.

Jan
Oleksii Kurochko March 7, 2024, 11:01 a.m. UTC | #4
On Thu, 2024-03-07 at 11:46 +0100, Jan Beulich wrote:
> On 07.03.2024 11:35, Oleksii wrote:
> > On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
> > > On 26.02.2024 18:38, Oleksii Kurochko wrote:
> > > > The header was taken from Linux kernl 6.4.0-rc1.
> > > > 
> > > > Addionally, were updated:
> > > > * add emulation of {cmp}xchg for 1/2 byte types using 32-bit
> > > > atomic
> > > >   access.
> > > > * replace tabs with spaces
> > > > * replace __* variale with *__
> > > > * introduce generic version of xchg_* and cmpxchg_*.
> > > > 
> > > > Implementation of 4- and 8-byte cases were left as it is done
> > > > in
> > > > Linux kernel as according to the RISC-V spec:
> > > > ```
> > > > Table A.5 ( only part of the table was copied here )
> > > > 
> > > > Linux Construct       RVWMO Mapping
> > > > atomic <op> relaxed    amo<op>.{w|d}
> > > > atomic <op> acquire    amo<op>.{w|d}.aq
> > > > atomic <op> release    amo<op>.{w|d}.rl
> > > > atomic <op>            amo<op>.{w|d}.aqrl
> > > > 
> > > > Linux Construct       RVWMO LR/SC Mapping
> > > > atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez
> > > > loop
> > > > atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
> > > > loop
> > > > atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ;
> > > > bnez
> > > > loop OR
> > > >                        fence.tso; loop: lr.{w|d}; <op>;
> > > > sc.{w|d}∗ ;
> > > > bnez loop
> > > > atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> > > > bnez
> > > > loop
> > > > 
> > > > The Linux mappings for release operations may seem stronger
> > > > than
> > > > necessary,
> > > > but these mappings are needed to cover some cases in which
> > > > Linux
> > > > requires
> > > > stronger orderings than the more intuitive mappings would
> > > > provide.
> > > > In particular, as of the time this text is being written, Linux
> > > > is
> > > > actively
> > > > debating whether to require load-load, load-store, and store-
> > > > store
> > > > orderings
> > > > between accesses in one critical section and accesses in a
> > > > subsequent critical
> > > > section in the same hart and protected by the same
> > > > synchronization
> > > > object.
> > > > Not all combinations of FENCE RW,W/FENCE R,RW mappings with
> > > > aq/rl
> > > > mappings
> > > > combine to provide such orderings.
> > > > There are a few ways around this problem, including:
> > > > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
> > > > suffices
> > > >    but is undesirable, as it defeats the purpose of the aq/rl
> > > > modifiers.
> > > > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This
> > > > does
> > > > not
> > > >    currently work due to the lack of load and store opcodes
> > > > with aq
> > > > and rl
> > > >    modifiers.
> > > 
> > > As before I don't understand this point. Can you give an example
> > > of
> > > what
> > > sort of opcode / instruction is missing?
> > If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d}
> > instructions don't have aq or rl annotation.
> 
> How would load insns other that LR and store insns other than SC come
> into play here?

This part of the spec. is not only about LR and SC which cover Load-
Exclusive and Store-Exclusive cases, but also about non-Exclusive cases
for each l{b|h|w|d} and s{b|h|w|d} are used.

~ Oleksii
Jan Beulich March 7, 2024, 11:11 a.m. UTC | #5
On 07.03.2024 12:01, Oleksii wrote:
> On Thu, 2024-03-07 at 11:46 +0100, Jan Beulich wrote:
>> On 07.03.2024 11:35, Oleksii wrote:
>>> On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
>>>> On 26.02.2024 18:38, Oleksii Kurochko wrote:
>>>>> The header was taken from Linux kernl 6.4.0-rc1.
>>>>>
>>>>> Addionally, were updated:
>>>>> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit
>>>>> atomic
>>>>>   access.
>>>>> * replace tabs with spaces
>>>>> * replace __* variale with *__
>>>>> * introduce generic version of xchg_* and cmpxchg_*.
>>>>>
>>>>> Implementation of 4- and 8-byte cases were left as it is done
>>>>> in
>>>>> Linux kernel as according to the RISC-V spec:
>>>>> ```
>>>>> Table A.5 ( only part of the table was copied here )
>>>>>
>>>>> Linux Construct       RVWMO Mapping
>>>>> atomic <op> relaxed    amo<op>.{w|d}
>>>>> atomic <op> acquire    amo<op>.{w|d}.aq
>>>>> atomic <op> release    amo<op>.{w|d}.rl
>>>>> atomic <op>            amo<op>.{w|d}.aqrl
>>>>>
>>>>> Linux Construct       RVWMO LR/SC Mapping
>>>>> atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez
>>>>> loop
>>>>> atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
>>>>> loop
>>>>> atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ;
>>>>> bnez
>>>>> loop OR
>>>>>                        fence.tso; loop: lr.{w|d}; <op>;
>>>>> sc.{w|d}∗ ;
>>>>> bnez loop
>>>>> atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
>>>>> bnez
>>>>> loop

Note the load and store forms mentioned here. How would ...

>>>>> The Linux mappings for release operations may seem stronger
>>>>> than
>>>>> necessary,
>>>>> but these mappings are needed to cover some cases in which
>>>>> Linux
>>>>> requires
>>>>> stronger orderings than the more intuitive mappings would
>>>>> provide.
>>>>> In particular, as of the time this text is being written, Linux
>>>>> is
>>>>> actively
>>>>> debating whether to require load-load, load-store, and store-
>>>>> store
>>>>> orderings
>>>>> between accesses in one critical section and accesses in a
>>>>> subsequent critical
>>>>> section in the same hart and protected by the same
>>>>> synchronization
>>>>> object.
>>>>> Not all combinations of FENCE RW,W/FENCE R,RW mappings with
>>>>> aq/rl
>>>>> mappings
>>>>> combine to provide such orderings.
>>>>> There are a few ways around this problem, including:
>>>>> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
>>>>> suffices
>>>>>    but is undesirable, as it defeats the purpose of the aq/rl
>>>>> modifiers.
>>>>> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This
>>>>> does
>>>>> not
>>>>>    currently work due to the lack of load and store opcodes
>>>>> with aq
>>>>> and rl
>>>>>    modifiers.
>>>>
>>>> As before I don't understand this point. Can you give an example
>>>> of
>>>> what
>>>> sort of opcode / instruction is missing?
>>> If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d}
>>> instructions don't have aq or rl annotation.
>>
>> How would load insns other that LR and store insns other than SC come
>> into play here?
> 
> This part of the spec. is not only about LR and SC which cover Load-
> Exclusive and Store-Exclusive cases, but also about non-Exclusive cases
> for each l{b|h|w|d} and s{b|h|w|d} are used.

... the spec (obviously covering other forms, too) be relevant when
reasoning whether just suffixes or actual barrier insns need using?

Jan
Oleksii Kurochko March 7, 2024, 12:28 p.m. UTC | #6
On Thu, 2024-03-07 at 12:11 +0100, Jan Beulich wrote:
> On 07.03.2024 12:01, Oleksii wrote:
> > On Thu, 2024-03-07 at 11:46 +0100, Jan Beulich wrote:
> > > On 07.03.2024 11:35, Oleksii wrote:
> > > > On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
> > > > > On 26.02.2024 18:38, Oleksii Kurochko wrote:
> > > > > > The header was taken from Linux kernl 6.4.0-rc1.
> > > > > > 
> > > > > > Addionally, were updated:
> > > > > > * add emulation of {cmp}xchg for 1/2 byte types using 32-
> > > > > > bit
> > > > > > atomic
> > > > > >   access.
> > > > > > * replace tabs with spaces
> > > > > > * replace __* variale with *__
> > > > > > * introduce generic version of xchg_* and cmpxchg_*.
> > > > > > 
> > > > > > Implementation of 4- and 8-byte cases were left as it is
> > > > > > done
> > > > > > in
> > > > > > Linux kernel as according to the RISC-V spec:
> > > > > > ```
> > > > > > Table A.5 ( only part of the table was copied here )
> > > > > > 
> > > > > > Linux Construct       RVWMO Mapping
> > > > > > atomic <op> relaxed    amo<op>.{w|d}
> > > > > > atomic <op> acquire    amo<op>.{w|d}.aq
> > > > > > atomic <op> release    amo<op>.{w|d}.rl
> > > > > > atomic <op>            amo<op>.{w|d}.aqrl
> > > > > > 
> > > > > > Linux Construct       RVWMO LR/SC Mapping
> > > > > > atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez
> > > > > > loop
> > > > > > atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d};
> > > > > > bnez
> > > > > > loop
> > > > > > atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗
> > > > > > ;
> > > > > > bnez
> > > > > > loop OR
> > > > > >                        fence.tso; loop: lr.{w|d}; <op>;
> > > > > > sc.{w|d}∗ ;
> > > > > > bnez loop
> > > > > > atomic <op>            loop: lr.{w|d}.aq; <op>;
> > > > > > sc.{w|d}.aqrl;
> > > > > > bnez
> > > > > > loop
> 
> Note the load and store forms mentioned here. How would ...
> 
> > > > > > The Linux mappings for release operations may seem stronger
> > > > > > than
> > > > > > necessary,
> > > > > > but these mappings are needed to cover some cases in which
> > > > > > Linux
> > > > > > requires
> > > > > > stronger orderings than the more intuitive mappings would
> > > > > > provide.
> > > > > > In particular, as of the time this text is being written,
> > > > > > Linux
> > > > > > is
> > > > > > actively
> > > > > > debating whether to require load-load, load-store, and
> > > > > > store-
> > > > > > store
> > > > > > orderings
> > > > > > between accesses in one critical section and accesses in a
> > > > > > subsequent critical
> > > > > > section in the same hart and protected by the same
> > > > > > synchronization
> > > > > > object.
> > > > > > Not all combinations of FENCE RW,W/FENCE R,RW mappings with
> > > > > > aq/rl
> > > > > > mappings
> > > > > > combine to provide such orderings.
> > > > > > There are a few ways around this problem, including:
> > > > > > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl.
> > > > > > This
> > > > > > suffices
> > > > > >    but is undesirable, as it defeats the purpose of the
> > > > > > aq/rl
> > > > > > modifiers.
> > > > > > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW.
> > > > > > This
> > > > > > does
> > > > > > not
> > > > > >    currently work due to the lack of load and store opcodes
> > > > > > with aq
> > > > > > and rl
> > > > > >    modifiers.
> > > > > 
> > > > > As before I don't understand this point. Can you give an
> > > > > example
> > > > > of
> > > > > what
> > > > > sort of opcode / instruction is missing?
> > > > If I understand the spec correctly then l{b|h|w|d} and
> > > > s{b|h|w|d}
> > > > instructions don't have aq or rl annotation.
> > > 
> > > How would load insns other that LR and store insns other than SC
> > > come
> > > into play here?
> > 
> > This part of the spec. is not only about LR and SC which cover
> > Load-
> > Exclusive and Store-Exclusive cases, but also about non-Exclusive
> > cases
> > for each l{b|h|w|d} and s{b|h|w|d} are used.
> 
> ... the spec (obviously covering other forms, too) be relevant when
> reasoning whether just suffixes or actual barrier insns need using?
Based on 3 rules which are in the commit message and in the spec.,
there is no difference between what option should be used ( at least, I
wasn't able to find an explanation in that paragraph ), but based on
the tables provided in the same paragraph ( and partially in the commit
message ) if an instruction has .aq or .rl annotation it should be
used.

And speaking about xchg and cmpxcgh case and their implementations, all
instructions have .ar/.rl suffixes, so we'd rather prefer suffixes
instead of barriers. 

Does it make sense?

~ Oleksii
diff mbox series

Patch

diff --git a/xen/arch/riscv/include/asm/cmpxchg.h b/xen/arch/riscv/include/asm/cmpxchg.h
new file mode 100644
index 0000000000..66cbe26737
--- /dev/null
+++ b/xen/arch/riscv/include/asm/cmpxchg.h
@@ -0,0 +1,258 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2014 Regents of the University of California */
+
+#ifndef _ASM_RISCV_CMPXCHG_H
+#define _ASM_RISCV_CMPXCHG_H
+
+#include <xen/compiler.h>
+#include <xen/lib.h>
+
+#include <asm/fence.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+#define __amoswap_generic(ptr, new, ret, sfx, pre, post) \
+({ \
+    asm volatile( \
+        pre \
+        " amoswap" sfx " %0, %2, %1\n" \
+        post \
+        : "=r" (ret), "+A" (*ptr) \
+        : "r" (new) \
+        : "memory" ); \
+})
+
+/*
+ * For LR and SC, the A extension requires that the address held in rs1 be
+ * naturally aligned to the size of the operand (i.e., eight-byte aligned
+ * for 64-bit words and four-byte aligned for 32-bit words).
+ * If the address is not naturally aligned, an address-misaligned exception
+ * or an access-fault exception will be generated.
+ * 
+ * Thereby:
+ * - for 1-byte xchg access the containing word by clearing low two bits
+ * - for 2-byte xchg ccess the containing word by clearing first bit.
+ * 
+ * If resulting 4-byte access is still misalgined, it will fault just as
+ * non-emulated 4-byte access would.
+ */
+#define emulate_xchg_1_2(ptr, new, sc_sfx, pre, post) \
+({ \
+    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr & ~(0x4 - sizeof(*ptr))); \
+    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 - sizeof(*ptr))) * BITS_PER_BYTE; \
+    unsigned long mask = GENMASK(((sizeof(*ptr)) * BITS_PER_BYTE) - 1, 0) << new_val_pos; \
+    unsigned int new_ = new << new_val_pos; \
+    unsigned int old_val; \
+    unsigned int xchged_val; \
+    \
+    asm volatile ( \
+        pre \
+        "0: lr.w %[op_oldval], %[op_aligned_ptr]\n" \
+        "   and  %[op_xchged_val], %[op_oldval], %z[op_nmask]\n" \
+        "   or   %[op_xchged_val], %[op_xchged_val], %z[op_new]\n" \
+        "   sc.w" sc_sfx " %[op_xchged_val], %[op_xchged_val], %[op_aligned_ptr]\n" \
+        "   bnez %[op_xchged_val], 0b\n" \
+        post \
+        : [op_oldval] "=&r" (old_val), [op_xchged_val] "=&r" (xchged_val), [op_aligned_ptr]"+A" (*aligned_ptr) \
+        : [op_new] "rJ" (new_), [op_nmask] "rJ" (~mask) \
+        : "memory" ); \
+    \
+    (__typeof__(*(ptr)))((old_val & mask) >> new_val_pos); \
+})
+
+#define __xchg_generic(ptr, new, size, sfx, pre, post) \
+({ \
+    __typeof__(*(ptr)) new__ = (new); \
+    __typeof__(*(ptr)) ret__; \
+    switch ( size ) \
+    { \
+    case 1: \
+    case 2: \
+        ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \
+        break; \
+    case 4: \
+        __amoswap_generic(ptr, new__, ret__,\
+                          ".w" sfx,  pre, post); \
+        break; \
+    case 8: \
+        __amoswap_generic(ptr, new__, ret__,\
+                          ".d" sfx,  pre, post); \
+        break; \
+    default: \
+        STATIC_ASSERT_UNREACHABLE(); \
+    } \
+    ret__; \
+})
+
+#define xchg_relaxed(ptr, x) \
+({ \
+    __typeof__(*(ptr)) x_ = (x); \
+    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), "", "", ""); \
+})
+
+#define xchg_acquire(ptr, x) \
+({ \
+    __typeof__(*(ptr)) x_ = (x); \
+    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
+                                       "", "", RISCV_ACQUIRE_BARRIER); \
+})
+
+#define xchg_release(ptr, x) \
+({ \
+    __typeof__(*(ptr)) x_ = (x); \
+    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
+                                       "", RISCV_RELEASE_BARRIER, ""); \
+})
+
+#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(*(ptr)), \
+                                    ".aqrl", "", "")
+
+#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx, pre, post)	\
+ ({ \
+    register unsigned int rc; \
+    asm volatile( \
+        pre \
+        "0: lr" lr_sfx " %0, %2\n" \
+        "   bne  %0, %z3, 1f\n" \
+        "   sc" sc_sfx " %1, %z4, %2\n" \
+        "   bnez %1, 0b\n" \
+        post \
+        "1:\n" \
+        : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
+        : "rJ" (old), "rJ" (new) \
+        : "memory"); \
+ })
+
+/*
+ * For LR and SC, the A extension requires that the address held in rs1 be
+ * naturally aligned to the size of the operand (i.e., eight-byte aligned
+ * for 64-bit words and four-byte aligned for 32-bit words).
+ * If the address is not naturally aligned, an address-misaligned exception
+ * or an access-fault exception will be generated.
+ * 
+ * Thereby:
+ * - for 1-byte xchg access the containing word by clearing low two bits
+ * - for 2-byte xchg ccess the containing word by clearing first bit.
+ * 
+ * If resulting 4-byte access is still misalgined, it will fault just as
+ * non-emulated 4-byte access would.
+ *
+ * old_val was casted to unsigned long at the end of the define because of
+ * the following issue:
+ * ./arch/riscv/include/asm/cmpxchg.h:166:5: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
+ * 166 |     (__typeof__(*(ptr)))(old_val >> new_val_pos); \
+ *     |     ^
+ * ./arch/riscv/include/asm/cmpxchg.h:184:17: note: in expansion of macro 'emulate_cmpxchg_1_2'
+ * 184 |         ret__ = emulate_cmpxchg_1_2(ptr, old, new, \
+ *     |                 ^~~~~~~~~~~~~~~~~~~
+ * ./arch/riscv/include/asm/cmpxchg.h:227:5: note: in expansion of macro '__cmpxchg_generic'
+ * 227 |     __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
+ *     |     ^~~~~~~~~~~~~~~~~
+ * ./include/xen/lib.h:141:26: note: in expansion of macro '__cmpxchg'
+ * 141 |     ((__typeof__(*(ptr)))__cmpxchg(ptr, (unsigned long)o_,              \
+ *     |                          ^~~~~~~~~
+ * common/event_channel.c:109:13: note: in expansion of macro 'cmpxchgptr'
+ * 109 |             cmpxchgptr(&xen_consumers[i], NULL, fn);
+ */
+#define emulate_cmpxchg_1_2(ptr, old, new, sc_sfx, pre, post) \
+({ \
+    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr & ~(0x4 - sizeof(*ptr))); \
+    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 - sizeof(*ptr))) * BITS_PER_BYTE; \
+    unsigned long mask = GENMASK(((sizeof(*ptr)) * BITS_PER_BYTE) - 1, 0) << new_val_pos; \
+    unsigned int old_ = old << new_val_pos; \
+    unsigned int new_ = new << new_val_pos; \
+    unsigned int old_val; \
+    unsigned int xchged_val; \
+    \
+    __asm__ __volatile__ ( \
+        pre \
+        "0: lr.w %[op_xchged_val], %[op_aligned_ptr]\n" \
+        "   and  %[op_oldval], %[op_xchged_val], %z[op_mask]\n" \
+        "   bne  %[op_oldval], %z[op_old], 1f\n" \
+        "   xor  %[op_xchged_val], %[op_oldval], %[op_xchged_val]\n" \
+        "   or   %[op_xchged_val], %[op_xchged_val], %z[op_new]\n" \
+        "   sc.w" sc_sfx " %[op_xchged_val], %[op_xchged_val], %[op_aligned_ptr]\n" \
+        "   bnez %[op_xchged_val], 0b\n" \
+        post \
+        "1:\n" \
+        : [op_oldval] "=&r" (old_val), [op_xchged_val] "=&r" (xchged_val), [op_aligned_ptr] "+A" (*aligned_ptr) \
+        : [op_old] "rJ" (old_), [op_new] "rJ" (new_), \
+          [op_mask] "rJ" (mask) \
+        : "memory" ); \
+    \
+    (__typeof__(*(ptr)))((unsigned long)old_val >> new_val_pos); \
+})
+
+/*
+ * Atomic compare and exchange.  Compare OLD with MEM, if identical,
+ * store NEW in MEM.  Return the initial value in MEM.  Success is
+ * indicated by comparing RETURN with OLD.
+ */
+#define __cmpxchg_generic(ptr, old, new, size, sc_sfx, pre, post) \
+({ \
+    __typeof__(ptr) ptr__ = (ptr); \
+    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
+    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
+    __typeof__(*(ptr)) ret__; \
+    switch ( size ) \
+    { \
+    case 1: \
+    case 2: \
+        ret__ = emulate_cmpxchg_1_2(ptr, old, new, \
+                            sc_sfx, pre, post); \
+        break; \
+    case 4: \
+        __generic_cmpxchg(ptr__, old__, new__, ret__, \
+                          ".w", ".w"sc_sfx, pre, post); \
+        break; \
+    case 8: \
+        __generic_cmpxchg(ptr__, old__, new__, ret__, \
+                          ".d", ".d"sc_sfx, pre, post); \
+        break; \
+    default: \
+        STATIC_ASSERT_UNREACHABLE(); \
+    } \
+    ret__; \
+})
+
+#define cmpxchg_relaxed(ptr, o, n) \
+({ \
+    __typeof__(*(ptr)) o_ = (o); \
+    __typeof__(*(ptr)) n_ = (n); \
+    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, \
+                    o_, n_, sizeof(*(ptr)), "", "", ""); \
+})
+
+#define cmpxchg_acquire(ptr, o, n) \
+({ \
+    __typeof__(*(ptr)) o_ = (o); \
+    __typeof__(*(ptr)) n_ = (n); \
+    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, o_, n_, sizeof(*(ptr)), \
+                                          "", "", RISCV_ACQUIRE_BARRIER); \
+})
+
+#define cmpxchg_release(ptr, o, n) \
+({ \
+    __typeof__(*(ptr)) o_ = (o); \
+    __typeof__(*(ptr)) n_ = (n); \
+    (__typeof__(*(ptr)))__cmpxchg_release(ptr, o_, n_, sizeof(*(ptr)), \
+                                          "", RISCV_RELEASE_BARRIER, ""); \
+})
+
+#define __cmpxchg(ptr, o, n, s) \
+    (__typeof__(*(ptr))) \
+    __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
+                      s, ".rl", "", RISCV_FULL_BARRIER)
+
+#define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr)))
+
+#endif /* _ASM_RISCV_CMPXCHG_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */