diff mbox series

[RFC,v4,2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.

Message ID 20241029194348.59574-3-paolo.savini@embecosm.com (mailing list archive)
State New
Headers show
Series target/riscv: add wrapper for target specific macros in atomicity check. | expand

Commit Message

Paolo Savini Oct. 29, 2024, 7:43 p.m. UTC
This patch optimizes the emulation of unit-stride load/store RVV instructions
when the data being loaded/stored per iteration amounts to 16 bytes or more.
The optimization consists of calling __builtin_memcpy on chunks of data of 16
bytes between the memory address of the simulated vector register and the
destination memory address and vice versa.
This is done only if we have direct access to the RAM of the host machine,
if the host is little endiand and if it supports atomic 128 bit memory
operations.

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
---
 target/riscv/vector_helper.c    | 17 ++++++++++++++++-
 target/riscv/vector_internals.h | 12 ++++++++++++
 2 files changed, 28 insertions(+), 1 deletion(-)

Comments

Richard Henderson Oct. 30, 2024, 11:40 a.m. UTC | #1
On 10/29/24 19:43, Paolo Savini wrote:
> This patch optimizes the emulation of unit-stride load/store RVV instructions
> when the data being loaded/stored per iteration amounts to 16 bytes or more.
> The optimization consists of calling __builtin_memcpy on chunks of data of 16
> bytes between the memory address of the simulated vector register and the
> destination memory address and vice versa.
> This is done only if we have direct access to the RAM of the host machine,
> if the host is little endiand and if it supports atomic 128 bit memory
> operations.
> 
> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
> ---
>   target/riscv/vector_helper.c    | 17 ++++++++++++++++-
>   target/riscv/vector_internals.h | 12 ++++++++++++
>   2 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 75c24653f0..e1c100e907 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>       }
>   
>       fn = fns[is_load][group_size];
> -    fn(vd, byte_offset, host + byte_offset);
> +
> +    /* __builtin_memcpy uses host 16 bytes vector loads and stores if supported.
> +     * We need to make sure that these instructions have guarantees of atomicity.
> +     * E.g. x86 processors provide strong guarantees of atomicity for 16-byte
> +     * memory operations if the memory operands are 16-byte aligned */
> +    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
> +		    ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
> +      group_size = MO_128;
> +      if (is_load) {
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
> +      } else {
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
> +      }

I said this last time and I'll say it again:

     __builtin_memcpy DOES NOT equal VMOVDQA

Your comment there about 'if supported' does not really apply.

(1) You'd need a compile-time test not the runtime test that is HOST_128_ATOMIC_MEM_OP to 
ensure that the compiler knows that AVX vector support is present.

(2) Even then, you're not giving the compiler any reason to use VMOVDQA over VMOVDQU or 
ANY OTHER vector load/store.  So you're not really doing what you say you're doing.


Frankly, I think this entire patch set is premature.
We need to get Max Chou's patch set landed first.


r~
Paolo Savini Oct. 30, 2024, 3:25 p.m. UTC | #2
Thanks for the review Richard.

On 10/30/24 11:40, Richard Henderson wrote:
> On 10/29/24 19:43, Paolo Savini wrote:
>> This patch optimizes the emulation of unit-stride load/store RVV 
>> instructions
>> when the data being loaded/stored per iteration amounts to 16 bytes 
>> or more.
>> The optimization consists of calling __builtin_memcpy on chunks of 
>> data of 16
>> bytes between the memory address of the simulated vector register and 
>> the
>> destination memory address and vice versa.
>> This is done only if we have direct access to the RAM of the host 
>> machine,
>> if the host is little endiand and if it supports atomic 128 bit memory
>> operations.
>>
>> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
>> ---
>>   target/riscv/vector_helper.c    | 17 ++++++++++++++++-
>>   target/riscv/vector_internals.h | 12 ++++++++++++
>>   2 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>> index 75c24653f0..e1c100e907 100644
>> --- a/target/riscv/vector_helper.c
>> +++ b/target/riscv/vector_helper.c
>> @@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void 
>> *vd, uint32_t byte_end,
>>       }
>>         fn = fns[is_load][group_size];
>> -    fn(vd, byte_offset, host + byte_offset);
>> +
>> +    /* __builtin_memcpy uses host 16 bytes vector loads and stores 
>> if supported.
>> +     * We need to make sure that these instructions have guarantees 
>> of atomicity.
>> +     * E.g. x86 processors provide strong guarantees of atomicity 
>> for 16-byte
>> +     * memory operations if the memory operands are 16-byte aligned */
>> +    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
>> +            ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
>> +      group_size = MO_128;
>> +      if (is_load) {
>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t 
>> *)(host + byte_offset), 16);
>> +      } else {
>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t 
>> *)(vd + byte_offset), 16);
>> +      }
>
> I said this last time and I'll say it again:
>
>     __builtin_memcpy DOES NOT equal VMOVDQA
I am aware of this. I took __builtin_memcpy as a generic enough way to 
emulate loads and stores that should allow several hosts to generate the 
widest load/store instructions they can and on x86 I see this generates 
instructions vmovdpu/movdqu that are not always guaranteed to be atomic. 
x86 though guarantees them to be atomic if the memory address is aligned 
to 16 bytes. Hence the check on the alignment. My x86 knowledge is 
admittedly limited though so the check on alignment might be redundant?
>
> Your comment there about 'if supported' does not really apply.

This refers to the fact that in order to decide whether to use 
__builtin_memcpy which could generate 16 byte mem ops we use cpuinfo to 
test for applicable host features (so for x86 whether the host supports 
AVX or atomic vmovdqa/vmovdqu). The intent here is to only use 
__builtin_memcpy when we know that 16 byte mem ops, if generated, will 
have a guarantee of atomicity.

Before I submit a new version of the patch, does this comment describe 
the situation more clearly?

     /* __builtin_memcpy could generate host 16 bytes memory operations that
      * would accelerate the emulation of RVV loads and stores of more 
than 16 bytes.
      * We need to make sure that these instructions have guarantees of 
atomicity.
      * E.g. x86 processors provide strong guarantees of atomicity for 
16-byte
      * memory operations if the memory operands are 16-byte aligned */

Worst case scenario the memcpy will generate normal 8 byte mem ops.

>
> (1) You'd need a compile-time test not the runtime test that is 
> HOST_128_ATOMIC_MEM_OP to ensure that the compiler knows that AVX 
> vector support is present.
(lack of QEMU knowledge here) I assumed it was safe enough to use QEMU's 
cpuinfo in the way it is initialized for x86 as an example in 
util/cpuinfo-i386.c:cpuinfo_init() and by building this for Aarch64 I 
see that these conditions work as intended (by not using memcpy, because 
we haven't added atomicity checks for Aarch64 yet).
>
> (2) Even then, you're not giving the compiler any reason to use 
> VMOVDQA over VMOVDQU or ANY OTHER vector load/store.  So you're not 
> really doing what you say you're doing.

I hope that the new version of the comment addresses this.

If I misinterpreted the point of your feedback and you are saying that 
it would be better to have guarantee to have specific host atomic vector 
loads/stores rather then using memcpy, there's the option of using 
compiler builtins to generate some specific host instructions and so 
enhance performance for some hosts but that didn't seem like a neat 
solution. Any thoughts?

>
>
> Frankly, I think this entire patch set is premature.
> We need to get Max Chou's patch set landed first.

Understood. I'm trying anyway to get to the bottom of this as it can 
prove useful for other optimizations.

Thanks again for the review work.

Paolo
Richard Henderson Nov. 4, 2024, 12:48 p.m. UTC | #3
On 10/30/24 15:25, Paolo Savini wrote:
> Thanks for the review Richard.
> 
> On 10/30/24 11:40, Richard Henderson wrote:
>> On 10/29/24 19:43, Paolo Savini wrote:
>>> This patch optimizes the emulation of unit-stride load/store RVV instructions
>>> when the data being loaded/stored per iteration amounts to 16 bytes or more.
>>> The optimization consists of calling __builtin_memcpy on chunks of data of 16
>>> bytes between the memory address of the simulated vector register and the
>>> destination memory address and vice versa.
>>> This is done only if we have direct access to the RAM of the host machine,
>>> if the host is little endiand and if it supports atomic 128 bit memory
>>> operations.
>>>
>>> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
>>> ---
>>>   target/riscv/vector_helper.c    | 17 ++++++++++++++++-
>>>   target/riscv/vector_internals.h | 12 ++++++++++++
>>>   2 files changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>>> index 75c24653f0..e1c100e907 100644
>>> --- a/target/riscv/vector_helper.c
>>> +++ b/target/riscv/vector_helper.c
>>> @@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>>>       }
>>>         fn = fns[is_load][group_size];
>>> -    fn(vd, byte_offset, host + byte_offset);
>>> +
>>> +    /* __builtin_memcpy uses host 16 bytes vector loads and stores if supported.
>>> +     * We need to make sure that these instructions have guarantees of atomicity.
>>> +     * E.g. x86 processors provide strong guarantees of atomicity for 16-byte
>>> +     * memory operations if the memory operands are 16-byte aligned */
>>> +    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
>>> +            ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
>>> +      group_size = MO_128;
>>> +      if (is_load) {
>>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + 
>>> byte_offset), 16);
>>> +      } else {
>>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + 
>>> byte_offset), 16);
>>> +      }
>>
>> I said this last time and I'll say it again:
>>
>>     __builtin_memcpy DOES NOT equal VMOVDQA
> I am aware of this. I took __builtin_memcpy as a generic enough way to emulate loads and 
> stores that should allow several hosts to generate the widest load/store instructions they 
> can and on x86 I see this generates instructions vmovdpu/movdqu that are not always 
> guaranteed to be atomic. x86 though guarantees them to be atomic if the memory address is 
> aligned to 16 bytes.

No, AMD guarantees MOVDQU is atomic if aligned, Intel does not.
See the comment in util/cpuinfo-i386.c, and the two CPUINFO_ATOMIC_VMOVDQ[AU] bits.

See also host/include/*/host/atomic128-ldst.h, HAVE_ATOMIC128_RO, and atomic16_read_ro.
Not that I think you should use that here; it's complicated, and I think you're better off 
relying on the code in accel/tcg/ when more than byte atomicity is required.


r~
Daniel Henrique Barboza Nov. 7, 2024, 12:58 p.m. UTC | #4
On 11/4/24 9:48 AM, Richard Henderson wrote:
> On 10/30/24 15:25, Paolo Savini wrote:
>> Thanks for the review Richard.
>>
>> On 10/30/24 11:40, Richard Henderson wrote:
>>> On 10/29/24 19:43, Paolo Savini wrote:
>>>> This patch optimizes the emulation of unit-stride load/store RVV instructions
>>>> when the data being loaded/stored per iteration amounts to 16 bytes or more.
>>>> The optimization consists of calling __builtin_memcpy on chunks of data of 16
>>>> bytes between the memory address of the simulated vector register and the
>>>> destination memory address and vice versa.
>>>> This is done only if we have direct access to the RAM of the host machine,
>>>> if the host is little endiand and if it supports atomic 128 bit memory
>>>> operations.
>>>>
>>>> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
>>>> ---
>>>>   target/riscv/vector_helper.c    | 17 ++++++++++++++++-
>>>>   target/riscv/vector_internals.h | 12 ++++++++++++
>>>>   2 files changed, 28 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>>>> index 75c24653f0..e1c100e907 100644
>>>> --- a/target/riscv/vector_helper.c
>>>> +++ b/target/riscv/vector_helper.c
>>>> @@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>>>>       }
>>>>         fn = fns[is_load][group_size];
>>>> -    fn(vd, byte_offset, host + byte_offset);
>>>> +
>>>> +    /* __builtin_memcpy uses host 16 bytes vector loads and stores if supported.
>>>> +     * We need to make sure that these instructions have guarantees of atomicity.
>>>> +     * E.g. x86 processors provide strong guarantees of atomicity for 16-byte
>>>> +     * memory operations if the memory operands are 16-byte aligned */
>>>> +    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
>>>> +            ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
>>>> +      group_size = MO_128;
>>>> +      if (is_load) {
>>>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
>>>> +      } else {
>>>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
>>>> +      }
>>>
>>> I said this last time and I'll say it again:
>>>
>>>     __builtin_memcpy DOES NOT equal VMOVDQA
>> I am aware of this. I took __builtin_memcpy as a generic enough way to emulate loads and stores that should allow several hosts to generate the widest load/store instructions they can and on x86 I see this generates instructions vmovdpu/movdqu that are not always guaranteed to be atomic. x86 though guarantees them to be atomic if the memory address is aligned to 16 bytes.
> 
> No, AMD guarantees MOVDQU is atomic if aligned, Intel does not.
> See the comment in util/cpuinfo-i386.c, and the two CPUINFO_ATOMIC_VMOVDQ[AU] bits.
> 
> See also host/include/*/host/atomic128-ldst.h, HAVE_ATOMIC128_RO, and atomic16_read_ro.
> Not that I think you should use that here; it's complicated, and I think you're better off relying on the code in accel/tcg/ when more than byte atomicity is required.
>

Not sure if that's what you meant but I didn't find any clear example of
multi-byte atomicity using qatomic_read() and friends that would be closer
to what memcpy() is doing here. I found one example in bdrv_graph_co_rdlock()
that seems to use a mem barrier via smp_mb() and qatomic_read() inside a
loop, but I don't understand that code enough to say.

I'm also wondering if a common pthread_lock() wrapping up these memcpy() calls
would suffice in this case. Even if we can't guarantee that __builtin_memcpy()
will use arch specific vector insns in the host it would already be a faster
path than falling back to fn(...).

In a quick detour, I'm not sure if we really considered how ARM SVE implements these
helpers. E.g gen_sve_str():

https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/translate-sve.c#L4182

I'm curious to see if this form of front end optimization, use TCG ops
instead of a for() loop with ldst_elem() like we're doing today,  would yield more
performance with less complication with backend specifics, atomicity and whatnot.
In fact I have a feeling this is not the first time we talk about using ideas from
SVE too.


Thanks,

Daniel



  
> 
> r~
Richard Henderson Nov. 8, 2024, 9:11 a.m. UTC | #5
On 11/7/24 12:58, Daniel Henrique Barboza wrote:
> On 11/4/24 9:48 AM, Richard Henderson wrote:
>> On 10/30/24 15:25, Paolo Savini wrote:
>>> On 10/30/24 11:40, Richard Henderson wrote:
>>>>     __builtin_memcpy DOES NOT equal VMOVDQA
>>> I am aware of this. I took __builtin_memcpy as a generic enough way to emulate loads 
>>> and stores that should allow several hosts to generate the widest load/store 
>>> instructions they can and on x86 I see this generates instructions vmovdpu/movdqu that 
>>> are not always guaranteed to be atomic. x86 though guarantees them to be atomic if the 
>>> memory address is aligned to 16 bytes.
>>
>> No, AMD guarantees MOVDQU is atomic if aligned, Intel does not.
>> See the comment in util/cpuinfo-i386.c, and the two CPUINFO_ATOMIC_VMOVDQ[AU] bits.
>>
>> See also host/include/*/host/atomic128-ldst.h, HAVE_ATOMIC128_RO, and atomic16_read_ro.
>> Not that I think you should use that here; it's complicated, and I think you're better 
>> off relying on the code in accel/tcg/ when more than byte atomicity is required.
>>
> 
> Not sure if that's what you meant but I didn't find any clear example of
> multi-byte atomicity using qatomic_read() and friends that would be closer
> to what memcpy() is doing here. I found one example in bdrv_graph_co_rdlock()
> that seems to use a mem barrier via smp_mb() and qatomic_read() inside a
> loop, but I don't understand that code enough to say.

Memory barriers provide ordering between loads and stores, but they cannot be used to 
address atomicity of individual loads and stores.


> I'm also wondering if a common pthread_lock() wrapping up these memcpy() calls
> would suffice in this case. Even if we can't guarantee that __builtin_memcpy()
> will use arch specific vector insns in the host it would already be a faster
> path than falling back to fn(...).

Locks would certainly not be faster than calling the accel/tcg function.


> In a quick detour, I'm not sure if we really considered how ARM SVE implements these
> helpers. E.g gen_sve_str():
> 
> https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/translate-sve.c#L4182

Note that ARM SVE defines these instructions to have byte atomicity.


r~
Paolo Savini Nov. 11, 2024, 4:04 p.m. UTC | #6
Hi Richard, Daniel,

This might be a silly question, but why do we need to ensure atomicity 
when emulating these guest instructions? I might be wrong but I didn't 
see an explicit requirement for the vector instructions to be atomic in 
the documentation of the RISC-V V extension.

Anyway the patches from Max have landed and since one of them already 
uses memcpy() where this patch does and achieves a similar performance 
improvement we should probably drop this particular patch. I'm wondering 
whether we should be concerned about atomicity there too?

https://github.com/qemu/qemu/blob/134b443512825bed401b6e141447b8cdc22d2efe/target/riscv/vector_helper.c#L224

Thanks

Paolo

On 11/8/24 09:11, Richard Henderson wrote:
> On 11/7/24 12:58, Daniel Henrique Barboza wrote:
>> On 11/4/24 9:48 AM, Richard Henderson wrote:
>>> On 10/30/24 15:25, Paolo Savini wrote:
>>>> On 10/30/24 11:40, Richard Henderson wrote:
>>>>>     __builtin_memcpy DOES NOT equal VMOVDQA
>>>> I am aware of this. I took __builtin_memcpy as a generic enough way 
>>>> to emulate loads and stores that should allow several hosts to 
>>>> generate the widest load/store instructions they can and on x86 I 
>>>> see this generates instructions vmovdpu/movdqu that are not always 
>>>> guaranteed to be atomic. x86 though guarantees them to be atomic if 
>>>> the memory address is aligned to 16 bytes.
>>>
>>> No, AMD guarantees MOVDQU is atomic if aligned, Intel does not.
>>> See the comment in util/cpuinfo-i386.c, and the two 
>>> CPUINFO_ATOMIC_VMOVDQ[AU] bits.
>>>
>>> See also host/include/*/host/atomic128-ldst.h, HAVE_ATOMIC128_RO, 
>>> and atomic16_read_ro.
>>> Not that I think you should use that here; it's complicated, and I 
>>> think you're better off relying on the code in accel/tcg/ when more 
>>> than byte atomicity is required.
>>>
>>
>> Not sure if that's what you meant but I didn't find any clear example of
>> multi-byte atomicity using qatomic_read() and friends that would be 
>> closer
>> to what memcpy() is doing here. I found one example in 
>> bdrv_graph_co_rdlock()
>> that seems to use a mem barrier via smp_mb() and qatomic_read() inside a
>> loop, but I don't understand that code enough to say.
>
> Memory barriers provide ordering between loads and stores, but they 
> cannot be used to address atomicity of individual loads and stores.
>
>
>> I'm also wondering if a common pthread_lock() wrapping up these 
>> memcpy() calls
>> would suffice in this case. Even if we can't guarantee that 
>> __builtin_memcpy()
>> will use arch specific vector insns in the host it would already be a 
>> faster
>> path than falling back to fn(...).
>
> Locks would certainly not be faster than calling the accel/tcg function.
>
>
>> In a quick detour, I'm not sure if we really considered how ARM SVE 
>> implements these
>> helpers. E.g gen_sve_str():
>>
>> https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/translate-sve.c#L4182 
>>
>
> Note that ARM SVE defines these instructions to have byte atomicity.
>
>
> r~
Richard Henderson Nov. 14, 2024, 4:09 p.m. UTC | #7
On 11/11/24 08:04, Paolo Savini wrote:
> Hi Richard, Daniel,
> 
> This might be a silly question, but why do we need to ensure atomicity when emulating 
> these guest instructions? I might be wrong but I didn't see an explicit requirement for 
> the vector instructions to be atomic in the documentation of the RISC-V V extension.

So that it works with threads?

The rvv extension talks about loads and stores to individual elements.  The risc-v integer 
spec talks about the atomicity of loads and stores.  The rvv extension does not talk about 
*lowering* atomicity requirements.


> Anyway the patches from Max have landed and since one of them already uses memcpy() where 
> this patch does and achieves a similar performance improvement we should probably drop 
> this particular patch. I'm wondering whether we should be concerned about atomicity there 
> too?
> 
> https://github.com/qemu/qemu/blob/134b443512825bed401b6e141447b8cdc22d2efe/target/riscv/ 
> vector_helper.c#L224

It *did* go through review for exactly this. You'll notice that the memcpy path is 
restricted to esz == 1, i.e. bytes.


r~
diff mbox series

Patch

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 75c24653f0..e1c100e907 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -488,7 +488,22 @@  vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
     }
 
     fn = fns[is_load][group_size];
-    fn(vd, byte_offset, host + byte_offset);
+
+    /* __builtin_memcpy uses host 16 bytes vector loads and stores if supported.
+     * We need to make sure that these instructions have guarantees of atomicity.
+     * E.g. x86 processors provide strong guarantees of atomicity for 16-byte
+     * memory operations if the memory operands are 16-byte aligned */
+    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
+		    ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
+      group_size = MO_128;
+      if (is_load) {
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
+      } else {
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
+      }
+    } else {
+      fn(vd, byte_offset, host + byte_offset);
+    }
 
     return 1 << group_size;
 }
diff --git a/target/riscv/vector_internals.h b/target/riscv/vector_internals.h
index f59d7d5c19..92694162ce 100644
--- a/target/riscv/vector_internals.h
+++ b/target/riscv/vector_internals.h
@@ -56,6 +56,18 @@  static inline uint32_t vext_nf(uint32_t desc)
 #define H8(x)   (x)
 #endif
 
+/*
+ * If we use host SIMD memory operations to accelerate the emulation we might
+ * want to rely on host specific flags to check that the memory accesses will
+ * be atomic.
+ */
+#if defined(HOST_X86_64)
+#define HOST_128_ATOMIC_MEM_OP \
+	((cpuinfo & (CPUINFO_ATOMIC_VMOVDQA | CPUINFO_ATOMIC_VMOVDQU)) != 0)
+#else
+#define HOST_128_ATOMIC_MEM_OP false
+#endif
+
 /*
  * Encode LMUL to lmul as following:
  *     LMUL    vlmul    lmul