diff mbox series

[RFC] CXL: TCG/KVM instruction alignment issue discussion default

Message ID Y/Cm5nuJl3G2CG2p@memverge.com (mailing list archive)
State New, archived
Headers show
Series [RFC] CXL: TCG/KVM instruction alignment issue discussion default | expand

Commit Message

Gregory Price Feb. 18, 2023, 10:22 a.m. UTC
Breaking this off into a separate thread for archival sake.

There's a bug with handling execution of instructions held in CXL
memory - specifically when an instruction crosses a page boundary.

The result of this is that type-3 devices cannot use KVM at all at the
moment, and require the attached patch to run in TCG-only mode.


CXL memory devices are presently emulated as MMIO, and MMIO has no
coherency guarantees, so TCG doesn't cache the results of translating
an instruction, meaning execution is incredibly slow (orders of
magnitude slower than KVM).


Request for comments:


First there's the stability issue:

0) TCG cannot handle instructions across a page boundary spanning ram and
   MMIO. See attached patch for hotfix.  This basically solves the page
   boundary issue by reverting the entire block to MMIO-mode if the
   problem is detected.

1) KVM needs to be investigated.  It's likely the same/similar issue,
   but it's not confirmed.



Second there's the performance issue:

0) Do we actually care about performance? How likely are users to
   attempt to run software out of CXL memory?

1) If we do care, is there a potential for converting CXL away from the
   MMIO design?  The issue is coherency for shared memory. Emulating
   coherency is a) hard, and b) a ton of work for little gain.

   Presently marking CXL memory as MMIO basically enforces coherency by
   preventing caching, though it's unclear how this is enforced
   by KVM (or if it is, i have to imagine it is).



It might be nice to solve this for non-shared memory regions, but
testing functionality >>> performance at this point so it might not
worth the investment.


Just tossing this out for discussion
~Gregory

Comments

Jørgen Hansen Feb. 27, 2023, 11:06 a.m. UTC | #1
On 2/18/23 11:22, Gregory Price wrote:
> Breaking this off into a separate thread for archival sake.
> 
> There's a bug with handling execution of instructions held in CXL
> memory - specifically when an instruction crosses a page boundary.
> 
> The result of this is that type-3 devices cannot use KVM at all at the
> moment, and require the attached patch to run in TCG-only mode.
> 
> 
> CXL memory devices are presently emulated as MMIO, and MMIO has no
> coherency guarantees, so TCG doesn't cache the results of translating
> an instruction, meaning execution is incredibly slow (orders of
> magnitude slower than KVM).
> 
> 
> Request for comments:
> 
> 
> First there's the stability issue:
> 
> 0) TCG cannot handle instructions across a page boundary spanning ram and
>     MMIO. See attached patch for hotfix.  This basically solves the page
>     boundary issue by reverting the entire block to MMIO-mode if the
>     problem is detected.
> 
> 1) KVM needs to be investigated.  It's likely the same/similar issue,
>     but it's not confirmed.

I ran into an issue with KVM as well. However, it wasn't a page boundary 
spanning issue, since I could hit it when using pure CXL backed memory 
for a given application. It turned out that (at least) certain AVX 
instructions didn't handle execution from MMIO when using qemu. This 
generated an illegal instruction exception for the application. At that 
point, I switched to tcg, so I didn't investigate if running a non-AVX 
system would work with KVM.

> Second there's the performance issue:
> 
> 0) Do we actually care about performance? How likely are users to
>     attempt to run software out of CXL memory?
> 
> 1) If we do care, is there a potential for converting CXL away from the
>     MMIO design?  The issue is coherency for shared memory. Emulating
>     coherency is a) hard, and b) a ton of work for little gain.
> 
>     Presently marking CXL memory as MMIO basically enforces coherency by
>     preventing caching, though it's unclear how this is enforced
>     by KVM (or if it is, i have to imagine it is). 

Having the option of doing device specific processing of accesses to a 
CXL type 3 device (that the MMIO based access allows) is useful for 
experimentation with device functionality, so I would be sad to see that 
option go away. Emulating cache line access to a type 3 device would be 
interesting, and could potentially be implemented in a way that would 
allow caching of device memory in a shadow page in RAM, but that it a 
rather large project.

> It might be nice to solve this for non-shared memory regions, but
> testing functionality >>> performance at this point so it might not
> worth the investment.

Thanks,
Jorgen
Jonathan Cameron Feb. 28, 2023, 10:49 a.m. UTC | #2
On Mon, 27 Feb 2023 11:06:47 +0000
Jørgen Hansen <Jorgen.Hansen@wdc.com> wrote:

> On 2/18/23 11:22, Gregory Price wrote:
> > Breaking this off into a separate thread for archival sake.
> > 
> > There's a bug with handling execution of instructions held in CXL
> > memory - specifically when an instruction crosses a page boundary.
> > 
> > The result of this is that type-3 devices cannot use KVM at all at the
> > moment, and require the attached patch to run in TCG-only mode.
> > 
> > 
> > CXL memory devices are presently emulated as MMIO, and MMIO has no
> > coherency guarantees, so TCG doesn't cache the results of translating
> > an instruction, meaning execution is incredibly slow (orders of
> > magnitude slower than KVM).
> > 
> > 
> > Request for comments:
> > 
> > 
> > First there's the stability issue:
> > 
> > 0) TCG cannot handle instructions across a page boundary spanning ram and
> >     MMIO. See attached patch for hotfix.  This basically solves the page
> >     boundary issue by reverting the entire block to MMIO-mode if the
> >     problem is detected.
> > 
> > 1) KVM needs to be investigated.  It's likely the same/similar issue,
> >     but it's not confirmed.  
> 
> I ran into an issue with KVM as well. However, it wasn't a page boundary 
> spanning issue, since I could hit it when using pure CXL backed memory 
> for a given application. It turned out that (at least) certain AVX 
> instructions didn't handle execution from MMIO when using qemu. This 
> generated an illegal instruction exception for the application. At that 
> point, I switched to tcg, so I didn't investigate if running a non-AVX 
> system would work with KVM.

Short term I'm wondering if we should attempt to error out on KVM
unless some override parameter is used alongside the main cxl=on

> 
> > Second there's the performance issue:
> > 
> > 0) Do we actually care about performance? How likely are users to
> >     attempt to run software out of CXL memory?
> > 
> > 1) If we do care, is there a potential for converting CXL away from the
> >     MMIO design?  The issue is coherency for shared memory. Emulating
> >     coherency is a) hard, and b) a ton of work for little gain.
> > 
> >     Presently marking CXL memory as MMIO basically enforces coherency by
> >     preventing caching, though it's unclear how this is enforced
> >     by KVM (or if it is, i have to imagine it is).   
> 
> Having the option of doing device specific processing of accesses to a 
> CXL type 3 device (that the MMIO based access allows) is useful for 
> experimentation with device functionality, so I would be sad to see that 
> option go away. Emulating cache line access to a type 3 device would be 
> interesting, and could potentially be implemented in a way that would 
> allow caching of device memory in a shadow page in RAM, but that it a 
> rather large project.

Absolutely agree.  Can sketch a solution that is entirely in QEMU and
works with KVM on a white board, but it doesn't feel like a small job
to actually implement it and I'm sure there are nasty corners
(persistency is going to be tricky) 

If anyone sees this as a 'fun challenge' and wants to take it on though
that would be great!

Jonathan

> 
> > It might be nice to solve this for non-shared memory regions, but
> > testing functionality >>> performance at this point so it might not
> > worth the investment.  
> 
> Thanks,
> Jorgen
Philippe Mathieu-Daudé Feb. 28, 2023, 12:42 p.m. UTC | #3
On 18/2/23 11:22, Gregory Price wrote:
> 
> Breaking this off into a separate thread for archival sake.
> 
> There's a bug with handling execution of instructions held in CXL
> memory - specifically when an instruction crosses a page boundary.
> 
> The result of this is that type-3 devices cannot use KVM at all at the
> moment, and require the attached patch to run in TCG-only mode.
> 
> 
> CXL memory devices are presently emulated as MMIO, and MMIO has no
> coherency guarantees, so TCG doesn't cache the results of translating
> an instruction, meaning execution is incredibly slow (orders of
> magnitude slower than KVM).
> 
> 
> Request for comments:
> 
> 
> First there's the stability issue:
> 
> 0) TCG cannot handle instructions across a page boundary spanning ram and
>     MMIO. See attached patch for hotfix.  This basically solves the page
>     boundary issue by reverting the entire block to MMIO-mode if the
>     problem is detected.
> 
> 1) KVM needs to be investigated.  It's likely the same/similar issue,
>     but it's not confirmed.
> 
> 
> 
> Second there's the performance issue:
> 
> 0) Do we actually care about performance? How likely are users to
>     attempt to run software out of CXL memory?
> 
> 1) If we do care, is there a potential for converting CXL away from the
>     MMIO design?  The issue is coherency for shared memory. Emulating
>     coherency is a) hard, and b) a ton of work for little gain.
> 
>     Presently marking CXL memory as MMIO basically enforces coherency by
>     preventing caching, though it's unclear how this is enforced
>     by KVM (or if it is, i have to imagine it is).
> 
> 
> 
> It might be nice to solve this for non-shared memory regions, but
> testing functionality >>> performance at this point so it might not
> worth the investment.
> 
> 
> Just tossing this out for discussion
> ~Gregory
> 
> 
> 
> 
> diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
> index 061519691f..cd383d7125 100644
> --- a/accel/tcg/translator.c
> +++ b/accel/tcg/translator.c
> @@ -171,8 +171,16 @@ static void *translator_access(CPUArchState *env, DisasContextBase *db,
>           if (host == NULL) {
>               tb_page_addr_t phys_page =
>                   get_page_addr_code_hostp(env, base, &db->host_addr[1]);
> -            /* We cannot handle MMIO as second page. */
> -            assert(phys_page != -1);
> +
> +            /*
> +            * If the second page is MMIO, treat as if the first page
> +            * was MMIO as well, so that we do not cache the TB.
> +            */
> +            if (unlikely(phys_page == -1)) {
> +                tb_set_page_addr0(tb, -1);
> +                return NULL;
> +            }
> +
>               tb_set_page_addr1(tb, phys_page);
>   #ifdef CONFIG_USER_ONLY
>               page_protect(end);
> 

This is:
https://lore.kernel.org/qemu-devel/20230222020023.904232-2-richard.henderson@linaro.org/
Jørgen Hansen March 1, 2023, 2:51 p.m. UTC | #4
On 2/28/23 11:49, Jonathan Cameron wrote:
>>> Second there's the performance issue:
>>>
>>> 0) Do we actually care about performance? How likely are users to
>>>      attempt to run software out of CXL memory?
>>>
>>> 1) If we do care, is there a potential for converting CXL away from the
>>>      MMIO design?  The issue is coherency for shared memory. Emulating
>>>      coherency is a) hard, and b) a ton of work for little gain.
>>>
>>>      Presently marking CXL memory as MMIO basically enforces coherency by
>>>      preventing caching, though it's unclear how this is enforced
>>>      by KVM (or if it is, i have to imagine it is).
>>
>> Having the option of doing device specific processing of accesses to a
>> CXL type 3 device (that the MMIO based access allows) is useful for
>> experimentation with device functionality, so I would be sad to see that
>> option go away. Emulating cache line access to a type 3 device would be
>> interesting, and could potentially be implemented in a way that would
>> allow caching of device memory in a shadow page in RAM, but that it a
>> rather large project.
> 
> Absolutely agree.  Can sketch a solution that is entirely in QEMU and
> works with KVM on a white board, but it doesn't feel like a small job
> to actually implement it and I'm sure there are nasty corners
> (persistency is going to be tricky)
> 
> If anyone sees this as a 'fun challenge' and wants to take it on though
> that would be great!

I'd be interested in getting more details on your thoughts on this and 
potentially work on it. We'd like to get closer to the CXL.mem traffic 
that a physical system would see, ideally seeing read requests only on 
LLC cache misses - although that seems hard to achieve.

Thanks,
Jorgen
Ajay Joshi March 2, 2023, 1:17 a.m. UTC | #5
On Tue, Feb 28, 2023 at 4:54 PM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Mon, 27 Feb 2023 11:06:47 +0000
> Jørgen Hansen <Jorgen.Hansen@wdc.com> wrote:
>
> > On 2/18/23 11:22, Gregory Price wrote:
> > > Breaking this off into a separate thread for archival sake.
> > >
> > > There's a bug with handling execution of instructions held in CXL
> > > memory - specifically when an instruction crosses a page boundary.
> > >
> > > The result of this is that type-3 devices cannot use KVM at all at the
> > > moment, and require the attached patch to run in TCG-only mode.
> > >
> > >
> > > CXL memory devices are presently emulated as MMIO, and MMIO has no
> > > coherency guarantees, so TCG doesn't cache the results of translating
> > > an instruction, meaning execution is incredibly slow (orders of
> > > magnitude slower than KVM).
> > >
> > >
> > > Request for comments:
> > >
> > >
> > > First there's the stability issue:
> > >
> > > 0) TCG cannot handle instructions across a page boundary spanning ram and
> > >     MMIO. See attached patch for hotfix.  This basically solves the page
> > >     boundary issue by reverting the entire block to MMIO-mode if the
> > >     problem is detected.
> > >
> > > 1) KVM needs to be investigated.  It's likely the same/similar issue,
> > >     but it's not confirmed.
> >
> > I ran into an issue with KVM as well. However, it wasn't a page boundary
> > spanning issue, since I could hit it when using pure CXL backed memory
> > for a given application. It turned out that (at least) certain AVX
> > instructions didn't handle execution from MMIO when using qemu. This
> > generated an illegal instruction exception for the application. At that
> > point, I switched to tcg, so I didn't investigate if running a non-AVX
> > system would work with KVM.
>
> Short term I'm wondering if we should attempt to error out on KVM
> unless some override parameter is used alongside the main cxl=on
This seems like a good idea. Avoids the trouble of discovering a lot later
during the execution.
>
> >
> > > Second there's the performance issue:
> > >
> > > 0) Do we actually care about performance? How likely are users to
> > >     attempt to run software out of CXL memory?
> > >
> > > 1) If we do care, is there a potential for converting CXL away from the
> > >     MMIO design?  The issue is coherency for shared memory. Emulating
> > >     coherency is a) hard, and b) a ton of work for little gain.
> > >
> > >     Presently marking CXL memory as MMIO basically enforces coherency by
> > >     preventing caching, though it's unclear how this is enforced
> > >     by KVM (or if it is, i have to imagine it is).
> >
> > Having the option of doing device specific processing of accesses to a
> > CXL type 3 device (that the MMIO based access allows) is useful for
> > experimentation with device functionality, so I would be sad to see that
> > option go away. Emulating cache line access to a type 3 device would be
> > interesting, and could potentially be implemented in a way that would
> > allow caching of device memory in a shadow page in RAM, but that it a
> > rather large project.
>
> Absolutely agree.  Can sketch a solution that is entirely in QEMU and
> works with KVM on a white board, but it doesn't feel like a small job
> to actually implement it and I'm sure there are nasty corners
> (persistency is going to be tricky)
>
> If anyone sees this as a 'fun challenge' and wants to take it on though
> that would be great!
>
> Jonathan
>
> >
> > > It might be nice to solve this for non-shared memory regions, but
> > > testing functionality >>> performance at this point so it might not
> > > worth the investment.
> >
> > Thanks,
> > Jorgen
>
diff mbox series

Patch

diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index 061519691f..cd383d7125 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -171,8 +171,16 @@  static void *translator_access(CPUArchState *env, DisasContextBase *db,
         if (host == NULL) {
             tb_page_addr_t phys_page =
                 get_page_addr_code_hostp(env, base, &db->host_addr[1]);
-            /* We cannot handle MMIO as second page. */
-            assert(phys_page != -1);
+
+            /*
+            * If the second page is MMIO, treat as if the first page
+            * was MMIO as well, so that we do not cache the TB.
+            */
+            if (unlikely(phys_page == -1)) {
+                tb_set_page_addr0(tb, -1);
+                return NULL;
+            }
+
             tb_set_page_addr1(tb, phys_page);
 #ifdef CONFIG_USER_ONLY
             page_protect(end);