Message ID | Y/Cm5nuJl3G2CG2p@memverge.com |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] CXL: TCG/KVM instruction alignment issue discussion default | expand |
On 2/18/23 11:22, Gregory Price wrote: > Breaking this off into a separate thread for archival sake. > > There's a bug with handling execution of instructions held in CXL > memory - specifically when an instruction crosses a page boundary. > > The result of this is that type-3 devices cannot use KVM at all at the > moment, and require the attached patch to run in TCG-only mode. > > > CXL memory devices are presently emulated as MMIO, and MMIO has no > coherency guarantees, so TCG doesn't cache the results of translating > an instruction, meaning execution is incredibly slow (orders of > magnitude slower than KVM). > > > Request for comments: > > > First there's the stability issue: > > 0) TCG cannot handle instructions across a page boundary spanning ram and > MMIO. See attached patch for hotfix. This basically solves the page > boundary issue by reverting the entire block to MMIO-mode if the > problem is detected. > > 1) KVM needs to be investigated. It's likely the same/similar issue, > but it's not confirmed. I ran into an issue with KVM as well. However, it wasn't a page boundary spanning issue, since I could hit it when using pure CXL backed memory for a given application. It turned out that (at least) certain AVX instructions didn't handle execution from MMIO when using qemu. This generated an illegal instruction exception for the application. At that point, I switched to tcg, so I didn't investigate if running a non-AVX system would work with KVM. > Second there's the performance issue: > > 0) Do we actually care about performance? How likely are users to > attempt to run software out of CXL memory? > > 1) If we do care, is there a potential for converting CXL away from the > MMIO design? The issue is coherency for shared memory. Emulating > coherency is a) hard, and b) a ton of work for little gain. > > Presently marking CXL memory as MMIO basically enforces coherency by > preventing caching, though it's unclear how this is enforced > by KVM (or if it is, i have to imagine it is). Having the option of doing device specific processing of accesses to a CXL type 3 device (that the MMIO based access allows) is useful for experimentation with device functionality, so I would be sad to see that option go away. Emulating cache line access to a type 3 device would be interesting, and could potentially be implemented in a way that would allow caching of device memory in a shadow page in RAM, but that it a rather large project. > It might be nice to solve this for non-shared memory regions, but > testing functionality >>> performance at this point so it might not > worth the investment. Thanks, Jorgen
On Mon, 27 Feb 2023 11:06:47 +0000 Jørgen Hansen <Jorgen.Hansen@wdc.com> wrote: > On 2/18/23 11:22, Gregory Price wrote: > > Breaking this off into a separate thread for archival sake. > > > > There's a bug with handling execution of instructions held in CXL > > memory - specifically when an instruction crosses a page boundary. > > > > The result of this is that type-3 devices cannot use KVM at all at the > > moment, and require the attached patch to run in TCG-only mode. > > > > > > CXL memory devices are presently emulated as MMIO, and MMIO has no > > coherency guarantees, so TCG doesn't cache the results of translating > > an instruction, meaning execution is incredibly slow (orders of > > magnitude slower than KVM). > > > > > > Request for comments: > > > > > > First there's the stability issue: > > > > 0) TCG cannot handle instructions across a page boundary spanning ram and > > MMIO. See attached patch for hotfix. This basically solves the page > > boundary issue by reverting the entire block to MMIO-mode if the > > problem is detected. > > > > 1) KVM needs to be investigated. It's likely the same/similar issue, > > but it's not confirmed. > > I ran into an issue with KVM as well. However, it wasn't a page boundary > spanning issue, since I could hit it when using pure CXL backed memory > for a given application. It turned out that (at least) certain AVX > instructions didn't handle execution from MMIO when using qemu. This > generated an illegal instruction exception for the application. At that > point, I switched to tcg, so I didn't investigate if running a non-AVX > system would work with KVM. Short term I'm wondering if we should attempt to error out on KVM unless some override parameter is used alongside the main cxl=on > > > Second there's the performance issue: > > > > 0) Do we actually care about performance? How likely are users to > > attempt to run software out of CXL memory? > > > > 1) If we do care, is there a potential for converting CXL away from the > > MMIO design? The issue is coherency for shared memory. Emulating > > coherency is a) hard, and b) a ton of work for little gain. > > > > Presently marking CXL memory as MMIO basically enforces coherency by > > preventing caching, though it's unclear how this is enforced > > by KVM (or if it is, i have to imagine it is). > > Having the option of doing device specific processing of accesses to a > CXL type 3 device (that the MMIO based access allows) is useful for > experimentation with device functionality, so I would be sad to see that > option go away. Emulating cache line access to a type 3 device would be > interesting, and could potentially be implemented in a way that would > allow caching of device memory in a shadow page in RAM, but that it a > rather large project. Absolutely agree. Can sketch a solution that is entirely in QEMU and works with KVM on a white board, but it doesn't feel like a small job to actually implement it and I'm sure there are nasty corners (persistency is going to be tricky) If anyone sees this as a 'fun challenge' and wants to take it on though that would be great! Jonathan > > > It might be nice to solve this for non-shared memory regions, but > > testing functionality >>> performance at this point so it might not > > worth the investment. > > Thanks, > Jorgen
On 18/2/23 11:22, Gregory Price wrote: > > Breaking this off into a separate thread for archival sake. > > There's a bug with handling execution of instructions held in CXL > memory - specifically when an instruction crosses a page boundary. > > The result of this is that type-3 devices cannot use KVM at all at the > moment, and require the attached patch to run in TCG-only mode. > > > CXL memory devices are presently emulated as MMIO, and MMIO has no > coherency guarantees, so TCG doesn't cache the results of translating > an instruction, meaning execution is incredibly slow (orders of > magnitude slower than KVM). > > > Request for comments: > > > First there's the stability issue: > > 0) TCG cannot handle instructions across a page boundary spanning ram and > MMIO. See attached patch for hotfix. This basically solves the page > boundary issue by reverting the entire block to MMIO-mode if the > problem is detected. > > 1) KVM needs to be investigated. It's likely the same/similar issue, > but it's not confirmed. > > > > Second there's the performance issue: > > 0) Do we actually care about performance? How likely are users to > attempt to run software out of CXL memory? > > 1) If we do care, is there a potential for converting CXL away from the > MMIO design? The issue is coherency for shared memory. Emulating > coherency is a) hard, and b) a ton of work for little gain. > > Presently marking CXL memory as MMIO basically enforces coherency by > preventing caching, though it's unclear how this is enforced > by KVM (or if it is, i have to imagine it is). > > > > It might be nice to solve this for non-shared memory regions, but > testing functionality >>> performance at this point so it might not > worth the investment. > > > Just tossing this out for discussion > ~Gregory > > > > > diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c > index 061519691f..cd383d7125 100644 > --- a/accel/tcg/translator.c > +++ b/accel/tcg/translator.c > @@ -171,8 +171,16 @@ static void *translator_access(CPUArchState *env, DisasContextBase *db, > if (host == NULL) { > tb_page_addr_t phys_page = > get_page_addr_code_hostp(env, base, &db->host_addr[1]); > - /* We cannot handle MMIO as second page. */ > - assert(phys_page != -1); > + > + /* > + * If the second page is MMIO, treat as if the first page > + * was MMIO as well, so that we do not cache the TB. > + */ > + if (unlikely(phys_page == -1)) { > + tb_set_page_addr0(tb, -1); > + return NULL; > + } > + > tb_set_page_addr1(tb, phys_page); > #ifdef CONFIG_USER_ONLY > page_protect(end); > This is: https://lore.kernel.org/qemu-devel/20230222020023.904232-2-richard.henderson@linaro.org/
On 2/28/23 11:49, Jonathan Cameron wrote: >>> Second there's the performance issue: >>> >>> 0) Do we actually care about performance? How likely are users to >>> attempt to run software out of CXL memory? >>> >>> 1) If we do care, is there a potential for converting CXL away from the >>> MMIO design? The issue is coherency for shared memory. Emulating >>> coherency is a) hard, and b) a ton of work for little gain. >>> >>> Presently marking CXL memory as MMIO basically enforces coherency by >>> preventing caching, though it's unclear how this is enforced >>> by KVM (or if it is, i have to imagine it is). >> >> Having the option of doing device specific processing of accesses to a >> CXL type 3 device (that the MMIO based access allows) is useful for >> experimentation with device functionality, so I would be sad to see that >> option go away. Emulating cache line access to a type 3 device would be >> interesting, and could potentially be implemented in a way that would >> allow caching of device memory in a shadow page in RAM, but that it a >> rather large project. > > Absolutely agree. Can sketch a solution that is entirely in QEMU and > works with KVM on a white board, but it doesn't feel like a small job > to actually implement it and I'm sure there are nasty corners > (persistency is going to be tricky) > > If anyone sees this as a 'fun challenge' and wants to take it on though > that would be great! I'd be interested in getting more details on your thoughts on this and potentially work on it. We'd like to get closer to the CXL.mem traffic that a physical system would see, ideally seeing read requests only on LLC cache misses - although that seems hard to achieve. Thanks, Jorgen
On Tue, Feb 28, 2023 at 4:54 PM Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Mon, 27 Feb 2023 11:06:47 +0000 > Jørgen Hansen <Jorgen.Hansen@wdc.com> wrote: > > > On 2/18/23 11:22, Gregory Price wrote: > > > Breaking this off into a separate thread for archival sake. > > > > > > There's a bug with handling execution of instructions held in CXL > > > memory - specifically when an instruction crosses a page boundary. > > > > > > The result of this is that type-3 devices cannot use KVM at all at the > > > moment, and require the attached patch to run in TCG-only mode. > > > > > > > > > CXL memory devices are presently emulated as MMIO, and MMIO has no > > > coherency guarantees, so TCG doesn't cache the results of translating > > > an instruction, meaning execution is incredibly slow (orders of > > > magnitude slower than KVM). > > > > > > > > > Request for comments: > > > > > > > > > First there's the stability issue: > > > > > > 0) TCG cannot handle instructions across a page boundary spanning ram and > > > MMIO. See attached patch for hotfix. This basically solves the page > > > boundary issue by reverting the entire block to MMIO-mode if the > > > problem is detected. > > > > > > 1) KVM needs to be investigated. It's likely the same/similar issue, > > > but it's not confirmed. > > > > I ran into an issue with KVM as well. However, it wasn't a page boundary > > spanning issue, since I could hit it when using pure CXL backed memory > > for a given application. It turned out that (at least) certain AVX > > instructions didn't handle execution from MMIO when using qemu. This > > generated an illegal instruction exception for the application. At that > > point, I switched to tcg, so I didn't investigate if running a non-AVX > > system would work with KVM. > > Short term I'm wondering if we should attempt to error out on KVM > unless some override parameter is used alongside the main cxl=on This seems like a good idea. Avoids the trouble of discovering a lot later during the execution. > > > > > > Second there's the performance issue: > > > > > > 0) Do we actually care about performance? How likely are users to > > > attempt to run software out of CXL memory? > > > > > > 1) If we do care, is there a potential for converting CXL away from the > > > MMIO design? The issue is coherency for shared memory. Emulating > > > coherency is a) hard, and b) a ton of work for little gain. > > > > > > Presently marking CXL memory as MMIO basically enforces coherency by > > > preventing caching, though it's unclear how this is enforced > > > by KVM (or if it is, i have to imagine it is). > > > > Having the option of doing device specific processing of accesses to a > > CXL type 3 device (that the MMIO based access allows) is useful for > > experimentation with device functionality, so I would be sad to see that > > option go away. Emulating cache line access to a type 3 device would be > > interesting, and could potentially be implemented in a way that would > > allow caching of device memory in a shadow page in RAM, but that it a > > rather large project. > > Absolutely agree. Can sketch a solution that is entirely in QEMU and > works with KVM on a white board, but it doesn't feel like a small job > to actually implement it and I'm sure there are nasty corners > (persistency is going to be tricky) > > If anyone sees this as a 'fun challenge' and wants to take it on though > that would be great! > > Jonathan > > > > > > It might be nice to solve this for non-shared memory regions, but > > > testing functionality >>> performance at this point so it might not > > > worth the investment. > > > > Thanks, > > Jorgen >
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c index 061519691f..cd383d7125 100644 --- a/accel/tcg/translator.c +++ b/accel/tcg/translator.c @@ -171,8 +171,16 @@ static void *translator_access(CPUArchState *env, DisasContextBase *db, if (host == NULL) { tb_page_addr_t phys_page = get_page_addr_code_hostp(env, base, &db->host_addr[1]); - /* We cannot handle MMIO as second page. */ - assert(phys_page != -1); + + /* + * If the second page is MMIO, treat as if the first page + * was MMIO as well, so that we do not cache the TB. + */ + if (unlikely(phys_page == -1)) { + tb_set_page_addr0(tb, -1); + return NULL; + } + tb_set_page_addr1(tb, phys_page); #ifdef CONFIG_USER_ONLY page_protect(end);