diff mbox series

x86/hvm: don't expose XENFEAT_hvm_pirqs by default

Message ID 20240110095311.61809-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/hvm: don't expose XENFEAT_hvm_pirqs by default | expand

Commit Message

Roger Pau Monné Jan. 10, 2024, 9:53 a.m. UTC
The HVM pirq feature allows routing interrupts from both physical and emulated
devices over event channels, this was done a performance improvement.  However
its usage is fully undocumented, and the only reference implementation is in
Linux.  It defeats the purpose of local APIC hardware virtualization, because
when using it interrupts avoid the usage of the local APIC altogether.

It has also been reported to not work properly with certain devices, at least
when using some AMD GPUs Linux attempts to route interrupts over event
channels, but Xen doesn't correctly detect such routing, which leads to the
hypervisor complaining with:

(XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15

When MSIs are attempted to be routed over event channels the entry delivery
mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
inject the interrupt following the native MSI path, and the ExtINT delivery
mode is not supported.

Disable HVM PIRQs by default and provide a per-domain option in xl.cfg to
enable such feature.  Also for backwards compatibility keep the feature enabled
for any resumed domains that don't have an explicit selection.

Note that the only user of the feature (Linux) is also able to handle native
interrupts fine, as the feature was already not used if Xen reported local APIC
hardware virtualization active.

Link: https://github.com/QubesOS/qubes-issues/issues/7971
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 docs/man/xl.cfg.5.pod.in          |  7 +++++++
 tools/include/libxl.h             |  7 +++++++
 tools/libs/light/libxl_create.c   |  7 +++++--
 tools/libs/light/libxl_types.idl  |  1 +
 tools/libs/light/libxl_x86.c      | 11 ++++++++---
 tools/python/xen/lowlevel/xc/xc.c |  4 +++-
 tools/xl/xl_parse.c               |  1 +
 xen/arch/x86/domain.c             |  4 +++-
 8 files changed, 35 insertions(+), 7 deletions(-)

Comments

Jan Beulich Jan. 10, 2024, 10:26 a.m. UTC | #1
On 10.01.2024 10:53, Roger Pau Monne wrote:
> The HVM pirq feature allows routing interrupts from both physical and emulated
> devices over event channels, this was done a performance improvement.  However
> its usage is fully undocumented, and the only reference implementation is in
> Linux.  It defeats the purpose of local APIC hardware virtualization, because
> when using it interrupts avoid the usage of the local APIC altogether.

So without sufficient APIC acceleration, isn't this arranging for degraded
performance then? IOW should the new default perhaps be dependent on the
degree of APIC acceleration?

> It has also been reported to not work properly with certain devices, at least
> when using some AMD GPUs Linux attempts to route interrupts over event
> channels, but Xen doesn't correctly detect such routing, which leads to the
> hypervisor complaining with:
> 
> (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
> 
> When MSIs are attempted to be routed over event channels the entry delivery
> mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
> inject the interrupt following the native MSI path, and the ExtINT delivery
> mode is not supported.

Shouldn't this be properly addressed nevertheless? The way it's described
it sounds as if MSI wouldn't work at all this way; I can't spot why the
issue would only be "with certain devices". Yet that in turn doesn't look
to be very likely - pass-through use cases, in particular SR-IOV ones,
would certainly have noticed.

Jan
Roger Pau Monné Jan. 10, 2024, 11:34 a.m. UTC | #2
On Wed, Jan 10, 2024 at 11:26:02AM +0100, Jan Beulich wrote:
> On 10.01.2024 10:53, Roger Pau Monne wrote:
> > The HVM pirq feature allows routing interrupts from both physical and emulated
> > devices over event channels, this was done a performance improvement.  However
> > its usage is fully undocumented, and the only reference implementation is in
> > Linux.  It defeats the purpose of local APIC hardware virtualization, because
> > when using it interrupts avoid the usage of the local APIC altogether.
> 
> So without sufficient APIC acceleration, isn't this arranging for degraded
> performance then? IOW should the new default perhaps be dependent on the
> degree of APIC acceleration?

Maybe, I certainly have no figures, but given the feature is not
working as expected I don't think we should block disabling it on
performance reasons.  Reliability should always be first to
performance.

> > It has also been reported to not work properly with certain devices, at least
> > when using some AMD GPUs Linux attempts to route interrupts over event
> > channels, but Xen doesn't correctly detect such routing, which leads to the
> > hypervisor complaining with:
> > 
> > (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
> > 
> > When MSIs are attempted to be routed over event channels the entry delivery
> > mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
> > inject the interrupt following the native MSI path, and the ExtINT delivery
> > mode is not supported.
> 
> Shouldn't this be properly addressed nevertheless?

There's no documentation at all about how the HVM PIRQ interface, so
every time I had to deal with it I had to guess and reverse engineer
how it's supposed to work.

> The way it's described
> it sounds as if MSI wouldn't work at all this way; I can't spot why the
> issue would only be "with certain devices". Yet that in turn doesn't look
> to be very likely - pass-through use cases, in particular SR-IOV ones,
> would certainly have noticed.

I think it has to do with when/how the MSI is updated.  I can't repro
myself, which makes fixing even more complicated.

TBH, I think the feature is simply broken, and I don't feel like
spending more time fixing it.  The fact that it was added in the first
place, and enabled by default was IMO a mistake.

If someone is willing to fix the issue, I'm fine with that, but I
certainly don't want to spend more time fixing issues for an
undocumented feature, the more that going forward such feature is
likely to be even less relevant with hardware APIC virtualization
being more widely available.

FWIW, this patch has an issue in libxl with PV guests, so will need to
send v2.

Thanks, Roger.
Xenia Ragiadakou Jan. 10, 2024, 1:47 p.m. UTC | #3
On 10/1/24 12:26, Jan Beulich wrote:
> On 10.01.2024 10:53, Roger Pau Monne wrote:
>> The HVM pirq feature allows routing interrupts from both physical and emulated
>> devices over event channels, this was done a performance improvement.  However
>> its usage is fully undocumented, and the only reference implementation is in
>> Linux.  It defeats the purpose of local APIC hardware virtualization, because
>> when using it interrupts avoid the usage of the local APIC altogether.
> 
> So without sufficient APIC acceleration, isn't this arranging for degraded
> performance then? IOW should the new default perhaps be dependent on the
> degree of APIC acceleration?
> 
>> It has also been reported to not work properly with certain devices, at least
>> when using some AMD GPUs Linux attempts to route interrupts over event
>> channels, but Xen doesn't correctly detect such routing, which leads to the
>> hypervisor complaining with:
>>
>> (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
>>
>> When MSIs are attempted to be routed over event channels the entry delivery
>> mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
>> inject the interrupt following the native MSI path, and the ExtINT delivery
>> mode is not supported.
> 
> Shouldn't this be properly addressed nevertheless? The way it's described
> it sounds as if MSI wouldn't work at all this way; I can't spot why the
> issue would only be "with certain devices". Yet that in turn doesn't look
> to be very likely - pass-through use cases, in particular SR-IOV ones,
> would certainly have noticed.

The issue gets triggered when the guest performs save/restore of MSIs, 
because PHYSDEVOP_map_pirq is not implemented for MSIs, and thus, QEMU 
cannot remap the MSI to the event channel once unmapped.
So, to fix this issue either would be needed to change QEMU to not unmap 
pirq-emulated MSIs or to implement PHYSDEVOP_map_pirq for MSIs.

But still, even when no device has been passed-through, scheduling 
latencies (of hundreds of ms), were observed in the guest even when 
running a simple loop application, that disappear once the flag is 
disabled. We did not have the chance to root cause it further.

> 
> Jan
>
Andrew Cooper Jan. 10, 2024, 3:20 p.m. UTC | #4
On 10/01/2024 10:26 am, Jan Beulich wrote:
> On 10.01.2024 10:53, Roger Pau Monne wrote:
>> The HVM pirq feature allows routing interrupts from both physical and emulated
>> devices over event channels, this was done a performance improvement.  However
>> its usage is fully undocumented, and the only reference implementation is in
>> Linux.  It defeats the purpose of local APIC hardware virtualization, because
>> when using it interrupts avoid the usage of the local APIC altogether.
> So without sufficient APIC acceleration, isn't this arranging for degraded
> performance then?

Maybe.  Quite possibly.

>  IOW should the new default perhaps be dependent on the
> degree of APIC acceleration?

No.  Getting things working is strictly more important than making
things fast.

HVM_PIRQ is an entirely undocumented pile of Xen-specific deviation from
standards, with a single reference implementation and 0 people who
understand it.  It has also been shown to be the cause of real breakage,
and disabling it has been shown to make guests function better.

If and when we have two working schemes, we can choose a default based
on performance. We are not in this position right now. ~Andrew
Roger Pau Monné Jan. 10, 2024, 3:21 p.m. UTC | #5
On Wed, Jan 10, 2024 at 03:47:12PM +0200, Xenia Ragiadakou wrote:
> 
> 
> On 10/1/24 12:26, Jan Beulich wrote:
> > On 10.01.2024 10:53, Roger Pau Monne wrote:
> > > The HVM pirq feature allows routing interrupts from both physical and emulated
> > > devices over event channels, this was done a performance improvement.  However
> > > its usage is fully undocumented, and the only reference implementation is in
> > > Linux.  It defeats the purpose of local APIC hardware virtualization, because
> > > when using it interrupts avoid the usage of the local APIC altogether.
> > 
> > So without sufficient APIC acceleration, isn't this arranging for degraded
> > performance then? IOW should the new default perhaps be dependent on the
> > degree of APIC acceleration?
> > 
> > > It has also been reported to not work properly with certain devices, at least
> > > when using some AMD GPUs Linux attempts to route interrupts over event
> > > channels, but Xen doesn't correctly detect such routing, which leads to the
> > > hypervisor complaining with:
> > > 
> > > (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
> > > 
> > > When MSIs are attempted to be routed over event channels the entry delivery
> > > mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
> > > inject the interrupt following the native MSI path, and the ExtINT delivery
> > > mode is not supported.
> > 
> > Shouldn't this be properly addressed nevertheless? The way it's described
> > it sounds as if MSI wouldn't work at all this way; I can't spot why the
> > issue would only be "with certain devices". Yet that in turn doesn't look
> > to be very likely - pass-through use cases, in particular SR-IOV ones,
> > would certainly have noticed.
> 
> The issue gets triggered when the guest performs save/restore of MSIs,
> because PHYSDEVOP_map_pirq is not implemented for MSIs, and thus, QEMU
> cannot remap the MSI to the event channel once unmapped.

I'm kind of confused by this sentence, PHYSDEVOP_map_pirq does support
MSIs, see xc_physdev_map_pirq_msi() helper in Xen code base.

> So, to fix this issue either would be needed to change QEMU to not unmap
> pirq-emulated MSIs or to implement PHYSDEVOP_map_pirq for MSIs.
> 
> But still, even when no device has been passed-through, scheduling latencies
> (of hundreds of ms), were observed in the guest even when running a simple
> loop application, that disappear once the flag is disabled. We did not have
> the chance to root cause it further.

So XENFEAT_hvm_pirqs is causing such latency issues?  That I certainly
didn't notice.

Regards, Roger.
Xenia Ragiadakou Jan. 11, 2024, 8:04 a.m. UTC | #6
Hi Roger,

On 10/1/24 17:21, Roger Pau Monné wrote:
> On Wed, Jan 10, 2024 at 03:47:12PM +0200, Xenia Ragiadakou wrote:
>>
>>
>> On 10/1/24 12:26, Jan Beulich wrote:
>>> On 10.01.2024 10:53, Roger Pau Monne wrote:
>>>> The HVM pirq feature allows routing interrupts from both physical and emulated
>>>> devices over event channels, this was done a performance improvement.  However
>>>> its usage is fully undocumented, and the only reference implementation is in
>>>> Linux.  It defeats the purpose of local APIC hardware virtualization, because
>>>> when using it interrupts avoid the usage of the local APIC altogether.
>>>
>>> So without sufficient APIC acceleration, isn't this arranging for degraded
>>> performance then? IOW should the new default perhaps be dependent on the
>>> degree of APIC acceleration?
>>>
>>>> It has also been reported to not work properly with certain devices, at least
>>>> when using some AMD GPUs Linux attempts to route interrupts over event
>>>> channels, but Xen doesn't correctly detect such routing, which leads to the
>>>> hypervisor complaining with:
>>>>
>>>> (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
>>>>
>>>> When MSIs are attempted to be routed over event channels the entry delivery
>>>> mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
>>>> inject the interrupt following the native MSI path, and the ExtINT delivery
>>>> mode is not supported.
>>>
>>> Shouldn't this be properly addressed nevertheless? The way it's described
>>> it sounds as if MSI wouldn't work at all this way; I can't spot why the
>>> issue would only be "with certain devices". Yet that in turn doesn't look
>>> to be very likely - pass-through use cases, in particular SR-IOV ones,
>>> would certainly have noticed.
>>
>> The issue gets triggered when the guest performs save/restore of MSIs,
>> because PHYSDEVOP_map_pirq is not implemented for MSIs, and thus, QEMU
>> cannot remap the MSI to the event channel once unmapped.
> 
> I'm kind of confused by this sentence, PHYSDEVOP_map_pirq does support
> MSIs, see xc_physdev_map_pirq_msi() helper in Xen code base.
> 

Sorry I had to explain it better. For an HVM guest with 
XENFEAT_hvm_pirqs set, physdev_hvm_map_pirq() will be called, that has 
not support for MSI.

>> So, to fix this issue either would be needed to change QEMU to not unmap
>> pirq-emulated MSIs or to implement PHYSDEVOP_map_pirq for MSIs.
>>
>> But still, even when no device has been passed-through, scheduling latencies
>> (of hundreds of ms), were observed in the guest even when running a simple
>> loop application, that disappear once the flag is disabled. We did not have
>> the chance to root cause it further.
> 
> So XENFEAT_hvm_pirqs is causing such latency issues?  That I certainly
> didn't notice.

We 've seen that in a setup with an HVM guest using emulated MSIs. We 
were running an application, like the one below, pinned on an isolated 
and pinned vcpu.

int main()
{
     struct timeval cur_time = {0}, prev_time = {0};
     int diff;

     gettimeofday(&prev_time, NULL);
     do {
         gettimeofday(&cur_time, NULL);

         diff = (((cur_time.tv_sec - prev_time.tv_sec)*1000) + ((cur_time
.tv_usec - prev_time.tv_usec)/1000));
         if (diff>10)
            printf("app scheduled after: %d msec\n",diff);

         gettimeofday(&prev_time, NULL);
     } while(1);
}

And we 're getting values of hundreds of ms like:
app scheduled after: 985 msec

> 
> Regards, Roger.
David Woodhouse Jan. 11, 2024, 5:47 p.m. UTC | #7
On Wed, 2024-01-10 at 11:26 +0100, Jan Beulich wrote:
> On 10.01.2024 10:53, Roger Pau Monne wrote:
> > The HVM pirq feature allows routing interrupts from both physical and emulated
> > devices over event channels, this was done a performance improvement.  However
> > its usage is fully undocumented, and the only reference implementation is in
> > Linux.  It defeats the purpose of local APIC hardware virtualization, because
> > when using it interrupts avoid the usage of the local APIC altogether.
> 
> So without sufficient APIC acceleration, isn't this arranging for degraded
> performance then? IOW should the new default perhaps be dependent on the
> degree of APIC acceleration?

In fact Linux already declines to use MSI → PIRQ routing if APIC
acceleration is enabled.

static void __init xen_hvm_msi_init(void)
{
        if (!apic_is_disabled) {
                /*
                 * If hardware supports (x2)APIC virtualization (as indicated
                 * by hypervisor's leaf 4) then we don't need to use pirqs/
                 * event channels for MSI handling and instead use regular
                 * APIC processing
                 */
                uint32_t eax = cpuid_eax(xen_cpuid_base() + 4);

                if (((eax & XEN_HVM_CPUID_X2APIC_VIRT) && x2apic_mode) ||
                    ((eax & XEN_HVM_CPUID_APIC_ACCESS_VIRT) && boot_cpu_has(X86_FEATURE_APIC)))
                        return;
        }
        xen_setup_pci_msi();
}

> > It has also been reported to not work properly with certain devices, at least
> > when using some AMD GPUs Linux attempts to route interrupts over event
> > channels, but Xen doesn't correctly detect such routing, which leads to the
> > hypervisor complaining with:
> > 
> > (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
> > 
> > When MSIs are attempted to be routed over event channels the entry delivery
> > mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
> > inject the interrupt following the native MSI path, and the ExtINT delivery
> > mode is not supported.
> 
> Shouldn't this be properly addressed nevertheless? The way it's described
> it sounds as if MSI wouldn't work at all this way; I can't spot why the
> issue would only be "with certain devices". Yet that in turn doesn't look
> to be very likely - pass-through use cases, in particular SR-IOV ones,
> would certainly have noticed.

I agree. The MSI to PIRQ routing thing is *awful*, especially the way
that Xen/QEMU snoops on writes to the MSI table while the target is
*masked*, and then Xen unmasks the MSI instead of the guest doing so.

But it does work, and there are three implementations of it on the
hypervisor side now (Xen itself, QEMU and the Nitro hypervisor). We
should fix the bug which is being reported, but I don't see why it's
necessary to completely disable the feature.
Roger Pau Monné Jan. 12, 2024, 11:05 a.m. UTC | #8
On Thu, Jan 11, 2024 at 05:47:30PM +0000, David Woodhouse wrote:
> On Wed, 2024-01-10 at 11:26 +0100, Jan Beulich wrote:
> > On 10.01.2024 10:53, Roger Pau Monne wrote:
> > > The HVM pirq feature allows routing interrupts from both physical and emulated
> > > devices over event channels, this was done a performance improvement.  However
> > > its usage is fully undocumented, and the only reference implementation is in
> > > Linux.  It defeats the purpose of local APIC hardware virtualization, because
> > > when using it interrupts avoid the usage of the local APIC altogether.
> > 
> > So without sufficient APIC acceleration, isn't this arranging for degraded
> > performance then? IOW should the new default perhaps be dependent on the
> > degree of APIC acceleration?
> 
> In fact Linux already declines to use MSI → PIRQ routing if APIC
> acceleration is enabled.
> 
> static void __init xen_hvm_msi_init(void)
> {
>         if (!apic_is_disabled) {
>                 /*
>                  * If hardware supports (x2)APIC virtualization (as indicated
>                  * by hypervisor's leaf 4) then we don't need to use pirqs/
>                  * event channels for MSI handling and instead use regular
>                  * APIC processing
>                  */
>                 uint32_t eax = cpuid_eax(xen_cpuid_base() + 4);
> 
>                 if (((eax & XEN_HVM_CPUID_X2APIC_VIRT) && x2apic_mode) ||
>                     ((eax & XEN_HVM_CPUID_APIC_ACCESS_VIRT) && boot_cpu_has(X86_FEATURE_APIC)))
>                         return;
>         }
>         xen_setup_pci_msi();
> }
> 
> > > It has also been reported to not work properly with certain devices, at least
> > > when using some AMD GPUs Linux attempts to route interrupts over event
> > > channels, but Xen doesn't correctly detect such routing, which leads to the
> > > hypervisor complaining with:
> > > 
> > > (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
> > > 
> > > When MSIs are attempted to be routed over event channels the entry delivery
> > > mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
> > > inject the interrupt following the native MSI path, and the ExtINT delivery
> > > mode is not supported.
> > 
> > Shouldn't this be properly addressed nevertheless? The way it's described
> > it sounds as if MSI wouldn't work at all this way; I can't spot why the
> > issue would only be "with certain devices". Yet that in turn doesn't look
> > to be very likely - pass-through use cases, in particular SR-IOV ones,
> > would certainly have noticed.
> 
> I agree. The MSI to PIRQ routing thing is *awful*, especially the way
> that Xen/QEMU snoops on writes to the MSI table while the target is
> *masked*, and then Xen unmasks the MSI instead of the guest doing so.
> 
> But it does work, and there are three implementations of it on the
> hypervisor side now (Xen itself, QEMU and the Nitro hypervisor).

There's only one implementation in Xen, that's split between the
hypervisor and QEMU.  IOW: it's not like the QEMU and the hypervisor
sides are independent.

It's also only used by Linux, I don't know of any other guest having
implemented support for them (thankfully I should say).

> We
> should fix the bug which is being reported, but I don't see why it's
> necessary to completely disable the feature.

It's not just this bug, the feature is fully undocumented, and every
time there's an issue reported against interrupt delivery on HVM Linux
we need to reverse engineer how this is supposed to work.

Not to mention the MSI breakage it introduces by re-using the MSI
data and address fields for Xen specific purposes.

Note that this commit is not removing the code, just disabling it by
default.  Users can still enable it using hvm_pirq option on xl.cfg
(and toolstacks can use the domctl create flag to enable it).

IMO if someone whats to make a case for having HVM PIRQs enabled by
default, we first need documentation about how it's supposed to work,
plus all the reported bugs fixed.  I have to admit I would probably be
reluctant to enable it by default even then.

Apart from the original issue, Xenia has also reported that when
having the option enabled they see some unexpected scheduling
imbalance, that's gone when the option is disabled:

https://lore.kernel.org/xen-devel/6fe776cd-3fa6-421f-9d02-9350e85d5612@amd.com/

Regards, Roger.
diff mbox series

Patch

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 2e234b450efb..ea8d41727d8e 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -2460,6 +2460,13 @@  The viridian option can be specified as a boolean. A value of true (1)
 is equivalent to the list [ "defaults" ], and a value of false (0) is
 equivalent to an empty list.
 
+=item B<hvm_pirq=BOOLEAN>
+
+Select whether the guest is allowed to route interrupts from devices (either
+emulated or passed through) over event channels.
+
+This option is disabled by default.
+
 =back
 
 =head3 Emulated VGA Graphics Device
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index 907aa0a3303a..f1652b1664f0 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -608,6 +608,13 @@ 
  * executable in order to not run it as the same user as libxl.
  */
 
+/*
+ * LIBXL_HAVE_HVM_PIRQ indicates the presence of the u.hvm.pirq filed in
+ * libxl_domain_build_info that signals whether an HVM guest has accesses to
+ * the XENFEAT_hvm_pirqs feature.
+ */
+#define LIBXL_HAVE_HVM_PIRQ 1
+
 /*
  * libxl memory management
  *
diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
index ce1d43110336..0008fac607e3 100644
--- a/tools/libs/light/libxl_create.c
+++ b/tools/libs/light/libxl_create.c
@@ -376,6 +376,7 @@  int libxl__domain_build_info_setdefault(libxl__gc *gc,
         libxl_defbool_setdefault(&b_info->u.hvm.usb,                false);
         libxl_defbool_setdefault(&b_info->u.hvm.vkb_device,         true);
         libxl_defbool_setdefault(&b_info->u.hvm.xen_platform_pci,   true);
+        libxl_defbool_setdefault(&b_info->u.hvm.pirq,               false);
 
         libxl_defbool_setdefault(&b_info->u.hvm.spice.enable, false);
         if (!libxl_defbool_val(b_info->u.hvm.spice.enable) &&
@@ -2375,10 +2376,12 @@  int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
 
     /*
      * When restoring (either from a save file or for a migration domain) set
-     * the MSR relaxed mode for compatibility with older Xen versions if the
-     * option is not set as part of the original configuration.
+     * the MSR relaxed mode and HVM PIRQs for compatibility with older Xen
+     * versions if the options are not set as part of the original
+     * configuration.
      */
     libxl_defbool_setdefault(&d_config->b_info.arch_x86.msr_relaxed, true);
+    libxl_defbool_setdefault(&d_config->b_info.u.hvm.pirq, true);
 
     return do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
                             params, ao_how, aop_console_how);
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 7d8bd5d21667..899ad3096926 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -692,6 +692,7 @@  libxl_domain_build_info = Struct("domain_build_info",[
                                        ("rdm", libxl_rdm_reserve),
                                        ("rdm_mem_boundary_memkb", MemKB),
                                        ("mca_caps",         uint64),
+                                       ("pirq",             libxl_defbool),
                                        ])),
                  ("pv", Struct(None, [("kernel", string, {'deprecated_by': 'kernel'}),
                                       ("slack_memkb", MemKB),
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index d16573e72cd4..79c7c11e30b0 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -9,6 +9,8 @@  int libxl__arch_domain_prepare_config(libxl__gc *gc,
     switch(d_config->c_info.type) {
     case LIBXL_DOMAIN_TYPE_HVM:
         config->arch.emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
+        if (!libxl_defbool_val(d_config->b_info.u.hvm.pirq))
+            config->arch.emulation_flags &= ~XEN_X86_EMU_USE_PIRQ;
         break;
     case LIBXL_DOMAIN_TYPE_PVH:
         config->arch.emulation_flags = XEN_X86_EMU_LAPIC;
@@ -864,15 +866,18 @@  void libxl__arch_update_domain_config(libxl__gc *gc,
                                       const libxl_domain_config *src)
 {
     /*
-     * Force MSR relaxed to be set (either to true or false) so it's part of
-     * the domain configuration when saving or performing a live-migration.
+     * Force MSR relaxed and HVM pirq to be set (either to true or false) so
+     * it's part of the domain configuration when saving or performing a
+     * live-migration.
      *
-     * Doing so allows the recovery side to figure out whether the flag should
+     * Doing so allows the recovery side to figure out whether the flags should
      * be set to true in order to keep backwards compatibility with already
      * started domains.
      */
     libxl_defbool_setdefault(&dst->b_info.arch_x86.msr_relaxed,
                     libxl_defbool_val(src->b_info.arch_x86.msr_relaxed));
+    libxl_defbool_setdefault(&dst->b_info.u.hvm.pirq,
+                             libxl_defbool_val(src->b_info.u.hvm.pirq));
 }
 
 /*
diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c
index d3ea350e07b9..9feb12ae2b16 100644
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -159,7 +159,9 @@  static PyObject *pyxc_domain_create(XcObject *self,
 
 #if defined (__i386) || defined(__x86_64__)
     if ( config.flags & XEN_DOMCTL_CDF_hvm )
-        config.arch.emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
+        config.arch.emulation_flags = XEN_X86_EMU_ALL &
+                                      ~(XEN_X86_EMU_VPCI |
+                                        XEN_X86_EMU_USE_PIRQ);
 #elif defined (__arm__) || defined(__aarch64__)
     config.arch.gic_version = XEN_DOMCTL_CONFIG_GIC_NATIVE;
 #else
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index ed983200c3f8..9b358f11b88e 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -1801,6 +1801,7 @@  void parse_config_data(const char *config_source,
         xlu_cfg_get_defbool(config, "hpet", &b_info->u.hvm.hpet, 0);
         xlu_cfg_get_defbool(config, "vpt_align", &b_info->u.hvm.vpt_align, 0);
         xlu_cfg_get_defbool(config, "apic", &b_info->apic, 0);
+        xlu_cfg_get_defbool(config, "hvm_pirq", &b_info->u.hvm.pirq, 0);
 
         switch (xlu_cfg_get_list(config, "viridian",
                                  &viridian, &num_viridian, 1))
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8a31d18f6967..bda853e3c92b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -725,7 +725,9 @@  static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
              emflags != (X86_EMU_VPCI | X86_EMU_LAPIC | X86_EMU_IOAPIC) )
             return false;
         if ( !is_hardware_domain(d) &&
-             emflags != (X86_EMU_ALL & ~X86_EMU_VPCI) &&
+             /* HVM PIRQ feature is user-selectable. */
+             (emflags & ~X86_EMU_USE_PIRQ) !=
+             (X86_EMU_ALL & ~(X86_EMU_VPCI | X86_EMU_USE_PIRQ)) &&
              emflags != X86_EMU_LAPIC )
             return false;
     }