diff mbox series

VMX: use a single, global APIC access page

Message ID 1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com (mailing list archive)
State Superseded
Headers show
Series VMX: use a single, global APIC access page | expand

Commit Message

Jan Beulich Feb. 10, 2021, 4:48 p.m. UTC
The address of this page is used by the CPU only to recognize when to
instead access the virtual APIC page instead. No accesses would ever go
to this page. It only needs to be present in the (CPU) page tables so
that address translation will produce its address as result for
respective accesses.

By making this page global, we also eliminate the need to refcount it,
or to assign it to any domain in the first place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Hooking p2m insertion onto arch_domain_creation_finished() isn't very
nice, but I couldn't find any better hook (nor a good place where to
introduce a new one). In particular there look to be no hvm_funcs hooks
being used on a domain-wide basis (except for init/destroy of course).
I did consider connecting this to the setting of HVM_PARAM_IDENT_PT, but
considered this no better, the more that the tool stack could be smarter
and avoid setting that param when not needed.

I did further consider not allocating any real page at all, but just
using the address of some unpopulated space (which would require
announcing this page as reserved to Dom0, so it wouldn't put any PCI
MMIO BARs there). But I thought this would be too controversial, because
of the possible risks associated with this.

Perhaps the change to p2m_get_iommu_flags() should be in a separate
patch.

Comments

Andrew Cooper Feb. 10, 2021, 5 p.m. UTC | #1
On 10/02/2021 16:48, Jan Beulich wrote:
> The address of this page is used by the CPU only to recognize when to
> instead access the virtual APIC page instead. No accesses would ever go
> to this page. It only needs to be present in the (CPU) page tables so
> that address translation will produce its address as result for
> respective accesses.
>
> By making this page global, we also eliminate the need to refcount it,
> or to assign it to any domain in the first place.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

How certain are you about this?

It's definitely not true on AMD's AVIC - writes very definitely end up
in the backing page if they miss the APIC registers.

This concern was why we didn't use a global page originally, IIRC.

~Andrew
Jan Beulich Feb. 10, 2021, 5:03 p.m. UTC | #2
On 10.02.2021 18:00, Andrew Cooper wrote:
> On 10/02/2021 16:48, Jan Beulich wrote:
>> The address of this page is used by the CPU only to recognize when to
>> instead access the virtual APIC page instead. No accesses would ever go
>> to this page. It only needs to be present in the (CPU) page tables so
>> that address translation will produce its address as result for
>> respective accesses.
>>
>> By making this page global, we also eliminate the need to refcount it,
>> or to assign it to any domain in the first place.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> How certain are you about this?

The documentation (I'm inclined to say: unexpectedly) is very
clear about this; I don't think it had been this clear back at
the time. I'm hoping for Kevin to shout if he's aware of issues
here.

I've also gone through many of our emulation code paths (I
think I caught all relevant ones), and found them all having
suitable guards in place. (This process was why took until the
evening to have a patch ready.)

Jan
Jan Beulich Feb. 10, 2021, 5:16 p.m. UTC | #3
On 10.02.2021 18:00, Andrew Cooper wrote:
> On 10/02/2021 16:48, Jan Beulich wrote:
>> The address of this page is used by the CPU only to recognize when to
>> instead access the virtual APIC page instead. No accesses would ever go
>> to this page. It only needs to be present in the (CPU) page tables so
>> that address translation will produce its address as result for
>> respective accesses.
>>
>> By making this page global, we also eliminate the need to refcount it,
>> or to assign it to any domain in the first place.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> How certain are you about this?
> 
> It's definitely not true on AMD's AVIC - writes very definitely end up
> in the backing page if they miss the APIC registers.

Doesn't this require a per-vCPU page then anyway? With a per-domain
one things can't work correctly in the case you describe.

Jan
Roger Pau Monne Feb. 11, 2021, 8:45 a.m. UTC | #4
On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
> I did further consider not allocating any real page at all, but just
> using the address of some unpopulated space (which would require
> announcing this page as reserved to Dom0, so it wouldn't put any PCI
> MMIO BARs there). But I thought this would be too controversial, because
> of the possible risks associated with this.

No, Xen is not capable of allocating a suitable unpopulated page IMO,
so let's not go down that route. Wasting one RAM page seems perfectly
fine to me.

> Perhaps the change to p2m_get_iommu_flags() should be in a separate
> patch.

Maybe, I'm still not fully convinced that's a change we want to do
TBH.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1007,6 +1007,8 @@ int arch_domain_soft_reset(struct domain
>  
>  void arch_domain_creation_finished(struct domain *d)
>  {
> +    if ( is_hvm_domain(d) )
> +        hvm_domain_creation_finished(d);
>  }
>  
>  #define xen_vcpu_guest_context vcpu_guest_context
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -66,8 +66,7 @@ boolean_param("force-ept", opt_force_ept
>  static void vmx_ctxt_switch_from(struct vcpu *v);
>  static void vmx_ctxt_switch_to(struct vcpu *v);
>  
> -static int  vmx_alloc_vlapic_mapping(struct domain *d);
> -static void vmx_free_vlapic_mapping(struct domain *d);
> +static int alloc_vlapic_mapping(void);
>  static void vmx_install_vlapic_mapping(struct vcpu *v);
>  static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr,
>                                  unsigned int flags);
> @@ -78,6 +77,8 @@ static int vmx_msr_read_intercept(unsign
>  static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
>  static void vmx_invlpg(struct vcpu *v, unsigned long linear);
>  
> +static mfn_t __read_mostly apic_access_mfn;
> +
>  /* Values for domain's ->arch.hvm_domain.pi_ops.flags. */
>  #define PI_CSW_FROM (1u << 0)
>  #define PI_CSW_TO   (1u << 1)
> @@ -401,7 +402,6 @@ static int vmx_domain_initialise(struct
>          .to   = vmx_ctxt_switch_to,
>          .tail = vmx_do_resume,
>      };
> -    int rc;
>  
>      d->arch.ctxt_switch = &csw;
>  
> @@ -411,21 +411,16 @@ static int vmx_domain_initialise(struct
>       */
>      d->arch.hvm.vmx.exec_sp = is_hardware_domain(d) || opt_ept_exec_sp;
>  
> -    if ( !has_vlapic(d) )
> -        return 0;
> -
> -    if ( (rc = vmx_alloc_vlapic_mapping(d)) != 0 )
> -        return rc;
> -
>      return 0;
>  }
>  
> -static void vmx_domain_relinquish_resources(struct domain *d)
> +static void domain_creation_finished(struct domain *d)
>  {
> -    if ( !has_vlapic(d) )
> -        return;
>  
> -    vmx_free_vlapic_mapping(d);
> +    if ( !mfn_eq(apic_access_mfn, _mfn(0)) &&
> +         set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE),
> +                            apic_access_mfn, PAGE_ORDER_4K) )
> +        domain_crash(d);
>  }
>  
>  static void vmx_init_ipt(struct vcpu *v)
> @@ -2407,7 +2402,7 @@ static struct hvm_function_table __initd
>      .cpu_up_prepare       = vmx_cpu_up_prepare,
>      .cpu_dead             = vmx_cpu_dead,
>      .domain_initialise    = vmx_domain_initialise,
> -    .domain_relinquish_resources = vmx_domain_relinquish_resources,
> +    .domain_creation_finished = domain_creation_finished,
>      .vcpu_initialise      = vmx_vcpu_initialise,
>      .vcpu_destroy         = vmx_vcpu_destroy,
>      .save_cpu_ctxt        = vmx_save_vmcs_ctxt,
> @@ -2653,7 +2648,7 @@ const struct hvm_function_table * __init
>  {
>      set_in_cr4(X86_CR4_VMXE);
>  
> -    if ( vmx_vmcs_init() )
> +    if ( vmx_vmcs_init() || alloc_vlapic_mapping() )
>      {
>          printk("VMX: failed to initialise.\n");
>          return NULL;
> @@ -3208,7 +3203,7 @@ gp_fault:
>      return X86EMUL_EXCEPTION;
>  }
>  
> -static int vmx_alloc_vlapic_mapping(struct domain *d)
> +static int __init alloc_vlapic_mapping(void)
>  {
>      struct page_info *pg;
>      mfn_t mfn;
> @@ -3216,53 +3211,28 @@ static int vmx_alloc_vlapic_mapping(stru
>      if ( !cpu_has_vmx_virtualize_apic_accesses )
>          return 0;
>  
> -    pg = alloc_domheap_page(d, MEMF_no_refcount);
> +    pg = alloc_domheap_page(NULL, 0);
>      if ( !pg )
>          return -ENOMEM;
>  
> -    if ( !get_page_and_type(pg, d, PGT_writable_page) )
> -    {
> -        /*
> -         * The domain can't possibly know about this page yet, so failure
> -         * here is a clear indication of something fishy going on.
> -         */
> -        domain_crash(d);
> -        return -ENODATA;
> -    }
> -
>      mfn = page_to_mfn(pg);
>      clear_domain_page(mfn);
> -    d->arch.hvm.vmx.apic_access_mfn = mfn;
> +    apic_access_mfn = mfn;
>  
> -    return set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE), mfn,
> -                              PAGE_ORDER_4K);
> -}
> -
> -static void vmx_free_vlapic_mapping(struct domain *d)
> -{
> -    mfn_t mfn = d->arch.hvm.vmx.apic_access_mfn;
> -
> -    d->arch.hvm.vmx.apic_access_mfn = _mfn(0);
> -    if ( !mfn_eq(mfn, _mfn(0)) )
> -    {
> -        struct page_info *pg = mfn_to_page(mfn);
> -
> -        put_page_alloc_ref(pg);
> -        put_page_and_type(pg);
> -    }
> +    return 0;
>  }
>  
>  static void vmx_install_vlapic_mapping(struct vcpu *v)
>  {
>      paddr_t virt_page_ma, apic_page_ma;
>  
> -    if ( mfn_eq(v->domain->arch.hvm.vmx.apic_access_mfn, _mfn(0)) )
> +    if ( mfn_eq(apic_access_mfn, _mfn(0)) )

You should check if the domain has a vlapic I think, since now
apic_access_mfn is global and will be set regardless of whether the
domain has vlapic enabled or not.

Previously apic_access_mfn was only allocated if the domain had vlapic
enabled.

>          return;
>  
>      ASSERT(cpu_has_vmx_virtualize_apic_accesses);
>  
>      virt_page_ma = page_to_maddr(vcpu_vlapic(v)->regs_page);
> -    apic_page_ma = mfn_to_maddr(v->domain->arch.hvm.vmx.apic_access_mfn);
> +    apic_page_ma = mfn_to_maddr(apic_access_mfn);
>  
>      vmx_vmcs_enter(v);
>      __vmwrite(VIRTUAL_APIC_PAGE_ADDR, virt_page_ma);

I would consider setting up the vmcs and adding the page to the p2m in
the same function, and likely call it from vlapic_init. We could have
a domain_setup_apic in hvm_function_table that takes care of all this.

> --- a/xen/include/asm-x86/hvm/hvm.h
> +++ b/xen/include/asm-x86/hvm/hvm.h
> @@ -106,6 +106,7 @@ struct hvm_function_table {
>       * Initialise/destroy HVM domain/vcpu resources
>       */
>      int  (*domain_initialise)(struct domain *d);
> +    void (*domain_creation_finished)(struct domain *d);
>      void (*domain_relinquish_resources)(struct domain *d);
>      void (*domain_destroy)(struct domain *d);
>      int  (*vcpu_initialise)(struct vcpu *v);
> @@ -390,6 +391,12 @@ static inline bool hvm_has_set_descripto
>      return hvm_funcs.set_descriptor_access_exiting;
>  }
>  
> +static inline void hvm_domain_creation_finished(struct domain *d)
> +{
> +    if ( hvm_funcs.domain_creation_finished )
> +        alternative_vcall(hvm_funcs.domain_creation_finished, d);
> +}
> +
>  static inline int
>  hvm_guest_x86_mode(struct vcpu *v)
>  {
> @@ -765,6 +772,11 @@ static inline void hvm_invlpg(const stru
>  {
>      ASSERT_UNREACHABLE();
>  }
> +
> +static inline void hvm_domain_creation_finished(struct domain *d)
> +{
> +    ASSERT_UNREACHABLE();
> +}
>  
>  /*
>   * Shadow code needs further cleanup to eliminate some HVM-only paths. For
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -58,7 +58,6 @@ struct ept_data {
>  #define _VMX_DOMAIN_PML_ENABLED    0
>  #define VMX_DOMAIN_PML_ENABLED     (1ul << _VMX_DOMAIN_PML_ENABLED)
>  struct vmx_domain {
> -    mfn_t apic_access_mfn;
>      /* VMX_DOMAIN_* */
>      unsigned int status;
>  
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
>          flags = IOMMUF_readable;
>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
>              flags |= IOMMUF_writable;
> +        /* VMX'es APIC access page is global and hence has no owner. */
> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
> +            flags = 0;

Is it fine to have this page accessible to devices if the page tables
are shared between the CPU and the IOMMU?

Is it possible for devices to write to it?

I still think both the CPU and the IOMMU page tables should be the
same unless there's a technical reason for them not to be, rather than
us just wanting to hide things from devices.

Thanks, Roger.
Jan Beulich Feb. 11, 2021, 10:36 a.m. UTC | #5
On 11.02.2021 09:45, Roger Pau Monné wrote:
> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
>> I did further consider not allocating any real page at all, but just
>> using the address of some unpopulated space (which would require
>> announcing this page as reserved to Dom0, so it wouldn't put any PCI
>> MMIO BARs there). But I thought this would be too controversial, because
>> of the possible risks associated with this.
> 
> No, Xen is not capable of allocating a suitable unpopulated page IMO,
> so let's not go down that route. Wasting one RAM page seems perfectly
> fine to me.

Why would Xen not be able to, in principle? It may be difficult,
but there may also be pages we could declare we know we can use
for this purpose.

>> @@ -3216,53 +3211,28 @@ static int vmx_alloc_vlapic_mapping(stru
>>      if ( !cpu_has_vmx_virtualize_apic_accesses )
>>          return 0;
>>  
>> -    pg = alloc_domheap_page(d, MEMF_no_refcount);
>> +    pg = alloc_domheap_page(NULL, 0);
>>      if ( !pg )
>>          return -ENOMEM;
>>  
>> -    if ( !get_page_and_type(pg, d, PGT_writable_page) )
>> -    {
>> -        /*
>> -         * The domain can't possibly know about this page yet, so failure
>> -         * here is a clear indication of something fishy going on.
>> -         */
>> -        domain_crash(d);
>> -        return -ENODATA;
>> -    }
>> -
>>      mfn = page_to_mfn(pg);
>>      clear_domain_page(mfn);
>> -    d->arch.hvm.vmx.apic_access_mfn = mfn;
>> +    apic_access_mfn = mfn;
>>  
>> -    return set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE), mfn,
>> -                              PAGE_ORDER_4K);
>> -}
>> -
>> -static void vmx_free_vlapic_mapping(struct domain *d)
>> -{
>> -    mfn_t mfn = d->arch.hvm.vmx.apic_access_mfn;
>> -
>> -    d->arch.hvm.vmx.apic_access_mfn = _mfn(0);
>> -    if ( !mfn_eq(mfn, _mfn(0)) )
>> -    {
>> -        struct page_info *pg = mfn_to_page(mfn);
>> -
>> -        put_page_alloc_ref(pg);
>> -        put_page_and_type(pg);
>> -    }
>> +    return 0;
>>  }
>>  
>>  static void vmx_install_vlapic_mapping(struct vcpu *v)
>>  {
>>      paddr_t virt_page_ma, apic_page_ma;
>>  
>> -    if ( mfn_eq(v->domain->arch.hvm.vmx.apic_access_mfn, _mfn(0)) )
>> +    if ( mfn_eq(apic_access_mfn, _mfn(0)) )
> 
> You should check if the domain has a vlapic I think, since now
> apic_access_mfn is global and will be set regardless of whether the
> domain has vlapic enabled or not.
> 
> Previously apic_access_mfn was only allocated if the domain had vlapic
> enabled.

Oh, indeed - thanks for spotting. And also in domain_creation_finished().

>>          return;
>>  
>>      ASSERT(cpu_has_vmx_virtualize_apic_accesses);
>>  
>>      virt_page_ma = page_to_maddr(vcpu_vlapic(v)->regs_page);
>> -    apic_page_ma = mfn_to_maddr(v->domain->arch.hvm.vmx.apic_access_mfn);
>> +    apic_page_ma = mfn_to_maddr(apic_access_mfn);
>>  
>>      vmx_vmcs_enter(v);
>>      __vmwrite(VIRTUAL_APIC_PAGE_ADDR, virt_page_ma);
> 
> I would consider setting up the vmcs and adding the page to the p2m in
> the same function, and likely call it from vlapic_init. We could have
> a domain_setup_apic in hvm_function_table that takes care of all this.

Well, I'd prefer to do this just once per domain without needing
to special case this on e.g. vCPU 0.

>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
>>          flags = IOMMUF_readable;
>>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
>>              flags |= IOMMUF_writable;
>> +        /* VMX'es APIC access page is global and hence has no owner. */
>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
>> +            flags = 0;
> 
> Is it fine to have this page accessible to devices if the page tables
> are shared between the CPU and the IOMMU?

No, it's not, but what do you do? As said elsewhere, devices
gaining more access than is helpful is the price we pay for
being able to share page tables. But ...

> Is it possible for devices to write to it?

... thinking about it - they would be able to access it only
when interrupt remapping is off. Otherwise the entire range
0xFEExxxxx gets treated differently altogether by the IOMMU,
and is not subject to DMA remapping. IOW it shouldn't even
matter what flags we put in there (and I'd be far less
concerned about the no-intremap case). What this change
matters for then is only to avoid an unnecessary call to
iommu_map() (and, for small enough domain, extra intermediate
page tables to be allocated).

But let me really split this off of this patch.

Jan
Roger Pau Monne Feb. 11, 2021, 11:16 a.m. UTC | #6
On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
> On 11.02.2021 09:45, Roger Pau Monné wrote:
> > On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
> >> I did further consider not allocating any real page at all, but just
> >> using the address of some unpopulated space (which would require
> >> announcing this page as reserved to Dom0, so it wouldn't put any PCI
> >> MMIO BARs there). But I thought this would be too controversial, because
> >> of the possible risks associated with this.
> > 
> > No, Xen is not capable of allocating a suitable unpopulated page IMO,
> > so let's not go down that route. Wasting one RAM page seems perfectly
> > fine to me.
> 
> Why would Xen not be able to, in principle? It may be difficult,
> but there may also be pages we could declare we know we can use
> for this purpose.

I was under the impression that there could always be bits in ACPI
dynamic tables that could report MMIO ranges that Xen wasn't aware of,
but those should already be marked as reserved in the memory map
anyway for good behaved systems.

> >>          return;
> >>  
> >>      ASSERT(cpu_has_vmx_virtualize_apic_accesses);
> >>  
> >>      virt_page_ma = page_to_maddr(vcpu_vlapic(v)->regs_page);
> >> -    apic_page_ma = mfn_to_maddr(v->domain->arch.hvm.vmx.apic_access_mfn);
> >> +    apic_page_ma = mfn_to_maddr(apic_access_mfn);
> >>  
> >>      vmx_vmcs_enter(v);
> >>      __vmwrite(VIRTUAL_APIC_PAGE_ADDR, virt_page_ma);
> > 
> > I would consider setting up the vmcs and adding the page to the p2m in
> > the same function, and likely call it from vlapic_init. We could have
> > a domain_setup_apic in hvm_function_table that takes care of all this.
> 
> Well, I'd prefer to do this just once per domain without needing
> to special case this on e.g. vCPU 0.

I seems more natural to me to do this setup together with the rest of
the vlapic initialization, but I'm not going to insist, I also
understand your point about calling the function only once.

> >> --- a/xen/include/asm-x86/p2m.h
> >> +++ b/xen/include/asm-x86/p2m.h
> >> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
> >>          flags = IOMMUF_readable;
> >>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
> >>              flags |= IOMMUF_writable;
> >> +        /* VMX'es APIC access page is global and hence has no owner. */
> >> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
> >> +            flags = 0;
> > 
> > Is it fine to have this page accessible to devices if the page tables
> > are shared between the CPU and the IOMMU?
> 
> No, it's not, but what do you do? As said elsewhere, devices
> gaining more access than is helpful is the price we pay for
> being able to share page tables. But ...

I'm concerned about allowing devices to write to this shared page, as
could be used as an unintended way to exchange information between
domains?

> > Is it possible for devices to write to it?
> 
> ... thinking about it - they would be able to access it only
> when interrupt remapping is off. Otherwise the entire range
> 0xFEExxxxx gets treated differently altogether by the IOMMU,

Now that I think of it, the range 0xFEExxxxx must always be special
handled for device accesses, regardless of whether interrupt remapping
is enabled or not, or else they won't be capable of delivering MSI
messages?

So I assume that whatever gets mapped in the IOMMU pages tables at
0xFEExxxxx simply gets ignored?

Or else mapping the lapic access page when using shared page tables
would have prevented CPU#0 from receiving MSI messages.

Thanks, Roger.
Jan Beulich Feb. 11, 2021, 11:22 a.m. UTC | #7
On 11.02.2021 12:16, Roger Pau Monné wrote:
> On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
>> On 11.02.2021 09:45, Roger Pau Monné wrote:
>>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
>>>> --- a/xen/include/asm-x86/p2m.h
>>>> +++ b/xen/include/asm-x86/p2m.h
>>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
>>>>          flags = IOMMUF_readable;
>>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
>>>>              flags |= IOMMUF_writable;
>>>> +        /* VMX'es APIC access page is global and hence has no owner. */
>>>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
>>>> +            flags = 0;
>>>
>>> Is it fine to have this page accessible to devices if the page tables
>>> are shared between the CPU and the IOMMU?
>>
>> No, it's not, but what do you do? As said elsewhere, devices
>> gaining more access than is helpful is the price we pay for
>> being able to share page tables. But ...
> 
> I'm concerned about allowing devices to write to this shared page, as
> could be used as an unintended way to exchange information between
> domains?

Well, such an abuse would be possible, but it wouldn't be part
of an ABI and hence could break at any time. Similarly I
wouldn't consider it an information leak if a guest abused
this.

>>> Is it possible for devices to write to it?
>>
>> ... thinking about it - they would be able to access it only
>> when interrupt remapping is off. Otherwise the entire range
>> 0xFEExxxxx gets treated differently altogether by the IOMMU,
> 
> Now that I think of it, the range 0xFEExxxxx must always be special
> handled for device accesses, regardless of whether interrupt remapping
> is enabled or not, or else they won't be capable of delivering MSI
> messages?

I don't think I know how exactly chipsets handle MSI in this
case, but yes, presumably these accesses need to be routed a
different path even in that case.

> So I assume that whatever gets mapped in the IOMMU pages tables at
> 0xFEExxxxx simply gets ignored?

This would be the implication, aiui.

> Or else mapping the lapic access page when using shared page tables
> would have prevented CPU#0 from receiving MSI messages.

I guess I don't understand this part. In particular I see
nothing CPU#0 specific here.

Jan
Roger Pau Monne Feb. 11, 2021, 12:27 p.m. UTC | #8
On Thu, Feb 11, 2021 at 12:22:41PM +0100, Jan Beulich wrote:
> On 11.02.2021 12:16, Roger Pau Monné wrote:
> > On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
> >> On 11.02.2021 09:45, Roger Pau Monné wrote:
> >>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
> >>>> --- a/xen/include/asm-x86/p2m.h
> >>>> +++ b/xen/include/asm-x86/p2m.h
> >>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
> >>>>          flags = IOMMUF_readable;
> >>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
> >>>>              flags |= IOMMUF_writable;
> >>>> +        /* VMX'es APIC access page is global and hence has no owner. */
> >>>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
> >>>> +            flags = 0;
> >>>
> >>> Is it fine to have this page accessible to devices if the page tables
> >>> are shared between the CPU and the IOMMU?
> >>
> >> No, it's not, but what do you do? As said elsewhere, devices
> >> gaining more access than is helpful is the price we pay for
> >> being able to share page tables. But ...
> > 
> > I'm concerned about allowing devices to write to this shared page, as
> > could be used as an unintended way to exchange information between
> > domains?
> 
> Well, such an abuse would be possible, but it wouldn't be part
> of an ABI and hence could break at any time. Similarly I
> wouldn't consider it an information leak if a guest abused
> this.

Hm, I'm kind of worried about having such shared page accessible to
guests. Could Intel confirm whether pages in the 0xFEExxxxx range are
accessible to devices in any way when using IOMMU shared page
tables?

> >>> Is it possible for devices to write to it?
> >>
> >> ... thinking about it - they would be able to access it only
> >> when interrupt remapping is off. Otherwise the entire range
> >> 0xFEExxxxx gets treated differently altogether by the IOMMU,
> > 
> > Now that I think of it, the range 0xFEExxxxx must always be special
> > handled for device accesses, regardless of whether interrupt remapping
> > is enabled or not, or else they won't be capable of delivering MSI
> > messages?
> 
> I don't think I know how exactly chipsets handle MSI in this
> case, but yes, presumably these accesses need to be routed a
> different path even in that case.
> 
> > So I assume that whatever gets mapped in the IOMMU pages tables at
> > 0xFEExxxxx simply gets ignored?
> 
> This would be the implication, aiui.
> 
> > Or else mapping the lapic access page when using shared page tables
> > would have prevented CPU#0 from receiving MSI messages.
> 
> I guess I don't understand this part. In particular I see
> nothing CPU#0 specific here.

Well, the default APIC address is 0xfee00000 which matches the MSI
doorbell for APIC ID 0. APIC ID 1 would instead use 0xfee01000 and
thus the APIC access page mapping won't shadow it anyway.

Thanks, Roger.
Andrew Cooper Feb. 11, 2021, 1:53 p.m. UTC | #9
On 11/02/2021 10:36, Jan Beulich wrote:
> On 11.02.2021 09:45, Roger Pau Monné wrote:
>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
>>> I did further consider not allocating any real page at all, but just
>>> using the address of some unpopulated space (which would require
>>> announcing this page as reserved to Dom0, so it wouldn't put any PCI
>>> MMIO BARs there). But I thought this would be too controversial, because
>>> of the possible risks associated with this.
>> No, Xen is not capable of allocating a suitable unpopulated page IMO,
>> so let's not go down that route. Wasting one RAM page seems perfectly
>> fine to me.
> Why would Xen not be able to, in principle? It may be difficult,
> but there may also be pages we could declare we know we can use
> for this purpose.

There are frames we could use for this purpose, but their locations are
platform specific (holes around TSEG).

I'm also not sure about the implications of their use as a DMA target.

>
>>>  static void vmx_install_vlapic_mapping(struct vcpu *v)
>>>  {
>>>      paddr_t virt_page_ma, apic_page_ma;
>>>  
>>> -    if ( mfn_eq(v->domain->arch.hvm.vmx.apic_access_mfn, _mfn(0)) )
>>> +    if ( mfn_eq(apic_access_mfn, _mfn(0)) )
>> You should check if the domain has a vlapic I think, since now
>> apic_access_mfn is global and will be set regardless of whether the
>> domain has vlapic enabled or not.
>>
>> Previously apic_access_mfn was only allocated if the domain had vlapic
>> enabled.
> Oh, indeed - thanks for spotting. And also in domain_creation_finished().

Honestly - PVH without LAPIC was a short sighted move.

Its a prohibited combination at the moment from XSA-256 and I don't see
any value in lifting the restriction, considering that TPR acceleration
has been around since practically the first generation of HVM support.

~Andrew
Tian, Kevin March 1, 2021, 2:08 a.m. UTC | #10
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Thursday, February 11, 2021 1:04 AM
> 
> On 10.02.2021 18:00, Andrew Cooper wrote:
> > On 10/02/2021 16:48, Jan Beulich wrote:
> >> The address of this page is used by the CPU only to recognize when to
> >> instead access the virtual APIC page instead. No accesses would ever go
> >> to this page. It only needs to be present in the (CPU) page tables so
> >> that address translation will produce its address as result for
> >> respective accesses.
> >>
> >> By making this page global, we also eliminate the need to refcount it,
> >> or to assign it to any domain in the first place.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >
> > How certain are you about this?
> 
> The documentation (I'm inclined to say: unexpectedly) is very
> clear about this; I don't think it had been this clear back at
> the time. I'm hoping for Kevin to shout if he's aware of issues
> here.
> 

No, I didn't see an issue here.

Thanks
Kevin
Tian, Kevin March 1, 2021, 2:18 a.m. UTC | #11
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: Thursday, February 11, 2021 8:27 PM
> 
> On Thu, Feb 11, 2021 at 12:22:41PM +0100, Jan Beulich wrote:
> > On 11.02.2021 12:16, Roger Pau Monné wrote:
> > > On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
> > >> On 11.02.2021 09:45, Roger Pau Monné wrote:
> > >>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
> > >>>> --- a/xen/include/asm-x86/p2m.h
> > >>>> +++ b/xen/include/asm-x86/p2m.h
> > >>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
> > >>>>          flags = IOMMUF_readable;
> > >>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
> > >>>>              flags |= IOMMUF_writable;
> > >>>> +        /* VMX'es APIC access page is global and hence has no owner.
> */
> > >>>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
> > >>>> +            flags = 0;
> > >>>
> > >>> Is it fine to have this page accessible to devices if the page tables
> > >>> are shared between the CPU and the IOMMU?
> > >>
> > >> No, it's not, but what do you do? As said elsewhere, devices
> > >> gaining more access than is helpful is the price we pay for
> > >> being able to share page tables. But ...
> > >
> > > I'm concerned about allowing devices to write to this shared page, as
> > > could be used as an unintended way to exchange information between
> > > domains?
> >
> > Well, such an abuse would be possible, but it wouldn't be part
> > of an ABI and hence could break at any time. Similarly I
> > wouldn't consider it an information leak if a guest abused
> > this.
> 
> Hm, I'm kind of worried about having such shared page accessible to
> guests. Could Intel confirm whether pages in the 0xFEExxxxx range are
> accessible to devices in any way when using IOMMU shared page
> tables?

0xFEExxxxx range is special. Requests to this range are not subject to
DMA remapping (even if a valid mapping for this range exists in the 
IOMMU page table). And this special treatment is true regardless of
whether interrupt remapping is enabled (which comes only after an 
interrupt message to this range is recognized).

Thanks
Kevin
Jan Beulich March 1, 2021, 8:15 a.m. UTC | #12
On 01.03.2021 03:18, Tian, Kevin wrote:
>> From: Roger Pau Monné <roger.pau@citrix.com>
>> Sent: Thursday, February 11, 2021 8:27 PM
>>
>> On Thu, Feb 11, 2021 at 12:22:41PM +0100, Jan Beulich wrote:
>>> On 11.02.2021 12:16, Roger Pau Monné wrote:
>>>> On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
>>>>> On 11.02.2021 09:45, Roger Pau Monné wrote:
>>>>>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
>>>>>>> --- a/xen/include/asm-x86/p2m.h
>>>>>>> +++ b/xen/include/asm-x86/p2m.h
>>>>>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
>>>>>>>          flags = IOMMUF_readable;
>>>>>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
>>>>>>>              flags |= IOMMUF_writable;
>>>>>>> +        /* VMX'es APIC access page is global and hence has no owner.
>> */
>>>>>>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
>>>>>>> +            flags = 0;
>>>>>>
>>>>>> Is it fine to have this page accessible to devices if the page tables
>>>>>> are shared between the CPU and the IOMMU?
>>>>>
>>>>> No, it's not, but what do you do? As said elsewhere, devices
>>>>> gaining more access than is helpful is the price we pay for
>>>>> being able to share page tables. But ...
>>>>
>>>> I'm concerned about allowing devices to write to this shared page, as
>>>> could be used as an unintended way to exchange information between
>>>> domains?
>>>
>>> Well, such an abuse would be possible, but it wouldn't be part
>>> of an ABI and hence could break at any time. Similarly I
>>> wouldn't consider it an information leak if a guest abused
>>> this.
>>
>> Hm, I'm kind of worried about having such shared page accessible to
>> guests. Could Intel confirm whether pages in the 0xFEExxxxx range are
>> accessible to devices in any way when using IOMMU shared page
>> tables?
> 
> 0xFEExxxxx range is special. Requests to this range are not subject to
> DMA remapping (even if a valid mapping for this range exists in the 
> IOMMU page table). And this special treatment is true regardless of
> whether interrupt remapping is enabled (which comes only after an 
> interrupt message to this range is recognized).

For my/our education, could you outline what happens to device
accesses to that range when interrupt remapping is off? And
perhaps also what happens to accesses to this range that don't
match the pattern of an MSI initiation (dword write)? I don't
think I've been able to spot anything to this effect in the docs.

Jan
Tian, Kevin March 1, 2021, 8:30 a.m. UTC | #13
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 1, 2021 4:16 PM
> 
> On 01.03.2021 03:18, Tian, Kevin wrote:
> >> From: Roger Pau Monné <roger.pau@citrix.com>
> >> Sent: Thursday, February 11, 2021 8:27 PM
> >>
> >> On Thu, Feb 11, 2021 at 12:22:41PM +0100, Jan Beulich wrote:
> >>> On 11.02.2021 12:16, Roger Pau Monné wrote:
> >>>> On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
> >>>>> On 11.02.2021 09:45, Roger Pau Monné wrote:
> >>>>>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
> >>>>>>> --- a/xen/include/asm-x86/p2m.h
> >>>>>>> +++ b/xen/include/asm-x86/p2m.h
> >>>>>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
> >>>>>>>          flags = IOMMUF_readable;
> >>>>>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges,
> mfn_x(mfn)) )
> >>>>>>>              flags |= IOMMUF_writable;
> >>>>>>> +        /* VMX'es APIC access page is global and hence has no owner.
> >> */
> >>>>>>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
> >>>>>>> +            flags = 0;
> >>>>>>
> >>>>>> Is it fine to have this page accessible to devices if the page tables
> >>>>>> are shared between the CPU and the IOMMU?
> >>>>>
> >>>>> No, it's not, but what do you do? As said elsewhere, devices
> >>>>> gaining more access than is helpful is the price we pay for
> >>>>> being able to share page tables. But ...
> >>>>
> >>>> I'm concerned about allowing devices to write to this shared page, as
> >>>> could be used as an unintended way to exchange information between
> >>>> domains?
> >>>
> >>> Well, such an abuse would be possible, but it wouldn't be part
> >>> of an ABI and hence could break at any time. Similarly I
> >>> wouldn't consider it an information leak if a guest abused
> >>> this.
> >>
> >> Hm, I'm kind of worried about having such shared page accessible to
> >> guests. Could Intel confirm whether pages in the 0xFEExxxxx range are
> >> accessible to devices in any way when using IOMMU shared page
> >> tables?
> >
> > 0xFEExxxxx range is special. Requests to this range are not subject to
> > DMA remapping (even if a valid mapping for this range exists in the
> > IOMMU page table). And this special treatment is true regardless of
> > whether interrupt remapping is enabled (which comes only after an
> > interrupt message to this range is recognized).
> 
> For my/our education, could you outline what happens to device
> accesses to that range when interrupt remapping is off? And
> perhaps also what happens to accesses to this range that don't
> match the pattern of an MSI initiation (dword write)? I don't
> think I've been able to spot anything to this effect in the docs.
> 

In VT-d spec "3.14 Handling Requests to Interrupt Address Range"
--
On Intel® architecture platforms, physical address range 0xFEEx_xxxx is 
designated as the interrupt address range. Requests without PASID to 
this range are not subjected to DMA remapping (even if translation 
structures specify a mapping for this range).
--
The following types of requests to this range are illegal requests. 
They are blocked and reported as Interrupt Remapping faults.
• Read requests without PASID that are not ZLR.
• Atomics requests without PASID.
• Non-DWORD length write requests without PASID. 
--

Interrupt remapping decides how to interpret the format of the
recognized interrupt message and whether to go through IRTE,
as explained in "5.1.4 Interrupt-Remapping Hardware Operation":
--
An interrupt request is identified by hardware as a DWORD sized 
write request to interrupt address ranges 0xFEEx_xxxx.
• When interrupt-remapping is not enabled (IRES field Clear in Global 
Status Register), all interrupt requests are processed per the Compatibility 
interrupt request format described in Section 5.1.2.1.
• When interrupt-remapping is enabled (IRES field Set in Global Status 
Register), interrupt requests are processed as follows:
...
--

Thanks
Kevin
Jan Beulich March 1, 2021, 9:58 a.m. UTC | #14
On 01.03.2021 09:30, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 1, 2021 4:16 PM
>>
>> On 01.03.2021 03:18, Tian, Kevin wrote:
>>>> From: Roger Pau Monné <roger.pau@citrix.com>
>>>> Sent: Thursday, February 11, 2021 8:27 PM
>>>>
>>>> On Thu, Feb 11, 2021 at 12:22:41PM +0100, Jan Beulich wrote:
>>>>> On 11.02.2021 12:16, Roger Pau Monné wrote:
>>>>>> On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
>>>>>>> On 11.02.2021 09:45, Roger Pau Monné wrote:
>>>>>>>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
>>>>>>>>> --- a/xen/include/asm-x86/p2m.h
>>>>>>>>> +++ b/xen/include/asm-x86/p2m.h
>>>>>>>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
>>>>>>>>>          flags = IOMMUF_readable;
>>>>>>>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges,
>> mfn_x(mfn)) )
>>>>>>>>>              flags |= IOMMUF_writable;
>>>>>>>>> +        /* VMX'es APIC access page is global and hence has no owner.
>>>> */
>>>>>>>>> +        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
>>>>>>>>> +            flags = 0;
>>>>>>>>
>>>>>>>> Is it fine to have this page accessible to devices if the page tables
>>>>>>>> are shared between the CPU and the IOMMU?
>>>>>>>
>>>>>>> No, it's not, but what do you do? As said elsewhere, devices
>>>>>>> gaining more access than is helpful is the price we pay for
>>>>>>> being able to share page tables. But ...
>>>>>>
>>>>>> I'm concerned about allowing devices to write to this shared page, as
>>>>>> could be used as an unintended way to exchange information between
>>>>>> domains?
>>>>>
>>>>> Well, such an abuse would be possible, but it wouldn't be part
>>>>> of an ABI and hence could break at any time. Similarly I
>>>>> wouldn't consider it an information leak if a guest abused
>>>>> this.
>>>>
>>>> Hm, I'm kind of worried about having such shared page accessible to
>>>> guests. Could Intel confirm whether pages in the 0xFEExxxxx range are
>>>> accessible to devices in any way when using IOMMU shared page
>>>> tables?
>>>
>>> 0xFEExxxxx range is special. Requests to this range are not subject to
>>> DMA remapping (even if a valid mapping for this range exists in the
>>> IOMMU page table). And this special treatment is true regardless of
>>> whether interrupt remapping is enabled (which comes only after an
>>> interrupt message to this range is recognized).
>>
>> For my/our education, could you outline what happens to device
>> accesses to that range when interrupt remapping is off? And
>> perhaps also what happens to accesses to this range that don't
>> match the pattern of an MSI initiation (dword write)? I don't
>> think I've been able to spot anything to this effect in the docs.
>>
> 
> In VT-d spec "3.14 Handling Requests to Interrupt Address Range"
> --
> On Intel® architecture platforms, physical address range 0xFEEx_xxxx is 
> designated as the interrupt address range. Requests without PASID to 
> this range are not subjected to DMA remapping (even if translation 
> structures specify a mapping for this range).
> --
> The following types of requests to this range are illegal requests. 
> They are blocked and reported as Interrupt Remapping faults.
> • Read requests without PASID that are not ZLR.
> • Atomics requests without PASID.
> • Non-DWORD length write requests without PASID. 
> --

Ah, I see. That's (according to the change bars) a relatively recent
addition. So the above clarifies things for the !PASID case. Am I
interpreting

"Requests-with-PASID with input address in range 0xFEEx_xxxx are
 translated normally like any other request-with-PASID through
 DMA-remapping hardware. However, if such a request is processed
 using pass-through translation, it will be blocked as described
 in the paragraph below.

 Software must not program paging-structure entries to remap any
 address to the interrupt address range. Untranslated requests and
 translation requests that result in an address in the interrupt
 range will be blocked with condition code LGN.4 or SGN.8.
 Translated requests with an address in the interrupt address
 range are treated as Unsupported Request (UR)."

right in that _with_ PASID translation entries for this range would
still be used, so long as they translate to an area outside of the
FEExxxxx range? If so this would mean that with PASID (whenever we
get to enabling this mode of operation) we'd need to avoid sharing
page tables, and we'd need to suppress mirroring of EPT insertions
for this range in the IOMMU page tables. (All of this independent
of the particular choice of the APIC access page.)

Jan
Tian, Kevin March 4, 2021, 7:51 a.m. UTC | #15
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 1, 2021 5:59 PM
> 
> On 01.03.2021 09:30, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 1, 2021 4:16 PM
> >>
> >> On 01.03.2021 03:18, Tian, Kevin wrote:
> >>>> From: Roger Pau Monné <roger.pau@citrix.com>
> >>>> Sent: Thursday, February 11, 2021 8:27 PM
> >>>>
> >>>> On Thu, Feb 11, 2021 at 12:22:41PM +0100, Jan Beulich wrote:
> >>>>> On 11.02.2021 12:16, Roger Pau Monné wrote:
> >>>>>> On Thu, Feb 11, 2021 at 11:36:59AM +0100, Jan Beulich wrote:
> >>>>>>> On 11.02.2021 09:45, Roger Pau Monné wrote:
> >>>>>>>> On Wed, Feb 10, 2021 at 05:48:26PM +0100, Jan Beulich wrote:
> >>>>>>>>> --- a/xen/include/asm-x86/p2m.h
> >>>>>>>>> +++ b/xen/include/asm-x86/p2m.h
> >>>>>>>>> @@ -935,6 +935,9 @@ static inline unsigned int p2m_get_iommu
> >>>>>>>>>          flags = IOMMUF_readable;
> >>>>>>>>>          if ( !rangeset_contains_singleton(mmio_ro_ranges,
> >> mfn_x(mfn)) )
> >>>>>>>>>              flags |= IOMMUF_writable;
> >>>>>>>>> +        /* VMX'es APIC access page is global and hence has no
> owner.
> >>>> */
> >>>>>>>>> +        if ( mfn_valid(mfn)
> && !page_get_owner(mfn_to_page(mfn)) )
> >>>>>>>>> +            flags = 0;
> >>>>>>>>
> >>>>>>>> Is it fine to have this page accessible to devices if the page tables
> >>>>>>>> are shared between the CPU and the IOMMU?
> >>>>>>>
> >>>>>>> No, it's not, but what do you do? As said elsewhere, devices
> >>>>>>> gaining more access than is helpful is the price we pay for
> >>>>>>> being able to share page tables. But ...
> >>>>>>
> >>>>>> I'm concerned about allowing devices to write to this shared page, as
> >>>>>> could be used as an unintended way to exchange information
> between
> >>>>>> domains?
> >>>>>
> >>>>> Well, such an abuse would be possible, but it wouldn't be part
> >>>>> of an ABI and hence could break at any time. Similarly I
> >>>>> wouldn't consider it an information leak if a guest abused
> >>>>> this.
> >>>>
> >>>> Hm, I'm kind of worried about having such shared page accessible to
> >>>> guests. Could Intel confirm whether pages in the 0xFEExxxxx range are
> >>>> accessible to devices in any way when using IOMMU shared page
> >>>> tables?
> >>>
> >>> 0xFEExxxxx range is special. Requests to this range are not subject to
> >>> DMA remapping (even if a valid mapping for this range exists in the
> >>> IOMMU page table). And this special treatment is true regardless of
> >>> whether interrupt remapping is enabled (which comes only after an
> >>> interrupt message to this range is recognized).
> >>
> >> For my/our education, could you outline what happens to device
> >> accesses to that range when interrupt remapping is off? And
> >> perhaps also what happens to accesses to this range that don't
> >> match the pattern of an MSI initiation (dword write)? I don't
> >> think I've been able to spot anything to this effect in the docs.
> >>
> >
> > In VT-d spec "3.14 Handling Requests to Interrupt Address Range"
> > --
> > On Intel® architecture platforms, physical address range 0xFEEx_xxxx is
> > designated as the interrupt address range. Requests without PASID to
> > this range are not subjected to DMA remapping (even if translation
> > structures specify a mapping for this range).
> > --
> > The following types of requests to this range are illegal requests.
> > They are blocked and reported as Interrupt Remapping faults.
> > • Read requests without PASID that are not ZLR.
> > • Atomics requests without PASID.
> > • Non-DWORD length write requests without PASID.
> > --
> 
> Ah, I see. That's (according to the change bars) a relatively recent
> addition. So the above clarifies things for the !PASID case. Am I
> interpreting
> 
> "Requests-with-PASID with input address in range 0xFEEx_xxxx are
>  translated normally like any other request-with-PASID through
>  DMA-remapping hardware. However, if such a request is processed
>  using pass-through translation, it will be blocked as described
>  in the paragraph below.
> 
>  Software must not program paging-structure entries to remap any
>  address to the interrupt address range. Untranslated requests and
>  translation requests that result in an address in the interrupt
>  range will be blocked with condition code LGN.4 or SGN.8.
>  Translated requests with an address in the interrupt address
>  range are treated as Unsupported Request (UR)."
> 
> right in that _with_ PASID translation entries for this range would
> still be used, so long as they translate to an area outside of the
> FEExxxxx range? If so this would mean that with PASID (whenever we

yes

> get to enabling this mode of operation) we'd need to avoid sharing
> page tables, and we'd need to suppress mirroring of EPT insertions
> for this range in the IOMMU page tables. (All of this independent
> of the particular choice of the APIC access page.)
> 

Or you can still share page tables as long as no DMA will target at
this range (which should be the normal case and any inadvertent
or malicious attempt is blocked anyway).


Thanks
Kevin
diff mbox series

Patch

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1007,6 +1007,8 @@  int arch_domain_soft_reset(struct domain
 
 void arch_domain_creation_finished(struct domain *d)
 {
+    if ( is_hvm_domain(d) )
+        hvm_domain_creation_finished(d);
 }
 
 #define xen_vcpu_guest_context vcpu_guest_context
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -66,8 +66,7 @@  boolean_param("force-ept", opt_force_ept
 static void vmx_ctxt_switch_from(struct vcpu *v);
 static void vmx_ctxt_switch_to(struct vcpu *v);
 
-static int  vmx_alloc_vlapic_mapping(struct domain *d);
-static void vmx_free_vlapic_mapping(struct domain *d);
+static int alloc_vlapic_mapping(void);
 static void vmx_install_vlapic_mapping(struct vcpu *v);
 static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr,
                                 unsigned int flags);
@@ -78,6 +77,8 @@  static int vmx_msr_read_intercept(unsign
 static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
 static void vmx_invlpg(struct vcpu *v, unsigned long linear);
 
+static mfn_t __read_mostly apic_access_mfn;
+
 /* Values for domain's ->arch.hvm_domain.pi_ops.flags. */
 #define PI_CSW_FROM (1u << 0)
 #define PI_CSW_TO   (1u << 1)
@@ -401,7 +402,6 @@  static int vmx_domain_initialise(struct
         .to   = vmx_ctxt_switch_to,
         .tail = vmx_do_resume,
     };
-    int rc;
 
     d->arch.ctxt_switch = &csw;
 
@@ -411,21 +411,16 @@  static int vmx_domain_initialise(struct
      */
     d->arch.hvm.vmx.exec_sp = is_hardware_domain(d) || opt_ept_exec_sp;
 
-    if ( !has_vlapic(d) )
-        return 0;
-
-    if ( (rc = vmx_alloc_vlapic_mapping(d)) != 0 )
-        return rc;
-
     return 0;
 }
 
-static void vmx_domain_relinquish_resources(struct domain *d)
+static void domain_creation_finished(struct domain *d)
 {
-    if ( !has_vlapic(d) )
-        return;
 
-    vmx_free_vlapic_mapping(d);
+    if ( !mfn_eq(apic_access_mfn, _mfn(0)) &&
+         set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE),
+                            apic_access_mfn, PAGE_ORDER_4K) )
+        domain_crash(d);
 }
 
 static void vmx_init_ipt(struct vcpu *v)
@@ -2407,7 +2402,7 @@  static struct hvm_function_table __initd
     .cpu_up_prepare       = vmx_cpu_up_prepare,
     .cpu_dead             = vmx_cpu_dead,
     .domain_initialise    = vmx_domain_initialise,
-    .domain_relinquish_resources = vmx_domain_relinquish_resources,
+    .domain_creation_finished = domain_creation_finished,
     .vcpu_initialise      = vmx_vcpu_initialise,
     .vcpu_destroy         = vmx_vcpu_destroy,
     .save_cpu_ctxt        = vmx_save_vmcs_ctxt,
@@ -2653,7 +2648,7 @@  const struct hvm_function_table * __init
 {
     set_in_cr4(X86_CR4_VMXE);
 
-    if ( vmx_vmcs_init() )
+    if ( vmx_vmcs_init() || alloc_vlapic_mapping() )
     {
         printk("VMX: failed to initialise.\n");
         return NULL;
@@ -3208,7 +3203,7 @@  gp_fault:
     return X86EMUL_EXCEPTION;
 }
 
-static int vmx_alloc_vlapic_mapping(struct domain *d)
+static int __init alloc_vlapic_mapping(void)
 {
     struct page_info *pg;
     mfn_t mfn;
@@ -3216,53 +3211,28 @@  static int vmx_alloc_vlapic_mapping(stru
     if ( !cpu_has_vmx_virtualize_apic_accesses )
         return 0;
 
-    pg = alloc_domheap_page(d, MEMF_no_refcount);
+    pg = alloc_domheap_page(NULL, 0);
     if ( !pg )
         return -ENOMEM;
 
-    if ( !get_page_and_type(pg, d, PGT_writable_page) )
-    {
-        /*
-         * The domain can't possibly know about this page yet, so failure
-         * here is a clear indication of something fishy going on.
-         */
-        domain_crash(d);
-        return -ENODATA;
-    }
-
     mfn = page_to_mfn(pg);
     clear_domain_page(mfn);
-    d->arch.hvm.vmx.apic_access_mfn = mfn;
+    apic_access_mfn = mfn;
 
-    return set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE), mfn,
-                              PAGE_ORDER_4K);
-}
-
-static void vmx_free_vlapic_mapping(struct domain *d)
-{
-    mfn_t mfn = d->arch.hvm.vmx.apic_access_mfn;
-
-    d->arch.hvm.vmx.apic_access_mfn = _mfn(0);
-    if ( !mfn_eq(mfn, _mfn(0)) )
-    {
-        struct page_info *pg = mfn_to_page(mfn);
-
-        put_page_alloc_ref(pg);
-        put_page_and_type(pg);
-    }
+    return 0;
 }
 
 static void vmx_install_vlapic_mapping(struct vcpu *v)
 {
     paddr_t virt_page_ma, apic_page_ma;
 
-    if ( mfn_eq(v->domain->arch.hvm.vmx.apic_access_mfn, _mfn(0)) )
+    if ( mfn_eq(apic_access_mfn, _mfn(0)) )
         return;
 
     ASSERT(cpu_has_vmx_virtualize_apic_accesses);
 
     virt_page_ma = page_to_maddr(vcpu_vlapic(v)->regs_page);
-    apic_page_ma = mfn_to_maddr(v->domain->arch.hvm.vmx.apic_access_mfn);
+    apic_page_ma = mfn_to_maddr(apic_access_mfn);
 
     vmx_vmcs_enter(v);
     __vmwrite(VIRTUAL_APIC_PAGE_ADDR, virt_page_ma);
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -106,6 +106,7 @@  struct hvm_function_table {
      * Initialise/destroy HVM domain/vcpu resources
      */
     int  (*domain_initialise)(struct domain *d);
+    void (*domain_creation_finished)(struct domain *d);
     void (*domain_relinquish_resources)(struct domain *d);
     void (*domain_destroy)(struct domain *d);
     int  (*vcpu_initialise)(struct vcpu *v);
@@ -390,6 +391,12 @@  static inline bool hvm_has_set_descripto
     return hvm_funcs.set_descriptor_access_exiting;
 }
 
+static inline void hvm_domain_creation_finished(struct domain *d)
+{
+    if ( hvm_funcs.domain_creation_finished )
+        alternative_vcall(hvm_funcs.domain_creation_finished, d);
+}
+
 static inline int
 hvm_guest_x86_mode(struct vcpu *v)
 {
@@ -765,6 +772,11 @@  static inline void hvm_invlpg(const stru
 {
     ASSERT_UNREACHABLE();
 }
+
+static inline void hvm_domain_creation_finished(struct domain *d)
+{
+    ASSERT_UNREACHABLE();
+}
 
 /*
  * Shadow code needs further cleanup to eliminate some HVM-only paths. For
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -58,7 +58,6 @@  struct ept_data {
 #define _VMX_DOMAIN_PML_ENABLED    0
 #define VMX_DOMAIN_PML_ENABLED     (1ul << _VMX_DOMAIN_PML_ENABLED)
 struct vmx_domain {
-    mfn_t apic_access_mfn;
     /* VMX_DOMAIN_* */
     unsigned int status;
 
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -935,6 +935,9 @@  static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
             flags |= IOMMUF_writable;
+        /* VMX'es APIC access page is global and hence has no owner. */
+        if ( mfn_valid(mfn) && !page_get_owner(mfn_to_page(mfn)) )
+            flags = 0;
         break;
     default:
         flags = 0;