diff mbox series

x86 / iommu: set up a scratch page in the quarantine domain

Message ID 20191120120859.1846-1-pdurrant@amazon.com (mailing list archive)
State Superseded
Headers show
Series x86 / iommu: set up a scratch page in the quarantine domain | expand

Commit Message

Paul Durrant Nov. 20, 2019, 12:08 p.m. UTC
This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read/wrote scratch page to serve as the source/
target for all DMA whilst a device is assigned to dom_io.

The reason for doing this is that some hardware may continue to re-try
DMA, despite FLR, in the event of an error. Having a scratch page mapped
will allow pending DMA to drain and thus quiesce such buggy hardware.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Wei Liu <wl@xen.org>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
---
 xen/drivers/passthrough/amd/iommu_map.c       | 57 +++++++++++++++
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |  9 +--
 xen/drivers/passthrough/iommu.c               | 25 ++++++-
 xen/drivers/passthrough/vtd/iommu.c           | 71 +++++++++++++++----
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |  2 +
 xen/include/xen/iommu.h                       |  1 +
 6 files changed, 143 insertions(+), 22 deletions(-)

Comments

Jan Beulich Nov. 20, 2019, 1:51 p.m. UTC | #1
On 20.11.2019 13:08, Paul Durrant wrote:
> This patch introduces a new iommu_op to facilitate a per-implementation
> quarantine set up, and then further code for x86 implementations
> (amd and vtd) to set up a read/wrote scratch page to serve as the source/
> target for all DMA whilst a device is assigned to dom_io.

A single page in the system won't do, I'm afraid. If one guest's
(prior) device is retrying reads with data containing secrets of that
guest, another guest's (prior) device could end up writing this data
to e.g. storage where after a guest restart it is then available to
the wrong guest.

Also nit: s/wrote/write/ .

> The reason for doing this is that some hardware may continue to re-try
> DMA, despite FLR, in the event of an error. Having a scratch page mapped
> will allow pending DMA to drain and thus quiesce such buggy hardware.

Without a "sink" page mapped, this would result in IOMMU faults aiui.
What's the problem with having these faults surface and get handled,
eventually leading to the device getting bus-mastering disabled? Is
it that devices continue DMAing even when bus-mastering is off? If
so, is it even safe to pass through any such device? In any event
the description needs to be extended here.

> Signed-off-by: Paul Durrant <pdurrant@amazon.com>

What about Arm? Can devices which Arm allows to assign to guests
also "babble" like this after de-assignment? If not, this should be
said in the description. If so, obviously that side would also want
fixing.

> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -560,6 +560,63 @@ int amd_iommu_reserve_domain_unity_map(struct domain *domain,
>      return rt;
>  }
>  
> +int amd_iommu_quarantine_init(struct domain *d)

__init

> +{
> +    struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level;
> +    struct amd_iommu_pte *table;
> +
> +    if ( hd->arch.root_table )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return 0;
> +    }
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +
> +    level = hd->arch.paging_mode;

With DomIO being PV in principle, this is going to be the
fixed value PV domains get set, merely depending on RAM size at
boot time (i.e. not accounting for memory hotplug). This could
be easily too little for HVM guests, which are free to extend
their GFN (and hence DFN) space. Therefore I think you need to
set the maximum possible level here.

> +    hd->arch.root_table = alloc_amd_iommu_pgtable();
> +    if ( !hd->arch.root_table )
> +        goto out;
> +
> +    table = __map_domain_page(hd->arch.root_table);
> +    while ( level )
> +    {
> +        struct page_info *pg;
> +        unsigned int i;
> +
> +        /*
> +         * The pgtable allocator is fine for the leaf page, as well as
> +         * page table pages.
> +         */
> +        pg = alloc_amd_iommu_pgtable();
> +        if ( !pg )
> +            break;
> +
> +        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
> +        {
> +            struct amd_iommu_pte *pde = &table[i];
> +
> +            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
> +                                  true, true);

This would also benefit from a comment indicating that it's fine
for the leaf level as well, despite the "pde" in the name (and
its sibling set_iommu_pte_present() actually existing). Of course
you could as well extend the comment a few lines up.

What you do need to do though is pre-fill the leaf page ...

> +        }
> +
> +        unmap_domain_page(table);
> +        table = __map_domain_page(pg);
> +        level--;
> +    }

... here, such that possible left over secrets can't end up
getting written to e.g. persistent storage or over a network.

> @@ -2683,9 +2671,68 @@ static void vtd_dump_p2m_table(struct domain *d)
>      vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd->arch.agaw), 0, 0);
>  }
>  
> +static int intel_iommu_quarantine_init(struct domain *d)

__init again.

> +{
> +    struct domain_iommu *hd = dom_iommu(d);
> +    struct dma_pte *parent;
> +    unsigned int level = agaw_to_level(hd->arch.agaw);

Other than for AMD this is not a problem here, as all domains
(currently) get the same AGAW. I wonder though whether precautions
would be possible here against the "normal" domain setting getting
adjusted without recalling the need to come back here.

> +    int rc;
> +
> +    if ( hd->arch.pgd_maddr )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return 0;
> +    }
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +
> +    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
> +    if ( !hd->arch.pgd_maddr )
> +        goto out;
> +
> +    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.pgd_maddr);

Unnecessary cast; funnily enough you don't have one further down.

> +    while ( level )
> +    {
> +        uint64_t maddr;
> +        unsigned int offset;
> +
> +        /*
> +         * The pgtable allocator is fine for the leaf page, as well as
> +         * page table pages.
> +         */
> +        maddr = alloc_pgtable_maddr(1, hd->node);
> +        if ( !maddr )
> +            break;
> +
> +        for ( offset = 0; offset < PTE_NUM; offset++ )
> +        {
> +            struct dma_pte *pte = &parent[offset];
> +
> +            dma_set_pte_addr(*pte, maddr);
> +            dma_set_pte_readable(*pte);
> +            dma_set_pte_writable(*pte);
> +        }
> +        iommu_flush_cache_page(parent, 1);
> +
> +        unmap_vtd_domain_page(parent);
> +        parent = map_vtd_domain_page(maddr);
> +        level--;
> +    }

The leaf page wants scrubbing here as well.

Jan
Tian, Kevin Nov. 25, 2019, 8:21 a.m. UTC | #2
> From: Paul Durrant [mailto:pdurrant@amazon.com]
> Sent: Wednesday, November 20, 2019 8:09 PM
> 
> This patch introduces a new iommu_op to facilitate a per-implementation
> quarantine set up, and then further code for x86 implementations
> (amd and vtd) to set up a read/wrote scratch page to serve as the source/
> target for all DMA whilst a device is assigned to dom_io.
> 
> The reason for doing this is that some hardware may continue to re-try
> DMA, despite FLR, in the event of an error. Having a scratch page mapped
> will allow pending DMA to drain and thus quiesce such buggy hardware.

then there is no diagnostics at all since all faults are quiescent now...
why do we want to support such buggy hardware? Is it better to make
it an default-off option since buggy is supposed to niche case?

> 
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Wei Liu <wl@xen.org>
> Cc: "Roger Pau Monné" <roger.pau@citrix.com>
> ---
>  xen/drivers/passthrough/amd/iommu_map.c       | 57 +++++++++++++++
>  xen/drivers/passthrough/amd/pci_amd_iommu.c   |  9 +--
>  xen/drivers/passthrough/iommu.c               | 25 ++++++-
>  xen/drivers/passthrough/vtd/iommu.c           | 71 +++++++++++++++----
>  xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |  2 +
>  xen/include/xen/iommu.h                       |  1 +
>  6 files changed, 143 insertions(+), 22 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/amd/iommu_map.c
> b/xen/drivers/passthrough/amd/iommu_map.c
> index cd5c7de7c5..8440ccf1c1 100644
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -560,6 +560,63 @@ int
> amd_iommu_reserve_domain_unity_map(struct domain *domain,
>      return rt;
>  }
> 
> +int amd_iommu_quarantine_init(struct domain *d)
> +{
> +    struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level;
> +    struct amd_iommu_pte *table;
> +
> +    if ( hd->arch.root_table )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return 0;
> +    }
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +
> +    level = hd->arch.paging_mode;
> +
> +    hd->arch.root_table = alloc_amd_iommu_pgtable();
> +    if ( !hd->arch.root_table )
> +        goto out;
> +
> +    table = __map_domain_page(hd->arch.root_table);
> +    while ( level )
> +    {
> +        struct page_info *pg;
> +        unsigned int i;
> +
> +        /*
> +         * The pgtable allocator is fine for the leaf page, as well as
> +         * page table pages.
> +         */
> +        pg = alloc_amd_iommu_pgtable();
> +        if ( !pg )
> +            break;
> +
> +        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
> +        {
> +            struct amd_iommu_pte *pde = &table[i];
> +
> +            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
> +                                  true, true);
> +        }
> +
> +        unmap_domain_page(table);
> +        table = __map_domain_page(pg);
> +        level--;
> +    }
> +    unmap_domain_page(table);
> +
> + out:
> +    spin_unlock(&hd->arch.mapping_lock);
> +
> +    amd_iommu_flush_all_pages(d);
> +
> +    /* Pages leaked in failure case */
> +    return level ? -ENOMEM : 0;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 75a0f1b4ab..c7858b4e8f 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -95,10 +95,6 @@ static void amd_iommu_setup_domain_device(
>      u8 bus = pdev->bus;
>      const struct domain_iommu *hd = dom_iommu(domain);
> 
> -    /* dom_io is used as a sentinel for quarantined devices */
> -    if ( domain == dom_io )
> -        return;
> -
>      BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
>              !iommu->dev_table.buffer );
> 
> @@ -290,10 +286,6 @@ static void
> amd_iommu_disable_domain_device(const struct domain *domain,
>      int req_id;
>      u8 bus = pdev->bus;
> 
> -    /* dom_io is used as a sentinel for quarantined devices */
> -    if ( domain == dom_io )
> -        return;
> -
>      BUG_ON ( iommu->dev_table.buffer == NULL );
>      req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
>      table = iommu->dev_table.buffer;
> @@ -632,6 +624,7 @@ static void amd_dump_p2m_table(struct domain
> *d)
>  static const struct iommu_ops __initconstrel _iommu_ops = {
>      .init = amd_iommu_domain_init,
>      .hwdom_init = amd_iommu_hwdom_init,
> +    .quarantine_init = amd_iommu_quarantine_init,
>      .add_device = amd_iommu_add_device,
>      .remove_device = amd_iommu_remove_device,
>      .assign_device  = amd_iommu_assign_device,
> diff --git a/xen/drivers/passthrough/iommu.c
> b/xen/drivers/passthrough/iommu.c
> index 8cbe908fff..25283263d7 100644
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -440,6 +440,28 @@ int iommu_iotlb_flush_all(struct domain *d,
> unsigned int flush_flags)
>      return rc;
>  }
> 
> +static int __init iommu_quarantine_init(void)
> +{
> +    const struct domain_iommu *hd = dom_iommu(dom_io);
> +    int rc;
> +
> +    dom_io->options |= XEN_DOMCTL_CDF_iommu;
> +
> +    rc = iommu_domain_init(dom_io, 0);
> +    if ( rc )
> +        return rc;
> +
> +    if ( !hd->platform_ops->quarantine_init )
> +        return 0;
> +
> +    rc = hd->platform_ops->quarantine_init(dom_io);
> +
> +    if ( !rc )
> +        printk("Quarantine initialized\n");
> +
> +    return rc;
> +}
> +
>  int __init iommu_setup(void)
>  {
>      int rc = -ENODEV;
> @@ -473,8 +495,7 @@ int __init iommu_setup(void)
>      }
>      else
>      {
> -        dom_io->options |= XEN_DOMCTL_CDF_iommu;
> -        if ( iommu_domain_init(dom_io, 0) )
> +        if ( iommu_quarantine_init() )
>              panic("Could not set up quarantine\n");
> 
>          printk(" - Dom0 mode: %s\n",
> diff --git a/xen/drivers/passthrough/vtd/iommu.c
> b/xen/drivers/passthrough/vtd/iommu.c
> index 25ad649c34..c20f2ca029 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1291,10 +1291,6 @@ int domain_context_mapping_one(
>      int agaw, rc, ret;
>      bool_t flush_dev_iotlb;
> 
> -    /* dom_io is used as a sentinel for quarantined devices */
> -    if ( domain == dom_io )
> -        return 0;
> -
>      ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
>      maddr = bus_to_context_maddr(iommu, bus);
> @@ -1541,10 +1537,6 @@ int domain_context_unmap_one(
>      int iommu_domid, rc, ret;
>      bool_t flush_dev_iotlb;
> 
> -    /* dom_io is used as a sentinel for quarantined devices */
> -    if ( domain == dom_io )
> -        return 0;
> -
>      ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
> 
> @@ -1677,10 +1669,6 @@ static int domain_context_unmap(struct
> domain *domain, u8 devfn,
>          goto out;
>      }
> 
> -    /* dom_io is used as a sentinel for quarantined devices */
> -    if ( domain == dom_io )
> -        goto out;
> -
>      /*
>       * if no other devices under the same iommu owned by this domain,
>       * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
> @@ -2683,9 +2671,68 @@ static void vtd_dump_p2m_table(struct domain
> *d)
>      vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd-
> >arch.agaw), 0, 0);
>  }
> 
> +static int intel_iommu_quarantine_init(struct domain *d)
> +{
> +    struct domain_iommu *hd = dom_iommu(d);
> +    struct dma_pte *parent;
> +    unsigned int level = agaw_to_level(hd->arch.agaw);
> +    int rc;
> +
> +    if ( hd->arch.pgd_maddr )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return 0;
> +    }
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +
> +    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
> +    if ( !hd->arch.pgd_maddr )
> +        goto out;
> +
> +    parent = (struct dma_pte *)map_vtd_domain_page(hd-
> >arch.pgd_maddr);
> +    while ( level )
> +    {
> +        uint64_t maddr;
> +        unsigned int offset;
> +
> +        /*
> +         * The pgtable allocator is fine for the leaf page, as well as
> +         * page table pages.
> +         */
> +        maddr = alloc_pgtable_maddr(1, hd->node);
> +        if ( !maddr )
> +            break;
> +
> +        for ( offset = 0; offset < PTE_NUM; offset++ )
> +        {
> +            struct dma_pte *pte = &parent[offset];
> +
> +            dma_set_pte_addr(*pte, maddr);
> +            dma_set_pte_readable(*pte);
> +            dma_set_pte_writable(*pte);
> +        }
> +        iommu_flush_cache_page(parent, 1);
> +
> +        unmap_vtd_domain_page(parent);
> +        parent = map_vtd_domain_page(maddr);
> +        level--;
> +    }
> +    unmap_vtd_domain_page(parent);
> +
> + out:
> +    spin_unlock(&hd->arch.mapping_lock);
> +
> +    rc = iommu_flush_iotlb_all(d);
> +
> +    /* Pages leaked in failure case */
> +    return level ? -ENOMEM : rc;
> +}
> +
>  const struct iommu_ops __initconstrel intel_iommu_ops = {
>      .init = intel_iommu_domain_init,
>      .hwdom_init = intel_iommu_hwdom_init,
> +    .quarantine_init = intel_iommu_quarantine_init,
>      .add_device = intel_iommu_add_device,
>      .enable_device = intel_iommu_enable_device,
>      .remove_device = intel_iommu_remove_device,
> diff --git a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> index 8ed9482791..39fb10f567 100644
> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> @@ -54,6 +54,8 @@ int amd_iommu_init_late(void);
>  int amd_iommu_update_ivrs_mapping_acpi(void);
>  int iov_adjust_irq_affinities(void);
> 
> +int amd_iommu_quarantine_init(struct domain *d);
> +
>  /* mapping functions */
>  int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
>                                      mfn_t mfn, unsigned int flags,
> diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
> index 974bd3ffe8..6977ddbb97 100644
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -211,6 +211,7 @@ typedef int iommu_grdm_t(xen_pfn_t start,
> xen_ulong_t nr, u32 id, void *ctxt);
>  struct iommu_ops {
>      int (*init)(struct domain *d);
>      void (*hwdom_init)(struct domain *d);
> +    int (*quarantine_init)(struct domain *d);
>      int (*add_device)(u8 devfn, device_t *dev);
>      int (*enable_device)(device_t *dev);
>      int (*remove_device)(u8 devfn, device_t *dev);
> --
> 2.20.1
Paul Durrant Nov. 27, 2019, 3:18 p.m. UTC | #3
> -----Original Message-----
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: 25 November 2019 08:22
> To: Durrant, Paul <pdurrant@amazon.com>; xen-devel@lists.xenproject.org
> Cc: Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> <roger.pau@citrix.com>
> Subject: RE: [PATCH] x86 / iommu: set up a scratch page in the quarantine
> domain
> 
> > From: Paul Durrant [mailto:pdurrant@amazon.com]
> > Sent: Wednesday, November 20, 2019 8:09 PM
> >
> > This patch introduces a new iommu_op to facilitate a per-implementation
> > quarantine set up, and then further code for x86 implementations
> > (amd and vtd) to set up a read/wrote scratch page to serve as the
> source/
> > target for all DMA whilst a device is assigned to dom_io.
> >
> > The reason for doing this is that some hardware may continue to re-try
> > DMA, despite FLR, in the event of an error. Having a scratch page mapped
> > will allow pending DMA to drain and thus quiesce such buggy hardware.
> 
> then there is no diagnostics at all since all faults are quiescent now...
> why do we want to support such buggy hardware? Is it better to make
> it an default-off option since buggy is supposed to niche case?

I guess it could be a command line option... perhaps making the new 'iommu=quarantine' boolean into something more complex, but I'm not sure it's really worth it. Perhaps a compile time option would be better?

  Paul
Jan Beulich Nov. 27, 2019, 3:26 p.m. UTC | #4
On 27.11.2019 16:18,  Durrant, Paul  wrote:
>> -----Original Message-----
>> From: Tian, Kevin <kevin.tian@intel.com>
>> Sent: 25 November 2019 08:22
>> To: Durrant, Paul <pdurrant@amazon.com>; xen-devel@lists.xenproject.org
>> Cc: Jan Beulich <jbeulich@suse.com>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
>> <roger.pau@citrix.com>
>> Subject: RE: [PATCH] x86 / iommu: set up a scratch page in the quarantine
>> domain
>>
>>> From: Paul Durrant [mailto:pdurrant@amazon.com]
>>> Sent: Wednesday, November 20, 2019 8:09 PM
>>>
>>> This patch introduces a new iommu_op to facilitate a per-implementation
>>> quarantine set up, and then further code for x86 implementations
>>> (amd and vtd) to set up a read/wrote scratch page to serve as the
>> source/
>>> target for all DMA whilst a device is assigned to dom_io.
>>>
>>> The reason for doing this is that some hardware may continue to re-try
>>> DMA, despite FLR, in the event of an error. Having a scratch page mapped
>>> will allow pending DMA to drain and thus quiesce such buggy hardware.
>>
>> then there is no diagnostics at all since all faults are quiescent now...
>> why do we want to support such buggy hardware? Is it better to make
>> it an default-off option since buggy is supposed to niche case?
> 
> I guess it could be a command line option... perhaps making the new
> 'iommu=quarantine' boolean into something more complex, but I'm not
> sure it's really worth it. Perhaps a compile time option would be
> better?

Yet another option: How about installing the scratch page mappings
only after a (handful of) IOMMU faults? But of course there was the
related earlier question of whether indeed our turning off of bus
mastering doesn't already help silencing the faults.

Jan
Paul Durrant Nov. 27, 2019, 3:31 p.m. UTC | #5
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
> Beulich
> Sent: 20 November 2019 13:52
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: xen-devel@lists.xenproject.org; Kevin Tian <kevin.tian@intel.com>;
> Roger Pau Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; Andrew
> Cooper <andrew.cooper3@citrix.com>
> Subject: Re: [Xen-devel] [PATCH] x86 / iommu: set up a scratch page in the
> quarantine domain
> 
> On 20.11.2019 13:08, Paul Durrant wrote:
> > This patch introduces a new iommu_op to facilitate a per-implementation
> > quarantine set up, and then further code for x86 implementations
> > (amd and vtd) to set up a read/wrote scratch page to serve as the
> source/
> > target for all DMA whilst a device is assigned to dom_io.
> 
> A single page in the system won't do, I'm afraid. If one guest's
> (prior) device is retrying reads with data containing secrets of that
> guest, another guest's (prior) device could end up writing this data
> to e.g. storage where after a guest restart it is then available to
> the wrong guest.
> 

True. I was unsure whether this was a concern in the scenarios we had to deal with but I'm informed it is, and in the general case it is too.

> Also nit: s/wrote/write/ .
> 

Yep. Will fix.

> > The reason for doing this is that some hardware may continue to re-try
> > DMA, despite FLR, in the event of an error. Having a scratch page mapped
> > will allow pending DMA to drain and thus quiesce such buggy hardware.
> 
> Without a "sink" page mapped, this would result in IOMMU faults aiui.
> What's the problem with having these faults surface and get handled,
> eventually leading to the device getting bus-mastering disabled? Is
> it that devices continue DMAing even when bus-mastering is off? If
> so, is it even safe to pass through any such device? In any event
> the description needs to be extended here.
> 

The devices in question ignore both FLR and BME and some IOMMU faults are fatal. I believe, however, write faults are not and so I think a single read-only 'source' page will be sufficient.

> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> 
> What about Arm? Can devices which Arm allows to assign to guests
> also "babble" like this after de-assignment? If not, this should be
> said in the description. If so, obviously that side would also want
> fixing.
> 
> > --- a/xen/drivers/passthrough/amd/iommu_map.c
> > +++ b/xen/drivers/passthrough/amd/iommu_map.c
> > @@ -560,6 +560,63 @@ int amd_iommu_reserve_domain_unity_map(struct
> domain *domain,
> >      return rt;
> >  }
> >
> > +int amd_iommu_quarantine_init(struct domain *d)
> 
> __init
> 

Ok.

> > +{
> > +    struct domain_iommu *hd = dom_iommu(d);
> > +    unsigned int level;
> > +    struct amd_iommu_pte *table;
> > +
> > +    if ( hd->arch.root_table )
> > +    {
> > +        ASSERT_UNREACHABLE();
> > +        return 0;
> > +    }
> > +
> > +    spin_lock(&hd->arch.mapping_lock);
> > +
> > +    level = hd->arch.paging_mode;
> 
> With DomIO being PV in principle, this is going to be the
> fixed value PV domains get set, merely depending on RAM size at
> boot time (i.e. not accounting for memory hotplug). This could
> be easily too little for HVM guests, which are free to extend
> their GFN (and hence DFN) space. Therefore I think you need to
> set the maximum possible level here.
>

Ok. I'd not considered memory hotplug. I'll use a static maximum value instead, as VT-d does in general.

> > +    hd->arch.root_table = alloc_amd_iommu_pgtable();
> > +    if ( !hd->arch.root_table )
> > +        goto out;
> > +
> > +    table = __map_domain_page(hd->arch.root_table);
> > +    while ( level )
> > +    {
> > +        struct page_info *pg;
> > +        unsigned int i;
> > +
> > +        /*
> > +         * The pgtable allocator is fine for the leaf page, as well as
> > +         * page table pages.
> > +         */
> > +        pg = alloc_amd_iommu_pgtable();
> > +        if ( !pg )
> > +            break;
> > +
> > +        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
> > +        {
> > +            struct amd_iommu_pte *pde = &table[i];
> > +
> > +            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level -
> 1,
> > +                                  true, true);
> 
> This would also benefit from a comment indicating that it's fine
> for the leaf level as well, despite the "pde" in the name (and
> its sibling set_iommu_pte_present() actually existing). Of course
> you could as well extend the comment a few lines up.

The AMD IOMMU conflates PDE and PTE all over the place but I'll add a comment here to that effect.

> 
> What you do need to do though is pre-fill the leaf page ...
> 
> > +        }
> > +
> > +        unmap_domain_page(table);
> > +        table = __map_domain_page(pg);
> > +        level--;
> > +    }
> 
> ... here, such that possible left over secrets can't end up
> getting written to e.g. persistent storage or over a network.
> 

That's actually one reason for using the pgtable allocator... It already does that.

> > @@ -2683,9 +2671,68 @@ static void vtd_dump_p2m_table(struct domain *d)
> >      vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd-
> >arch.agaw), 0, 0);
> >  }
> >
> > +static int intel_iommu_quarantine_init(struct domain *d)
> 
> __init again.
> 

Ok.

> > +{
> > +    struct domain_iommu *hd = dom_iommu(d);
> > +    struct dma_pte *parent;
> > +    unsigned int level = agaw_to_level(hd->arch.agaw);
> 
> Other than for AMD this is not a problem here, as all domains
> (currently) get the same AGAW. I wonder though whether precautions
> would be possible here against the "normal" domain setting getting
> adjusted without recalling the need to come back here.
> 

I could just go for a hardcoded width value here too.

> > +    int rc;
> > +
> > +    if ( hd->arch.pgd_maddr )
> > +    {
> > +        ASSERT_UNREACHABLE();
> > +        return 0;
> > +    }
> > +
> > +    spin_lock(&hd->arch.mapping_lock);
> > +
> > +    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
> > +    if ( !hd->arch.pgd_maddr )
> > +        goto out;
> > +
> > +    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.pgd_maddr);
> 
> Unnecessary cast; funnily enough you don't have one further down.
> 

Sorry. Cut'n'paste.

> > +    while ( level )
> > +    {
> > +        uint64_t maddr;
> > +        unsigned int offset;
> > +
> > +        /*
> > +         * The pgtable allocator is fine for the leaf page, as well as
> > +         * page table pages.
> > +         */
> > +        maddr = alloc_pgtable_maddr(1, hd->node);
> > +        if ( !maddr )
> > +            break;
> > +
> > +        for ( offset = 0; offset < PTE_NUM; offset++ )
> > +        {
> > +            struct dma_pte *pte = &parent[offset];
> > +
> > +            dma_set_pte_addr(*pte, maddr);
> > +            dma_set_pte_readable(*pte);
> > +            dma_set_pte_writable(*pte);
> > +        }
> > +        iommu_flush_cache_page(parent, 1);
> > +
> > +        unmap_vtd_domain_page(parent);
> > +        parent = map_vtd_domain_page(maddr);
> > +        level--;
> > +    }
> 
> The leaf page wants scrubbing here as well.
> 

Already done by alloc_pgtable_maddr(). I'll note in the comment in this hunk and the AMD one that the pages are zeroed.

  Paul

> Jan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Paul Durrant Nov. 27, 2019, 3:32 p.m. UTC | #6
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 27 November 2019 15:26
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: Kevin Tian <kevin.tian@intel.com>; xen-devel@lists.xenproject.org;
> Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Wei Liu <wl@xen.org>
> Subject: Re: [Xen-devel] [PATCH] x86 / iommu: set up a scratch page in the
> quarantine domain
> 
> On 27.11.2019 16:18,  Durrant, Paul  wrote:
> >> -----Original Message-----
> >> From: Tian, Kevin <kevin.tian@intel.com>
> >> Sent: 25 November 2019 08:22
> >> To: Durrant, Paul <pdurrant@amazon.com>; xen-devel@lists.xenproject.org
> >> Cc: Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> >> <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> >> <roger.pau@citrix.com>
> >> Subject: RE: [PATCH] x86 / iommu: set up a scratch page in the
> quarantine
> >> domain
> >>
> >>> From: Paul Durrant [mailto:pdurrant@amazon.com]
> >>> Sent: Wednesday, November 20, 2019 8:09 PM
> >>>
> >>> This patch introduces a new iommu_op to facilitate a per-
> implementation
> >>> quarantine set up, and then further code for x86 implementations
> >>> (amd and vtd) to set up a read/wrote scratch page to serve as the
> >> source/
> >>> target for all DMA whilst a device is assigned to dom_io.
> >>>
> >>> The reason for doing this is that some hardware may continue to re-try
> >>> DMA, despite FLR, in the event of an error. Having a scratch page
> mapped
> >>> will allow pending DMA to drain and thus quiesce such buggy hardware.
> >>
> >> then there is no diagnostics at all since all faults are quiescent
> now...
> >> why do we want to support such buggy hardware? Is it better to make
> >> it an default-off option since buggy is supposed to niche case?
> >
> > I guess it could be a command line option... perhaps making the new
> > 'iommu=quarantine' boolean into something more complex, but I'm not
> > sure it's really worth it. Perhaps a compile time option would be
> > better?
> 
> Yet another option: How about installing the scratch page mappings
> only after a (handful of) IOMMU faults? But of course there was the
> related earlier question of whether indeed our turning off of bus
> mastering doesn't already help silencing the faults.

No. Unfortunately the h/w has zero tolerance for some faults.

  Paul

> 
> Jan
diff mbox series

Patch

diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index cd5c7de7c5..8440ccf1c1 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -560,6 +560,63 @@  int amd_iommu_reserve_domain_unity_map(struct domain *domain,
     return rt;
 }
 
+int amd_iommu_quarantine_init(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level;
+    struct amd_iommu_pte *table;
+
+    if ( hd->arch.root_table )
+    {
+        ASSERT_UNREACHABLE();
+        return 0;
+    }
+
+    spin_lock(&hd->arch.mapping_lock);
+
+    level = hd->arch.paging_mode;
+
+    hd->arch.root_table = alloc_amd_iommu_pgtable();
+    if ( !hd->arch.root_table )
+        goto out;
+
+    table = __map_domain_page(hd->arch.root_table);
+    while ( level )
+    {
+        struct page_info *pg;
+        unsigned int i;
+
+        /*
+         * The pgtable allocator is fine for the leaf page, as well as
+         * page table pages.
+         */
+        pg = alloc_amd_iommu_pgtable();
+        if ( !pg )
+            break;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
+        {
+            struct amd_iommu_pte *pde = &table[i];
+
+            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
+                                  true, true);
+        }
+
+        unmap_domain_page(table);
+        table = __map_domain_page(pg);
+        level--;
+    }
+    unmap_domain_page(table);
+
+ out:
+    spin_unlock(&hd->arch.mapping_lock);
+
+    amd_iommu_flush_all_pages(d);
+
+    /* Pages leaked in failure case */
+    return level ? -ENOMEM : 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 75a0f1b4ab..c7858b4e8f 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -95,10 +95,6 @@  static void amd_iommu_setup_domain_device(
     u8 bus = pdev->bus;
     const struct domain_iommu *hd = dom_iommu(domain);
 
-    /* dom_io is used as a sentinel for quarantined devices */
-    if ( domain == dom_io )
-        return;
-
     BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
             !iommu->dev_table.buffer );
 
@@ -290,10 +286,6 @@  static void amd_iommu_disable_domain_device(const struct domain *domain,
     int req_id;
     u8 bus = pdev->bus;
 
-    /* dom_io is used as a sentinel for quarantined devices */
-    if ( domain == dom_io )
-        return;
-
     BUG_ON ( iommu->dev_table.buffer == NULL );
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
@@ -632,6 +624,7 @@  static void amd_dump_p2m_table(struct domain *d)
 static const struct iommu_ops __initconstrel _iommu_ops = {
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
+    .quarantine_init = amd_iommu_quarantine_init,
     .add_device = amd_iommu_add_device,
     .remove_device = amd_iommu_remove_device,
     .assign_device  = amd_iommu_assign_device,
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 8cbe908fff..25283263d7 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -440,6 +440,28 @@  int iommu_iotlb_flush_all(struct domain *d, unsigned int flush_flags)
     return rc;
 }
 
+static int __init iommu_quarantine_init(void)
+{
+    const struct domain_iommu *hd = dom_iommu(dom_io);
+    int rc;
+
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+
+    rc = iommu_domain_init(dom_io, 0);
+    if ( rc )
+        return rc;
+
+    if ( !hd->platform_ops->quarantine_init )
+        return 0;
+
+    rc = hd->platform_ops->quarantine_init(dom_io);
+
+    if ( !rc )
+        printk("Quarantine initialized\n");
+
+    return rc;
+}
+
 int __init iommu_setup(void)
 {
     int rc = -ENODEV;
@@ -473,8 +495,7 @@  int __init iommu_setup(void)
     }
     else
     {
-        dom_io->options |= XEN_DOMCTL_CDF_iommu;
-        if ( iommu_domain_init(dom_io, 0) )
+        if ( iommu_quarantine_init() )
             panic("Could not set up quarantine\n");
 
         printk(" - Dom0 mode: %s\n",
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 25ad649c34..c20f2ca029 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1291,10 +1291,6 @@  int domain_context_mapping_one(
     int agaw, rc, ret;
     bool_t flush_dev_iotlb;
 
-    /* dom_io is used as a sentinel for quarantined devices */
-    if ( domain == dom_io )
-        return 0;
-
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
     maddr = bus_to_context_maddr(iommu, bus);
@@ -1541,10 +1537,6 @@  int domain_context_unmap_one(
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
-    /* dom_io is used as a sentinel for quarantined devices */
-    if ( domain == dom_io )
-        return 0;
-
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
 
@@ -1677,10 +1669,6 @@  static int domain_context_unmap(struct domain *domain, u8 devfn,
         goto out;
     }
 
-    /* dom_io is used as a sentinel for quarantined devices */
-    if ( domain == dom_io )
-        goto out;
-
     /*
      * if no other devices under the same iommu owned by this domain,
      * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
@@ -2683,9 +2671,68 @@  static void vtd_dump_p2m_table(struct domain *d)
     vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd->arch.agaw), 0, 0);
 }
 
+static int intel_iommu_quarantine_init(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct dma_pte *parent;
+    unsigned int level = agaw_to_level(hd->arch.agaw);
+    int rc;
+
+    if ( hd->arch.pgd_maddr )
+    {
+        ASSERT_UNREACHABLE();
+        return 0;
+    }
+
+    spin_lock(&hd->arch.mapping_lock);
+
+    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
+    if ( !hd->arch.pgd_maddr )
+        goto out;
+
+    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.pgd_maddr);
+    while ( level )
+    {
+        uint64_t maddr;
+        unsigned int offset;
+
+        /*
+         * The pgtable allocator is fine for the leaf page, as well as
+         * page table pages.
+         */
+        maddr = alloc_pgtable_maddr(1, hd->node);
+        if ( !maddr )
+            break;
+
+        for ( offset = 0; offset < PTE_NUM; offset++ )
+        {
+            struct dma_pte *pte = &parent[offset];
+
+            dma_set_pte_addr(*pte, maddr);
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        iommu_flush_cache_page(parent, 1);
+
+        unmap_vtd_domain_page(parent);
+        parent = map_vtd_domain_page(maddr);
+        level--;
+    }
+    unmap_vtd_domain_page(parent);
+
+ out:
+    spin_unlock(&hd->arch.mapping_lock);
+
+    rc = iommu_flush_iotlb_all(d);
+
+    /* Pages leaked in failure case */
+    return level ? -ENOMEM : rc;
+}
+
 const struct iommu_ops __initconstrel intel_iommu_ops = {
     .init = intel_iommu_domain_init,
     .hwdom_init = intel_iommu_hwdom_init,
+    .quarantine_init = intel_iommu_quarantine_init,
     .add_device = intel_iommu_add_device,
     .enable_device = intel_iommu_enable_device,
     .remove_device = intel_iommu_remove_device,
diff --git a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
index 8ed9482791..39fb10f567 100644
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -54,6 +54,8 @@  int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);
 
+int amd_iommu_quarantine_init(struct domain *d);
+
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
                                     mfn_t mfn, unsigned int flags,
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 974bd3ffe8..6977ddbb97 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -211,6 +211,7 @@  typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
+    int (*quarantine_init)(struct domain *d);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);