mbox series

[v2,00/11] iommufd: Add nesting infrastructure

Message ID 20230511143844.22693-1-yi.l.liu@intel.com (mailing list archive)
Headers show
Series iommufd: Add nesting infrastructure | expand

Message

Yi Liu May 11, 2023, 2:38 p.m. UTC
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.

Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest I/O page table      |
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush --+
    '-------------'                        |
    |             |                        V
    |             |           I/O page table pointer in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .------------------------.
    |   pIOMMU    |  |  FS for GIOVA->GPA     |
    |             |  '------------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.----------------------------------.
    |             |   | SS for GPA->HPA, unmanaged domain|
    |             |   '----------------------------------'
    '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.

This series is based on the iommu hw info reporting series [1]. It first
introduces new iommu op for allocating domains with user data and the op
for syncing stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure
to accept user_data and parent hwpt, then relay the data to iommu core to
allocate iommu_domain. After it, extend the ioctl IOMMU_HWPT_ALLOC to accept
user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl
IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed
for user-managed hwpts. Selftest is added as well to cover the new ioctls.

Complete code can be found in [2], QEMU could can be found in [3].

At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.

base-commit: cf905391237ded2331388e75adb5afbabeddc852

[1] https://lore.kernel.org/linux-iommu/20230511143024.19542-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%2Bnesting

Change log:

v2:
 - Add union iommu_domain_user_data to include all user data structures to avoid
   passing void * in kernel APIs.
 - Add iommu op to return user data length for user domain allocation
 - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
 - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
 - Convert cache_invalidate_user op to be int instead of void
 - Remove @data_type in struct iommu_hwpt_invalidate
 - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1

v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/

Thanks,
	Yi Liu

Lu Baolu (2):
  iommu: Add new iommu op to create domains owned by userspace
  iommu: Add nested domain support

Nicolin Chen (5):
  iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
  iommufd/selftest: Add domain_alloc_user() support in iommu mock
  iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
  iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
  iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl

Yi Liu (4):
  iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
  iommufd: Pass parent hwpt and user_data to
    iommufd_hw_pagetable_alloc()
  iommufd: IOMMU_HWPT_ALLOC allocation with user data
  iommufd: Add IOMMU_HWPT_INVALIDATE

 drivers/iommu/iommufd/device.c                |   2 +-
 drivers/iommu/iommufd/hw_pagetable.c          | 191 +++++++++++++++++-
 drivers/iommu/iommufd/iommufd_private.h       |  16 +-
 drivers/iommu/iommufd/iommufd_test.h          |  30 +++
 drivers/iommu/iommufd/main.c                  |   5 +-
 drivers/iommu/iommufd/selftest.c              | 119 ++++++++++-
 include/linux/iommu.h                         |  36 ++++
 include/uapi/linux/iommufd.h                  |  58 +++++-
 tools/testing/selftests/iommu/iommufd.c       | 126 +++++++++++-
 tools/testing/selftests/iommu/iommufd_utils.h |  70 +++++++
 10 files changed, 629 insertions(+), 24 deletions(-)

Comments

Tian, Kevin May 19, 2023, 9:56 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, May 11, 2023 10:39 PM
> 
> Lu Baolu (2):
>   iommu: Add new iommu op to create domains owned by userspace
>   iommu: Add nested domain support
> 
> Nicolin Chen (5):
>   iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
>   iommufd/selftest: Add domain_alloc_user() support in iommu mock
>   iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
>   iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
>   iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> 
> Yi Liu (4):
>   iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
>   iommufd: Pass parent hwpt and user_data to
>     iommufd_hw_pagetable_alloc()
>   iommufd: IOMMU_HWPT_ALLOC allocation with user data
>   iommufd: Add IOMMU_HWPT_INVALIDATE
> 

I didn't see any change in iommufd_hw_pagetable_attach() to handle
stage-1 hwpt differently.

In concept whatever reserved regions existing on a device should be
directly reflected on the hwpt which the device is attached to.

So with nesting presumably the reserved regions of the device have
been reported to the userspace and it's user's responsibility to avoid
allocating IOVA from those reserved regions in stage-1 hwpt.

It's not necessarily to add reserved regions to the IOAS of the parent
hwpt since the device doesn't access that address space after it's
attached to stage-1. The parent is used only for address translation
in the iommu side.

This series kind of ignores this fact which is probably the reason why
you store an ioas pointer even in the stage-1 hwpt. 

Thanks
Kevin
Jason Gunthorpe May 19, 2023, 11:49 a.m. UTC | #2
On Fri, May 19, 2023 at 09:56:04AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, May 11, 2023 10:39 PM
> > 
> > Lu Baolu (2):
> >   iommu: Add new iommu op to create domains owned by userspace
> >   iommu: Add nested domain support
> > 
> > Nicolin Chen (5):
> >   iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
> >   iommufd/selftest: Add domain_alloc_user() support in iommu mock
> >   iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
> >   iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
> >   iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> > 
> > Yi Liu (4):
> >   iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
> >   iommufd: Pass parent hwpt and user_data to
> >     iommufd_hw_pagetable_alloc()
> >   iommufd: IOMMU_HWPT_ALLOC allocation with user data
> >   iommufd: Add IOMMU_HWPT_INVALIDATE
> > 
> 
> I didn't see any change in iommufd_hw_pagetable_attach() to handle
> stage-1 hwpt differently.
> 
> In concept whatever reserved regions existing on a device should be
> directly reflected on the hwpt which the device is attached to.
> 
> So with nesting presumably the reserved regions of the device have
> been reported to the userspace and it's user's responsibility to avoid
> allocating IOVA from those reserved regions in stage-1 hwpt.

Presumably
 
> It's not necessarily to add reserved regions to the IOAS of the parent
> hwpt since the device doesn't access that address space after it's
> attached to stage-1. The parent is used only for address translation
> in the iommu side.

But if we don't put them in the IOAS of the parent there is no way for
userspace to learn what they are to forward to the VM ?

Since we expect the parent IOAS to be usable in an identity mode I
think they should be added, at least I can't see a reason not to add
them.

Which is definately complicating some parts of this..

Jason
Tian, Kevin May 24, 2023, 3:48 a.m. UTC | #3
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 19, 2023 7:50 PM
> 
> On Fri, May 19, 2023 at 09:56:04AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Thursday, May 11, 2023 10:39 PM
> > >
> > > Lu Baolu (2):
> > >   iommu: Add new iommu op to create domains owned by userspace
> > >   iommu: Add nested domain support
> > >
> > > Nicolin Chen (5):
> > >   iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
> > >   iommufd/selftest: Add domain_alloc_user() support in iommu mock
> > >   iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user
> data
> > >   iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
> > >   iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> > >
> > > Yi Liu (4):
> > >   iommufd/hw_pagetable: Use domain_alloc_user op for domain
> allocation
> > >   iommufd: Pass parent hwpt and user_data to
> > >     iommufd_hw_pagetable_alloc()
> > >   iommufd: IOMMU_HWPT_ALLOC allocation with user data
> > >   iommufd: Add IOMMU_HWPT_INVALIDATE
> > >
> >
> > I didn't see any change in iommufd_hw_pagetable_attach() to handle
> > stage-1 hwpt differently.
> >
> > In concept whatever reserved regions existing on a device should be
> > directly reflected on the hwpt which the device is attached to.
> >
> > So with nesting presumably the reserved regions of the device have
> > been reported to the userspace and it's user's responsibility to avoid
> > allocating IOVA from those reserved regions in stage-1 hwpt.
> 
> Presumably
> 
> > It's not necessarily to add reserved regions to the IOAS of the parent
> > hwpt since the device doesn't access that address space after it's
> > attached to stage-1. The parent is used only for address translation
> > in the iommu side.
> 
> But if we don't put them in the IOAS of the parent there is no way for
> userspace to learn what they are to forward to the VM ?

emmm I wonder whether that is the right interface to report
per-device reserved regions.

e.g. does it imply that all devices will be reported to the guest with
the exact same set of reserved regions merged in the parent IOAS?

it works but looks unclear in concept. By definition the list of
reserved regions on a device should be static/fixed instead of
being dynamic upon which IOAS this device is attached to and
how many other devices are sharing the same IOAS...

IOAS_IOVA_RANGES kind of follows what vfio type1 provides today

IMHO probably we should have DEVICE_IOVA_RANGES in the first
place instead of doing it via IOAS_IOVA_RANGES which is then
described as being dynamic upon the list of currently attached devices.

> 
> Since we expect the parent IOAS to be usable in an identity mode I
> think they should be added, at least I can't see a reason not to add
> them.

this is a good point.

for SMMU this sounds a must-have as identity mode is configured
in CD with nested translation always enabled. It is out of the host
awareness hence reserved regions must be added to the parent IOAS.

for VT-d identity must be configured explicitly and the hardware
doesn't support stage-1 identity in nested mode. It essentially means
not using nested translation and the user just explicitly attaches
the associated RID or {RID, PASID} to the parent IOAS then get
reserved regions covered already.

With that it makes more sense to make it a vendor specific choice. 
Probably can have a flag bit when creating nested hwpt to mark
that identity mode might be used in this nested configuration
then iommufd should add device reserved regions to the parent
IOAS?

> 
> Which is definately complicating some parts of this..
> 
> Jason
Jason Gunthorpe June 6, 2023, 2:18 p.m. UTC | #4
On Wed, May 24, 2023 at 03:48:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, May 19, 2023 7:50 PM
> > 
> > On Fri, May 19, 2023 at 09:56:04AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Thursday, May 11, 2023 10:39 PM
> > > >
> > > > Lu Baolu (2):
> > > >   iommu: Add new iommu op to create domains owned by userspace
> > > >   iommu: Add nested domain support
> > > >
> > > > Nicolin Chen (5):
> > > >   iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
> > > >   iommufd/selftest: Add domain_alloc_user() support in iommu mock
> > > >   iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user
> > data
> > > >   iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
> > > >   iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> > > >
> > > > Yi Liu (4):
> > > >   iommufd/hw_pagetable: Use domain_alloc_user op for domain
> > allocation
> > > >   iommufd: Pass parent hwpt and user_data to
> > > >     iommufd_hw_pagetable_alloc()
> > > >   iommufd: IOMMU_HWPT_ALLOC allocation with user data
> > > >   iommufd: Add IOMMU_HWPT_INVALIDATE
> > > >
> > >
> > > I didn't see any change in iommufd_hw_pagetable_attach() to handle
> > > stage-1 hwpt differently.
> > >
> > > In concept whatever reserved regions existing on a device should be
> > > directly reflected on the hwpt which the device is attached to.
> > >
> > > So with nesting presumably the reserved regions of the device have
> > > been reported to the userspace and it's user's responsibility to avoid
> > > allocating IOVA from those reserved regions in stage-1 hwpt.
> > 
> > Presumably
> > 
> > > It's not necessarily to add reserved regions to the IOAS of the parent
> > > hwpt since the device doesn't access that address space after it's
> > > attached to stage-1. The parent is used only for address translation
> > > in the iommu side.
> > 
> > But if we don't put them in the IOAS of the parent there is no way for
> > userspace to learn what they are to forward to the VM ?
> 
> emmm I wonder whether that is the right interface to report
> per-device reserved regions.

The iommu driver needs to report different reserved regions for the S1
and S2 iommu_domains, and the IOAS should only get the reserved
regions for the S2.

Currently the API has no way to report per-domain reserved regions and
that is possibly OK for now. The S2 really doesn't have reserved
regions beyond the domain aperture.

So an ioctl to directly query the reserved regions for a dev_id makes
sense.

> > Since we expect the parent IOAS to be usable in an identity mode I
> > think they should be added, at least I can't see a reason not to add
> > them.
> 
> this is a good point.

But it mixes things

The S2 doesn't have reserved ranges restrictions, we always have some
model of a S1, even for identity mode, that would carry the reserved
ranges.

> With that it makes more sense to make it a vendor specific choice.

It isn't vendor specific, the ranges come from the domain that is
attached to the IOAS, and we simply don't import ranges for a S2
domain.

Jason
Tian, Kevin June 16, 2023, 2:43 a.m. UTC | #5
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, June 6, 2023 10:18 PM
> 
> > > > It's not necessarily to add reserved regions to the IOAS of the parent
> > > > hwpt since the device doesn't access that address space after it's
> > > > attached to stage-1. The parent is used only for address translation
> > > > in the iommu side.
> > >
> > > But if we don't put them in the IOAS of the parent there is no way for
> > > userspace to learn what they are to forward to the VM ?
> >
> > emmm I wonder whether that is the right interface to report
> > per-device reserved regions.
> 
> The iommu driver needs to report different reserved regions for the S1
> and S2 iommu_domains, 

I can see the difference between RID and RID+PASID, but not sure whether
it's a actual requirement regarding to attached domain.

e.g. if only talking about RID then the same set of reserved regions should
be reported for both S1 attach and S2 attach.

> and the IOAS should only get the reserved regions for the S2.
> 
> Currently the API has no way to report per-domain reserved regions and
> that is possibly OK for now. The S2 really doesn't have reserved
> regions beyond the domain aperture.
> 
> So an ioctl to directly query the reserved regions for a dev_id makes
> sense.

Or more specifically query the reserved regions for RID-based access.

Ideally for PASID there is no reserved region otherwise SVA won't work. 
Jason Gunthorpe June 19, 2023, 12:37 p.m. UTC | #6
On Fri, Jun 16, 2023 at 02:43:13AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, June 6, 2023 10:18 PM
> > 
> > > > > It's not necessarily to add reserved regions to the IOAS of the parent
> > > > > hwpt since the device doesn't access that address space after it's
> > > > > attached to stage-1. The parent is used only for address translation
> > > > > in the iommu side.
> > > >
> > > > But if we don't put them in the IOAS of the parent there is no way for
> > > > userspace to learn what they are to forward to the VM ?
> > >
> > > emmm I wonder whether that is the right interface to report
> > > per-device reserved regions.
> > 
> > The iommu driver needs to report different reserved regions for the S1
> > and S2 iommu_domains, 
> 
> I can see the difference between RID and RID+PASID, but not sure whether
> it's a actual requirement regarding to attached domain.

No, it isn't RID or RID+PASID here

The S2 has a different set of reserved regsions than the S1 because
the S2's IOVA does not appear on the bus.

So the S2's reserved regions are entirely an artifact of how the IOMMU
HW itself works when nesting.

We can probably get by with some documented slightly messy rules that
the reserved_regions only applies to directly RID attached domains. S2
and PASID attachments always have no reserved spaces.

> When talking about RID-based nesting alone, ARM needs to add reserved
> regions to the parent IOAS as identity is a valid S1 mode in nesting.

No, definately not. The S2 has no reserved regions because it is an
internal IOVA, and we should not abuse that.

Reflecting the requirements for an identity map is something all iommu
HW needs to handle, we should figure out how to do that properly.

> But for Intel RID nesting excludes identity (which becomes a direct
> attach to S2) so the reserved regions apply to S1 instead of the parent IOAS.

IIRC all the HW models will assign their S2's as a RID attached "S1"
during boot time to emulate "no translation"?

They all need to learn what the allowed identiy mapping is so that the
VMM can construct a compatible guest address space, independently of
any IOAS restrictions.

Jason
Tian, Kevin June 20, 2023, 1:43 a.m. UTC | #7
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, June 19, 2023 8:37 PM
> 
> On Fri, Jun 16, 2023 at 02:43:13AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, June 6, 2023 10:18 PM
> > >
> > > > > > It's not necessarily to add reserved regions to the IOAS of the parent
> > > > > > hwpt since the device doesn't access that address space after it's
> > > > > > attached to stage-1. The parent is used only for address translation
> > > > > > in the iommu side.
> > > > >
> > > > > But if we don't put them in the IOAS of the parent there is no way for
> > > > > userspace to learn what they are to forward to the VM ?
> > > >
> > > > emmm I wonder whether that is the right interface to report
> > > > per-device reserved regions.
> > >
> > > The iommu driver needs to report different reserved regions for the S1
> > > and S2 iommu_domains,
> >
> > I can see the difference between RID and RID+PASID, but not sure whether
> > it's a actual requirement regarding to attached domain.
> 
> No, it isn't RID or RID+PASID here
> 
> The S2 has a different set of reserved regsions than the S1 because
> the S2's IOVA does not appear on the bus.
> 
> So the S2's reserved regions are entirely an artifact of how the IOMMU
> HW itself works when nesting.
> 
> We can probably get by with some documented slightly messy rules that
> the reserved_regions only applies to directly RID attached domains. S2
> and PASID attachments always have no reserved spaces.
> 
> > When talking about RID-based nesting alone, ARM needs to add reserved
> > regions to the parent IOAS as identity is a valid S1 mode in nesting.
> 
> No, definately not. The S2 has no reserved regions because it is an
> internal IOVA, and we should not abuse that.
> 
> Reflecting the requirements for an identity map is something all iommu
> HW needs to handle, we should figure out how to do that properly.

I wonder whether we have argued passed each other.

This series adds reserved regions to S2. I challenged the necessity as
S2 is not directly accessed by the device.

Then you replied that doing so still made sense to support identity S1.

But now looks you also agree that reserved regions should not be 
added to S2 except supporting identity S1 needs more thought?

> 
> > But for Intel RID nesting excludes identity (which becomes a direct
> > attach to S2) so the reserved regions apply to S1 instead of the parent IOAS.
> 
> IIRC all the HW models will assign their S2's as a RID attached "S1"
> during boot time to emulate "no translation"?

I'm not sure what it means...

> 
> They all need to learn what the allowed identiy mapping is so that the
> VMM can construct a compatible guest address space, independently of
> any IOAS restrictions.
> 

Intel VT-d supports 4 configurations:
  - passthrough (i.e. identity mapped)
  - S1 only
  - S2 only
  - nested

'S2 only' is used when vIOMMU is configured in passthrough.

'nested' is used when vIOMMU is configured in 'S1 only'.

So in any case 'identity' is not a business of nesting in the VT-d context.

My understanding of ARM SMMU is that from host p.o.v. the CD is the
S1 in the nested configuration. 'identity' is one configuration in the CD
then it's in the business of nesting.

My preference was that ALLOC_HWPT allows vIOMMU to opt whether
reserved regions of dev_id should be added to the IOAS of the parent
S2 hwpt.
Jason Gunthorpe June 20, 2023, 12:47 p.m. UTC | #8
On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
> I wonder whether we have argued passed each other.
> 
> This series adds reserved regions to S2. I challenged the necessity as
> S2 is not directly accessed by the device.
> 
> Then you replied that doing so still made sense to support identity
> S1.

I think I said/ment if we attach the "s2" iommu domain as a direct
attach for identity - eg at boot time, then the IOAS must gain the
reserved regions. This is our normal protocol.

But when we use the "s2" iommu domain as an actual nested S2 then we
don't gain reserved regions.

> Intel VT-d supports 4 configurations:
>   - passthrough (i.e. identity mapped)
>   - S1 only
>   - S2 only
>   - nested
> 
> 'S2 only' is used when vIOMMU is configured in passthrough.

S2 only is modeled as attaching an S2 format iommu domain to the RID,
and when this is done the IOAS should gain the reserved regions
because it is no different behavior than attaching any other iommu
domain to a RID.

When the S2 is replaced with a S1 nest then the IOAS should loose
those reserved regions since it is no longer attached to a RID.

> My understanding of ARM SMMU is that from host p.o.v. the CD is the
> S1 in the nested configuration. 'identity' is one configuration in the CD
> then it's in the business of nesting.

I think it is the same. A CD doesn't come into the picture until the
guest installs a CD pointing STE. Until that time the S2 is being used
as identity.

It sounds like the same basic flow.

> My preference was that ALLOC_HWPT allows vIOMMU to opt whether
> reserved regions of dev_id should be added to the IOAS of the parent
> S2 hwpt.

Having an API to explicitly load reserved regions of a specific device
to an IOAS makes some sense to me.

Jason
Tian, Kevin June 21, 2023, 6:02 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, June 20, 2023 8:47 PM
> 
> On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
> > I wonder whether we have argued passed each other.
> >
> > This series adds reserved regions to S2. I challenged the necessity as
> > S2 is not directly accessed by the device.
> >
> > Then you replied that doing so still made sense to support identity
> > S1.
> 
> I think I said/ment if we attach the "s2" iommu domain as a direct
> attach for identity - eg at boot time, then the IOAS must gain the
> reserved regions. This is our normal protocol.
> 
> But when we use the "s2" iommu domain as an actual nested S2 then we
> don't gain reserved regions.

Then we're aligned.

Yi/Nicolin, please update this series to not automatically add reserved
regions to S2 in the nesting configuration.

It also implies that the user cannot rely on IOAS_IOVA_RANGES to
learn reserved regions for arranging addresses in S1.

Then we also need a new ioctl to report reserved regions per dev_id.

> 
> > Intel VT-d supports 4 configurations:
> >   - passthrough (i.e. identity mapped)
> >   - S1 only
> >   - S2 only
> >   - nested
> >
> > 'S2 only' is used when vIOMMU is configured in passthrough.
> 
> S2 only is modeled as attaching an S2 format iommu domain to the RID,
> and when this is done the IOAS should gain the reserved regions
> because it is no different behavior than attaching any other iommu
> domain to a RID.
> 
> When the S2 is replaced with a S1 nest then the IOAS should loose
> those reserved regions since it is no longer attached to a RID.

yes

> 
> > My understanding of ARM SMMU is that from host p.o.v. the CD is the
> > S1 in the nested configuration. 'identity' is one configuration in the CD
> > then it's in the business of nesting.
> 
> I think it is the same. A CD doesn't come into the picture until the
> guest installs a CD pointing STE. Until that time the S2 is being used
> as identity.
> 
> It sounds like the same basic flow.

After a CD table is installed in a STE I assume the SMMU still allows to
configure an individual CD entry as identity? e.g. while vSVA is enabled
on a device the guest can continue to keep CD#0 as identity when the
default domain of the device is set as 'passthrough'. In this case the
IOAS still needs to gain reserved regions even though S2 is not directly
attached from host p.o.v.

> 
> > My preference was that ALLOC_HWPT allows vIOMMU to opt whether
> > reserved regions of dev_id should be added to the IOAS of the parent
> > S2 hwpt.
> 
> Having an API to explicitly load reserved regions of a specific device
> to an IOAS makes some sense to me.
> 
> Jason
Yi Liu June 21, 2023, 7:09 a.m. UTC | #10
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, June 21, 2023 2:02 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, June 20, 2023 8:47 PM
> >
> > On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
> > > I wonder whether we have argued passed each other.
> > >
> > > This series adds reserved regions to S2. I challenged the necessity as
> > > S2 is not directly accessed by the device.
> > >
> > > Then you replied that doing so still made sense to support identity
> > > S1.
> >
> > I think I said/ment if we attach the "s2" iommu domain as a direct
> > attach for identity - eg at boot time, then the IOAS must gain the
> > reserved regions. This is our normal protocol.
> >
> > But when we use the "s2" iommu domain as an actual nested S2 then we
> > don't gain reserved regions.
> 
> Then we're aligned.
> 
> Yi/Nicolin, please update this series to not automatically add reserved
> regions to S2 in the nesting configuration.

Got it.

> It also implies that the user cannot rely on IOAS_IOVA_RANGES to
> learn reserved regions for arranging addresses in S1.
> 
> Then we also need a new ioctl to report reserved regions per dev_id.

Shall we add it now? I suppose yes.

> >
> > > Intel VT-d supports 4 configurations:
> > >   - passthrough (i.e. identity mapped)
> > >   - S1 only
> > >   - S2 only
> > >   - nested
> > >
> > > 'S2 only' is used when vIOMMU is configured in passthrough.
> >
> > S2 only is modeled as attaching an S2 format iommu domain to the RID,
> > and when this is done the IOAS should gain the reserved regions
> > because it is no different behavior than attaching any other iommu
> > domain to a RID.
> >
> > When the S2 is replaced with a S1 nest then the IOAS should loose
> > those reserved regions since it is no longer attached to a RID.
> 
> yes

Makes sense.

Regards,
Yi Liu

> 
> >
> > > My understanding of ARM SMMU is that from host p.o.v. the CD is the
> > > S1 in the nested configuration. 'identity' is one configuration in the CD
> > > then it's in the business of nesting.
> >
> > I think it is the same. A CD doesn't come into the picture until the
> > guest installs a CD pointing STE. Until that time the S2 is being used
> > as identity.
> >
> > It sounds like the same basic flow.
> 
> After a CD table is installed in a STE I assume the SMMU still allows to
> configure an individual CD entry as identity? e.g. while vSVA is enabled
> on a device the guest can continue to keep CD#0 as identity when the
> default domain of the device is set as 'passthrough'. In this case the
> IOAS still needs to gain reserved regions even though S2 is not directly
> attached from host p.o.v.
> 
> >
> > > My preference was that ALLOC_HWPT allows vIOMMU to opt whether
> > > reserved regions of dev_id should be added to the IOAS of the parent
> > > S2 hwpt.
> >
> > Having an API to explicitly load reserved regions of a specific device
> > to an IOAS makes some sense to me.
> >
> > Jason
Duan, Zhenzhong June 21, 2023, 8:29 a.m. UTC | #11
>-----Original Message-----
>From: Jason Gunthorpe <jgg@nvidia.com>
>Sent: Tuesday, June 20, 2023 8:47 PM
>Subject: Re: [PATCH v2 00/11] iommufd: Add nesting infrastructure
>
>On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
>> I wonder whether we have argued passed each other.
>>
>> This series adds reserved regions to S2. I challenged the necessity as
>> S2 is not directly accessed by the device.
>>
>> Then you replied that doing so still made sense to support identity
>> S1.
>
>I think I said/ment if we attach the "s2" iommu domain as a direct attach for
>identity - eg at boot time, then the IOAS must gain the reserved regions. This is
>our normal protocol.
There is code to fail the attaching for device with RMRR in intel iommu driver,
do we plan to remove below check for IOMMUFD soon or later?

static int intel_iommu_attach_device(struct iommu_domain *domain,
                                     struct device *dev)
{
        struct device_domain_info *info = dev_iommu_priv_get(dev);
        int ret;

        if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
            device_is_rmrr_locked(dev)) {
                dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
                return -EPERM;
        }

Thanks
Zhenzhong
Jason Gunthorpe June 21, 2023, 12:04 p.m. UTC | #12
On Wed, Jun 21, 2023 at 06:02:21AM +0000, Tian, Kevin wrote:
> > > My understanding of ARM SMMU is that from host p.o.v. the CD is the
> > > S1 in the nested configuration. 'identity' is one configuration in the CD
> > > then it's in the business of nesting.
> > 
> > I think it is the same. A CD doesn't come into the picture until the
> > guest installs a CD pointing STE. Until that time the S2 is being used
> > as identity.
> > 
> > It sounds like the same basic flow.
> 
> After a CD table is installed in a STE I assume the SMMU still allows to
> configure an individual CD entry as identity? e.g. while vSVA is enabled
> on a device the guest can continue to keep CD#0 as identity when the
> default domain of the device is set as 'passthrough'. In this case the
> IOAS still needs to gain reserved regions even though S2 is not directly
> attached from host p.o.v.

In any nesting configuration the hypervisor cannot directly restrict
what IOVA the guest will use. The VM could make a normal nest and try
to use unusable IOVA. Identity is not really special.

The VMM should construct the guest memory map so that an identity
iommu_domain can meet the reserved requirements - it needs to do this
anyhow for the initial boot part. It shouuld try to forward the
reserved regions to the guest via ACPI/etc.

Being able to explicitly load reserved regions into an IOAS seems like
a useful way to help construct this.

Jason
Jason Gunthorpe June 21, 2023, 12:07 p.m. UTC | #13
On Wed, Jun 21, 2023 at 08:29:09AM +0000, Duan, Zhenzhong wrote:
> >-----Original Message-----
> >From: Jason Gunthorpe <jgg@nvidia.com>
> >Sent: Tuesday, June 20, 2023 8:47 PM
> >Subject: Re: [PATCH v2 00/11] iommufd: Add nesting infrastructure
> >
> >On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
> >> I wonder whether we have argued passed each other.
> >>
> >> This series adds reserved regions to S2. I challenged the necessity as
> >> S2 is not directly accessed by the device.
> >>
> >> Then you replied that doing so still made sense to support identity
> >> S1.
> >
> >I think I said/ment if we attach the "s2" iommu domain as a direct attach for
> >identity - eg at boot time, then the IOAS must gain the reserved regions. This is
> >our normal protocol.
> There is code to fail the attaching for device with RMRR in intel iommu driver,
> do we plan to remove below check for IOMMUFD soon or later?
> 
> static int intel_iommu_attach_device(struct iommu_domain *domain,
>                                      struct device *dev)
> {
>         struct device_domain_info *info = dev_iommu_priv_get(dev);
>         int ret;
> 
>         if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
>             device_is_rmrr_locked(dev)) {
>                 dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
>                 return -EPERM;
>         }

Not really, systems with RMRR cannot support VFIO at all. Baolu sent a
series lifting this restriction up higher in the stack:

https://lore.kernel.org/all/20230607035145.343698-1-baolu.lu@linux.intel.com/

Jason
Nicolin Chen June 21, 2023, 5:13 p.m. UTC | #14
On Wed, Jun 21, 2023 at 06:02:21AM +0000, Tian, Kevin wrote:

> > On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
> > > I wonder whether we have argued passed each other.
> > >
> > > This series adds reserved regions to S2. I challenged the necessity as
> > > S2 is not directly accessed by the device.
> > >
> > > Then you replied that doing so still made sense to support identity
> > > S1.
> >
> > I think I said/ment if we attach the "s2" iommu domain as a direct
> > attach for identity - eg at boot time, then the IOAS must gain the
> > reserved regions. This is our normal protocol.
> >
> > But when we use the "s2" iommu domain as an actual nested S2 then we
> > don't gain reserved regions.
> 
> Then we're aligned.
> 
> Yi/Nicolin, please update this series to not automatically add reserved
> regions to S2 in the nesting configuration.

I'm a bit late for the conversation here. Yet, how about the
IOMMU_RESV_SW_MSI on ARM in the nesting configuration? We'd
still call iommufd_group_setup_msi() on the S2 HWPT, despite
attaching the device to a nested S1 HWPT right?

> It also implies that the user cannot rely on IOAS_IOVA_RANGES to
> learn reserved regions for arranging addresses in S1.
> 
> Then we also need a new ioctl to report reserved regions per dev_id.

So, in a nesting configuration, QEMU would poll a device's S2
MSI region (i.e. IOMMU_RESV_SW_MSI) to prevent conflict?

Thanks
Nic
Tian, Kevin June 26, 2023, 6:32 a.m. UTC | #15
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 21, 2023 8:05 PM
> 
> On Wed, Jun 21, 2023 at 06:02:21AM +0000, Tian, Kevin wrote:
> > > > My understanding of ARM SMMU is that from host p.o.v. the CD is the
> > > > S1 in the nested configuration. 'identity' is one configuration in the CD
> > > > then it's in the business of nesting.
> > >
> > > I think it is the same. A CD doesn't come into the picture until the
> > > guest installs a CD pointing STE. Until that time the S2 is being used
> > > as identity.
> > >
> > > It sounds like the same basic flow.
> >
> > After a CD table is installed in a STE I assume the SMMU still allows to
> > configure an individual CD entry as identity? e.g. while vSVA is enabled
> > on a device the guest can continue to keep CD#0 as identity when the
> > default domain of the device is set as 'passthrough'. In this case the
> > IOAS still needs to gain reserved regions even though S2 is not directly
> > attached from host p.o.v.
> 
> In any nesting configuration the hypervisor cannot directly restrict
> what IOVA the guest will use. The VM could make a normal nest and try
> to use unusable IOVA. Identity is not really special.

Sure. What I talked is the end result e.g. after the user explicitly requests
to load reserved regions into an IOAS.

> 
> The VMM should construct the guest memory map so that an identity
> iommu_domain can meet the reserved requirements - it needs to do this
> anyhow for the initial boot part. It shouuld try to forward the
> reserved regions to the guest via ACPI/etc.

Yes.

> 
> Being able to explicitly load reserved regions into an IOAS seems like
> a useful way to help construct this.
> 

And it's correct in concept because the IOAS is 'implicitly' accessed by
the device when the guest domain is identity in this case.
Tian, Kevin June 26, 2023, 6:42 a.m. UTC | #16
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, June 22, 2023 1:13 AM
> 
> On Wed, Jun 21, 2023 at 06:02:21AM +0000, Tian, Kevin wrote:
> 
> > > On Tue, Jun 20, 2023 at 01:43:42AM +0000, Tian, Kevin wrote:
> > > > I wonder whether we have argued passed each other.
> > > >
> > > > This series adds reserved regions to S2. I challenged the necessity as
> > > > S2 is not directly accessed by the device.
> > > >
> > > > Then you replied that doing so still made sense to support identity
> > > > S1.
> > >
> > > I think I said/ment if we attach the "s2" iommu domain as a direct
> > > attach for identity - eg at boot time, then the IOAS must gain the
> > > reserved regions. This is our normal protocol.
> > >
> > > But when we use the "s2" iommu domain as an actual nested S2 then we
> > > don't gain reserved regions.
> >
> > Then we're aligned.
> >
> > Yi/Nicolin, please update this series to not automatically add reserved
> > regions to S2 in the nesting configuration.
> 
> I'm a bit late for the conversation here. Yet, how about the
> IOMMU_RESV_SW_MSI on ARM in the nesting configuration? We'd
> still call iommufd_group_setup_msi() on the S2 HWPT, despite
> attaching the device to a nested S1 HWPT right?

Yes, based on current design of ARM nesting.

But please special case it instead of pretending that all reserved regions
are added to IOAS which is wrong in concept based on the discussion.

> 
> > It also implies that the user cannot rely on IOAS_IOVA_RANGES to
> > learn reserved regions for arranging addresses in S1.
> >
> > Then we also need a new ioctl to report reserved regions per dev_id.
> 
> So, in a nesting configuration, QEMU would poll a device's S2
> MSI region (i.e. IOMMU_RESV_SW_MSI) to prevent conflict?
> 

Qemu needs to know all the reserved regions of the device and skip
them when arranging S1 layout.

I'm not sure whether the MSI region needs a special MSI type or
just a general RESV_DIRECT type for 1:1 mapping, though.
Jason Gunthorpe June 26, 2023, 1:05 p.m. UTC | #17
On Mon, Jun 26, 2023 at 06:42:58AM +0000, Tian, Kevin wrote:

> I'm not sure whether the MSI region needs a special MSI type or
> just a general RESV_DIRECT type for 1:1 mapping, though.

It probably always needs a special type :(

Jason
Nicolin Chen June 26, 2023, 5:28 p.m. UTC | #18
On Mon, Jun 26, 2023 at 06:42:58AM +0000, Tian, Kevin wrote:

> > > Yi/Nicolin, please update this series to not automatically add reserved
> > > regions to S2 in the nesting configuration.
> >
> > I'm a bit late for the conversation here. Yet, how about the
> > IOMMU_RESV_SW_MSI on ARM in the nesting configuration? We'd
> > still call iommufd_group_setup_msi() on the S2 HWPT, despite
> > attaching the device to a nested S1 HWPT right?
> 
> Yes, based on current design of ARM nesting.
> 
> But please special case it instead of pretending that all reserved regions
> are added to IOAS which is wrong in concept based on the discussion.

Ack. Yi made a version of change dropping it completely along
with the iommufd_group_setup_msi() call for a nested S1 HWPT.
So I thought there was a misalignment. I made another version
preserving the pathway for MSI on ARM, and perhaps we should
go with this one:
https://github.com/nicolinc/iommufd/commit/c63829a12d35f2d7a390f42821a079f8a294cff8

> > > It also implies that the user cannot rely on IOAS_IOVA_RANGES to
> > > learn reserved regions for arranging addresses in S1.
> > >
> > > Then we also need a new ioctl to report reserved regions per dev_id.
> >
> > So, in a nesting configuration, QEMU would poll a device's S2
> > MSI region (i.e. IOMMU_RESV_SW_MSI) to prevent conflict?
> >
> 
> Qemu needs to know all the reserved regions of the device and skip
> them when arranging S1 layout.

OK.

> I'm not sure whether the MSI region needs a special MSI type or
> just a general RESV_DIRECT type for 1:1 mapping, though.

I don't quite get this part. Isn't MSI having IOMMU_RESV_MSI
and IOMMU_RESV_SW_MSI? Or does it juset mean we should report
the iommu_resv_type along with reserved regions in new ioctl?

Thanks
Nic
Tian, Kevin June 27, 2023, 6:02 a.m. UTC | #19
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, June 27, 2023 1:29 AM
> 
> > I'm not sure whether the MSI region needs a special MSI type or
> > just a general RESV_DIRECT type for 1:1 mapping, though.
> 
> I don't quite get this part. Isn't MSI having IOMMU_RESV_MSI
> and IOMMU_RESV_SW_MSI? Or does it juset mean we should report
> the iommu_resv_type along with reserved regions in new ioctl?
> 

Currently those are iommu internal types. When defining the new
ioctl we need think about what are necessary presenting to the user.

Probably just a list of reserved regions plus a flag to mark which
one is SW_MSI? Except SW_MSI all other reserved region types
just need the user to reserve them w/o knowing more detail.
Jason Gunthorpe June 27, 2023, 4:01 p.m. UTC | #20
On Tue, Jun 27, 2023 at 06:02:13AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, June 27, 2023 1:29 AM
> > 
> > > I'm not sure whether the MSI region needs a special MSI type or
> > > just a general RESV_DIRECT type for 1:1 mapping, though.
> > 
> > I don't quite get this part. Isn't MSI having IOMMU_RESV_MSI
> > and IOMMU_RESV_SW_MSI? Or does it juset mean we should report
> > the iommu_resv_type along with reserved regions in new ioctl?
> > 
> 
> Currently those are iommu internal types. When defining the new
> ioctl we need think about what are necessary presenting to the user.
> 
> Probably just a list of reserved regions plus a flag to mark which
> one is SW_MSI? Except SW_MSI all other reserved region types
> just need the user to reserve them w/o knowing more detail.

I think I prefer the idea we just import the reserved regions from a
devid and do not expose any of this detail to userspace.

Kernel can make only the SW_MSI a mandatory cut out when the S2 is
attached.

Jason
Tian, Kevin June 28, 2023, 2:47 a.m. UTC | #21
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 28, 2023 12:01 AM
> 
> On Tue, Jun 27, 2023 at 06:02:13AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, June 27, 2023 1:29 AM
> > >
> > > > I'm not sure whether the MSI region needs a special MSI type or
> > > > just a general RESV_DIRECT type for 1:1 mapping, though.
> > >
> > > I don't quite get this part. Isn't MSI having IOMMU_RESV_MSI
> > > and IOMMU_RESV_SW_MSI? Or does it juset mean we should report
> > > the iommu_resv_type along with reserved regions in new ioctl?
> > >
> >
> > Currently those are iommu internal types. When defining the new
> > ioctl we need think about what are necessary presenting to the user.
> >
> > Probably just a list of reserved regions plus a flag to mark which
> > one is SW_MSI? Except SW_MSI all other reserved region types
> > just need the user to reserve them w/o knowing more detail.
> 
> I think I prefer the idea we just import the reserved regions from a
> devid and do not expose any of this detail to userspace.
> 
> Kernel can make only the SW_MSI a mandatory cut out when the S2 is
> attached.
> 

I'm confused.

The VMM needs to know reserved regions per dev_id and report them
to the guest.

And we have aligned on that reserved regions (except SW_MSI) should
not be automatically added to S2 in nesting case. Then the VMM cannot
rely on IOAS_IOVA_RANGES to identify the reserved regions.

So there needs a new interface for the user to discover reserved regions
per dev_id, within which the SW_MSI region should be marked out so
identity mapping can be installed properly for it in S1.

Did I misunderstand your point in previous discussion?
Jason Gunthorpe June 28, 2023, 12:36 p.m. UTC | #22
On Wed, Jun 28, 2023 at 02:47:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, June 28, 2023 12:01 AM
> > 
> > On Tue, Jun 27, 2023 at 06:02:13AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Tuesday, June 27, 2023 1:29 AM
> > > >
> > > > > I'm not sure whether the MSI region needs a special MSI type or
> > > > > just a general RESV_DIRECT type for 1:1 mapping, though.
> > > >
> > > > I don't quite get this part. Isn't MSI having IOMMU_RESV_MSI
> > > > and IOMMU_RESV_SW_MSI? Or does it juset mean we should report
> > > > the iommu_resv_type along with reserved regions in new ioctl?
> > > >
> > >
> > > Currently those are iommu internal types. When defining the new
> > > ioctl we need think about what are necessary presenting to the user.
> > >
> > > Probably just a list of reserved regions plus a flag to mark which
> > > one is SW_MSI? Except SW_MSI all other reserved region types
> > > just need the user to reserve them w/o knowing more detail.
> > 
> > I think I prefer the idea we just import the reserved regions from a
> > devid and do not expose any of this detail to userspace.
> > 
> > Kernel can make only the SW_MSI a mandatory cut out when the S2 is
> > attached.
> > 
> 
> I'm confused.
> 
> The VMM needs to know reserved regions per dev_id and report them
> to the guest.
> 
> And we have aligned on that reserved regions (except SW_MSI) should
> not be automatically added to S2 in nesting case. Then the VMM cannot
> rely on IOAS_IOVA_RANGES to identify the reserved regions.

We also said we need a way to load the reserved regions to create an
identity compatible version of the HWPT

So we have a model where the VMM will want to load in regions beyond
the currently attached device needs

> So there needs a new interface for the user to discover reserved regions
> per dev_id, within which the SW_MSI region should be marked out so
> identity mapping can be installed properly for it in S1.
> 
> Did I misunderstand your point in previous discussion?

This is another discussion, if the vmm needs this then we probably
need a new API to get it.

Jason
Tian, Kevin June 29, 2023, 2:16 a.m. UTC | #23
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 28, 2023 8:36 PM
> 
> On Wed, Jun 28, 2023 at 02:47:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, June 28, 2023 12:01 AM
> > >
> > > On Tue, Jun 27, 2023 at 06:02:13AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Tuesday, June 27, 2023 1:29 AM
> > > > >
> > > > > > I'm not sure whether the MSI region needs a special MSI type or
> > > > > > just a general RESV_DIRECT type for 1:1 mapping, though.
> > > > >
> > > > > I don't quite get this part. Isn't MSI having IOMMU_RESV_MSI
> > > > > and IOMMU_RESV_SW_MSI? Or does it juset mean we should report
> > > > > the iommu_resv_type along with reserved regions in new ioctl?
> > > > >
> > > >
> > > > Currently those are iommu internal types. When defining the new
> > > > ioctl we need think about what are necessary presenting to the user.
> > > >
> > > > Probably just a list of reserved regions plus a flag to mark which
> > > > one is SW_MSI? Except SW_MSI all other reserved region types
> > > > just need the user to reserve them w/o knowing more detail.
> > >
> > > I think I prefer the idea we just import the reserved regions from a
> > > devid and do not expose any of this detail to userspace.
> > >
> > > Kernel can make only the SW_MSI a mandatory cut out when the S2 is
> > > attached.
> > >
> >
> > I'm confused.
> >
> > The VMM needs to know reserved regions per dev_id and report them
> > to the guest.
> >
> > And we have aligned on that reserved regions (except SW_MSI) should
> > not be automatically added to S2 in nesting case. Then the VMM cannot
> > rely on IOAS_IOVA_RANGES to identify the reserved regions.
> 
> We also said we need a way to load the reserved regions to create an
> identity compatible version of the HWPT
> 
> So we have a model where the VMM will want to load in regions beyond
> the currently attached device needs

No question on this.

> 
> > So there needs a new interface for the user to discover reserved regions
> > per dev_id, within which the SW_MSI region should be marked out so
> > identity mapping can be installed properly for it in S1.
> >
> > Did I misunderstand your point in previous discussion?
> 
> This is another discussion, if the vmm needs this then we probably
> need a new API to get it.
> 

Then it's clear.