Message ID | 20250124202115.349386-2-mjrosato@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | s390x/pci: relax I/O address translation requirement | expand |
On 24.01.25 21:21, Matthew Rosato wrote: > When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T > bit set, treat this as a request to perform direct mapping instead of > address translation. In order to facilitate this, pin the entirety of > guest memory into the host iommu. > > Pinning for the direct mapping case is handled via vfio and its memory > listener. Additionally, ram discard settings are inherited from vfio: > coordinated discards (e.g. virtio-mem) are allowed while uncoordinated > discards (e.g. virtio-balloon) are disabled. > > Subsequent guest DMA operations are all expected to be of the format > guest_phys+sdma, allowing them to be used as lookup into the host > iommu table. > > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > hw/s390x/s390-pci-bus.c | 36 +++++++++++++++++++++++++++++++-- > hw/s390x/s390-pci-inst.c | 13 ++++++++++-- > hw/s390x/s390-pci-vfio.c | 15 ++++++++++---- > hw/s390x/s390-virtio-ccw.c | 5 +++++ > include/hw/s390x/s390-pci-bus.h | 4 ++++ > 5 files changed, 65 insertions(+), 8 deletions(-) > > diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c > index eead269cc2..8cf5aec1a2 100644 > --- a/hw/s390x/s390-pci-bus.c > +++ b/hw/s390x/s390-pci-bus.c > @@ -18,6 +18,8 @@ > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-kvm.h" > #include "hw/s390x/s390-pci-vfio.h" > +#include "hw/s390x/s390-virtio-ccw.h" > +#include "hw/boards.h" > #include "hw/pci/pci_bus.h" > #include "hw/qdev-properties.h" > #include "hw/pci/pci_bridge.h" > @@ -720,16 +722,44 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) > TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > name, iommu->pal + 1); > iommu->enabled = true; > + iommu->direct_map = false; > memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > g_free(name); > } > > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + S390CcwMachineState *s390ms = S390_CCW_MACHINE(ms); > + > + /* > + * For direct-mapping we must map the entire guest address space. Rather > + * than using an iommu, create a memory region alias that maps GPA X to > + * iova X + SDMA. VFIO will handle pinning via its memory listener. > + */ > + g_autofree char *name = g_strdup_printf("iommu-dm-s390-%04x", > + iommu->pbdev->uid); > + memory_region_init_alias(&iommu->dm_mr, OBJECT(&iommu->mr), name, ms->ram, > + 0, s390_get_memory_limit(s390ms)); Hm, the memory limit can exceed ms->ram. Would it be possible to use get_system_memory() here, such that whatever is mapped into physical address space (including virtio-mem devices etc) would simply be aliased with an offset? Or does that blow up elsewhere? target/i386/kvm/kvm.c seems to do that: memory_region_init_alias(&smram_as_mem, OBJECT(kvm_state), "mem-smram", get_system_memory(), 0, ~0ull); and target/i386/tcg/system/tcg-cpu.c
On 24/01/2025 21.21, Matthew Rosato wrote: > When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T > bit set, treat this as a request to perform direct mapping instead of > address translation. In order to facilitate this, pin the entirety of > guest memory into the host iommu. > > Pinning for the direct mapping case is handled via vfio and its memory > listener. Additionally, ram discard settings are inherited from vfio: > coordinated discards (e.g. virtio-mem) are allowed while uncoordinated > discards (e.g. virtio-balloon) are disabled. > > Subsequent guest DMA operations are all expected to be of the format > guest_phys+sdma, allowing them to be used as lookup into the host > iommu table. > > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- ... > diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c > index e386d75d58..4c108fa8c4 100644 > --- a/hw/s390x/s390-pci-inst.c > +++ b/hw/s390x/s390-pci-inst.c > @@ -16,6 +16,7 @@ > #include "exec/memory.h" > #include "qemu/error-report.h" > #include "system/hw_accel.h" > +#include "hw/boards.h" > #include "hw/pci/pci_device.h" > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-bus.h" > @@ -1008,17 +1009,25 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, > } > > /* currently we only support designation type 1 with translation */ > - if (!(dt == ZPCI_IOTA_RTTO && t)) { > + if (t && !(dt == ZPCI_IOTA_RTTO)) { While you're at it, you could change that "!(dt == ZPCI_IOTA_RTTO)" into "dt != ZPCI_IOTA_RTTO". > error_report("unsupported ioat dt %d t %d", dt, t); > s390_program_interrupt(env, PGM_OPERAND, ra); > return -EINVAL; > + } else if (!t && !pbdev->rtr_allowed) { > + error_report("relaxed translation not allowed"); Not sure, but maybe better use qemu_log_mask(LOG_GUEST_ERROR, ...) instead? > + s390_program_interrupt(env, PGM_OPERAND, ra); > + return -EINVAL; > } Thomas
On Fri, 2025-01-24 at 15:21 -0500, Matthew Rosato wrote: > When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T > bit set, treat this as a request to perform direct mapping instead of > address translation. In order to facilitate this, pin the entirety of > guest memory into the host iommu. > > Pinning for the direct mapping case is handled via vfio and its memory > listener. Additionally, ram discard settings are inherited from vfio: > coordinated discards (e.g. virtio-mem) are allowed while uncoordinated > discards (e.g. virtio-balloon) are disabled. > > Subsequent guest DMA operations are all expected to be of the format > guest_phys+sdma, allowing them to be used as lookup into the host > iommu table. > > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > hw/s390x/s390-pci-bus.c | 36 +++++++++++++++++++++++++++++++-- > hw/s390x/s390-pci-inst.c | 13 ++++++++++-- > hw/s390x/s390-pci-vfio.c | 15 ++++++++++---- > hw/s390x/s390-virtio-ccw.c | 5 +++++ > include/hw/s390x/s390-pci-bus.h | 4 ++++ > 5 files changed, 65 insertions(+), 8 deletions(-) > > diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c > index eead269cc2..8cf5aec1a2 100644 > --- a/hw/s390x/s390-pci-bus.c > +++ b/hw/s390x/s390-pci-bus.c > @@ -18,6 +18,8 @@ > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-kvm.h" > #include "hw/s390x/s390-pci-vfio.h" > +#include "hw/s390x/s390-virtio-ccw.h" > +#include "hw/boards.h" > #include "hw/pci/pci_bus.h" > #include "hw/qdev-properties.h" > #include "hw/pci/pci_bridge.h" > @@ -720,16 +722,44 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) > TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > name, iommu->pal + 1); > iommu->enabled = true; > + iommu->direct_map = false; > memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > g_free(name); > } > > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + S390CcwMachineState *s390ms = S390_CCW_MACHINE(ms); > + > + /* > + * For direct-mapping we must map the entire guest address space. Rather > + * than using an iommu, create a memory region alias that maps GPA X to > + * iova X + SDMA. VFIO will handle pinning via its memory listener. Nit: I'd capitalize IOVA to match GPA. > + */ > + g_autofree char *name = g_strdup_printf("iommu-dm-s390-%04x", > + iommu->pbdev->uid); > + memory_region_init_alias(&iommu->dm_mr, OBJECT(&iommu->mr), name, ms->ram, > + 0, s390_get_memory_limit(s390ms)); > + iommu->enabled = true; > + iommu->direct_map = true; > + memory_region_add_subregion(&iommu->mr, iommu->pbdev->zpci_fn.sdma, > + &iommu->dm_mr); > +} > + > void s390_pci_iommu_disable(S390PCIIOMMU *iommu) > { > iommu->enabled = false; > g_hash_table_remove_all(iommu->iotlb); > - memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr)); > - object_unparent(OBJECT(&iommu->iommu_mr)); > + if (iommu->direct_map) { > + memory_region_del_subregion(&iommu->mr, &iommu->dm_mr); > + iommu->direct_map = false; > + object_unparent(OBJECT(&iommu->dm_mr)); > + } else { > + memory_region_del_subregion(&iommu->mr, > + MEMORY_REGION(&iommu->iommu_mr)); > + object_unparent(OBJECT(&iommu->iommu_mr)); > + } > } > > static void s390_pci_iommu_free(S390pciState *s, PCIBus *bus, int32_t devfn) > @@ -1488,6 +1518,8 @@ static const Property s390_pci_device_properties[] = { > DEFINE_PROP_BOOL("interpret", S390PCIBusDevice, interp, true), > DEFINE_PROP_BOOL("forwarding-assist", S390PCIBusDevice, forwarding_assist, > true), > + DEFINE_PROP_BOOL("relaxed-translation", S390PCIBusDevice, rtr_allowed, > + true), Question: Do we maybe want to default rtr_allowed to false for ISM devices? Performance wise it doesn't matter much since they keep their mappings fairly static and it would help us catch bugs in the handling of rtr_allowed == false devices, the ISM driver and increase security. > }; > > static const VMStateDescription s390_pci_device_vmstate = { > diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c > index e386d75d58..4c108fa8c4 100644 > --- a/hw/s390x/s390-pci-inst.c > +++ b/hw/s390x/s390-pci-inst.c > @@ -16,6 +16,7 @@ > #include "exec/memory.h" > #include "qemu/error-report.h" > #include "system/hw_accel.h" > +#include "hw/boards.h" > #include "hw/pci/pci_device.h" > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-bus.h" > @@ -1008,17 +1009,25 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, > } > > /* currently we only support designation type 1 with translation */ > - if (!(dt == ZPCI_IOTA_RTTO && t)) { > + if (t && !(dt == ZPCI_IOTA_RTTO)) { What Thomas said ;-) > error_report("unsupported ioat dt %d t %d", dt, t); > s390_program_interrupt(env, PGM_OPERAND, ra); > return -EINVAL; > + } else if (!t && !pbdev->rtr_allowed) { > + error_report("relaxed translation not allowed"); > + s390_program_interrupt(env, PGM_OPERAND, ra); > + return -EINVAL; > } > > iommu->pba = pba; > iommu->pal = pal; > iommu->g_iota = g_iota; > > - s390_pci_iommu_enable(iommu); > + if (t) { > + s390_pci_iommu_enable(iommu); > + } else { > + s390_pci_iommu_dm_enable(iommu); > + } > > return 0; > } > diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c > index 7dbbc76823..dad525c81c 100644 > --- a/hw/s390x/s390-pci-vfio.c > +++ b/hw/s390x/s390-pci-vfio.c > @@ -133,11 +133,18 @@ static void s390_pci_read_base(S390PCIBusDevice *pbdev, > > /* > * If appropriate, reduce the size of the supported DMA aperture reported > - * to the guest based upon the vfio DMA limit. > + * to the guest based upon the vfio DMA limit. This is applicable for > + * devices that are guaranteed to not use relaxed translation. If the > + * device is capable of relaxed translation then we must advertise the > + * full aperture. In this case, if translation is used then we will > + * rely on the vfio DMA limit counting and use RPCIT CC1 / status 16 > + * to request the guest free DMA mappings when necessary. Not a native speaker but I think there is a "to" missing in the last sentence and I'd have used "as necessary". > */ > - vfio_size = pbdev->iommu->max_dma_limit << TARGET_PAGE_BITS; > - if (vfio_size > 0 && vfio_size < cap->end_dma - cap->start_dma + 1) { > - pbdev->zpci_fn.edma = cap->start_dma + vfio_size - 1; > + if (!pbdev->rtr_allowed) { > + vfio_size = pbdev->iommu->max_dma_limit << TARGET_PAGE_BITS; > + if (vfio_size > 0 && vfio_size < cap->end_dma - cap->start_dma + 1) { > + pbdev->zpci_fn.edma = cap->start_dma + vfio_size - 1; > + } > } > } > > diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c > index 3af613d4e9..c96cd0d4bb 100644 > --- a/hw/s390x/s390-virtio-ccw.c > +++ b/hw/s390x/s390-virtio-ccw.c > @@ -907,8 +907,13 @@ static void ccw_machine_9_2_instance_options(MachineState *machine) > > static void ccw_machine_9_2_class_options(MachineClass *mc) > { > + static GlobalProperty compat[] = { > + { TYPE_S390_PCI_DEVICE, "relaxed-translation", "off", }, > + }; > + > ccw_machine_10_0_class_options(mc); > compat_props_add(mc->compat_props, hw_compat_9_2, hw_compat_9_2_len); > + compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat)); > } > DEFINE_CCW_MACHINE(9, 2); > > diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h > index 2c43ea123f..27732247cf 100644 > --- a/include/hw/s390x/s390-pci-bus.h > +++ b/include/hw/s390x/s390-pci-bus.h > @@ -277,7 +277,9 @@ struct S390PCIIOMMU { > AddressSpace as; > MemoryRegion mr; > IOMMUMemoryRegion iommu_mr; > + MemoryRegion dm_mr; > bool enabled; > + bool direct_map; > uint64_t g_iota; > uint64_t pba; > uint64_t pal; > @@ -362,6 +364,7 @@ struct S390PCIBusDevice { > bool interp; > bool forwarding_assist; > bool aif; > + bool rtr_allowed; Nit: In the kernel in struct zpci_dev you used rtr_avail but "allowed" in the comment, just for gerppability I'd prefer the names to match. > QTAILQ_ENTRY(S390PCIBusDevice) link; > }; > > @@ -389,6 +392,7 @@ int pci_chsc_sei_nt2_have_event(void); > void s390_pci_sclp_configure(SCCB *sccb); > void s390_pci_sclp_deconfigure(SCCB *sccb); > void s390_pci_iommu_enable(S390PCIIOMMU *iommu); > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu); Nit: I find "_dm_" a bit hard to map to "direct map". If you want two letters I'd go for "_pt_" for "_iommu_pass_through_" or maybe "_direct_map_". > void s390_pci_iommu_disable(S390PCIIOMMU *iommu); > void s390_pci_generate_error_event(uint16_t pec, uint32_t fh, uint32_t fid, > uint64_t faddr, uint32_t e);
>> +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) >> +{ >> + MachineState *ms = MACHINE(qdev_get_machine()); >> + S390CcwMachineState *s390ms = S390_CCW_MACHINE(ms); >> + >> + /* >> + * For direct-mapping we must map the entire guest address space. Rather >> + * than using an iommu, create a memory region alias that maps GPA X to >> + * iova X + SDMA. VFIO will handle pinning via its memory listener. >> + */ >> + g_autofree char *name = g_strdup_printf("iommu-dm-s390-%04x", >> + iommu->pbdev->uid); >> + memory_region_init_alias(&iommu->dm_mr, OBJECT(&iommu->mr), name, ms->ram, >> + 0, s390_get_memory_limit(s390ms)); > > Hm, the memory limit can exceed ms->ram. > > Would it be possible to use get_system_memory() here, such that whatever is mapped into physical address space (including virtio-mem devices etc) would simply be aliased with an offset? > > Or does that blow up elsewhere? Testing with memory_region_init_alias(&iommu->dm_mr, OBJECT(&iommu->mr), name, get_system_memory(), 0, s390_get_memory_limit(s390ms)); Looks good so far, will change for next version > > target/i386/kvm/kvm.c seems to do that: > > memory_region_init_alias(&smram_as_mem, OBJECT(kvm_state), "mem-smram", > get_system_memory(), 0, ~0ull); > > and target/i386/tcg/system/tcg-cpu.c >
>> #include "hw/s390x/s390-pci-bus.h" >> @@ -1008,17 +1009,25 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, >> } >> /* currently we only support designation type 1 with translation */ >> - if (!(dt == ZPCI_IOTA_RTTO && t)) { >> + if (t && !(dt == ZPCI_IOTA_RTTO)) { > > While you're at it, you could change that "!(dt == ZPCI_IOTA_RTTO)" into > "dt != ZPCI_IOTA_RTTO". > ACK >> error_report("unsupported ioat dt %d t %d", dt, t); >> s390_program_interrupt(env, PGM_OPERAND, ra); >> return -EINVAL; >> + } else if (!t && !pbdev->rtr_allowed) { >> + error_report("relaxed translation not allowed"); > > Not sure, but maybe better use qemu_log_mask(LOG_GUEST_ERROR, ...) instead? > Hrm, this one I'm not so sure. If I force this path and we give back the operand exception to the guest, the guest kernel is going to panic as a result. We are very much in a "that wasn't ever supposed to happen" path, just like the unsupported dt above, because we already reported what we do (not) support via CLP. I just tried to force the path with default settings + the above change, and I don't see anything in the qemu log from qemu_log_mask(LOG_GUEST_ERROR, ...) but I do with error_report. Given the unlikely scenario that we reach this path and the potential ramifications, I think I'd rather leave it as error_report unless you've got a strong opinion on it! Thanks, Matt
>> >> static void s390_pci_iommu_free(S390pciState *s, PCIBus *bus, int32_t devfn) >> @@ -1488,6 +1518,8 @@ static const Property s390_pci_device_properties[] = { >> DEFINE_PROP_BOOL("interpret", S390PCIBusDevice, interp, true), >> DEFINE_PROP_BOOL("forwarding-assist", S390PCIBusDevice, forwarding_assist, >> true), >> + DEFINE_PROP_BOOL("relaxed-translation", S390PCIBusDevice, rtr_allowed, >> + true), > > Question: Do we maybe want to default rtr_allowed to false for ISM > devices? Performance wise it doesn't matter much since they keep their > mappings fairly static and it would help us catch bugs in the handling > of rtr_allowed == false devices, the ISM driver and increase security. > I've been asking myself the same. Believe we've discussed it in the past (off-list) too. I'd be fine with that. I think it doesn't change the line above, but I can add a check against pft in s390_pcihost_plug() after clp info is read in, along with a small comment, that changes the setting to false for ISM devices. Note that there is one side effect I can think of based on the kernel implementation - if a guest has, say, NVMe and ISM passed through with iommu.passthrough=1 specified then they would always see the "Falling back to IOMMU_DOMAIN_DMA" message for the ISM device(s) because of rtr_allowed == false. >> /* >> * If appropriate, reduce the size of the supported DMA aperture reported >> - * to the guest based upon the vfio DMA limit. >> + * to the guest based upon the vfio DMA limit. This is applicable for >> + * devices that are guaranteed to not use relaxed translation. If the >> + * device is capable of relaxed translation then we must advertise the >> + * full aperture. In this case, if translation is used then we will >> + * rely on the vfio DMA limit counting and use RPCIT CC1 / status 16 >> + * to request the guest free DMA mappings when necessary. > > Not a native speaker but I think there is a "to" missing in the last > sentence and I'd have used "as necessary". 'request that the' would probably work too... But yeah, something is missing, I'll re-word. >> @@ -362,6 +364,7 @@ struct S390PCIBusDevice { >> bool interp; >> bool forwarding_assist; >> bool aif; >> + bool rtr_allowed; > > Nit: In the kernel in struct zpci_dev you used rtr_avail but "allowed" > in the comment, just for gerppability I'd prefer the names to match. I guess in my head, QEMU is the one allowing it and the guest kernel is checking whether or not it's available. But I have no strong opinion on the name; can rename the QEMU variable to rtr_avail > >> QTAILQ_ENTRY(S390PCIBusDevice) link; >> }; >> >> @@ -389,6 +392,7 @@ int pci_chsc_sei_nt2_have_event(void); >> void s390_pci_sclp_configure(SCCB *sccb); >> void s390_pci_sclp_deconfigure(SCCB *sccb); >> void s390_pci_iommu_enable(S390PCIIOMMU *iommu); >> +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu); > > Nit: I find "_dm_" a bit hard to map to "direct map". If you want two > letters I'd go for "_pt_" for "_iommu_pass_through_" or maybe > "_direct_map_". OK
On 27/01/2025 21.45, Matthew Rosato wrote: > >>> #include "hw/s390x/s390-pci-bus.h" >>> @@ -1008,17 +1009,25 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, >>> } >>> /* currently we only support designation type 1 with translation */ >>> - if (!(dt == ZPCI_IOTA_RTTO && t)) { >>> + if (t && !(dt == ZPCI_IOTA_RTTO)) { >> >> While you're at it, you could change that "!(dt == ZPCI_IOTA_RTTO)" into >> "dt != ZPCI_IOTA_RTTO". >> > > ACK > >>> error_report("unsupported ioat dt %d t %d", dt, t); >>> s390_program_interrupt(env, PGM_OPERAND, ra); >>> return -EINVAL; >>> + } else if (!t && !pbdev->rtr_allowed) { >>> + error_report("relaxed translation not allowed"); >> >> Not sure, but maybe better use qemu_log_mask(LOG_GUEST_ERROR, ...) instead? >> > > Hrm, this one I'm not so sure. If I force this path and we give back the operand exception to the guest, the guest kernel is going to panic as a result. We are very much in a "that wasn't ever supposed to happen" path, just like the unsupported dt above, because we already reported what we do (not) support via CLP. > > I just tried to force the path with default settings + the above change, and I don't see anything in the qemu log from qemu_log_mask(LOG_GUEST_ERROR, ...) but I do with error_report. The qemu_log_mask() stuff has to be enabled on the command line with "-d guest_errors". > Given the unlikely scenario that we reach this path and the potential ramifications, I think I'd rather leave it as error_report unless you've got a strong opinion on it! Ok, fine for me, too. Thomas
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c index eead269cc2..8cf5aec1a2 100644 --- a/hw/s390x/s390-pci-bus.c +++ b/hw/s390x/s390-pci-bus.c @@ -18,6 +18,8 @@ #include "hw/s390x/s390-pci-inst.h" #include "hw/s390x/s390-pci-kvm.h" #include "hw/s390x/s390-pci-vfio.h" +#include "hw/s390x/s390-virtio-ccw.h" +#include "hw/boards.h" #include "hw/pci/pci_bus.h" #include "hw/qdev-properties.h" #include "hw/pci/pci_bridge.h" @@ -720,16 +722,44 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), name, iommu->pal + 1); iommu->enabled = true; + iommu->direct_map = false; memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); g_free(name); } +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) +{ + MachineState *ms = MACHINE(qdev_get_machine()); + S390CcwMachineState *s390ms = S390_CCW_MACHINE(ms); + + /* + * For direct-mapping we must map the entire guest address space. Rather + * than using an iommu, create a memory region alias that maps GPA X to + * iova X + SDMA. VFIO will handle pinning via its memory listener. + */ + g_autofree char *name = g_strdup_printf("iommu-dm-s390-%04x", + iommu->pbdev->uid); + memory_region_init_alias(&iommu->dm_mr, OBJECT(&iommu->mr), name, ms->ram, + 0, s390_get_memory_limit(s390ms)); + iommu->enabled = true; + iommu->direct_map = true; + memory_region_add_subregion(&iommu->mr, iommu->pbdev->zpci_fn.sdma, + &iommu->dm_mr); +} + void s390_pci_iommu_disable(S390PCIIOMMU *iommu) { iommu->enabled = false; g_hash_table_remove_all(iommu->iotlb); - memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr)); - object_unparent(OBJECT(&iommu->iommu_mr)); + if (iommu->direct_map) { + memory_region_del_subregion(&iommu->mr, &iommu->dm_mr); + iommu->direct_map = false; + object_unparent(OBJECT(&iommu->dm_mr)); + } else { + memory_region_del_subregion(&iommu->mr, + MEMORY_REGION(&iommu->iommu_mr)); + object_unparent(OBJECT(&iommu->iommu_mr)); + } } static void s390_pci_iommu_free(S390pciState *s, PCIBus *bus, int32_t devfn) @@ -1488,6 +1518,8 @@ static const Property s390_pci_device_properties[] = { DEFINE_PROP_BOOL("interpret", S390PCIBusDevice, interp, true), DEFINE_PROP_BOOL("forwarding-assist", S390PCIBusDevice, forwarding_assist, true), + DEFINE_PROP_BOOL("relaxed-translation", S390PCIBusDevice, rtr_allowed, + true), }; static const VMStateDescription s390_pci_device_vmstate = { diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c index e386d75d58..4c108fa8c4 100644 --- a/hw/s390x/s390-pci-inst.c +++ b/hw/s390x/s390-pci-inst.c @@ -16,6 +16,7 @@ #include "exec/memory.h" #include "qemu/error-report.h" #include "system/hw_accel.h" +#include "hw/boards.h" #include "hw/pci/pci_device.h" #include "hw/s390x/s390-pci-inst.h" #include "hw/s390x/s390-pci-bus.h" @@ -1008,17 +1009,25 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, } /* currently we only support designation type 1 with translation */ - if (!(dt == ZPCI_IOTA_RTTO && t)) { + if (t && !(dt == ZPCI_IOTA_RTTO)) { error_report("unsupported ioat dt %d t %d", dt, t); s390_program_interrupt(env, PGM_OPERAND, ra); return -EINVAL; + } else if (!t && !pbdev->rtr_allowed) { + error_report("relaxed translation not allowed"); + s390_program_interrupt(env, PGM_OPERAND, ra); + return -EINVAL; } iommu->pba = pba; iommu->pal = pal; iommu->g_iota = g_iota; - s390_pci_iommu_enable(iommu); + if (t) { + s390_pci_iommu_enable(iommu); + } else { + s390_pci_iommu_dm_enable(iommu); + } return 0; } diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c index 7dbbc76823..dad525c81c 100644 --- a/hw/s390x/s390-pci-vfio.c +++ b/hw/s390x/s390-pci-vfio.c @@ -133,11 +133,18 @@ static void s390_pci_read_base(S390PCIBusDevice *pbdev, /* * If appropriate, reduce the size of the supported DMA aperture reported - * to the guest based upon the vfio DMA limit. + * to the guest based upon the vfio DMA limit. This is applicable for + * devices that are guaranteed to not use relaxed translation. If the + * device is capable of relaxed translation then we must advertise the + * full aperture. In this case, if translation is used then we will + * rely on the vfio DMA limit counting and use RPCIT CC1 / status 16 + * to request the guest free DMA mappings when necessary. */ - vfio_size = pbdev->iommu->max_dma_limit << TARGET_PAGE_BITS; - if (vfio_size > 0 && vfio_size < cap->end_dma - cap->start_dma + 1) { - pbdev->zpci_fn.edma = cap->start_dma + vfio_size - 1; + if (!pbdev->rtr_allowed) { + vfio_size = pbdev->iommu->max_dma_limit << TARGET_PAGE_BITS; + if (vfio_size > 0 && vfio_size < cap->end_dma - cap->start_dma + 1) { + pbdev->zpci_fn.edma = cap->start_dma + vfio_size - 1; + } } } diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c index 3af613d4e9..c96cd0d4bb 100644 --- a/hw/s390x/s390-virtio-ccw.c +++ b/hw/s390x/s390-virtio-ccw.c @@ -907,8 +907,13 @@ static void ccw_machine_9_2_instance_options(MachineState *machine) static void ccw_machine_9_2_class_options(MachineClass *mc) { + static GlobalProperty compat[] = { + { TYPE_S390_PCI_DEVICE, "relaxed-translation", "off", }, + }; + ccw_machine_10_0_class_options(mc); compat_props_add(mc->compat_props, hw_compat_9_2, hw_compat_9_2_len); + compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat)); } DEFINE_CCW_MACHINE(9, 2); diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h index 2c43ea123f..27732247cf 100644 --- a/include/hw/s390x/s390-pci-bus.h +++ b/include/hw/s390x/s390-pci-bus.h @@ -277,7 +277,9 @@ struct S390PCIIOMMU { AddressSpace as; MemoryRegion mr; IOMMUMemoryRegion iommu_mr; + MemoryRegion dm_mr; bool enabled; + bool direct_map; uint64_t g_iota; uint64_t pba; uint64_t pal; @@ -362,6 +364,7 @@ struct S390PCIBusDevice { bool interp; bool forwarding_assist; bool aif; + bool rtr_allowed; QTAILQ_ENTRY(S390PCIBusDevice) link; }; @@ -389,6 +392,7 @@ int pci_chsc_sei_nt2_have_event(void); void s390_pci_sclp_configure(SCCB *sccb); void s390_pci_sclp_deconfigure(SCCB *sccb); void s390_pci_iommu_enable(S390PCIIOMMU *iommu); +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu); void s390_pci_iommu_disable(S390PCIIOMMU *iommu); void s390_pci_generate_error_event(uint16_t pec, uint32_t fh, uint32_t fid, uint64_t faddr, uint32_t e);
When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T bit set, treat this as a request to perform direct mapping instead of address translation. In order to facilitate this, pin the entirety of guest memory into the host iommu. Pinning for the direct mapping case is handled via vfio and its memory listener. Additionally, ram discard settings are inherited from vfio: coordinated discards (e.g. virtio-mem) are allowed while uncoordinated discards (e.g. virtio-balloon) are disabled. Subsequent guest DMA operations are all expected to be of the format guest_phys+sdma, allowing them to be used as lookup into the host iommu table. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> --- hw/s390x/s390-pci-bus.c | 36 +++++++++++++++++++++++++++++++-- hw/s390x/s390-pci-inst.c | 13 ++++++++++-- hw/s390x/s390-pci-vfio.c | 15 ++++++++++---- hw/s390x/s390-virtio-ccw.c | 5 +++++ include/hw/s390x/s390-pci-bus.h | 4 ++++ 5 files changed, 65 insertions(+), 8 deletions(-)