mbox series

[0/8] s390x/pci: Fixing s390 vfio-pci ISM support

Message ID 1611089059-6468-1-git-send-email-mjrosato@linux.ibm.com (mailing list archive)
Headers show
Series s390x/pci: Fixing s390 vfio-pci ISM support | expand

Message

Matthew Rosato Jan. 19, 2021, 8:44 p.m. UTC
Today, ISM devices are completely disallowed for vfio-pci passthrough as
QEMU rejects the device due to an (inappropriate) MSI-X check.  Removing
this fence, however, reveals additional deficiencies in the s390x PCI
interception layer that prevent ISM devices from working correctly.
Namely, ISM block write operations have particular requirements in regards
to the alignment, size and order of writes performed that cannot be
guaranteed when breaking up write operations through the typical
vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
s390 PCI instructions are used, which is also not guaranteed when the I/O
is passed through the typical userspace channels.

This patchset provides a set of fixes related to enabling ISM device
passthrough and includes patches to enable use of a new vfio region that
will allow s390x PCI pass-through devices to perform s390 PCI instructions
in such a way that the same instruction issued on the guest is re-issued
on the host.

Associated kernel patchset:
https://lkml.org/lkml/2021/1/19/874

Changes from RFC -> v1:
- Refresh the header sync (built using Eric's 'update-linux-headers:
Include const.h' + manually removed pvrdma_ring.h again)
- Remove s390x/pci: fix pcistb length (already merged)
- Remove s390x/pci: Fix memory_region_access_valid call (already merged)
- Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
buffer pcistb_buf rather than allocating/freeing its own.
- New patch: track the PFT (PCI Function Type) separately from guest CLP
response data -- we tell the guest '0' for now due to limitations in
measurement block support, but we can still use the real value provided via
the vfio CLP capabilities to make decisions.
- Use the PFT (pci function type) to determine when to use the region
for PCISTB/PCILG (only for ISM), rather than using the relaxed alignment
bit.
- As a result, the pcistb_default is now updated to also handle the
possibility of relaxed alignment via 2 new functions, pcistb_validate_write
and pcistb_write, which serve as wrappers to the memory_region calls.
- New patch, which partially restores the MSI-X fence for passthrough
devices...  Could potentially be squashed with 's390x/pci: MSI-X isn't
strictly required for passthrough' but left separately for now as I felt it
needed a clear commit description of why we should still fence this case.

Matthew Rosato (8):
  linux-headers: update against 5.11-rc4
  s390x/pci: Keep track of the PCI Function type
  s390x/pci: MSI-X isn't strictly required for passthrough
  s390x/pci: Introduce the ZpciOps structure
  s390x/pci: Handle devices that support relaxed alignment
  s390x/pci: PCISTB via the vfio zPCI I/O region
  s390x/pci: PCILG via the vfio zPCI I/O region
  s390x/pci: Prevent ISM device passthrough on older host kernels

 hw/s390x/s390-pci-bus.c                            |  45 ++-
 hw/s390x/s390-pci-inst.c                           | 389 +++++++++++++++------
 hw/s390x/s390-pci-vfio.c                           | 152 ++++++++
 include/hw/s390x/s390-pci-bus.h                    |  29 ++
 include/hw/s390x/s390-pci-clp.h                    |   1 +
 include/hw/s390x/s390-pci-inst.h                   |   3 +
 include/hw/s390x/s390-pci-vfio.h                   |  23 ++
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h        |   2 +-
 include/standard-headers/drm/drm_fourcc.h          | 175 ++++++++-
 include/standard-headers/linux/ethtool.h           |   2 +-
 include/standard-headers/linux/fuse.h              |  30 +-
 include/standard-headers/linux/kernel.h            |   9 +-
 include/standard-headers/linux/pci_regs.h          |  16 +
 include/standard-headers/linux/vhost_types.h       |   9 +
 include/standard-headers/linux/virtio_gpu.h        |  82 +++++
 include/standard-headers/linux/virtio_ids.h        |  44 ++-
 linux-headers/asm-arm64/kvm.h                      |   3 -
 linux-headers/asm-generic/unistd.h                 |   6 +-
 linux-headers/asm-mips/unistd_n32.h                |   1 +
 linux-headers/asm-mips/unistd_n64.h                |   1 +
 linux-headers/asm-mips/unistd_o32.h                |   1 +
 linux-headers/asm-powerpc/unistd_32.h              |   1 +
 linux-headers/asm-powerpc/unistd_64.h              |   1 +
 linux-headers/asm-s390/unistd_32.h                 |   1 +
 linux-headers/asm-s390/unistd_64.h                 |   1 +
 linux-headers/asm-x86/kvm.h                        |   1 +
 linux-headers/asm-x86/unistd_32.h                  |   1 +
 linux-headers/asm-x86/unistd_64.h                  |   1 +
 linux-headers/asm-x86/unistd_x32.h                 |   1 +
 linux-headers/linux/kvm.h                          |  58 ++-
 linux-headers/linux/userfaultfd.h                  |   9 +
 linux-headers/linux/vfio.h                         |   5 +
 linux-headers/linux/vfio_zdev.h                    |  34 ++
 linux-headers/linux/vhost.h                        |   4 +
 34 files changed, 996 insertions(+), 145 deletions(-)

Comments

Pierre Morel Jan. 20, 2021, 9:12 a.m. UTC | #1
On 1/19/21 9:44 PM, Matthew Rosato wrote:
> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> QEMU rejects the device due to an (inappropriate) MSI-X check.  Removing
> this fence, however, reveals additional deficiencies in the s390x PCI
> interception layer that prevent ISM devices from working correctly.
> Namely, ISM block write operations have particular requirements in regards
> to the alignment, size and order of writes performed that cannot be
> guaranteed when breaking up write operations through the typical
> vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
> s390 PCI instructions are used, which is also not guaranteed when the I/O
> is passed through the typical userspace channels.
> 
> This patchset provides a set of fixes related to enabling ISM device
> passthrough and includes patches to enable use of a new vfio region that
> will allow s390x PCI pass-through devices to perform s390 PCI instructions
> in such a way that the same instruction issued on the guest is re-issued
> on the host.
> 
> Associated kernel patchset:
> https://lkml.org/lkml/2021/1/19/874
> 
> Changes from RFC -> v1:
> - Refresh the header sync (built using Eric's 'update-linux-headers:
> Include const.h' + manually removed pvrdma_ring.h again)
> - Remove s390x/pci: fix pcistb length (already merged)
> - Remove s390x/pci: Fix memory_region_access_valid call (already merged)
> - Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
> buffer pcistb_buf rather than allocating/freeing its own.
> - New patch: track the PFT (PCI Function Type) separately from guest CLP
> response data -- we tell the guest '0' for now due to limitations in
> measurement block support, but we can still use the real value provided via
> the vfio CLP capabilities to make decisions.
> - Use the PFT (pci function type) to determine when to use the region
> for PCISTB/PCILG (only for ISM), rather than using the relaxed alignment
> bit.
> - As a result, the pcistb_default is now updated to also handle the
> possibility of relaxed alignment via 2 new functions, pcistb_validate_write
> and pcistb_write, which serve as wrappers to the memory_region calls.
> - New patch, which partially restores the MSI-X fence for passthrough
> devices...  Could potentially be squashed with 's390x/pci: MSI-X isn't
> strictly required for passthrough' but left separately for now as I felt it
> needed a clear commit description of why we should still fence this case.
> 
Hi,

The choice of using the new VFIO region is made on the ISM PCI function 
type (PFT), which makes the patch ISM specific, why don't we use here 
the MIO bit common to any zPCI function and present in kernel to make 
the choice?

Regards,
Pierre
Matthew Rosato Jan. 20, 2021, 2:03 p.m. UTC | #2
On 1/20/21 4:12 AM, Pierre Morel wrote:
> 
> 
> On 1/19/21 9:44 PM, Matthew Rosato wrote:
>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
>> QEMU rejects the device due to an (inappropriate) MSI-X check.  Removing
>> this fence, however, reveals additional deficiencies in the s390x PCI
>> interception layer that prevent ISM devices from working correctly.
>> Namely, ISM block write operations have particular requirements in 
>> regards
>> to the alignment, size and order of writes performed that cannot be
>> guaranteed when breaking up write operations through the typical
>> vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
>> s390 PCI instructions are used, which is also not guaranteed when the I/O
>> is passed through the typical userspace channels.
>>
>> This patchset provides a set of fixes related to enabling ISM device
>> passthrough and includes patches to enable use of a new vfio region that
>> will allow s390x PCI pass-through devices to perform s390 PCI 
>> instructions
>> in such a way that the same instruction issued on the guest is re-issued
>> on the host.
>>
>> Associated kernel patchset:
>> https://lkml.org/lkml/2021/1/19/874
>>
>> Changes from RFC -> v1:
>> - Refresh the header sync (built using Eric's 'update-linux-headers:
>> Include const.h' + manually removed pvrdma_ring.h again)
>> - Remove s390x/pci: fix pcistb length (already merged)
>> - Remove s390x/pci: Fix memory_region_access_valid call (already merged)
>> - Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
>> buffer pcistb_buf rather than allocating/freeing its own.
>> - New patch: track the PFT (PCI Function Type) separately from guest CLP
>> response data -- we tell the guest '0' for now due to limitations in
>> measurement block support, but we can still use the real value 
>> provided via
>> the vfio CLP capabilities to make decisions.
>> - Use the PFT (pci function type) to determine when to use the region
>> for PCISTB/PCILG (only for ISM), rather than using the relaxed alignment
>> bit.
>> - As a result, the pcistb_default is now updated to also handle the
>> possibility of relaxed alignment via 2 new functions, 
>> pcistb_validate_write
>> and pcistb_write, which serve as wrappers to the memory_region calls.
>> - New patch, which partially restores the MSI-X fence for passthrough
>> devices...  Could potentially be squashed with 's390x/pci: MSI-X isn't
>> strictly required for passthrough' but left separately for now as I 
>> felt it
>> needed a clear commit description of why we should still fence this case.
>>
> Hi,
> 
> The choice of using the new VFIO region is made on the ISM PCI function 
> type (PFT), which makes the patch ISM specific, why don't we use here 
> the MIO bit common to any zPCI function and present in kernel to make 
> the choice?
> 

As discussed during the RFC (and see my reply also to the kernel set), 
the use of this region only works for devices that do not rely on MSI-X 
interrupts.  If we did as you suggest, other device types like mlx would 
not receive MSI-X interrupts in the guest (And I did indeed try 
variations where I used the special VFIO region for all 
PCISTG/PCILG/PCISTB for various device types)

So the idea for now was to solve the specific problem at hand (getting 
ISM devices working).
Pierre Morel Jan. 20, 2021, 2:45 p.m. UTC | #3
On 1/20/21 3:03 PM, Matthew Rosato wrote:
> On 1/20/21 4:12 AM, Pierre Morel wrote:
>>
>>
>> On 1/19/21 9:44 PM, Matthew Rosato wrote:
>>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
>>> QEMU rejects the device due to an (inappropriate) MSI-X check.  Removing
>>> this fence, however, reveals additional deficiencies in the s390x PCI
>>> interception layer that prevent ISM devices from working correctly.
>>> Namely, ISM block write operations have particular requirements in 
>>> regards
>>> to the alignment, size and order of writes performed that cannot be
>>> guaranteed when breaking up write operations through the typical
>>> vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
>>> s390 PCI instructions are used, which is also not guaranteed when the 
>>> I/O
>>> is passed through the typical userspace channels.
>>>
>>> This patchset provides a set of fixes related to enabling ISM device
>>> passthrough and includes patches to enable use of a new vfio region that
>>> will allow s390x PCI pass-through devices to perform s390 PCI 
>>> instructions
>>> in such a way that the same instruction issued on the guest is re-issued
>>> on the host.
>>>
>>> Associated kernel patchset:
>>> https://lkml.org/lkml/2021/1/19/874
>>>
>>> Changes from RFC -> v1:
>>> - Refresh the header sync (built using Eric's 'update-linux-headers:
>>> Include const.h' + manually removed pvrdma_ring.h again)
>>> - Remove s390x/pci: fix pcistb length (already merged)
>>> - Remove s390x/pci: Fix memory_region_access_valid call (already merged)
>>> - Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
>>> buffer pcistb_buf rather than allocating/freeing its own.
>>> - New patch: track the PFT (PCI Function Type) separately from guest CLP
>>> response data -- we tell the guest '0' for now due to limitations in
>>> measurement block support, but we can still use the real value 
>>> provided via
>>> the vfio CLP capabilities to make decisions.
>>> - Use the PFT (pci function type) to determine when to use the region
>>> for PCISTB/PCILG (only for ISM), rather than using the relaxed alignment
>>> bit.
>>> - As a result, the pcistb_default is now updated to also handle the
>>> possibility of relaxed alignment via 2 new functions, 
>>> pcistb_validate_write
>>> and pcistb_write, which serve as wrappers to the memory_region calls.
>>> - New patch, which partially restores the MSI-X fence for passthrough
>>> devices...  Could potentially be squashed with 's390x/pci: MSI-X isn't
>>> strictly required for passthrough' but left separately for now as I 
>>> felt it
>>> needed a clear commit description of why we should still fence this 
>>> case.
>>>
>> Hi,
>>
>> The choice of using the new VFIO region is made on the ISM PCI 
>> function type (PFT), which makes the patch ISM specific, why don't we 
>> use here the MIO bit common to any zPCI function and present in kernel 
>> to make the choice?
>>
> 
> As discussed during the RFC (and see my reply also to the kernel set), 
> the use of this region only works for devices that do not rely on MSI-X 
> interrupts.  If we did as you suggest, other device types like mlx would 
> not receive MSI-X interrupts in the guest (And I did indeed try 
> variations where I used the special VFIO region for all 
> PCISTG/PCILG/PCISTB for various device types)
> 
> So the idea for now was to solve the specific problem at hand (getting 
> ISM devices working).
> 
> 

Sorry, if I missed or forgot some discussions, but I understood that we 
are using this region to handle PCISTB instructions when the device do 
not support MIO.
Don't we?

I do not understand the relation between MSI-X and MIO.
Can you please explain?

Thanks,
Pierre
Matthew Rosato Jan. 20, 2021, 3:59 p.m. UTC | #4
On 1/20/21 9:45 AM, Pierre Morel wrote:
> 
> 
> On 1/20/21 3:03 PM, Matthew Rosato wrote:
>> On 1/20/21 4:12 AM, Pierre Morel wrote:
>>>
>>>
>>> On 1/19/21 9:44 PM, Matthew Rosato wrote:
>>>> Today, ISM devices are completely disallowed for vfio-pci 
>>>> passthrough as
>>>> QEMU rejects the device due to an (inappropriate) MSI-X check.  
>>>> Removing
>>>> this fence, however, reveals additional deficiencies in the s390x PCI
>>>> interception layer that prevent ISM devices from working correctly.
>>>> Namely, ISM block write operations have particular requirements in 
>>>> regards
>>>> to the alignment, size and order of writes performed that cannot be
>>>> guaranteed when breaking up write operations through the typical
>>>> vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
>>>> s390 PCI instructions are used, which is also not guaranteed when 
>>>> the I/O
>>>> is passed through the typical userspace channels.
>>>>
>>>> This patchset provides a set of fixes related to enabling ISM device
>>>> passthrough and includes patches to enable use of a new vfio region 
>>>> that
>>>> will allow s390x PCI pass-through devices to perform s390 PCI 
>>>> instructions
>>>> in such a way that the same instruction issued on the guest is 
>>>> re-issued
>>>> on the host.
>>>>
>>>> Associated kernel patchset:
>>>> https://lkml.org/lkml/2021/1/19/874
>>>>
>>>> Changes from RFC -> v1:
>>>> - Refresh the header sync (built using Eric's 'update-linux-headers:
>>>> Include const.h' + manually removed pvrdma_ring.h again)
>>>> - Remove s390x/pci: fix pcistb length (already merged)
>>>> - Remove s390x/pci: Fix memory_region_access_valid call (already 
>>>> merged)
>>>> - Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
>>>> buffer pcistb_buf rather than allocating/freeing its own.
>>>> - New patch: track the PFT (PCI Function Type) separately from guest 
>>>> CLP
>>>> response data -- we tell the guest '0' for now due to limitations in
>>>> measurement block support, but we can still use the real value 
>>>> provided via
>>>> the vfio CLP capabilities to make decisions.
>>>> - Use the PFT (pci function type) to determine when to use the region
>>>> for PCISTB/PCILG (only for ISM), rather than using the relaxed 
>>>> alignment
>>>> bit.
>>>> - As a result, the pcistb_default is now updated to also handle the
>>>> possibility of relaxed alignment via 2 new functions, 
>>>> pcistb_validate_write
>>>> and pcistb_write, which serve as wrappers to the memory_region calls.
>>>> - New patch, which partially restores the MSI-X fence for passthrough
>>>> devices...  Could potentially be squashed with 's390x/pci: MSI-X isn't
>>>> strictly required for passthrough' but left separately for now as I 
>>>> felt it
>>>> needed a clear commit description of why we should still fence this 
>>>> case.
>>>>
>>> Hi,
>>>
>>> The choice of using the new VFIO region is made on the ISM PCI 
>>> function type (PFT), which makes the patch ISM specific, why don't we 
>>> use here the MIO bit common to any zPCI function and present in 
>>> kernel to make the choice?
>>>
>>
>> As discussed during the RFC (and see my reply also to the kernel set), 
>> the use of this region only works for devices that do not rely on 
>> MSI-X interrupts.  If we did as you suggest, other device types like 
>> mlx would not receive MSI-X interrupts in the guest (And I did indeed 
>> try variations where I used the special VFIO region for all 
>> PCISTG/PCILG/PCISTB for various device types)
>>
>> So the idea for now was to solve the specific problem at hand (getting 
>> ISM devices working).
>>
>>
> 
> Sorry, if I missed or forgot some discussions, but I understood that we 
> are using this region to handle PCISTB instructions when the device do 
> not support MIO.
> Don't we?

Sure thing - It's probably good to refresh the issue/rationale anyway as 
we've had the holidays in between.

You are correct, a primary reason we need to resort to a separate VFIO 
region for PCISTB (and PCILG) instructions for ISM devices is that they 
do not support the MIO instruction set, yet the host kernel will 
translate everything coming through the PCI I/O layer to MIO 
instructions whenever that facility is available to the host (and not 
purposely disabled).  This issue is unique to vfio-pci/passthrough - in 
the host, the ISM driver directly invokes functions in s390 pci code to 
ensure that MIO instructions are not used.

But this is not the only reason.  There are additional reasons for using 
this VFIO region:
1) ISM devices also don't support PCISTG instructions to certain address 
spaces and PCISTB must be used regardless of operation length.  However 
the standard s390 PCI I/O path always uses PCISTG for anything <=8B. 
Trying to determine whether a given I/O is intended for an ISM device at 
that point in kernel code so as to use PCISTB instead of PCISTG is the 
same problem as attempting to decide whether to use MIO vs non-MIO 
instructions at that point.
2) It allows for much larger PCISTB operations (4K) than allowed via the 
memory regions (loop of 8B operations).
3) The above also has the added benefit of eliminating certain write 
pattern requirements that are unique to ISM that would be introduced if 
we split up the I/O into 8B chunks (if we can't write the whole PCISTB 
in one go, ISM requires data written in a certain order for some address 
spaces, or with certain bits on/off on the PCISTB instruction to signify 
the state of the larger operation)

> 
> I do not understand the relation between MSI-X and MIO.
> Can you please explain?
> 

There is not a relation between MSI-X and MIO really.  Rather, this is a 
case of the solution that is being offered here ONLY works for devices 
that use MSI -- and ISM is a device that only supports MSI.  If you try 
to use this new VFIO region to pass I/O for an MSI-X enabled device, the 
notifiers set up via vfio_msix_setup won't be triggered because we are 
writing to the new VFIO region, not the virtual bar regions that may 
have had notifiers setup as part of vfio_msix_setup.  This results in 
missing interrupts on MSI-X-enabled vfio-pci devices.

These notifiers aren't a factor when the device is using MSI.
Pierre Morel Jan. 20, 2021, 7:18 p.m. UTC | #5
On 1/20/21 4:59 PM, Matthew Rosato wrote:
> On 1/20/21 9:45 AM, Pierre Morel wrote:
>>
>>
>> On 1/20/21 3:03 PM, Matthew Rosato wrote:
>>> On 1/20/21 4:12 AM, Pierre Morel wrote:
>>>>
>>>>
>>>> On 1/19/21 9:44 PM, Matthew Rosato wrote:
>>>>> Today, ISM devices are completely disallowed for vfio-pci 
>>>>> passthrough as
>>>>> QEMU rejects the device due to an (inappropriate) MSI-X check. 
>>>>> Removing
>>>>> this fence, however, reveals additional deficiencies in the s390x PCI
>>>>> interception layer that prevent ISM devices from working correctly.
>>>>> Namely, ISM block write operations have particular requirements in 
>>>>> regards
>>>>> to the alignment, size and order of writes performed that cannot be
>>>>> guaranteed when breaking up write operations through the typical
>>>>> vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
>>>>> s390 PCI instructions are used, which is also not guaranteed when 
>>>>> the I/O
>>>>> is passed through the typical userspace channels.
>>>>>
>>>>> This patchset provides a set of fixes related to enabling ISM device
>>>>> passthrough and includes patches to enable use of a new vfio region 
>>>>> that
>>>>> will allow s390x PCI pass-through devices to perform s390 PCI 
>>>>> instructions
>>>>> in such a way that the same instruction issued on the guest is 
>>>>> re-issued
>>>>> on the host.
>>>>>
>>>>> Associated kernel patchset:
>>>>> https://lkml.org/lkml/2021/1/19/874
>>>>>
>>>>> Changes from RFC -> v1:
>>>>> - Refresh the header sync (built using Eric's 'update-linux-headers:
>>>>> Include const.h' + manually removed pvrdma_ring.h again)
>>>>> - Remove s390x/pci: fix pcistb length (already merged)
>>>>> - Remove s390x/pci: Fix memory_region_access_valid call (already 
>>>>> merged)
>>>>> - Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
>>>>> buffer pcistb_buf rather than allocating/freeing its own.
>>>>> - New patch: track the PFT (PCI Function Type) separately from 
>>>>> guest CLP
>>>>> response data -- we tell the guest '0' for now due to limitations in
>>>>> measurement block support, but we can still use the real value 
>>>>> provided via
>>>>> the vfio CLP capabilities to make decisions.
>>>>> - Use the PFT (pci function type) to determine when to use the region
>>>>> for PCISTB/PCILG (only for ISM), rather than using the relaxed 
>>>>> alignment
>>>>> bit.
>>>>> - As a result, the pcistb_default is now updated to also handle the
>>>>> possibility of relaxed alignment via 2 new functions, 
>>>>> pcistb_validate_write
>>>>> and pcistb_write, which serve as wrappers to the memory_region calls.
>>>>> - New patch, which partially restores the MSI-X fence for passthrough
>>>>> devices...  Could potentially be squashed with 's390x/pci: MSI-X isn't
>>>>> strictly required for passthrough' but left separately for now as I 
>>>>> felt it
>>>>> needed a clear commit description of why we should still fence this 
>>>>> case.
>>>>>
>>>> Hi,
>>>>
>>>> The choice of using the new VFIO region is made on the ISM PCI 
>>>> function type (PFT), which makes the patch ISM specific, why don't 
>>>> we use here the MIO bit common to any zPCI function and present in 
>>>> kernel to make the choice?
>>>>
>>>
>>> As discussed during the RFC (and see my reply also to the kernel 
>>> set), the use of this region only works for devices that do not rely 
>>> on MSI-X interrupts.  If we did as you suggest, other device types 
>>> like mlx would not receive MSI-X interrupts in the guest (And I did 
>>> indeed try variations where I used the special VFIO region for all 
>>> PCISTG/PCILG/PCISTB for various device types)
>>>
>>> So the idea for now was to solve the specific problem at hand 
>>> (getting ISM devices working).
>>>
>>>
>>
>> Sorry, if I missed or forgot some discussions, but I understood that 
>> we are using this region to handle PCISTB instructions when the device 
>> do not support MIO.
>> Don't we?
> 
> Sure thing - It's probably good to refresh the issue/rationale anyway as 
> we've had the holidays in between.
> 
> You are correct, a primary reason we need to resort to a separate VFIO 
> region for PCISTB (and PCILG) instructions for ISM devices is that they 
> do not support the MIO instruction set, yet the host kernel will 
> translate everything coming through the PCI I/O layer to MIO 
> instructions whenever that facility is available to the host (and not 
> purposely disabled).  This issue is unique to vfio-pci/passthrough - in 
> the host, the ISM driver directly invokes functions in s390 pci code to 
> ensure that MIO instructions are not used.


QEMU intercepts and differentiates PCISTG and PCISTB.
The new hardware support both MIO and legacy PCISTB/PCISTG.

QEMU does not support MIO

My first interrogation is why should we translate legacy to MIO?

But OK, say we do need this for some obscure reason.

> 
> But this is not the only reason.  There are additional reasons for using 
> this VFIO region:
> 1) ISM devices also don't support PCISTG instructions to certain address 
> spaces and PCISTB must be used regardless of operation length.  However 
> the standard s390 PCI I/O path always uses PCISTG for anything <=8B. 
> Trying to determine whether a given I/O is intended for an ISM device at 
> that point in kernel code so as to use PCISTB instead of PCISTG is the 

OK, this is clear.

> same problem as attempting to decide whether to use MIO vs non-MIO 
> instructions at that point.

humm, this is not exactly the same problem for me, but OK to choose to 
handle it the same way.



> 2) It allows for much larger PCISTB operations (4K) than allowed via the 
> memory regions (loop of 8B operations).

OK

> 3) The above also has the added benefit of eliminating certain write 
> pattern requirements that are unique to ISM that would be introduced if 
> we split up the I/O into 8B chunks (if we can't write the whole PCISTB 
> in one go, ISM requires data written in a certain order for some address 
> spaces, or with certain bits on/off on the PCISTB instruction to signify 
> the state of the larger operation)

Yes, I suppose that the driver in the guest does it right and we need to 
do the same.


> 
>>
>> I do not understand the relation between MSI-X and MIO.
>> Can you please explain?
>>
> 
> There is not a relation between MSI-X and MIO really.  Rather, this is a 
> case of the solution that is being offered here ONLY works for devices 
> that use MSI -- and ISM is a device that only supports MSI.  If you try 
> to use this new VFIO region to pass I/O for an MSI-X enabled device, the 
> notifiers set up via vfio_msix_setup won't be triggered because we are 
> writing to the new VFIO region, not the virtual bar regions that may 
> have had notifiers setup as part of vfio_msix_setup.  This results in 
> missing interrupts on MSI-X-enabled vfio-pci devices.
> 
> These notifiers aren't a factor when the device is using MSI.

I find this strange but we do not need to discuss it.

> 

So we have:
devices supporting MIO and MSIX
devices not supporting MIO nor MSIX
devices not supporting the use of PCISTG to emulate PCISTB

The first two are two different things indicated by two different 
entries in the clp query PCI function response.

The last one, we do not have an indicator as if the relaxed alignment 
and length is set, PCISTB can not be emulated with PCISTG

What I mean with this is that considering the proposed implementation 
and considering:
MIO MSIX RELAX

0 0 1  -> must use the new region (ISM)
1 1 0  -> must use the standard VFIO region (MLX)

we can discuss other 6 possibilities

0 0 0 -> must use the new region
0 1 0 -> NOOP
0 1 1 -> NOOP
1 0 0 -> can use any region
1 0 1 -> can use any region
1 1 1 -> NOOP

In my opinion the test for using one region or another should be done on 
these indicator instead of using the PFT.
This may offer us more compatibility with other hardware we may not be 
aware of as today.


Regards,
Pierre
Matthew Rosato Jan. 20, 2021, 8:29 p.m. UTC | #6
On 1/20/21 2:18 PM, Pierre Morel wrote:
> 
> 
> On 1/20/21 4:59 PM, Matthew Rosato wrote:
>> On 1/20/21 9:45 AM, Pierre Morel wrote:
>>>
>>>
>>> On 1/20/21 3:03 PM, Matthew Rosato wrote:
>>>> On 1/20/21 4:12 AM, Pierre Morel wrote:
>>>>>
>>>>>
>>>>> On 1/19/21 9:44 PM, Matthew Rosato wrote:
>>>>>> Today, ISM devices are completely disallowed for vfio-pci 
>>>>>> passthrough as
>>>>>> QEMU rejects the device due to an (inappropriate) MSI-X check. 
>>>>>> Removing
>>>>>> this fence, however, reveals additional deficiencies in the s390x PCI
>>>>>> interception layer that prevent ISM devices from working correctly.
>>>>>> Namely, ISM block write operations have particular requirements in 
>>>>>> regards
>>>>>> to the alignment, size and order of writes performed that cannot be
>>>>>> guaranteed when breaking up write operations through the typical
>>>>>> vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
>>>>>> s390 PCI instructions are used, which is also not guaranteed when 
>>>>>> the I/O
>>>>>> is passed through the typical userspace channels.
>>>>>>
>>>>>> This patchset provides a set of fixes related to enabling ISM device
>>>>>> passthrough and includes patches to enable use of a new vfio 
>>>>>> region that
>>>>>> will allow s390x PCI pass-through devices to perform s390 PCI 
>>>>>> instructions
>>>>>> in such a way that the same instruction issued on the guest is 
>>>>>> re-issued
>>>>>> on the host.
>>>>>>
>>>>>> Associated kernel patchset:
>>>>>> https://lkml.org/lkml/2021/1/19/874
>>>>>>
>>>>>> Changes from RFC -> v1:
>>>>>> - Refresh the header sync (built using Eric's 'update-linux-headers:
>>>>>> Include const.h' + manually removed pvrdma_ring.h again)
>>>>>> - Remove s390x/pci: fix pcistb length (already merged)
>>>>>> - Remove s390x/pci: Fix memory_region_access_valid call (already 
>>>>>> merged)
>>>>>> - Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
>>>>>> buffer pcistb_buf rather than allocating/freeing its own.
>>>>>> - New patch: track the PFT (PCI Function Type) separately from 
>>>>>> guest CLP
>>>>>> response data -- we tell the guest '0' for now due to limitations in
>>>>>> measurement block support, but we can still use the real value 
>>>>>> provided via
>>>>>> the vfio CLP capabilities to make decisions.
>>>>>> - Use the PFT (pci function type) to determine when to use the region
>>>>>> for PCISTB/PCILG (only for ISM), rather than using the relaxed 
>>>>>> alignment
>>>>>> bit.
>>>>>> - As a result, the pcistb_default is now updated to also handle the
>>>>>> possibility of relaxed alignment via 2 new functions, 
>>>>>> pcistb_validate_write
>>>>>> and pcistb_write, which serve as wrappers to the memory_region calls.
>>>>>> - New patch, which partially restores the MSI-X fence for passthrough
>>>>>> devices...  Could potentially be squashed with 's390x/pci: MSI-X 
>>>>>> isn't
>>>>>> strictly required for passthrough' but left separately for now as 
>>>>>> I felt it
>>>>>> needed a clear commit description of why we should still fence 
>>>>>> this case.
>>>>>>
>>>>> Hi,
>>>>>
>>>>> The choice of using the new VFIO region is made on the ISM PCI 
>>>>> function type (PFT), which makes the patch ISM specific, why don't 
>>>>> we use here the MIO bit common to any zPCI function and present in 
>>>>> kernel to make the choice?
>>>>>
>>>>
>>>> As discussed during the RFC (and see my reply also to the kernel 
>>>> set), the use of this region only works for devices that do not rely 
>>>> on MSI-X interrupts.  If we did as you suggest, other device types 
>>>> like mlx would not receive MSI-X interrupts in the guest (And I did 
>>>> indeed try variations where I used the special VFIO region for all 
>>>> PCISTG/PCILG/PCISTB for various device types)
>>>>
>>>> So the idea for now was to solve the specific problem at hand 
>>>> (getting ISM devices working).
>>>>
>>>>
>>>
>>> Sorry, if I missed or forgot some discussions, but I understood that 
>>> we are using this region to handle PCISTB instructions when the 
>>> device do not support MIO.
>>> Don't we?
>>
>> Sure thing - It's probably good to refresh the issue/rationale anyway 
>> as we've had the holidays in between.
>>
>> You are correct, a primary reason we need to resort to a separate VFIO 
>> region for PCISTB (and PCILG) instructions for ISM devices is that 
>> they do not support the MIO instruction set, yet the host kernel will 
>> translate everything coming through the PCI I/O layer to MIO 
>> instructions whenever that facility is available to the host (and not 
>> purposely disabled).  This issue is unique to vfio-pci/passthrough - 
>> in the host, the ISM driver directly invokes functions in s390 pci 
>> code to ensure that MIO instructions are not used.
> 
> 
> QEMU intercepts and differentiates PCISTG and PCISTB.
> The new hardware support both MIO and legacy PCISTB/PCISTG.
> 
> QEMU does not support MIO
> 
> My first interrogation is why should we translate legacy to MIO?

This is existing behavior of the s390 kernel PCI layer, not something 
I'm introducing here.

So, as you say, QEMU does not support MIO, nor does it attempt to 
translate anything to MIO.  The thing is, to the host kernel, PCI I/O 
coming from QEMU through the standard vfio-pci codepath just looks like 
any other userspace PCI I/O coming in to the kernel.  So the 
memory_region operations against the vfio virtual address space bars in 
QEMU then turn into iowrite/ioread operations in vfio-pci in the kernel, 
which then subsequently end up in the s390 PCI kernel layer, and it's 
there that everything is turned into an MIO operation if MIO is 
available.  That's not limited to vfio-pci, that's all PCI I/O in the 
host (except for the ISM driver, which bypasses this behavior by 
invoking s390 PCI kernel interfaces directly).

In an early (internal) version of the kernel component to this I floated 
the idea of trying to determine whether or not MIO instructions could be 
used for a given I/O operation, but the ideas I floated had various 
flaws -- I'd invite Niklas to chime in with why this got squashed.

But anyway...  If we were to solve that somehow, this would still leave 
us with write operations capped at 8B and odd write pattern requirements 
for ISM passthrough.  Using a VFIO region to pass the operation directly 
to the host kernel via a pinned page to overcome those limitations was 
your idea actually :)

> 
> But OK, say we do need this for some obscure reason.
> 
>>
>> But this is not the only reason.  There are additional reasons for 
>> using this VFIO region:
>> 1) ISM devices also don't support PCISTG instructions to certain 
>> address spaces and PCISTB must be used regardless of operation 
>> length.  However the standard s390 PCI I/O path always uses PCISTG for 
>> anything <=8B. Trying to determine whether a given I/O is intended for 
>> an ISM device at that point in kernel code so as to use PCISTB instead 
>> of PCISTG is the 
> 
> OK, this is clear.
> 
>> same problem as attempting to decide whether to use MIO vs non-MIO 
>> instructions at that point.
> 
> humm, this is not exactly the same problem for me, but OK to choose to 
> handle it the same way.

The problem isn't the same BUT the information needed to solve the 
problem is the same (for a given memory operation, knowing what 
device/type it is intended for, which can therefore be used to determine 
also whether it supports MIO or not).

> 
> 
> 
>> 2) It allows for much larger PCISTB operations (4K) than allowed via 
>> the memory regions (loop of 8B operations).
> 
> OK
> 
>> 3) The above also has the added benefit of eliminating certain write 
>> pattern requirements that are unique to ISM that would be introduced 
>> if we split up the I/O into 8B chunks (if we can't write the whole 
>> PCISTB in one go, ISM requires data written in a certain order for 
>> some address spaces, or with certain bits on/off on the PCISTB 
>> instruction to signify the state of the larger operation)
> 
> Yes, I suppose that the driver in the guest does it right and we need to 
> do the same.
> 
> 
>>
>>>
>>> I do not understand the relation between MSI-X and MIO.
>>> Can you please explain?
>>>
>>
>> There is not a relation between MSI-X and MIO really.  Rather, this is 
>> a case of the solution that is being offered here ONLY works for 
>> devices that use MSI -- and ISM is a device that only supports MSI.  
>> If you try to use this new VFIO region to pass I/O for an MSI-X 
>> enabled device, the notifiers set up via vfio_msix_setup won't be 
>> triggered because we are writing to the new VFIO region, not the 
>> virtual bar regions that may have had notifiers setup as part of 
>> vfio_msix_setup.  This results in missing interrupts on MSI-X-enabled 
>> vfio-pci devices.
>>
>> These notifiers aren't a factor when the device is using MSI.
> 
> I find this strange but we do not need to discuss it.
> 
>>
> 
> So we have:
> devices supporting MIO and MSIX
> devices not supporting MIO nor MSIX
> devices not supporting the use of PCISTG to emulate PCISTB
> 
> The first two are two different things indicated by two different 
> entries in the clp query PCI function response.
> 
> The last one, we do not have an indicator as if the relaxed alignment 
> and length is set, PCISTB can not be emulated with PCISTG
> 
> What I mean with this is that considering the proposed implementation 
> and considering:
> MIO MSIX RELAX
> 
> 0 0 1  -> must use the new region (ISM)
> 1 1 0  -> must use the standard VFIO region (MLX)
> 
> we can discuss other 6 possibilities
> 
> 0 0 0 -> must use the new region
> 0 1 0 -> NOOP
> 0 1 1 -> NOOP
> 1 0 0 -> can use any region
> 1 0 1 -> can use any region
> 1 1 1 -> NOOP
> 
> In my opinion the test for using one region or another should be done on 
> these indicator instead of using the PFT. > This may offer us more compatibility with other hardware we may not be
> aware of as today.

This gets a little shaky, and goes both ways -- Using your list, a 
device that supports MIO, does not have MSI-X capability and doesn't 
support relaxed alignment (1 0 0 from above) can use any region -- but 
that may not always be true.  What if "other hardware we may not be 
aware of as today" includes future hardware that ONLY supports the MIO 
instruction set?  Then that device really can't use this region either.

But forgetting that possibility...  I think we can really simplify the 
above matrix down to a statement of "if device doesn't support MSI-X but 
DOES support non-MIO instructions, it can use the region."  I believe 
the latter half of that statement is implicit in the architecture today, 
so it's really then "if device doesn't support MSI-X, it can use the 
region".  There's just the caveat of, if the device is ISM, it changes 
from 'can use the region' to 'must use the region'.

So, I mean I can change the code to be more permissive in that way 
(allow any device that doesn't have MSI-X capability to at least attempt 
to use the region).  But the reality is that ISM specifically needs the 
region for successful pass through, so I don't see a reason to create a 
different bit for that vs just checking for the PFT in QEMU and using 
that value to decide whether or not region availability is a requirement 
for allowing the device to pass through.
Pierre Morel Jan. 21, 2021, 8:27 a.m. UTC | #7
On 1/20/21 9:29 PM, Matthew Rosato wrote:
> On 1/20/21 2:18 PM, Pierre Morel wrote:
>>
>>
...snip...

>> So we have:
>> devices supporting MIO and MSIX
>> devices not supporting MIO nor MSIX
>> devices not supporting the use of PCISTG to emulate PCISTB
>>
>> The first two are two different things indicated by two different 
>> entries in the clp query PCI function response.
>>
>> The last one, we do not have an indicator as if the relaxed alignment 
>> and length is set, PCISTB can not be emulated with PCISTG


hum sorry, it seems I rewrote my sentence until it was wrong wrong!
I wanted to say we DO HAVE an indicator with the relaxed bit...

>>
>> What I mean with this is that considering the proposed implementation 
>> and considering:
>> MIO MSIX RELAX
>>
>> 0 0 1  -> must use the new region (ISM)
>> 1 1 0  -> must use the standard VFIO region (MLX)
>>
>> we can discuss other 6 possibilities
>>
>> 0 0 0 -> must use the new region
>> 0 1 0 -> NOOP
>> 0 1 1 -> NOOP
>> 1 0 0 -> can use any region
>> 1 0 1 -> can use any region
>> 1 1 1 -> NOOP
>>
>> In my opinion the test for using one region or another should be done 
>> on these indicator instead of using the PFT. > This may offer us more 
>> compatibility with other hardware we may not be
>> aware of as today.
> 
> This gets a little shaky, and goes both ways -- Using your list, a 
> device that supports MIO, does not have MSI-X capability and doesn't 
> support relaxed alignment (1 0 0 from above) can use any region -- but 
> that may not always be true.  What if "other hardware we may not be 
> aware of as today" includes future hardware that ONLY supports the MIO 
> instruction set?  Then that device really can't use this region either.


Right, but there is no bit in the CLP response for this case.
Until there is one, the system is supposed to handle legacy instructions

> 
> But forgetting that possibility...  I think we can really simplify the 
> above matrix down to a statement of "if device doesn't support MSI-X but 
> DOES support non-MIO instructions, it can use the region."  I believe 
> the latter half of that statement is implicit in the architecture today, 
> so it's really then "if device doesn't support MSI-X, it can use the 
> region".  There's just the caveat of, if the device is ISM, it changes 
> from 'can use the region' to 'must use the region'.


There can surely be simplifications.

> 
> So, I mean I can change the code to be more permissive in that way 
> (allow any device that doesn't have MSI-X capability to at least attempt 
> to use the region).  But the reality is that ISM specifically needs the 
> region for successful pass through, so I don't see a reason to create a 
> different bit for that vs just checking for the PFT in QEMU and using 
> that value to decide whether or not region availability is a requirement 
> for allowing the device to pass through.


There is no need for a new bit to know if a device support MIO or not, 
as I said before, there is already one in the CLP query PCI function 
response and it is already used in the kernel zPCI architecture.


It is not a big think to do and does not change the general architecture 
of the patch, only the detection of which device is impacted to make it 
generic instead of device dedicated.

Regards,
Pierre
Niklas Schnelle Jan. 21, 2021, 9:58 a.m. UTC | #8
On 1/21/21 9:27 AM, Pierre Morel wrote:
> 
> 
> On 1/20/21 9:29 PM, Matthew Rosato wrote:
>> On 1/20/21 2:18 PM, Pierre Morel wrote:
>>>
>>>
> ...snip...
> 
>>
>> So, I mean I can change the code to be more permissive in that way (allow any device that doesn't have MSI-X capability to at least attempt to use the region).  But the reality is that ISM specifically needs the region for successful pass through, so I don't see a reason to create a different bit for that vs just checking for the PFT in QEMU and using that value to decide whether or not region availability is a requirement for allowing the device to pass through.
> 
> 
> There is no need for a new bit to know if a device support MIO or not, as I said before, there is already one in the CLP query PCI function response and it is already used in the kernel zPCI architecture.
> 
> 
> It is not a big think to do and does not change the general architecture of the patch, only the detection of which device is impacted to make it generic instead of device dedicated.
> 
> Regards,
> Pierre

Just wanted to say that we've had a very similar discussion with
Cornelia end of last year and came to the conclusion that explicitly
matching the PFT is likely the safest bet:
https://lkml.org/lkml/2020/12/22/479

One other point for me is that all these special requirements seem
to stem from the fact that ISM is not a physical PCIe device but
a special s390 only thing. That also implies that unlike with real
PCI devices we are not going to find an existing Linux driver
and infrastructure that just works. I'd bet if it's anywhere
as special as ISM we will also need a QEMU change either way
and if that change is only to change the PFT check then
great, won't have trouble getting that backported...
So the PFT check ensures we're definitely trying the standard path
first and only change it to a special case if we understand that
it is required as we do for ISM.

That said it's also important to note that we are not searching
for a be all end all solution here. The standard path's to-MIO
translation, even though it is working, is weird enough and also
not ideal for performance so we might look at that again
in separate work and yes that could change things for
this case too but that too will require QEMU and Kernel
changes either way.
Furthermore we will be looking into handling more PCI
without leaving SIE at which point we will also re-evaluate
if that works for ISM. Again that needs QEMU and likely Kernel
changes. I'm a great fan of general solutions that
don't unnecessarily exclude something, so I do understand your
concern but let's focus on what we know and want to fix now
instead of trying to solve future theoretical issues that
we have no strong reason to believe will ever appear.

If we stick with the PFT check I think we do need to
change some wording especially for the Kernel changes.


>
Pierre Morel Jan. 21, 2021, 12:30 p.m. UTC | #9
On 1/21/21 10:58 AM, Niklas Schnelle wrote:
> 
> 
> On 1/21/21 9:27 AM, Pierre Morel wrote:
>>
>>
>> On 1/20/21 9:29 PM, Matthew Rosato wrote:
>>> On 1/20/21 2:18 PM, Pierre Morel wrote:
>>>>
>>>>
>> ...snip...
>>
>>>
>>> So, I mean I can change the code to be more permissive in that way (allow any device that doesn't have MSI-X capability to at least attempt to use the region).  But the reality is that ISM specifically needs the region for successful pass through, so I don't see a reason to create a different bit for that vs just checking for the PFT in QEMU and using that value to decide whether or not region availability is a requirement for allowing the device to pass through.
>>
>>
>> There is no need for a new bit to know if a device support MIO or not, as I said before, there is already one in the CLP query PCI function response and it is already used in the kernel zPCI architecture.
>>
>>
>> It is not a big think to do and does not change the general architecture of the patch, only the detection of which device is impacted to make it generic instead of device dedicated.
>>
>> Regards,
>> Pierre
> 
> Just wanted to say that we've had a very similar discussion with
> Cornelia end of last year and came to the conclusion that explicitly
> matching the PFT is likely the safest bet:
> https://lkml.org/lkml/2020/12/22/479

What I see there is a discussion on the relation between relaxed access 
and MIO without explaining to Connie that we have in the kernel the 
possibility to know if a device support MIO or not independently of it 
supports the relaxed access.

The all point here is about taking decisions for the right reasons.

We have the possibility to take the decision based on functionalities 
and not on a specific PCI function.


If we keep the PFT check, and we can do this of course, but is it a good 
solution if it appears we have other PFT with the same functionalities?

Please note that this is a minor code change, keeping track of the MIO 
support just as we keep track of the PFT and check on this instead of on 
the PFT.

It does not modify the general architecture of the patch series neither 
in kernel nor in QEMU at all.


Pierre
Niklas Schnelle Jan. 21, 2021, 1:37 p.m. UTC | #10
On 1/21/21 1:30 PM, Pierre Morel wrote:
> 
> 
> On 1/21/21 10:58 AM, Niklas Schnelle wrote:
>>
>>
>> On 1/21/21 9:27 AM, Pierre Morel wrote:
>>>
>>>
>>> On 1/20/21 9:29 PM, Matthew Rosato wrote:
>>>> On 1/20/21 2:18 PM, Pierre Morel wrote:
>>>>>
>>>>>
>>> ...snip...
>>>
>>>>
>>>> So, I mean I can change the code to be more permissive in that way (allow any device that doesn't have MSI-X capability to at least attempt to use the region).  But the reality is that ISM specifically needs the region for successful pass through, so I don't see a reason to create a different bit for that vs just checking for the PFT in QEMU and using that value to decide whether or not region availability is a requirement for allowing the device to pass through.
>>>
>>>
>>> There is no need for a new bit to know if a device support MIO or not, as I said before, there is already one in the CLP query PCI function response and it is already used in the kernel zPCI architecture.
>>>
>>>
>>> It is not a big think to do and does not change the general architecture of the patch, only the detection of which device is impacted to make it generic instead of device dedicated.
>>>
>>> Regards,
>>> Pierre
>>
>> Just wanted to say that we've had a very similar discussion with
>> Cornelia end of last year and came to the conclusion that explicitly
>> matching the PFT is likely the safest bet:
>> https://lkml.org/lkml/2020/12/22/479
> 
> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
> 
> The all point here is about taking decisions for the right reasons.
> 
> We have the possibility to take the decision based on functionalities and not on a specific PCI function.

Yes but that goes both ways the functionality of the region has to match
that of the device and at least in it's current state the regions functionality
matches only ISM in a way that is so specific that it is very unlikely to match anything
else. For example it can't support a PCI device that requires non-MIO but
also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
Block, we had that in an earlier version and it's trivial but then we get the MSI-X
problem.

> 
> 
> If we keep the PFT check, and we can do this of course, but is it a good solution if it appears we have other PFT with the same functionalities?
> 
> Please note that this is a minor code change, keeping track of the MIO support just as we keep track of the PFT and check on this instead of on the PFT.

That is certainly true and I'm not strongly against matching on functionality
it just seems to me that it's too specific for that to make sense and
in that case I feel it's better to be clear about that and make it ISM
specific in name and functionality.

If we manage to find a fix for the MSI-X problem which I'd be really happy about
we can simply extend the regions functionality and reuse the same code
for a backwards compatibile ISM region and a more generic zPCI non-MIO
region that could even be used if the client (QEMU) uses non-MIO and
the device can do both as is the case for all current physical devices.

> 
> It does not modify the general architecture of the patch series neither in kernel nor in QEMU at all.
> 
> 
> Pierre
>
Matthew Rosato Jan. 21, 2021, 2:42 p.m. UTC | #11
On 1/21/21 3:27 AM, Pierre Morel wrote:
> 
> 
> On 1/20/21 9:29 PM, Matthew Rosato wrote:
>> On 1/20/21 2:18 PM, Pierre Morel wrote:
>>>
>>>
> ...snip...
> 
>>> So we have:
>>> devices supporting MIO and MSIX
>>> devices not supporting MIO nor MSIX
>>> devices not supporting the use of PCISTG to emulate PCISTB
>>>
>>> The first two are two different things indicated by two different 
>>> entries in the clp query PCI function response.
>>>
>>> The last one, we do not have an indicator as if the relaxed alignment 
>>> and length is set, PCISTB can not be emulated with PCISTG
> 
> 
> hum sorry, it seems I rewrote my sentence until it was wrong wrong!
> I wanted to say we DO HAVE an indicator with the relaxed bit...

That's actually not quite true though...  The relaxed bit does not 
directly imply that PCISTB cannot be emulated with PCISTG.  Rather, it 
more generally implies that PCISTB could be used instead of PCISTG as 
the length and alignment requirements for PCISTB are waived.  As far as 
I can tell, disallowing a PCISTG altogether is a trait that is unique to 
ISM and I don't see anywhere that it's otherwise indicated in 
architecture...  And in fact, for a given ISM device, certain address 
spaces (command) WILL accept a series of PCISTG issued in a particular 
manner in place of a PCISTB; meanwhile other ISM address spaces (data) 
will not accept any PCISTG ever.  :(  The ISM driver just always uses 
PCISTB.

So that is why I suggested type==ISM must require use of the region 
whereas other types that implement MSI could request (but not require) 
use of the region.

But yes, any other theoretical piece of hardware that also does not 
support MIO would have a similar region requirement.  I'll have a look 
at the MIO bit you referenced below and will have to verify that it does 
exactly what we expect for an ISM device.  Assuming yes, I will consider 
the following sort of checking for the next version of QEMU...

if (!MIO) {
	if (MSIX) {
		// passthrough disallowed
	}
	else {
		// region required, disallow if unavailable
	}
}
else if(RELAX && !MSIX) {
	// use region if available
}

Sound good?

All said, this would result in another bit passed from the kernel as CLP 
info and some small changes to patch 6 of this set to determine when to 
call s390_pci_get_zpci_io_region(pbdev) -- But just about everything 
else is unaffected, so @all please feel free to provide other review 
comments for the series in the meanwhile.


> 
>>>
>>> What I mean with this is that considering the proposed implementation 
>>> and considering:
>>> MIO MSIX RELAX
>>>
>>> 0 0 1  -> must use the new region (ISM)
>>> 1 1 0  -> must use the standard VFIO region (MLX)
>>>
>>> we can discuss other 6 possibilities
>>>
>>> 0 0 0 -> must use the new region
>>> 0 1 0 -> NOOP
>>> 0 1 1 -> NOOP
>>> 1 0 0 -> can use any region
>>> 1 0 1 -> can use any region
>>> 1 1 1 -> NOOP
>>>
>>> In my opinion the test for using one region or another should be done 
>>> on these indicator instead of using the PFT. > This may offer us more 
>>> compatibility with other hardware we may not be
>>> aware of as today.
>>
>> This gets a little shaky, and goes both ways -- Using your list, a 
>> device that supports MIO, does not have MSI-X capability and doesn't 
>> support relaxed alignment (1 0 0 from above) can use any region -- but 
>> that may not always be true.  What if "other hardware we may not be 
>> aware of as today" includes future hardware that ONLY supports the MIO 
>> instruction set?  Then that device really can't use this region either.
> 
> 
> Right, but there is no bit in the CLP response for this case.
> Until there is one, the system is supposed to handle legacy instructions
> 
>>
>> But forgetting that possibility...  I think we can really simplify the 
>> above matrix down to a statement of "if device doesn't support MSI-X 
>> but DOES support non-MIO instructions, it can use the region."  I 
>> believe the latter half of that statement is implicit in the 
>> architecture today, so it's really then "if device doesn't support 
>> MSI-X, it can use the region".  There's just the caveat of, if the 
>> device is ISM, it changes from 'can use the region' to 'must use the 
>> region'.
> 
> 
> There can surely be simplifications.
> 
>>
>> So, I mean I can change the code to be more permissive in that way 
>> (allow any device that doesn't have MSI-X capability to at least 
>> attempt to use the region).  But the reality is that ISM specifically 
>> needs the region for successful pass through, so I don't see a reason 
>> to create a different bit for that vs just checking for the PFT in 
>> QEMU and using that value to decide whether or not region availability 
>> is a requirement for allowing the device to pass through.
> 
> 
> There is no need for a new bit to know if a device support MIO or not, 
> as I said before, there is already one in the CLP query PCI function 
> response and it is already used in the kernel zPCI architecture.
> 
> 
> It is not a big think to do and does not change the general architecture 
> of the patch, only the detection of which device is impacted to make it 
> generic instead of device dedicated.
> 
> Regards,
> Pierre
>
Pierre Morel Jan. 21, 2021, 2:46 p.m. UTC | #12
On 1/21/21 2:37 PM, Niklas Schnelle wrote:
> 
> 
> On 1/21/21 1:30 PM, Pierre Morel wrote:

>>>
>>> Just wanted to say that we've had a very similar discussion with
>>> Cornelia end of last year and came to the conclusion that explicitly
>>> matching the PFT is likely the safest bet:
>>> https://lkml.org/lkml/2020/12/22/479
>>
>> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
>>
>> The all point here is about taking decisions for the right reasons.
>>
>> We have the possibility to take the decision based on functionalities and not on a specific PCI function.
> 
> Yes but that goes both ways the functionality of the region has to match
> that of the device and at least in it's current state the regions functionality
> matches only ISM in a way that is so specific that it is very unlikely to match anything
> else. For example it can't support a PCI device that requires non-MIO but
> also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
> Block, we had that in an earlier version and it's trivial but then we get the MSI-X
> problem.


What does that change if we take one or the other solution considering 
the checking of MIO/MSIX/relax versus PFT?


> 
>>
>>
>> If we keep the PFT check, and we can do this of course, but is it a good solution if it appears we have other PFT with the same functionalities?
>>
>> Please note that this is a minor code change, keeping track of the MIO support just as we keep track of the PFT and check on this instead of on the PFT.
> 
> That is certainly true and I'm not strongly against matching on functionality
> it just seems to me that it's too specific for that to make sense and
> in that case I feel it's better to be clear about that and make it ISM
> specific in name and functionality.
> 
> If we manage to find a fix for the MSI-X problem which I'd be really happy about
> we can simply extend the regions functionality and reuse the same code
> for a backwards compatibile ISM region and a more generic zPCI non-MIO
> region that could even be used if the client (QEMU) uses non-MIO and
> the device can do both as is the case for all current physical devices.


I think that the fix for this is in the kernel, in the MIO implementation.
But we better discuss this offline.


Since neither of us have the decision, why don't we let the maintainer 
discuss now that they have all the information.

Regards,
Pierre
Niklas Schnelle Jan. 21, 2021, 2:54 p.m. UTC | #13
On 1/21/21 3:46 PM, Pierre Morel wrote:
> 
> 
> On 1/21/21 2:37 PM, Niklas Schnelle wrote:
>>
>>
>> On 1/21/21 1:30 PM, Pierre Morel wrote:
> 
>>>>
>>>> Just wanted to say that we've had a very similar discussion with
>>>> Cornelia end of last year and came to the conclusion that explicitly
>>>> matching the PFT is likely the safest bet:
>>>> https://lkml.org/lkml/2020/12/22/479
>>>
>>> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
>>>
>>> The all point here is about taking decisions for the right reasons.
>>>
>>> We have the possibility to take the decision based on functionalities and not on a specific PCI function.
>>
>> Yes but that goes both ways the functionality of the region has to match
>> that of the device and at least in it's current state the regions functionality
>> matches only ISM in a way that is so specific that it is very unlikely to match anything
>> else. For example it can't support a PCI device that requires non-MIO but
>> also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
>> Block, we had that in an earlier version and it's trivial but then we get the MSI-X
>> problem.
> 
> 
> What does that change if we take one or the other solution considering the checking of MIO/MSIX/relax versus PFT?


If it's !MIO && !MSIX && relax_align I'm fine with that check but
then we should also add PCISTG to the region.
Pierre Morel Jan. 21, 2021, 3:45 p.m. UTC | #14
On 1/21/21 3:42 PM, Matthew Rosato wrote:
> On 1/21/21 3:27 AM, Pierre Morel wrote:
>>
>>
>> On 1/20/21 9:29 PM, Matthew Rosato wrote:
>>> On 1/20/21 2:18 PM, Pierre Morel wrote:
>>>>
>>>>
>> ...snip...
>>
>>>> So we have:
>>>> devices supporting MIO and MSIX
>>>> devices not supporting MIO nor MSIX
>>>> devices not supporting the use of PCISTG to emulate PCISTB
>>>>
>>>> The first two are two different things indicated by two different 
>>>> entries in the clp query PCI function response.
>>>>
>>>> The last one, we do not have an indicator as if the relaxed 
>>>> alignment and length is set, PCISTB can not be emulated with PCISTG
>>
>>
>> hum sorry, it seems I rewrote my sentence until it was wrong wrong!
>> I wanted to say we DO HAVE an indicator with the relaxed bit...
> 
> That's actually not quite true though...  The relaxed bit does not 
> directly imply that PCISTB cannot be emulated with PCISTG.


It does, PCISTG have stronger alignment enforcement as PCISTB with the 
relaxed bit set, as you say here under, so we agree.


>  Rather, it 
> more generally implies that PCISTB could be used instead of PCISTG as 
> the length and alignment requirements for PCISTB are waived.  As far as 
> I can tell, disallowing a PCISTG altogether is a trait that is unique to 
> ISM and I don't see anywhere that it's otherwise indicated in 
> architecture...  And in fact, for a given ISM device, certain address 
> spaces (command) WILL accept a series of PCISTG issued in a particular 
> manner in place of a PCISTB; meanwhile other ISM address spaces (data) 
> will not accept any PCISTG ever.  :(  The ISM driver just always uses 
> PCISTB.

Very interresting!

> 
> So that is why I suggested type==ISM must require use of the region 
> whereas other types that implement MSI could request (but not require) 
> use of the region.
> 
> But yes, any other theoretical piece of hardware that also does not 
> support MIO would have a similar region requirement.  I'll have a look 
> at the MIO bit you referenced below and will have to verify that it does 
> exactly what we expect for an ISM device.  Assuming yes, I will consider 
> the following sort of checking for the next version of QEMU...
> 
> if (!MIO) {
>      if (MSIX) {
>          // passthrough disallowed
>      }
>      else {
>          // region required, disallow if unavailable
>      }
> }
> else if(RELAX && !MSIX) {
>      // use region if available
> }
> 
> Sound good?

Yes sounds good to me.
I mean the idea, I will not try to simplify, I trust you on this.


Thanks,

Pierre
Cornelia Huck Jan. 21, 2021, 5:50 p.m. UTC | #15
On Thu, 21 Jan 2021 15:54:22 +0100
Niklas Schnelle <schnelle@linux.ibm.com> wrote:

> On 1/21/21 3:46 PM, Pierre Morel wrote:
> > 
> > 
> > On 1/21/21 2:37 PM, Niklas Schnelle wrote:  
> >>
> >>
> >> On 1/21/21 1:30 PM, Pierre Morel wrote:  
> >   
> >>>>
> >>>> Just wanted to say that we've had a very similar discussion with
> >>>> Cornelia end of last year and came to the conclusion that explicitly
> >>>> matching the PFT is likely the safest bet:
> >>>> https://lkml.org/lkml/2020/12/22/479  
> >>>
> >>> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
> >>>
> >>> The all point here is about taking decisions for the right reasons.
> >>>
> >>> We have the possibility to take the decision based on functionalities and not on a specific PCI function.  
> >>
> >> Yes but that goes both ways the functionality of the region has to match
> >> that of the device and at least in it's current state the regions functionality
> >> matches only ISM in a way that is so specific that it is very unlikely to match anything
> >> else. For example it can't support a PCI device that requires non-MIO but
> >> also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
> >> Block, we had that in an earlier version and it's trivial but then we get the MSI-X
> >> problem.  
> > 
> > 
> > What does that change if we take one or the other solution considering the checking of MIO/MSIX/relax versus PFT?  
> 
> 
> If it's !MIO && !MSIX && relax_align I'm fine with that check but
> then we should also add PCISTG to the region.
> 

Just to double check: that would today cover only ISM (which doesn't
use PCISTG), correct?

/me getting a bit lost in this discussion :)
Matthew Rosato Jan. 21, 2021, 6:06 p.m. UTC | #16
On 1/21/21 12:50 PM, Cornelia Huck wrote:
> On Thu, 21 Jan 2021 15:54:22 +0100
> Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> 
>> On 1/21/21 3:46 PM, Pierre Morel wrote:
>>>
>>>
>>> On 1/21/21 2:37 PM, Niklas Schnelle wrote:
>>>>
>>>>
>>>> On 1/21/21 1:30 PM, Pierre Morel wrote:
>>>    
>>>>>>
>>>>>> Just wanted to say that we've had a very similar discussion with
>>>>>> Cornelia end of last year and came to the conclusion that explicitly
>>>>>> matching the PFT is likely the safest bet:
>>>>>> https://lkml.org/lkml/2020/12/22/479
>>>>>
>>>>> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
>>>>>
>>>>> The all point here is about taking decisions for the right reasons.
>>>>>
>>>>> We have the possibility to take the decision based on functionalities and not on a specific PCI function.
>>>>
>>>> Yes but that goes both ways the functionality of the region has to match
>>>> that of the device and at least in it's current state the regions functionality
>>>> matches only ISM in a way that is so specific that it is very unlikely to match anything
>>>> else. For example it can't support a PCI device that requires non-MIO but
>>>> also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
>>>> Block, we had that in an earlier version and it's trivial but then we get the MSI-X
>>>> problem.
>>>
>>>
>>> What does that change if we take one or the other solution considering the checking of MIO/MSIX/relax versus PFT?
>>
>>
>> If it's !MIO && !MSIX && relax_align I'm fine with that check but
>> then we should also add PCISTG to the region.
>>
> 
> Just to double check: that would today cover only ISM (which doesn't
> use PCISTG), correct?
> 

Yes, correct -- So to summarize the proposal outlined is to use those 
features to determine whether a device should be using the region or not 
rather rather than strictly saying only PFT==ISM may use it.

Practically speaking, ISM is the only device that fits the bill today 
when you combine these things, and ISM does not use PCISTG -- so PCISTG 
support was simply omitted from the region (prior versions before coming 
upstream included it, was dropped since ISM doesn't use it).

What Niklas suggests here is that, if we are going to be broad in 
determining whether the region can be used for a given device vs saying 
'only PFT==ISM', then we can no longer assume that the device doesn't 
use PCISTG and as such is suggesting the region should also include 
PCISTG support as a means of future compatibility (to ensure non-MIO 
PCISTG is issued for the device).

> /me getting a bit lost in this discussion :)
>
Cornelia Huck Jan. 22, 2021, 4:46 p.m. UTC | #17
On Thu, 21 Jan 2021 13:06:24 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 1/21/21 12:50 PM, Cornelia Huck wrote:
> > On Thu, 21 Jan 2021 15:54:22 +0100
> > Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> >   
> >> On 1/21/21 3:46 PM, Pierre Morel wrote:  
> >>>
> >>>
> >>> On 1/21/21 2:37 PM, Niklas Schnelle wrote:  
> >>>>
> >>>>
> >>>> On 1/21/21 1:30 PM, Pierre Morel wrote:  
> >>>      
> >>>>>>
> >>>>>> Just wanted to say that we've had a very similar discussion with
> >>>>>> Cornelia end of last year and came to the conclusion that explicitly
> >>>>>> matching the PFT is likely the safest bet:
> >>>>>> https://lkml.org/lkml/2020/12/22/479  
> >>>>>
> >>>>> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
> >>>>>
> >>>>> The all point here is about taking decisions for the right reasons.
> >>>>>
> >>>>> We have the possibility to take the decision based on functionalities and not on a specific PCI function.  
> >>>>
> >>>> Yes but that goes both ways the functionality of the region has to match
> >>>> that of the device and at least in it's current state the regions functionality
> >>>> matches only ISM in a way that is so specific that it is very unlikely to match anything
> >>>> else. For example it can't support a PCI device that requires non-MIO but
> >>>> also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
> >>>> Block, we had that in an earlier version and it's trivial but then we get the MSI-X
> >>>> problem.  
> >>>
> >>>
> >>> What does that change if we take one or the other solution considering the checking of MIO/MSIX/relax versus PFT?  
> >>
> >>
> >> If it's !MIO && !MSIX && relax_align I'm fine with that check but
> >> then we should also add PCISTG to the region.
> >>  
> > 
> > Just to double check: that would today cover only ISM (which doesn't
> > use PCISTG), correct?
> >   
> 
> Yes, correct -- So to summarize the proposal outlined is to use those 
> features to determine whether a device should be using the region or not 
> rather rather than strictly saying only PFT==ISM may use it.
> 
> Practically speaking, ISM is the only device that fits the bill today 
> when you combine these things, and ISM does not use PCISTG -- so PCISTG 
> support was simply omitted from the region (prior versions before coming 
> upstream included it, was dropped since ISM doesn't use it).
> 
> What Niklas suggests here is that, if we are going to be broad in 
> determining whether the region can be used for a given device vs saying 
> 'only PFT==ISM', then we can no longer assume that the device doesn't 
> use PCISTG and as such is suggesting the region should also include 
> PCISTG support as a means of future compatibility (to ensure non-MIO 
> PCISTG is issued for the device).

Yes, if we go the "check for features" route, including PCISTG makes
sense.

However, I'm still not quite sure whether checking for ISM vs checking
for features makes more sense in the long run. Is ISM that one weird
device whose cousins you're not going to invite to the party, or is
there a possibility that there will be devices that could make use of
the region for performance etc.?
Matthew Rosato Jan. 25, 2021, 2:55 p.m. UTC | #18
On 1/22/21 11:46 AM, Cornelia Huck wrote:
> On Thu, 21 Jan 2021 13:06:24 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 1/21/21 12:50 PM, Cornelia Huck wrote:
>>> On Thu, 21 Jan 2021 15:54:22 +0100
>>> Niklas Schnelle <schnelle@linux.ibm.com> wrote:
>>>    
>>>> On 1/21/21 3:46 PM, Pierre Morel wrote:
>>>>>
>>>>>
>>>>> On 1/21/21 2:37 PM, Niklas Schnelle wrote:
>>>>>>
>>>>>>
>>>>>> On 1/21/21 1:30 PM, Pierre Morel wrote:
>>>>>       
>>>>>>>>
>>>>>>>> Just wanted to say that we've had a very similar discussion with
>>>>>>>> Cornelia end of last year and came to the conclusion that explicitly
>>>>>>>> matching the PFT is likely the safest bet:
>>>>>>>> https://lkml.org/lkml/2020/12/22/479
>>>>>>>
>>>>>>> What I see there is a discussion on the relation between relaxed access and MIO without explaining to Connie that we have in the kernel the possibility to know if a device support MIO or not independently of it supports the relaxed access.
>>>>>>>
>>>>>>> The all point here is about taking decisions for the right reasons.
>>>>>>>
>>>>>>> We have the possibility to take the decision based on functionalities and not on a specific PCI function.
>>>>>>
>>>>>> Yes but that goes both ways the functionality of the region has to match
>>>>>> that of the device and at least in it's current state the regions functionality
>>>>>> matches only ISM in a way that is so specific that it is very unlikely to match anything
>>>>>> else. For example it can't support a PCI device that requires non-MIO but
>>>>>> also MSI-X. In its current form it doesn't even support PCI Store only PCI Store
>>>>>> Block, we had that in an earlier version and it's trivial but then we get the MSI-X
>>>>>> problem.
>>>>>
>>>>>
>>>>> What does that change if we take one or the other solution considering the checking of MIO/MSIX/relax versus PFT?
>>>>
>>>>
>>>> If it's !MIO && !MSIX && relax_align I'm fine with that check but
>>>> then we should also add PCISTG to the region.
>>>>   
>>>
>>> Just to double check: that would today cover only ISM (which doesn't
>>> use PCISTG), correct?
>>>    
>>
>> Yes, correct -- So to summarize the proposal outlined is to use those
>> features to determine whether a device should be using the region or not
>> rather rather than strictly saying only PFT==ISM may use it.
>>
>> Practically speaking, ISM is the only device that fits the bill today
>> when you combine these things, and ISM does not use PCISTG -- so PCISTG
>> support was simply omitted from the region (prior versions before coming
>> upstream included it, was dropped since ISM doesn't use it).
>>
>> What Niklas suggests here is that, if we are going to be broad in
>> determining whether the region can be used for a given device vs saying
>> 'only PFT==ISM', then we can no longer assume that the device doesn't
>> use PCISTG and as such is suggesting the region should also include
>> PCISTG support as a means of future compatibility (to ensure non-MIO
>> PCISTG is issued for the device).
> 
> Yes, if we go the "check for features" route, including PCISTG makes
> sense.
> 
> However, I'm still not quite sure whether checking for ISM vs checking
> for features makes more sense in the long run. Is ISM that one weird
> device whose cousins you're not going to invite to the party, or is

...  Yes. :)

> there a possibility that there will be devices that could make use of
> the region for performance etc.?
> 

Also yes.  Well, potentially.  That's really what I wanted at the 
outset, but the fly in the ointment is that the region as I exploit it 
in QEMU today only works properly for devices without MSI-X, which 
frankly limits its general-purpose utility.  If we could lift that 
restriction (whether it be now or at a later point) then the region 
becomes quite beneficial for performance of any devices making frequent 
use of large PCISTB operations.