Message ID | 20250318082945.52019-1-roger.pau@citrix.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/msi: always propagate MSI writes when not in active system mode | expand |
On 18.03.2025 09:29, Roger Pau Monne wrote: > --- a/xen/drivers/passthrough/amd/iommu_intr.c > +++ b/xen/drivers/passthrough/amd/iommu_intr.c > @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( > rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, > &msi_desc->remap_index, > msg, &data); > - if ( rc > 0 ) > + if ( rc >= 0 ) > { > for ( i = 1; i < nr; ++i ) > msi_desc[i].remap_index = msi_desc->remap_index + i; I understand that Marek's testing has made clear that this change is needed, yet I don't understand it. If we didn't allocate a new index, why would we need to update in-memory state, when memory is preserved across S3? (This lack of understanding on my part is why I didn't associate the last paragraph of the description with this extra change, when you first sent it in this shape on the original thread.) Jan
On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote: > On 18.03.2025 09:29, Roger Pau Monne wrote: > > --- a/xen/drivers/passthrough/amd/iommu_intr.c > > +++ b/xen/drivers/passthrough/amd/iommu_intr.c > > @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( > > rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, > > &msi_desc->remap_index, > > msg, &data); > > - if ( rc > 0 ) > > + if ( rc >= 0 ) > > { > > for ( i = 1; i < nr; ++i ) > > msi_desc[i].remap_index = msi_desc->remap_index + i; > > I understand that Marek's testing has made clear that this change is needed, > yet I don't understand it. If we didn't allocate a new index, why would we > need to update in-memory state, when memory is preserved across S3? Is this always the case for device memory? (iow: contents of the BARs and possibly the PCI config space?) > (This > lack of understanding on my part is why I didn't associate the last > paragraph of the description with this extra change, when you first sent it > in this shape on the original thread.) At least for the AMD IOMMU driver it seems to be expected. See how amd_iommu_resume() performs a pair of disable_iommu() and enable_iommu() calls, and in the enable_iommu() function there's a call to set_{msi,x2apic}_affinity() that's expected to (re)set the interrupts. Or at least that would be my understanding. This change reverts the behavior to what it used to be prior to 8e60d47cf011 for the suspend and resume paths. I'm afraid I don't have a sensible way to test changes in that area, so I cannot investigate much. Thanks, Roger.
On 18.03.2025 09:54, Roger Pau Monné wrote: > On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote: >> On 18.03.2025 09:29, Roger Pau Monne wrote: >>> --- a/xen/drivers/passthrough/amd/iommu_intr.c >>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c >>> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( >>> rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, >>> &msi_desc->remap_index, >>> msg, &data); >>> - if ( rc > 0 ) >>> + if ( rc >= 0 ) >>> { >>> for ( i = 1; i < nr; ++i ) >>> msi_desc[i].remap_index = msi_desc->remap_index + i; >> >> I understand that Marek's testing has made clear that this change is needed, >> yet I don't understand it. If we didn't allocate a new index, why would we >> need to update in-memory state, when memory is preserved across S3? > > Is this always the case for device memory? (iow: contents of the BARs > and possibly the PCI config space?) Of course not. But msi_desc[] is in RAM. >> (This >> lack of understanding on my part is why I didn't associate the last >> paragraph of the description with this extra change, when you first sent it >> in this shape on the original thread.) > > At least for the AMD IOMMU driver it seems to be expected. See how > amd_iommu_resume() performs a pair of disable_iommu() and > enable_iommu() calls, and in the enable_iommu() function there's a > call to set_{msi,x2apic}_affinity() that's expected to (re)set the > interrupts. Or at least that would be my understanding. > > This change reverts the behavior to what it used to be prior to > 8e60d47cf011 for the suspend and resume paths. I'm afraid I don't > have a sensible way to test changes in that area, so I cannot > investigate much. So how did you end up considering this may have been the reason for the failure Marek was still seeing with the earlier form of the patch? I'm simply hesitant to ack something that I don't understand at all. Jan
On Tue, Mar 18, 2025 at 11:14:59AM +0100, Jan Beulich wrote: > On 18.03.2025 09:54, Roger Pau Monné wrote: > > On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote: > >> On 18.03.2025 09:29, Roger Pau Monne wrote: > >>> --- a/xen/drivers/passthrough/amd/iommu_intr.c > >>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c > >>> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( > >>> rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, > >>> &msi_desc->remap_index, > >>> msg, &data); > >>> - if ( rc > 0 ) > >>> + if ( rc >= 0 ) > >>> { > >>> for ( i = 1; i < nr; ++i ) > >>> msi_desc[i].remap_index = msi_desc->remap_index + i; > >> > >> I understand that Marek's testing has made clear that this change is needed, > >> yet I don't understand it. If we didn't allocate a new index, why would we > >> need to update in-memory state, when memory is preserved across S3? > > > > Is this always the case for device memory? (iow: contents of the BARs > > and possibly the PCI config space?) > > Of course not. But msi_desc[] is in RAM. Sorry, I think I didn't understand your earlier question, and hence the reply I provided didn't make any sense to you. > >> (This > >> lack of understanding on my part is why I didn't associate the last > >> paragraph of the description with this extra change, when you first sent it > >> in this shape on the original thread.) > > > > At least for the AMD IOMMU driver it seems to be expected. See how > > amd_iommu_resume() performs a pair of disable_iommu() and > > enable_iommu() calls, and in the enable_iommu() function there's a > > call to set_{msi,x2apic}_affinity() that's expected to (re)set the > > interrupts. Or at least that would be my understanding. > > > > This change reverts the behavior to what it used to be prior to > > 8e60d47cf011 for the suspend and resume paths. I'm afraid I don't > > have a sensible way to test changes in that area, so I cannot > > investigate much. > > So how did you end up considering this may have been the reason for the > failure Marek was still seeing with the earlier form of the patch? I'm > simply hesitant to ack something that I don't understand at all. Oh, I think I know what you are missing, and it's because it's out of patch context. The adjusted chunk in amd_iommu_msi_msg_update_ire() does: if ( rc >= 0 ) { for ( i = 1; i < nr; ++i ) msi_desc[i].remap_index = msi_desc->remap_index + i; msg->data = data; } Note how it sets msg->data, as otherwise the field won't be properly set, and hence the caller propagating the contents of `msg` to the registers would be incorrect. The change forces msg->data to be correctly set when returning either 0 or 1, so that propagation to the hardware can be done in both cases. Previously the contents of msg->data where only correct when returning 1 on AMD. Hope this makes more sense, sorry for not understanding your question initially. Thanks, Roger.
On Tue, Mar 18, 2025 at 09:29:45AM +0100, Roger Pau Monne wrote: > Relax the limitation on MSI register writes, and only apply it when the > system is in active state. For example AMD IOMMU drivers rely on using > set_msi_affinity() to force an MSI register write on resume from > suspension. > > The original patch intention was to reduce the number of MSI register > writes when the system is in active state. Leave the other states to > always perform the writes, as it's safer given the existing code, and it's > expected to not make a difference performance wise. > > For such propagation to work even when the IRT index is not updated the MSI > message must be adjusted in all success cases for AMD IOMMU, not just when > the index has been newly allocated. > > Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed') > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > --- > xen/arch/x86/msi.c | 9 +++++++++ > xen/drivers/passthrough/amd/iommu_intr.c | 2 +- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c > index 163ccf874720..8bb3bb18af61 100644 > --- a/xen/arch/x86/msi.c > +++ b/xen/arch/x86/msi.c > @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg, > { > entry->msg = *msg; > > + if ( unlikely(system_state != SYS_STATE_active) ) > + /* > + * Always propagate writes when not in the 'active' state. The > + * optimization to avoid the MSI address and data registers write is > + * only relevant for runtime state, and drivers on resume (at least) > + * rely on set_msi_affinity() to update the hardware state. > + */ > + force = true; > + > if ( iommu_intremap != iommu_intremap_off ) > { > int rc; > diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c > index 9abdc38053d7..08766122b421 100644 > --- a/xen/drivers/passthrough/amd/iommu_intr.c > +++ b/xen/drivers/passthrough/amd/iommu_intr.c > @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( > rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, > &msi_desc->remap_index, > msg, &data); > - if ( rc > 0 ) > + if ( rc >= 0 ) > { > for ( i = 1; i < nr; ++i ) > msi_desc[i].remap_index = msi_desc->remap_index + i; > -- > 2.48.1 >
On 18.03.2025 11:45, Roger Pau Monné wrote: > On Tue, Mar 18, 2025 at 11:14:59AM +0100, Jan Beulich wrote: >> On 18.03.2025 09:54, Roger Pau Monné wrote: >>> On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote: >>>> On 18.03.2025 09:29, Roger Pau Monne wrote: >>>>> --- a/xen/drivers/passthrough/amd/iommu_intr.c >>>>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c >>>>> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( >>>>> rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, >>>>> &msi_desc->remap_index, >>>>> msg, &data); >>>>> - if ( rc > 0 ) >>>>> + if ( rc >= 0 ) >>>>> { >>>>> for ( i = 1; i < nr; ++i ) >>>>> msi_desc[i].remap_index = msi_desc->remap_index + i; >>>> >>>> I understand that Marek's testing has made clear that this change is needed, >>>> yet I don't understand it. If we didn't allocate a new index, why would we >>>> need to update in-memory state, when memory is preserved across S3? >>> >>> Is this always the case for device memory? (iow: contents of the BARs >>> and possibly the PCI config space?) >> >> Of course not. But msi_desc[] is in RAM. > > Sorry, I think I didn't understand your earlier question, and hence > the reply I provided didn't make any sense to you. > >>>> (This >>>> lack of understanding on my part is why I didn't associate the last >>>> paragraph of the description with this extra change, when you first sent it >>>> in this shape on the original thread.) >>> >>> At least for the AMD IOMMU driver it seems to be expected. See how >>> amd_iommu_resume() performs a pair of disable_iommu() and >>> enable_iommu() calls, and in the enable_iommu() function there's a >>> call to set_{msi,x2apic}_affinity() that's expected to (re)set the >>> interrupts. Or at least that would be my understanding. >>> >>> This change reverts the behavior to what it used to be prior to >>> 8e60d47cf011 for the suspend and resume paths. I'm afraid I don't >>> have a sensible way to test changes in that area, so I cannot >>> investigate much. >> >> So how did you end up considering this may have been the reason for the >> failure Marek was still seeing with the earlier form of the patch? I'm >> simply hesitant to ack something that I don't understand at all. > > Oh, I think I know what you are missing, and it's because it's out of > patch context. The adjusted chunk in amd_iommu_msi_msg_update_ire() > does: > > if ( rc >= 0 ) > { > for ( i = 1; i < nr; ++i ) > msi_desc[i].remap_index = msi_desc->remap_index + i; > msg->data = data; > } > > Note how it sets msg->data, as otherwise the field won't be properly > set, and hence the caller propagating the contents of `msg` to the > registers would be incorrect. > > The change forces msg->data to be correctly set when returning either > 0 or 1, so that propagation to the hardware can be done in both > cases. Previously the contents of msg->data where only correct when > returning 1 on AMD. Oh, I see. The loop is entirely benign in this case. I did look at the full function, but I didn't make the connection from the writing of msg->data. Reviewed-by: Jan Beulich <jbeulich@suse.com> Jan
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index 163ccf874720..8bb3bb18af61 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg, { entry->msg = *msg; + if ( unlikely(system_state != SYS_STATE_active) ) + /* + * Always propagate writes when not in the 'active' state. The + * optimization to avoid the MSI address and data registers write is + * only relevant for runtime state, and drivers on resume (at least) + * rely on set_msi_affinity() to update the hardware state. + */ + force = true; + if ( iommu_intremap != iommu_intremap_off ) { int rc; diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c index 9abdc38053d7..08766122b421 100644 --- a/xen/drivers/passthrough/amd/iommu_intr.c +++ b/xen/drivers/passthrough/amd/iommu_intr.c @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire( rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr, &msi_desc->remap_index, msg, &data); - if ( rc > 0 ) + if ( rc >= 0 ) { for ( i = 1; i < nr; ++i ) msi_desc[i].remap_index = msi_desc->remap_index + i;
Relax the limitation on MSI register writes, and only apply it when the system is in active state. For example AMD IOMMU drivers rely on using set_msi_affinity() to force an MSI register write on resume from suspension. The original patch intention was to reduce the number of MSI register writes when the system is in active state. Leave the other states to always perform the writes, as it's safer given the existing code, and it's expected to not make a difference performance wise. For such propagation to work even when the IRT index is not updated the MSI message must be adjusted in all success cases for AMD IOMMU, not just when the index has been newly allocated. Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/msi.c | 9 +++++++++ xen/drivers/passthrough/amd/iommu_intr.c | 2 +- 2 files changed, 10 insertions(+), 1 deletion(-)