diff mbox series

x86/msi: always propagate MSI writes when not in active system mode

Message ID 20250318082945.52019-1-roger.pau@citrix.com (mailing list archive)
State New
Headers show
Series x86/msi: always propagate MSI writes when not in active system mode | expand

Commit Message

Roger Pau Monne March 18, 2025, 8:29 a.m. UTC
Relax the limitation on MSI register writes, and only apply it when the
system is in active state.  For example AMD IOMMU drivers rely on using
set_msi_affinity() to force an MSI register write on resume from
suspension.

The original patch intention was to reduce the number of MSI register
writes when the system is in active state.  Leave the other states to
always perform the writes, as it's safer given the existing code, and it's
expected to not make a difference performance wise.

For such propagation to work even when the IRT index is not updated the MSI
message must be adjusted in all success cases for AMD IOMMU, not just when
the index has been newly allocated.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/msi.c                       | 9 +++++++++
 xen/drivers/passthrough/amd/iommu_intr.c | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Jan Beulich March 18, 2025, 8:36 a.m. UTC | #1
On 18.03.2025 09:29, Roger Pau Monne wrote:
> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>                                              &msi_desc->remap_index,
>                                              msg, &data);
> -    if ( rc > 0 )
> +    if ( rc >= 0 )
>      {
>          for ( i = 1; i < nr; ++i )
>              msi_desc[i].remap_index = msi_desc->remap_index + i;

I understand that Marek's testing has made clear that this change is needed,
yet I don't understand it. If we didn't allocate a new index, why would we
need to update in-memory state, when memory is preserved across S3? (This
lack of understanding on my part is why I didn't associate the last
paragraph of the description with this extra change, when you first sent it
in this shape on the original thread.)

Jan
Roger Pau Monne March 18, 2025, 8:54 a.m. UTC | #2
On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote:
> On 18.03.2025 09:29, Roger Pau Monne wrote:
> > --- a/xen/drivers/passthrough/amd/iommu_intr.c
> > +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> > @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
> >      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
> >                                              &msi_desc->remap_index,
> >                                              msg, &data);
> > -    if ( rc > 0 )
> > +    if ( rc >= 0 )
> >      {
> >          for ( i = 1; i < nr; ++i )
> >              msi_desc[i].remap_index = msi_desc->remap_index + i;
> 
> I understand that Marek's testing has made clear that this change is needed,
> yet I don't understand it. If we didn't allocate a new index, why would we
> need to update in-memory state, when memory is preserved across S3?

Is this always the case for device memory? (iow: contents of the BARs
and possibly the PCI config space?)

> (This
> lack of understanding on my part is why I didn't associate the last
> paragraph of the description with this extra change, when you first sent it
> in this shape on the original thread.)

At least for the AMD IOMMU driver it seems to be expected.  See how
amd_iommu_resume() performs a pair of disable_iommu() and
enable_iommu() calls, and in the enable_iommu() function there's a
call to set_{msi,x2apic}_affinity() that's expected to (re)set the
interrupts.  Or at least that would be my understanding.

This change reverts the behavior to what it used to be prior to
8e60d47cf011 for the suspend and resume paths.  I'm afraid I don't
have a sensible way to test changes in that area, so I cannot
investigate much.

Thanks, Roger.
Jan Beulich March 18, 2025, 10:14 a.m. UTC | #3
On 18.03.2025 09:54, Roger Pau Monné wrote:
> On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote:
>> On 18.03.2025 09:29, Roger Pau Monne wrote:
>>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
>>> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
>>>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>>>                                              &msi_desc->remap_index,
>>>                                              msg, &data);
>>> -    if ( rc > 0 )
>>> +    if ( rc >= 0 )
>>>      {
>>>          for ( i = 1; i < nr; ++i )
>>>              msi_desc[i].remap_index = msi_desc->remap_index + i;
>>
>> I understand that Marek's testing has made clear that this change is needed,
>> yet I don't understand it. If we didn't allocate a new index, why would we
>> need to update in-memory state, when memory is preserved across S3?
> 
> Is this always the case for device memory? (iow: contents of the BARs
> and possibly the PCI config space?)

Of course not. But msi_desc[] is in RAM.

>> (This
>> lack of understanding on my part is why I didn't associate the last
>> paragraph of the description with this extra change, when you first sent it
>> in this shape on the original thread.)
> 
> At least for the AMD IOMMU driver it seems to be expected.  See how
> amd_iommu_resume() performs a pair of disable_iommu() and
> enable_iommu() calls, and in the enable_iommu() function there's a
> call to set_{msi,x2apic}_affinity() that's expected to (re)set the
> interrupts.  Or at least that would be my understanding.
> 
> This change reverts the behavior to what it used to be prior to
> 8e60d47cf011 for the suspend and resume paths.  I'm afraid I don't
> have a sensible way to test changes in that area, so I cannot
> investigate much.

So how did you end up considering this may have been the reason for the
failure Marek was still seeing with the earlier form of the patch? I'm
simply hesitant to ack something that I don't understand at all.

Jan
Roger Pau Monne March 18, 2025, 10:45 a.m. UTC | #4
On Tue, Mar 18, 2025 at 11:14:59AM +0100, Jan Beulich wrote:
> On 18.03.2025 09:54, Roger Pau Monné wrote:
> > On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote:
> >> On 18.03.2025 09:29, Roger Pau Monne wrote:
> >>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> >>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> >>> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
> >>>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
> >>>                                              &msi_desc->remap_index,
> >>>                                              msg, &data);
> >>> -    if ( rc > 0 )
> >>> +    if ( rc >= 0 )
> >>>      {
> >>>          for ( i = 1; i < nr; ++i )
> >>>              msi_desc[i].remap_index = msi_desc->remap_index + i;
> >>
> >> I understand that Marek's testing has made clear that this change is needed,
> >> yet I don't understand it. If we didn't allocate a new index, why would we
> >> need to update in-memory state, when memory is preserved across S3?
> > 
> > Is this always the case for device memory? (iow: contents of the BARs
> > and possibly the PCI config space?)
> 
> Of course not. But msi_desc[] is in RAM.

Sorry, I think I didn't understand your earlier question, and hence
the reply I provided didn't make any sense to you.

> >> (This
> >> lack of understanding on my part is why I didn't associate the last
> >> paragraph of the description with this extra change, when you first sent it
> >> in this shape on the original thread.)
> > 
> > At least for the AMD IOMMU driver it seems to be expected.  See how
> > amd_iommu_resume() performs a pair of disable_iommu() and
> > enable_iommu() calls, and in the enable_iommu() function there's a
> > call to set_{msi,x2apic}_affinity() that's expected to (re)set the
> > interrupts.  Or at least that would be my understanding.
> > 
> > This change reverts the behavior to what it used to be prior to
> > 8e60d47cf011 for the suspend and resume paths.  I'm afraid I don't
> > have a sensible way to test changes in that area, so I cannot
> > investigate much.
> 
> So how did you end up considering this may have been the reason for the
> failure Marek was still seeing with the earlier form of the patch? I'm
> simply hesitant to ack something that I don't understand at all.

Oh, I think I know what you are missing, and it's because it's out of
patch context.  The adjusted chunk in amd_iommu_msi_msg_update_ire()
does:

    if ( rc >= 0 )
    {
        for ( i = 1; i < nr; ++i )
            msi_desc[i].remap_index = msi_desc->remap_index + i;
        msg->data = data;
    }

Note how it sets msg->data, as otherwise the field won't be properly
set, and hence the caller propagating the contents of `msg` to the
registers would be incorrect.

The change forces msg->data to be correctly set when returning either
0 or 1, so that propagation to the hardware can be done in both
cases.  Previously the contents of msg->data where only correct when
returning 1 on AMD.

Hope this makes more sense, sorry for not understanding your question
initially.

Thanks, Roger.
Marek Marczykowski-Górecki March 18, 2025, 11:23 a.m. UTC | #5
On Tue, Mar 18, 2025 at 09:29:45AM +0100, Roger Pau Monne wrote:
> Relax the limitation on MSI register writes, and only apply it when the
> system is in active state.  For example AMD IOMMU drivers rely on using
> set_msi_affinity() to force an MSI register write on resume from
> suspension.
> 
> The original patch intention was to reduce the number of MSI register
> writes when the system is in active state.  Leave the other states to
> always perform the writes, as it's safer given the existing code, and it's
> expected to not make a difference performance wise.
> 
> For such propagation to work even when the IRT index is not updated the MSI
> message must be adjusted in all success cases for AMD IOMMU, not just when
> the index has been newly allocated.
> 
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

> ---
>  xen/arch/x86/msi.c                       | 9 +++++++++
>  xen/drivers/passthrough/amd/iommu_intr.c | 2 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 163ccf874720..8bb3bb18af61 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
>  {
>      entry->msg = *msg;
>  
> +    if ( unlikely(system_state != SYS_STATE_active) )
> +        /*
> +         * Always propagate writes when not in the 'active' state.  The
> +         * optimization to avoid the MSI address and data registers write is
> +         * only relevant for runtime state, and drivers on resume (at least)
> +         * rely on set_msi_affinity() to update the hardware state.
> +         */
> +        force = true;
> +
>      if ( iommu_intremap != iommu_intremap_off )
>      {
>          int rc;
> diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
> index 9abdc38053d7..08766122b421 100644
> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>                                              &msi_desc->remap_index,
>                                              msg, &data);
> -    if ( rc > 0 )
> +    if ( rc >= 0 )
>      {
>          for ( i = 1; i < nr; ++i )
>              msi_desc[i].remap_index = msi_desc->remap_index + i;
> -- 
> 2.48.1
>
Jan Beulich March 18, 2025, 11:31 a.m. UTC | #6
On 18.03.2025 11:45, Roger Pau Monné wrote:
> On Tue, Mar 18, 2025 at 11:14:59AM +0100, Jan Beulich wrote:
>> On 18.03.2025 09:54, Roger Pau Monné wrote:
>>> On Tue, Mar 18, 2025 at 09:36:37AM +0100, Jan Beulich wrote:
>>>> On 18.03.2025 09:29, Roger Pau Monne wrote:
>>>>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
>>>>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
>>>>> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
>>>>>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>>>>>                                              &msi_desc->remap_index,
>>>>>                                              msg, &data);
>>>>> -    if ( rc > 0 )
>>>>> +    if ( rc >= 0 )
>>>>>      {
>>>>>          for ( i = 1; i < nr; ++i )
>>>>>              msi_desc[i].remap_index = msi_desc->remap_index + i;
>>>>
>>>> I understand that Marek's testing has made clear that this change is needed,
>>>> yet I don't understand it. If we didn't allocate a new index, why would we
>>>> need to update in-memory state, when memory is preserved across S3?
>>>
>>> Is this always the case for device memory? (iow: contents of the BARs
>>> and possibly the PCI config space?)
>>
>> Of course not. But msi_desc[] is in RAM.
> 
> Sorry, I think I didn't understand your earlier question, and hence
> the reply I provided didn't make any sense to you.
> 
>>>> (This
>>>> lack of understanding on my part is why I didn't associate the last
>>>> paragraph of the description with this extra change, when you first sent it
>>>> in this shape on the original thread.)
>>>
>>> At least for the AMD IOMMU driver it seems to be expected.  See how
>>> amd_iommu_resume() performs a pair of disable_iommu() and
>>> enable_iommu() calls, and in the enable_iommu() function there's a
>>> call to set_{msi,x2apic}_affinity() that's expected to (re)set the
>>> interrupts.  Or at least that would be my understanding.
>>>
>>> This change reverts the behavior to what it used to be prior to
>>> 8e60d47cf011 for the suspend and resume paths.  I'm afraid I don't
>>> have a sensible way to test changes in that area, so I cannot
>>> investigate much.
>>
>> So how did you end up considering this may have been the reason for the
>> failure Marek was still seeing with the earlier form of the patch? I'm
>> simply hesitant to ack something that I don't understand at all.
> 
> Oh, I think I know what you are missing, and it's because it's out of
> patch context.  The adjusted chunk in amd_iommu_msi_msg_update_ire()
> does:
> 
>     if ( rc >= 0 )
>     {
>         for ( i = 1; i < nr; ++i )
>             msi_desc[i].remap_index = msi_desc->remap_index + i;
>         msg->data = data;
>     }
> 
> Note how it sets msg->data, as otherwise the field won't be properly
> set, and hence the caller propagating the contents of `msg` to the
> registers would be incorrect.
> 
> The change forces msg->data to be correctly set when returning either
> 0 or 1, so that propagation to the hardware can be done in both
> cases.  Previously the contents of msg->data where only correct when
> returning 1 on AMD.

Oh, I see. The loop is entirely benign in this case. I did look at the
full function, but I didn't make the connection from the writing of
msg->data.

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 163ccf874720..8bb3bb18af61 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -189,6 +189,15 @@  static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
 {
     entry->msg = *msg;
 
+    if ( unlikely(system_state != SYS_STATE_active) )
+        /*
+         * Always propagate writes when not in the 'active' state.  The
+         * optimization to avoid the MSI address and data registers write is
+         * only relevant for runtime state, and drivers on resume (at least)
+         * rely on set_msi_affinity() to update the hardware state.
+         */
+        force = true;
+
     if ( iommu_intremap != iommu_intremap_off )
     {
         int rc;
diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
index 9abdc38053d7..08766122b421 100644
--- a/xen/drivers/passthrough/amd/iommu_intr.c
+++ b/xen/drivers/passthrough/amd/iommu_intr.c
@@ -546,7 +546,7 @@  int cf_check amd_iommu_msi_msg_update_ire(
     rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
                                             &msi_desc->remap_index,
                                             msg, &data);
-    if ( rc > 0 )
+    if ( rc >= 0 )
     {
         for ( i = 1; i < nr; ++i )
             msi_desc[i].remap_index = msi_desc->remap_index + i;