diff mbox series

[v15,RESEND,22/23] PCI: starfive: Offload the NVMe timeout workaround to host drivers.

Message ID 20240227103522.80915-23-minda.chen@starfivetech.com (mailing list archive)
State Superseded
Headers show
Series Refactoring Microchip PCIe driver and add StarFive PCIe | expand

Checks

Context Check Description
conchuod/vmtest-fixes-PR fail merge-conflict

Commit Message

Minda Chen Feb. 27, 2024, 10:35 a.m. UTC
From: Kevin Xie <kevin.xie@starfivetech.com>

As the Starfive JH7110 hardware can't keep two inbound post write in
order all the time, such as MSI messages and NVMe completions. If the
NVMe completion update later than the MSI, an NVMe IRQ handle will miss.

As a workaround, we will wait a while before going to the generic
handle here.

Verified with NVMe SSD, USB SSD, R8169 NIC.
The performance are stable and even higher after this patch.

Signed-off-by: Kevin Xie <kevin.xie@starfivetech.com>
Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
---
 drivers/pci/controller/plda/pcie-plda-host.c | 12 ++++++++++++
 drivers/pci/controller/plda/pcie-plda.h      |  1 +
 drivers/pci/controller/plda/pcie-starfive.c  |  1 +
 3 files changed, 14 insertions(+)

Comments

Lorenzo Pieralisi Feb. 29, 2024, 3:08 p.m. UTC | #1
On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
> From: Kevin Xie <kevin.xie@starfivetech.com>
> 
> As the Starfive JH7110 hardware can't keep two inbound post write in
> order all the time, such as MSI messages and NVMe completions. If the
> NVMe completion update later than the MSI, an NVMe IRQ handle will miss.

Please explain what the problem is and what "NVMe completions" means
given that you are talking about posted writes.

If you have a link to an erratum write-up it would certainly help.

This looks completely broken to me, if the controller can't guarantee
PCIe transactions ordering it is toast, there is not even a point
considering mainline merging.

> As a workaround, we will wait a while before going to the generic
> handle here.
> 
> Verified with NVMe SSD, USB SSD, R8169 NIC.
> The performance are stable and even higher after this patch.

I assume this is a joke even though it does not make me laugh.

Thanks,
Lorenzo

> 
> Signed-off-by: Kevin Xie <kevin.xie@starfivetech.com>
> Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> ---
>  drivers/pci/controller/plda/pcie-plda-host.c | 12 ++++++++++++
>  drivers/pci/controller/plda/pcie-plda.h      |  1 +
>  drivers/pci/controller/plda/pcie-starfive.c  |  1 +
>  3 files changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/controller/plda/pcie-plda-host.c b/drivers/pci/controller/plda/pcie-plda-host.c
> index a18923d7cea6..9e077ddf45c0 100644
> --- a/drivers/pci/controller/plda/pcie-plda-host.c
> +++ b/drivers/pci/controller/plda/pcie-plda-host.c
> @@ -13,6 +13,7 @@
>  #include <linux/msi.h>
>  #include <linux/pci_regs.h>
>  #include <linux/pci-ecam.h>
> +#include <linux/delay.h>
>  
>  #include "pcie-plda.h"
>  
> @@ -44,6 +45,17 @@ static void plda_handle_msi(struct irq_desc *desc)
>  			       bridge_base_addr + ISTATUS_LOCAL);
>  		status = readl_relaxed(bridge_base_addr + ISTATUS_MSI);
>  		for_each_set_bit(bit, &status, msi->num_vectors) {
> +			/*
> +			 * As the Starfive JH7110 hardware can't keep two
> +			 * inbound post write in order all the time, such as
> +			 * MSI messages and NVMe completions.
> +			 * If the NVMe completion update later than the MSI,
> +			 * an NVMe IRQ handle will miss.
> +			 * As a workaround, we will wait a while before
> +			 * going to the generic handle here.
> +			 */
> +			if (port->msi_quirk_delay_us)
> +				udelay(port->msi_quirk_delay_us);
>  			ret = generic_handle_domain_irq(msi->dev_domain, bit);
>  			if (ret)
>  				dev_err_ratelimited(dev, "bad MSI IRQ %d\n",
> diff --git a/drivers/pci/controller/plda/pcie-plda.h b/drivers/pci/controller/plda/pcie-plda.h
> index 04e385758a2f..feccf285dfe8 100644
> --- a/drivers/pci/controller/plda/pcie-plda.h
> +++ b/drivers/pci/controller/plda/pcie-plda.h
> @@ -186,6 +186,7 @@ struct plda_pcie_rp {
>  	int msi_irq;
>  	int intx_irq;
>  	int num_events;
> +	u16 msi_quirk_delay_us;
>  };
>  
>  struct plda_event {
> diff --git a/drivers/pci/controller/plda/pcie-starfive.c b/drivers/pci/controller/plda/pcie-starfive.c
> index 9bb9f0e29565..5cfc30572b7f 100644
> --- a/drivers/pci/controller/plda/pcie-starfive.c
> +++ b/drivers/pci/controller/plda/pcie-starfive.c
> @@ -391,6 +391,7 @@ static int starfive_pcie_probe(struct platform_device *pdev)
>  
>  	plda->host_ops = &sf_host_ops;
>  	plda->num_events = PLDA_MAX_EVENT_NUM;
> +	plda->msi_quirk_delay_us = 1;
>  	/* mask doorbell event */
>  	plda->events_bitmap = GENMASK(PLDA_INT_EVENT_NUM - 1, 0)
>  			     & ~BIT(PLDA_AXI_DOORBELL)
> -- 
> 2.17.1
> 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Palmer Dabbelt March 4, 2024, 6:08 p.m. UTC | #2
On Thu, 29 Feb 2024 07:08:43 PST (-0800), lpieralisi@kernel.org wrote:
> On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
>> From: Kevin Xie <kevin.xie@starfivetech.com>
>>
>> As the Starfive JH7110 hardware can't keep two inbound post write in
>> order all the time, such as MSI messages and NVMe completions. If the
>> NVMe completion update later than the MSI, an NVMe IRQ handle will miss.
>
> Please explain what the problem is and what "NVMe completions" means
> given that you are talking about posted writes.
>
> If you have a link to an erratum write-up it would certainly help.

I think we really need to see that errata document.  Our formal memory 
model doesn't account for device interactions so it's possible there's 
just some arch fence we can make stronger in order to get things ordered 
again -- we've had similar problems with some other RISC-V chips, and 
while it ends up being slow at least it's correct.

> This looks completely broken to me, if the controller can't guarantee
> PCIe transactions ordering it is toast, there is not even a point
> considering mainline merging.

I wouldn't be at all surprised if that's the case.  Without some 
concrete ISA mechanisms here we're sort of just stuck hoping the SOC 
vendors do the right thing, which is always terrifying.

I'm not really a PCIe person so this is all a bit vague, but IIRC we had 
a bunch of possible PCIe ordering violations in the SiFive memory system 
back when I was there and we never really got a scheme for making sure 
things were correct.

So I think we really do need to see that errata document to know what's 
possible here.  Folks have been able to come up with clever solutions to 
these problems before, maybe we'll get lucky again.

>> As a workaround, we will wait a while before going to the generic
>> handle here.
>>
>> Verified with NVMe SSD, USB SSD, R8169 NIC.
>> The performance are stable and even higher after this patch.
>
> I assume this is a joke even though it does not make me laugh.

So you're new to RISC-V, then?  It gets way worse than this ;)

> Thanks,
> Lorenzo
>
>>
>> Signed-off-by: Kevin Xie <kevin.xie@starfivetech.com>
>> Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
>> ---
>>  drivers/pci/controller/plda/pcie-plda-host.c | 12 ++++++++++++
>>  drivers/pci/controller/plda/pcie-plda.h      |  1 +
>>  drivers/pci/controller/plda/pcie-starfive.c  |  1 +
>>  3 files changed, 14 insertions(+)
>>
>> diff --git a/drivers/pci/controller/plda/pcie-plda-host.c b/drivers/pci/controller/plda/pcie-plda-host.c
>> index a18923d7cea6..9e077ddf45c0 100644
>> --- a/drivers/pci/controller/plda/pcie-plda-host.c
>> +++ b/drivers/pci/controller/plda/pcie-plda-host.c
>> @@ -13,6 +13,7 @@
>>  #include <linux/msi.h>
>>  #include <linux/pci_regs.h>
>>  #include <linux/pci-ecam.h>
>> +#include <linux/delay.h>
>>
>>  #include "pcie-plda.h"
>>
>> @@ -44,6 +45,17 @@ static void plda_handle_msi(struct irq_desc *desc)
>>  			       bridge_base_addr + ISTATUS_LOCAL);
>>  		status = readl_relaxed(bridge_base_addr + ISTATUS_MSI);
>>  		for_each_set_bit(bit, &status, msi->num_vectors) {
>> +			/*
>> +			 * As the Starfive JH7110 hardware can't keep two
>> +			 * inbound post write in order all the time, such as
>> +			 * MSI messages and NVMe completions.
>> +			 * If the NVMe completion update later than the MSI,
>> +			 * an NVMe IRQ handle will miss.
>> +			 * As a workaround, we will wait a while before
>> +			 * going to the generic handle here.
>> +			 */
>> +			if (port->msi_quirk_delay_us)
>> +				udelay(port->msi_quirk_delay_us);
>>  			ret = generic_handle_domain_irq(msi->dev_domain, bit);
>>  			if (ret)
>>  				dev_err_ratelimited(dev, "bad MSI IRQ %d\n",
>> diff --git a/drivers/pci/controller/plda/pcie-plda.h b/drivers/pci/controller/plda/pcie-plda.h
>> index 04e385758a2f..feccf285dfe8 100644
>> --- a/drivers/pci/controller/plda/pcie-plda.h
>> +++ b/drivers/pci/controller/plda/pcie-plda.h
>> @@ -186,6 +186,7 @@ struct plda_pcie_rp {
>>  	int msi_irq;
>>  	int intx_irq;
>>  	int num_events;
>> +	u16 msi_quirk_delay_us;
>>  };
>>
>>  struct plda_event {
>> diff --git a/drivers/pci/controller/plda/pcie-starfive.c b/drivers/pci/controller/plda/pcie-starfive.c
>> index 9bb9f0e29565..5cfc30572b7f 100644
>> --- a/drivers/pci/controller/plda/pcie-starfive.c
>> +++ b/drivers/pci/controller/plda/pcie-starfive.c
>> @@ -391,6 +391,7 @@ static int starfive_pcie_probe(struct platform_device *pdev)
>>
>>  	plda->host_ops = &sf_host_ops;
>>  	plda->num_events = PLDA_MAX_EVENT_NUM;
>> +	plda->msi_quirk_delay_us = 1;
>>  	/* mask doorbell event */
>>  	plda->events_bitmap = GENMASK(PLDA_INT_EVENT_NUM - 1, 0)
>>  			     & ~BIT(PLDA_AXI_DOORBELL)
>> --
>> 2.17.1
>>
>>
>> _______________________________________________
>> linux-riscv mailing list
>> linux-riscv@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-riscv
Lorenzo Pieralisi March 5, 2024, 3:56 p.m. UTC | #3
On Mon, Mar 04, 2024 at 10:08:06AM -0800, Palmer Dabbelt wrote:
> On Thu, 29 Feb 2024 07:08:43 PST (-0800), lpieralisi@kernel.org wrote:
> > On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
> > > From: Kevin Xie <kevin.xie@starfivetech.com>
> > > 
> > > As the Starfive JH7110 hardware can't keep two inbound post write in
> > > order all the time, such as MSI messages and NVMe completions. If the
> > > NVMe completion update later than the MSI, an NVMe IRQ handle will miss.
> > 
> > Please explain what the problem is and what "NVMe completions" means
> > given that you are talking about posted writes.
> > 
> > If you have a link to an erratum write-up it would certainly help.
> 
> I think we really need to see that errata document.  Our formal memory model
> doesn't account for device interactions so it's possible there's just some
> arch fence we can make stronger in order to get things ordered again --
> we've had similar problems with some other RISC-V chips, and while it ends
> up being slow at least it's correct.
> 
> > This looks completely broken to me, if the controller can't guarantee
> > PCIe transactions ordering it is toast, there is not even a point
> > considering mainline merging.
> 
> I wouldn't be at all surprised if that's the case.  Without some concrete
> ISA mechanisms here we're sort of just stuck hoping the SOC vendors do the
> right thing, which is always terrifying.
> 
> I'm not really a PCIe person so this is all a bit vague, but IIRC we had a
> bunch of possible PCIe ordering violations in the SiFive memory system back
> when I was there and we never really got a scheme for making sure things
> were correct.
> 
> So I think we really do need to see that errata document to know what's
> possible here.  Folks have been able to come up with clever solutions to
> these problems before, maybe we'll get lucky again.
> 
> > > As a workaround, we will wait a while before going to the generic
> > > handle here.
> > > 
> > > Verified with NVMe SSD, USB SSD, R8169 NIC.
> > > The performance are stable and even higher after this patch.
> > 
> > I assume this is a joke even though it does not make me laugh.
> 
> So you're new to RISC-V, then?  It gets way worse than this ;)

To me this is just a PCI controller driver, arch does not matter.

What annoyed me is that we really can't state that this patch improves
performance, sorry, the patch itself is not acceptable, let's try
not to rub it in :)

Please post an erratum write-up and we shall see what can be done.

Thanks,
Lorenzo

> > Thanks,
> > Lorenzo
> > 
> > > 
> > > Signed-off-by: Kevin Xie <kevin.xie@starfivetech.com>
> > > Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> > > ---
> > >  drivers/pci/controller/plda/pcie-plda-host.c | 12 ++++++++++++
> > >  drivers/pci/controller/plda/pcie-plda.h      |  1 +
> > >  drivers/pci/controller/plda/pcie-starfive.c  |  1 +
> > >  3 files changed, 14 insertions(+)
> > > 
> > > diff --git a/drivers/pci/controller/plda/pcie-plda-host.c b/drivers/pci/controller/plda/pcie-plda-host.c
> > > index a18923d7cea6..9e077ddf45c0 100644
> > > --- a/drivers/pci/controller/plda/pcie-plda-host.c
> > > +++ b/drivers/pci/controller/plda/pcie-plda-host.c
> > > @@ -13,6 +13,7 @@
> > >  #include <linux/msi.h>
> > >  #include <linux/pci_regs.h>
> > >  #include <linux/pci-ecam.h>
> > > +#include <linux/delay.h>
> > > 
> > >  #include "pcie-plda.h"
> > > 
> > > @@ -44,6 +45,17 @@ static void plda_handle_msi(struct irq_desc *desc)
> > >  			       bridge_base_addr + ISTATUS_LOCAL);
> > >  		status = readl_relaxed(bridge_base_addr + ISTATUS_MSI);
> > >  		for_each_set_bit(bit, &status, msi->num_vectors) {
> > > +			/*
> > > +			 * As the Starfive JH7110 hardware can't keep two
> > > +			 * inbound post write in order all the time, such as
> > > +			 * MSI messages and NVMe completions.
> > > +			 * If the NVMe completion update later than the MSI,
> > > +			 * an NVMe IRQ handle will miss.
> > > +			 * As a workaround, we will wait a while before
> > > +			 * going to the generic handle here.
> > > +			 */
> > > +			if (port->msi_quirk_delay_us)
> > > +				udelay(port->msi_quirk_delay_us);
> > >  			ret = generic_handle_domain_irq(msi->dev_domain, bit);
> > >  			if (ret)
> > >  				dev_err_ratelimited(dev, "bad MSI IRQ %d\n",
> > > diff --git a/drivers/pci/controller/plda/pcie-plda.h b/drivers/pci/controller/plda/pcie-plda.h
> > > index 04e385758a2f..feccf285dfe8 100644
> > > --- a/drivers/pci/controller/plda/pcie-plda.h
> > > +++ b/drivers/pci/controller/plda/pcie-plda.h
> > > @@ -186,6 +186,7 @@ struct plda_pcie_rp {
> > >  	int msi_irq;
> > >  	int intx_irq;
> > >  	int num_events;
> > > +	u16 msi_quirk_delay_us;
> > >  };
> > > 
> > >  struct plda_event {
> > > diff --git a/drivers/pci/controller/plda/pcie-starfive.c b/drivers/pci/controller/plda/pcie-starfive.c
> > > index 9bb9f0e29565..5cfc30572b7f 100644
> > > --- a/drivers/pci/controller/plda/pcie-starfive.c
> > > +++ b/drivers/pci/controller/plda/pcie-starfive.c
> > > @@ -391,6 +391,7 @@ static int starfive_pcie_probe(struct platform_device *pdev)
> > > 
> > >  	plda->host_ops = &sf_host_ops;
> > >  	plda->num_events = PLDA_MAX_EVENT_NUM;
> > > +	plda->msi_quirk_delay_us = 1;
> > >  	/* mask doorbell event */
> > >  	plda->events_bitmap = GENMASK(PLDA_INT_EVENT_NUM - 1, 0)
> > >  			     & ~BIT(PLDA_AXI_DOORBELL)
> > > --
> > > 2.17.1
> > > 
> > > 
> > > _______________________________________________
> > > linux-riscv mailing list
> > > linux-riscv@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
Kevin Xie March 14, 2024, 2:18 a.m. UTC | #4
> Re: [PATCH v15,RESEND 22/23] PCI: starfive: Offload the NVMe timeout
> workaround to host drivers.
> 
> On Mon, Mar 04, 2024 at 10:08:06AM -0800, Palmer Dabbelt wrote:
> > On Thu, 29 Feb 2024 07:08:43 PST (-0800), lpieralisi@kernel.org wrote:
> > > On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
> > > > From: Kevin Xie <kevin.xie@starfivetech.com>
> > > >
> > > > As the Starfive JH7110 hardware can't keep two inbound post write
> > > > in order all the time, such as MSI messages and NVMe completions.
> > > > If the NVMe completion update later than the MSI, an NVMe IRQ handle
> will miss.
> > >
> > > Please explain what the problem is and what "NVMe completions" means
> > > given that you are talking about posted writes.

Sorry, we made a casual conclusion here.
Not any two of inbound post requests can`t be kept in order in JH7110 SoC, 
the only one case we found is NVMe completions with MSI interrupts.
To be more precise, they are the pending status in nvme_completion struct and
nvme_irq handler in nvme/host/pci.c.

We have shown the original workaround patch before:
https://lore.kernel.org/lkml/CAJM55Z9HtBSyCq7rDEDFdw644pOWCKJfPqhmi3SD1x6p3g2SLQ@mail.gmail.com/
We put it in our github branch and works fine for a long time.
Looking forward to better advices from someone familiar with NVMe drivers.

> > >
> > > If you have a link to an erratum write-up it would certainly help.
> >

That`s not a certain hardware issue with existing erratum, and we are doing
more investigation for it in these days.

The next version we will skip this workaround patch, and then summit separate
patch (with erratum) for the fix after more debug on the issue.

> > I think we really need to see that errata document.  Our formal memory
> > model doesn't account for device interactions so it's possible there's
> > just some arch fence we can make stronger in order to get things
> > ordered again -- we've had similar problems with some other RISC-V
> > chips, and while it ends up being slow at least it's correct.
> >
> > > This looks completely broken to me, if the controller can't
> > > guarantee PCIe transactions ordering it is toast, there is not even
> > > a point considering mainline merging.
> >
> > I wouldn't be at all surprised if that's the case.  Without some
> > concrete ISA mechanisms here we're sort of just stuck hoping the SOC
> > vendors do the right thing, which is always terrifying.
> >
> > I'm not really a PCIe person so this is all a bit vague, but IIRC we
> > had a bunch of possible PCIe ordering violations in the SiFive memory
> > system back when I was there and we never really got a scheme for
> > making sure things were correct.
> >
> > So I think we really do need to see that errata document to know
> > what's possible here.  Folks have been able to come up with clever
> > solutions to these problems before, maybe we'll get lucky again.
> >
> > > > As a workaround, we will wait a while before going to the generic
> > > > handle here.
> > > >
> > > > Verified with NVMe SSD, USB SSD, R8169 NIC.
> > > > The performance are stable and even higher after this patch.
> > >
> > > I assume this is a joke even though it does not make me laugh.
> >
> > So you're new to RISC-V, then?  It gets way worse than this ;)
> 
> To me this is just a PCI controller driver, arch does not matter.
> 
> What annoyed me is that we really can't state that this patch improves
> performance, sorry, the patch itself is not acceptable, let's try not to rub it in :)
> 

I'm sorry, the description here is confusing too.
The performance is compared with the version that hasn't been fixed.
It is reasonable that we can get stable performance when NVMe SSD works
without any timeouts, but that doesn't means it can improve the performance
for any other platform.

Thank you for your comments, Lorenzo & Palmer.
Kevin.

> Please post an erratum write-up and we shall see what can be done.
> 
> Thanks,
> Lorenzo
> 
> > > Thanks,
> > > Lorenzo
> > >
> > > >
> > > > Signed-off-by: Kevin Xie <kevin.xie@starfivetech.com>
> > > > Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> > > > ---
> > > >  drivers/pci/controller/plda/pcie-plda-host.c | 12 ++++++++++++
> > > >  drivers/pci/controller/plda/pcie-plda.h      |  1 +
> > > >  drivers/pci/controller/plda/pcie-starfive.c  |  1 +
> > > >  3 files changed, 14 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/controller/plda/pcie-plda-host.c
> > > > b/drivers/pci/controller/plda/pcie-plda-host.c
> > > > index a18923d7cea6..9e077ddf45c0 100644
> > > > --- a/drivers/pci/controller/plda/pcie-plda-host.c
> > > > +++ b/drivers/pci/controller/plda/pcie-plda-host.c
> > > > @@ -13,6 +13,7 @@
> > > >  #include <linux/msi.h>
> > > >  #include <linux/pci_regs.h>
> > > >  #include <linux/pci-ecam.h>
> > > > +#include <linux/delay.h>
> > > >
> > > >  #include "pcie-plda.h"
> > > >
> > > > @@ -44,6 +45,17 @@ static void plda_handle_msi(struct irq_desc *desc)
> > > >  			       bridge_base_addr + ISTATUS_LOCAL);
> > > >  		status = readl_relaxed(bridge_base_addr + ISTATUS_MSI);
> > > >  		for_each_set_bit(bit, &status, msi->num_vectors) {
> > > > +			/*
> > > > +			 * As the Starfive JH7110 hardware can't keep two
> > > > +			 * inbound post write in order all the time, such as
> > > > +			 * MSI messages and NVMe completions.
> > > > +			 * If the NVMe completion update later than the MSI,
> > > > +			 * an NVMe IRQ handle will miss.
> > > > +			 * As a workaround, we will wait a while before
> > > > +			 * going to the generic handle here.
> > > > +			 */
> > > > +			if (port->msi_quirk_delay_us)
> > > > +				udelay(port->msi_quirk_delay_us);
> > > >  			ret = generic_handle_domain_irq(msi->dev_domain, bit);
> > > >  			if (ret)
> > > >  				dev_err_ratelimited(dev, "bad MSI IRQ %d\n", diff --git
> > > > a/drivers/pci/controller/plda/pcie-plda.h
> > > > b/drivers/pci/controller/plda/pcie-plda.h
> > > > index 04e385758a2f..feccf285dfe8 100644
> > > > --- a/drivers/pci/controller/plda/pcie-plda.h
> > > > +++ b/drivers/pci/controller/plda/pcie-plda.h
> > > > @@ -186,6 +186,7 @@ struct plda_pcie_rp {
> > > >  	int msi_irq;
> > > >  	int intx_irq;
> > > >  	int num_events;
> > > > +	u16 msi_quirk_delay_us;
> > > >  };
> > > >
> > > >  struct plda_event {
> > > > diff --git a/drivers/pci/controller/plda/pcie-starfive.c
> > > > b/drivers/pci/controller/plda/pcie-starfive.c
> > > > index 9bb9f0e29565..5cfc30572b7f 100644
> > > > --- a/drivers/pci/controller/plda/pcie-starfive.c
> > > > +++ b/drivers/pci/controller/plda/pcie-starfive.c
> > > > @@ -391,6 +391,7 @@ static int starfive_pcie_probe(struct
> > > > platform_device *pdev)
> > > >
> > > >  	plda->host_ops = &sf_host_ops;
> > > >  	plda->num_events = PLDA_MAX_EVENT_NUM;
> > > > +	plda->msi_quirk_delay_us = 1;
> > > >  	/* mask doorbell event */
> > > >  	plda->events_bitmap = GENMASK(PLDA_INT_EVENT_NUM - 1, 0)
> > > >  			     & ~BIT(PLDA_AXI_DOORBELL)
> > > > --
> > > > 2.17.1
> > > >
> > > >
> > > > _______________________________________________
> > > > linux-riscv mailing list
> > > > linux-riscv@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/linux-riscv
Keith Busch March 14, 2024, 2:51 a.m. UTC | #5
On Thu, Mar 14, 2024 at 02:18:38AM +0000, Kevin Xie wrote:
> > Re: [PATCH v15,RESEND 22/23] PCI: starfive: Offload the NVMe timeout
> > workaround to host drivers.
> > 
> > On Mon, Mar 04, 2024 at 10:08:06AM -0800, Palmer Dabbelt wrote:
> > > On Thu, 29 Feb 2024 07:08:43 PST (-0800), lpieralisi@kernel.org wrote:
> > > > On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
> > > > > From: Kevin Xie <kevin.xie@starfivetech.com>
> > > > >
> > > > > As the Starfive JH7110 hardware can't keep two inbound post write
> > > > > in order all the time, such as MSI messages and NVMe completions.
> > > > > If the NVMe completion update later than the MSI, an NVMe IRQ handle
> > will miss.
> > > >
> > > > Please explain what the problem is and what "NVMe completions" means
> > > > given that you are talking about posted writes.
> 
> Sorry, we made a casual conclusion here.
> Not any two of inbound post requests can`t be kept in order in JH7110 SoC, 
> the only one case we found is NVMe completions with MSI interrupts.
> To be more precise, they are the pending status in nvme_completion struct and
> nvme_irq handler in nvme/host/pci.c.
> 
> We have shown the original workaround patch before:
> https://lore.kernel.org/lkml/CAJM55Z9HtBSyCq7rDEDFdw644pOWCKJfPqhmi3SD1x6p3g2SLQ@mail.gmail.com/
> We put it in our github branch and works fine for a long time.
> Looking forward to better advices from someone familiar with NVMe drivers.

So this platform treats strictly ordered writes the same as if relaxed
ordering was enabled? I am not sure if we could reasonably work around
such behavior. An arbitrary delay is likely too long for most cases, and
too short for the worst case.

I suppose we could quirk a non-posted transaction in the interrupt
handler to force flush pending memory updates, but that will noticeably
harm your nvme performance. Maybe if you constrain such behavior to the
spurious IRQ_NONE condition, then it might be okay? I don't know.
Keith Busch March 14, 2024, 3:39 a.m. UTC | #6
On Wed, Mar 13, 2024 at 08:51:29PM -0600, Keith Busch wrote:
> I suppose we could quirk a non-posted transaction in the interrupt
> handler to force flush pending memory updates, but that will noticeably
> harm your nvme performance. Maybe if you constrain such behavior to the
> spurious IRQ_NONE condition, then it might be okay? I don't know.

Hm, that may not be good enough: if nvme completions can be reordered
with their msi's, then I assume data may reorder with their completion.
Your application will inevitably see stale and corrupted data, so it
sounds like you need some kind of barrier per completion. Ouch!
Bo Gan March 20, 2024, 7:12 a.m. UTC | #7
On 3/13/24 7:51 PM, Keith Busch wrote:
> On Thu, Mar 14, 2024 at 02:18:38AM +0000, Kevin Xie wrote:
>>> Re: [PATCH v15,RESEND 22/23] PCI: starfive: Offload the NVMe timeout
>>> workaround to host drivers.
>>>
>>> On Mon, Mar 04, 2024 at 10:08:06AM -0800, Palmer Dabbelt wrote:
>>>> On Thu, 29 Feb 2024 07:08:43 PST (-0800), lpieralisi@kernel.org wrote:
>>>>> On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
>>>>>> From: Kevin Xie <kevin.xie@starfivetech.com>
>>>>>>
>>>>>> As the Starfive JH7110 hardware can't keep two inbound post write
>>>>>> in order all the time, such as MSI messages and NVMe completions.
>>>>>> If the NVMe completion update later than the MSI, an NVMe IRQ handle
>>> will miss.
>>>>>
>>>>> Please explain what the problem is and what "NVMe completions" means
>>>>> given that you are talking about posted writes.

Echoing Keith here. Why are you treating NVMe completions + MSI as a special case?
What's special about this combination other than two posted writes? I own JH7110
visionfive 2 boards myself, and if I'm not mistaken, there are two identical PCIe
controllers in JH7110. The first one connects the onboard USB controller of vf2,
which also enables MSI interrupts. How come this exact problem not affecting the
USB controller? The commit message from Minda strongly suggests it does, and also
for R8169 NIC. Thus, why would you suggest the problem is confined to NVMe?

Bo

>>
>> Sorry, we made a casual conclusion here.
>> Not any two of inbound post requests can`t be kept in order in JH7110 SoC,
>> the only one case we found is NVMe completions with MSI interrupts.
>> To be more precise, they are the pending status in nvme_completion struct and
>> nvme_irq handler in nvme/host/pci.c.
>>
>> We have shown the original workaround patch before:
>> https://lore.kernel.org/lkml/CAJM55Z9HtBSyCq7rDEDFdw644pOWCKJfPqhmi3SD1x6p3g2SLQ@mail.gmail.com/
>> We put it in our github branch and works fine for a long time.
>> Looking forward to better advices from someone familiar with NVMe drivers.
> 
> So this platform treats strictly ordered writes the same as if relaxed
> ordering was enabled? I am not sure if we could reasonably work around
> such behavior. An arbitrary delay is likely too long for most cases, and
> too short for the worst case.
> 
> I suppose we could quirk a non-posted transaction in the interrupt
> handler to force flush pending memory updates, but that will noticeably
> harm your nvme performance. Maybe if you constrain such behavior to the
> spurious IRQ_NONE condition, then it might be okay? I don't know.
> 

Also copied Keith's latest reply below, and I also have the same doubt.

> Hm, that may not be good enough: if nvme completions can be reordered
> with their msi's, then I assume data may reorder with their completion.
> Your application will inevitably see stale and corrupted data, so it
> sounds like you need some kind of barrier per completion. Ouch!
Kevin Xie March 20, 2024, 8:42 a.m. UTC | #8
> On 3/13/24 7:51 PM, Keith Busch wrote:
> > On Thu, Mar 14, 2024 at 02:18:38AM +0000, Kevin Xie wrote:
> >>> Re: [PATCH v15,RESEND 22/23] PCI: starfive: Offload the NVMe timeout
> >>> workaround to host drivers.
> >>>
> >>> On Mon, Mar 04, 2024 at 10:08:06AM -0800, Palmer Dabbelt wrote:
> >>>> On Thu, 29 Feb 2024 07:08:43 PST (-0800), lpieralisi@kernel.org wrote:
> >>>>> On Tue, Feb 27, 2024 at 06:35:21PM +0800, Minda Chen wrote:
> >>>>>> From: Kevin Xie <kevin.xie@starfivetech.com>
> >>>>>>
> >>>>>> As the Starfive JH7110 hardware can't keep two inbound post write
> >>>>>> in order all the time, such as MSI messages and NVMe completions.
> >>>>>> If the NVMe completion update later than the MSI, an NVMe IRQ
> >>>>>> handle
> >>> will miss.
> >>>>>
> >>>>> Please explain what the problem is and what "NVMe completions"
> >>>>> means given that you are talking about posted writes.
> 
> Echoing Keith here. Why are you treating NVMe completions + MSI as a special
> case?
> What's special about this combination other than two posted writes? I own
> JH7110 visionfive 2 boards myself, and if I'm not mistaken, there are two
> identical PCIe controllers in JH7110. The first one connects the onboard USB
> controller of vf2, which also enables MSI interrupts. How come this exact
> problem not affecting the USB controller? The commit message from Minda
> strongly suggests it does, and also for R8169 NIC. Thus, why would you suggest
> the problem is confined to NVMe?
> 
> Bo
> 

Hi, Bo, 
Yes, we have two PCIe controller in JH7110 SoC, and the USB hub & NIC over
PCIe work fine no matter we apply this patch or not.

Let me explain in detail about the origin of this patch:
As described in the title, we fixed the timeout issue by a workaround in NVMe
driver, that may affects all other platforms. Thus, we wanted to offload the
workaround from NVMe driver to our PCIe controller platform driver.
After finished the offload patch, we wanted to test if it does harm to the other
PCIe devices, so you can see we tested with R8169 NIC in description.

MSI and NVMe completion are two inbound post requests from NVMe device
to JH7110.
We made the mistake of generalizing both of them as " two inbound post write", 
because we had been investigating the issue in the direction of inbound ordering.

Actually, the phenomenon of this problem is:
"JH7110 have a small probability of not getting the updated NVMe completion 
pending status in NVMe MSI handler, and we can get one after 1 to 3us delay."
Thus, that may related to the ordering, cache consistency or other reasons.
This issue is still under investigation.
 
> >>
> >> Sorry, we made a casual conclusion here.
> >> Not any two of inbound post requests can`t be kept in order in JH7110
> >> SoC, the only one case we found is NVMe completions with MSI interrupts.
> >> To be more precise, they are the pending status in nvme_completion
> >> struct and nvme_irq handler in nvme/host/pci.c.
> >>
> >> We have shown the original workaround patch before:
> >>
> https://lore.kernel.org/lkml/CAJM55Z9HtBSyCq7rDEDFdw644pOWCKJfPqhmi3
> S
> >> D1x6p3g2SLQ@mail.gmail.com/ We put it in our github branch and works
> >> fine for a long time.
> >> Looking forward to better advices from someone familiar with NVMe
> drivers.
> >
> > So this platform treats strictly ordered writes the same as if relaxed
> > ordering was enabled? I am not sure if we could reasonably work around
> > such behavior. An arbitrary delay is likely too long for most cases,
> > and too short for the worst case.
> >
> > I suppose we could quirk a non-posted transaction in the interrupt
> > handler to force flush pending memory updates, but that will
> > noticeably harm your nvme performance. Maybe if you constrain such
> > behavior to the spurious IRQ_NONE condition, then it might be okay? I don't
> know.
> >
> 
> Also copied Keith's latest reply below, and I also have the same doubt.
> 

Hi, Keith, sorry for the late reply.
We have tried to add a dummy non-post request( config read ) in the handler,
but it doesn't help.
Besides, we tried to add the mb() before checking the NVMe completion, 
and it doesn't help too.

> > Hm, that may not be good enough: if nvme completions can be reordered
> > with their msi's, then I assume data may reorder with their completion.
> > Your application will inevitably see stale and corrupted data, so it
> > sounds like you need some kind of barrier per completion. Ouch!

If we do not apply the patch, we might get the timeout warnings and waste
some time, the problem seems to be less serious than you described.
After applying the workaround, we can do tasks with NVMe SSD normally,
such as boot up, hibernation and saving data.
diff mbox series

Patch

diff --git a/drivers/pci/controller/plda/pcie-plda-host.c b/drivers/pci/controller/plda/pcie-plda-host.c
index a18923d7cea6..9e077ddf45c0 100644
--- a/drivers/pci/controller/plda/pcie-plda-host.c
+++ b/drivers/pci/controller/plda/pcie-plda-host.c
@@ -13,6 +13,7 @@ 
 #include <linux/msi.h>
 #include <linux/pci_regs.h>
 #include <linux/pci-ecam.h>
+#include <linux/delay.h>
 
 #include "pcie-plda.h"
 
@@ -44,6 +45,17 @@  static void plda_handle_msi(struct irq_desc *desc)
 			       bridge_base_addr + ISTATUS_LOCAL);
 		status = readl_relaxed(bridge_base_addr + ISTATUS_MSI);
 		for_each_set_bit(bit, &status, msi->num_vectors) {
+			/*
+			 * As the Starfive JH7110 hardware can't keep two
+			 * inbound post write in order all the time, such as
+			 * MSI messages and NVMe completions.
+			 * If the NVMe completion update later than the MSI,
+			 * an NVMe IRQ handle will miss.
+			 * As a workaround, we will wait a while before
+			 * going to the generic handle here.
+			 */
+			if (port->msi_quirk_delay_us)
+				udelay(port->msi_quirk_delay_us);
 			ret = generic_handle_domain_irq(msi->dev_domain, bit);
 			if (ret)
 				dev_err_ratelimited(dev, "bad MSI IRQ %d\n",
diff --git a/drivers/pci/controller/plda/pcie-plda.h b/drivers/pci/controller/plda/pcie-plda.h
index 04e385758a2f..feccf285dfe8 100644
--- a/drivers/pci/controller/plda/pcie-plda.h
+++ b/drivers/pci/controller/plda/pcie-plda.h
@@ -186,6 +186,7 @@  struct plda_pcie_rp {
 	int msi_irq;
 	int intx_irq;
 	int num_events;
+	u16 msi_quirk_delay_us;
 };
 
 struct plda_event {
diff --git a/drivers/pci/controller/plda/pcie-starfive.c b/drivers/pci/controller/plda/pcie-starfive.c
index 9bb9f0e29565..5cfc30572b7f 100644
--- a/drivers/pci/controller/plda/pcie-starfive.c
+++ b/drivers/pci/controller/plda/pcie-starfive.c
@@ -391,6 +391,7 @@  static int starfive_pcie_probe(struct platform_device *pdev)
 
 	plda->host_ops = &sf_host_ops;
 	plda->num_events = PLDA_MAX_EVENT_NUM;
+	plda->msi_quirk_delay_us = 1;
 	/* mask doorbell event */
 	plda->events_bitmap = GENMASK(PLDA_INT_EVENT_NUM - 1, 0)
 			     & ~BIT(PLDA_AXI_DOORBELL)