Message ID | CAMaF-rOG9gNf3g8rOXiKMq3TXrfJf5dFwN6q6uQqvMruUm4VQg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Le jeudi 01 septembre 2011 à 16:50 -0500, Jon Mason a écrit : > I believe modifying the MRRS values is what is causing the issues. > Can you try the attached patch and verify that it also resolves the > issue? > Its midnight here, I'll try this in ~7 hours Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 01, 2011 at 04:50:45PM -0500, Jon Mason wrote: > On Thu, Sep 1, 2011 at 3:44 PM, <scameron@beardog.cce.hp.com> wrote: > > On Thu, Sep 01, 2011 at 01:09:30PM -0700, Jesse Barnes wrote: > >> On Thu, 1 Sep 2011 15:03:49 -0500 > >> scameron@beardog.cce.hp.com wrote: > >> > >> > On Thu, Sep 01, 2011 at 12:59:38PM -0700, Jesse Barnes wrote: > >> > > On Thu, 01 Sep 2011 11:50:38 -0700 > >> > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > >> > > > >> > > > On Thu, 2011-09-01 at 10:58 -0700, Roland Dreier wrote: > >> > > > > > OK I found the bad commit,I got lucky... I lost some files but my > >> > > > > > machine was able to complete the bisection. CC involved people > >> > > > > > >> > > > > > # bad: [b03e7495a862b028294f59fc87286d6d78ee7fa1] PCI: Set PCI-E Max Payload Size on fabric > >> > > > > > >> > > > > Hi Eric, > >> > > > > > >> > > > > I guess it would be useful to see "lspci -vv" output with a "good" kernel > >> > > > > and with that bad patch applied. Most likely we should see some difference > >> > > > > somewhere in the MaxPayload fields in the PCI Express capability of > >> > > > > some device. > >> > > > > > >> > > > > Either the RAID controller or something else lies, and puts a value > >> > > > > in the DevCap that it can't actually support, or else the patch is > >> > > > > buggy and puts something out of range in a DevCtl somewhere. > >> > > > > >> > > > > >> > > > While we investigate, I think the problems produced by the patch (data > >> > > > corruption) are serious enough to warrant reverting it, please Jesse. > >> > > > >> > > Hm I haven't been paying attention to the compromise thread; how should > >> > > I share these changes? Is master.kernel.org down indefinitely? Is > >> > > there a new server at kernel.org I can use? > >> > > >> > I can't answer that question, but I would like a copy of your revert > >> > patch(es) to test (as a simple patch --reverse of the original commit on the 3.1-rc4 > >> > tree didn't go in cleanly). > >> > >> Attached is the series. Applies on top of my for-linus branch. > > > > Thanks. I tried them out vs. 3.1-rc4, and they applied cleanly and > > make things work on my BL460g7. > > I believe modifying the MRRS values is what is causing the issues. > Can you try the attached patch and verify that it also resolves the > issue? Ok, just tried it. The mrrs_removal patch does also appear to resolve the issue. Thanks. -- steve -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 01 septembre 2011 à 17:16 -0500, scameron@beardog.cce.hp.com a écrit : > On Thu, Sep 01, 2011 at 04:50:45PM -0500, Jon Mason wrote: > > On Thu, Sep 1, 2011 at 3:44 PM, <scameron@beardog.cce.hp.com> wrote: > > > On Thu, Sep 01, 2011 at 01:09:30PM -0700, Jesse Barnes wrote: > > >> On Thu, 1 Sep 2011 15:03:49 -0500 > > >> scameron@beardog.cce.hp.com wrote: > > >> > > >> > On Thu, Sep 01, 2011 at 12:59:38PM -0700, Jesse Barnes wrote: > > >> > > On Thu, 01 Sep 2011 11:50:38 -0700 > > >> > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > >> > > > > >> > > > On Thu, 2011-09-01 at 10:58 -0700, Roland Dreier wrote: > > >> > > > > > OK I found the bad commit,I got lucky... I lost some files but my > > >> > > > > > machine was able to complete the bisection. CC involved people > > >> > > > > > > >> > > > > > # bad: [b03e7495a862b028294f59fc87286d6d78ee7fa1] PCI: Set PCI-E Max Payload Size on fabric > > >> > > > > > > >> > > > > Hi Eric, > > >> > > > > > > >> > > > > I guess it would be useful to see "lspci -vv" output with a "good" kernel > > >> > > > > and with that bad patch applied. Most likely we should see some difference > > >> > > > > somewhere in the MaxPayload fields in the PCI Express capability of > > >> > > > > some device. > > >> > > > > > > >> > > > > Either the RAID controller or something else lies, and puts a value > > >> > > > > in the DevCap that it can't actually support, or else the patch is > > >> > > > > buggy and puts something out of range in a DevCtl somewhere. > > >> > > > > > >> > > > > > >> > > > While we investigate, I think the problems produced by the patch (data > > >> > > > corruption) are serious enough to warrant reverting it, please Jesse. > > >> > > > > >> > > Hm I haven't been paying attention to the compromise thread; how should > > >> > > I share these changes? Is master.kernel.org down indefinitely? Is > > >> > > there a new server at kernel.org I can use? > > >> > > > >> > I can't answer that question, but I would like a copy of your revert > > >> > patch(es) to test (as a simple patch --reverse of the original commit on the 3.1-rc4 > > >> > tree didn't go in cleanly). > > >> > > >> Attached is the series. Applies on top of my for-linus branch. > > > > > > Thanks. I tried them out vs. 3.1-rc4, and they applied cleanly and > > > make things work on my BL460g7. > > > > I believe modifying the MRRS values is what is causing the issues. > > Can you try the attached patch and verify that it also resolves the > > issue? > > Ok, just tried it. > > The mrrs_removal patch does also appear to resolve the issue. > I cannot say that right now, as it appears the last "bad" kernel destroyed my distro enough that I cannot test this patch without a full reinstall. (/root partition is busted, even after several fsck -f -y) [ 42.501569] EXT3-fs error (device cciss/c0d0p1): ext3_free_inode: bit already cleared for inode 424649 [ 42.501721] Aborting journal on device cciss/c0d0p1. [ 42.516101] Remounting filesystem read-only [ 42.529563] EXT3-fs error (device cciss/c0d0p1) in ext3_delete_inode: IO failure I'll have to do this reinstall when I am at the office, in a couple of hours. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 01 septembre 2011 à 16:50 -0500, Jon Mason a écrit : > I believe modifying the MRRS values is what is causing the issues. > Can you try the attached patch and verify that it also resolves the > issue? I tested this patch and can confirm this solves the corruption problem. But my disk is _much_ slower than before # hdparm -t /dev/sda1 Before : Timing buffered disk reads: 254 MB in 3.02 seconds = 84.16 MB/sec After : Timing buffered disk reads: 120 MB in 3.04 seconds = 39.42 MB/sec -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 02 septembre 2011 à 11:39 +0200, Eric Dumazet a écrit : > Le jeudi 01 septembre 2011 à 16:50 -0500, Jon Mason a écrit : > > > I believe modifying the MRRS values is what is causing the issues. > > Can you try the attached patch and verify that it also resolves the > > issue? > > I tested this patch and can confirm this solves the corruption problem. > > But my disk is _much_ slower than before > > # hdparm -t /dev/sda1 > > Before : > > Timing buffered disk reads: 254 MB in 3.02 seconds = 84.16 MB/sec > > After : > > Timing buffered disk reads: 120 MB in 3.04 seconds = 39.42 MB/sec Hmm, this speed regression is probably old : the 84MB/s was with the standard debian 6.0.2 kernel (2.6.32-5-amd64) -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 02, 2011 at 12:08:53PM +0200, Eric Dumazet wrote: > Le vendredi 02 septembre 2011 à 11:39 +0200, Eric Dumazet a écrit : > > Le jeudi 01 septembre 2011 à 16:50 -0500, Jon Mason a écrit : > > > > > I believe modifying the MRRS values is what is causing the issues. > > > Can you try the attached patch and verify that it also resolves the > > > issue? > > > > I tested this patch and can confirm this solves the corruption problem. > > > > But my disk is _much_ slower than before > > > > # hdparm -t /dev/sda1 > > > > Before : > > > > Timing buffered disk reads: 254 MB in 3.02 seconds = 84.16 MB/sec > > > > After : > > > > Timing buffered disk reads: 120 MB in 3.04 seconds = 39.42 MB/sec > > Hmm, this speed regression is probably old : the 84MB/s was with the > standard debian 6.0.2 kernel (2.6.32-5-amd64) > This regression might be due to these two patches: d0be5ec8693944c2e2fc0de70fda9dbc1b93bd7d [SCSI] hpsa: do readl after writel in main i/o path to ensure commands don't get lost. Apparently we've been doin it rong for a decade, but only lately do we run into problems. and fec62c368b9c8b05d5124ca6c3b8336b537f26f3 [SCSI] hpsa: do not attempt to read from a write-only register Most smartarrays tolerate it, but a few new ones don't. Without this change some newer Smart Arrays will lock up and i/o will grind to a halt. with the second patch being a correction to the first. It seems like the readl after the writel should not be needed, and wasn't needed for a very long time, but there is some very hard to trigger and not yet well understood problem in which very occasionally a command would get lost and the driver thinks a command is out, but controller firmware thinks all commands are completed -- a circumstance which tends to make things grind to a halt. Those two patches avoid that problem. -- steve -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 8473727..d896c5e 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1394,41 +1394,6 @@ static void pcie_write_mps(struct pci_dev *dev, int mps) dev_err(&dev->dev, "Failed attempting to set the MPS\n"); } -static void pcie_write_mrrs(struct pci_dev *dev, int mps) -{ - int rc, mrrs; - - if (pcie_bus_config == PCIE_BUS_PERFORMANCE) { - int dev_mpss = 128 << dev->pcie_mpss; - - /* For Max performance, the MRRS must be set to the largest - * supported value. However, it cannot be configured larger - * than the MPS the device or the bus can support. This assumes - * that the largest MRRS available on the device cannot be - * smaller than the device MPSS. - */ - mrrs = mps < dev_mpss ? mps : dev_mpss; - } else - /* In the "safe" case, configure the MRRS for fairness on the - * bus by making all devices have the same size - */ - mrrs = mps; - - - /* MRRS is a R/W register. Invalid values can be written, but a - * subsiquent read will verify if the value is acceptable or not. - * If the MRRS value provided is not acceptable (e.g., too large), - * shrink the value until it is acceptable to the HW. - */ - while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) { - rc = pcie_set_readrq(dev, mrrs); - if (rc) - dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); - - mrrs /= 2; - } -} - static int pcie_bus_configure_set(struct pci_dev *dev, void *data) { int mps = 128 << *(u8 *)data; @@ -1440,7 +1405,6 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data) pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); pcie_write_mps(dev, mps); - pcie_write_mrrs(dev, mps); dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));