Message ID | 20180130084121.18653-1-sr@denx.de (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: > Hotplugging of some PCIe devices on our platform sometimes leads to a > bounce of link-up and link-down events, resulting in problems in the > corresponding PCI drivers. > > Here an example of such a hotplug event bounce for a AHCI PCIe card: > ... > pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up It would be good to find out why this happens in the first place. Perhaps there is some environmental interference or something causing this? > pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 > pci 0000:02:00.0: reg 0x10: [io 0x8000-0x8007] > ... > ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100 > ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100 > ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100 > ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100 > pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > ahci 0000:02:00.0: PME# disabled > ata3: SATA link down (SStatus 0 SControl 300) > ata5: SATA link down (SStatus 0 SControl 300) > ata4: SATA link down (SStatus 0 SControl 300) > WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130 I think the AHCI driver should be fixed to cope with this. > ata6: SATA link down (SStatus 0 SControl 300) > Modules linked in: > CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26 > Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018 > Workqueue: pciehp-1 pciehp_power_thread > ... > > This patch now adds the 'pciehp_debounce_time' module parameter, which > can be used to drop all events for the specified time (in milliseconds) > after a link-up event occurred. A value of ~100ms works fine in my tests > to debounce all the link-up / link-down events in my tests. This sounds a bit "hackish". I would rather make sure we can handle situations like this properly without passing additional parameters.
Hi Mika, sorry for the late reply. On 30.01.2018 11:28, Mika Westerberg wrote: > On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: >> Hotplugging of some PCIe devices on our platform sometimes leads to a >> bounce of link-up and link-down events, resulting in problems in the >> corresponding PCI drivers. >> >> Here an example of such a hotplug event bounce for a AHCI PCIe card: >> ... >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > > It would be good to find out why this happens in the first place. > Perhaps there is some environmental interference or something causing > this? I'm seeing these link bounces in the following environments: a) Using a BayTrail SoC and hotplugging a standard Desktop PCIe SATA / AHCI Controller (Marvell chip) b) Hotplugging (booting via SPI) an Altera / Intel FPGA which is connected via PCIe to a PCIe switch In both cases, this link bouncing happens infrequently, approx. once out of 5 - 10 tries. Out of curiosity, has nobody else ever experienced such "link bouncing" with PCIe cards / devices getting hot-plugged? >> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 >> pci 0000:02:00.0: reg 0x10: [io 0x8000-0x8007] >> ... >> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100 >> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100 >> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100 >> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100 >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on >> ahci 0000:02:00.0: PME# disabled >> ata3: SATA link down (SStatus 0 SControl 300) >> ata5: SATA link down (SStatus 0 SControl 300) >> ata4: SATA link down (SStatus 0 SControl 300) >> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130 > > I think the AHCI driver should be fixed to cope with this. Yes, this can be discussed. But still the root-cause should be fixed, IMHO. Either in our environment (HW issue?) or by adding this de-bouncing feature. >> ata6: SATA link down (SStatus 0 SControl 300) >> Modules linked in: >> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26 >> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018 >> Workqueue: pciehp-1 pciehp_power_thread >> ... >> >> This patch now adds the 'pciehp_debounce_time' module parameter, which >> can be used to drop all events for the specified time (in milliseconds) >> after a link-up event occurred. A value of ~100ms works fine in my tests >> to debounce all the link-up / link-down events in my tests. > > This sounds a bit "hackish". I would rather make sure we can handle > situations like this properly without passing additional parameters. I'm open for other / better ideas on how to solve this situation, we are seeing on our systems. Thanks, Stefan
On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote: > > On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: > >> Hotplugging of some PCIe devices on our platform sometimes leads to a > >> bounce of link-up and link-down events, resulting in problems in the > >> corresponding PCI drivers. > >> > >> Here an example of such a hotplug event bounce for a AHCI PCIe card: > >> ... > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > > I'm open for other / better ideas on how to solve this situation, we > are seeing on our systems. If a Link Up event is received and there is already a Link Up / Link Down pair in the queue, the Link Down event can be dequeued and the newly received Link Up event need not be queued. Same if a Link Down event is received and there is already a Link Down / Link Up pair in the queue. Thanks, Lukas
On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote: > Hi Mika, > > sorry for the late reply. > > On 30.01.2018 11:28, Mika Westerberg wrote: > > On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: > >> Hotplugging of some PCIe devices on our platform sometimes leads to a > >> bounce of link-up and link-down events, resulting in problems in the > >> corresponding PCI drivers. > >> > >> Here an example of such a hotplug event bounce for a AHCI PCIe card: > >> ... > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > > > > It would be good to find out why this happens in the first place. > > Perhaps there is some environmental interference or something causing > > this? > > I'm seeing these link bounces in the following environments: > > a) Using a BayTrail SoC and hotplugging a standard Desktop PCIe SATA / > AHCI Controller (Marvell chip) > b) Hotplugging (booting via SPI) an Altera / Intel FPGA which is connected > via PCIe to a PCIe switch > > In both cases, this link bouncing happens infrequently, approx. once out > of 5 - 10 tries. > > Out of curiosity, has nobody else ever experienced such "link bouncing" > with PCIe cards / devices getting hot-plugged? I've seen it with some Thunderbolt devices from time to time. I think it is entirely possible in real world that the link goes down briefly for example because of some external interference so we should make sure we can handle that properly. > >> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 > >> pci 0000:02:00.0: reg 0x10: [io 0x8000-0x8007] > >> ... > >> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100 > >> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100 > >> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100 > >> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100 > >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > >> ahci 0000:02:00.0: PME# disabled > >> ata3: SATA link down (SStatus 0 SControl 300) > >> ata5: SATA link down (SStatus 0 SControl 300) > >> ata4: SATA link down (SStatus 0 SControl 300) > >> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130 > > > > I think the AHCI driver should be fixed to cope with this. > > Yes, this can be discussed. But still the root-cause should be fixed, > IMHO. Either in our environment (HW issue?) or by adding this de-bouncing > feature. > > >> ata6: SATA link down (SStatus 0 SControl 300) > >> Modules linked in: > >> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26 > >> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018 > >> Workqueue: pciehp-1 pciehp_power_thread > >> ... > >> > >> This patch now adds the 'pciehp_debounce_time' module parameter, which > >> can be used to drop all events for the specified time (in milliseconds) > >> after a link-up event occurred. A value of ~100ms works fine in my tests > >> to debounce all the link-up / link-down events in my tests. > > > > This sounds a bit "hackish". I would rather make sure we can handle > > situations like this properly without passing additional parameters. > > I'm open for other / better ideas on how to solve this situation, we > are seeing on our systems. Well, I would start by fixing drivers that can't cope with surprise link down (e.g disapearing PCI device, or suddenly reading 0xffffffff from register). BTW, have you checked whether presence detect actually toggles similarly or is it only triggered when the link is fully up? Since currently we prioritize link up/down higher than presence detect but it may be that we should do the opposite.
On 02.02.2018 14:47, Lukas Wunner wrote: > On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote: >>> On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: >>>> Hotplugging of some PCIe devices on our platform sometimes leads to a >>>> bounce of link-up and link-down events, resulting in problems in the >>>> corresponding PCI drivers. >>>> >>>> Here an example of such a hotplug event bounce for a AHCI PCIe card: >>>> ... >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up >> >> I'm open for other / better ideas on how to solve this situation, we >> are seeing on our systems. > > If a Link Up event is received and there is already a Link Up / Link Down > pair in the queue, the Link Down event can be dequeued and the newly > received Link Up event need not be queued. > > Same if a Link Down event is received and there is already a Link Down / > Link Up pair in the queue. Makes sense. But I'm more often seeing this sequence here while hot-plugging the PCIe card: [ 41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present [ 41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up [ 41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down [ 41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present [ 41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present [ 41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present [ 41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up [ 41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 ... So a link-down is following the link-up directly (~30ms here). Sometimes a double link-up is also seen. But this one is more frequent in my test cases. Thanks, Stefan
On 02.02.2018 14:56, Mika Westerberg wrote: > On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote: >> Hi Mika, >> >> sorry for the late reply. >> >> On 30.01.2018 11:28, Mika Westerberg wrote: >>> On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: >>>> Hotplugging of some PCIe devices on our platform sometimes leads to a >>>> bounce of link-up and link-down events, resulting in problems in the >>>> corresponding PCI drivers. >>>> >>>> Here an example of such a hotplug event bounce for a AHCI PCIe card: >>>> ... >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up >>> >>> It would be good to find out why this happens in the first place. >>> Perhaps there is some environmental interference or something causing >>> this? >> >> I'm seeing these link bounces in the following environments: >> >> a) Using a BayTrail SoC and hotplugging a standard Desktop PCIe SATA / >> AHCI Controller (Marvell chip) >> b) Hotplugging (booting via SPI) an Altera / Intel FPGA which is connected >> via PCIe to a PCIe switch >> >> In both cases, this link bouncing happens infrequently, approx. once out >> of 5 - 10 tries. >> >> Out of curiosity, has nobody else ever experienced such "link bouncing" >> with PCIe cards / devices getting hot-plugged? > > I've seen it with some Thunderbolt devices from time to time. > > I think it is entirely possible in real world that the link goes down > briefly for example because of some external interference so we should > make sure we can handle that properly. > >>>> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 >>>> pci 0000:02:00.0: reg 0x10: [io 0x8000-0x8007] >>>> ... >>>> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100 >>>> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100 >>>> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100 >>>> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100 >>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on >>>> ahci 0000:02:00.0: PME# disabled >>>> ata3: SATA link down (SStatus 0 SControl 300) >>>> ata5: SATA link down (SStatus 0 SControl 300) >>>> ata4: SATA link down (SStatus 0 SControl 300) >>>> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130 >>> >>> I think the AHCI driver should be fixed to cope with this. >> >> Yes, this can be discussed. But still the root-cause should be fixed, >> IMHO. Either in our environment (HW issue?) or by adding this de-bouncing >> feature. >> >>>> ata6: SATA link down (SStatus 0 SControl 300) >>>> Modules linked in: >>>> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26 >>>> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018 >>>> Workqueue: pciehp-1 pciehp_power_thread >>>> ... >>>> >>>> This patch now adds the 'pciehp_debounce_time' module parameter, which >>>> can be used to drop all events for the specified time (in milliseconds) >>>> after a link-up event occurred. A value of ~100ms works fine in my tests >>>> to debounce all the link-up / link-down events in my tests. >>> >>> This sounds a bit "hackish". I would rather make sure we can handle >>> situations like this properly without passing additional parameters. >> >> I'm open for other / better ideas on how to solve this situation, we >> are seeing on our systems. > > Well, I would start by fixing drivers that can't cope with surprise link > down (e.g disapearing PCI device, or suddenly reading 0xffffffff from > register). I've already sent a patch regarding a libata problem while unplugging an AHCI controller: https://www.spinics.net/lists/linux-ide/msg55038.html > BTW, have you checked whether presence detect actually toggles similarly > or is it only triggered when the link is fully up? Since currently we > prioritize link up/down higher than presence detect but it may be that > we should do the opposite. As seen in the log sent in my previous mail, presence detect also toggles. Here again: [ 41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present [ 41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up [ 41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down [ 41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present [ 41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present [ 41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present [ 41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up [ 41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 ... Even though I'm wondering, why we are seeing 3 times "Card present" and only one time "Card not present". I would expect to see 2 "Card present" messages here. This does not seem to be balanced correctly. Thanks, Stefan
On Fri, Feb 02, 2018 at 03:50:55PM +0100, Stefan Roese wrote: > I've already sent a patch regarding a libata problem while unplugging > an AHCI controller: > > https://www.spinics.net/lists/linux-ide/msg55038.html Great :) > > BTW, have you checked whether presence detect actually toggles similarly > > or is it only triggered when the link is fully up? Since currently we > > prioritize link up/down higher than presence detect but it may be that > > we should do the opposite. > > As seen in the log sent in my previous mail, presence detect also > toggles. Here again: > > [ 41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > [ 41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > [ 41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present > [ 41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > [ 41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 > ... Indeed, it seems to follow link status changes closely. So changing the "priority" here would not help.
On Fri, Feb 02, 2018 at 03:44:21PM +0100, Stefan Roese wrote: > On 02.02.2018 14:47, Lukas Wunner wrote: > >On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote: > >>>On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: > >>>>Hotplugging of some PCIe devices on our platform sometimes leads to a > >>>>bounce of link-up and link-down events, resulting in problems in the > >>>>corresponding PCI drivers. > >>>> > >>>>Here an example of such a hotplug event bounce for a AHCI PCIe card: > >>>>... > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > >> > >>I'm open for other / better ideas on how to solve this situation, we > >>are seeing on our systems. This is definitely a real problem that should be fixed somehow. But I don't like the idea of a new module parameter because it's not very user-friendly. It would be very difficult for a user to identify the problem, discover the parameter, and figure out what debounce time to use. > >If a Link Up event is received and there is already a Link Up / Link Down > >pair in the queue, the Link Down event can be dequeued and the newly > >received Link Up event need not be queued. > > > >Same if a Link Down event is received and there is already a Link Down / > >Link Up pair in the queue. > > Makes sense. But I'm more often seeing this sequence here while > hot-plugging the PCIe card: > > [ 41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > [ 41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > [ 41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present > [ 41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > [ 41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 > ... > > So a link-down is following the link-up directly (~30ms here). Sometimes > a double link-up is also seen. But this one is more frequent in my test > cases. Unfortunately I don't have any easy ideas to offer. I do think the pciehp interrupt handling is baroque and I suspect that if we could simplify and rationalize it, some of these issues would take care of themselves. Bjorn
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index 06109d40c4ac..a9ff87150e82 100644 --- a/drivers/pci/hotplug/pciehp.h +++ b/drivers/pci/hotplug/pciehp.h @@ -43,6 +43,7 @@ extern bool pciehp_poll_mode; extern int pciehp_poll_time; extern bool pciehp_debug; +extern int pciehp_debounce_time; #define dbg(format, arg...) \ do { \ @@ -78,6 +79,8 @@ struct slot { struct mutex lock; struct mutex hotplug_lock; struct workqueue_struct *wq; + unsigned long linkup_start; /* jiffies */ + int linkup_debounce_active; /* linkup-debounce is active */ }; struct event_info { diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index 35d84845d5af..5a97f2550cba 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -45,6 +45,7 @@ bool pciehp_debug; bool pciehp_poll_mode; int pciehp_poll_time; static bool pciehp_force; +int pciehp_debounce_time; /* * not really modular, but the easiest way to keep compat with existing @@ -54,10 +55,13 @@ module_param(pciehp_debug, bool, 0644); module_param(pciehp_poll_mode, bool, 0644); module_param(pciehp_poll_time, int, 0644); module_param(pciehp_force, bool, 0644); +module_param(pciehp_debounce_time, int, 0644); MODULE_PARM_DESC(pciehp_debug, "Debugging mode enabled or not"); MODULE_PARM_DESC(pciehp_poll_mode, "Using polling mechanism for hot-plug events or not"); MODULE_PARM_DESC(pciehp_poll_time, "Polling mechanism frequency, in seconds"); MODULE_PARM_DESC(pciehp_force, "Force pciehp, even if OSHP is missing"); +MODULE_PARM_DESC(pciehp_debounce_time, + "PCIe hotplug debounce time in milliseconds"); #define PCIE_MODULE_NAME "pciehp" diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c index 83f3d4af3677..03d966c21c41 100644 --- a/drivers/pci/hotplug/pciehp_ctrl.c +++ b/drivers/pci/hotplug/pciehp_ctrl.c @@ -40,6 +40,7 @@ static void interrupt_event_handler(struct work_struct *work); void pciehp_queue_interrupt_event(struct slot *p_slot, u32 event_type) { struct event_info *info; + bool drop_event = false; info = kmalloc(sizeof(*info), GFP_ATOMIC); if (!info) { @@ -47,10 +48,39 @@ void pciehp_queue_interrupt_event(struct slot *p_slot, u32 event_type) return; } - INIT_WORK(&info->work, interrupt_event_handler); - info->event_type = event_type; - info->p_slot = p_slot; - queue_work(p_slot->wq, &info->work); + /* Clear linkup-debounce flag if time exceeds linkup-debounce timeout */ + if (time_after(jiffies, p_slot->linkup_start + + msecs_to_jiffies(pciehp_debounce_time))) + p_slot->linkup_debounce_active = 0; + + /* Check if this event starts a new linkup-debounce period */ + if (pciehp_debounce_time && (event_type == INT_LINK_UP) && + !p_slot->linkup_debounce_active) { + p_slot->linkup_start = jiffies; + p_slot->linkup_debounce_active = 1; + ctrl_info(p_slot->ctrl, + "Slot(%s): Linkup-debounce active for %dms\n", + slot_name(p_slot), pciehp_debounce_time); + } else { + /* + * Drop this event if it occurs inside the debounce period + * after the linkup event + */ + if (p_slot->linkup_debounce_active) + drop_event = true; + } + + if (drop_event) { + ctrl_info(p_slot->ctrl, + "Slot(%s): Event %x dropped (dt=%dms)!\n", + slot_name(p_slot), event_type, + jiffies_to_msecs(jiffies - p_slot->linkup_start)); + } else { + INIT_WORK(&info->work, interrupt_event_handler); + info->event_type = event_type; + info->p_slot = p_slot; + queue_work(p_slot->wq, &info->work); + } } /* The following routines constitute the bulk of the
Hotplugging of some PCIe devices on our platform sometimes leads to a bounce of link-up and link-down events, resulting in problems in the corresponding PCI drivers. Here an example of such a hotplug event bounce for a AHCI PCIe card: ... pciehp 0000:00:1c.1:pcie004: Slot(1): Card present pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down pciehp 0000:00:1c.1:pcie004: Slot(1): Card present pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 pci 0000:02:00.0: reg 0x10: [io 0x8000-0x8007] ... ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100 ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100 ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100 ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100 pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on ahci 0000:02:00.0: PME# disabled ata3: SATA link down (SStatus 0 SControl 300) ata5: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130 ata6: SATA link down (SStatus 0 SControl 300) Modules linked in: CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26 Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018 Workqueue: pciehp-1 pciehp_power_thread ... This patch now adds the 'pciehp_debounce_time' module parameter, which can be used to drop all events for the specified time (in milliseconds) after a link-up event occurred. A value of ~100ms works fine in my tests to debounce all the link-up / link-down events in my tests. If this parameter is not set (default) then the current implementation is unchanged and all events are handled. Signed-off-by: Stefan Roese <sr@denx.de> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> --- drivers/pci/hotplug/pciehp.h | 3 +++ drivers/pci/hotplug/pciehp_core.c | 4 ++++ drivers/pci/hotplug/pciehp_ctrl.c | 38 ++++++++++++++++++++++++++++++++++---- 3 files changed, 41 insertions(+), 4 deletions(-)