Message ID | 1413927152.4202.195.camel@ul30vt.home (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Alex Williamson wrote: > On Tue, 2014-10-21 at 15:06 -0600, Alex Williamson wrote: >> Hi Andreas, >> >> On Fri, 2014-10-17 at 03:04 +0200, Andreas Hartmann wrote: >>> Hello Alex, >>> >>> Alex Williamson wrote: >>>> Hi Andreas, >>> [...] >>>> Sorry for the breakage. Is it possible to run lspci on the device in a >>>> loop from the host and capture whether we're failing to restore some of >>>> the VC bits to their previous state? >>> >>>> Does the problem also occur if you >>>> unbind from host driver, >>> >>> The machine is booted w/ blacklisted ath9k. Then, the device is bound to >>> vfio: >>> >>> echo "168c 0030" > /sys/bus/pci/drivers/vfio-pci/new_id >>> echo 0000:03:00.0 > /sys/bus/pci/devices/0000:03:00.0/driver/unbind >>> echo 0000:03:00.0 > /sys/bus/pci/drivers/vfio-pci/bind >>> >>> afterwards the VM is started -> hang. >>> >>> W/o starting th VM, I can bind it to vfio and unbind it from vfio w/o >>> any problem. >>> >>>> echo 1 > reset in pci-sysfs, >>> >>> echo 1 > /sys/bus/pci/devices/0000:03:00.0 works w/o any problem while >>> bound to vfio. Even after unbinding from vfio and rebinding to vfio >>> again ... . >>> >>>> and re-bind to the >>> >>> Do you mean loading ath9k in host system after unbinding from vfio? If >>> yes: Works w/o any problem. It's even possible to reset it or do a >>> ifconfig wlan0 up, ifconfig wlan0 down, rmmod ath9k, bind it to vfio >>> again and reset it, .... >>> >>> Looks like the hang only is triggered by qemu-system_x86_64 on startup >>> the VM. > > Also, this might be because QEMU since 1.7 will favor doing a bus reset > for a device over PM reset while the sysfs reset interface will only do > a bus reset if there are no other methods available and there are no > other devices on the bus. Can you reproduce the hang using the sysfs > reset interface without QEMU if you modify the kernel like this: > > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3308,15 +3308,15 @@ static int __pci_dev_reset(struct pci_dev *dev, int prob > if (rc != -ENOTTY) > goto done; > > - rc = pci_pm_reset(dev, probe); > + rc = pci_dev_reset_slot_function(dev, probe); > if (rc != -ENOTTY) > goto done; > > - rc = pci_dev_reset_slot_function(dev, probe); > + rc = pci_parent_bus_reset(dev, probe); > if (rc != -ENOTTY) > goto done; > > - rc = pci_parent_bus_reset(dev, probe); > + rc = pci_pm_reset(dev, probe); > done: > return rc; > } This way it's crashing with echo 1 > reset, too. Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-10-22 at 18:22 +0200, Andreas Hartmann wrote: > Alex Williamson wrote: > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -3308,15 +3308,15 @@ static int __pci_dev_reset(struct pci_dev *dev, int prob > > if (rc != -ENOTTY) > > goto done; > > > > - rc = pci_pm_reset(dev, probe); > > + rc = pci_dev_reset_slot_function(dev, probe); > > if (rc != -ENOTTY) > > goto done; > > > > - rc = pci_dev_reset_slot_function(dev, probe); > > + rc = pci_parent_bus_reset(dev, probe); > > if (rc != -ENOTTY) > > goto done; > > > > - rc = pci_parent_bus_reset(dev, probe); > > + rc = pci_pm_reset(dev, probe); > > done: > > return rc; > > } > > This way it's crashing with echo 1 > reset, too. Ok, so it's somehow related to doing a bus reset with virtual channel save/restore while PM reset with VC save/restore works ok as apparently does bus reset without VC save/restore. Let's try to do a manual bus reset so we can look at the post reset state of the device before the kernel tries to restore it. First bind the target device 03:00.0 to pci-stub or vfio-pci so that we know it's not being used. Next capture lspci -xxxx -s 3:00.0 so we have the starting state. Then we'll do a bus reset using setpci: # setpci -s 00:05.0 3e.w=40:40 <if you script this, wait at least 2ms here> # setpci -s 00:05.0 3e.w=00:40 <wait 1 second here> Now re-capture lspci -xxxx -s 3:00.0 The interesting lines for your device are 140: and 150:, so if you want to avoid sending massive emails you can just send those for the before and after. You'll need to reboot the system before you do anything else with this device since it's now in an uninitialized state. Based on what the lspci output reports (or whether you experience a hang simply from this), we may want to try writing additional bits with setpci to mimic the VC restore behavior. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: > On Wed, 2014-10-22 at 18:22 +0200, Andreas Hartmann wrote: >> Alex Williamson wrote: >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -3308,15 +3308,15 @@ static int __pci_dev_reset(struct pci_dev *dev, int prob >>> if (rc != -ENOTTY) >>> goto done; >>> >>> - rc = pci_pm_reset(dev, probe); >>> + rc = pci_dev_reset_slot_function(dev, probe); >>> if (rc != -ENOTTY) >>> goto done; >>> >>> - rc = pci_dev_reset_slot_function(dev, probe); >>> + rc = pci_parent_bus_reset(dev, probe); >>> if (rc != -ENOTTY) >>> goto done; >>> >>> - rc = pci_parent_bus_reset(dev, probe); >>> + rc = pci_pm_reset(dev, probe); >>> done: >>> return rc; >>> } >> >> This way it's crashing with echo 1 > reset, too. > > Ok, so it's somehow related to doing a bus reset with virtual channel > save/restore while PM reset with VC save/restore works ok as apparently > does bus reset without VC save/restore. Let's try to do a manual bus > reset so we can look at the post reset state of the device before the > kernel tries to restore it. > > First bind the target device 03:00.0 to pci-stub or vfio-pci so that we > know it's not being used. > > Next capture lspci -xxxx -s 3:00.0 so we have the starting state. > > Then we'll do a bus reset using setpci: > # setpci -s 00:05.0 3e.w=40:40 > <if you script this, wait at least 2ms here> > # setpci -s 00:05.0 3e.w=00:40 > <wait 1 second here> > > Now re-capture lspci -xxxx -s 3:00.0 The machine is booted w/ vfio bound to 3:00.0 as usual (now for testing linux 3.14) lspci -xxxx -s 3:00.0 setpci -s 00:05.0 3e.w=40:40 usleep 10 setpci -s 00:05.0 3e.w=00:40 sleep 1 lspci -xxxx -s 3:00.0 I didn't get the second lspci because the machine already was hanging. The first output is attached completely. Hope this helps, thanks, regards, Andreas
On Thu, 2014-10-23 at 18:00 +0200, Andreas Hartmann wrote: > Alex Williamson wrote: > > On Wed, 2014-10-22 at 18:22 +0200, Andreas Hartmann wrote: > >> Alex Williamson wrote: > >>> --- a/drivers/pci/pci.c > >>> +++ b/drivers/pci/pci.c > >>> @@ -3308,15 +3308,15 @@ static int __pci_dev_reset(struct pci_dev *dev, int prob > >>> if (rc != -ENOTTY) > >>> goto done; > >>> > >>> - rc = pci_pm_reset(dev, probe); > >>> + rc = pci_dev_reset_slot_function(dev, probe); > >>> if (rc != -ENOTTY) > >>> goto done; > >>> > >>> - rc = pci_dev_reset_slot_function(dev, probe); > >>> + rc = pci_parent_bus_reset(dev, probe); > >>> if (rc != -ENOTTY) > >>> goto done; > >>> > >>> - rc = pci_parent_bus_reset(dev, probe); > >>> + rc = pci_pm_reset(dev, probe); > >>> done: > >>> return rc; > >>> } > >> > >> This way it's crashing with echo 1 > reset, too. > > > > Ok, so it's somehow related to doing a bus reset with virtual channel > > save/restore while PM reset with VC save/restore works ok as apparently > > does bus reset without VC save/restore. Let's try to do a manual bus > > reset so we can look at the post reset state of the device before the > > kernel tries to restore it. > > > > First bind the target device 03:00.0 to pci-stub or vfio-pci so that we > > know it's not being used. > > > > Next capture lspci -xxxx -s 3:00.0 so we have the starting state. > > > > Then we'll do a bus reset using setpci: > > # setpci -s 00:05.0 3e.w=40:40 > > <if you script this, wait at least 2ms here> > > # setpci -s 00:05.0 3e.w=00:40 > > <wait 1 second here> > > > > Now re-capture lspci -xxxx -s 3:00.0 > > The machine is booted w/ vfio bound to 3:00.0 as usual (now for testing > linux 3.14) > > lspci -xxxx -s 3:00.0 > setpci -s 00:05.0 3e.w=40:40 > usleep 10 > setpci -s 00:05.0 3e.w=00:40 > sleep 1 > lspci -xxxx -s 3:00.0 > > I didn't get the second lspci because the machine already was hanging. > The first output is attached completely. Hmm, that doesn't make much sense. You had found that if you disabled the VC save/restore then QEMU works. That should have still been using secondary bus reset as we're trying to do here, so I don't understand why we can't do a manual secondary bus reset now. If you use Bjorn's previous patch to disable VC save/restore and my patch to reorder the reset mechanisms, does echo 1 > reset for the sysfs entry for the device also still cause a hang? Can you provide a link to the specific model for this card? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: > On Thu, 2014-10-23 at 18:00 +0200, Andreas Hartmann wrote: >> Alex Williamson wrote: >>> On Wed, 2014-10-22 at 18:22 +0200, Andreas Hartmann wrote: >>>> Alex Williamson wrote: >>>>> --- a/drivers/pci/pci.c >>>>> +++ b/drivers/pci/pci.c >>>>> @@ -3308,15 +3308,15 @@ static int __pci_dev_reset(struct pci_dev *dev, int prob >>>>> if (rc != -ENOTTY) >>>>> goto done; >>>>> >>>>> - rc = pci_pm_reset(dev, probe); >>>>> + rc = pci_dev_reset_slot_function(dev, probe); >>>>> if (rc != -ENOTTY) >>>>> goto done; >>>>> >>>>> - rc = pci_dev_reset_slot_function(dev, probe); >>>>> + rc = pci_parent_bus_reset(dev, probe); >>>>> if (rc != -ENOTTY) >>>>> goto done; >>>>> >>>>> - rc = pci_parent_bus_reset(dev, probe); >>>>> + rc = pci_pm_reset(dev, probe); >>>>> done: >>>>> return rc; >>>>> } >>>> >>>> This way it's crashing with echo 1 > reset, too. >>> >>> Ok, so it's somehow related to doing a bus reset with virtual channel >>> save/restore while PM reset with VC save/restore works ok as apparently >>> does bus reset without VC save/restore. Let's try to do a manual bus >>> reset so we can look at the post reset state of the device before the >>> kernel tries to restore it. >>> >>> First bind the target device 03:00.0 to pci-stub or vfio-pci so that we >>> know it's not being used. >>> >>> Next capture lspci -xxxx -s 3:00.0 so we have the starting state. >>> >>> Then we'll do a bus reset using setpci: >>> # setpci -s 00:05.0 3e.w=40:40 >>> <if you script this, wait at least 2ms here> >>> # setpci -s 00:05.0 3e.w=00:40 >>> <wait 1 second here> >>> >>> Now re-capture lspci -xxxx -s 3:00.0 >> >> The machine is booted w/ vfio bound to 3:00.0 as usual (now for testing >> linux 3.14) >> >> lspci -xxxx -s 3:00.0 >> setpci -s 00:05.0 3e.w=40:40 >> usleep 10 >> setpci -s 00:05.0 3e.w=00:40 >> sleep 1 >> lspci -xxxx -s 3:00.0 >> >> I didn't get the second lspci because the machine already was hanging. >> The first output is attached completely. > > Hmm, that doesn't make much sense. You had found that if you disabled > the VC save/restore then QEMU works. That should have still been using > secondary bus reset as we're trying to do here, so I don't understand > why we can't do a manual secondary bus reset now. > > If you use Bjorn's previous patch to disable VC save/restore and my > patch to reorder the reset mechanisms, does echo 1 > reset for the sysfs > entry for the device also still cause a hang? I will test it. > Can you provide a link to the specific model for this card? Thanks, http://www.tp-link.com.de/support/download/?model=TL-WDN4800&version=V1 Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: [...] > If you use Bjorn's previous patch to disable VC save/restore and my > patch to reorder the reset mechanisms, does echo 1 > reset for the sysfs > entry for the device also still cause a hang? Yes - it's hanging too (w/ vfio bound to the device - didn't test other possibilities). Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-10-23 at 19:33 +0200, Andreas Hartmann wrote: > Alex Williamson wrote: > [...] > > If you use Bjorn's previous patch to disable VC save/restore and my > > patch to reorder the reset mechanisms, does echo 1 > reset for the sysfs > > entry for the device also still cause a hang? > > Yes - it's hanging too (w/ vfio bound to the device - didn't test other > possibilities). Does it happen regardless of the slot the card is plugged into? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: > On Thu, 2014-10-23 at 19:33 +0200, Andreas Hartmann wrote: >> Alex Williamson wrote: >> [...] >>> If you use Bjorn's previous patch to disable VC save/restore and my >>> patch to reorder the reset mechanisms, does echo 1 > reset for the sysfs >>> entry for the device also still cause a hang? >> >> Yes - it's hanging too (w/ vfio bound to the device - didn't test other >> possibilities). > > Does it happen regardless of the slot the card is plugged into? Thanks, Can't say - there is only one usable small pcie slot. The other slot is blocked by the graphics card - and the third slot, which should be there according documentation doesn't exist in reality :-(. Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: > On Thu, 2014-10-23 at 19:33 +0200, Andreas Hartmann wrote: >> Alex Williamson wrote: >> [...] >>> If you use Bjorn's previous patch to disable VC save/restore and my >>> patch to reorder the reset mechanisms, does echo 1 > reset for the sysfs >>> entry for the device also still cause a hang? >> >> Yes - it's hanging too (w/ vfio bound to the device - didn't test other >> possibilities). > > Does it happen regardless of the slot the card is plugged into? Thanks, As I already wrote, it's not possible to plug the device to another port. But besides that, let me stress some "findings" I made over the past view weeks I'm now knowing about this problem. Maybe it gives you an idea about what's going on: - I did all of the tests in text mode on the console. Normally, there is a blinking cursor. When doing the echo 1 > reset, the shell doesn't come back again and the blinking of the cursor gets immediately slower. Getting slower means: it takes some more time until it is on / off again again. This way, it "blinks" another not exceeding 2 times until it's finally dead. It looks like the machine would have suddenly extremely high load (there are 8 cores!) - but this seems to be not true, because the cpu fan stays silent - the rpm isn't changed at all. - Most of the time, I'm doing tests which fail, I'm having problems after the hang with USB (it's the Etron device). Problem means: initrd isn't able to communicate with the device (but bios and grub2 didn't had any problem, because keyboard worked fine, which is connected via USB 3). At this point, it is necessary to disconnect the mains completely and wait half a minute until the problem disappears. Seldom, I too had this problem even on bios stage: the keyboard couldn't be seen even by the bios any more. - Sometimes (really seldom - now happened about 3 times), it gets extremely hard to return to normal operation after that hang. This means: Since a few weeks, I'm running kernel 3.12.28-3-desktop out of the box (= as provided by openSUSE). Sometimes now, I got (apparently) the same problems (= PCIe passthrough hangs the complete machine) w/ 3.12.28 as I'm having with stock >= 3.14 after testing. It's even useless then to reconnect the mains (I experienced this 2 times in series after one hang yesterday). At this point, I have to run kernel 3.10.x (which runs pretty fine as usual) and only after that, 3.12 works again as expected (as appeared once yesterday while tests w/ disabled USB 3 devices via bios). - I think there is a relationship between how long the hang is active and the consecutive problems coming up. If the hang is immediately (max about 1s) reset w/ the reset knob, it is possible, that there is no USB problem after reboot and the machine works completely fine with 3.12.x again. Conclusion (from my point of view): The broken reset seems to do something really _extreme ugly_ w/ the hardware, which has the potential to break the hardware "lasting" or the consecutive software isn't able at all to correctly reconfigure the system again - even after reconnecting the mains. Fortunately I'm having an old kernel version (3.10.x), which seems to be able to "repair" the hardware again. But I have to emphasis that the situation is really highly questionable and I'm meanwhile fearing to break my board finally, which is working really _extremely_ stable besides that. Out of interest: Bjorn's patch disables vc save/restore support - and the machine works fine again. Why is it needed at all if it seems to work perfectly w/o it? What's the additional benefit? Or in other words: What am I missing until today :-) ? What would be better? What could I do more? Thanks, kind regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -3308,15 +3308,15 @@ static int __pci_dev_reset(struct pci_dev *dev, int prob if (rc != -ENOTTY) goto done; - rc = pci_pm_reset(dev, probe); + rc = pci_dev_reset_slot_function(dev, probe); if (rc != -ENOTTY) goto done; - rc = pci_dev_reset_slot_function(dev, probe); + rc = pci_parent_bus_reset(dev, probe); if (rc != -ENOTTY) goto done; - rc = pci_parent_bus_reset(dev, probe); + rc = pci_pm_reset(dev, probe); done: return rc; }