Message ID | d9ca2742-7f85-afe3-2de6-6637eb56d9b3@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
On Wed, 14 Feb 2018 13:34:38 +0200, cantabile wrote: > The firmware running on the device sometimes survives a reboot > (firmware_running returns 1). When this happens the driver never calls > request_firmware, which means the kernel's firmware handling code > doesn't know this firmware should be cached before hibernating. Upon > resuming from several hours of hibernation, the firmware is no longer > running on the device, so the driver calls request_firmware. Since the > firmware was never cached, it needs to be loaded from disk, and this is > when the system freezes, somewhere in the xfs driver. Fix this by always > requesting the firmware, whether it's already running on the device or not. > > Signed-off-by: John Smith <cantabile.desu@gmail.com> Thanks for tracking this down, but this seems like the wrong direction. What's your hard drive? Is it some complex configuration which prevents the block device from coming online after resume? If it's really because of some peculiarities of XFS the fix should go there, no driver will be able to load FW on resume... > --- a/drivers/net/wireless/mediatek/mt7601u/mcu.c > +++ b/drivers/net/wireless/mediatek/mt7601u/mcu.c > @@ -420,13 +420,15 @@ > mt7601u_wr(dev, MT_USB_DMA_CFG, (MT_USB_DMA_CFG_RX_BULK_EN | > MT_USB_DMA_CFG_TX_BULK_EN)); > > - if (firmware_running(dev)) > - return 0; > - > ret = request_firmware(&fw, MT7601U_FIRMWARE, dev->dev); > if (ret) > return ret; > > + if (firmware_running(dev)) { > + release_firmware(fw); > + return 0; > + } > + > if (!fw || !fw->data || fw->size < sizeof(*hdr)) > goto err_inv_fw; >
On 15/02/18 02:45, Jakub Kicinski wrote: > On Wed, 14 Feb 2018 13:34:38 +0200, cantabile wrote: >> The firmware running on the device sometimes survives a reboot >> (firmware_running returns 1). When this happens the driver never calls >> request_firmware, which means the kernel's firmware handling code >> doesn't know this firmware should be cached before hibernating. Upon >> resuming from several hours of hibernation, the firmware is no longer >> running on the device, so the driver calls request_firmware. Since the >> firmware was never cached, it needs to be loaded from disk, and this is >> when the system freezes, somewhere in the xfs driver. Fix this by always >> requesting the firmware, whether it's already running on the device or not. >> >> Signed-off-by: John Smith <cantabile.desu@gmail.com> > > Thanks for tracking this down, but this seems like the wrong > direction. > > What's your hard drive? Is it some complex configuration which > prevents the block device from coming online after resume? > > If it's really because of some peculiarities of XFS the fix should > go there, no driver will be able to load FW on resume... > My hard drive is a SATA SSD, with a regular MS-DOS partition table, with three primary partitions: one ext4 mounted at /boot, one xfs mounted at /, one xfs mounted at /home. No encryption, no software RAID, no logical volumes, etc. The kernel has a firmware caching mechanism to make sure that no driver needs to load firmware from disk during the resume process. Because that's unreliable. See this bit of the documentation: [1] This is how it works: when the driver's probe callback calls request_firmware, that function adds the firmware file's name in a devres in your 'struct device'. When you suspend the system, the kernel calls fw_pm_notify [2], which goes through all the 'struct device's and looks for firmware file names previously added by request_firmware, and loads into memory all the firmware files it finds. When you resume the system, all calls to request_firmware ought to find the firmware files already loaded in memory. After resuming, the kernel calls fw_pm_notify again to release the cached firmware files (with some delay). This caching mechanism currently doesn't always work for your driver because your driver doesn't always call request_firmware from the probe callback. Because the firmware running on the device sometimes survives a reboot. You can get some very useful messages if you compile firmware_class.c with -DDEBUG [3]. This is how I figured out the problem. [1] https://www.kernel.org/doc/html/v4.13/driver-api/firmware/request_firmware.html#considerations-for-suspend-and-resume [2] https://github.com/torvalds/linux/blob/master/drivers/base/firmware_class.c#L1804 [3] https://www.kernel.org/doc/local/pr_debug.txt
On Thu, 15 Feb 2018 13:38:00 +0200, cantabile wrote: > On 15/02/18 02:45, Jakub Kicinski wrote: > > On Wed, 14 Feb 2018 13:34:38 +0200, cantabile wrote: > >> The firmware running on the device sometimes survives a reboot > >> (firmware_running returns 1). When this happens the driver never calls > >> request_firmware, which means the kernel's firmware handling code > >> doesn't know this firmware should be cached before hibernating. Upon > >> resuming from several hours of hibernation, the firmware is no longer > >> running on the device, so the driver calls request_firmware. Since the > >> firmware was never cached, it needs to be loaded from disk, and this is > >> when the system freezes, somewhere in the xfs driver. Fix this by always > >> requesting the firmware, whether it's already running on the device or not. > >> > >> Signed-off-by: John Smith <cantabile.desu@gmail.com> > > > > Thanks for tracking this down, but this seems like the wrong > > direction. > > > > What's your hard drive? Is it some complex configuration which > > prevents the block device from coming online after resume? > > > > If it's really because of some peculiarities of XFS the fix should > > go there, no driver will be able to load FW on resume... > > > > My hard drive is a SATA SSD, with a regular MS-DOS partition table, with > three primary partitions: one ext4 mounted at /boot, one xfs mounted at > /, one xfs mounted at /home. No encryption, no software RAID, no logical > volumes, etc. > > The kernel has a firmware caching mechanism to make sure that no driver > needs to load firmware from disk during the resume process. Because > that's unreliable. See this bit of the documentation: [1] > > This is how it works: when the driver's probe callback calls > request_firmware, that function adds the firmware file's name in a > devres in your 'struct device'. When you suspend the system, the kernel > calls fw_pm_notify [2], which goes through all the 'struct device's and > looks for firmware file names previously added by request_firmware, and > loads into memory all the firmware files it finds. When you resume the > system, all calls to request_firmware ought to find the firmware files > already loaded in memory. After resuming, the kernel calls fw_pm_notify > again to release the cached firmware files (with some delay). > > This caching mechanism currently doesn't always work for your driver > because your driver doesn't always call request_firmware from the probe > callback. Because the firmware running on the device sometimes survives > a reboot. Thanks for the info. Would it be cleaner to EXPORT fw_add_devm_name() and just call that in case driver sees FW is already loaded? That should inform the fw subsystem that we want the image around in case of hibernation, but there is no need to load it immediately? > You can get some very useful messages if you compile firmware_class.c > with -DDEBUG [3]. This is how I figured out the problem. > > [1] > https://www.kernel.org/doc/html/v4.13/driver-api/firmware/request_firmware.html#considerations-for-suspend-and-resume > [2] > https://github.com/torvalds/linux/blob/master/drivers/base/firmware_class.c#L1804 > [3] https://www.kernel.org/doc/local/pr_debug.txt
On 15/02/18 23:47, Jakub Kicinski wrote: > On Thu, 15 Feb 2018 13:38:00 +0200, cantabile wrote: >> On 15/02/18 02:45, Jakub Kicinski wrote: >>> On Wed, 14 Feb 2018 13:34:38 +0200, cantabile wrote: >>>> The firmware running on the device sometimes survives a reboot >>>> (firmware_running returns 1). When this happens the driver never calls >>>> request_firmware, which means the kernel's firmware handling code >>>> doesn't know this firmware should be cached before hibernating. Upon >>>> resuming from several hours of hibernation, the firmware is no longer >>>> running on the device, so the driver calls request_firmware. Since the >>>> firmware was never cached, it needs to be loaded from disk, and this is >>>> when the system freezes, somewhere in the xfs driver. Fix this by always >>>> requesting the firmware, whether it's already running on the device or not. >>>> >>>> Signed-off-by: John Smith <cantabile.desu@gmail.com> >>> >>> Thanks for tracking this down, but this seems like the wrong >>> direction. >>> >>> What's your hard drive? Is it some complex configuration which >>> prevents the block device from coming online after resume? >>> >>> If it's really because of some peculiarities of XFS the fix should >>> go there, no driver will be able to load FW on resume... >>> >> >> My hard drive is a SATA SSD, with a regular MS-DOS partition table, with >> three primary partitions: one ext4 mounted at /boot, one xfs mounted at >> /, one xfs mounted at /home. No encryption, no software RAID, no logical >> volumes, etc. >> >> The kernel has a firmware caching mechanism to make sure that no driver >> needs to load firmware from disk during the resume process. Because >> that's unreliable. See this bit of the documentation: [1] >> >> This is how it works: when the driver's probe callback calls >> request_firmware, that function adds the firmware file's name in a >> devres in your 'struct device'. When you suspend the system, the kernel >> calls fw_pm_notify [2], which goes through all the 'struct device's and >> looks for firmware file names previously added by request_firmware, and >> loads into memory all the firmware files it finds. When you resume the >> system, all calls to request_firmware ought to find the firmware files >> already loaded in memory. After resuming, the kernel calls fw_pm_notify >> again to release the cached firmware files (with some delay). >> >> This caching mechanism currently doesn't always work for your driver >> because your driver doesn't always call request_firmware from the probe >> callback. Because the firmware running on the device sometimes survives >> a reboot. > > Thanks for the info. Would it be cleaner to EXPORT fw_add_devm_name() > and just call that in case driver sees FW is already loaded? That > should inform the fw subsystem that we want the image around in case of > hibernation, but there is no need to load it immediately? > No, I don't believe it's cleaner to expose a private function that you don't even really need. Remember that calling request_firmware every time your driver's probe and resume functions are called is normal. It's the expected behaviour.
On Sat, 17 Feb 2018 13:23:29 +0200, cantabile wrote: > > Thanks for the info. Would it be cleaner to EXPORT fw_add_devm_name() > > and just call that in case driver sees FW is already loaded? That > > should inform the fw subsystem that we want the image around in case of > > hibernation, but there is no need to load it immediately? > > No, I don't believe it's cleaner to expose a private function that you > don't even really need. Remember that calling request_firmware every > time your driver's probe and resume functions are called is normal. It's > the expected behaviour. I'm asking you the extend functionality of a subsystem to be able to cleanly communicate the intent. Not export internal functions. Requesting firmware you don't need and risking failing probe even if FW is already pre-loaded is not correct. Reordering you suggest is brittle and makes little logical sense unless someone guesses your use case. Please at least try to do as advised. Otherwise: Nacked-by: Jakub Kicinski <kubakici@wp.pl>
On 19/02/18 07:55, Jakub Kicinski wrote: > On Sat, 17 Feb 2018 13:23:29 +0200, cantabile wrote: >>> Thanks for the info. Would it be cleaner to EXPORT fw_add_devm_name() >>> and just call that in case driver sees FW is already loaded? That >>> should inform the fw subsystem that we want the image around in case of >>> hibernation, but there is no need to load it immediately? >> >> No, I don't believe it's cleaner to expose a private function that you >> don't even really need. Remember that calling request_firmware every >> time your driver's probe and resume functions are called is normal. It's >> the expected behaviour. > > I'm asking you the extend functionality of a subsystem to be able to > cleanly communicate the intent. Not export internal functions. > > Requesting firmware you don't need and risking failing probe even if FW > is already pre-loaded is not correct. Reordering you suggest is > brittle and makes little logical sense unless someone guesses your use > case. > > Please at least try to do as advised. Otherwise: > > Nacked-by: Jakub Kicinski <kubakici@wp.pl> > You're right about the reordering not making sense to someone unfamiliar with the problem. I can fix that with a comment. I can change the patch so that request_firmware will only make the probe function fail if the firmware is not already running. If that's not satisfactory, I will try to do what you suggested. (The lack of comment from mcgrof@kernel.org doesn't look promising, but maybe I'm just impatient.)
On Mon, Feb 19, 2018 at 05:01:27PM +0200, cantabile wrote: > On 19/02/18 07:55, Jakub Kicinski wrote: > > On Sat, 17 Feb 2018 13:23:29 +0200, cantabile wrote: > > > > Thanks for the info. Would it be cleaner to EXPORT fw_add_devm_name() > > > > and just call that in case driver sees FW is already loaded? That > > > > should inform the fw subsystem that we want the image around in case of > > > > hibernation, but there is no need to load it immediately? > > > > > > No, I don't believe it's cleaner to expose a private function that you > > > don't even really need. Remember that calling request_firmware every > > > time your driver's probe and resume functions are called is normal. It's > > > the expected behaviour. > > > > I'm asking you the extend functionality of a subsystem to be able to > > cleanly communicate the intent. Not export internal functions. > > > > Requesting firmware you don't need and risking failing probe even if FW > > is already pre-loaded is not correct. Reordering you suggest is > > brittle and makes little logical sense unless someone guesses your use > > case. > > > > Please at least try to do as advised. Otherwise: > > > > Nacked-by: Jakub Kicinski <kubakici@wp.pl> > > > > You're right about the reordering not making sense to someone unfamiliar > with the problem. I can fix that with a comment. > > I can change the patch so that request_firmware will only make the probe > function fail if the firmware is not already running. Note that using request_firmware() on probe typically is also not an outstanding idea given it delays boot. Not because looking for the firmware takes time, but instead because processing firmware typically does on the device. For instance cxgb4 is an example device where processing firmware takes a long time. Delays on probe may mean the "feel good" immediate desktop coming up is delyed. Specially if its networking... I see no reason why to process firmware on probe. If one can use a workqueue to process verifying if it needs firmware and loading later, that's more advisable. Now, that's all a side topic. I will for now agree that it seems pointless to request for firmware always even if you don't need to, and all you want is to just cache the firmware on suspend. So I welcome a patch but the justification for it really needs to be documented very well, and the documentation extended as such. In fact maybe rename the function to something more sensible. Another use case for the firmware cache (which we need to add the documentation) is that for hibernation we suspend all devices first, get a snapshot, and then resume devices so we can then write the snapshot to disk. On that resume step I don't think devices have access to the hard drive for firmware, so cache is all we have. This may need some confirmation but I suspect this is the case. Drivers needing firmware on resume for hibernation may need to cache their firmware. I want to understand the case where the firmware is *not* available on resume? Why did that happen? I seem to have read that on a fresh reboot the firmware was not needed, and so on probe request_firmware() was not called? Why would firmware not be required on a reboot? Luis
On Sun, 25 Feb 2018 17:54:25 +0000, Luis R. Rodriguez wrote: > On Mon, Feb 19, 2018 at 05:01:27PM +0200, cantabile wrote: > > On 19/02/18 07:55, Jakub Kicinski wrote: > > > On Sat, 17 Feb 2018 13:23:29 +0200, cantabile wrote: > > > > > Thanks for the info. Would it be cleaner to EXPORT fw_add_devm_name() > > > > > and just call that in case driver sees FW is already loaded? That > > > > > should inform the fw subsystem that we want the image around in case of > > > > > hibernation, but there is no need to load it immediately? > > > > > > > > No, I don't believe it's cleaner to expose a private function that you > > > > don't even really need. Remember that calling request_firmware every > > > > time your driver's probe and resume functions are called is normal. It's > > > > the expected behaviour. > > > > > > I'm asking you the extend functionality of a subsystem to be able to > > > cleanly communicate the intent. Not export internal functions. > > > > > > Requesting firmware you don't need and risking failing probe even if FW > > > is already pre-loaded is not correct. Reordering you suggest is > > > brittle and makes little logical sense unless someone guesses your use > > > case. > > > > > > Please at least try to do as advised. Otherwise: > > > > > > Nacked-by: Jakub Kicinski <kubakici@wp.pl> > > > > > > > You're right about the reordering not making sense to someone unfamiliar > > with the problem. I can fix that with a comment. > > > > I can change the patch so that request_firmware will only make the probe > > function fail if the firmware is not already running. > > Note that using request_firmware() on probe typically is also not an > outstanding idea given it delays boot. Not because looking for the firmware > takes time, but instead because processing firmware typically does on > the device. For instance cxgb4 is an example device where processing > firmware takes a long time. > > Delays on probe may mean the "feel good" immediate desktop coming up is delyed. > > Specially if its networking... I see no reason why to process firmware on probe. > > If one can use a workqueue to process verifying if it needs firmware and loading > later, that's more advisable. Quite true, more advanced the FW the longer FW load takes :( Although I would be cautious not to cause issues for network/NFS boot... Perhaps it can wait for such workqueue to finish? > Now, that's all a side topic. > > I will for now agree that it seems pointless to request for firmware always > even if you don't need to, and all you want is to just cache the firmware > on suspend. So I welcome a patch but the justification for it really needs to > be documented very well, and the documentation extended as such. In fact > maybe rename the function to something more sensible. > > Another use case for the firmware cache (which we need to add the documentation) > is that for hibernation we suspend all devices first, get a snapshot, and then > resume devices so we can then write the snapshot to disk. On that resume step > I don't think devices have access to the hard drive for firmware, so cache is > all we have. This may need some confirmation but I suspect this is the case. > Drivers needing firmware on resume for hibernation may need to cache their > firmware. > > I want to understand the case where the firmware is *not* available on resume? > Why did that happen? I seem to have read that on a fresh reboot the firmware > was not needed, and so on probe request_firmware() was not called? Why would > firmware not be required on a reboot? Yes, that is a good question.. John, do you have a theory? My initial thought was that the UEFI/BIOS loads it during pre-boot, but this is a USB card, so it's a bit unlikely that UEFI will have a driver for it... Does this happen when rebooting maybe?
On 27/02/18 04:28, Jakub Kicinski wrote: > On Sun, 25 Feb 2018 17:54:25 +0000, Luis R. Rodriguez wrote: >> I want to understand the case where the firmware is *not* available on resume? >> Why did that happen? I seem to have read that on a fresh reboot the firmware >> was not needed, and so on probe request_firmware() was not called? Why would >> firmware not be required on a reboot? > > Yes, that is a good question.. John, do you have a theory? My initial > thought was that the UEFI/BIOS loads it during pre-boot, but this is a > USB card, so it's a bit unlikely that UEFI will have a driver for it... > Does this happen when rebooting maybe? > Yes, it happens when rebooting: 1) Plug in the dongle. Message about firmware appears in dmesg: mt7601u 2-3:1.0: ASIC revision: 76010001 MAC revision: 76010500 mt7601u 2-3:1.0: Firmware Version: 0.1.00 Build: 7640 Build time: 201302052146____ mt7601u 2-3:1.0: Warning: unsupported EEPROM version 0d 2) `systemctl reboot`. Message about firmware does not appear in dmesg: mt7601u 2-3:1.0: ASIC revision: 76010001 MAC revision: 76010500 mt7601u 2-3:1.0: Warning: unsupported EEPROM version 0d The dongle is nevertheless perfectly functional after rebooting. I have no idea why it works like this. This is an older laptop, no UEFI.
On Tue, Feb 27, 2018 at 02:25:55PM +0200, cantabile wrote: > On 27/02/18 04:28, Jakub Kicinski wrote: > > On Sun, 25 Feb 2018 17:54:25 +0000, Luis R. Rodriguez wrote: > > > > I want to understand the case where the firmware is *not* available on resume? > > > Why did that happen? I seem to have read that on a fresh reboot the firmware > > > was not needed, and so on probe request_firmware() was not called? Why would > > > firmware not be required on a reboot? > > > > Yes, that is a good question.. John, do you have a theory? My initial > > thought was that the UEFI/BIOS loads it during pre-boot, but this is a > > USB card, so it's a bit unlikely that UEFI will have a driver for it... > > Does this happen when rebooting maybe? > > > > Yes, it happens when rebooting: > 1) Plug in the dongle. Message about firmware appears in dmesg: > > mt7601u 2-3:1.0: ASIC revision: 76010001 MAC revision: 76010500 > mt7601u 2-3:1.0: Firmware Version: 0.1.00 Build: 7640 Build time: > 201302052146____ > mt7601u 2-3:1.0: Warning: unsupported EEPROM version 0d > > 2) `systemctl reboot`. Message about firmware does not appear in dmesg: > > mt7601u 2-3:1.0: ASIC revision: 76010001 MAC revision: 76010500 > mt7601u 2-3:1.0: Warning: unsupported EEPROM version 0d > > The dongle is nevertheless perfectly functional after rebooting. > > I have no idea why it works like this. > > This is an older laptop, no UEFI. OK, this just confirms that firmware is not needed on reboot sometimes, but it does not explain *why*. What driver and code lines are involved so I can go read? Is the vendor involved or not really? I could imagine a situation where on reboot we just reset a device but since poweroff is never fully issued the firmware is not lost. So let me ask, why not do a full reset / shutdown of the device when the interface goes down? If there is no gain from the behaviour observed, might as well make the device work as much others rather than adding an API for a one-off situation where there is no gain from it behaving in a unique way. Luis
On Tue, 27 Feb 2018 16:54:51 +0000, Luis R. Rodriguez wrote: > On Tue, Feb 27, 2018 at 02:25:55PM +0200, cantabile wrote: > > On 27/02/18 04:28, Jakub Kicinski wrote: > > > On Sun, 25 Feb 2018 17:54:25 +0000, Luis R. Rodriguez wrote: > > > > > > I want to understand the case where the firmware is *not* available on resume? > > > > Why did that happen? I seem to have read that on a fresh reboot the firmware > > > > was not needed, and so on probe request_firmware() was not called? Why would > > > > firmware not be required on a reboot? > > > > > > Yes, that is a good question.. John, do you have a theory? My initial > > > thought was that the UEFI/BIOS loads it during pre-boot, but this is a > > > USB card, so it's a bit unlikely that UEFI will have a driver for it... > > > Does this happen when rebooting maybe? > > > > Yes, it happens when rebooting: > > 1) Plug in the dongle. Message about firmware appears in dmesg: > > > > mt7601u 2-3:1.0: ASIC revision: 76010001 MAC revision: 76010500 > > mt7601u 2-3:1.0: Firmware Version: 0.1.00 Build: 7640 Build time: > > 201302052146____ > > mt7601u 2-3:1.0: Warning: unsupported EEPROM version 0d > > > > 2) `systemctl reboot`. Message about firmware does not appear in dmesg: > > > > mt7601u 2-3:1.0: ASIC revision: 76010001 MAC revision: 76010500 > > mt7601u 2-3:1.0: Warning: unsupported EEPROM version 0d > > > > The dongle is nevertheless perfectly functional after rebooting. > > > > I have no idea why it works like this. > > > > This is an older laptop, no UEFI. > > OK, this just confirms that firmware is not needed on reboot sometimes, > but it does not explain *why*. What driver and code lines are involved > so I can go read? mt7601u_load_firmware() is probably the place to look at. > Is the vendor involved or not really? Not really, we got the vendor to submit the FW to linux-firmware, that's about it. > I could imagine a situation where on reboot we just reset a device but > since poweroff is never fully issued the firmware is not lost. So let me > ask, why not do a full reset / shutdown of the device when the interface > goes down? > > If there is no gain from the behaviour observed, might as well make > the device work as much others rather than adding an API for a one-off > situation where there is no gain from it behaving in a unique way. IDK what the reset procedure is :S
--- a/drivers/net/wireless/mediatek/mt7601u/mcu.c +++ b/drivers/net/wireless/mediatek/mt7601u/mcu.c @@ -420,13 +420,15 @@ mt7601u_wr(dev, MT_USB_DMA_CFG, (MT_USB_DMA_CFG_RX_BULK_EN | MT_USB_DMA_CFG_TX_BULK_EN)); - if (firmware_running(dev)) - return 0; - ret = request_firmware(&fw, MT7601U_FIRMWARE, dev->dev); if (ret) return ret; + if (firmware_running(dev)) { + release_firmware(fw); + return 0; + } + if (!fw || !fw->data || fw->size < sizeof(*hdr)) goto err_inv_fw;
The firmware running on the device sometimes survives a reboot (firmware_running returns 1). When this happens the driver never calls request_firmware, which means the kernel's firmware handling code doesn't know this firmware should be cached before hibernating. Upon resuming from several hours of hibernation, the firmware is no longer running on the device, so the driver calls request_firmware. Since the firmware was never cached, it needs to be loaded from disk, and this is when the system freezes, somewhere in the xfs driver. Fix this by always requesting the firmware, whether it's already running on the device or not. Signed-off-by: John Smith <cantabile.desu@gmail.com> ---