Message ID | 20200526145815.6415-1-mcgrof@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | kernel: taint when the driver firmware crashes | expand |
On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote: > To those new on CC -- this is intended to be a simple generic interface > to the kernel to annotate when the firwmare has crashed leaving the > driver or system in a questionable state, in the worst case requiring > full system reboot. This series is first addressing only a few > networking patches, however, I already have an idea of where such > firmware crashes happen across the tree. The goal with this series then > is to first introduce the simple framework, and only if that moves > forward will I continue to chug on with the rest of the drivers / > subsystems. > > This is *not* a networking specific problem only. > > This v3 augments the last series by introducing the uevent for panic > events, one of them is during tainting. The uvent mechanism is > independent from any of this firmware taint mechanism. I've also > addressed Jessica Yu's feedback. Given I've extended the patches a bit > with other minor cleanup which checkpatch.pl complains over, and since > this infrastructure is still being discussed, I've trimmed the patch > series size to only cover drivers for which I've received an Acked-by > from the respective driver maintainer, or where we have bug reports to > support such dire situations on the driver such as ath10k. > > During the last v2 it was discussed that we should instead use devlink > for this work, however the initial RFC patches produced by Jakub > Kicinski [0] shows how devlink is networking specific, and the intent > behind this series is to produce simple helpers which can be used by *any* > device driver, for any subsystem, not just networking. Subsystem > specific infrastructure to help address firwmare crashes may still make > sense, however that does not mean we *don't* need something even more > generic regardless of the subsystem the issue happens on. Since uevents > for taints are exposed, we now expose these through uapi as well, and > that was something which eventually had to happen given that the current > scheme of relying on sensible character representations for each taint > will not scale beyond the alphabet. Nacked-by: Jakub Kicinski <kuba@kernel.org>
On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote: > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote: > > To those new on CC -- this is intended to be a simple generic interface > > to the kernel to annotate when the firwmare has crashed leaving the > > driver or system in a questionable state, in the worst case requiring > > full system reboot. This series is first addressing only a few > > networking patches, however, I already have an idea of where such > > firmware crashes happen across the tree. The goal with this series then > > is to first introduce the simple framework, and only if that moves > > forward will I continue to chug on with the rest of the drivers / > > subsystems. > > > > This is *not* a networking specific problem only. > > > > This v3 augments the last series by introducing the uevent for panic > > events, one of them is during tainting. The uvent mechanism is > > independent from any of this firmware taint mechanism. I've also > > addressed Jessica Yu's feedback. Given I've extended the patches a bit > > with other minor cleanup which checkpatch.pl complains over, and since > > this infrastructure is still being discussed, I've trimmed the patch > > series size to only cover drivers for which I've received an Acked-by > > from the respective driver maintainer, or where we have bug reports to > > support such dire situations on the driver such as ath10k. > > > > During the last v2 it was discussed that we should instead use devlink > > for this work, however the initial RFC patches produced by Jakub > > Kicinski [0] shows how devlink is networking specific, and the intent > > behind this series is to produce simple helpers which can be used by *any* > > device driver, for any subsystem, not just networking. Subsystem > > specific infrastructure to help address firwmare crashes may still make > > sense, however that does not mean we *don't* need something even more > > generic regardless of the subsystem the issue happens on. Since uevents > > for taints are exposed, we now expose these through uapi as well, and > > that was something which eventually had to happen given that the current > > scheme of relying on sensible character representations for each taint > > will not scale beyond the alphabet. > > Nacked-by: Jakub Kicinski <kuba@kernel.org> Care to elaborate? Luis
On Tue, 26 May 2020 23:07:48 +0000 Luis Chamberlain wrote: > On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote: > > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote: > > > To those new on CC -- this is intended to be a simple generic interface > > > to the kernel to annotate when the firwmare has crashed leaving the > > > driver or system in a questionable state, in the worst case requiring > > > full system reboot. This series is first addressing only a few > > > networking patches, however, I already have an idea of where such > > > firmware crashes happen across the tree. The goal with this series then > > > is to first introduce the simple framework, and only if that moves > > > forward will I continue to chug on with the rest of the drivers / > > > subsystems. > > > > > > This is *not* a networking specific problem only. > > > > > > This v3 augments the last series by introducing the uevent for panic > > > events, one of them is during tainting. The uvent mechanism is > > > independent from any of this firmware taint mechanism. I've also > > > addressed Jessica Yu's feedback. Given I've extended the patches a bit > > > with other minor cleanup which checkpatch.pl complains over, and since > > > this infrastructure is still being discussed, I've trimmed the patch > > > series size to only cover drivers for which I've received an Acked-by > > > from the respective driver maintainer, or where we have bug reports to > > > support such dire situations on the driver such as ath10k. > > > > > > During the last v2 it was discussed that we should instead use devlink > > > for this work, however the initial RFC patches produced by Jakub > > > Kicinski [0] shows how devlink is networking specific, and the intent > > > behind this series is to produce simple helpers which can be used by *any* > > > device driver, for any subsystem, not just networking. Subsystem > > > specific infrastructure to help address firwmare crashes may still make > > > sense, however that does not mean we *don't* need something even more > > > generic regardless of the subsystem the issue happens on. Since uevents > > > for taints are exposed, we now expose these through uapi as well, and > > > that was something which eventually had to happen given that the current > > > scheme of relying on sensible character representations for each taint > > > will not scale beyond the alphabet. > > > > Nacked-by: Jakub Kicinski <kuba@kernel.org> > > Care to elaborate? I elaborated in the previous thread and told you I will nack this, but sure let's go over this again. For the third time saying the devlink is networking specific is not true. It was created as a netlink configuration channel for devices when there is no networking reference that could be used. It can be compiled in or out much like sysfs. And as I've shown you devlink already has the uAPI for what you're trying to achieve. Regardless of your opinions about wider interfaces, networking drivers should implement devlink, and not have to sprinkle magic taint calls.
On Tue, May 26, 2020 at 04:30:31PM -0700, Jakub Kicinski wrote: > On Tue, 26 May 2020 23:07:48 +0000 Luis Chamberlain wrote: > > On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote: > > > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote: > > > > To those new on CC -- this is intended to be a simple generic interface > > > > to the kernel to annotate when the firwmare has crashed leaving the > > > > driver or system in a questionable state, in the worst case requiring > > > > full system reboot. This series is first addressing only a few > > > > networking patches, however, I already have an idea of where such > > > > firmware crashes happen across the tree. The goal with this series then > > > > is to first introduce the simple framework, and only if that moves > > > > forward will I continue to chug on with the rest of the drivers / > > > > subsystems. > > > > > > > > This is *not* a networking specific problem only. > > > > > > > > This v3 augments the last series by introducing the uevent for panic > > > > events, one of them is during tainting. The uvent mechanism is > > > > independent from any of this firmware taint mechanism. I've also > > > > addressed Jessica Yu's feedback. Given I've extended the patches a bit > > > > with other minor cleanup which checkpatch.pl complains over, and since > > > > this infrastructure is still being discussed, I've trimmed the patch > > > > series size to only cover drivers for which I've received an Acked-by > > > > from the respective driver maintainer, or where we have bug reports to > > > > support such dire situations on the driver such as ath10k. > > > > > > > > During the last v2 it was discussed that we should instead use devlink > > > > for this work, however the initial RFC patches produced by Jakub > > > > Kicinski [0] shows how devlink is networking specific, and the intent > > > > behind this series is to produce simple helpers which can be used by *any* > > > > device driver, for any subsystem, not just networking. Subsystem > > > > specific infrastructure to help address firwmare crashes may still make > > > > sense, however that does not mean we *don't* need something even more > > > > generic regardless of the subsystem the issue happens on. Since uevents > > > > for taints are exposed, we now expose these through uapi as well, and > > > > that was something which eventually had to happen given that the current > > > > scheme of relying on sensible character representations for each taint > > > > will not scale beyond the alphabet. > > > > > > Nacked-by: Jakub Kicinski <kuba@kernel.org> > > > > Care to elaborate? > > I elaborated in the previous thread No you didn't. > and told you I will nack this, That's all you said. > but sure let's go over this again. > > For the third time saying the devlink is networking specific is not > true. It was created as a netlink configuration channel for devices > when there is no networking reference that could be used. It can be > compiled in or out much like sysfs. Perhaps I didn't get your email but this clarification was in no way shape or form present in your reply on that thread. > And as I've shown you devlink already has the uAPI for what you're > trying to achieve. I read your patch, and granted, I will accept I was under the incorrect assumption that this can only be used by networking devices, however it the devlink approach achieves getting userspace the ability with iproute2 devlink util to query a device health, on to which we can peg firmware health. But *this* patch series is not about health status and letting users query it, its about a *critical* situation which has come up with firmware requiring me to reboot my system, and the lack of *any* infrastructure in the kernel today to inform userspace about it. So say we use netlink to report a critical health situation, how are we informing userspace with your patch series about requring a reboot? Luis
On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote: > I read your patch, and granted, I will accept I was under the incorrect > assumption that this can only be used by networking devices, however it > the devlink approach achieves getting userspace the ability with > iproute2 devlink util to query a device health, on to which we can peg > firmware health. But *this* patch series is not about health status and > letting users query it, its about a *critical* situation which has come up > with firmware requiring me to reboot my system, and the lack of *any* > infrastructure in the kernel today to inform userspace about it. > > So say we use netlink to report a critical health situation, how are we > informing userspace with your patch series about requring a reboot? One of main features of netlink is pub/sub model of notifications. Whatever you imagine listening to your uevent can listen to devlink-health notifications via devlink. In fact I've shown this off in the RFC patches I sent to you, see the devlink mon health command being used.
On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote: > On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote: > > I read your patch, and granted, I will accept I was under the incorrect > > assumption that this can only be used by networking devices, however it > > the devlink approach achieves getting userspace the ability with > > iproute2 devlink util to query a device health, on to which we can peg > > firmware health. But *this* patch series is not about health status and > > letting users query it, its about a *critical* situation which has come up > > with firmware requiring me to reboot my system, and the lack of *any* > > infrastructure in the kernel today to inform userspace about it. > > > > So say we use netlink to report a critical health situation, how are we > > informing userspace with your patch series about requring a reboot? > > One of main features of netlink is pub/sub model of notifications. > > Whatever you imagine listening to your uevent can listen to > devlink-health notifications via devlink. > > In fact I've shown this off in the RFC patches I sent to you, see > the devlink mon health command being used. Yes but I looked at iputils2 devlink and seems I made an incorrect assumption this can only be used for a network device rather than a struct device. I'll take a second look. Luis
On 05/28/2020 07:27 AM, Luis Chamberlain wrote: > On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote: >> On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote: >>> I read your patch, and granted, I will accept I was under the incorrect >>> assumption that this can only be used by networking devices, however it >>> the devlink approach achieves getting userspace the ability with >>> iproute2 devlink util to query a device health, on to which we can peg >>> firmware health. But *this* patch series is not about health status and >>> letting users query it, its about a *critical* situation which has come up >>> with firmware requiring me to reboot my system, and the lack of *any* >>> infrastructure in the kernel today to inform userspace about it. >>> >>> So say we use netlink to report a critical health situation, how are we >>> informing userspace with your patch series about requring a reboot? >> >> One of main features of netlink is pub/sub model of notifications. >> >> Whatever you imagine listening to your uevent can listen to >> devlink-health notifications via devlink. >> >> In fact I've shown this off in the RFC patches I sent to you, see >> the devlink mon health command being used. > > Yes but I looked at iputils2 devlink and seems I made an incorrect > assumption this can only be used for a network device rather than > a struct device. > > I'll take a second look. Hello Jakub, I'm thinking about something similar to what Luis is proposing, but in my case I'd like to report just when the driver knows the hardware is gone and cannot be recovered, like when this is reported: [ 2548.851832] WARNING: CPU: 3 PID: 98 at backports-4.19.98-1/net/mac80211/util.c:2040 ieee80211_reconfig+0x98/0xb64 [mac80211] [ 2548.856020] Hardware became unavailable during restart. I'd like to be able to tie this into a watch-dog program to allow automatic reboot of the system soon after this event is seen, for instance. Could you post your devlink RFC patches somewhere public? Thanks, Ben
On Thu, May 28, 2020 at 08:04:50AM -0700, Ben Greear wrote: > > Could you post your devlink RFC patches somewhere public? This cover letter provided a URL to these. Luis