Message ID | 20220708185608.676474-2-thierry.reding@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | NVIDIA Tegra changes for v5.20-rc1 | expand |
On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote: > git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git tags/tegra-for-5.20-soc ... > ---------------------------------------------------------------- > soc/tegra: Changes for v5.20-rc1 > > The bulk of these changes is the new CBB driver which is used to provide > (a lot of) information about SErrors when things go wrong, instead of > the kernel just crashing or hanging. > > In addition more SoC information is exposed to sysfs and various minor > issues are fixed. > Hi Thierry, I fear I'm going to skip this for the current merge window. It looks like the CBB driver you add here would fit into the existing drivers/edac/ subsystem, or at the minimum should have been reviewed by the corresponding maintainers (added to Cc) to decide whether it goes there or not. I had not previously seen this driver, but I'll let them have a look first. For the other patches, I found two more problems: > Bitan Biswas (1): > soc/tegra: fuse: Expose Tegra production status Please don't just add random attributes in the soc device infrastructure. This one has a completely generic name but a SoC specific meaning, and it lacks a description in Documentation/ABI. Not sure what the right ABI is here, but this is something that needs to be discussed more broadly when you send a new version. I see there are already some custom attributes in the same device, we should probably not have added those either, but I suppose we are stuck with those, so please add the missing documentation. > YueHaibing (1): > soc/tegra: fuse: Add missing DMADEVICES dependency This one fixes the warning the wrong way: we don't 'select' random drivers from other subsystems, and selecting the entire subsystem makes it worse. Just drop the 'select' here and enable the drivers in the defconfig. Arnd
On Tue, Jul 12, 2022 at 03:27:16PM +0200, Arnd Bergmann wrote: > On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git tags/tegra-for-5.20-soc > ... > > ---------------------------------------------------------------- > > soc/tegra: Changes for v5.20-rc1 > > > > The bulk of these changes is the new CBB driver which is used to provide > > (a lot of) information about SErrors when things go wrong, instead of > > the kernel just crashing or hanging. > > > > In addition more SoC information is exposed to sysfs and various minor > > issues are fixed. > > > > Hi Thierry, > > I fear I'm going to skip this for the current merge window. It looks like > the CBB driver you add here would fit into the existing drivers/edac/ > subsystem, or at the minimum should have been reviewed by the > corresponding maintainers (added to Cc) to decide whether it goes > there or not. > > I had not previously seen this driver, but I'll let them have a look first. EDAC looks like it's used primarily for memory controllers, which this is not. But then I also see explicit references to non-memory-controller references in the infrastructure, so perhaps this does fit in there. The CBB driver is primarily a means to provide additional information about runtime errors, so it's not directly a means of discovering the errors (they would be detected anyway and cause a crash) and I don't think we have a means of correcting any of these errors. I'll ask Sumit to work with the EDAC maintainers on this. > For the other patches, I found two more problems: > > > Bitan Biswas (1): > > soc/tegra: fuse: Expose Tegra production status > > Please don't just add random attributes in the soc device infrastructure. > This one has a completely generic name but a SoC specific > meaning, and it lacks a description in Documentation/ABI. > Not sure what the right ABI is here, but this is something that needs > to be discussed more broadly when you send a new version. I wasn't aware that the SoC device infrastructure was restricted to only standardized attributes. Looks like there are a few other outliers that add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. Do we have some other place where this kind of thing can be exposed? Or do we just need to come up with some better way of namespacing these? Perhaps it would also be sufficient if all of these were better documented so that people know what to look for on their platform of interest. > I see there are already some custom attributes in the same device, > we should probably not have added those either, but I suppose > we are stuck with those, so please add the missing documentation. Yeah, that's a good point. These should definitely be documented properly. > > > YueHaibing (1): > > soc/tegra: fuse: Add missing DMADEVICES dependency > > This one fixes the warning the wrong way: we don't 'select' random > drivers from other subsystems, and selecting the entire > subsystem makes it worse. Just drop the 'select' here and > enable the drivers in the defconfig. This doesn't actually select the DMADEVICES property. It adds a dependency on DMADEVICES and if that is met it will select TEGRA20_APB_DMA. Thierry
On Wed, Jul 13, 2022 at 12:58 PM Thierry Reding <thierry.reding@gmail.com> wrote: > On Tue, Jul 12, 2022 at 03:27:16PM +0200, Arnd Bergmann wrote: > > On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote: > > > > I fear I'm going to skip this for the current merge window. It looks like > > the CBB driver you add here would fit into the existing drivers/edac/ > > subsystem, or at the minimum should have been reviewed by the > > corresponding maintainers (added to Cc) to decide whether it goes > > there or not. > > > > I had not previously seen this driver, but I'll let them have a look first. > > EDAC looks like it's used primarily for memory controllers, which this > is not. But then I also see explicit references to non-memory-controller > references in the infrastructure, so perhaps this does fit in there. The > CBB driver is primarily a means to provide additional information about > runtime errors, so it's not directly a means of discovering the errors > (they would be detected anyway and cause a crash) and I don't think we > have a means of correcting any of these errors. I think this is just a reflection of what other hardware can do: most machines only detect memory errors, but the EDAC subsystem can work with any type in principle. There are also a lot of conditions elsewhere that can be detected but not corrected. > I'll ask Sumit to work with the EDAC maintainers on this. Thanks > > For the other patches, I found two more problems: > > > > > Bitan Biswas (1): > > > soc/tegra: fuse: Expose Tegra production status > > > > Please don't just add random attributes in the soc device infrastructure. > > This one has a completely generic name but a SoC specific > > meaning, and it lacks a description in Documentation/ABI. > > Not sure what the right ABI is here, but this is something that needs > > to be discussed more broadly when you send a new version. > > I wasn't aware that the SoC device infrastructure was restricted to only > standardized attributes. Looks like there are a few other outliers that > add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. > > Do we have some other place where this kind of thing can be exposed? Or > do we just need to come up with some better way of namespacing these? > Perhaps it would also be sufficient if all of these were better > documented so that people know what to look for on their platform of > interest. It's not a 100% strict rule, I've just tried to limit it as much as possible, and sometimes missed drivers doing it anyway. My main goal here is to make things consistent between SoC families, so if one piece of information is provided by a number of them, I'd rather have a standard attribute, or a common way of encoding this in the existing attributes than to have too many custom attributes with similar names. > > > YueHaibing (1): > > > soc/tegra: fuse: Add missing DMADEVICES dependency > > > > This one fixes the warning the wrong way: we don't 'select' random > > drivers from other subsystems, and selecting the entire > > subsystem makes it worse. Just drop the 'select' here and > > enable the drivers in the defconfig. > > This doesn't actually select the DMADEVICES property. It adds a > dependency on DMADEVICES and if that is met it will select > TEGRA20_APB_DMA. My mistake. However, I still think it's wrong to select TEGRA20_APB_DMA here, unless there is a build-time dependency that prevents it from being compiled otherwise. The dmaengine subsystem is meant to abstract the relation between the drivers using DMA and those providing the feature, the same way we abstract all the other subsystems. The fuse driver may only be used on machines that use TEGRA20_APB_DMA, but neither the driver code nor Kconfig should care about that. Arnd
On 13/07/2022 13:14, Arnd Bergmann wrote: ... >>> For the other patches, I found two more problems: >>> >>>> Bitan Biswas (1): >>>> soc/tegra: fuse: Expose Tegra production status >>> >>> Please don't just add random attributes in the soc device infrastructure. >>> This one has a completely generic name but a SoC specific >>> meaning, and it lacks a description in Documentation/ABI. >>> Not sure what the right ABI is here, but this is something that needs >>> to be discussed more broadly when you send a new version. >> >> I wasn't aware that the SoC device infrastructure was restricted to only >> standardized attributes. Looks like there are a few other outliers that >> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. >> >> Do we have some other place where this kind of thing can be exposed? Or >> do we just need to come up with some better way of namespacing these? >> Perhaps it would also be sufficient if all of these were better >> documented so that people know what to look for on their platform of >> interest. > > It's not a 100% strict rule, I've just tried to limit it as much as possible, > and sometimes missed drivers doing it anyway. My main goal here is > to make things consistent between SoC families, so if one piece of > information is provided by a number of them, I'd rather have a standard > attribute, or a common way of encoding this in the existing attributes > than to have too many custom attributes with similar names. Makes sense. Any recommendations for this specific attribute? I could imagine other vendors may have engineering devices and production versions. This is slightly different from the silicon version. Cheers Jon
On Wed, Jul 13, 2022 at 2:19 PM Jon Hunter <jonathanh@nvidia.com> wrote: > On 13/07/2022 13:14, Arnd Bergmann wrote: > >>> For the other patches, I found two more problems: > >>> > >>>> Bitan Biswas (1): > >>>> soc/tegra: fuse: Expose Tegra production status > >>> > >>> Please don't just add random attributes in the soc device infrastructure. > >>> This one has a completely generic name but a SoC specific > >>> meaning, and it lacks a description in Documentation/ABI. > >>> Not sure what the right ABI is here, but this is something that needs > >>> to be discussed more broadly when you send a new version. > >> > >> I wasn't aware that the SoC device infrastructure was restricted to only > >> standardized attributes. Looks like there are a few other outliers that > >> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. > >> > >> Do we have some other place where this kind of thing can be exposed? Or > >> do we just need to come up with some better way of namespacing these? > >> Perhaps it would also be sufficient if all of these were better > >> documented so that people know what to look for on their platform of > >> interest. > > > > It's not a 100% strict rule, I've just tried to limit it as much as possible, > > and sometimes missed drivers doing it anyway. My main goal here is > > to make things consistent between SoC families, so if one piece of > > information is provided by a number of them, I'd rather have a standard > > attribute, or a common way of encoding this in the existing attributes > > than to have too many custom attributes with similar names. > > > Makes sense. Any recommendations for this specific attribute? I could > imagine other vendors may have engineering devices and production > versions. This is slightly different from the silicon version. Not sure, I haven't seen this one referenced elsewhere so far. What is the actual information this encodes in your case? Is this fused down in a way that production devices lose access to certain features that could be security critical but are useful for development? Arnd
On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote: > On Wed, Jul 13, 2022 at 12:58 PM Thierry Reding > <thierry.reding@gmail.com> wrote: > > On Tue, Jul 12, 2022 at 03:27:16PM +0200, Arnd Bergmann wrote: > > > On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > I fear I'm going to skip this for the current merge window. It looks like > > > the CBB driver you add here would fit into the existing drivers/edac/ > > > subsystem, or at the minimum should have been reviewed by the > > > corresponding maintainers (added to Cc) to decide whether it goes > > > there or not. > > > > > > I had not previously seen this driver, but I'll let them have a look first. > > > > EDAC looks like it's used primarily for memory controllers, which this > > is not. But then I also see explicit references to non-memory-controller > > references in the infrastructure, so perhaps this does fit in there. The > > CBB driver is primarily a means to provide additional information about > > runtime errors, so it's not directly a means of discovering the errors > > (they would be detected anyway and cause a crash) and I don't think we > > have a means of correcting any of these errors. > > I think this is just a reflection of what other hardware can do: > most machines only detect memory errors, but the EDAC subsystem > can work with any type in principle. There are also a lot of > conditions elsewhere that can be detected but not corrected. > > > I'll ask Sumit to work with the EDAC maintainers on this. > > Thanks > > > > For the other patches, I found two more problems: > > > > > > > Bitan Biswas (1): > > > > soc/tegra: fuse: Expose Tegra production status > > > > > > Please don't just add random attributes in the soc device infrastructure. > > > This one has a completely generic name but a SoC specific > > > meaning, and it lacks a description in Documentation/ABI. > > > Not sure what the right ABI is here, but this is something that needs > > > to be discussed more broadly when you send a new version. > > > > I wasn't aware that the SoC device infrastructure was restricted to only > > standardized attributes. Looks like there are a few other outliers that > > add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. > > > > Do we have some other place where this kind of thing can be exposed? Or > > do we just need to come up with some better way of namespacing these? > > Perhaps it would also be sufficient if all of these were better > > documented so that people know what to look for on their platform of > > interest. > > It's not a 100% strict rule, I've just tried to limit it as much as possible, > and sometimes missed drivers doing it anyway. My main goal here is > to make things consistent between SoC families, so if one piece of > information is provided by a number of them, I'd rather have a standard > attribute, or a common way of encoding this in the existing attributes > than to have too many custom attributes with similar names. The major/minor attributes that we have on Tegra SoCs should be easy to standardize. It seems like those could be fairly common. The other one that we have is the "platform" one, which I suppose is not as easy to standardize. I don't recall the exact details, but I think we're mostly interested in whether or not the platform is simulation or silicon. The exact simulation value is not something that userspace scripts will look at, as far as I recall. Jon, correct me if I'm wrong. Perhaps this can be deprecated in favour of a more standardized property that can more easily be implemented on other SoCs. The production mode is something that is read from a fuse and we expose those via the nvmem subsystem already. Currently nvmem exposes only a binary attribute in sysfs that userspace would need to parse and ideally we'd have something a little easier to work with, but perhaps nvmem can be enhanced to expose individual cells as separate attributes in some standard format. We also have some other values in the fuses that we want to make available to userspace (IDs and that sort of thing), so it's good that you noticed this now before we would've added even more. > > > > YueHaibing (1): > > > > soc/tegra: fuse: Add missing DMADEVICES dependency > > > > > > This one fixes the warning the wrong way: we don't 'select' random > > > drivers from other subsystems, and selecting the entire > > > subsystem makes it worse. Just drop the 'select' here and > > > enable the drivers in the defconfig. > > > > This doesn't actually select the DMADEVICES property. It adds a > > dependency on DMADEVICES and if that is met it will select > > TEGRA20_APB_DMA. > > My mistake. However, I still think it's wrong to select > TEGRA20_APB_DMA here, unless there is a build-time > dependency that prevents it from being compiled otherwise. > > The dmaengine subsystem is meant to abstract the relation > between the drivers using DMA and those providing the feature, > the same way we abstract all the other subsystems. The > fuse driver may only be used on machines that use > TEGRA20_APB_DMA, but neither the driver code nor > Kconfig should care about that. This dependency has existed for quite a while and my recollection is that we wanted to make this very explicit because the lack of the TEGRA20_APB_DMA driver makes the FUSE driver completely useless on Tegra20 and that in turn has a very negative impact on the rest of the system, so we deemed a default configuration change insufficient. Perhaps a better way to solve this would be to make TEGRA20_APB_DMA default to "y" if ARCH_TEGRA_2x_SOC. And then perhaps make the FUSE driver depend on DMADEVICES. That still wouldn't ensure that we get SOC_TEGRA_FUSE enabled automatically all the time, but perhaps it'd document the dependency a bit more explicitly. Thierry
On 13/07/2022 21:22, Thierry Reding wrote: ... >>>>> Bitan Biswas (1): >>>>> soc/tegra: fuse: Expose Tegra production status >>>> >>>> Please don't just add random attributes in the soc device infrastructure. >>>> This one has a completely generic name but a SoC specific >>>> meaning, and it lacks a description in Documentation/ABI. >>>> Not sure what the right ABI is here, but this is something that needs >>>> to be discussed more broadly when you send a new version. >>> >>> I wasn't aware that the SoC device infrastructure was restricted to only >>> standardized attributes. Looks like there are a few other outliers that >>> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. >>> >>> Do we have some other place where this kind of thing can be exposed? Or >>> do we just need to come up with some better way of namespacing these? >>> Perhaps it would also be sufficient if all of these were better >>> documented so that people know what to look for on their platform of >>> interest. >> >> It's not a 100% strict rule, I've just tried to limit it as much as possible, >> and sometimes missed drivers doing it anyway. My main goal here is >> to make things consistent between SoC families, so if one piece of >> information is provided by a number of them, I'd rather have a standard >> attribute, or a common way of encoding this in the existing attributes >> than to have too many custom attributes with similar names. > > The major/minor attributes that we have on Tegra SoCs should be easy to > standardize. It seems like those could be fairly common. The other one > that we have is the "platform" one, which I suppose is not as easy to > standardize. I don't recall the exact details, but I think we're mostly > interested in whether or not the platform is simulation or silicon. The > exact simulation value is not something that userspace scripts will look > at, as far as I recall. > > Jon, correct me if I'm wrong. There are a few different simulation types and I am seen some userspace code convert the value and display the actual type. However, in reality I am not sure how much this is used, but yes at least identifying that this is silicon is used widely from what I have seen. Jon
On 13/07/2022 13:36, Arnd Bergmann wrote: > On Wed, Jul 13, 2022 at 2:19 PM Jon Hunter <jonathanh@nvidia.com> wrote: >> On 13/07/2022 13:14, Arnd Bergmann wrote: >>>>> For the other patches, I found two more problems: >>>>> >>>>>> Bitan Biswas (1): >>>>>> soc/tegra: fuse: Expose Tegra production status >>>>> >>>>> Please don't just add random attributes in the soc device infrastructure. >>>>> This one has a completely generic name but a SoC specific >>>>> meaning, and it lacks a description in Documentation/ABI. >>>>> Not sure what the right ABI is here, but this is something that needs >>>>> to be discussed more broadly when you send a new version. >>>> >>>> I wasn't aware that the SoC device infrastructure was restricted to only >>>> standardized attributes. Looks like there are a few other outliers that >>>> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2. >>>> >>>> Do we have some other place where this kind of thing can be exposed? Or >>>> do we just need to come up with some better way of namespacing these? >>>> Perhaps it would also be sufficient if all of these were better >>>> documented so that people know what to look for on their platform of >>>> interest. >>> >>> It's not a 100% strict rule, I've just tried to limit it as much as possible, >>> and sometimes missed drivers doing it anyway. My main goal here is >>> to make things consistent between SoC families, so if one piece of >>> information is provided by a number of them, I'd rather have a standard >>> attribute, or a common way of encoding this in the existing attributes >>> than to have too many custom attributes with similar names. >> >> >> Makes sense. Any recommendations for this specific attribute? I could >> imagine other vendors may have engineering devices and production >> versions. This is slightly different from the silicon version. > > Not sure, I haven't seen this one referenced elsewhere so far. > > What is the actual information this encodes in your case? Is this fused > down in a way that production devices lose access to certain features > that could be security critical but are useful for development? Yes I believe it is precisely that. Exact details I am not clear on, but I see a lot of references to this throughout our userspace and testing code. Jon
On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote: > I think this is just a reflection of what other hardware can do: > most machines only detect memory errors, but the EDAC subsystem > can work with any type in principle. There are also a lot of > conditions elsewhere that can be detected but not corrected. Just a couple of thoughts from looking at this: So the EDAC thing reports *hardware* errors by using the RAS capabilities built into an IP block. So it started with memory controllers but it is getting extended to other blocks. AMD are looking at how to integrate GPU hw errors reporting into it, for example. Looking at that CBB thing, it looks like it is supposed to report not so much hardware errors but operational errors. Some of the hw errors reported by RAS hw are also operation-related but not the majority. Then, EDAC has this counters exposed in: $ grep -r . /sys/devices/system/edac/ /sys/devices/system/edac/power/runtime_active_time:0 /sys/devices/system/edac/power/runtime_status:unsupported /sys/devices/system/edac/power/runtime_suspended_time:0 /sys/devices/system/edac/power/control:auto /sys/devices/system/edac/pci/edac_pci_log_pe:1 /sys/devices/system/edac/pci/pci0/pe_count:0 /sys/devices/system/edac/pci/pci0/npe_count:0 /sys/devices/system/edac/pci/pci_parity_count:0 /sys/devices/system/edac/pci/pci_nonparity_count:0 /sys/devices/system/edac/pci/edac_pci_log_npe:1 /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0 /sys/devices/system/edac/pci/check_pci_errors:0 /sys/devices/system/edac/mc/power/runtime_active_time:0 /sys/devices/system/edac/mc/power/runtime_status:unsupported ... with the respective hierarchy: memory controllers, PCI errors, etc. So the main question is, does it make sense for you to fit this into the EDAC hierarchy and what would even be the advantage of making it part of EDAC? HTH.
On Wed, Jul 13, 2022 at 10:22 PM Thierry Reding <thierry.reding@gmail.com> wrote: > On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote: > > > > It's not a 100% strict rule, I've just tried to limit it as much as possible, > > and sometimes missed drivers doing it anyway. My main goal here is > > to make things consistent between SoC families, so if one piece of > > information is provided by a number of them, I'd rather have a standard > > attribute, or a common way of encoding this in the existing attributes > > than to have too many custom attributes with similar names. > > The major/minor attributes that we have on Tegra SoCs should be easy to > standardize. It seems like those could be fairly common. I think these can just be folded into one of the other attributes, probably either revision or soc_id dependending on what they actually refer to. These properties are intentionally free-text fields that you can match using wildcards with the soc_device_match() function. If I read this part right, the information is already available in the soc_id field, so we don't even need to change anything here. > The other one > that we have is the "platform" one, which I suppose is not as easy to > standardize. I don't recall the exact details, but I think we're mostly > interested in whether or not the platform is simulation or silicon. The > exact simulation value is not something that userspace scripts will look > at, as far as I recall. This also looks like it's part of the chip_id. > > > > > YueHaibing (1): > > > > > soc/tegra: fuse: Add missing DMADEVICES dependency > > > > > > > > This one fixes the warning the wrong way: we don't 'select' random > > > > drivers from other subsystems, and selecting the entire > > > > subsystem makes it worse. Just drop the 'select' here and > > > > enable the drivers in the defconfig. > > > > > > This doesn't actually select the DMADEVICES property. It adds a > > > dependency on DMADEVICES and if that is met it will select > > > TEGRA20_APB_DMA. > > > > My mistake. However, I still think it's wrong to select > > TEGRA20_APB_DMA here, unless there is a build-time > > dependency that prevents it from being compiled otherwise. > > > > The dmaengine subsystem is meant to abstract the relation > > between the drivers using DMA and those providing the feature, > > the same way we abstract all the other subsystems. The > > fuse driver may only be used on machines that use > > TEGRA20_APB_DMA, but neither the driver code nor > > Kconfig should care about that. > > This dependency has existed for quite a while and my recollection is > that we wanted to make this very explicit because the lack of the > TEGRA20_APB_DMA driver makes the FUSE driver completely useless on > Tegra20 and that in turn has a very negative impact on the rest of the > system, so we deemed a default configuration change insufficient. > > Perhaps a better way to solve this would be to make TEGRA20_APB_DMA > default to "y" if ARCH_TEGRA_2x_SOC. And then perhaps make the FUSE > driver depend on DMADEVICES. That still wouldn't ensure that we get > SOC_TEGRA_FUSE enabled automatically all the time, but perhaps it'd > document the dependency a bit more explicitly. Ok, this sounds good to me. Arnd
Hi Arnd, Boris, Thank you for your inputs. >> I think this is just a reflection of what other hardware can do: >> most machines only detect memory errors, but the EDAC subsystem >> can work with any type in principle. There are also a lot of >> conditions elsewhere that can be detected but not corrected. > > Just a couple of thoughts from looking at this: > > So the EDAC thing reports *hardware* errors by using the RAS > capabilities built into an IP block. So it started with memory > controllers but it is getting extended to other blocks. AMD are looking > at how to integrate GPU hw errors reporting into it, for example. > > Looking at that CBB thing, it looks like it is supposed to report not > so much hardware errors but operational errors. Some of the hw errors > reported by RAS hw are also operation-related but not the majority. > CBB driver reports errors due to bad MMIO accesses within software. The vast majority of the CBB errors tend to be programming errors in setting up address windows leading to decode errors. > Then, EDAC has this counters exposed in: > > $ grep -r . /sys/devices/system/edac/ > /sys/devices/system/edac/power/runtime_active_time:0 > /sys/devices/system/edac/power/runtime_status:unsupported > /sys/devices/system/edac/power/runtime_suspended_time:0 > /sys/devices/system/edac/power/control:auto > /sys/devices/system/edac/pci/edac_pci_log_pe:1 > /sys/devices/system/edac/pci/pci0/pe_count:0 > /sys/devices/system/edac/pci/pci0/npe_count:0 > /sys/devices/system/edac/pci/pci_parity_count:0 > /sys/devices/system/edac/pci/pci_nonparity_count:0 > /sys/devices/system/edac/pci/edac_pci_log_npe:1 > /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0 > /sys/devices/system/edac/pci/check_pci_errors:0 > /sys/devices/system/edac/mc/power/runtime_active_time:0 > /sys/devices/system/edac/mc/power/runtime_status:unsupported > ... > > with the respective hierarchy: memory controllers, PCI errors, etc. > > So the main question is, does it make sense for you to fit this into the > EDAC hierarchy and what would even be the advantage of making it part of > EDAC? > I also think this doesn't seem to fit with the errors reported by EDAC which are mainly hardware errors as Boris explained. Please share your thoughts and if we can merge the patches as it is. > HTH. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jul 15, 2022 at 01:36:16PM +0530, Sumit Gupta wrote: > Hi Arnd, Boris, > > Thank you for your inputs. > > > > I think this is just a reflection of what other hardware can do: > > > most machines only detect memory errors, but the EDAC subsystem > > > can work with any type in principle. There are also a lot of > > > conditions elsewhere that can be detected but not corrected. > > > > Just a couple of thoughts from looking at this: > > > > So the EDAC thing reports *hardware* errors by using the RAS > > capabilities built into an IP block. So it started with memory > > controllers but it is getting extended to other blocks. AMD are looking > > at how to integrate GPU hw errors reporting into it, for example. > > > > Looking at that CBB thing, it looks like it is supposed to report not > > so much hardware errors but operational errors. Some of the hw errors > > reported by RAS hw are also operation-related but not the majority. > > > > CBB driver reports errors due to bad MMIO accesses within software. > The vast majority of the CBB errors tend to be programming errors in setting > up address windows leading to decode errors. > > > Then, EDAC has this counters exposed in: > > > > $ grep -r . /sys/devices/system/edac/ > > /sys/devices/system/edac/power/runtime_active_time:0 > > /sys/devices/system/edac/power/runtime_status:unsupported > > /sys/devices/system/edac/power/runtime_suspended_time:0 > > /sys/devices/system/edac/power/control:auto > > /sys/devices/system/edac/pci/edac_pci_log_pe:1 > > /sys/devices/system/edac/pci/pci0/pe_count:0 > > /sys/devices/system/edac/pci/pci0/npe_count:0 > > /sys/devices/system/edac/pci/pci_parity_count:0 > > /sys/devices/system/edac/pci/pci_nonparity_count:0 > > /sys/devices/system/edac/pci/edac_pci_log_npe:1 > > /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0 > > /sys/devices/system/edac/pci/check_pci_errors:0 > > /sys/devices/system/edac/mc/power/runtime_active_time:0 > > /sys/devices/system/edac/mc/power/runtime_status:unsupported > > ... > > > > with the respective hierarchy: memory controllers, PCI errors, etc. > > > > So the main question is, does it make sense for you to fit this into the > > EDAC hierarchy and what would even be the advantage of making it part of > > EDAC? > > > > I also think this doesn't seem to fit with the errors reported by EDAC which > are mainly hardware errors as Boris explained. > Please share your thoughts and if we can merge the patches as it is. Arnd, any more thoughts on this? Looks like there is no consensus on where this should go. If it's okay for this to go in via ARM SoC after all, I could prepare another pull request including only the CBB changes along with some of the reference count fixes. I could possibly also rework the DMADEVICES dependency patch as discussed, or we could defer it if it's too risky at this point. Thierry
> On Fri, Jul 15, 2022 at 01:36:16PM +0530, Sumit Gupta wrote: >> Hi Arnd, Boris, >> >> Thank you for your inputs. >> >>>> I think this is just a reflection of what other hardware can do: >>>> most machines only detect memory errors, but the EDAC subsystem >>>> can work with any type in principle. There are also a lot of >>>> conditions elsewhere that can be detected but not corrected. >>> Just a couple of thoughts from looking at this: >>> >>> So the EDAC thing reports*hardware* errors by using the RAS >>> capabilities built into an IP block. So it started with memory >>> controllers but it is getting extended to other blocks. AMD are looking >>> at how to integrate GPU hw errors reporting into it, for example. >>> >>> Looking at that CBB thing, it looks like it is supposed to report not >>> so much hardware errors but operational errors. Some of the hw errors >>> reported by RAS hw are also operation-related but not the majority. >>> >> CBB driver reports errors due to bad MMIO accesses within software. >> The vast majority of the CBB errors tend to be programming errors in setting >> up address windows leading to decode errors. >> >>> Then, EDAC has this counters exposed in: >>> >>> $ grep -r ./sys/devices/system/edac/ >>> /sys/devices/system/edac/power/runtime_active_time:0 >>> /sys/devices/system/edac/power/runtime_status:unsupported >>> /sys/devices/system/edac/power/runtime_suspended_time:0 >>> /sys/devices/system/edac/power/control:auto >>> /sys/devices/system/edac/pci/edac_pci_log_pe:1 >>> /sys/devices/system/edac/pci/pci0/pe_count:0 >>> /sys/devices/system/edac/pci/pci0/npe_count:0 >>> /sys/devices/system/edac/pci/pci_parity_count:0 >>> /sys/devices/system/edac/pci/pci_nonparity_count:0 >>> /sys/devices/system/edac/pci/edac_pci_log_npe:1 >>> /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0 >>> /sys/devices/system/edac/pci/check_pci_errors:0 >>> /sys/devices/system/edac/mc/power/runtime_active_time:0 >>> /sys/devices/system/edac/mc/power/runtime_status:unsupported >>> ... >>> >>> with the respective hierarchy: memory controllers, PCI errors, etc. >>> >>> So the main question is, does it make sense for you to fit this into the >>> EDAC hierarchy and what would even be the advantage of making it part of >>> EDAC? >>> >> I also think this doesn't seem to fit with the errors reported by EDAC which >> are mainly hardware errors as Boris explained. >> Please share your thoughts and if we can merge the patches as it is. > Arnd, > > any more thoughts on this? Looks like there is no consensus on where > this should go. If it's okay for this to go in via ARM SoC after all, > I could prepare another pull request including only the CBB changes > along with some of the reference count fixes. I could possibly also > rework the DMADEVICES dependency patch as discussed, or we could defer > it if it's too risky at this point. > > Thierry Hi Arnd, Thierry, Gentle ping. If we are OK with the reasoning then can we please queue the patch series for '6.1'. Thank you, Sumit
On Thu, Jul 14, 2022 at 03:31:07PM +0200, Borislav Petkov wrote: > On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote: > > I think this is just a reflection of what other hardware can do: > > most machines only detect memory errors, but the EDAC subsystem > > can work with any type in principle. There are also a lot of > > conditions elsewhere that can be detected but not corrected. > > Just a couple of thoughts from looking at this: > > So the EDAC thing reports *hardware* errors by using the RAS > capabilities built into an IP block. So it started with memory > controllers but it is getting extended to other blocks. AMD are looking > at how to integrate GPU hw errors reporting into it, for example. > > Looking at that CBB thing, it looks like it is supposed to report not > so much hardware errors but operational errors. Some of the hw errors > reported by RAS hw are also operation-related but not the majority. > > Then, EDAC has this counters exposed in: > > $ grep -r . /sys/devices/system/edac/ > /sys/devices/system/edac/power/runtime_active_time:0 > /sys/devices/system/edac/power/runtime_status:unsupported > /sys/devices/system/edac/power/runtime_suspended_time:0 > /sys/devices/system/edac/power/control:auto > /sys/devices/system/edac/pci/edac_pci_log_pe:1 > /sys/devices/system/edac/pci/pci0/pe_count:0 > /sys/devices/system/edac/pci/pci0/npe_count:0 > /sys/devices/system/edac/pci/pci_parity_count:0 > /sys/devices/system/edac/pci/pci_nonparity_count:0 > /sys/devices/system/edac/pci/edac_pci_log_npe:1 > /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0 > /sys/devices/system/edac/pci/check_pci_errors:0 > /sys/devices/system/edac/mc/power/runtime_active_time:0 > /sys/devices/system/edac/mc/power/runtime_status:unsupported > ... > > with the respective hierarchy: memory controllers, PCI errors, etc. > > So the main question is, does it make sense for you to fit this into the > EDAC hierarchy and what would even be the advantage of making it part of > EDAC? Closing the loop on this: we've decided to keep this in drivers/soc for now, with the option of re-evaluating when we encounter similar functionality on other hardware. I'm also going to hijack the thread because something else came up recently that fits the audience here and it's up the same alley: on Tegra234 a mechanism, called FSI (Functional Safety Island), exists to report failures to an external MCU that's monitoring the system. Special hardware exists in the SoC that can send these errors to the MCU via different transports, and the idea is to report software- detected failures from kernel drivers such as I2C or PCI via this mechanism, so appropriate action can be taken. So essentially we're looking at adding some new API, preferably something generic, to these bus drivers along with "provider" drivers that get notified of these reports so that they can be forwarded to the FSI (and then the MCU). This again doesn't seem to be a great fit for EDAC as it is today, but I can also not find anything better looking around the kernel. So I'm wondering if this is something that others have encountered and might have solved already and I just haven't found it, or if this is something that would be worth creating a new subsystem for. Or perhaps this could be integrated into EDAC somehow? I'm a bit reluctant to add yet another custom infrastructure for this, given that it's functionality that likely exists in other SoCs as well. Any thoughts on this? Thierry