Message ID | YW5OmSBM4mO1lDHs@Dennis-MBP.local (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI MCFG consolidation and APEI resource filterin | expand |
On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > The commit d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance > level") fixes the issue that the ACPI/APEI can not access the PCI MCFG > address on x86 platform, but this issue can also happen on other > architectures, for instance, we got below error message on arm64 platform: > ... > APEI: Can not request [mem 0x50100000-0x50100003] for APEI EINJ Trigger registers > ... > > This patch will try to handle this case in a more common way instead of the > original 'arch' specific solution, which will be beneficial to all the > APEI-dependent platforms after that. > > Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> > Reported-by: kernel test robot <lkp@intel.com> The purpose of this patch is not to fix a problem reported by the kernel test robot, so remove this tag. I know the robot found a problem with a previous version of this patch, but we treat that the same as a code review comment. We normally don't explicitly credit reviewers unless it was something major, and then it would go in the commit log, not a "Reported-by" tag. It makes sense to credit the kernel test robot for things found in Linus' tree, but it's a little too aggressive about suggesting the tag for problems with unmerged changes. > Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> This tag can only be added when Lorenzo explicitly supplies it himself. I do not see that on the mailing list, so please remove this tag as well. After Lorenzo supplies it, you can include it in future postings as long as you don't make significant changes to the patch. Bjorn
[+cc Huang in case I'm misinterpreting EINJ resource requests] On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > The commit d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance > level") fixes the issue that the ACPI/APEI can not access the PCI MCFG > address on x86 platform, but this issue can also happen on other > architectures, for instance, we got below error message on arm64 platform: > ... > APEI: Can not request [mem 0x50100000-0x50100003] for APEI EINJ Trigger registers I'm guessing this address is something in the MCFG area? I wish we could also include the matching MCFG info For x86, we put something like this in the dmesg log, but I guess pci_mcfg.c doesn't do this: PCI: MMCONFIG for domain 0000 [bus 00-7f] at [mem 0xf0000000-0xf7ffffff] (base 0xf0000000) > This patch will try to handle this case in a more common way instead of the > original 'arch' specific solution, which will be beneficial to all the > APEI-dependent platforms after that. This actually doesn't say anything about what the patch does or how it works. It says "handles this case in a more common way" but with no details. The EINJ table contains "injection instructions" that can read or write "register regions" described by generic address structures (see ACPI v6.3, sec 18.6.2 and 18.6.3), and __einj_error_trigger() requests those register regions with request_mem_region() or request_region() before executing the injections instructions. IIUC, this patch basically says "if this region is part of the MCFG area, we don't need to reserve it." That leads to the questions of why we need to reserve *any* of the areas and why it's safe to simply skip reserving regions that are part of the MCFG area. > Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> > Reported-by: kernel test robot <lkp@intel.com> > Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: James Morse <james.morse@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Rafael. J. Wysocki <rafael@kernel.org> > Cc: Tony Luck <tony.luck@intel.com> > Cc: Tomasz Nowicki <tn@semihalf.com> > --- > arch/x86/pci/mmconfig-shared.c | 28 -------------------------- > drivers/acpi/apei/apei-base.c | 45 ++++++++++++++++++++++++++++-------------- > 2 files changed, 30 insertions(+), 43 deletions(-) > > diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c > index 0b961fe6..12f7d96 100644 > --- a/arch/x86/pci/mmconfig-shared.c > +++ b/arch/x86/pci/mmconfig-shared.c > @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) > return 0; > } > > -#ifdef CONFIG_ACPI_APEI > -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, > - void *data), void *data); > - > -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, > - void *data), void *data) > -{ > - struct pci_mmcfg_region *cfg; > - int rc; > - > - if (list_empty(&pci_mmcfg_list)) > - return 0; > - > - list_for_each_entry(cfg, &pci_mmcfg_list, list) { > - rc = func(cfg->res.start, resource_size(&cfg->res), data); > - if (rc) > - return rc; > - } > - > - return 0; > -} > -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) > -#else > -#define set_apei_filter() > -#endif > - > static void __init __pci_mmcfg_init(int early) > { > pci_mmcfg_reject_broken(early); > @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) > else > acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); > __pci_mmcfg_init(1); > - > - set_apei_filter(); > } > } > > diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c > index c7fdb12..daae75a 100644 > --- a/drivers/acpi/apei/apei-base.c > +++ b/drivers/acpi/apei/apei-base.c > @@ -21,6 +21,7 @@ > #include <linux/kernel.h> > #include <linux/module.h> > #include <linux/init.h> > +#include <linux/pci.h> > #include <linux/acpi.h> > #include <linux/slab.h> > #include <linux/io.h> > @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) > return acpi_nvs_for_each_region(apei_get_res_callback, resources); > } > > -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, > - void *data), void *data); > -static int apei_get_arch_resources(struct apei_resources *resources) > +#ifdef CONFIG_PCI > +extern struct list_head pci_mmcfg_list; > +static int apei_filter_mcfg_addr(struct apei_resources *res, > + struct apei_resources *mcfg_res) > +{ > + int rc = 0; > + struct pci_mmcfg_region *cfg; > + > + if (list_empty(&pci_mmcfg_list)) > + return 0; > + > + apei_resources_init(mcfg_res); > + list_for_each_entry(cfg, &pci_mmcfg_list, list) { > + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); > + if (rc) > + return rc; > + } > > + /* filter the mcfg resource from current APEI's */ > + return apei_resources_sub(res, mcfg_res); > +} > +#else > +static inline int apei_filter_mcfg_addr(struct apei_resources *res, > + struct apei_resources *mcfg_res) > { > - return arch_apei_filter_addr(apei_get_res_callback, resources); > + return 0; > } > +#endif > > /* > * IO memory/port resource management mechanism is used to check > @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, > if (rc) > goto nvs_res_fini; > > - if (arch_apei_filter_addr) { > - apei_resources_init(&arch_res); > - rc = apei_get_arch_resources(&arch_res); > - if (rc) > - goto arch_res_fini; > - rc = apei_resources_sub(resources, &arch_res); > - if (rc) > - goto arch_res_fini; > - } > + rc = apei_filter_mcfg_addr(resources, &arch_res); > + if (rc) > + goto arch_res_fini; > > rc = -EINVAL; > list_for_each_entry(res, &resources->iomem, list) { > @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, > release_mem_region(res->start, res->end - res->start); > } > arch_res_fini: > - if (arch_apei_filter_addr) > - apei_resources_fini(&arch_res); > + apei_resources_fini(&arch_res); > nvs_res_fini: > apei_resources_fini(&nvs_resources); > return rc; > -- > 1.8.3.1 >
On 19/10/2021 23:04, Bjorn Helgaas wrote: > On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: >> The commit d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance >> level") fixes the issue that the ACPI/APEI can not access the PCI MCFG >> address on x86 platform, but this issue can also happen on other >> architectures, for instance, we got below error message on arm64 platform: >> ... >> APEI: Can not request [mem 0x50100000-0x50100003] for APEI EINJ Trigger registers >> ... >> >> This patch will try to handle this case in a more common way instead of the >> original 'arch' specific solution, which will be beneficial to all the >> APEI-dependent platforms after that. >> >> Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> >> Reported-by: kernel test robot <lkp@intel.com> > > The purpose of this patch is not to fix a problem reported by the > kernel test robot, so remove this tag. Yes, will do > > I know the robot found a problem with a previous version of this > patch, but we treat that the same as a code review comment. We > normally don't explicitly credit reviewers unless it was something > major, and then it would go in the commit log, not a "Reported-by" > tag. > > It makes sense to credit the kernel test robot for things found in > Linus' tree, but it's a little too aggressive about suggesting the tag > for problems with unmerged changes. > >> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > > This tag can only be added when Lorenzo explicitly supplies it > himself. I do not see that on the mailing list, so please remove this > tag as well. After Lorenzo supplies it, you can include it in future > postings as long as you don't make significant changes to the patch. En, Lorenzo does have comments on the patch#2 and I also update that patch according to his feedback, so why the tag is here. OK, I'll add this tag if Lorenzo can supply it explicitly before I send the next version. > > Bjorn >
On 20/10/2021 03:23, Bjorn Helgaas wrote: > [+cc Huang in case I'm misinterpreting EINJ resource requests] > > On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: >> The commit d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance >> level") fixes the issue that the ACPI/APEI can not access the PCI MCFG >> address on x86 platform, but this issue can also happen on other >> architectures, for instance, we got below error message on arm64 platform: >> ... >> APEI: Can not request [mem 0x50100000-0x50100003] for APEI EINJ Trigger registers > > I'm guessing this address is something in the MCFG area? Right, correct! The address 0x50100000 is within the MCFG area. > I wish we > could also include the matching MCFG info For x86, we put something > like this in the dmesg log, but I guess pci_mcfg.c doesn't do this: > > PCI: MMCONFIG for domain 0000 [bus 00-7f] at [mem 0xf0000000-0xf7ffffff] (base 0xf0000000) > I can add the similar dmesg log in pci_mcfg.c to make it's consistent between different arches >> This patch will try to handle this case in a more common way instead of the >> original 'arch' specific solution, which will be beneficial to all the >> APEI-dependent platforms after that. > > This actually doesn't say anything about what the patch does or how it > works. It says "handles this case in a more common way" but with no > details. Good suggestion, I'll give more details about that... > The EINJ table contains "injection instructions" that can read or > write "register regions" described by generic address structures (see > ACPI v6.3, sec 18.6.2 and 18.6.3), and __einj_error_trigger() requests > those register regions with request_mem_region() or request_region() > before executing the injections instructions. > > IIUC, this patch basically says "if this region is part of the MCFG > area, we don't need to reserve it." That leads to the questions of why > we need to reserve *any* of the areas AFAIK, the MCFG area is reserved since the ECAM module will provide a generic Kernel Programming Interfaces(KPI), e.g, pci_generic_config_read(...), so all the drivers are allowed to access the pci config space only by those KPIs in a consistent and safe way, direct raw access will break the rule. Correct me if I am missing sth. > and why it's safe to simply skip > reserving regions that are part of the MCFG area. Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error injection tolerance level") before to address this issue, the entire commit log as below: Some BIOSes utilize PCI MMCFG space read/write opertion to trigger specific errors. EINJ will report errors as below when hitting such cases: APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers It is because on x86 platform ACPI based PCI MMCFG logic has reserved all MMCFG spaces so that EINJ can't reserve it again. We already trust the ACPI/APEI code when using the EINJ interface so it is not a big leap to also trust it to access the right MMCFG addresses. Skip address checking to allow the access. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Except that the above explanation, IMO the EINJ is only a RAS debug framework, in this code path, sometimes we need to acesss the address within the MCFG space directly to trigger kind of HW error, which behavior does not like the normal device driver's, in this case some possible unsafe operations(bypass the ecam ops) can be mitigated because the touched device will generate some HW errors and the RAS handling part will preempt its corresponding drivers to fix/log the HW error, that's my understanding about that. > >> Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> >> Reported-by: kernel test robot <lkp@intel.com> >> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: James Morse <james.morse@arm.com> >> Cc: Will Deacon <will@kernel.org> >> Cc: Rafael. J. Wysocki <rafael@kernel.org> >> Cc: Tony Luck <tony.luck@intel.com> >> Cc: Tomasz Nowicki <tn@semihalf.com> >> --- >> arch/x86/pci/mmconfig-shared.c | 28 -------------------------- >> drivers/acpi/apei/apei-base.c | 45 ++++++++++++++++++++++++++++-------------- >> 2 files changed, 30 insertions(+), 43 deletions(-) >> >> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c >> index 0b961fe6..12f7d96 100644 >> --- a/arch/x86/pci/mmconfig-shared.c >> +++ b/arch/x86/pci/mmconfig-shared.c >> @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) >> return 0; >> } >> >> -#ifdef CONFIG_ACPI_APEI >> -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, >> - void *data), void *data); >> - >> -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, >> - void *data), void *data) >> -{ >> - struct pci_mmcfg_region *cfg; >> - int rc; >> - >> - if (list_empty(&pci_mmcfg_list)) >> - return 0; >> - >> - list_for_each_entry(cfg, &pci_mmcfg_list, list) { >> - rc = func(cfg->res.start, resource_size(&cfg->res), data); >> - if (rc) >> - return rc; >> - } >> - >> - return 0; >> -} >> -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) >> -#else >> -#define set_apei_filter() >> -#endif >> - >> static void __init __pci_mmcfg_init(int early) >> { >> pci_mmcfg_reject_broken(early); >> @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) >> else >> acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); >> __pci_mmcfg_init(1); >> - >> - set_apei_filter(); >> } >> } >> >> diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c >> index c7fdb12..daae75a 100644 >> --- a/drivers/acpi/apei/apei-base.c >> +++ b/drivers/acpi/apei/apei-base.c >> @@ -21,6 +21,7 @@ >> #include <linux/kernel.h> >> #include <linux/module.h> >> #include <linux/init.h> >> +#include <linux/pci.h> >> #include <linux/acpi.h> >> #include <linux/slab.h> >> #include <linux/io.h> >> @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) >> return acpi_nvs_for_each_region(apei_get_res_callback, resources); >> } >> >> -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, >> - void *data), void *data); >> -static int apei_get_arch_resources(struct apei_resources *resources) >> +#ifdef CONFIG_PCI >> +extern struct list_head pci_mmcfg_list; >> +static int apei_filter_mcfg_addr(struct apei_resources *res, >> + struct apei_resources *mcfg_res) >> +{ >> + int rc = 0; >> + struct pci_mmcfg_region *cfg; >> + >> + if (list_empty(&pci_mmcfg_list)) >> + return 0; >> + >> + apei_resources_init(mcfg_res); >> + list_for_each_entry(cfg, &pci_mmcfg_list, list) { >> + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); >> + if (rc) >> + return rc; >> + } >> >> + /* filter the mcfg resource from current APEI's */ >> + return apei_resources_sub(res, mcfg_res); >> +} >> +#else >> +static inline int apei_filter_mcfg_addr(struct apei_resources *res, >> + struct apei_resources *mcfg_res) >> { >> - return arch_apei_filter_addr(apei_get_res_callback, resources); >> + return 0; >> } >> +#endif >> >> /* >> * IO memory/port resource management mechanism is used to check >> @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, >> if (rc) >> goto nvs_res_fini; >> >> - if (arch_apei_filter_addr) { >> - apei_resources_init(&arch_res); >> - rc = apei_get_arch_resources(&arch_res); >> - if (rc) >> - goto arch_res_fini; >> - rc = apei_resources_sub(resources, &arch_res); >> - if (rc) >> - goto arch_res_fini; >> - } >> + rc = apei_filter_mcfg_addr(resources, &arch_res); >> + if (rc) >> + goto arch_res_fini; >> >> rc = -EINVAL; >> list_for_each_entry(res, &resources->iomem, list) { >> @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, >> release_mem_region(res->start, res->end - res->start); >> } >> arch_res_fini: >> - if (arch_apei_filter_addr) >> - apei_resources_fini(&arch_res); >> + apei_resources_fini(&arch_res); >> nvs_res_fini: >> apei_resources_fini(&nvs_resources); >> return rc; >> -- >> 1.8.3.1 >>
[+cc Gong, author of d91525eb8ee6] On Wed, Oct 20, 2021 at 11:16:38AM +0800, Xuesong Chen wrote: > On 20/10/2021 03:23, Bjorn Helgaas wrote: > > On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > > I wish we could also include the matching MCFG info For x86, we > > put something like this in the dmesg log, but I guess pci_mcfg.c > > doesn't do this: > > > > PCI: MMCONFIG for domain 0000 [bus 00-7f] at [mem 0xf0000000-0xf7ffffff] (base 0xf0000000) > > I can add the similar dmesg log in pci_mcfg.c to make it's > consistent between different arches I think that might be nice. It should probably be a separate patch since it's not really related to the others. > >> This patch will try to handle this case in a more common way instead of the > >> original 'arch' specific solution, which will be beneficial to all the > >> APEI-dependent platforms after that. > > > > This actually doesn't say anything about what the patch does or how it > > works. It says "handles this case in a more common way" but with no > > details. > > Good suggestion, I'll give more details about that... > > The EINJ table contains "injection instructions" that can read or > > write "register regions" described by generic address structures (see > > ACPI v6.3, sec 18.6.2 and 18.6.3), and __einj_error_trigger() requests > > those register regions with request_mem_region() or request_region() > > before executing the injections instructions. > > > > IIUC, this patch basically says "if this region is part of the MCFG > > area, we don't need to reserve it." That leads to the questions of why > > we need to reserve *any* of the areas > > AFAIK, the MCFG area is reserved since the ECAM module will provide > a generic Kernel Programming Interfaces(KPI), e.g, > pci_generic_config_read(...), so all the drivers are allowed to > access the pci config space only by those KPIs in a consistent and > safe way, direct raw access will break the rule. Correct me if I am > missing sth. > > > and why it's safe to simply skip reserving regions that are part > > of the MCFG area. > > Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error > injection tolerance level") before to address this issue, the entire > commit log as below: > > Some BIOSes utilize PCI MMCFG space read/write opertion to trigger > specific errors. EINJ will report errors as below when hitting such > cases: > > APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers > > It is because on x86 platform ACPI based PCI MMCFG logic has > reserved all MMCFG spaces so that EINJ can't reserve it again. > We already trust the ACPI/APEI code when using the EINJ interface > so it is not a big leap to also trust it to access the right > MMCFG addresses. Skip address checking to allow the access. I'm not really convinced by that justification because I don't think the issue here is *trust*. If all we care about is trust, and we trust the ACPI/APEI code, why do we need to reserve anything at all when executing EINJ actions? I think the resource reservation issue is about coordinating multiple users of the address space. A driver reserves the MMIO address space of a device it controls so no other driver can reserve it at the same time and cause conflicts. I'm not really convinced by this mutual exclusion argument either, because I haven't yet seen a situation where we say "EINJ needs a resource that's already in use by somebody else, so we can't use EINJ." When conflicts arise, the response is always "we'll just stop reserving this conflicting resource but use it anyway." I think the only real value in apei_resources_request() is a little bit of documentation in /proc/iomem. For ERST and EINJ, even that only lasts for the tiny period when we're actually executing an action. So convince me there's a reason why we shouldn't just remove apei_resources_request() completely :) > Except that the above explanation, IMO the EINJ is only a RAS debug > framework, in this code path, sometimes we need to acesss the > address within the MCFG space directly to trigger kind of HW error, > which behavior does not like the normal device driver's, in this > case some possible unsafe operations (bypass the ecam ops) can be > mitigated because the touched device will generate some HW errors > and the RAS handling part will preempt its corresponding drivers to > fix/log the HW error, that's my understanding about that. > >> Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> > >> Reported-by: kernel test robot <lkp@intel.com> > >> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > >> Cc: Catalin Marinas <catalin.marinas@arm.com> > >> Cc: James Morse <james.morse@arm.com> > >> Cc: Will Deacon <will@kernel.org> > >> Cc: Rafael. J. Wysocki <rafael@kernel.org> > >> Cc: Tony Luck <tony.luck@intel.com> > >> Cc: Tomasz Nowicki <tn@semihalf.com> > >> --- > >> arch/x86/pci/mmconfig-shared.c | 28 -------------------------- > >> drivers/acpi/apei/apei-base.c | 45 ++++++++++++++++++++++++++++-------------- > >> 2 files changed, 30 insertions(+), 43 deletions(-) > >> > >> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c > >> index 0b961fe6..12f7d96 100644 > >> --- a/arch/x86/pci/mmconfig-shared.c > >> +++ b/arch/x86/pci/mmconfig-shared.c > >> @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) > >> return 0; > >> } > >> > >> -#ifdef CONFIG_ACPI_APEI > >> -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, > >> - void *data), void *data); > >> - > >> -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, > >> - void *data), void *data) > >> -{ > >> - struct pci_mmcfg_region *cfg; > >> - int rc; > >> - > >> - if (list_empty(&pci_mmcfg_list)) > >> - return 0; > >> - > >> - list_for_each_entry(cfg, &pci_mmcfg_list, list) { > >> - rc = func(cfg->res.start, resource_size(&cfg->res), data); > >> - if (rc) > >> - return rc; > >> - } > >> - > >> - return 0; > >> -} > >> -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) > >> -#else > >> -#define set_apei_filter() > >> -#endif > >> - > >> static void __init __pci_mmcfg_init(int early) > >> { > >> pci_mmcfg_reject_broken(early); > >> @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) > >> else > >> acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); > >> __pci_mmcfg_init(1); > >> - > >> - set_apei_filter(); > >> } > >> } > >> > >> diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c > >> index c7fdb12..daae75a 100644 > >> --- a/drivers/acpi/apei/apei-base.c > >> +++ b/drivers/acpi/apei/apei-base.c > >> @@ -21,6 +21,7 @@ > >> #include <linux/kernel.h> > >> #include <linux/module.h> > >> #include <linux/init.h> > >> +#include <linux/pci.h> > >> #include <linux/acpi.h> > >> #include <linux/slab.h> > >> #include <linux/io.h> > >> @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) > >> return acpi_nvs_for_each_region(apei_get_res_callback, resources); > >> } > >> > >> -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, > >> - void *data), void *data); > >> -static int apei_get_arch_resources(struct apei_resources *resources) > >> +#ifdef CONFIG_PCI > >> +extern struct list_head pci_mmcfg_list; > >> +static int apei_filter_mcfg_addr(struct apei_resources *res, > >> + struct apei_resources *mcfg_res) > >> +{ > >> + int rc = 0; > >> + struct pci_mmcfg_region *cfg; > >> + > >> + if (list_empty(&pci_mmcfg_list)) > >> + return 0; > >> + > >> + apei_resources_init(mcfg_res); > >> + list_for_each_entry(cfg, &pci_mmcfg_list, list) { > >> + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); > >> + if (rc) > >> + return rc; > >> + } > >> > >> + /* filter the mcfg resource from current APEI's */ > >> + return apei_resources_sub(res, mcfg_res); > >> +} > >> +#else > >> +static inline int apei_filter_mcfg_addr(struct apei_resources *res, > >> + struct apei_resources *mcfg_res) > >> { > >> - return arch_apei_filter_addr(apei_get_res_callback, resources); > >> + return 0; > >> } > >> +#endif > >> > >> /* > >> * IO memory/port resource management mechanism is used to check > >> @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, > >> if (rc) > >> goto nvs_res_fini; > >> > >> - if (arch_apei_filter_addr) { > >> - apei_resources_init(&arch_res); > >> - rc = apei_get_arch_resources(&arch_res); > >> - if (rc) > >> - goto arch_res_fini; > >> - rc = apei_resources_sub(resources, &arch_res); > >> - if (rc) > >> - goto arch_res_fini; > >> - } > >> + rc = apei_filter_mcfg_addr(resources, &arch_res); > >> + if (rc) > >> + goto arch_res_fini; > >> > >> rc = -EINVAL; > >> list_for_each_entry(res, &resources->iomem, list) { > >> @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, > >> release_mem_region(res->start, res->end - res->start); > >> } > >> arch_res_fini: > >> - if (arch_apei_filter_addr) > >> - apei_resources_fini(&arch_res); > >> + apei_resources_fini(&arch_res); > >> nvs_res_fini: > >> apei_resources_fini(&nvs_resources); > >> return rc; > >> -- > >> 1.8.3.1 > >>
On 21/10/2021 02:50, Bjorn Helgaas wrote: > [+cc Gong, author of d91525eb8ee6] > > On Wed, Oct 20, 2021 at 11:16:38AM +0800, Xuesong Chen wrote: >> On 20/10/2021 03:23, Bjorn Helgaas wrote: >>> On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > >>> I wish we could also include the matching MCFG info For x86, we >>> put something like this in the dmesg log, but I guess pci_mcfg.c >>> doesn't do this: >>> >>> PCI: MMCONFIG for domain 0000 [bus 00-7f] at [mem 0xf0000000-0xf7ffffff] (base 0xf0000000) >> >> I can add the similar dmesg log in pci_mcfg.c to make it's >> consistent between different arches > > I think that might be nice. It should probably be a separate patch > since it's not really related to the others. Yes, will do. > >>>> This patch will try to handle this case in a more common way instead of the >>>> original 'arch' specific solution, which will be beneficial to all the >>>> APEI-dependent platforms after that. >>> >>> This actually doesn't say anything about what the patch does or how it >>> works. It says "handles this case in a more common way" but with no >>> details. >> >> Good suggestion, I'll give more details about that... >>> The EINJ table contains "injection instructions" that can read or >>> write "register regions" described by generic address structures (see >>> ACPI v6.3, sec 18.6.2 and 18.6.3), and __einj_error_trigger() requests >>> those register regions with request_mem_region() or request_region() >>> before executing the injections instructions. >>> >>> IIUC, this patch basically says "if this region is part of the MCFG >>> area, we don't need to reserve it." That leads to the questions of why >>> we need to reserve *any* of the areas >> >> AFAIK, the MCFG area is reserved since the ECAM module will provide >> a generic Kernel Programming Interfaces(KPI), e.g, >> pci_generic_config_read(...), so all the drivers are allowed to >> access the pci config space only by those KPIs in a consistent and >> safe way, direct raw access will break the rule. Correct me if I am >> missing sth. >> >>> and why it's safe to simply skip reserving regions that are part >>> of the MCFG area. >> >> Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error >> injection tolerance level") before to address this issue, the entire >> commit log as below: >> >> Some BIOSes utilize PCI MMCFG space read/write opertion to trigger >> specific errors. EINJ will report errors as below when hitting such >> cases: >> >> APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers >> >> It is because on x86 platform ACPI based PCI MMCFG logic has >> reserved all MMCFG spaces so that EINJ can't reserve it again. >> We already trust the ACPI/APEI code when using the EINJ interface >> so it is not a big leap to also trust it to access the right >> MMCFG addresses. Skip address checking to allow the access. > > I'm not really convinced by that justification because I don't think > the issue here is *trust*. If all we care about is trust, and we > trust the ACPI/APEI code, why do we need to reserve anything at all > when executing EINJ actions? > > I think the resource reservation issue is about coordinating multiple > users of the address space. A driver reserves the MMIO address space > of a device it controls so no other driver can reserve it at the same > time and cause conflicts. > > I'm not really convinced by this mutual exclusion argument either, > because I haven't yet seen a situation where we say "EINJ needs a > resource that's already in use by somebody else, so we can't use > EINJ." When conflicts arise, the response is always "we'll just > stop reserving this conflicting resource but use it anyway." > > I think the only real value in apei_resources_request() is a little > bit of documentation in /proc/iomem. For ERST and EINJ, even that > only lasts for the tiny period when we're actually executing an > action. > > So convince me there's a reason why we shouldn't just remove > apei_resources_request() completely :) > I have to confess that currently I have no strong evidence/reason to convince you that it's absolute safe to remove apei_resources_request(), probably in some conditions it *does* require to follow the mutual exclusion usage model. The ECAM/MCFG maybe a special case not like other normal device driver, since all its MCFG space has been reserved during the initialization. Anyway, it's another topic and good point well worth discussing in the future. From the patch set itself, I don't think it's a nice idea to make a dramatic change regarding the apei_resources_request() part, I suggest to keep the original rationale untouched and based on that to fix the real issue at hand in a more generic way. Any thought? If no objection, I plan to finalize the next version patch according to the previous feedbacks and send it out for reviewing soon... Thanks, Xuesong >> Except that the above explanation, IMO the EINJ is only a RAS debug >> framework, in this code path, sometimes we need to acesss the >> address within the MCFG space directly to trigger kind of HW error, >> which behavior does not like the normal device driver's, in this >> case some possible unsafe operations (bypass the ecam ops) can be >> mitigated because the touched device will generate some HW errors >> and the RAS handling part will preempt its corresponding drivers to >> fix/log the HW error, that's my understanding about that. > >>>> Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> >>>> Reported-by: kernel test robot <lkp@intel.com> >>>> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> >>>> Cc: Catalin Marinas <catalin.marinas@arm.com> >>>> Cc: James Morse <james.morse@arm.com> >>>> Cc: Will Deacon <will@kernel.org> >>>> Cc: Rafael. J. Wysocki <rafael@kernel.org> >>>> Cc: Tony Luck <tony.luck@intel.com> >>>> Cc: Tomasz Nowicki <tn@semihalf.com> >>>> --- >>>> arch/x86/pci/mmconfig-shared.c | 28 -------------------------- >>>> drivers/acpi/apei/apei-base.c | 45 ++++++++++++++++++++++++++++-------------- >>>> 2 files changed, 30 insertions(+), 43 deletions(-) >>>> >>>> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c >>>> index 0b961fe6..12f7d96 100644 >>>> --- a/arch/x86/pci/mmconfig-shared.c >>>> +++ b/arch/x86/pci/mmconfig-shared.c >>>> @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) >>>> return 0; >>>> } >>>> >>>> -#ifdef CONFIG_ACPI_APEI >>>> -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, >>>> - void *data), void *data); >>>> - >>>> -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, >>>> - void *data), void *data) >>>> -{ >>>> - struct pci_mmcfg_region *cfg; >>>> - int rc; >>>> - >>>> - if (list_empty(&pci_mmcfg_list)) >>>> - return 0; >>>> - >>>> - list_for_each_entry(cfg, &pci_mmcfg_list, list) { >>>> - rc = func(cfg->res.start, resource_size(&cfg->res), data); >>>> - if (rc) >>>> - return rc; >>>> - } >>>> - >>>> - return 0; >>>> -} >>>> -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) >>>> -#else >>>> -#define set_apei_filter() >>>> -#endif >>>> - >>>> static void __init __pci_mmcfg_init(int early) >>>> { >>>> pci_mmcfg_reject_broken(early); >>>> @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) >>>> else >>>> acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); >>>> __pci_mmcfg_init(1); >>>> - >>>> - set_apei_filter(); >>>> } >>>> } >>>> >>>> diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c >>>> index c7fdb12..daae75a 100644 >>>> --- a/drivers/acpi/apei/apei-base.c >>>> +++ b/drivers/acpi/apei/apei-base.c >>>> @@ -21,6 +21,7 @@ >>>> #include <linux/kernel.h> >>>> #include <linux/module.h> >>>> #include <linux/init.h> >>>> +#include <linux/pci.h> >>>> #include <linux/acpi.h> >>>> #include <linux/slab.h> >>>> #include <linux/io.h> >>>> @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) >>>> return acpi_nvs_for_each_region(apei_get_res_callback, resources); >>>> } >>>> >>>> -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, >>>> - void *data), void *data); >>>> -static int apei_get_arch_resources(struct apei_resources *resources) >>>> +#ifdef CONFIG_PCI >>>> +extern struct list_head pci_mmcfg_list; >>>> +static int apei_filter_mcfg_addr(struct apei_resources *res, >>>> + struct apei_resources *mcfg_res) >>>> +{ >>>> + int rc = 0; >>>> + struct pci_mmcfg_region *cfg; >>>> + >>>> + if (list_empty(&pci_mmcfg_list)) >>>> + return 0; >>>> + >>>> + apei_resources_init(mcfg_res); >>>> + list_for_each_entry(cfg, &pci_mmcfg_list, list) { >>>> + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); >>>> + if (rc) >>>> + return rc; >>>> + } >>>> >>>> + /* filter the mcfg resource from current APEI's */ >>>> + return apei_resources_sub(res, mcfg_res); >>>> +} >>>> +#else >>>> +static inline int apei_filter_mcfg_addr(struct apei_resources *res, >>>> + struct apei_resources *mcfg_res) >>>> { >>>> - return arch_apei_filter_addr(apei_get_res_callback, resources); >>>> + return 0; >>>> } >>>> +#endif >>>> >>>> /* >>>> * IO memory/port resource management mechanism is used to check >>>> @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, >>>> if (rc) >>>> goto nvs_res_fini; >>>> >>>> - if (arch_apei_filter_addr) { >>>> - apei_resources_init(&arch_res); >>>> - rc = apei_get_arch_resources(&arch_res); >>>> - if (rc) >>>> - goto arch_res_fini; >>>> - rc = apei_resources_sub(resources, &arch_res); >>>> - if (rc) >>>> - goto arch_res_fini; >>>> - } >>>> + rc = apei_filter_mcfg_addr(resources, &arch_res); >>>> + if (rc) >>>> + goto arch_res_fini; >>>> >>>> rc = -EINVAL; >>>> list_for_each_entry(res, &resources->iomem, list) { >>>> @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, >>>> release_mem_region(res->start, res->end - res->start); >>>> } >>>> arch_res_fini: >>>> - if (arch_apei_filter_addr) >>>> - apei_resources_fini(&arch_res); >>>> + apei_resources_fini(&arch_res); >>>> nvs_res_fini: >>>> apei_resources_fini(&nvs_resources); >>>> return rc; >>>> -- >>>> 1.8.3.1 >>>>
On Thu, Oct 21, 2021 at 11:46:40PM +0800, Xuesong Chen wrote: > On 21/10/2021 02:50, Bjorn Helgaas wrote: > > On Wed, Oct 20, 2021 at 11:16:38AM +0800, Xuesong Chen wrote: > >> On 20/10/2021 03:23, Bjorn Helgaas wrote: > >>> On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > >>>> This patch will try to handle this case in a more common way > >>>> instead of the original 'arch' specific solution, which will be > >>>> beneficial to all the APEI-dependent platforms after that. > >>> > >>> This actually doesn't say anything about what the patch does or > >>> how it works. It says "handles this case in a more common way" > >>> but with no details. > >> > >> Good suggestion, I'll give more details about that... > >> > >>> The EINJ table contains "injection instructions" that can read > >>> or write "register regions" described by generic address > >>> structures (see ACPI v6.3, sec 18.6.2 and 18.6.3), and > >>> __einj_error_trigger() requests those register regions with > >>> request_mem_region() or request_region() before executing the > >>> injections instructions. > >>> > >>> IIUC, this patch basically says "if this region is part of the > >>> MCFG area, we don't need to reserve it." That leads to the > >>> questions of why we need to reserve *any* of the areas > >> > >> AFAIK, the MCFG area is reserved since the ECAM module will > >> provide a generic Kernel Programming Interfaces(KPI), e.g, > >> pci_generic_config_read(...), so all the drivers are allowed to > >> access the pci config space only by those KPIs in a consistent > >> and safe way, direct raw access will break the rule. Correct me > >> if I am missing sth. > >> > >>> and why it's safe to simply skip reserving regions that are part > >>> of the MCFG area. > >> > >> Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error > >> injection tolerance level") before to address this issue, the > >> entire commit log as below: > >> > >> Some BIOSes utilize PCI MMCFG space read/write opertion to trigger > >> specific errors. EINJ will report errors as below when hitting such > >> cases: > >> > >> APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers > >> > >> It is because on x86 platform ACPI based PCI MMCFG logic has > >> reserved all MMCFG spaces so that EINJ can't reserve it again. > >> We already trust the ACPI/APEI code when using the EINJ interface > >> so it is not a big leap to also trust it to access the right > >> MMCFG addresses. Skip address checking to allow the access. > > > > I'm not really convinced by that justification because I don't > > think the issue here is *trust*. If all we care about is trust, > > and we trust the ACPI/APEI code, why do we need to reserve > > anything at all when executing EINJ actions? > > > > I think the resource reservation issue is about coordinating > > multiple users of the address space. A driver reserves the MMIO > > address space of a device it controls so no other driver can > > reserve it at the same time and cause conflicts. > > > > I'm not really convinced by this mutual exclusion argument either, > > because I haven't yet seen a situation where we say "EINJ needs a > > resource that's already in use by somebody else, so we can't use > > EINJ." When conflicts arise, the response is always "we'll just > > stop reserving this conflicting resource but use it anyway." > > > > I think the only real value in apei_resources_request() is a > > little bit of documentation in /proc/iomem. For ERST and EINJ, > > even that only lasts for the tiny period when we're actually > > executing an action. > > > > So convince me there's a reason why we shouldn't just remove > > apei_resources_request() completely :) > > I have to confess that currently I have no strong evidence/reason to > convince you that it's absolute safe to remove > apei_resources_request(), probably in some conditions it *does* > require to follow the mutual exclusion usage model. The ECAM/MCFG > maybe a special case not like other normal device driver, since all > its MCFG space has been reserved during the initialization. Anyway, > it's another topic and good point well worth discussing in the > future. This is missing the point. It's not the MCFG reservation during initialization that would make this safe. What would make it safe is the fact that ECAM does not require mutual exclusion. When the hardware implements ECAM correctly, PCI config accesses do not require locking because a config access requires a single MMIO load or store. Many non-ECAM config accessors *do* require locking because they use several register accesses, e.g., the 0xCF8/0xCFC address/data pairs used by pci_conf1_read(). If EINJ actions used these, we would have to enforce mutual exclusion between EINJ config accesses and those done by other drivers. Some ARM64 platforms do not implement ECAM correctly, e.g., tegra194_map_bus() programs an outbound ATU and xgene_pcie_map_bus() sets an RTDID register before the MMIO load/store. Platforms like this *do* require mutual exclusion between an EINJ config access and other config accesses. These platforms are supported via quirks in pci_mcfg.c, so they will have resources in the pci_mcfg_list, and if we just ignore all the MCFG resources in apei_resources_request(), there will be nothing to prevent ordinary driver config accesses from being corrupted by EINJ accesses. I think in general, is probably *is* safe to remove MCFG resources from the APEI reservations, but it would be better if we had some way to prevent EINJ from using MCFG on platforms like tegra194 and xgene. > From the patch set itself, I don't think it's a nice idea to make a > dramatic change regarding the apei_resources_request() part, I > suggest to keep the original rationale untouched and based on that > to fix the real issue at hand in a more generic way. There *was* no original rationale. The whole point of this conversation is to figure out what the real rationale is. > >> Except that the above explanation, IMO the EINJ is only a RAS > >> debug framework, in this code path, sometimes we need to acesss > >> the address within the MCFG space directly to trigger kind of HW > >> error, which behavior does not like the normal device driver's, > >> in this case some possible unsafe operations (bypass the ecam > >> ops) can be mitigated because the touched device will generate > >> some HW errors and the RAS handling part will preempt its > >> corresponding drivers to fix/log the HW error, that's my > >> understanding about that. > > > >>>> Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> > >>>> Reported-by: kernel test robot <lkp@intel.com> > >>>> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > >>>> Cc: Catalin Marinas <catalin.marinas@arm.com> > >>>> Cc: James Morse <james.morse@arm.com> > >>>> Cc: Will Deacon <will@kernel.org> > >>>> Cc: Rafael. J. Wysocki <rafael@kernel.org> > >>>> Cc: Tony Luck <tony.luck@intel.com> > >>>> Cc: Tomasz Nowicki <tn@semihalf.com> > >>>> --- > >>>> arch/x86/pci/mmconfig-shared.c | 28 -------------------------- > >>>> drivers/acpi/apei/apei-base.c | 45 ++++++++++++++++++++++++++++-------------- > >>>> 2 files changed, 30 insertions(+), 43 deletions(-) > >>>> > >>>> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c > >>>> index 0b961fe6..12f7d96 100644 > >>>> --- a/arch/x86/pci/mmconfig-shared.c > >>>> +++ b/arch/x86/pci/mmconfig-shared.c > >>>> @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) > >>>> return 0; > >>>> } > >>>> > >>>> -#ifdef CONFIG_ACPI_APEI > >>>> -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, > >>>> - void *data), void *data); > >>>> - > >>>> -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, > >>>> - void *data), void *data) > >>>> -{ > >>>> - struct pci_mmcfg_region *cfg; > >>>> - int rc; > >>>> - > >>>> - if (list_empty(&pci_mmcfg_list)) > >>>> - return 0; > >>>> - > >>>> - list_for_each_entry(cfg, &pci_mmcfg_list, list) { > >>>> - rc = func(cfg->res.start, resource_size(&cfg->res), data); > >>>> - if (rc) > >>>> - return rc; > >>>> - } > >>>> - > >>>> - return 0; > >>>> -} > >>>> -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) > >>>> -#else > >>>> -#define set_apei_filter() > >>>> -#endif > >>>> - > >>>> static void __init __pci_mmcfg_init(int early) > >>>> { > >>>> pci_mmcfg_reject_broken(early); > >>>> @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) > >>>> else > >>>> acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); > >>>> __pci_mmcfg_init(1); > >>>> - > >>>> - set_apei_filter(); > >>>> } > >>>> } > >>>> > >>>> diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c > >>>> index c7fdb12..daae75a 100644 > >>>> --- a/drivers/acpi/apei/apei-base.c > >>>> +++ b/drivers/acpi/apei/apei-base.c > >>>> @@ -21,6 +21,7 @@ > >>>> #include <linux/kernel.h> > >>>> #include <linux/module.h> > >>>> #include <linux/init.h> > >>>> +#include <linux/pci.h> > >>>> #include <linux/acpi.h> > >>>> #include <linux/slab.h> > >>>> #include <linux/io.h> > >>>> @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) > >>>> return acpi_nvs_for_each_region(apei_get_res_callback, resources); > >>>> } > >>>> > >>>> -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, > >>>> - void *data), void *data); > >>>> -static int apei_get_arch_resources(struct apei_resources *resources) > >>>> +#ifdef CONFIG_PCI > >>>> +extern struct list_head pci_mmcfg_list; > >>>> +static int apei_filter_mcfg_addr(struct apei_resources *res, > >>>> + struct apei_resources *mcfg_res) > >>>> +{ > >>>> + int rc = 0; > >>>> + struct pci_mmcfg_region *cfg; > >>>> + > >>>> + if (list_empty(&pci_mmcfg_list)) > >>>> + return 0; > >>>> + > >>>> + apei_resources_init(mcfg_res); > >>>> + list_for_each_entry(cfg, &pci_mmcfg_list, list) { > >>>> + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); > >>>> + if (rc) > >>>> + return rc; > >>>> + } > >>>> > >>>> + /* filter the mcfg resource from current APEI's */ > >>>> + return apei_resources_sub(res, mcfg_res); > >>>> +} > >>>> +#else > >>>> +static inline int apei_filter_mcfg_addr(struct apei_resources *res, > >>>> + struct apei_resources *mcfg_res) > >>>> { > >>>> - return arch_apei_filter_addr(apei_get_res_callback, resources); > >>>> + return 0; > >>>> } > >>>> +#endif > >>>> > >>>> /* > >>>> * IO memory/port resource management mechanism is used to check > >>>> @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, > >>>> if (rc) > >>>> goto nvs_res_fini; > >>>> > >>>> - if (arch_apei_filter_addr) { > >>>> - apei_resources_init(&arch_res); > >>>> - rc = apei_get_arch_resources(&arch_res); > >>>> - if (rc) > >>>> - goto arch_res_fini; > >>>> - rc = apei_resources_sub(resources, &arch_res); > >>>> - if (rc) > >>>> - goto arch_res_fini; > >>>> - } > >>>> + rc = apei_filter_mcfg_addr(resources, &arch_res); > >>>> + if (rc) > >>>> + goto arch_res_fini; > >>>> > >>>> rc = -EINVAL; > >>>> list_for_each_entry(res, &resources->iomem, list) { > >>>> @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, > >>>> release_mem_region(res->start, res->end - res->start); > >>>> } > >>>> arch_res_fini: > >>>> - if (arch_apei_filter_addr) > >>>> - apei_resources_fini(&arch_res); > >>>> + apei_resources_fini(&arch_res); > >>>> nvs_res_fini: > >>>> apei_resources_fini(&nvs_resources); > >>>> return rc; > >>>> -- > >>>> 1.8.3.1 > >>>>
On 22/10/2021 00:57, Bjorn Helgaas wrote: > On Thu, Oct 21, 2021 at 11:46:40PM +0800, Xuesong Chen wrote: >> On 21/10/2021 02:50, Bjorn Helgaas wrote: >>> On Wed, Oct 20, 2021 at 11:16:38AM +0800, Xuesong Chen wrote: >>>> On 20/10/2021 03:23, Bjorn Helgaas wrote: >>>>> On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > >>>>>> This patch will try to handle this case in a more common way >>>>>> instead of the original 'arch' specific solution, which will be >>>>>> beneficial to all the APEI-dependent platforms after that. >>>>> >>>>> This actually doesn't say anything about what the patch does or >>>>> how it works. It says "handles this case in a more common way" >>>>> but with no details. >>>> >>>> Good suggestion, I'll give more details about that... >>>> >>>>> The EINJ table contains "injection instructions" that can read >>>>> or write "register regions" described by generic address >>>>> structures (see ACPI v6.3, sec 18.6.2 and 18.6.3), and >>>>> __einj_error_trigger() requests those register regions with >>>>> request_mem_region() or request_region() before executing the >>>>> injections instructions. >>>>> >>>>> IIUC, this patch basically says "if this region is part of the >>>>> MCFG area, we don't need to reserve it." That leads to the >>>>> questions of why we need to reserve *any* of the areas >>>> >>>> AFAIK, the MCFG area is reserved since the ECAM module will >>>> provide a generic Kernel Programming Interfaces(KPI), e.g, >>>> pci_generic_config_read(...), so all the drivers are allowed to >>>> access the pci config space only by those KPIs in a consistent >>>> and safe way, direct raw access will break the rule. Correct me >>>> if I am missing sth. >>>> >>>>> and why it's safe to simply skip reserving regions that are part >>>>> of the MCFG area. >>>> >>>> Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error >>>> injection tolerance level") before to address this issue, the >>>> entire commit log as below: >>>> >>>> Some BIOSes utilize PCI MMCFG space read/write opertion to trigger >>>> specific errors. EINJ will report errors as below when hitting such >>>> cases: >>>> >>>> APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers >>>> >>>> It is because on x86 platform ACPI based PCI MMCFG logic has >>>> reserved all MMCFG spaces so that EINJ can't reserve it again. >>>> We already trust the ACPI/APEI code when using the EINJ interface >>>> so it is not a big leap to also trust it to access the right >>>> MMCFG addresses. Skip address checking to allow the access. >>> >>> I'm not really convinced by that justification because I don't >>> think the issue here is *trust*. If all we care about is trust, >>> and we trust the ACPI/APEI code, why do we need to reserve >>> anything at all when executing EINJ actions? >>> >>> I think the resource reservation issue is about coordinating >>> multiple users of the address space. A driver reserves the MMIO >>> address space of a device it controls so no other driver can >>> reserve it at the same time and cause conflicts. >>> >>> I'm not really convinced by this mutual exclusion argument either, >>> because I haven't yet seen a situation where we say "EINJ needs a >>> resource that's already in use by somebody else, so we can't use >>> EINJ." When conflicts arise, the response is always "we'll just >>> stop reserving this conflicting resource but use it anyway." >>> >>> I think the only real value in apei_resources_request() is a >>> little bit of documentation in /proc/iomem. For ERST and EINJ, >>> even that only lasts for the tiny period when we're actually >>> executing an action. >>> >>> So convince me there's a reason why we shouldn't just remove >>> apei_resources_request() completely :) >> >> I have to confess that currently I have no strong evidence/reason to >> convince you that it's absolute safe to remove >> apei_resources_request(), probably in some conditions it *does* >> require to follow the mutual exclusion usage model. The ECAM/MCFG >> maybe a special case not like other normal device driver, since all >> its MCFG space has been reserved during the initialization. Anyway, >> it's another topic and good point well worth discussing in the >> future. > > This is missing the point. It's not the MCFG reservation during > initialization that would make this safe. What would make it safe is > the fact that ECAM does not require mutual exclusion. > > When the hardware implements ECAM correctly, PCI config accesses do > not require locking because a config access requires a single MMIO > load or store. > I don't quite understand here, we're talking about apei_resources_request() which is a mechanism to void resource conflict,"request_mem_region() tells the kernel that your driver is going to use this range of I/O addresses, which will prevent other drivers to make any overlapping call to the same region through request_mem_region()", but according to the context of 'a single MMIO load or store', are you talking about something like the mutex lock primitive? > Many non-ECAM config accessors *do* require locking because they use > several register accesses, e.g., the 0xCF8/0xCFC address/data pairs > used by pci_conf1_read(). If EINJ actions used these, we would have > to enforce mutual exclusion between EINJ config accesses and those > done by other drivers. I take a look at the pci_conf1_read() function, there's only a pair of raw_spin_lock_irqsave() and raw_spin_unlock_irqrestore(), if that's the mutual exclusion you mentioned, seems it's not related to the apei_resources_request() we're talking about... > > Some ARM64 platforms do not implement ECAM correctly, e.g., > tegra194_map_bus() programs an outbound ATU and xgene_pcie_map_bus() > sets an RTDID register before the MMIO load/store. Platforms like > this *do* require mutual exclusion between an EINJ config access and > other config accesses. What's the mutual exclusion for those quirk functions(tegra194 and xgene)? *mutual* is not applied for single side. I can see neither locking nor request_mem_region() in those bus map functions. > > These platforms are supported via quirks in pci_mcfg.c, so they will > have resources in the pci_mcfg_list, and if we just ignore all the > MCFG resources in apei_resources_request(), there will be nothing to > prevent ordinary driver config accesses from being corrupted by EINJ > accesses. > > I think in general, is probably *is* safe to remove MCFG resources > from the APEI reservations, but it would be better if we had some way > to prevent EINJ from using MCFG on platforms like tegra194 and xgene. Just as I mentioned, since there's no mutual exclusion applied for the tegra194 and xgene(correct me if I am wrong), putting their MCFG resources into the APEI reservation(so the apei_resources_request() applied) does nothing Thanks, Xuesong > >> From the patch set itself, I don't think it's a nice idea to make a >> dramatic change regarding the apei_resources_request() part, I >> suggest to keep the original rationale untouched and based on that >> to fix the real issue at hand in a more generic way. > > There *was* no original rationale. The whole point of this > conversation is to figure out what the real rationale is. > >>>> Except that the above explanation, IMO the EINJ is only a RAS >>>> debug framework, in this code path, sometimes we need to acesss >>>> the address within the MCFG space directly to trigger kind of HW >>>> error, which behavior does not like the normal device driver's, >>>> in this case some possible unsafe operations (bypass the ecam >>>> ops) can be mitigated because the touched device will generate >>>> some HW errors and the RAS handling part will preempt its >>>> corresponding drivers to fix/log the HW error, that's my >>>> understanding about that. >>> >>>>>> Signed-off-by: Xuesong Chen <xuesong.chen@linux.alibaba.com> >>>>>> Reported-by: kernel test robot <lkp@intel.com> >>>>>> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> >>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com> >>>>>> Cc: James Morse <james.morse@arm.com> >>>>>> Cc: Will Deacon <will@kernel.org> >>>>>> Cc: Rafael. J. Wysocki <rafael@kernel.org> >>>>>> Cc: Tony Luck <tony.luck@intel.com> >>>>>> Cc: Tomasz Nowicki <tn@semihalf.com> >>>>>> --- >>>>>> arch/x86/pci/mmconfig-shared.c | 28 -------------------------- >>>>>> drivers/acpi/apei/apei-base.c | 45 ++++++++++++++++++++++++++++-------------- >>>>>> 2 files changed, 30 insertions(+), 43 deletions(-) >>>>>> >>>>>> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c >>>>>> index 0b961fe6..12f7d96 100644 >>>>>> --- a/arch/x86/pci/mmconfig-shared.c >>>>>> +++ b/arch/x86/pci/mmconfig-shared.c >>>>>> @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) >>>>>> return 0; >>>>>> } >>>>>> >>>>>> -#ifdef CONFIG_ACPI_APEI >>>>>> -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, >>>>>> - void *data), void *data); >>>>>> - >>>>>> -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, >>>>>> - void *data), void *data) >>>>>> -{ >>>>>> - struct pci_mmcfg_region *cfg; >>>>>> - int rc; >>>>>> - >>>>>> - if (list_empty(&pci_mmcfg_list)) >>>>>> - return 0; >>>>>> - >>>>>> - list_for_each_entry(cfg, &pci_mmcfg_list, list) { >>>>>> - rc = func(cfg->res.start, resource_size(&cfg->res), data); >>>>>> - if (rc) >>>>>> - return rc; >>>>>> - } >>>>>> - >>>>>> - return 0; >>>>>> -} >>>>>> -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) >>>>>> -#else >>>>>> -#define set_apei_filter() >>>>>> -#endif >>>>>> - >>>>>> static void __init __pci_mmcfg_init(int early) >>>>>> { >>>>>> pci_mmcfg_reject_broken(early); >>>>>> @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) >>>>>> else >>>>>> acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); >>>>>> __pci_mmcfg_init(1); >>>>>> - >>>>>> - set_apei_filter(); >>>>>> } >>>>>> } >>>>>> >>>>>> diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c >>>>>> index c7fdb12..daae75a 100644 >>>>>> --- a/drivers/acpi/apei/apei-base.c >>>>>> +++ b/drivers/acpi/apei/apei-base.c >>>>>> @@ -21,6 +21,7 @@ >>>>>> #include <linux/kernel.h> >>>>>> #include <linux/module.h> >>>>>> #include <linux/init.h> >>>>>> +#include <linux/pci.h> >>>>>> #include <linux/acpi.h> >>>>>> #include <linux/slab.h> >>>>>> #include <linux/io.h> >>>>>> @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) >>>>>> return acpi_nvs_for_each_region(apei_get_res_callback, resources); >>>>>> } >>>>>> >>>>>> -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, >>>>>> - void *data), void *data); >>>>>> -static int apei_get_arch_resources(struct apei_resources *resources) >>>>>> +#ifdef CONFIG_PCI >>>>>> +extern struct list_head pci_mmcfg_list; >>>>>> +static int apei_filter_mcfg_addr(struct apei_resources *res, >>>>>> + struct apei_resources *mcfg_res) >>>>>> +{ >>>>>> + int rc = 0; >>>>>> + struct pci_mmcfg_region *cfg; >>>>>> + >>>>>> + if (list_empty(&pci_mmcfg_list)) >>>>>> + return 0; >>>>>> + >>>>>> + apei_resources_init(mcfg_res); >>>>>> + list_for_each_entry(cfg, &pci_mmcfg_list, list) { >>>>>> + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); >>>>>> + if (rc) >>>>>> + return rc; >>>>>> + } >>>>>> >>>>>> + /* filter the mcfg resource from current APEI's */ >>>>>> + return apei_resources_sub(res, mcfg_res); >>>>>> +} >>>>>> +#else >>>>>> +static inline int apei_filter_mcfg_addr(struct apei_resources *res, >>>>>> + struct apei_resources *mcfg_res) >>>>>> { >>>>>> - return arch_apei_filter_addr(apei_get_res_callback, resources); >>>>>> + return 0; >>>>>> } >>>>>> +#endif >>>>>> >>>>>> /* >>>>>> * IO memory/port resource management mechanism is used to check >>>>>> @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, >>>>>> if (rc) >>>>>> goto nvs_res_fini; >>>>>> >>>>>> - if (arch_apei_filter_addr) { >>>>>> - apei_resources_init(&arch_res); >>>>>> - rc = apei_get_arch_resources(&arch_res); >>>>>> - if (rc) >>>>>> - goto arch_res_fini; >>>>>> - rc = apei_resources_sub(resources, &arch_res); >>>>>> - if (rc) >>>>>> - goto arch_res_fini; >>>>>> - } >>>>>> + rc = apei_filter_mcfg_addr(resources, &arch_res); >>>>>> + if (rc) >>>>>> + goto arch_res_fini; >>>>>> >>>>>> rc = -EINVAL; >>>>>> list_for_each_entry(res, &resources->iomem, list) { >>>>>> @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, >>>>>> release_mem_region(res->start, res->end - res->start); >>>>>> } >>>>>> arch_res_fini: >>>>>> - if (arch_apei_filter_addr) >>>>>> - apei_resources_fini(&arch_res); >>>>>> + apei_resources_fini(&arch_res); >>>>>> nvs_res_fini: >>>>>> apei_resources_fini(&nvs_resources); >>>>>> return rc; >>>>>> -- >>>>>> 1.8.3.1 >>>>>>
On Fri, Oct 22, 2021 at 05:52:15PM +0800, Xuesong Chen wrote: > On 22/10/2021 00:57, Bjorn Helgaas wrote: > > On Thu, Oct 21, 2021 at 11:46:40PM +0800, Xuesong Chen wrote: > >> On 21/10/2021 02:50, Bjorn Helgaas wrote: > >>> On Wed, Oct 20, 2021 at 11:16:38AM +0800, Xuesong Chen wrote: > >>>> On 20/10/2021 03:23, Bjorn Helgaas wrote: > >>>>> On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: > > > >>>>>> This patch will try to handle this case in a more common way > >>>>>> instead of the original 'arch' specific solution, which will be > >>>>>> beneficial to all the APEI-dependent platforms after that. > >>>>> > >>>>> This actually doesn't say anything about what the patch does or > >>>>> how it works. It says "handles this case in a more common way" > >>>>> but with no details. > >>>> > >>>> Good suggestion, I'll give more details about that... > >>>> > >>>>> The EINJ table contains "injection instructions" that can read > >>>>> or write "register regions" described by generic address > >>>>> structures (see ACPI v6.3, sec 18.6.2 and 18.6.3), and > >>>>> __einj_error_trigger() requests those register regions with > >>>>> request_mem_region() or request_region() before executing the > >>>>> injections instructions. > >>>>> > >>>>> IIUC, this patch basically says "if this region is part of the > >>>>> MCFG area, we don't need to reserve it." That leads to the > >>>>> questions of why we need to reserve *any* of the areas > >>>> > >>>> AFAIK, the MCFG area is reserved since the ECAM module will > >>>> provide a generic Kernel Programming Interfaces(KPI), e.g, > >>>> pci_generic_config_read(...), so all the drivers are allowed to > >>>> access the pci config space only by those KPIs in a consistent > >>>> and safe way, direct raw access will break the rule. Correct me > >>>> if I am missing sth. > >>>> > >>>>> and why it's safe to simply skip reserving regions that are part > >>>>> of the MCFG area. > >>>> > >>>> Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error > >>>> injection tolerance level") before to address this issue, the > >>>> entire commit log as below: > >>>> > >>>> Some BIOSes utilize PCI MMCFG space read/write opertion to trigger > >>>> specific errors. EINJ will report errors as below when hitting such > >>>> cases: > >>>> > >>>> APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers > >>>> > >>>> It is because on x86 platform ACPI based PCI MMCFG logic has > >>>> reserved all MMCFG spaces so that EINJ can't reserve it again. > >>>> We already trust the ACPI/APEI code when using the EINJ interface > >>>> so it is not a big leap to also trust it to access the right > >>>> MMCFG addresses. Skip address checking to allow the access. > >>> > >>> I'm not really convinced by that justification because I don't > >>> think the issue here is *trust*. If all we care about is trust, > >>> and we trust the ACPI/APEI code, why do we need to reserve > >>> anything at all when executing EINJ actions? > >>> > >>> I think the resource reservation issue is about coordinating > >>> multiple users of the address space. A driver reserves the MMIO > >>> address space of a device it controls so no other driver can > >>> reserve it at the same time and cause conflicts. > >>> > >>> I'm not really convinced by this mutual exclusion argument either, > >>> because I haven't yet seen a situation where we say "EINJ needs a > >>> resource that's already in use by somebody else, so we can't use > >>> EINJ." When conflicts arise, the response is always "we'll just > >>> stop reserving this conflicting resource but use it anyway." > >>> > >>> I think the only real value in apei_resources_request() is a > >>> little bit of documentation in /proc/iomem. For ERST and EINJ, > >>> even that only lasts for the tiny period when we're actually > >>> executing an action. > >>> > >>> So convince me there's a reason why we shouldn't just remove > >>> apei_resources_request() completely :) > >> > >> I have to confess that currently I have no strong evidence/reason to > >> convince you that it's absolute safe to remove > >> apei_resources_request(), probably in some conditions it *does* > >> require to follow the mutual exclusion usage model. The ECAM/MCFG > >> maybe a special case not like other normal device driver, since all > >> its MCFG space has been reserved during the initialization. Anyway, > >> it's another topic and good point well worth discussing in the > >> future. > > > > This is missing the point. It's not the MCFG reservation during > > initialization that would make this safe. What would make it safe is > > the fact that ECAM does not require mutual exclusion. > > > > When the hardware implements ECAM correctly, PCI config accesses do > > not require locking because a config access requires a single MMIO > > load or store. > > I don't quite understand here, we're talking about > apei_resources_request() which is a mechanism to void resource > conflict,"request_mem_region() tells the kernel that your driver is > going to use this range of I/O addresses, which will prevent other > drivers to make any overlapping call to the same region through > request_mem_region()", but according to the context of 'a single > MMIO load or store', are you talking about something like the mutex > lock primitive? My point was that when ECAM is implemented correctly, a CPU does a single MMIO load to do a PCI config read and a single MMIO store to do a PCI config write. In that case there no need for any locking, so there's no need for APEI to reserve those resources. This is what d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance level") does. That code change makes sense, but the commit log does not -- it has nothing to do with trusting the ACPI/APEI code; it's just that no matter what the EINJ actions do with the MCFG regions, they cannot interfere with other drivers. > > Many non-ECAM config accessors *do* require locking because they use > > several register accesses, e.g., the 0xCF8/0xCFC address/data pairs > > used by pci_conf1_read(). If EINJ actions used these, we would have > > to enforce mutual exclusion between EINJ config accesses and those > > done by other drivers. > > I take a look at the pci_conf1_read() function, there's only a pair of > raw_spin_lock_irqsave() and raw_spin_unlock_irqrestore(), if that's the > mutual exclusion you mentioned, seems it's not related to the > apei_resources_request() we're talking about... This was an example of a case where EINJ mutual exclusion *would* be required. I do not expect EINJ actions to use the 0xCF8/0xCFC registers because there is no mechanism to coordinate that with the OS use of the same registers. > > Some ARM64 platforms do not implement ECAM correctly, e.g., > > tegra194_map_bus() programs an outbound ATU and xgene_pcie_map_bus() > > sets an RTDID register before the MMIO load/store. Platforms like > > this *do* require mutual exclusion between an EINJ config access and > > other config accesses. > > What's the mutual exclusion for those quirk functions (tegra194 and > xgene)? *mutual* is not applied for single side. I can see neither > locking nor request_mem_region() in those bus map functions. These currently depend on the pci_lock. See PCI_OP_READ() in drivers/pci/access.c. EINJ actions cannot acquire the pci_lock, so EINJ actions cannot safely use ECAM space on those platforms. > > These platforms are supported via quirks in pci_mcfg.c, so they will > > have resources in the pci_mcfg_list, and if we just ignore all the > > MCFG resources in apei_resources_request(), there will be nothing to > > prevent ordinary driver config accesses from being corrupted by EINJ > > accesses. > > > > I think in general, is probably *is* safe to remove MCFG resources > > from the APEI reservations, but it would be better if we had some way > > to prevent EINJ from using MCFG on platforms like tegra194 and xgene. > > Just as I mentioned, since there's no mutual exclusion applied for > the tegra194 and xgene (correct me if I am wrong), putting their MCFG > resources into the APEI reservation (so the apei_resources_request() > applied) does nothing I think apei_resources_request() should continue to reserve MCFG areas on tegra194 and xgene, but it does not need to reserve them on other ARM64 platforms. Bjorn
On 26/10/2021 07:37, Bjorn Helgaas wrote: > On Fri, Oct 22, 2021 at 05:52:15PM +0800, Xuesong Chen wrote: >> On 22/10/2021 00:57, Bjorn Helgaas wrote: >>> On Thu, Oct 21, 2021 at 11:46:40PM +0800, Xuesong Chen wrote: >>>> On 21/10/2021 02:50, Bjorn Helgaas wrote: >>>>> On Wed, Oct 20, 2021 at 11:16:38AM +0800, Xuesong Chen wrote: >>>>>> On 20/10/2021 03:23, Bjorn Helgaas wrote: >>>>>>> On Tue, Oct 19, 2021 at 12:50:33PM +0800, Xuesong Chen wrote: >>> >>>>>>>> This patch will try to handle this case in a more common way >>>>>>>> instead of the original 'arch' specific solution, which will be >>>>>>>> beneficial to all the APEI-dependent platforms after that. >>>>>>> >>>>>>> This actually doesn't say anything about what the patch does or >>>>>>> how it works. It says "handles this case in a more common way" >>>>>>> but with no details. >>>>>> >>>>>> Good suggestion, I'll give more details about that... >>>>>> >>>>>>> The EINJ table contains "injection instructions" that can read >>>>>>> or write "register regions" described by generic address >>>>>>> structures (see ACPI v6.3, sec 18.6.2 and 18.6.3), and >>>>>>> __einj_error_trigger() requests those register regions with >>>>>>> request_mem_region() or request_region() before executing the >>>>>>> injections instructions. >>>>>>> >>>>>>> IIUC, this patch basically says "if this region is part of the >>>>>>> MCFG area, we don't need to reserve it." That leads to the >>>>>>> questions of why we need to reserve *any* of the areas >>>>>> >>>>>> AFAIK, the MCFG area is reserved since the ECAM module will >>>>>> provide a generic Kernel Programming Interfaces(KPI), e.g, >>>>>> pci_generic_config_read(...), so all the drivers are allowed to >>>>>> access the pci config space only by those KPIs in a consistent >>>>>> and safe way, direct raw access will break the rule. Correct me >>>>>> if I am missing sth. >>>>>> >>>>>>> and why it's safe to simply skip reserving regions that are part >>>>>>> of the MCFG area. >>>>>> >>>>>> Actual there is a commit d91525eb8ee6("ACPI, EINJ: Enhance error >>>>>> injection tolerance level") before to address this issue, the >>>>>> entire commit log as below: >>>>>> >>>>>> Some BIOSes utilize PCI MMCFG space read/write opertion to trigger >>>>>> specific errors. EINJ will report errors as below when hitting such >>>>>> cases: >>>>>> >>>>>> APEI: Can not request [mem 0x83f990a0-0x83f990a3] for APEI EINJ Trigger registers >>>>>> >>>>>> It is because on x86 platform ACPI based PCI MMCFG logic has >>>>>> reserved all MMCFG spaces so that EINJ can't reserve it again. >>>>>> We already trust the ACPI/APEI code when using the EINJ interface >>>>>> so it is not a big leap to also trust it to access the right >>>>>> MMCFG addresses. Skip address checking to allow the access. >>>>> >>>>> I'm not really convinced by that justification because I don't >>>>> think the issue here is *trust*. If all we care about is trust, >>>>> and we trust the ACPI/APEI code, why do we need to reserve >>>>> anything at all when executing EINJ actions? >>>>> >>>>> I think the resource reservation issue is about coordinating >>>>> multiple users of the address space. A driver reserves the MMIO >>>>> address space of a device it controls so no other driver can >>>>> reserve it at the same time and cause conflicts. >>>>> >>>>> I'm not really convinced by this mutual exclusion argument either, >>>>> because I haven't yet seen a situation where we say "EINJ needs a >>>>> resource that's already in use by somebody else, so we can't use >>>>> EINJ." When conflicts arise, the response is always "we'll just >>>>> stop reserving this conflicting resource but use it anyway." >>>>> >>>>> I think the only real value in apei_resources_request() is a >>>>> little bit of documentation in /proc/iomem. For ERST and EINJ, >>>>> even that only lasts for the tiny period when we're actually >>>>> executing an action. >>>>> >>>>> So convince me there's a reason why we shouldn't just remove >>>>> apei_resources_request() completely :) >>>> >>>> I have to confess that currently I have no strong evidence/reason to >>>> convince you that it's absolute safe to remove >>>> apei_resources_request(), probably in some conditions it *does* >>>> require to follow the mutual exclusion usage model. The ECAM/MCFG >>>> maybe a special case not like other normal device driver, since all >>>> its MCFG space has been reserved during the initialization. Anyway, >>>> it's another topic and good point well worth discussing in the >>>> future. >>> >>> This is missing the point. It's not the MCFG reservation during >>> initialization that would make this safe. What would make it safe is >>> the fact that ECAM does not require mutual exclusion. >>> >>> When the hardware implements ECAM correctly, PCI config accesses do >>> not require locking because a config access requires a single MMIO >>> load or store. >> >> I don't quite understand here, we're talking about >> apei_resources_request() which is a mechanism to void resource >> conflict,"request_mem_region() tells the kernel that your driver is >> going to use this range of I/O addresses, which will prevent other >> drivers to make any overlapping call to the same region through >> request_mem_region()", but according to the context of 'a single >> MMIO load or store', are you talking about something like the mutex >> lock primitive? > > My point was that when ECAM is implemented correctly, a CPU does a > single MMIO load to do a PCI config read and a single MMIO store to do > a PCI config write. In that case there no need for any locking, so > there's no need for APEI to reserve those resources. Ah, got it. That means the PCI ECAM has a implicit mutual exclusion with EINJ if the hardware implemention is correct, so we can remove the MCFG from the APEI's safely. > > This is what d91525eb8ee6 ("ACPI, EINJ: Enhance error injection > tolerance level") does. That code change makes sense, but the commit > log does not -- it has nothing to do with trusting the ACPI/APEI code; > it's just that no matter what the EINJ actions do with the MCFG > regions, they cannot interfere with other drivers. > >>> Many non-ECAM config accessors *do* require locking because they use >>> several register accesses, e.g., the 0xCF8/0xCFC address/data pairs >>> used by pci_conf1_read(). If EINJ actions used these, we would have >>> to enforce mutual exclusion between EINJ config accesses and those >>> done by other drivers. >> >> I take a look at the pci_conf1_read() function, there's only a pair of >> raw_spin_lock_irqsave() and raw_spin_unlock_irqrestore(), if that's the >> mutual exclusion you mentioned, seems it's not related to the >> apei_resources_request() we're talking about... > > This was an example of a case where EINJ mutual exclusion *would* be > required. I do not expect EINJ actions to use the 0xCF8/0xCFC > registers because there is no mechanism to coordinate that with the OS > use of the same registers. > >>> Some ARM64 platforms do not implement ECAM correctly, e.g., >>> tegra194_map_bus() programs an outbound ATU and xgene_pcie_map_bus() >>> sets an RTDID register before the MMIO load/store. Platforms like >>> this *do* require mutual exclusion between an EINJ config access and >>> other config accesses. >> >> What's the mutual exclusion for those quirk functions (tegra194 and >> xgene)? *mutual* is not applied for single side. I can see neither >> locking nor request_mem_region() in those bus map functions. > > These currently depend on the pci_lock. See PCI_OP_READ() in > drivers/pci/access.c. > > EINJ actions cannot acquire the pci_lock, so EINJ actions cannot > safely use ECAM space on those platforms> >>> These platforms are supported via quirks in pci_mcfg.c, so they will >>> have resources in the pci_mcfg_list, and if we just ignore all the >>> MCFG resources in apei_resources_request(), there will be nothing to >>> prevent ordinary driver config accesses from being corrupted by EINJ >>> accesses. >>> >>> I think in general, is probably *is* safe to remove MCFG resources >>> from the APEI reservations, but it would be better if we had some way >>> to prevent EINJ from using MCFG on platforms like tegra194 and xgene. >> >> Just as I mentioned, since there's no mutual exclusion applied for >> the tegra194 and xgene (correct me if I am wrong), putting their MCFG >> resources into the APEI reservation (so the apei_resources_request() >> applied) does nothing > > I think apei_resources_request() should continue to reserve MCFG areas > on tegra194 and xgene, but it does not need to reserve them on other > ARM64 platforms. As a summary: we need to reserve the MCFG areas on those platforms with a quirk ECAM implementation since there's no lockless method to access the configuration space, on other platforms we don't need to reserve the MCFG resources (so can remove it safely). So we need to add another patch to handle the case of tegra194 and xgene... I will try to figure it out. Thanks, Xuesong > > Bjorn >
On Tue, Oct 26, 2021 at 05:16:47PM +0800, Xuesong Chen wrote: > On 26/10/2021 07:37, Bjorn Helgaas wrote: > > My point was that when ECAM is implemented correctly, a CPU does a > > single MMIO load to do a PCI config read and a single MMIO store to do > > a PCI config write. In that case there no need for any locking, so > > there's no need for APEI to reserve those resources. > > Ah, got it. That means the PCI ECAM has a implicit mutual exclusion with EINJ > if the hardware implemention is correct, so we can remove the MCFG from > the APEI's safely. Well, not quite. ECAM doesn't *need* mutual exclusion. Single loads and stores are atomic by definition. > > I think apei_resources_request() should continue to reserve MCFG areas > > on tegra194 and xgene, but it does not need to reserve them on other > > ARM64 platforms. > > As a summary: we need to reserve the MCFG areas on those platforms with a > quirk ECAM implementation since there's no lockless method to access the > configuration space, on other platforms we don't need to reserve the MCFG > resources (so can remove it safely). > > So we need to add another patch to handle the case of tegra194 and xgene... > I will try to figure it out. I looked through these again and found another problem case (thunder). Here are my notes from my research. Normal ECAM users require no device-specific support. The platform supplies an MCFG table, the generic code works, no mutual exclusion is required, and APEI doesn't need to reserve the MCFG areas. The problem cases are platforms that supply an MCFG table but require some device-specific workarounds. We can identify these because they have quirks in pci-mcfg.c. Here are the existing quirks and the pci_ecam_ops structs they supply: AL_ECAM al_pcie_ops # OK QCOM_ECAM32 pci_32b_ops # OK HISI_QUAD_DOM hisi_pcie_ops # OK THUNDER_PEM_QUIRK thunder_pem_ecam_ops # problem THUNDER_PEM_QUIRK thunder_pem_ecam_ops # problem THUNDER_ECAM_QUIRK pci_thunder_ecam_ops # OK tegra tegra194_pcie_ops # problem XGENE_V1_ECAM_MCFG xgene_v1_pcie_ecam_ops # problem XGENE_V2_ECAM_MCFG xgene_v2_pcie_ecam_ops # problem ALTRA_ECAM_QUIRK pci_32b_read_ops # OK The ones marked "OK" have .map_bus(), .read(), and .write() methods that need no mutual exclusion because they boil down to just a single MMIO load or store. These are fine and there shouldn't be a problem if an EINJ action accesses the ECAM space. The others do require mutual exclusion: - thunder_pem_ecam_ops: thunder_pem_config_read() calls thunder_pem_bridge_read(), which does a writeq() to PEM_CFG_RD followed by a readq(). The writeq() and readq() must be atomic to avoid corruption. - tegra194_pcie_ops: tegra194_map_bus() programs the ATU. This and the subsequent ECAM read/write must be atomic. - xgene_v1_pcie_ecam_ops and xgene_v2_pcie_ecam_ops: xgene_pcie_map_bus() sets the RTID. This and the subsequent ECAM read/write must be atomic. I had to look at all these ops individually to find them, so I don't see an easy way to identify these problem cases at run-time. I personally would not have an issue with having APEI try to reserve the MCFG regions for any platform that has an MCFG quirk. That would prevent the al, qcom, hisi, thunder-ecam, and altra drivers from using EINJ even though it would probably be safe for them. But we already know those platforms are not really ACPI-compliant, so ... Bjorn
On 27/10/2021 04:47, Bjorn Helgaas wrote: > On Tue, Oct 26, 2021 at 05:16:47PM +0800, Xuesong Chen wrote: >> On 26/10/2021 07:37, Bjorn Helgaas wrote: > >>> My point was that when ECAM is implemented correctly, a CPU does a >>> single MMIO load to do a PCI config read and a single MMIO store to do >>> a PCI config write. In that case there no need for any locking, so >>> there's no need for APEI to reserve those resources. >> >> Ah, got it. That means the PCI ECAM has a implicit mutual exclusion with EINJ >> if the hardware implemention is correct, so we can remove the MCFG from >> the APEI's safely. > > Well, not quite. ECAM doesn't *need* mutual exclusion. Single loads > and stores are atomic by definition. > OK, because ECAM config access has intrinsic atomic primitive, so no need to reserve those ECAM config space resources for APEI, that's why commit d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance level") fix make sense... >>> I think apei_resources_request() should continue to reserve MCFG areas >>> on tegra194 and xgene, but it does not need to reserve them on other >>> ARM64 platforms. >> >> As a summary: we need to reserve the MCFG areas on those platforms with a >> quirk ECAM implementation since there's no lockless method to access the >> configuration space, on other platforms we don't need to reserve the MCFG >> resources (so can remove it safely). >> >> So we need to add another patch to handle the case of tegra194 and xgene... >> I will try to figure it out. > > I looked through these again and found another problem case (thunder). > Here are my notes from my research. > > Normal ECAM users require no device-specific support. The platform > supplies an MCFG table, the generic code works, no mutual exclusion is > required, and APEI doesn't need to reserve the MCFG areas. > > The problem cases are platforms that supply an MCFG table but require > some device-specific workarounds. We can identify these because they > have quirks in pci-mcfg.c. Here are the existing quirks and the > pci_ecam_ops structs they supply: > > AL_ECAM al_pcie_ops # OK > QCOM_ECAM32 pci_32b_ops # OK > HISI_QUAD_DOM hisi_pcie_ops # OK > THUNDER_PEM_QUIRK thunder_pem_ecam_ops # problem > THUNDER_PEM_QUIRK thunder_pem_ecam_ops # problem > THUNDER_ECAM_QUIRK pci_thunder_ecam_ops # OK > tegra tegra194_pcie_ops # problem > XGENE_V1_ECAM_MCFG xgene_v1_pcie_ecam_ops # problem > XGENE_V2_ECAM_MCFG xgene_v2_pcie_ecam_ops # problem > ALTRA_ECAM_QUIRK pci_32b_read_ops # OK > > The ones marked "OK" have .map_bus(), .read(), and .write() methods > that need no mutual exclusion because they boil down to just a single > MMIO load or store. These are fine and there shouldn't be a problem > if an EINJ action accesses the ECAM space. > > The others do require mutual exclusion: > > - thunder_pem_ecam_ops: thunder_pem_config_read() calls > thunder_pem_bridge_read(), which does a writeq() to PEM_CFG_RD > followed by a readq(). The writeq() and readq() must be atomic to > avoid corruption. > > - tegra194_pcie_ops: tegra194_map_bus() programs the ATU. This and > the subsequent ECAM read/write must be atomic. > > - xgene_v1_pcie_ecam_ops and xgene_v2_pcie_ecam_ops: > xgene_pcie_map_bus() sets the RTID. This and the subsequent ECAM > read/write must be atomic. > > I had to look at all these ops individually to find them, so I don't > see an easy way to identify these problem cases at run-time. > > I personally would not have an issue with having APEI try to reserve > the MCFG regions for any platform that has an MCFG quirk. That would > prevent the al, qcom, hisi, thunder-ecam, and altra drivers from using > EINJ even though it would probably be safe for them. But we already > know those platforms are not really ACPI-compliant, so ... OK, understood. Since those platforms are not really ACPI-compliant, so we can unify all the quirks together. Let me send an inital solution about this for your review and see if there's room for further improvement... Thanks, Xuesong > > Bjorn >
diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c index 0b961fe6..12f7d96 100644 --- a/arch/x86/pci/mmconfig-shared.c +++ b/arch/x86/pci/mmconfig-shared.c @@ -605,32 +605,6 @@ static int __init pci_parse_mcfg(struct acpi_table_header *header) return 0; } -#ifdef CONFIG_ACPI_APEI -extern int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, - void *data), void *data); - -static int pci_mmcfg_for_each_region(int (*func)(__u64 start, __u64 size, - void *data), void *data) -{ - struct pci_mmcfg_region *cfg; - int rc; - - if (list_empty(&pci_mmcfg_list)) - return 0; - - list_for_each_entry(cfg, &pci_mmcfg_list, list) { - rc = func(cfg->res.start, resource_size(&cfg->res), data); - if (rc) - return rc; - } - - return 0; -} -#define set_apei_filter() (arch_apei_filter_addr = pci_mmcfg_for_each_region) -#else -#define set_apei_filter() -#endif - static void __init __pci_mmcfg_init(int early) { pci_mmcfg_reject_broken(early); @@ -665,8 +639,6 @@ void __init pci_mmcfg_early_init(void) else acpi_table_parse(ACPI_SIG_MCFG, pci_parse_mcfg); __pci_mmcfg_init(1); - - set_apei_filter(); } } diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c index c7fdb12..daae75a 100644 --- a/drivers/acpi/apei/apei-base.c +++ b/drivers/acpi/apei/apei-base.c @@ -21,6 +21,7 @@ #include <linux/kernel.h> #include <linux/module.h> #include <linux/init.h> +#include <linux/pci.h> #include <linux/acpi.h> #include <linux/slab.h> #include <linux/io.h> @@ -448,13 +449,34 @@ static int apei_get_nvs_resources(struct apei_resources *resources) return acpi_nvs_for_each_region(apei_get_res_callback, resources); } -int (*arch_apei_filter_addr)(int (*func)(__u64 start, __u64 size, - void *data), void *data); -static int apei_get_arch_resources(struct apei_resources *resources) +#ifdef CONFIG_PCI +extern struct list_head pci_mmcfg_list; +static int apei_filter_mcfg_addr(struct apei_resources *res, + struct apei_resources *mcfg_res) +{ + int rc = 0; + struct pci_mmcfg_region *cfg; + + if (list_empty(&pci_mmcfg_list)) + return 0; + + apei_resources_init(mcfg_res); + list_for_each_entry(cfg, &pci_mmcfg_list, list) { + rc = apei_res_add(&mcfg_res->iomem, cfg->res.start, resource_size(&cfg->res)); + if (rc) + return rc; + } + /* filter the mcfg resource from current APEI's */ + return apei_resources_sub(res, mcfg_res); +} +#else +static inline int apei_filter_mcfg_addr(struct apei_resources *res, + struct apei_resources *mcfg_res) { - return arch_apei_filter_addr(apei_get_res_callback, resources); + return 0; } +#endif /* * IO memory/port resource management mechanism is used to check @@ -486,15 +508,9 @@ int apei_resources_request(struct apei_resources *resources, if (rc) goto nvs_res_fini; - if (arch_apei_filter_addr) { - apei_resources_init(&arch_res); - rc = apei_get_arch_resources(&arch_res); - if (rc) - goto arch_res_fini; - rc = apei_resources_sub(resources, &arch_res); - if (rc) - goto arch_res_fini; - } + rc = apei_filter_mcfg_addr(resources, &arch_res); + if (rc) + goto arch_res_fini; rc = -EINVAL; list_for_each_entry(res, &resources->iomem, list) { @@ -544,8 +560,7 @@ int apei_resources_request(struct apei_resources *resources, release_mem_region(res->start, res->end - res->start); } arch_res_fini: - if (arch_apei_filter_addr) - apei_resources_fini(&arch_res); + apei_resources_fini(&arch_res); nvs_res_fini: apei_resources_fini(&nvs_resources); return rc;