diff mbox series

[RFC,1/1] PCI/ACPI: Make acpi_pci_root_validate_resources() reject IOMEM resources which start at address 0

Message ID 20210615102555.6035-2-hdegoede@redhat.com (mailing list archive)
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series PCI/ACPI: Make acpi_pci_root_validate_resources() reject IOMEM resources which start at address 0 | expand

Commit Message

Hans de Goede June 15, 2021, 10:25 a.m. UTC
On some Lenovo laptops the base-addrsss of some PCI devices is left
at 0 at boot:

[    0.283145] pci 0000:00:15.0: [8086:34e8] type 00 class 0x0c8000
[    0.283217] pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
[    0.285117] pci 0000:00:15.1: [8086:34e9] type 00 class 0x0c8000
[    0.285189] pci 0000:00:15.1: reg 0x10: [mem 0x00000000-0x00000fff 64bit]

There is a _CRS method for these devices, which simply returns the
configured 0 address. This is causing the PCI core to not assign
memory to these PCI devices and is causing these errors:

[    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
[    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
[    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
[    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]

This happens specifically for the designware I2C PCI devices on these
laptops, causing I2C-HID attached touchpads/touchscreens to not work.

Booting with nocrs on these devices results in the kernel itself
assigning memory to these devices, fixing things:

[    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
[    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]

At least the following models are known to be affected by this (but there
might be more):

Lenovo IdeaPad 3 15IIL05 81WE
Lenovo IdeaPad 5 14IIL05 81YH

Add an extra check for the base-address being 0 to
acpi_pci_root_validate_resources() and reject IOMEM resources where this
is the case, to fix this issue.

BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
Note we could instead add the known to be affected models to the
pci_crs_quirks table in arch/x86/pci/acpi.c, but it is likely that more
systems are affected and a more generic fix seems better in general.
---
 drivers/acpi/pci_root.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Rafael J. Wysocki June 15, 2021, 10:59 a.m. UTC | #1
On Tue, Jun 15, 2021 at 12:26 PM Hans de Goede <hdegoede@redhat.com> wrote:
>
> On some Lenovo laptops the base-addrsss of some PCI devices is left
> at 0 at boot:
>
> [    0.283145] pci 0000:00:15.0: [8086:34e8] type 00 class 0x0c8000
> [    0.283217] pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
> [    0.285117] pci 0000:00:15.1: [8086:34e9] type 00 class 0x0c8000
> [    0.285189] pci 0000:00:15.1: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
>
> There is a _CRS method for these devices, which simply returns the
> configured 0 address. This is causing the PCI core to not assign
> memory to these PCI devices and is causing these errors:
>
> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
>
> This happens specifically for the designware I2C PCI devices on these
> laptops, causing I2C-HID attached touchpads/touchscreens to not work.
>
> Booting with nocrs on these devices results in the kernel itself
> assigning memory to these devices, fixing things:
>
> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
>
> At least the following models are known to be affected by this (but there
> might be more):
>
> Lenovo IdeaPad 3 15IIL05 81WE
> Lenovo IdeaPad 5 14IIL05 81YH
>
> Add an extra check for the base-address being 0 to
> acpi_pci_root_validate_resources() and reject IOMEM resources where this
> is the case, to fix this issue.
>
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
> BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> ---
> Note we could instead add the known to be affected models to the
> pci_crs_quirks table in arch/x86/pci/acpi.c, but it is likely that more
> systems are affected and a more generic fix seems better in general.

Also, a memory resource starting at 0 is not usable in Linux anyway
AFAICS, at least on x86.

> ---
>  drivers/acpi/pci_root.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> index dcd593766a64..6cd2ca551005 100644
> --- a/drivers/acpi/pci_root.c
> +++ b/drivers/acpi/pci_root.c
> @@ -686,6 +686,13 @@ static void acpi_pci_root_validate_resources(struct device *dev,
>                 if (!(res1->flags & type))
>                         goto next;
>
> +               if ((type & IORESOURCE_MEM) && res1->start == 0) {
> +                       dev_info(dev, "host bridge window %pR (ignored, start address not set)\n",
> +                                res1);

Makes sense to me (small nit: I wouldn't break the like above).

> +                       free = true;
> +                       goto next;
> +               }
> +
>                 /* Exclude non-addressable range or non-addressable portion */
>                 end = min(res1->end, root->end);
>                 if (end <= res1->start) {
> --
Hans de Goede June 15, 2021, 11:33 a.m. UTC | #2
Hi,

On 6/15/21 12:59 PM, Rafael J. Wysocki wrote:
> On Tue, Jun 15, 2021 at 12:26 PM Hans de Goede <hdegoede@redhat.com> wrote:
>>
>> On some Lenovo laptops the base-addrsss of some PCI devices is left
>> at 0 at boot:
>>
>> [    0.283145] pci 0000:00:15.0: [8086:34e8] type 00 class 0x0c8000
>> [    0.283217] pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
>> [    0.285117] pci 0000:00:15.1: [8086:34e9] type 00 class 0x0c8000
>> [    0.285189] pci 0000:00:15.1: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
>>
>> There is a _CRS method for these devices, which simply returns the
>> configured 0 address. This is causing the PCI core to not assign
>> memory to these PCI devices and is causing these errors:
>>
>> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
>> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
>> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
>>
>> This happens specifically for the designware I2C PCI devices on these
>> laptops, causing I2C-HID attached touchpads/touchscreens to not work.
>>
>> Booting with nocrs on these devices results in the kernel itself
>> assigning memory to these devices, fixing things:
>>
>> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
>> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
>>
>> At least the following models are known to be affected by this (but there
>> might be more):
>>
>> Lenovo IdeaPad 3 15IIL05 81WE
>> Lenovo IdeaPad 5 14IIL05 81YH
>>
>> Add an extra check for the base-address being 0 to
>> acpi_pci_root_validate_resources() and reject IOMEM resources where this
>> is the case, to fix this issue.
>>
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
>> BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
>> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
>> ---
>> Note we could instead add the known to be affected models to the
>> pci_crs_quirks table in arch/x86/pci/acpi.c, but it is likely that more
>> systems are affected and a more generic fix seems better in general.
> 
> Also, a memory resource starting at 0 is not usable in Linux anyway
> AFAICS, at least on x86.

Right, but I was wondering about other arches. Maybe wrap the new check in
"#if IS_ENABLED(CONFIG_X86)" ?

Regards,

Hans



> 
>> ---
>>  drivers/acpi/pci_root.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
>> index dcd593766a64..6cd2ca551005 100644
>> --- a/drivers/acpi/pci_root.c
>> +++ b/drivers/acpi/pci_root.c
>> @@ -686,6 +686,13 @@ static void acpi_pci_root_validate_resources(struct device *dev,
>>                 if (!(res1->flags & type))
>>                         goto next;
>>
>> +               if ((type & IORESOURCE_MEM) && res1->start == 0) {
>> +                       dev_info(dev, "host bridge window %pR (ignored, start address not set)\n",
>> +                                res1);
> 
> Makes sense to me (small nit: I wouldn't break the like above).
> 
>> +                       free = true;
>> +                       goto next;
>> +               }
>> +
>>                 /* Exclude non-addressable range or non-addressable portion */
>>                 end = min(res1->end, root->end);
>>                 if (end <= res1->start) {
>> --
>
Rafael J. Wysocki June 15, 2021, 11:52 a.m. UTC | #3
On Tue, Jun 15, 2021 at 1:33 PM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Hi,
>
> On 6/15/21 12:59 PM, Rafael J. Wysocki wrote:
> > On Tue, Jun 15, 2021 at 12:26 PM Hans de Goede <hdegoede@redhat.com> wrote:
> >>
> >> On some Lenovo laptops the base-addrsss of some PCI devices is left
> >> at 0 at boot:
> >>
> >> [    0.283145] pci 0000:00:15.0: [8086:34e8] type 00 class 0x0c8000
> >> [    0.283217] pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
> >> [    0.285117] pci 0000:00:15.1: [8086:34e9] type 00 class 0x0c8000
> >> [    0.285189] pci 0000:00:15.1: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
> >>
> >> There is a _CRS method for these devices, which simply returns the
> >> configured 0 address. This is causing the PCI core to not assign
> >> memory to these PCI devices and is causing these errors:
> >>
> >> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> >> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> >> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
> >> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
> >>
> >> This happens specifically for the designware I2C PCI devices on these
> >> laptops, causing I2C-HID attached touchpads/touchscreens to not work.
> >>
> >> Booting with nocrs on these devices results in the kernel itself
> >> assigning memory to these devices, fixing things:
> >>
> >> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
> >> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
> >>
> >> At least the following models are known to be affected by this (but there
> >> might be more):
> >>
> >> Lenovo IdeaPad 3 15IIL05 81WE
> >> Lenovo IdeaPad 5 14IIL05 81YH
> >>
> >> Add an extra check for the base-address being 0 to
> >> acpi_pci_root_validate_resources() and reject IOMEM resources where this
> >> is the case, to fix this issue.
> >>
> >> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
> >> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
> >> BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
> >> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> >> ---
> >> Note we could instead add the known to be affected models to the
> >> pci_crs_quirks table in arch/x86/pci/acpi.c, but it is likely that more
> >> systems are affected and a more generic fix seems better in general.
> >
> > Also, a memory resource starting at 0 is not usable in Linux anyway
> > AFAICS, at least on x86.
>
> Right, but I was wondering about other arches. Maybe wrap the new check in
> "#if IS_ENABLED(CONFIG_X86)" ?

That would be prudent, because all of the known affected machines are x86.

> >> ---
> >>  drivers/acpi/pci_root.c | 7 +++++++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> >> index dcd593766a64..6cd2ca551005 100644
> >> --- a/drivers/acpi/pci_root.c
> >> +++ b/drivers/acpi/pci_root.c
> >> @@ -686,6 +686,13 @@ static void acpi_pci_root_validate_resources(struct device *dev,
> >>                 if (!(res1->flags & type))
> >>                         goto next;
> >>
> >> +               if ((type & IORESOURCE_MEM) && res1->start == 0) {
> >> +                       dev_info(dev, "host bridge window %pR (ignored, start address not set)\n",
> >> +                                res1);
> >
> > Makes sense to me (small nit: I wouldn't break the like above).
> >
> >> +                       free = true;
> >> +                       goto next;
> >> +               }
> >> +
> >>                 /* Exclude non-addressable range or non-addressable portion */
> >>                 end = min(res1->end, root->end);
> >>                 if (end <= res1->start) {
> >> --
> >
>
Bjorn Helgaas June 15, 2021, 8:23 p.m. UTC | #4
On Tue, Jun 15, 2021 at 12:25:55PM +0200, Hans de Goede wrote:
> On some Lenovo laptops the base-addrsss of some PCI devices is left
> at 0 at boot:
> 
> [    0.283145] pci 0000:00:15.0: [8086:34e8] type 00 class 0x0c8000
> [    0.283217] pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
> [    0.285117] pci 0000:00:15.1: [8086:34e9] type 00 class 0x0c8000
> [    0.285189] pci 0000:00:15.1: reg 0x10: [mem 0x00000000-0x00000fff 64bit]

s/base-addrsss/base-address/
Timestamps aren't relevant here and can be trimmed.

It's not really an error if BIOS leaves a PCI BAR unprogrammed.

> There is a _CRS method for these devices, which simply returns the
> configured 0 address. This is causing the PCI core to not assign
> memory to these PCI devices and is causing these errors:
> 
> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]

I'm confused.  Did you say there's a _CRS method for these *PCI*
devices (0000:00:15.0, 0000:00:15.1)?

I suspect you mean the *host bridge* has a _CRS method, since I think
acpi_pci_root_validate_resources() is looking at contents of the host
bridge _CRS.

On x86, it would likely be a BIOS defect is a host bridge _CRS had a
memory window starting at 0, but you didn't show what _CRS contained.

The dmesg at https://bugzilla.redhat.com/show_bug.cgi?id=1871793 shows
this, which looks totally normal and should be unaffected by the patch
since there's no memory window starting at 0:

  pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
  pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
  pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
  pci_bus 0000:00: root bus resource [mem 0x6d400000-0xbfffffff window]

The ones at https://bugzilla.redhat.com/show_bug.cgi?id=1868899 and
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
are similar, so I can't quite connect the dots here.

> This happens specifically for the designware I2C PCI devices on these
> laptops, causing I2C-HID attached touchpads/touchscreens to not work.
> 
> Booting with nocrs on these devices results in the kernel itself
> assigning memory to these devices, fixing things:

"pci=nocrs" to help people repro this or try the same workaround
elsewhere.

> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
> 
> At least the following models are known to be affected by this (but there
> might be more):
> 
> Lenovo IdeaPad 3 15IIL05 81WE
> Lenovo IdeaPad 5 14IIL05 81YH
> 
> Add an extra check for the base-address being 0 to
> acpi_pci_root_validate_resources() and reject IOMEM resources where this
> is the case, to fix this issue.
> 
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
> BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> ---
> Note we could instead add the known to be affected models to the
> pci_crs_quirks table in arch/x86/pci/acpi.c, but it is likely that more
> systems are affected and a more generic fix seems better in general.

Definitely good to avoid pci_crs_quirks[] because throwing away _CRS
completely leads us into uncharted waters, especially for multi-host
bridge systems that support hotplug.

> ---
>  drivers/acpi/pci_root.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> index dcd593766a64..6cd2ca551005 100644
> --- a/drivers/acpi/pci_root.c
> +++ b/drivers/acpi/pci_root.c
> @@ -686,6 +686,13 @@ static void acpi_pci_root_validate_resources(struct device *dev,
>  		if (!(res1->flags & type))
>  			goto next;
>  
> +		if ((type & IORESOURCE_MEM) && res1->start == 0) {
> +			dev_info(dev, "host bridge window %pR (ignored, start address not set)\n",
> +				 res1);
> +			free = true;
> +			goto next;
> +		}
> +
>  		/* Exclude non-addressable range or non-addressable portion */
>  		end = min(res1->end, root->end);
>  		if (end <= res1->start) {
> -- 
> 2.31.1
>
Hans de Goede June 16, 2021, 6:43 p.m. UTC | #5
Hi,

On 6/15/21 10:23 PM, Bjorn Helgaas wrote:
> On Tue, Jun 15, 2021 at 12:25:55PM +0200, Hans de Goede wrote:
>> On some Lenovo laptops the base-addrsss of some PCI devices is left
>> at 0 at boot:
>>
>> [    0.283145] pci 0000:00:15.0: [8086:34e8] type 00 class 0x0c8000
>> [    0.283217] pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
>> [    0.285117] pci 0000:00:15.1: [8086:34e9] type 00 class 0x0c8000
>> [    0.285189] pci 0000:00:15.1: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
> 
> s/base-addrsss/base-address/
> Timestamps aren't relevant here and can be trimmed.

Ack.

> It's not really an error if BIOS leaves a PCI BAR unprogrammed.
> 
>> There is a _CRS method for these devices, which simply returns the
>> configured 0 address. This is causing the PCI core to not assign
>> memory to these PCI devices and is causing these errors:
>>
>> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
>> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
>> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
> 
> I'm confused.  Did you say there's a _CRS method for these *PCI*
> devices (0000:00:15.0, 0000:00:15.1)?

That is what I intended to say, but TBH I don't fully grasp the
problem at hand yet and I have not checked this in an actual acpidump.

I would need to see if someone attached an acpidump for one of the
affected devices to some bug and then I can check...

> I suspect you mean the *host bridge* has a _CRS method, since I think
> acpi_pci_root_validate_resources() is looking at contents of the host
> bridge _CRS.

You would expect that from the name, but when I was checking for the
handling of the pci=nocrs kernel-commandline option the only
place dealing with this which I could find are:

arch/x86/pci/common.c, which sets does "pci_probe |= PCI_ROOT_NO_CRS"
where pci_probe is a set of global flags

and:

arch/x86/pci/acpi.c, which sets a static bool pci_use_crs; at
the global level based on this, and the only place where
pci_use_crs is checked is here:

static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
{
        struct acpi_device *device = ci->bridge;
        int busnum = ci->root->secondary.start;
        struct resource_entry *entry, *tmp;
        int status;

        status = acpi_pci_probe_root_resources(ci);
        if (pci_use_crs) {
                resource_list_for_each_entry_safe(entry, tmp, &ci->resources)
                        if (resource_is_pcicfg_ioport(entry->res))
                                resource_list_destroy_entry(entry);
                return status;
        }

And the patch which we are discussing here influences the
resource-list returned by acpi_pci_probe_root_resources(ci).

I was a bit surprised about this to, because as you say it looks like
this code is only about the root windows, but I could not find any
other code dealing with pci=nocrs and multiple users have confirmed
that using pci=nocrs helps.

So I ended up writing this patch; and then just asking users to
test and see what happens, but the patch might very well be
completely wrong, esp. since it indeed seems this only affects
root resources...

> On x86, it would likely be a BIOS defect is a host bridge _CRS had a
> memory window starting at 0, but you didn't show what _CRS contained.
> 
> The dmesg at https://bugzilla.redhat.com/show_bug.cgi?id=1871793 shows
> this, which looks totally normal and should be unaffected by the patch
> since there's no memory window starting at 0:
> 
>   pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
>   pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
>   pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
>   pci_bus 0000:00: root bus resource [mem 0x6d400000-0xbfffffff window]
> 
> The ones at https://bugzilla.redhat.com/show_bug.cgi?id=1868899 and
> https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
> are similar, so I can't quite connect the dots here.

I've 2 dmesgs from runs both with and without pci=nocrs, the one
with a clean kernel commandline (no special options) yields:

[    0.312333] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    0.312335] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    0.312336] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.312337] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
[    0.312338] pci_bus 0000:00: root bus resource [bus 00-fe]

Where as the one with pci=nocrs on the kernel commandline gives:

[    0.271766] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[    0.271767] pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff]
[    0.271768] pci_bus 0000:00: root bus resource [bus 00-fe]

Hmm, so assuming that you are right that pci=nocrs only influences
the root resources (and I believe you are), and given that the problem is
that we are getting these errors:

[    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
[    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
[    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
[    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
[    0.655342] pci 0000:00:1f.5: BAR 0: no space for [mem size 0x00001000]

Instead of getting this:

[    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
[    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]

So now I believe that my initial theory for this is probably completely wrong; and
I wonder if the issue is that the _CRS returned root IOMEM window is big enough
to exactly hold the BIOS assigned mappings, but it does not have any free space
allowing the kernel to assign space for the 0000:00:15.0 and 0000:00:15.1
devices ?

Assuming that that theory is right, how could we work around this problem?
Or at least do a quick debug patch to confirm that indeed the window is "full" ?

>> This happens specifically for the designware I2C PCI devices on these
>> laptops, causing I2C-HID attached touchpads/touchscreens to not work.
>>
>> Booting with nocrs on these devices results in the kernel itself
>> assigning memory to these devices, fixing things:
> 
> "pci=nocrs" to help people repro this or try the same workaround
> elsewhere.

Not sure what you are trying to say here.

> 
>> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
>> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
>>
>> At least the following models are known to be affected by this (but there
>> might be more):
>>
>> Lenovo IdeaPad 3 15IIL05 81WE
>> Lenovo IdeaPad 5 14IIL05 81YH
>>
>> Add an extra check for the base-address being 0 to
>> acpi_pci_root_validate_resources() and reject IOMEM resources where this
>> is the case, to fix this issue.
>>
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
>> BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279
>> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
>> ---
>> Note we could instead add the known to be affected models to the
>> pci_crs_quirks table in arch/x86/pci/acpi.c, but it is likely that more
>> systems are affected and a more generic fix seems better in general.
> 
> Definitely good to avoid pci_crs_quirks[] because throwing away _CRS
> completely leads us into uncharted waters, especially for multi-host
> bridge systems that support hotplug.

ack.

Regards,

Hans



> 
>> ---
>>  drivers/acpi/pci_root.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
>> index dcd593766a64..6cd2ca551005 100644
>> --- a/drivers/acpi/pci_root.c
>> +++ b/drivers/acpi/pci_root.c
>> @@ -686,6 +686,13 @@ static void acpi_pci_root_validate_resources(struct device *dev,
>>  		if (!(res1->flags & type))
>>  			goto next;
>>  
>> +		if ((type & IORESOURCE_MEM) && res1->start == 0) {
>> +			dev_info(dev, "host bridge window %pR (ignored, start address not set)\n",
>> +				 res1);
>> +			free = true;
>> +			goto next;
>> +		}
>> +
>>  		/* Exclude non-addressable range or non-addressable portion */
>>  		end = min(res1->end, root->end);
>>  		if (end <= res1->start) {
>> -- 
>> 2.31.1
>>
>
Bjorn Helgaas June 16, 2021, 10:57 p.m. UTC | #6
On Wed, Jun 16, 2021 at 08:43:12PM +0200, Hans de Goede wrote:
> On 6/15/21 10:23 PM, Bjorn Helgaas wrote:
> > On Tue, Jun 15, 2021 at 12:25:55PM +0200, Hans de Goede wrote:

> I've 2 dmesgs from runs both with and without pci=nocrs, the one
> with a clean kernel commandline (no special options) yields:
> 
> [    0.312333] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> [    0.312335] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> [    0.312336] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> [    0.312337] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
> [    0.312338] pci_bus 0000:00: root bus resource [bus 00-fe]
> 
> Where as the one with pci=nocrs on the kernel commandline gives:
> 
> [    0.271766] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> [    0.271767] pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff]
> [    0.271768] pci_bus 0000:00: root bus resource [bus 00-fe]
> 
> Hmm, so assuming that you are right that pci=nocrs only influences
> the root resources (and I believe you are), and given that the problem is
> that we are getting these errors:
> 
> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
> [    0.655342] pci 0000:00:1f.5: BAR 0: no space for [mem size 0x00001000]
> 
> Instead of getting this:
> 
> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
> 
> So now I believe that my initial theory for this is probably completely wrong; and
> I wonder if the issue is that the _CRS returned root IOMEM window is big enough
> to exactly hold the BIOS assigned mappings, but it does not have any free space
> allowing the kernel to assign space for the 0000:00:15.0 and 0000:00:15.1
> devices ?
>
> Assuming that that theory is right, how could we work around this problem?
> Or at least do a quick debug patch to confirm that indeed the window is "full" ?

I'd be pretty surprised if the host bridge window actually full --
[mem 0x65400000-0xbfffffff] is a pretty big range and these devices
only need 4K each.

But maybe we aren't smart enough when trying to allocate space.
Places like __pci_bus_size_bridges() and __pci_assign_resource() are
full of assumptions about what PCI BARs can go where, depending on
64bit-ness, prefetchability, etc.

Maybe instrumenting those allocation paths would give some insight.
Possibly we should go ahead and merge some permanent pci_dbg() stuff
there too.

I do note that the working "pci=nocrs" case puts these devices above
4GB.  _CRS only told you about host bridge windows *below* 4GB, and
Linux will never assign space that's outside the windows (except in
the "pci=nocrs" case, of course).

> >> This happens specifically for the designware I2C PCI devices on these
> >> laptops, causing I2C-HID attached touchpads/touchscreens to not work.
> >>
> >> Booting with nocrs on these devices results in the kernel itself
> >> assigning memory to these devices, fixing things:
> > 
> > "pci=nocrs" to help people repro this or try the same workaround
> > elsewhere.
> 
> Not sure what you are trying to say here.

Sorry, I just meant that instead of "Booting with nocrs ...", I'd like
the commit log to say "Booting with 'pci=nocrs' ..." so that
non-expert users reading it will have a bit of a head start on how to
try this themselves.

Bjorn
Hans de Goede June 21, 2021, 11:49 a.m. UTC | #7
Hi Bjorn,

On 6/17/21 12:57 AM, Bjorn Helgaas wrote:
> On Wed, Jun 16, 2021 at 08:43:12PM +0200, Hans de Goede wrote:
>> On 6/15/21 10:23 PM, Bjorn Helgaas wrote:
>>> On Tue, Jun 15, 2021 at 12:25:55PM +0200, Hans de Goede wrote:
> 
>> I've 2 dmesgs from runs both with and without pci=nocrs, the one
>> with a clean kernel commandline (no special options) yields:
>>
>> [    0.312333] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
>> [    0.312335] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
>> [    0.312336] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
>> [    0.312337] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
>> [    0.312338] pci_bus 0000:00: root bus resource [bus 00-fe]
>>
>> Where as the one with pci=nocrs on the kernel commandline gives:
>>
>> [    0.271766] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
>> [    0.271767] pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff]
>> [    0.271768] pci_bus 0000:00: root bus resource [bus 00-fe]
>>
>> Hmm, so assuming that you are right that pci=nocrs only influences
>> the root resources (and I believe you are), and given that the problem is
>> that we are getting these errors:
>>
>> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
>> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
>> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
>> [    0.655342] pci 0000:00:1f.5: BAR 0: no space for [mem size 0x00001000]
>>
>> Instead of getting this:
>>
>> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
>> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
>>
>> So now I believe that my initial theory for this is probably completely wrong; and
>> I wonder if the issue is that the _CRS returned root IOMEM window is big enough
>> to exactly hold the BIOS assigned mappings, but it does not have any free space
>> allowing the kernel to assign space for the 0000:00:15.0 and 0000:00:15.1
>> devices ?
>>
>> Assuming that that theory is right, how could we work around this problem?
>> Or at least do a quick debug patch to confirm that indeed the window is "full" ?
> 
> I'd be pretty surprised if the host bridge window actually full --
> [mem 0x65400000-0xbfffffff] is a pretty big range and these devices
> only need 4K each.

Yeah, I just checked and the highest used IOMEM window ends at
0x811300ff leaving plenty of space after it for the 2 new windows.

> But maybe we aren't smart enough when trying to allocate space.
> Places like __pci_bus_size_bridges() and __pci_assign_resource() are
> full of assumptions about what PCI BARs can go where, depending on
> 64bit-ness, prefetchability, etc.
> 
> Maybe instrumenting those allocation paths would give some insight.
> Possibly we should go ahead and merge some permanent pci_dbg() stuff
> there too.

I agree that this seems to be the most likely issue. I've build
a Fedora kernel pkg with some extra debugging added for the reporter
to be test.  Since I'm reliant on a debug cycle where I provide
a kernel and then the reporter comes back with a new debug,
and then rince-repeat it might be a while before I get back to you
on this.  Hopefully when I do get back I will have figured out
what the problem is.

(and in case it was not clear yet the RFC patch which started this
discussion clearly is bogus and can be considered dropped).

> I do note that the working "pci=nocrs" case puts these devices above
> 4GB.  _CRS only told you about host bridge windows *below* 4GB, and
> Linux will never assign space that's outside the windows (except in
> the "pci=nocrs" case, of course).

That is not unexpected, pci_bus_alloc_resource() will first try the
pci_high region which starts at 0x100000000. But it is definitely
an interesting data point.

Regards,

Hans
Bjorn Helgaas June 21, 2021, 12:37 p.m. UTC | #8
On Mon, Jun 21, 2021 at 01:49:04PM +0200, Hans de Goede wrote:
> Hi Bjorn,
> 
> On 6/17/21 12:57 AM, Bjorn Helgaas wrote:
> > On Wed, Jun 16, 2021 at 08:43:12PM +0200, Hans de Goede wrote:
> >> On 6/15/21 10:23 PM, Bjorn Helgaas wrote:
> >>> On Tue, Jun 15, 2021 at 12:25:55PM +0200, Hans de Goede wrote:
> > 
> >> I've 2 dmesgs from runs both with and without pci=nocrs, the one
> >> with a clean kernel commandline (no special options) yields:
> >>
> >> [    0.312333] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> >> [    0.312335] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> >> [    0.312336] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> >> [    0.312337] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
> >> [    0.312338] pci_bus 0000:00: root bus resource [bus 00-fe]
> >>
> >> Where as the one with pci=nocrs on the kernel commandline gives:
> >>
> >> [    0.271766] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> >> [    0.271767] pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff]
> >> [    0.271768] pci_bus 0000:00: root bus resource [bus 00-fe]
> >>
> >> Hmm, so assuming that you are right that pci=nocrs only influences
> >> the root resources (and I believe you are), and given that the problem is
> >> that we are getting these errors:
> >>
> >> [    0.655335] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> >> [    0.655337] pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> >> [    0.655339] pci 0000:00:15.1: BAR 0: no space for [mem size 0x00001000 64bit]
> >> [    0.655340] pci 0000:00:15.1: BAR 0: failed to assign [mem size 0x00001000 64bit]
> >> [    0.655342] pci 0000:00:1f.5: BAR 0: no space for [mem size 0x00001000]
> >>
> >> Instead of getting this:
> >>
> >> [    0.355716] pci 0000:00:15.0: BAR 0: assigned [mem 0x29c000000-0x29c000fff 64bit]
> >> [    0.355783] pci 0000:00:15.1: BAR 0: assigned [mem 0x29c001000-0x29c001fff 64bit]
> >>
> >> So now I believe that my initial theory for this is probably completely wrong; and
> >> I wonder if the issue is that the _CRS returned root IOMEM window is big enough
> >> to exactly hold the BIOS assigned mappings, but it does not have any free space
> >> allowing the kernel to assign space for the 0000:00:15.0 and 0000:00:15.1
> >> devices ?
> >>
> >> Assuming that that theory is right, how could we work around this problem?
> >> Or at least do a quick debug patch to confirm that indeed the window is "full" ?
> > 
> > I'd be pretty surprised if the host bridge window actually full --
> > [mem 0x65400000-0xbfffffff] is a pretty big range and these devices
> > only need 4K each.
> 
> Yeah, I just checked and the highest used IOMEM window ends at
> 0x811300ff leaving plenty of space after it for the 2 new windows.
> 
> > But maybe we aren't smart enough when trying to allocate space.
> > Places like __pci_bus_size_bridges() and __pci_assign_resource()
> > are full of assumptions about what PCI BARs can go where,
> > depending on 64bit-ness, prefetchability, etc.
> > 
> > Maybe instrumenting those allocation paths would give some
> > insight.  Possibly we should go ahead and merge some permanent
> > pci_dbg() stuff there too.
> 
> I agree that this seems to be the most likely issue. I've build a
> Fedora kernel pkg with some extra debugging added for the reporter
> to be test.  Since I'm reliant on a debug cycle where I provide a
> kernel and then the reporter comes back with a new debug, and then
> rince-repeat it might be a while before I get back to you on this.
> Hopefully when I do get back I will have figured out what the
> problem is.

Thanks for the update.  It's a real hassle to debug issues in this
code, especially when it's a long repro cycle like this.  I always
wish we could feed the initial config (which I think we should already
know from existing dmesg logging) into some kind of qemu test fixture
so we could reproduce things locally.

Bjorn
Hans de Goede July 1, 2021, 2:21 p.m. UTC | #9
Hi Bjorn,

On 6/21/21 1:49 PM, Hans de Goede wrote:

<snip>

>> But maybe we aren't smart enough when trying to allocate space.
>> Places like __pci_bus_size_bridges() and __pci_assign_resource() are
>> full of assumptions about what PCI BARs can go where, depending on
>> 64bit-ness, prefetchability, etc.
>>
>> Maybe instrumenting those allocation paths would give some insight.
>> Possibly we should go ahead and merge some permanent pci_dbg() stuff
>> there too.
> 
> I agree that this seems to be the most likely issue. I've build
> a Fedora kernel pkg with some extra debugging added for the reporter
> to be test.  Since I'm reliant on a debug cycle where I provide
> a kernel and then the reporter comes back with a new debug,
> and then rince-repeat it might be a while before I get back to you
> on this.  Hopefully when I do get back I will have figured out
> what the problem is.
> 
> (and in case it was not clear yet the RFC patch which started this
> discussion clearly is bogus and can be considered dropped).

Ok, I believe I've figured out what the root cause of things
failing is here.

I've added a couple of pr_info-s to kernel/resource.c __find_resource()
like this:

                pr_info("__find_resource() trying 0x%llx - 0x%llx\n", tmp.start, tmp.end);
                if (tmp.end < tmp.start)
                        goto next;

                resource_clip(&tmp, constraint->min, constraint->max);
                arch_remove_reservations(&tmp);
                pr_info("__find_resource() after clip + arch-reserv 0x%llx - 0x%llx\n", tmp.start, tmp.end);

And the following (relevant) messages get logged by these 2 pr_info-s
when trying to allocate a resource for the unassigned IOMEM BAR
of the I2C controller on the troublesome laptops:

[    1.262423] resource: allocate_resource root [mem 0x65400000-0xbfffffff window] new [mem size 0x00001000 64bit] size 0x1000 min 0x65400000 max 0xbfffffff align 0x1000
[    1.262426] resource: __find_resource() trying 0x65400000 - 0x6fffffff
[    1.262428] resource: __find_resource() after clip + arch-reserv 0xd0000000 - 0x6fffffff
...
[    1.262543] resource: __find_resource() trying 0x8122d000 - 0x8122efff
[    1.262544] resource: __find_resource() after clip + arch-reserv 0xd0000000 - 0x8122efff
etc.

Notice that the regions start at 0xd0000000 rather then their original start
address after these 2 calls:

                resource_clip(&tmp, constraint->min, constraint->max);
                arch_remove_reservations(&tmp);

I believe that this is caused by the arch_remove_reservations call,
which applies e820 reservations and the e820 table has the following:

[    0.000000] BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved

So the e820 table is basically reserving the entire main PCI bus iomem
window, which runs from 0x65400000-0xbfffffff :

[    0.609979] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    0.609983] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    0.609986] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.609988] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
[    0.609991] pci_bus 0000:00: root bus resource [bus 00-fe]

One way of fixing this would be to check CRS provided PCI root bridge windows
vs e820 reservations and remove (carve out) the PCI root bridge iomem windows
from any e820 reservations. Probably with some warning about the firmware
being buggy.   I'm not sure if this won't cause issues else where though.

So before I spend time on trying to code something like this, any other
ideas / suggestions how to fix this ?

Regards,

Hans
diff mbox series

Patch

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index dcd593766a64..6cd2ca551005 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -686,6 +686,13 @@  static void acpi_pci_root_validate_resources(struct device *dev,
 		if (!(res1->flags & type))
 			goto next;
 
+		if ((type & IORESOURCE_MEM) && res1->start == 0) {
+			dev_info(dev, "host bridge window %pR (ignored, start address not set)\n",
+				 res1);
+			free = true;
+			goto next;
+		}
+
 		/* Exclude non-addressable range or non-addressable portion */
 		end = min(res1->end, root->end);
 		if (end <= res1->start) {