Message ID | 161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI: Allow drivers to claim exclusive access to config regions | expand |
On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > configuration cycles. It assumes one initiator at a time is > reading/writing the data registers. That sounds like a horrible protocol for a multi-processor system. Where is it described and who can we go complain to for creating such a mess? > If userspace reads from the response > data payload it may steal data that a kernel driver was expecting to > read. If userspace writes to the request payload it may corrupt the > request a driver was trying to send. Fun! So you want to keep root in userspace from doing this? I thought we already do that today? > Introduce pci_{request,release}_config_region() for a driver to exclude > the possibility of userspace induced corruption while accessing the DOE > mailbox. Likely there are other configuration state assumptions that a > driver may want to assert are under its exclusive control, so this > capability is not limited to any specific configuration range. As you do not have a user for these functions, it's hard to see how they would be used. We also really can't add new apis with no in-tree users, so do you have a patch series that requires this functionality somewhere? > Since writes are targeted and are already prepared for failure the > entire request is failed. The same can not be done for reads as the > device completely disappears from lspci output if any configuration > register in the request is exclusive. Instead skip the actual > configuration cycle on a per-access basis and return all f's as if the > read had failed. returning all ff is a huge hint to many drivers that the device is gone, not that it just failed. So what happens to code that thinks that and then tears stuff down as if the device has been removed? Trying to protect drivers from userspace here feels odd, what userspace tools are trying to access these devices while they are under "exclusive" control from the kernel? lspci not running as root should not be doing anything crazy, but if you want to run it as root, shouldn't you be allowed to access it properly? What hardware has this problem that we need to claim exclusive ownership over that differs from the old hardware we used to have that would do crazy things when reading from from userspace? We had this problem a long time ago and lived with it, what changed now? thanks, greg k-h
On Thu, Mar 25, 2021 at 07:54:59AM +0100, Greg Kroah-Hartman wrote: > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > configuration cycles. It assumes one initiator at a time is > > reading/writing the data registers. > > That sounds like a horrible protocol for a multi-processor system. > Where is it described and who can we go complain to for creating such a > mess? Indeed. Dan, is there a way to stilk kill this protocol off before it leaks into the wild?
On Wed, Mar 24, 2021 at 11:55 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > configuration cycles. It assumes one initiator at a time is > > reading/writing the data registers. > > That sounds like a horrible protocol for a multi-processor system. > Where is it described and who can we go complain to for creating such a > mess? It appears it was added to the PCIE specification in March of last year, I don't attend those meetings. I only learned about it since the CXL specification adopted it. > > > If userspace reads from the response > > data payload it may steal data that a kernel driver was expecting to > > read. If userspace writes to the request payload it may corrupt the > > request a driver was trying to send. > > Fun! So you want to keep root in userspace from doing this? I thought > we already do that today? The only limitation I found was temporary locking via pci_cfg_access_lock(), and some limitations on max config offset, not permanent access disable. > > > Introduce pci_{request,release}_config_region() for a driver to exclude > > the possibility of userspace induced corruption while accessing the DOE > > mailbox. Likely there are other configuration state assumptions that a > > driver may want to assert are under its exclusive control, so this > > capability is not limited to any specific configuration range. > > As you do not have a user for these functions, it's hard to see how they > would be used. We also really can't add new apis with no in-tree users, > so do you have a patch series that requires this functionality > somewhere? Whoops, I buried the lead here. This is in reaction to / support of Jonathan's efforts to use this mailbox to retrieve a blob of memory characteristics data from CXL devices called the CDAT [1]. That blob is used to populate / extend ACPI SRAT/HMAT/SLIT data for CXL attached memory. So while I was reviewing his patches it occurred to me that the b0rked nature of this interface relative to pci-sysfs needed to be addressed. This should wait to go in with his series. [1]: https://lore.kernel.org/linux-acpi/20210310180306.1588376-2-Jonathan.Cameron@huawei.com/ > > > Since writes are targeted and are already prepared for failure the > > entire request is failed. The same can not be done for reads as the > > device completely disappears from lspci output if any configuration > > register in the request is exclusive. Instead skip the actual > > configuration cycle on a per-access basis and return all f's as if the > > read had failed. > > returning all ff is a huge hint to many drivers that the device is gone, > not that it just failed. So what happens to code that thinks that and > then tears stuff down as if the device has been removed? This is limited to the pci_user_* family of accessors, kernel drivers should be unaffected. The protection for kernel drivers colliding is the same as request_mem_region() coordination. Userspace drivers will of course be horribly confused, but those should not be binding to devices that are claimed by a kernel driver in the first place. > Trying to protect drivers from userspace here feels odd, what userspace > tools are trying to access these devices while they are under > "exclusive" control from the kernel? lspci not running as root should > not be doing anything crazy, but if you want to run it as root, > shouldn't you be allowed to access it properly? The main concern is unwanted userspace reads. An inopportune "hexdump /sys/bus/pci/devices/$device/config" will end up reading the DOE payload register and advancing the device state machine surprising the kernel iterator that might be reading the payload. If root really wants to read that portion of config space it can also unload the driver similar to the policy for /dev/mem colliding with exclusive device-mmio... although that raises the question how would root know. At least for exclusive /dev/mem /proc/iomem can show who is claiming that resource. If userspace needs to submit DOE requests then that should probably be a proper generic driver to submit requests, not raw pci config access. > What hardware has this problem that we need to claim exclusive ownership > over that differs from the old hardware we used to have that would do > crazy things when reading from from userspace? We had this problem a > long time ago and lived with it, what changed now? All I can infer from the comments in drivers/pci/access.c is "bad things happens on some devices if you allow reads past a certain offset", but no concern for reads for offsets less than pdev->cfg_size. I think what's changed is that this is the first time Linux has had to worry about an awkward polled I/O data transfer protocol over config cycles. To make matters worse there appears to be a proliferation of protocols being layered on top of DOE. In addition to CXL Table Access for CDAT retrieval [2] I'm aware of CXL Compliance Testing [3], Integrity and Data Encryption (IDE) [4], and Component Measurement and Authentication (CMA) [5]. I've not read those, but I worry security_locked_down() may want to prevent even root userspace mucking with "security" interfaces. So that *might* be a reason to ensure exclusive kernel access beyond the basic sanity of the kernel being able to have uninterrupted request / response sessions with this mailbox [2]: https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf [3]: https://www.computeexpresslink.org/download-the-specification [4]: https://members.pcisig.com/wg/PCI-SIG/document/15149 [5]: https://members.pcisig.com/wg/PCI-SIG/document/14236
On Thu, Mar 25, 2021 at 1:29 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Mar 25, 2021 at 07:54:59AM +0100, Greg Kroah-Hartman wrote: > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > > configuration cycles. It assumes one initiator at a time is > > > reading/writing the data registers. > > > > That sounds like a horrible protocol for a multi-processor system. > > Where is it described and who can we go complain to for creating such a > > mess? > > Indeed. Dan, is there a way to stilk kill this protocol off before it > leaks into the wild? Unfortunately I think that opportunity was more than a year ago, and there's been a proliferation of derivative protocols building on it since.
On Thu, Mar 25, 2021 at 10:55:01AM -0700, Dan Williams wrote: > On Thu, Mar 25, 2021 at 1:29 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Mar 25, 2021 at 07:54:59AM +0100, Greg Kroah-Hartman wrote: > > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > > > configuration cycles. It assumes one initiator at a time is > > > > reading/writing the data registers. > > > > > > That sounds like a horrible protocol for a multi-processor system. > > > Where is it described and who can we go complain to for creating such a > > > mess? > > > > Indeed. Dan, is there a way to stilk kill this protocol off before it > > leaks into the wild? > > Unfortunately I think that opportunity was more than a year ago, and > there's been a proliferation of derivative protocols building on it > since. Doesn't mean it can't be changed, right? Are there any actual devices that require this? thanks, greg k-h
On Thu, Mar 25, 2021 at 10:43:41AM -0700, Dan Williams wrote: > On Wed, Mar 24, 2021 at 11:55 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > > configuration cycles. It assumes one initiator at a time is > > > reading/writing the data registers. > > > > That sounds like a horrible protocol for a multi-processor system. > > Where is it described and who can we go complain to for creating such a > > mess? > > It appears it was added to the PCIE specification in March of last > year, I don't attend those meetings. I only learned about it since the > CXL specification adopted it. Given that the kernel community can not participate in the PCI spec process, we rely on the member companies that do work there to at least create sane things. Who do I need to go complain to about this happening, again? > > > If userspace reads from the response > > > data payload it may steal data that a kernel driver was expecting to > > > read. If userspace writes to the request payload it may corrupt the > > > request a driver was trying to send. > > > > Fun! So you want to keep root in userspace from doing this? I thought > > we already do that today? > > The only limitation I found was temporary locking via > pci_cfg_access_lock(), and some limitations on max config offset, not > permanent access disable. Temp locking should be all that is needed, right? If not then something really is wrong with the protocol design. > > > Introduce pci_{request,release}_config_region() for a driver to exclude > > > the possibility of userspace induced corruption while accessing the DOE > > > mailbox. Likely there are other configuration state assumptions that a > > > driver may want to assert are under its exclusive control, so this > > > capability is not limited to any specific configuration range. > > > > As you do not have a user for these functions, it's hard to see how they > > would be used. We also really can't add new apis with no in-tree users, > > so do you have a patch series that requires this functionality > > somewhere? > > Whoops, I buried the lead here. This is in reaction to / support of > Jonathan's efforts to use this mailbox to retrieve a blob of memory > characteristics data from CXL devices called the CDAT [1]. That blob > is used to populate / extend ACPI SRAT/HMAT/SLIT data for CXL attached > memory. So while I was reviewing his patches it occurred to me that > the b0rked nature of this interface relative to pci-sysfs needed to be > addressed. This should wait to go in with his series. > > [1]: https://lore.kernel.org/linux-acpi/20210310180306.1588376-2-Jonathan.Cameron@huawei.com/ Interesting. Messy. Ick. > > > Since writes are targeted and are already prepared for failure the > > > entire request is failed. The same can not be done for reads as the > > > device completely disappears from lspci output if any configuration > > > register in the request is exclusive. Instead skip the actual > > > configuration cycle on a per-access basis and return all f's as if the > > > read had failed. > > > > returning all ff is a huge hint to many drivers that the device is gone, > > not that it just failed. So what happens to code that thinks that and > > then tears stuff down as if the device has been removed? > > This is limited to the pci_user_* family of accessors, kernel drivers > should be unaffected. The protection for kernel drivers colliding is > the same as request_mem_region() coordination. So you want to lock out user accesses entirely if a kernel driver is bound to the device? I missed that here. If so, that's a big change that people are not going to like. > Userspace drivers will of course be horribly confused, but those > should not be binding to devices that are claimed by a kernel driver > in the first place. I'm not worried about "userspace drivers", I'm worried about tools like 'lspci' and others that like reading PCI device state from userspace at tiems. > > Trying to protect drivers from userspace here feels odd, what userspace > > tools are trying to access these devices while they are under > > "exclusive" control from the kernel? lspci not running as root should > > not be doing anything crazy, but if you want to run it as root, > > shouldn't you be allowed to access it properly? > > The main concern is unwanted userspace reads. An inopportune "hexdump > /sys/bus/pci/devices/$device/config" will end up reading the DOE > payload register and advancing the device state machine surprising the > kernel iterator that might be reading the payload. > > If root really wants to read that portion of config space it can also > unload the driver similar to the policy for /dev/mem colliding with > exclusive device-mmio... although that raises the question how would > root know. At least for exclusive /dev/mem /proc/iomem can show who is > claiming that resource. Yes, root does not know, so this feels really broken from a spec point-of-view. > If userspace needs to submit DOE requests then that should probably be > a proper generic driver to submit requests, not raw pci config access. True, but again as this is exposed through the "normal" config space, that is not possible. Again, broken design :( > > What hardware has this problem that we need to claim exclusive ownership > > over that differs from the old hardware we used to have that would do > > crazy things when reading from from userspace? We had this problem a > > long time ago and lived with it, what changed now? > > All I can infer from the comments in drivers/pci/access.c is "bad > things happens on some devices if you allow reads past a certain > offset", but no concern for reads for offsets less than > pdev->cfg_size. I think what's changed is that this is the first time > Linux has had to worry about an awkward polled I/O data transfer > protocol over config cycles. I recall many old PCI devices that didn't like their memory being read at all, and could cause bus lockups, which is why the userspace tools would print scary warnings if you really wanted to do this. Feels like we are back to that mess again. > To make matters worse there appears to be a proliferation of protocols > being layered on top of DOE. In addition to CXL Table Access for CDAT > retrieval [2] I'm aware of CXL Compliance Testing [3], Integrity and > Data Encryption (IDE) [4], and Component Measurement and > Authentication (CMA) [5]. > > I've not read those, but I worry security_locked_down() may want to > prevent even root userspace mucking with "security" interfaces. So > that *might* be a reason to ensure exclusive kernel access beyond the > basic sanity of the kernel being able to have uninterrupted request / > response sessions with this mailbox > > [2]: https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf > [3]: https://www.computeexpresslink.org/download-the-specification > [4]: https://members.pcisig.com/wg/PCI-SIG/document/15149 > [5]: https://members.pcisig.com/wg/PCI-SIG/document/14236 I have no access to these proposed specs, sorry. If things are being layered on top of a broken design, well, they get what they asked for. How are the "other" operating systems going to handle this mess? How did they "sign off" on this crazy thing? thanks, greg k-h
[+cc Christoph] On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > configuration cycles. It assumes one initiator at a time is > reading/writing the data registers. If userspace reads from the response > data payload it may steal data that a kernel driver was expecting to > read. If userspace writes to the request payload it may corrupt the > request a driver was trying to send. IIUC the problem we're talking about is that userspace config access, e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE. I attached what I think are the relevant bits from the spec. It looks to me like config *reads* should not be a problem: A read of Write Data Mailbox always returns 0 and looks innocuous. A userspace read of Read Data Mailbox may return a DW of the data object, but it doesn't advance the cursor, so it shouldn't interfere with a kernel read. A write to Write Data Mailbox could obviously corrupt an object being written to the device. A config write to Read Data Mailbox *does* advance the cursor, so that would definitely interfere with a kernel user. So I think we're really talking about an issue with "setpci" and I don't expect "lspci" to be a problem. "setpci" is a valuable tool, and the fact that it can hose your system is not really news. I don't know how hard we should work to protect against that. Bjorn From PCIe r6.0 v0.7 draft (sec 7.9.24): DOE Control Register DOE Go – A write of 1b to this bit indicates to the DOE instance that it can start consuming the data object transferred through the DOE Write Data Mailbox Register. Behavior is undefined if the DOE Go bit is Set before the entire data object has been written to the DOE Write Data Mailbox Register. Behavior is undefined if the DOE Go bit is written with 1b when the DOE Busy bit is Set. Reads from this bit must always return 0b. DOE Write Data Mailbox Register DOE Write Data Mailbox – The DOE instance receives data objects via writes to this register. A successfully completed write to this register adds one DW to the incoming data object. Setting the DOE Go bit in the DOE Control Register indicates to the DOE Instance that the final DW of the data object has been written to this register. Reads of this register must return all 0’s. DOE Read Data Mailbox Register DOE Read Data Mailbox – If the Data Object Ready bit is Set, a read of this register returns the current DW of the data object. A write of any value to this register indicates a successful transfer of the current data object DW, and the DOE instance must return the next DW in the data object upon the next read of this register as long as the Data Object Ready bit remains Set. It is permitted for multiple data objects to be read from this register back-to-back. When this scenario occurs the Data Object Ready bit will remain Set until the final DW is read. A write of any value to this register when the Data Object Ready bit is Clear must have no effect. The value read from this register when Data Object Ready is Clear must be 0000 0000h. > Introduce pci_{request,release}_config_region() for a driver to exclude > the possibility of userspace induced corruption while accessing the DOE > mailbox. Likely there are other configuration state assumptions that a > driver may want to assert are under its exclusive control, so this > capability is not limited to any specific configuration range. > > Since writes are targeted and are already prepared for failure the > entire request is failed. The same can not be done for reads as the > device completely disappears from lspci output if any configuration > register in the request is exclusive. Instead skip the actual > configuration cycle on a per-access basis and return all f's as if the > read had failed. > > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/pci/access.c | 5 +++-- > drivers/pci/pci-sysfs.c | 3 +++ > drivers/pci/probe.c | 5 +++++ > include/linux/ioport.h | 2 ++ > include/linux/pci.h | 16 ++++++++++++++++ > kernel/resource.c | 24 +++++++++++------------- > 6 files changed, 40 insertions(+), 15 deletions(-) > > diff --git a/drivers/pci/access.c b/drivers/pci/access.c > index 46935695cfb9..a6b3cdfbd505 100644 > --- a/drivers/pci/access.c > +++ b/drivers/pci/access.c > @@ -225,8 +225,9 @@ int pci_user_read_config_##size \ > raw_spin_lock_irq(&pci_lock); \ > if (unlikely(dev->block_cfg_access)) \ > pci_wait_cfg(dev); \ > - ret = dev->bus->ops->read(dev->bus, dev->devfn, \ > - pos, sizeof(type), &data); \ > + if (!resource_is_exclusive(&dev->config_resource, pos, sizeof(type))) \ > + ret = dev->bus->ops->read(dev->bus, dev->devfn, \ > + pos, sizeof(type), &data); \ > raw_spin_unlock_irq(&pci_lock); \ > *val = (type)data; \ > return pcibios_err_to_errno(ret); \ > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > index f8afd54ca3e1..4ce2831ffadc 100644 > --- a/drivers/pci/pci-sysfs.c > +++ b/drivers/pci/pci-sysfs.c > @@ -753,6 +753,9 @@ static ssize_t pci_write_config(struct file *filp, struct kobject *kobj, > u8 *data = (u8 *) buf; > int ret; > > + if (resource_is_exclusive(&dev->config_resource, off, count)) > + return -EBUSY; > + > ret = security_locked_down(LOCKDOWN_PCI_ACCESS); > if (ret) > return ret; > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 953f15abc850..8fa3a6f38b53 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -2240,6 +2240,11 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus) > INIT_LIST_HEAD(&dev->bus_list); > dev->dev.type = &pci_dev_type; > dev->bus = pci_bus_get(bus); > + dev->config_resource = (struct resource) { > + .name = "PCI Config", > + .start = 0, > + .end = -1, > + }; > > return dev; > } > diff --git a/include/linux/ioport.h b/include/linux/ioport.h > index 55de385c839c..e1487a892fe7 100644 > --- a/include/linux/ioport.h > +++ b/include/linux/ioport.h > @@ -308,6 +308,8 @@ extern void __devm_release_region(struct device *dev, struct resource *parent, > resource_size_t start, resource_size_t n); > extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size); > extern bool iomem_is_exclusive(u64 addr); > +extern bool resource_is_exclusive(struct resource *resource, u64 addr, > + resource_size_t size); > > extern int > walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 86c799c97b77..d3d78e0df2e7 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -401,6 +401,7 @@ struct pci_dev { > */ > unsigned int irq; > struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */ > + struct resource config_resource; /* driver exclusive config register ranges */ > > bool match_driver; /* Skip attaching driver */ > > @@ -1330,6 +1331,21 @@ int pci_request_selected_regions(struct pci_dev *, int, const char *); > int pci_request_selected_regions_exclusive(struct pci_dev *, int, const char *); > void pci_release_selected_regions(struct pci_dev *, int); > > +static inline __must_check struct resource * > +pci_request_config_region(struct pci_dev *pdev, unsigned int where, > + unsigned int len, const char *name) > +{ > + return __request_region(&pdev->config_resource, where, len, name, > + IORESOURCE_EXCLUSIVE); > +} > + > +static inline void pci_release_config_region(struct pci_dev *pdev, > + unsigned int where, > + unsigned int len) > +{ > + __release_region(&pdev->config_resource, where, len); > +} > + > /* drivers/pci/bus.c */ > void pci_add_resource(struct list_head *resources, struct resource *res); > void pci_add_resource_offset(struct list_head *resources, struct resource *res, > diff --git a/kernel/resource.c b/kernel/resource.c > index 627e61b0c124..ab1aed06e8b0 100644 > --- a/kernel/resource.c > +++ b/kernel/resource.c > @@ -1706,27 +1706,16 @@ static int strict_iomem_checks; > #endif > > /* > - * check if an address is reserved in the iomem resource tree > + * check if an address is reserved in the @resource tree > * returns true if reserved, false if not reserved. > */ > -bool iomem_is_exclusive(u64 addr) > +bool resource_is_exclusive(struct resource *p, u64 addr, resource_size_t size) > { > - struct resource *p = &iomem_resource; > bool err = false; > loff_t l; > - int size = PAGE_SIZE; > - > - if (!strict_iomem_checks) > - return false; > - > - addr = addr & PAGE_MASK; > > read_lock(&resource_lock); > for (p = p->child; p ; p = r_next(NULL, p, &l)) { > - /* > - * We can probably skip the resources without > - * IORESOURCE_IO attribute? > - */ > if (p->start >= addr + size) > break; > if (p->end < addr) > @@ -1749,6 +1738,15 @@ bool iomem_is_exclusive(u64 addr) > return err; > } > > +bool iomem_is_exclusive(u64 addr) > +{ > + if (!strict_iomem_checks) > + return false; > + > + return resource_is_exclusive(&iomem_resource, addr & PAGE_MASK, > + PAGE_SIZE); > +} > + > struct resource_entry *resource_list_create_entry(struct resource *res, > size_t extra_size) > { >
On Fri, Mar 26, 2021 at 11:12:47AM -0500, Bjorn Helgaas wrote: > [+cc Christoph] > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > configuration cycles. It assumes one initiator at a time is > > reading/writing the data registers. If userspace reads from the response > > data payload it may steal data that a kernel driver was expecting to > > read. If userspace writes to the request payload it may corrupt the > > request a driver was trying to send. > > IIUC the problem we're talking about is that userspace config access, > e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE. > I attached what I think are the relevant bits from the spec. > > It looks to me like config *reads* should not be a problem: A read of > Write Data Mailbox always returns 0 and looks innocuous. A userspace > read of Read Data Mailbox may return a DW of the data object, but it > doesn't advance the cursor, so it shouldn't interfere with a kernel > read. > > A write to Write Data Mailbox could obviously corrupt an object being > written to the device. A config write to Read Data Mailbox *does* > advance the cursor, so that would definitely interfere with a kernel > user. > > So I think we're really talking about an issue with "setpci" and I > don't expect "lspci" to be a problem. "setpci" is a valuable tool, > and the fact that it can hose your system is not really news. I don't > know how hard we should work to protect against that. Thanks for looking this up and letting us know. So this should be fine, reads are ok, it's not as crazy of a protocol design as Dan alluded to, so the kernel should be ok. No need to add additional "protection" here at all, if you run setpci from userspace, you get what you asked for :) thanks, greg k-h
On Fri, Mar 26, 2021 at 9:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > [+cc Christoph] > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > configuration cycles. It assumes one initiator at a time is > > reading/writing the data registers. If userspace reads from the response > > data payload it may steal data that a kernel driver was expecting to > > read. If userspace writes to the request payload it may corrupt the > > request a driver was trying to send. > > IIUC the problem we're talking about is that userspace config access, > e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE. > I attached what I think are the relevant bits from the spec. > > It looks to me like config *reads* should not be a problem: A read of > Write Data Mailbox always returns 0 and looks innocuous. A userspace > read of Read Data Mailbox may return a DW of the data object, but it > doesn't advance the cursor, so it shouldn't interfere with a kernel > read. > > A write to Write Data Mailbox could obviously corrupt an object being > written to the device. A config write to Read Data Mailbox *does* > advance the cursor, so that would definitely interfere with a kernel > user. > > So I think we're really talking about an issue with "setpci" and I > don't expect "lspci" to be a problem. "setpci" is a valuable tool, > and the fact that it can hose your system is not really news. I don't > know how hard we should work to protect against that. True, the threat is smaller than I was reading, I apologize for that noise. Temporary blocking over kernel DOE cycles seems sufficient for now.
On Mon, 29 Mar 2021 09:46:15 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Mar 26, 2021 at 9:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > [+cc Christoph] > > > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > > configuration cycles. It assumes one initiator at a time is > > > reading/writing the data registers. If userspace reads from the response > > > data payload it may steal data that a kernel driver was expecting to > > > read. If userspace writes to the request payload it may corrupt the > > > request a driver was trying to send. > > > > IIUC the problem we're talking about is that userspace config access, > > e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE. > > I attached what I think are the relevant bits from the spec. > > > > It looks to me like config *reads* should not be a problem: A read of > > Write Data Mailbox always returns 0 and looks innocuous. A userspace > > read of Read Data Mailbox may return a DW of the data object, but it > > doesn't advance the cursor, so it shouldn't interfere with a kernel > > read. > > > > A write to Write Data Mailbox could obviously corrupt an object being > > written to the device. A config write to Read Data Mailbox *does* > > advance the cursor, so that would definitely interfere with a kernel > > user. > > > > So I think we're really talking about an issue with "setpci" and I > > don't expect "lspci" to be a problem. "setpci" is a valuable tool, > > and the fact that it can hose your system is not really news. I don't > > know how hard we should work to protect against that. > > True, the threat is smaller than I was reading, I apologize for that > noise. Temporary blocking over kernel DOE cycles seems sufficient for > now. Was on vacation. Glad I read the whole thread before replying. The key point as Bjorn identified is that reads don't hurt and as noted setpci can break many things anyway so I think we are safe on this. There are more 'exciting' questions to be addressed about firmware vs OS ownership of particular DOE mailboxes but that's a whole separate bit of fun for the future. Jonathan
On Sat, Mar 27, 2021 at 3:47 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Fri, Mar 26, 2021 at 11:12:47AM -0500, Bjorn Helgaas wrote: > > [+cc Christoph] > > > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > > configuration cycles. It assumes one initiator at a time is > > > reading/writing the data registers. If userspace reads from the response > > > data payload it may steal data that a kernel driver was expecting to > > > read. If userspace writes to the request payload it may corrupt the > > > request a driver was trying to send. > > > > IIUC the problem we're talking about is that userspace config access, > > e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE. > > I attached what I think are the relevant bits from the spec. > > > > It looks to me like config *reads* should not be a problem: A read of > > Write Data Mailbox always returns 0 and looks innocuous. A userspace > > read of Read Data Mailbox may return a DW of the data object, but it > > doesn't advance the cursor, so it shouldn't interfere with a kernel > > read. > > > > A write to Write Data Mailbox could obviously corrupt an object being > > written to the device. A config write to Read Data Mailbox *does* > > advance the cursor, so that would definitely interfere with a kernel > > user. > > > > So I think we're really talking about an issue with "setpci" and I > > don't expect "lspci" to be a problem. "setpci" is a valuable tool, > > and the fact that it can hose your system is not really news. I don't > > know how hard we should work to protect against that. > > Thanks for looking this up and letting us know. > > So this should be fine, reads are ok, it's not as crazy of a protocol > design as Dan alluded to, so the kernel should be ok. No need to add > additional "protection" here at all, if you run setpci from userspace, > you get what you asked for :) > Circling back to this after thinking of the implications and looking at the review of the DOE code, this situation is different than your typical "userspace gets to keep the pieces if it does a configuration write". If we assume well behaved non-malicious userpace, it has no reason to muck with critical config registers. If userspace changes a BAR value and the system fails, yup, that's its own fault. The DOE mailbox is different. There are legitimate reasons why non-broken userspace would want to read some DOE payloads while the kernel is retrieving its payloads. It also simplifies the kernel implementation if it does need to worry about other agents interrupting its transfers. My mistake was making this restriction apply to reads, but I'm not on the same page that blocking writes is fruitless just because userspace can do other damage with config writes. So either the kernel DOE driver needs to use pci_cfg_access_lock() and revalidate the state of the DOE after acquiring that lock, or it needs to claim the interface completely and provide a driver for userspace to submit requests that can be scheduled in the kernel's DOE queue.
diff --git a/drivers/pci/access.c b/drivers/pci/access.c index 46935695cfb9..a6b3cdfbd505 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -225,8 +225,9 @@ int pci_user_read_config_##size \ raw_spin_lock_irq(&pci_lock); \ if (unlikely(dev->block_cfg_access)) \ pci_wait_cfg(dev); \ - ret = dev->bus->ops->read(dev->bus, dev->devfn, \ - pos, sizeof(type), &data); \ + if (!resource_is_exclusive(&dev->config_resource, pos, sizeof(type))) \ + ret = dev->bus->ops->read(dev->bus, dev->devfn, \ + pos, sizeof(type), &data); \ raw_spin_unlock_irq(&pci_lock); \ *val = (type)data; \ return pcibios_err_to_errno(ret); \ diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index f8afd54ca3e1..4ce2831ffadc 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -753,6 +753,9 @@ static ssize_t pci_write_config(struct file *filp, struct kobject *kobj, u8 *data = (u8 *) buf; int ret; + if (resource_is_exclusive(&dev->config_resource, off, count)) + return -EBUSY; + ret = security_locked_down(LOCKDOWN_PCI_ACCESS); if (ret) return ret; diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 953f15abc850..8fa3a6f38b53 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -2240,6 +2240,11 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus) INIT_LIST_HEAD(&dev->bus_list); dev->dev.type = &pci_dev_type; dev->bus = pci_bus_get(bus); + dev->config_resource = (struct resource) { + .name = "PCI Config", + .start = 0, + .end = -1, + }; return dev; } diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 55de385c839c..e1487a892fe7 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -308,6 +308,8 @@ extern void __devm_release_region(struct device *dev, struct resource *parent, resource_size_t start, resource_size_t n); extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size); extern bool iomem_is_exclusive(u64 addr); +extern bool resource_is_exclusive(struct resource *resource, u64 addr, + resource_size_t size); extern int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, diff --git a/include/linux/pci.h b/include/linux/pci.h index 86c799c97b77..d3d78e0df2e7 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -401,6 +401,7 @@ struct pci_dev { */ unsigned int irq; struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */ + struct resource config_resource; /* driver exclusive config register ranges */ bool match_driver; /* Skip attaching driver */ @@ -1330,6 +1331,21 @@ int pci_request_selected_regions(struct pci_dev *, int, const char *); int pci_request_selected_regions_exclusive(struct pci_dev *, int, const char *); void pci_release_selected_regions(struct pci_dev *, int); +static inline __must_check struct resource * +pci_request_config_region(struct pci_dev *pdev, unsigned int where, + unsigned int len, const char *name) +{ + return __request_region(&pdev->config_resource, where, len, name, + IORESOURCE_EXCLUSIVE); +} + +static inline void pci_release_config_region(struct pci_dev *pdev, + unsigned int where, + unsigned int len) +{ + __release_region(&pdev->config_resource, where, len); +} + /* drivers/pci/bus.c */ void pci_add_resource(struct list_head *resources, struct resource *res); void pci_add_resource_offset(struct list_head *resources, struct resource *res, diff --git a/kernel/resource.c b/kernel/resource.c index 627e61b0c124..ab1aed06e8b0 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1706,27 +1706,16 @@ static int strict_iomem_checks; #endif /* - * check if an address is reserved in the iomem resource tree + * check if an address is reserved in the @resource tree * returns true if reserved, false if not reserved. */ -bool iomem_is_exclusive(u64 addr) +bool resource_is_exclusive(struct resource *p, u64 addr, resource_size_t size) { - struct resource *p = &iomem_resource; bool err = false; loff_t l; - int size = PAGE_SIZE; - - if (!strict_iomem_checks) - return false; - - addr = addr & PAGE_MASK; read_lock(&resource_lock); for (p = p->child; p ; p = r_next(NULL, p, &l)) { - /* - * We can probably skip the resources without - * IORESOURCE_IO attribute? - */ if (p->start >= addr + size) break; if (p->end < addr) @@ -1749,6 +1738,15 @@ bool iomem_is_exclusive(u64 addr) return err; } +bool iomem_is_exclusive(u64 addr) +{ + if (!strict_iomem_checks) + return false; + + return resource_is_exclusive(&iomem_resource, addr & PAGE_MASK, + PAGE_SIZE); +} + struct resource_entry *resource_list_create_entry(struct resource *res, size_t extra_size) {
The PCIE Data Object Exchange (DOE) mailbox is a protocol run over configuration cycles. It assumes one initiator at a time is reading/writing the data registers. If userspace reads from the response data payload it may steal data that a kernel driver was expecting to read. If userspace writes to the request payload it may corrupt the request a driver was trying to send. Introduce pci_{request,release}_config_region() for a driver to exclude the possibility of userspace induced corruption while accessing the DOE mailbox. Likely there are other configuration state assumptions that a driver may want to assert are under its exclusive control, so this capability is not limited to any specific configuration range. Since writes are targeted and are already prepared for failure the entire request is failed. The same can not be done for reads as the device completely disappears from lspci output if any configuration register in the request is exclusive. Instead skip the actual configuration cycle on a per-access basis and return all f's as if the read had failed. Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/pci/access.c | 5 +++-- drivers/pci/pci-sysfs.c | 3 +++ drivers/pci/probe.c | 5 +++++ include/linux/ioport.h | 2 ++ include/linux/pci.h | 16 ++++++++++++++++ kernel/resource.c | 24 +++++++++++------------- 6 files changed, 40 insertions(+), 15 deletions(-)