diff mbox series

PCI: Allow drivers to claim exclusive access to config regions

Message ID 161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Allow drivers to claim exclusive access to config regions | expand

Commit Message

Dan Williams March 25, 2021, 1:23 a.m. UTC
The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
configuration cycles. It assumes one initiator at a time is
reading/writing the data registers. If userspace reads from the response
data payload it may steal data that a kernel driver was expecting to
read. If userspace writes to the request payload it may corrupt the
request a driver was trying to send.

Introduce pci_{request,release}_config_region() for a driver to exclude
the possibility of userspace induced corruption while accessing the DOE
mailbox. Likely there are other configuration state assumptions that a
driver may want to assert are under its exclusive control, so this
capability is not limited to any specific configuration range.

Since writes are targeted and are already prepared for failure the
entire request is failed. The same can not be done for reads as the
device completely disappears from lspci output if any configuration
register in the request is exclusive. Instead skip the actual
configuration cycle on a per-access basis and return all f's as if the
read had failed.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/pci/access.c    |    5 +++--
 drivers/pci/pci-sysfs.c |    3 +++
 drivers/pci/probe.c     |    5 +++++
 include/linux/ioport.h  |    2 ++
 include/linux/pci.h     |   16 ++++++++++++++++
 kernel/resource.c       |   24 +++++++++++-------------
 6 files changed, 40 insertions(+), 15 deletions(-)

Comments

Greg KH March 25, 2021, 6:54 a.m. UTC | #1
On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> configuration cycles. It assumes one initiator at a time is
> reading/writing the data registers.

That sounds like a horrible protocol for a multi-processor system.
Where is it described and who can we go complain to for creating such a
mess?

> If userspace reads from the response
> data payload it may steal data that a kernel driver was expecting to
> read. If userspace writes to the request payload it may corrupt the
> request a driver was trying to send.

Fun!  So you want to keep root in userspace from doing this?  I thought
we already do that today?

> Introduce pci_{request,release}_config_region() for a driver to exclude
> the possibility of userspace induced corruption while accessing the DOE
> mailbox. Likely there are other configuration state assumptions that a
> driver may want to assert are under its exclusive control, so this
> capability is not limited to any specific configuration range.

As you do not have a user for these functions, it's hard to see how they
would be used.  We also really can't add new apis with no in-tree users,
so do you have a patch series that requires this functionality
somewhere?

> Since writes are targeted and are already prepared for failure the
> entire request is failed. The same can not be done for reads as the
> device completely disappears from lspci output if any configuration
> register in the request is exclusive. Instead skip the actual
> configuration cycle on a per-access basis and return all f's as if the
> read had failed.

returning all ff is a huge hint to many drivers that the device is gone,
not that it just failed.  So what happens to code that thinks that and
then tears stuff down as if the device has been removed?

Trying to protect drivers from userspace here feels odd, what userspace
tools are trying to access these devices while they are under
"exclusive" control from the kernel?  lspci not running as root should
not be doing anything crazy, but if you want to run it as root,
shouldn't you be allowed to access it properly?

What hardware has this problem that we need to claim exclusive ownership
over that differs from the old hardware we used to have that would do
crazy things when reading from from userspace?  We had this problem a
long time ago and lived with it, what changed now?

thanks,

greg k-h
Christoph Hellwig March 25, 2021, 8:29 a.m. UTC | #2
On Thu, Mar 25, 2021 at 07:54:59AM +0100, Greg Kroah-Hartman wrote:
> On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > configuration cycles. It assumes one initiator at a time is
> > reading/writing the data registers.
> 
> That sounds like a horrible protocol for a multi-processor system.
> Where is it described and who can we go complain to for creating such a
> mess?

Indeed.  Dan, is there a way to stilk kill this protocol off before it
leaks into the wild?
Dan Williams March 25, 2021, 5:43 p.m. UTC | #3
On Wed, Mar 24, 2021 at 11:55 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > configuration cycles. It assumes one initiator at a time is
> > reading/writing the data registers.
>
> That sounds like a horrible protocol for a multi-processor system.
> Where is it described and who can we go complain to for creating such a
> mess?

It appears it was added to the PCIE specification in March of last
year, I don't attend those meetings. I only learned about it since the
CXL specification adopted it.

>
> > If userspace reads from the response
> > data payload it may steal data that a kernel driver was expecting to
> > read. If userspace writes to the request payload it may corrupt the
> > request a driver was trying to send.
>
> Fun!  So you want to keep root in userspace from doing this?  I thought
> we already do that today?

The only limitation I found was temporary locking via
pci_cfg_access_lock(), and some limitations on max config offset, not
permanent access disable.

>
> > Introduce pci_{request,release}_config_region() for a driver to exclude
> > the possibility of userspace induced corruption while accessing the DOE
> > mailbox. Likely there are other configuration state assumptions that a
> > driver may want to assert are under its exclusive control, so this
> > capability is not limited to any specific configuration range.
>
> As you do not have a user for these functions, it's hard to see how they
> would be used.  We also really can't add new apis with no in-tree users,
> so do you have a patch series that requires this functionality
> somewhere?

Whoops, I buried the lead here. This is in reaction to / support of
Jonathan's efforts to use this mailbox to retrieve a blob of memory
characteristics data from CXL devices called the CDAT [1]. That blob
is used to populate / extend ACPI SRAT/HMAT/SLIT data for CXL attached
memory. So while I was reviewing his patches it occurred to me that
the b0rked nature of this interface relative to pci-sysfs needed to be
addressed. This should wait to go in with his series.

[1]: https://lore.kernel.org/linux-acpi/20210310180306.1588376-2-Jonathan.Cameron@huawei.com/

>
> > Since writes are targeted and are already prepared for failure the
> > entire request is failed. The same can not be done for reads as the
> > device completely disappears from lspci output if any configuration
> > register in the request is exclusive. Instead skip the actual
> > configuration cycle on a per-access basis and return all f's as if the
> > read had failed.
>
> returning all ff is a huge hint to many drivers that the device is gone,
> not that it just failed.  So what happens to code that thinks that and
> then tears stuff down as if the device has been removed?

This is limited to the pci_user_* family of accessors, kernel drivers
should be unaffected. The protection for kernel drivers colliding is
the same as request_mem_region() coordination.

Userspace drivers will of course be horribly confused, but those
should not be binding to devices that are claimed by a kernel driver
in the first place.

> Trying to protect drivers from userspace here feels odd, what userspace
> tools are trying to access these devices while they are under
> "exclusive" control from the kernel?  lspci not running as root should
> not be doing anything crazy, but if you want to run it as root,
> shouldn't you be allowed to access it properly?

The main concern is unwanted userspace reads. An inopportune "hexdump
/sys/bus/pci/devices/$device/config" will end up reading the DOE
payload register and advancing the device state machine surprising the
kernel iterator that might be reading the payload.

If root really wants to read that portion of config space it can also
unload the driver similar to the policy for /dev/mem colliding with
exclusive device-mmio... although that raises the question how would
root know. At least for exclusive /dev/mem /proc/iomem can show who is
claiming that resource.

If userspace needs to submit DOE requests then that should probably be
a proper generic driver to submit requests, not raw pci config access.

> What hardware has this problem that we need to claim exclusive ownership
> over that differs from the old hardware we used to have that would do
> crazy things when reading from from userspace?  We had this problem a
> long time ago and lived with it, what changed now?

All I can infer from the comments in drivers/pci/access.c is "bad
things happens on some devices if you allow reads past a certain
offset", but no concern for reads for offsets less than
pdev->cfg_size. I think what's changed is that this is the first time
Linux has had to worry about an awkward polled I/O data transfer
protocol over config cycles.

To make matters worse there appears to be a proliferation of protocols
being layered on top of DOE. In addition to CXL Table Access for CDAT
retrieval [2] I'm aware of CXL Compliance Testing [3], Integrity and
Data Encryption (IDE) [4], and Component Measurement and
Authentication (CMA) [5].

I've not read those, but I worry security_locked_down() may want to
prevent even root userspace mucking with "security" interfaces. So
that *might* be a reason to ensure exclusive kernel access beyond the
basic sanity of the kernel being able to have uninterrupted request /
response sessions with this mailbox

[2]: https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
[3]: https://www.computeexpresslink.org/download-the-specification
[4]: https://members.pcisig.com/wg/PCI-SIG/document/15149
[5]: https://members.pcisig.com/wg/PCI-SIG/document/14236
Dan Williams March 25, 2021, 5:55 p.m. UTC | #4
On Thu, Mar 25, 2021 at 1:29 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Mar 25, 2021 at 07:54:59AM +0100, Greg Kroah-Hartman wrote:
> > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > > configuration cycles. It assumes one initiator at a time is
> > > reading/writing the data registers.
> >
> > That sounds like a horrible protocol for a multi-processor system.
> > Where is it described and who can we go complain to for creating such a
> > mess?
>
> Indeed.  Dan, is there a way to stilk kill this protocol off before it
> leaks into the wild?

Unfortunately I think that opportunity was more than a year ago, and
there's been a proliferation of derivative protocols building on it
since.
Greg KH March 26, 2021, 9:18 a.m. UTC | #5
On Thu, Mar 25, 2021 at 10:55:01AM -0700, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 1:29 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Mar 25, 2021 at 07:54:59AM +0100, Greg Kroah-Hartman wrote:
> > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > > > configuration cycles. It assumes one initiator at a time is
> > > > reading/writing the data registers.
> > >
> > > That sounds like a horrible protocol for a multi-processor system.
> > > Where is it described and who can we go complain to for creating such a
> > > mess?
> >
> > Indeed.  Dan, is there a way to stilk kill this protocol off before it
> > leaks into the wild?
> 
> Unfortunately I think that opportunity was more than a year ago, and
> there's been a proliferation of derivative protocols building on it
> since.

Doesn't mean it can't be changed, right?

Are there any actual devices that require this?

thanks,

greg k-h
Greg KH March 26, 2021, 9:28 a.m. UTC | #6
On Thu, Mar 25, 2021 at 10:43:41AM -0700, Dan Williams wrote:
> On Wed, Mar 24, 2021 at 11:55 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > > configuration cycles. It assumes one initiator at a time is
> > > reading/writing the data registers.
> >
> > That sounds like a horrible protocol for a multi-processor system.
> > Where is it described and who can we go complain to for creating such a
> > mess?
> 
> It appears it was added to the PCIE specification in March of last
> year, I don't attend those meetings. I only learned about it since the
> CXL specification adopted it.

Given that the kernel community can not participate in the PCI spec
process, we rely on the member companies that do work there to at least
create sane things.  Who do I need to go complain to about this
happening, again?

> > > If userspace reads from the response
> > > data payload it may steal data that a kernel driver was expecting to
> > > read. If userspace writes to the request payload it may corrupt the
> > > request a driver was trying to send.
> >
> > Fun!  So you want to keep root in userspace from doing this?  I thought
> > we already do that today?
> 
> The only limitation I found was temporary locking via
> pci_cfg_access_lock(), and some limitations on max config offset, not
> permanent access disable.

Temp locking should be all that is needed, right?  If not then something
really is wrong with the protocol design.

> > > Introduce pci_{request,release}_config_region() for a driver to exclude
> > > the possibility of userspace induced corruption while accessing the DOE
> > > mailbox. Likely there are other configuration state assumptions that a
> > > driver may want to assert are under its exclusive control, so this
> > > capability is not limited to any specific configuration range.
> >
> > As you do not have a user for these functions, it's hard to see how they
> > would be used.  We also really can't add new apis with no in-tree users,
> > so do you have a patch series that requires this functionality
> > somewhere?
> 
> Whoops, I buried the lead here. This is in reaction to / support of
> Jonathan's efforts to use this mailbox to retrieve a blob of memory
> characteristics data from CXL devices called the CDAT [1]. That blob
> is used to populate / extend ACPI SRAT/HMAT/SLIT data for CXL attached
> memory. So while I was reviewing his patches it occurred to me that
> the b0rked nature of this interface relative to pci-sysfs needed to be
> addressed. This should wait to go in with his series.
> 
> [1]: https://lore.kernel.org/linux-acpi/20210310180306.1588376-2-Jonathan.Cameron@huawei.com/

Interesting.  Messy.  Ick.

> > > Since writes are targeted and are already prepared for failure the
> > > entire request is failed. The same can not be done for reads as the
> > > device completely disappears from lspci output if any configuration
> > > register in the request is exclusive. Instead skip the actual
> > > configuration cycle on a per-access basis and return all f's as if the
> > > read had failed.
> >
> > returning all ff is a huge hint to many drivers that the device is gone,
> > not that it just failed.  So what happens to code that thinks that and
> > then tears stuff down as if the device has been removed?
> 
> This is limited to the pci_user_* family of accessors, kernel drivers
> should be unaffected. The protection for kernel drivers colliding is
> the same as request_mem_region() coordination.

So you want to lock out user accesses entirely if a kernel driver is
bound to the device?  I missed that here.  If so, that's a big change
that people are not going to like.

> Userspace drivers will of course be horribly confused, but those
> should not be binding to devices that are claimed by a kernel driver
> in the first place.

I'm not worried about "userspace drivers", I'm worried about tools like
'lspci' and others that like reading PCI device state from userspace at
tiems.

> > Trying to protect drivers from userspace here feels odd, what userspace
> > tools are trying to access these devices while they are under
> > "exclusive" control from the kernel?  lspci not running as root should
> > not be doing anything crazy, but if you want to run it as root,
> > shouldn't you be allowed to access it properly?
> 
> The main concern is unwanted userspace reads. An inopportune "hexdump
> /sys/bus/pci/devices/$device/config" will end up reading the DOE
> payload register and advancing the device state machine surprising the
> kernel iterator that might be reading the payload.
> 
> If root really wants to read that portion of config space it can also
> unload the driver similar to the policy for /dev/mem colliding with
> exclusive device-mmio... although that raises the question how would
> root know. At least for exclusive /dev/mem /proc/iomem can show who is
> claiming that resource.

Yes, root does not know, so this feels really broken from a spec
point-of-view.

> If userspace needs to submit DOE requests then that should probably be
> a proper generic driver to submit requests, not raw pci config access.

True, but again as this is exposed through the "normal" config space,
that is not possible.  Again, broken design :(

> > What hardware has this problem that we need to claim exclusive ownership
> > over that differs from the old hardware we used to have that would do
> > crazy things when reading from from userspace?  We had this problem a
> > long time ago and lived with it, what changed now?
> 
> All I can infer from the comments in drivers/pci/access.c is "bad
> things happens on some devices if you allow reads past a certain
> offset", but no concern for reads for offsets less than
> pdev->cfg_size. I think what's changed is that this is the first time
> Linux has had to worry about an awkward polled I/O data transfer
> protocol over config cycles.

I recall many old PCI devices that didn't like their memory being read
at all, and could cause bus lockups, which is why the userspace tools
would print scary warnings if you really wanted to do this.

Feels like we are back to that mess again.

> To make matters worse there appears to be a proliferation of protocols
> being layered on top of DOE. In addition to CXL Table Access for CDAT
> retrieval [2] I'm aware of CXL Compliance Testing [3], Integrity and
> Data Encryption (IDE) [4], and Component Measurement and
> Authentication (CMA) [5].
> 
> I've not read those, but I worry security_locked_down() may want to
> prevent even root userspace mucking with "security" interfaces. So
> that *might* be a reason to ensure exclusive kernel access beyond the
> basic sanity of the kernel being able to have uninterrupted request /
> response sessions with this mailbox
> 
> [2]: https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> [3]: https://www.computeexpresslink.org/download-the-specification
> [4]: https://members.pcisig.com/wg/PCI-SIG/document/15149
> [5]: https://members.pcisig.com/wg/PCI-SIG/document/14236

I have no access to these proposed specs, sorry.

If things are being layered on top of a broken design, well, they get
what they asked for.

How are the "other" operating systems going to handle this mess?  How
did they "sign off" on this crazy thing?

thanks,

greg k-h
Bjorn Helgaas March 26, 2021, 4:12 p.m. UTC | #7
[+cc Christoph]

On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> configuration cycles. It assumes one initiator at a time is
> reading/writing the data registers. If userspace reads from the response
> data payload it may steal data that a kernel driver was expecting to
> read. If userspace writes to the request payload it may corrupt the
> request a driver was trying to send.

IIUC the problem we're talking about is that userspace config access,
e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE.
I attached what I think are the relevant bits from the spec.

It looks to me like config *reads* should not be a problem: A read of
Write Data Mailbox always returns 0 and looks innocuous.  A userspace
read of Read Data Mailbox may return a DW of the data object, but it
doesn't advance the cursor, so it shouldn't interfere with a kernel
read.  

A write to Write Data Mailbox could obviously corrupt an object being
written to the device.  A config write to Read Data Mailbox *does*
advance the cursor, so that would definitely interfere with a kernel
user.  

So I think we're really talking about an issue with "setpci" and I
don't expect "lspci" to be a problem.  "setpci" is a valuable tool,
and the fact that it can hose your system is not really news.  I don't
know how hard we should work to protect against that.

Bjorn


From PCIe r6.0 v0.7 draft (sec 7.9.24):

  DOE Control Register

    DOE Go – A write of 1b to this bit indicates to the DOE instance
    that it can start consuming the data object transferred through the
    DOE Write Data Mailbox Register.

    Behavior is undefined if the DOE Go bit is Set before the entire
    data object has been written to the DOE Write Data Mailbox Register.

    Behavior is undefined if the DOE Go bit is written with 1b when the
    DOE Busy bit is Set.

    Reads from this bit must always return 0b.

  DOE Write Data Mailbox Register

    DOE Write Data Mailbox – The DOE instance receives data objects
    via writes to this register.

    A successfully completed write to this register adds one DW to the
    incoming data object.

    Setting the DOE Go bit in the DOE Control Register indicates to
    the DOE Instance that the final DW of the data object has been
    written to this register.

    Reads of this register must return all 0’s.

  DOE Read Data Mailbox Register

    DOE Read Data Mailbox – If the Data Object Ready bit is Set, a
    read of this register returns the current DW of the data object.

    A write of any value to this register indicates a successful
    transfer of the current data object DW, and the DOE instance must
    return the next DW in the data object upon the next read of this
    register as long as the Data Object Ready bit remains Set.

    It is permitted for multiple data objects to be read from this
    register back-to-back. When this scenario occurs the Data Object
    Ready bit will remain Set until the final DW is read.

    A write of any value to this register when the Data Object Ready
    bit is Clear must have no effect.

    The value read from this register when Data Object Ready is Clear
    must be 0000 0000h.

> Introduce pci_{request,release}_config_region() for a driver to exclude
> the possibility of userspace induced corruption while accessing the DOE
> mailbox. Likely there are other configuration state assumptions that a
> driver may want to assert are under its exclusive control, so this
> capability is not limited to any specific configuration range.
> 
> Since writes are targeted and are already prepared for failure the
> entire request is failed. The same can not be done for reads as the
> device completely disappears from lspci output if any configuration
> register in the request is exclusive. Instead skip the actual
> configuration cycle on a per-access basis and return all f's as if the
> read had failed.
> 
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/pci/access.c    |    5 +++--
>  drivers/pci/pci-sysfs.c |    3 +++
>  drivers/pci/probe.c     |    5 +++++
>  include/linux/ioport.h  |    2 ++
>  include/linux/pci.h     |   16 ++++++++++++++++
>  kernel/resource.c       |   24 +++++++++++-------------
>  6 files changed, 40 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/access.c b/drivers/pci/access.c
> index 46935695cfb9..a6b3cdfbd505 100644
> --- a/drivers/pci/access.c
> +++ b/drivers/pci/access.c
> @@ -225,8 +225,9 @@ int pci_user_read_config_##size						\
>  	raw_spin_lock_irq(&pci_lock);				\
>  	if (unlikely(dev->block_cfg_access))				\
>  		pci_wait_cfg(dev);					\
> -	ret = dev->bus->ops->read(dev->bus, dev->devfn,			\
> -					pos, sizeof(type), &data);	\
> +	if (!resource_is_exclusive(&dev->config_resource, pos, sizeof(type))) \
> +		ret = dev->bus->ops->read(dev->bus, dev->devfn,		\
> +					  pos, sizeof(type), &data);	\
>  	raw_spin_unlock_irq(&pci_lock);				\
>  	*val = (type)data;						\
>  	return pcibios_err_to_errno(ret);				\
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index f8afd54ca3e1..4ce2831ffadc 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -753,6 +753,9 @@ static ssize_t pci_write_config(struct file *filp, struct kobject *kobj,
>  	u8 *data = (u8 *) buf;
>  	int ret;
>  
> +	if (resource_is_exclusive(&dev->config_resource, off, count))
> +		return -EBUSY;
> +
>  	ret = security_locked_down(LOCKDOWN_PCI_ACCESS);
>  	if (ret)
>  		return ret;
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 953f15abc850..8fa3a6f38b53 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2240,6 +2240,11 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
>  	INIT_LIST_HEAD(&dev->bus_list);
>  	dev->dev.type = &pci_dev_type;
>  	dev->bus = pci_bus_get(bus);
> +	dev->config_resource = (struct resource) {
> +		.name = "PCI Config",
> +		.start = 0,
> +		.end = -1,
> +	};
>  
>  	return dev;
>  }
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 55de385c839c..e1487a892fe7 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -308,6 +308,8 @@ extern void __devm_release_region(struct device *dev, struct resource *parent,
>  				  resource_size_t start, resource_size_t n);
>  extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size);
>  extern bool iomem_is_exclusive(u64 addr);
> +extern bool resource_is_exclusive(struct resource *resource, u64 addr,
> +				  resource_size_t size);
>  
>  extern int
>  walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 86c799c97b77..d3d78e0df2e7 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -401,6 +401,7 @@ struct pci_dev {
>  	 */
>  	unsigned int	irq;
>  	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
> +	struct resource config_resource;	/* driver exclusive config register ranges */
>  
>  	bool		match_driver;		/* Skip attaching driver */
>  
> @@ -1330,6 +1331,21 @@ int pci_request_selected_regions(struct pci_dev *, int, const char *);
>  int pci_request_selected_regions_exclusive(struct pci_dev *, int, const char *);
>  void pci_release_selected_regions(struct pci_dev *, int);
>  
> +static inline __must_check struct resource *
> +pci_request_config_region(struct pci_dev *pdev, unsigned int where,
> +			  unsigned int len, const char *name)
> +{
> +	return __request_region(&pdev->config_resource, where, len, name,
> +				IORESOURCE_EXCLUSIVE);
> +}
> +
> +static inline void pci_release_config_region(struct pci_dev *pdev,
> +					     unsigned int where,
> +					     unsigned int len)
> +{
> +	__release_region(&pdev->config_resource, where, len);
> +}
> +
>  /* drivers/pci/bus.c */
>  void pci_add_resource(struct list_head *resources, struct resource *res);
>  void pci_add_resource_offset(struct list_head *resources, struct resource *res,
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 627e61b0c124..ab1aed06e8b0 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1706,27 +1706,16 @@ static int strict_iomem_checks;
>  #endif
>  
>  /*
> - * check if an address is reserved in the iomem resource tree
> + * check if an address is reserved in the @resource tree
>   * returns true if reserved, false if not reserved.
>   */
> -bool iomem_is_exclusive(u64 addr)
> +bool resource_is_exclusive(struct resource *p, u64 addr, resource_size_t size)
>  {
> -	struct resource *p = &iomem_resource;
>  	bool err = false;
>  	loff_t l;
> -	int size = PAGE_SIZE;
> -
> -	if (!strict_iomem_checks)
> -		return false;
> -
> -	addr = addr & PAGE_MASK;
>  
>  	read_lock(&resource_lock);
>  	for (p = p->child; p ; p = r_next(NULL, p, &l)) {
> -		/*
> -		 * We can probably skip the resources without
> -		 * IORESOURCE_IO attribute?
> -		 */
>  		if (p->start >= addr + size)
>  			break;
>  		if (p->end < addr)
> @@ -1749,6 +1738,15 @@ bool iomem_is_exclusive(u64 addr)
>  	return err;
>  }
>  
> +bool iomem_is_exclusive(u64 addr)
> +{
> +	if (!strict_iomem_checks)
> +		return false;
> +
> +	return resource_is_exclusive(&iomem_resource, addr & PAGE_MASK,
> +				     PAGE_SIZE);
> +}
> +
>  struct resource_entry *resource_list_create_entry(struct resource *res,
>  						  size_t extra_size)
>  {
>
Greg KH March 27, 2021, 10:46 a.m. UTC | #8
On Fri, Mar 26, 2021 at 11:12:47AM -0500, Bjorn Helgaas wrote:
> [+cc Christoph]
> 
> On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > configuration cycles. It assumes one initiator at a time is
> > reading/writing the data registers. If userspace reads from the response
> > data payload it may steal data that a kernel driver was expecting to
> > read. If userspace writes to the request payload it may corrupt the
> > request a driver was trying to send.
> 
> IIUC the problem we're talking about is that userspace config access,
> e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE.
> I attached what I think are the relevant bits from the spec.
> 
> It looks to me like config *reads* should not be a problem: A read of
> Write Data Mailbox always returns 0 and looks innocuous.  A userspace
> read of Read Data Mailbox may return a DW of the data object, but it
> doesn't advance the cursor, so it shouldn't interfere with a kernel
> read.  
> 
> A write to Write Data Mailbox could obviously corrupt an object being
> written to the device.  A config write to Read Data Mailbox *does*
> advance the cursor, so that would definitely interfere with a kernel
> user.  
> 
> So I think we're really talking about an issue with "setpci" and I
> don't expect "lspci" to be a problem.  "setpci" is a valuable tool,
> and the fact that it can hose your system is not really news.  I don't
> know how hard we should work to protect against that.

Thanks for looking this up and letting us know.

So this should be fine, reads are ok, it's not as crazy of a protocol
design as Dan alluded to, so the kernel should be ok.  No need to add
additional "protection" here at all, if you run setpci from userspace,
you get what you asked for :)

thanks,

greg k-h
Dan Williams March 29, 2021, 4:46 p.m. UTC | #9
On Fri, Mar 26, 2021 at 9:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Christoph]
>
> On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > configuration cycles. It assumes one initiator at a time is
> > reading/writing the data registers. If userspace reads from the response
> > data payload it may steal data that a kernel driver was expecting to
> > read. If userspace writes to the request payload it may corrupt the
> > request a driver was trying to send.
>
> IIUC the problem we're talking about is that userspace config access,
> e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE.
> I attached what I think are the relevant bits from the spec.
>
> It looks to me like config *reads* should not be a problem: A read of
> Write Data Mailbox always returns 0 and looks innocuous.  A userspace
> read of Read Data Mailbox may return a DW of the data object, but it
> doesn't advance the cursor, so it shouldn't interfere with a kernel
> read.
>
> A write to Write Data Mailbox could obviously corrupt an object being
> written to the device.  A config write to Read Data Mailbox *does*
> advance the cursor, so that would definitely interfere with a kernel
> user.
>
> So I think we're really talking about an issue with "setpci" and I
> don't expect "lspci" to be a problem.  "setpci" is a valuable tool,
> and the fact that it can hose your system is not really news.  I don't
> know how hard we should work to protect against that.

True, the threat is smaller than I was reading, I apologize for that
noise. Temporary blocking over kernel DOE cycles seems sufficient for
now.
Jonathan Cameron March 30, 2021, 12:20 p.m. UTC | #10
On Mon, 29 Mar 2021 09:46:15 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, Mar 26, 2021 at 9:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Christoph]
> >
> > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:  
> > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > > configuration cycles. It assumes one initiator at a time is
> > > reading/writing the data registers. If userspace reads from the response
> > > data payload it may steal data that a kernel driver was expecting to
> > > read. If userspace writes to the request payload it may corrupt the
> > > request a driver was trying to send.  
> >
> > IIUC the problem we're talking about is that userspace config access,
> > e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE.
> > I attached what I think are the relevant bits from the spec.
> >
> > It looks to me like config *reads* should not be a problem: A read of
> > Write Data Mailbox always returns 0 and looks innocuous.  A userspace
> > read of Read Data Mailbox may return a DW of the data object, but it
> > doesn't advance the cursor, so it shouldn't interfere with a kernel
> > read.
> >
> > A write to Write Data Mailbox could obviously corrupt an object being
> > written to the device.  A config write to Read Data Mailbox *does*
> > advance the cursor, so that would definitely interfere with a kernel
> > user.
> >
> > So I think we're really talking about an issue with "setpci" and I
> > don't expect "lspci" to be a problem.  "setpci" is a valuable tool,
> > and the fact that it can hose your system is not really news.  I don't
> > know how hard we should work to protect against that.  
> 
> True, the threat is smaller than I was reading, I apologize for that
> noise. Temporary blocking over kernel DOE cycles seems sufficient for
> now.

Was on vacation. Glad I read the whole thread before replying.  The
key point as Bjorn identified is that reads don't hurt and as noted
setpci can break many things anyway so I think we are safe on this.

There are more 'exciting' questions to be addressed about firmware
vs OS ownership of particular DOE mailboxes but that's a whole
separate bit of fun for the future.

Jonathan
Dan Williams May 13, 2021, 9:28 p.m. UTC | #11
On Sat, Mar 27, 2021 at 3:47 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Mar 26, 2021 at 11:12:47AM -0500, Bjorn Helgaas wrote:
> > [+cc Christoph]
> >
> > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > > configuration cycles. It assumes one initiator at a time is
> > > reading/writing the data registers. If userspace reads from the response
> > > data payload it may steal data that a kernel driver was expecting to
> > > read. If userspace writes to the request payload it may corrupt the
> > > request a driver was trying to send.
> >
> > IIUC the problem we're talking about is that userspace config access,
> > e.g., via "lspci" or "setpci" may interfere with kernel usage of DOE.
> > I attached what I think are the relevant bits from the spec.
> >
> > It looks to me like config *reads* should not be a problem: A read of
> > Write Data Mailbox always returns 0 and looks innocuous.  A userspace
> > read of Read Data Mailbox may return a DW of the data object, but it
> > doesn't advance the cursor, so it shouldn't interfere with a kernel
> > read.
> >
> > A write to Write Data Mailbox could obviously corrupt an object being
> > written to the device.  A config write to Read Data Mailbox *does*
> > advance the cursor, so that would definitely interfere with a kernel
> > user.
> >
> > So I think we're really talking about an issue with "setpci" and I
> > don't expect "lspci" to be a problem.  "setpci" is a valuable tool,
> > and the fact that it can hose your system is not really news.  I don't
> > know how hard we should work to protect against that.
>
> Thanks for looking this up and letting us know.
>
> So this should be fine, reads are ok, it's not as crazy of a protocol
> design as Dan alluded to, so the kernel should be ok.  No need to add
> additional "protection" here at all, if you run setpci from userspace,
> you get what you asked for :)
>

Circling back to this after thinking of the implications and looking
at the review of the DOE code, this situation is different than your
typical "userspace gets to keep the pieces if it does a configuration
write". If we assume well behaved non-malicious userpace, it has no
reason to muck with critical config registers. If userspace changes a
BAR value and the system fails, yup, that's its own fault. The DOE
mailbox is different. There are legitimate reasons why non-broken
userspace would want to read some DOE payloads while the kernel is
retrieving its payloads. It also simplifies the kernel implementation
if it does need to worry about other agents interrupting its
transfers. My mistake was making this restriction apply to reads, but
I'm not on the same page that blocking writes is fruitless just
because userspace can do other damage with config writes.

So either the kernel DOE driver needs to use pci_cfg_access_lock() and
revalidate the state of the DOE after acquiring that lock, or it needs
to claim the interface completely and provide a driver for userspace
to submit requests that can be scheduled in the kernel's DOE queue.
diff mbox series

Patch

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 46935695cfb9..a6b3cdfbd505 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -225,8 +225,9 @@  int pci_user_read_config_##size						\
 	raw_spin_lock_irq(&pci_lock);				\
 	if (unlikely(dev->block_cfg_access))				\
 		pci_wait_cfg(dev);					\
-	ret = dev->bus->ops->read(dev->bus, dev->devfn,			\
-					pos, sizeof(type), &data);	\
+	if (!resource_is_exclusive(&dev->config_resource, pos, sizeof(type))) \
+		ret = dev->bus->ops->read(dev->bus, dev->devfn,		\
+					  pos, sizeof(type), &data);	\
 	raw_spin_unlock_irq(&pci_lock);				\
 	*val = (type)data;						\
 	return pcibios_err_to_errno(ret);				\
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index f8afd54ca3e1..4ce2831ffadc 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -753,6 +753,9 @@  static ssize_t pci_write_config(struct file *filp, struct kobject *kobj,
 	u8 *data = (u8 *) buf;
 	int ret;
 
+	if (resource_is_exclusive(&dev->config_resource, off, count))
+		return -EBUSY;
+
 	ret = security_locked_down(LOCKDOWN_PCI_ACCESS);
 	if (ret)
 		return ret;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 953f15abc850..8fa3a6f38b53 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2240,6 +2240,11 @@  struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
 	INIT_LIST_HEAD(&dev->bus_list);
 	dev->dev.type = &pci_dev_type;
 	dev->bus = pci_bus_get(bus);
+	dev->config_resource = (struct resource) {
+		.name = "PCI Config",
+		.start = 0,
+		.end = -1,
+	};
 
 	return dev;
 }
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 55de385c839c..e1487a892fe7 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -308,6 +308,8 @@  extern void __devm_release_region(struct device *dev, struct resource *parent,
 				  resource_size_t start, resource_size_t n);
 extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size);
 extern bool iomem_is_exclusive(u64 addr);
+extern bool resource_is_exclusive(struct resource *resource, u64 addr,
+				  resource_size_t size);
 
 extern int
 walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..d3d78e0df2e7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -401,6 +401,7 @@  struct pci_dev {
 	 */
 	unsigned int	irq;
 	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
+	struct resource config_resource;	/* driver exclusive config register ranges */
 
 	bool		match_driver;		/* Skip attaching driver */
 
@@ -1330,6 +1331,21 @@  int pci_request_selected_regions(struct pci_dev *, int, const char *);
 int pci_request_selected_regions_exclusive(struct pci_dev *, int, const char *);
 void pci_release_selected_regions(struct pci_dev *, int);
 
+static inline __must_check struct resource *
+pci_request_config_region(struct pci_dev *pdev, unsigned int where,
+			  unsigned int len, const char *name)
+{
+	return __request_region(&pdev->config_resource, where, len, name,
+				IORESOURCE_EXCLUSIVE);
+}
+
+static inline void pci_release_config_region(struct pci_dev *pdev,
+					     unsigned int where,
+					     unsigned int len)
+{
+	__release_region(&pdev->config_resource, where, len);
+}
+
 /* drivers/pci/bus.c */
 void pci_add_resource(struct list_head *resources, struct resource *res);
 void pci_add_resource_offset(struct list_head *resources, struct resource *res,
diff --git a/kernel/resource.c b/kernel/resource.c
index 627e61b0c124..ab1aed06e8b0 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1706,27 +1706,16 @@  static int strict_iomem_checks;
 #endif
 
 /*
- * check if an address is reserved in the iomem resource tree
+ * check if an address is reserved in the @resource tree
  * returns true if reserved, false if not reserved.
  */
-bool iomem_is_exclusive(u64 addr)
+bool resource_is_exclusive(struct resource *p, u64 addr, resource_size_t size)
 {
-	struct resource *p = &iomem_resource;
 	bool err = false;
 	loff_t l;
-	int size = PAGE_SIZE;
-
-	if (!strict_iomem_checks)
-		return false;
-
-	addr = addr & PAGE_MASK;
 
 	read_lock(&resource_lock);
 	for (p = p->child; p ; p = r_next(NULL, p, &l)) {
-		/*
-		 * We can probably skip the resources without
-		 * IORESOURCE_IO attribute?
-		 */
 		if (p->start >= addr + size)
 			break;
 		if (p->end < addr)
@@ -1749,6 +1738,15 @@  bool iomem_is_exclusive(u64 addr)
 	return err;
 }
 
+bool iomem_is_exclusive(u64 addr)
+{
+	if (!strict_iomem_checks)
+		return false;
+
+	return resource_is_exclusive(&iomem_resource, addr & PAGE_MASK,
+				     PAGE_SIZE);
+}
+
 struct resource_entry *resource_list_create_entry(struct resource *res,
 						  size_t extra_size)
 {