diff mbox series

[v9,7/9] PCI/bwctrl: Add API to set PCIe Link Speed

Message ID 20241018144755.7875-8-ilpo.jarvinen@linux.intel.com (mailing list archive)
State Accepted
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Add PCIe bandwidth controller | expand

Commit Message

Ilpo Järvinen Oct. 18, 2024, 2:47 p.m. UTC
Currently, PCIe Link Speeds are adjusted by custom code rather than in
a common function provided in PCI core. PCIe bandwidth controller
(bwctrl) introduces an in-kernel API to set PCIe Link Speed.

Convert Target Speed quirk to use the new API. The Target Speed quirk
runs very early when bwctrl is not yet probed for a Port and can also
run later when bwctrl is already setup for the Port, which requires the
per port mutex (set_speed_mutex) to be only taken if the bwctrl setup
is already complete.

The new API is also intended to be used in an upcoming commit that adds
a thermal cooling device to throttle PCIe bandwidth when thermal
thresholds are reached.

The PCIe bandwidth control procedure is as follows. The highest speed
supported by the Port and the PCIe device which is not higher than the
requested speed is selected and written into the Target Link Speed in
the Link Control 2 Register. Then bandwidth controller retrains the
PCIe Link.

Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus
to keep track PCIe Link Speed changes. While Bandwidth Notifications
should also be generated when bandwidth controller alters the PCIe Link
Speed, a few platforms do not deliver LMBS interrupt after Link
Training as expected. Thus, after changing the Link Speed, bandwidth
controller makes additional read for the Link Status Register to ensure
cur_bus_speed is consistent with the new PCIe Link Speed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/pci/pci.h         |  20 +++++
 drivers/pci/pcie/bwctrl.c | 178 ++++++++++++++++++++++++++++++++++++--
 drivers/pci/quirks.c      |  17 +---
 include/linux/pci.h       |  10 +++
 4 files changed, 206 insertions(+), 19 deletions(-)

Comments

Lukas Wunner Nov. 12, 2024, 3:47 p.m. UTC | #1
On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> +EXPORT_SYMBOL_GPL(pcie_set_target_speed);

My apologies for another belated comment on this series.
This patch is now a688ab21eb72 on pci/bwctrl:

I note that pcie_set_target_speed() is not called my a modular user
(CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
isn't really necessary right now.  I don't know if it was added
intentionally because some modular user is expected to show up
in the near future.


> @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  	if (!data)
>  		return -ENOMEM;
>  
> +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
>  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
>  					pcie_bwnotif_irq_thread,
>  					IRQF_SHARED | IRQF_ONESHOT,

We generally try to avoid devm_*() functions in port service drivers
because if we later on move them into the PCI core (which is the plan),
we'll have to unroll them.  Not the end of the world that they're used
here, just not ideal.

Thanks,

Lukas
Ilpo Järvinen Nov. 12, 2024, 4:01 p.m. UTC | #2
On Tue, 12 Nov 2024, Lukas Wunner wrote:

> On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);
> 
> My apologies for another belated comment on this series.
> This patch is now a688ab21eb72 on pci/bwctrl:
> 
> I note that pcie_set_target_speed() is not called my a modular user
> (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> isn't really necessary right now.  I don't know if it was added
> intentionally because some modular user is expected to show up
> in the near future.

Its probably a thinko to add it at all but then there have been talk about 
other users interested in the API too so it's not far fetched we could see 
a user. No idea about timelines though.

There are some AMD GPU drivers tweaking the TLS field on their own but 
they also touch some HW specific registers (although, IIRC, they only 
touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
that yields something very straightforward and ends up producing a working 
conversion or not (without ability to test with the HW). But TBH, not on 
my highest priority item.

> > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> >  	if (!data)
> >  		return -ENOMEM;
> >  
> > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> >  					pcie_bwnotif_irq_thread,
> >  					IRQF_SHARED | IRQF_ONESHOT,
> 
> We generally try to avoid devm_*() functions in port service drivers
> because if we later on move them into the PCI core (which is the plan),
> we'll have to unroll them.  Not the end of the world that they're used
> here, just not ideal.

I think Jonathan disagrees with you on that:

https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/

:-)
Bjorn Helgaas Nov. 12, 2024, 8:43 p.m. UTC | #3
On Tue, Nov 12, 2024 at 04:47:48PM +0100, Lukas Wunner wrote:
> On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);
> 
> My apologies for another belated comment on this series.
> This patch is now a688ab21eb72 on pci/bwctrl:
> 
> I note that pcie_set_target_speed() is not called my a modular user
> (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> isn't really necessary right now.  I don't know if it was added
> intentionally because some modular user is expected to show up
> in the near future.

Dropped the export for now, we can add it back if/when needed.

Bjorn
Jonathan Cameron Nov. 18, 2024, 1:03 p.m. UTC | #4
On Tue, 12 Nov 2024 18:01:50 +0200 (EET)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> On Tue, 12 Nov 2024, Lukas Wunner wrote:
> 
> > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:  
> > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);  
> > 
> > My apologies for another belated comment on this series.
> > This patch is now a688ab21eb72 on pci/bwctrl:
> > 
> > I note that pcie_set_target_speed() is not called my a modular user
> > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> > isn't really necessary right now.  I don't know if it was added
> > intentionally because some modular user is expected to show up
> > in the near future.  
> 
> Its probably a thinko to add it at all but then there have been talk about 
> other users interested in the API too so it's not far fetched we could see 
> a user. No idea about timelines though.
> 
> There are some AMD GPU drivers tweaking the TLS field on their own but 
> they also touch some HW specific registers (although, IIRC, they only 
> touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
> that yields something very straightforward and ends up producing a working 
> conversion or not (without ability to test with the HW). But TBH, not on 
> my highest priority item.
> 
> > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> > >  	if (!data)
> > >  		return -ENOMEM;
> > >  
> > > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> > >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> > >  					pcie_bwnotif_irq_thread,
> > >  					IRQF_SHARED | IRQF_ONESHOT,  
> > 
> > We generally try to avoid devm_*() functions in port service drivers
> > because if we later on move them into the PCI core (which is the plan),
> > we'll have to unroll them.  Not the end of the world that they're used
> > here, just not ideal.  
> 
> I think Jonathan disagrees with you on that:
> 
> https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/

Indeed - you beat me to it ;)

There is no practical way to move most of the port driver code into the PCI
core and definitely not interrupts. It is a shame though as I'd much prefer
if we could do so.  At LPC other issues some as power management were called
out as being very hard to handle, but to me the interrupt bit is a single
relatively easy to understand blocker.

I've been very slow on getting back to this, but my current plan would

1) Split up the bits of portdrv subdrivers that are actually core code
   (things like the AER counters etc) The current mix in the same files
   makes it hard to reason about lifetimes etc.

2) Squash all the portdrv sub drivers into simple library style calls so
   no pretend devices, everything registered directly.  That cleans up
   a lot of the layering and still provides reusable code if it makes
   sense to have multiple drivers for ports or to reuse this code for
   something else. Note that along the way I'll check we can build the
   portdrv as a module - I don't plan to actually do that, but it shows
   the layering / abstractions all work if it is possible.  That will
   probably make use of devm_ where appropriate as it simplifies a lot
   of paths.

3) Add the MSIX stuff to support 'new' child drivers but only where
   dynamic MSIX is supported.

4) Register new child devices where the layering does make sense.  So
   where we are actually binding drivers that can be unbound etc.
   Only for cases where dynamic MSI-X is available. 

I hope to get back to this in a week or so.

Jonathan

> 
> :-)
>
Ilpo Järvinen Nov. 18, 2024, 1:17 p.m. UTC | #5
On Mon, 18 Nov 2024, Jonathan Cameron wrote:

> On Tue, 12 Nov 2024 18:01:50 +0200 (EET)
> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> 
> > On Tue, 12 Nov 2024, Lukas Wunner wrote:
> > 
> > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:  
> > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);  
> > > 
> > > My apologies for another belated comment on this series.
> > > This patch is now a688ab21eb72 on pci/bwctrl:
> > > 
> > > I note that pcie_set_target_speed() is not called my a modular user
> > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> > > isn't really necessary right now.  I don't know if it was added
> > > intentionally because some modular user is expected to show up
> > > in the near future.  
> > 
> > Its probably a thinko to add it at all but then there have been talk about 
> > other users interested in the API too so it's not far fetched we could see 
> > a user. No idea about timelines though.
> > 
> > There are some AMD GPU drivers tweaking the TLS field on their own but 
> > they also touch some HW specific registers (although, IIRC, they only 
> > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
> > that yields something very straightforward and ends up producing a working 
> > conversion or not (without ability to test with the HW). But TBH, not on 
> > my highest priority item.
> > 
> > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> > > >  	if (!data)
> > > >  		return -ENOMEM;
> > > >  
> > > > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> > > >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> > > >  					pcie_bwnotif_irq_thread,
> > > >  					IRQF_SHARED | IRQF_ONESHOT,  
> > > 
> > > We generally try to avoid devm_*() functions in port service drivers
> > > because if we later on move them into the PCI core (which is the plan),
> > > we'll have to unroll them.  Not the end of the world that they're used
> > > here, just not ideal.  
> > 
> > I think Jonathan disagrees with you on that:
> > 
> > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/
> 
> Indeed - you beat me to it ;)
> 
> There is no practical way to move most of the port driver code into the PCI
> core and definitely not interrupts. It is a shame though as I'd much prefer
> if we could do so.  At LPC other issues some as power management were called
> out as being very hard to handle, but to me the interrupt bit is a single
> relatively easy to understand blocker.
> 
> I've been very slow on getting back to this, but my current plan would
> 
> 1) Split up the bits of portdrv subdrivers that are actually core code
>    (things like the AER counters etc) The current mix in the same files
>    makes it hard to reason about lifetimes etc.
> 
> 2) Squash all the portdrv sub drivers into simple library style calls so
>    no pretend devices, everything registered directly.  That cleans up
>    a lot of the layering and still provides reusable code if it makes
>    sense to have multiple drivers for ports or to reuse this code for
>    something else. Note that along the way I'll check we can build the
>    portdrv as a module - I don't plan to actually do that, but it shows
>    the layering / abstractions all work if it is possible.  That will
>    probably make use of devm_ where appropriate as it simplifies a lot
>    of paths.

I'm sorry to be a bit of a spoilsport here but quirks make calls to ASPM 
code and now also to bwctrl set Link Speed API so I'm not entire sure if 
you can actual succeed in that module test.

(It's just something that again indicates both would belong to PCI core
but sadly that direction is out of options).
Jonathan Cameron Nov. 20, 2024, 3:36 p.m. UTC | #6
On Mon, 18 Nov 2024 15:17:53 +0200 (EET)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> On Mon, 18 Nov 2024, Jonathan Cameron wrote:
> 
> > On Tue, 12 Nov 2024 18:01:50 +0200 (EET)
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> >   
> > > On Tue, 12 Nov 2024, Lukas Wunner wrote:
> > >   
> > > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:    
> > > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);    
> > > > 
> > > > My apologies for another belated comment on this series.
> > > > This patch is now a688ab21eb72 on pci/bwctrl:
> > > > 
> > > > I note that pcie_set_target_speed() is not called my a modular user
> > > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> > > > isn't really necessary right now.  I don't know if it was added
> > > > intentionally because some modular user is expected to show up
> > > > in the near future.    
> > > 
> > > Its probably a thinko to add it at all but then there have been talk about 
> > > other users interested in the API too so it's not far fetched we could see 
> > > a user. No idea about timelines though.
> > > 
> > > There are some AMD GPU drivers tweaking the TLS field on their own but 
> > > they also touch some HW specific registers (although, IIRC, they only 
> > > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
> > > that yields something very straightforward and ends up producing a working 
> > > conversion or not (without ability to test with the HW). But TBH, not on 
> > > my highest priority item.
> > >   
> > > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> > > > >  	if (!data)
> > > > >  		return -ENOMEM;
> > > > >  
> > > > > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> > > > >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> > > > >  					pcie_bwnotif_irq_thread,
> > > > >  					IRQF_SHARED | IRQF_ONESHOT,    
> > > > 
> > > > We generally try to avoid devm_*() functions in port service drivers
> > > > because if we later on move them into the PCI core (which is the plan),
> > > > we'll have to unroll them.  Not the end of the world that they're used
> > > > here, just not ideal.    
> > > 
> > > I think Jonathan disagrees with you on that:
> > > 
> > > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/  
> > 
> > Indeed - you beat me to it ;)
> > 
> > There is no practical way to move most of the port driver code into the PCI
> > core and definitely not interrupts. It is a shame though as I'd much prefer
> > if we could do so.  At LPC other issues some as power management were called
> > out as being very hard to handle, but to me the interrupt bit is a single
> > relatively easy to understand blocker.
> > 
> > I've been very slow on getting back to this, but my current plan would
> > 
> > 1) Split up the bits of portdrv subdrivers that are actually core code
> >    (things like the AER counters etc) The current mix in the same files
> >    makes it hard to reason about lifetimes etc.
> > 
> > 2) Squash all the portdrv sub drivers into simple library style calls so
> >    no pretend devices, everything registered directly.  That cleans up
> >    a lot of the layering and still provides reusable code if it makes
> >    sense to have multiple drivers for ports or to reuse this code for
> >    something else. Note that along the way I'll check we can build the
> >    portdrv as a module - I don't plan to actually do that, but it shows
> >    the layering / abstractions all work if it is possible.  That will
> >    probably make use of devm_ where appropriate as it simplifies a lot
> >    of paths.  
> 
> I'm sorry to be a bit of a spoilsport here but quirks make calls to ASPM 
> code and now also to bwctrl set Link Speed API so I'm not entire sure if 
> you can actual succeed in that module test.
> 
> (It's just something that again indicates both would belong to PCI core
> but sadly that direction is out of options).
> 
It may involve some bodges, but it is still a path to checking the
layer splits at least make 'some' sense.  Also that the resulting
library style code is suitable for reuse.  Possibly with an exception
for a few parts.

Thanks for the pointers to where the pitfalls lie!

Jonathan
diff mbox series

Patch

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 33ed324d1953..c8ea672c1892 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -331,6 +331,17 @@  void pci_disable_bridge_window(struct pci_dev *dev);
 struct pci_bus *pci_bus_get(struct pci_bus *bus);
 void pci_bus_put(struct pci_bus *bus);
 
+#define PCIE_LNKCAP_SLS2SPEED(lnkcap)					\
+({									\
+	((lnkcap) == PCI_EXP_LNKCAP_SLS_64_0GB ? PCIE_SPEED_64_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_32_0GB ? PCIE_SPEED_32_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_16_0GB ? PCIE_SPEED_16_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_8_0GB ? PCIE_SPEED_8_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_5_0GB ? PCIE_SPEED_5_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_2_5GB ? PCIE_SPEED_2_5GT :	\
+	 PCI_SPEED_UNKNOWN);						\
+})
+
 /* PCIe link information from Link Capabilities 2 */
 #define PCIE_LNKCAP2_SLS2SPEED(lnkcap2) \
 	((lnkcap2) & PCI_EXP_LNKCAP2_SLS_64_0GB ? PCIE_SPEED_64_0GT : \
@@ -341,6 +352,15 @@  void pci_bus_put(struct pci_bus *bus);
 	 (lnkcap2) & PCI_EXP_LNKCAP2_SLS_2_5GB ? PCIE_SPEED_2_5GT : \
 	 PCI_SPEED_UNKNOWN)
 
+#define PCIE_LNKCTL2_TLS2SPEED(lnkctl2) \
+	((lnkctl2) == PCI_EXP_LNKCTL2_TLS_64_0GT ? PCIE_SPEED_64_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_32_0GT ? PCIE_SPEED_32_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_16_0GT ? PCIE_SPEED_16_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_8_0GT ? PCIE_SPEED_8_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_5_0GT ? PCIE_SPEED_5_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_2_5GT ? PCIE_SPEED_2_5GT : \
+	 PCI_SPEED_UNKNOWN)
+
 /* PCIe speed to Mb/s reduced by encoding overhead */
 #define PCIE_SPEED2MBS_ENC(speed) \
 	((speed) == PCIE_SPEED_64_0GT ? 64000*1/1 : \
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index b31a31453872..8a2bd1e887e2 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -7,6 +7,11 @@ 
  * Copyright (C) 2019 Dell Inc
  * Copyright (C) 2023-2024 Intel Corporation
  *
+ * The PCIe bandwidth controller provides a way to alter PCIe Link Speeds
+ * and notify the operating system when the Link Width or Speed changes. The
+ * notification capability is required for all Root Ports and Downstream
+ * Ports supporting Link Width wider than x1 and/or multiple Link Speeds.
+ *
  * This service port driver hooks into the Bandwidth Notification interrupt
  * watching for changes or links becoming degraded in operation. It updates
  * the cached Current Link Speed that is exposed to user space through sysfs.
@@ -15,9 +20,12 @@ 
 #define dev_fmt(fmt) "bwctrl: " fmt
 
 #include <linux/atomic.h>
+#include <linux/bitops.h>
+#include <linux/bits.h>
 #include <linux/cleanup.h>
 #include <linux/errno.h>
 #include <linux/interrupt.h>
+#include <linux/mutex.h>
 #include <linux/pci.h>
 #include <linux/rwsem.h>
 #include <linux/slab.h>
@@ -28,14 +36,167 @@ 
 
 /**
  * struct pcie_bwctrl_data - PCIe bandwidth controller
+ * @set_speed_mutex:	Serializes link speed changes
  * @lbms_count:		Count for LBMS (since last reset)
  */
 struct pcie_bwctrl_data {
+	struct mutex set_speed_mutex;
 	atomic_t lbms_count;
 };
 
-/* Prevents port removal during LBMS count accessors */
+/*
+ * Prevent port removal during LBMS count accessors and Link Speed changes.
+ *
+ * These have to be differentiated because pcie_bwctrl_change_speed() calls
+ * pcie_retrain_link() which uses LBMS count reset accessor on success
+ * (using just one rwsem triggers "possible recursive locking detected"
+ * warning).
+ */
 static DECLARE_RWSEM(pcie_bwctrl_lbms_rwsem);
+static DECLARE_RWSEM(pcie_bwctrl_setspeed_rwsem);
+
+static bool pcie_valid_speed(enum pci_bus_speed speed)
+{
+	return (speed >= PCIE_SPEED_2_5GT) && (speed <= PCIE_SPEED_64_0GT);
+}
+
+static u16 pci_bus_speed2lnkctl2(enum pci_bus_speed speed)
+{
+	static const u8 speed_conv[] = {
+		[PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
+		[PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
+		[PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
+		[PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
+		[PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
+		[PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
+	};
+
+	if (WARN_ON_ONCE(!pcie_valid_speed(speed)))
+		return 0;
+
+	return speed_conv[speed];
+}
+
+static inline u16 pcie_supported_speeds2target_speed(u8 supported_speeds)
+{
+	return __fls(supported_speeds);
+}
+
+/**
+ * pcie_bwctrl_select_speed - Select Target Link Speed
+ * @port:	PCIe Port
+ * @speed_req:	requested PCIe Link Speed
+ *
+ * Select Target Link Speed by take into account Supported Link Speeds of
+ * both the Root Port and the Endpoint.
+ *
+ * Return: Target Link Speed (1=2.5GT/s, 2=5GT/s, 3=8GT/s, etc.)
+ */
+static u16 pcie_bwctrl_select_speed(struct pci_dev *port, enum pci_bus_speed speed_req)
+{
+	struct pci_bus *bus = port->subordinate;
+	u8 desired_speeds, supported_speeds;
+	struct pci_dev *dev;
+
+	desired_speeds = GENMASK(pci_bus_speed2lnkctl2(speed_req),
+				 __fls(PCI_EXP_LNKCAP2_SLS_2_5GB));
+
+	supported_speeds = port->supported_speeds;
+	if (bus) {
+		down_read(&pci_bus_sem);
+		dev = list_first_entry_or_null(&bus->devices, struct pci_dev, bus_list);
+		if (dev)
+			supported_speeds &= dev->supported_speeds;
+		up_read(&pci_bus_sem);
+	}
+	if (!supported_speeds)
+		return PCI_EXP_LNKCAP2_SLS_2_5GB;
+
+	return pcie_supported_speeds2target_speed(supported_speeds & desired_speeds);
+}
+
+static int pcie_bwctrl_change_speed(struct pci_dev *port, u16 target_speed, bool use_lt)
+{
+	int ret;
+
+	ret = pcie_capability_clear_and_set_word(port, PCI_EXP_LNKCTL2,
+						 PCI_EXP_LNKCTL2_TLS, target_speed);
+	if (ret != PCIBIOS_SUCCESSFUL)
+		return pcibios_err_to_errno(ret);
+
+	ret = pcie_retrain_link(port, use_lt);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Ensure link speed updates also with platforms that have problems
+	 * with notifications.
+	 */
+	if (port->subordinate)
+		pcie_update_link_speed(port->subordinate);
+
+	return 0;
+}
+
+/**
+ * pcie_set_target_speed - Set downstream Link Speed for PCIe Port
+ * @port:	PCIe Port
+ * @speed_req:	requested PCIe Link Speed
+ * @use_lt:	Wait for the LT or DLLLA bit to detect the end of link training
+ *
+ * Attempts to set PCIe Port Link Speed to @speed_req. @speed_req may be
+ * adjusted downwards to the best speed supported by both the Port and PCIe
+ * Device underneath it.
+ *
+ * Return:
+ * * 0		- on success
+ * * -EINVAL	- @speed_req is not a PCIe Link Speed
+ * * -ENODEV	- @port is not controllable
+ * * -ETIMEDOUT	- changing Link Speed took too long
+ * * -EAGAIN	- Link Speed was changed but @speed_req was not achieved
+ */
+int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
+			  bool use_lt)
+{
+	struct pci_bus *bus = port->subordinate;
+	u16 target_speed;
+	int ret;
+
+	if (WARN_ON_ONCE(!pcie_valid_speed(speed_req)))
+		return -EINVAL;
+
+	if (bus && bus->cur_bus_speed == speed_req)
+		return 0;
+
+	target_speed = pcie_bwctrl_select_speed(port, speed_req);
+
+	scoped_guard(rwsem_read, &pcie_bwctrl_setspeed_rwsem) {
+		struct pcie_bwctrl_data *data = port->link_bwctrl;
+
+		/*
+		 * port->link_bwctrl is NULL during initial scan when called
+		 * e.g. from the Target Speed quirk.
+		 */
+		if (data)
+			mutex_lock(&data->set_speed_mutex);
+
+		ret = pcie_bwctrl_change_speed(port, target_speed, use_lt);
+
+		if (data)
+			mutex_unlock(&data->set_speed_mutex);
+	}
+
+	/*
+	 * Despite setting higher speed into the Target Link Speed, empty
+	 * bus won't train to 5GT+ speeds.
+	 */
+	if (!ret && bus && bus->cur_bus_speed != speed_req &&
+	    !list_empty(&bus->devices))
+		ret = -EAGAIN;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcie_set_target_speed);
 
 static void pcie_bwnotif_enable(struct pcie_device *srv)
 {
@@ -135,6 +296,7 @@  static int pcie_bwnotif_probe(struct pcie_device *srv)
 	if (!data)
 		return -ENOMEM;
 
+	devm_mutex_init(&srv->device, &data->set_speed_mutex);
 	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
 					pcie_bwnotif_irq_thread,
 					IRQF_SHARED | IRQF_ONESHOT,
@@ -142,9 +304,11 @@  static int pcie_bwnotif_probe(struct pcie_device *srv)
 	if (ret)
 		return ret;
 
-	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
-		port->link_bwctrl = no_free_ptr(data);
-		pcie_bwnotif_enable(srv);
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
+			port->link_bwctrl = no_free_ptr(data);
+			pcie_bwnotif_enable(srv);
+		}
 	}
 
 	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
@@ -155,8 +319,10 @@  static int pcie_bwnotif_probe(struct pcie_device *srv)
 static void pcie_bwnotif_remove(struct pcie_device *srv)
 {
 	pcie_bwnotif_disable(srv->port);
-	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
-		srv->port->link_bwctrl = NULL;
+
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
+			srv->port->link_bwctrl = NULL;
 }
 
 static int pcie_bwnotif_suspend(struct pcie_device *srv)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e6d502dca939..dcf1c86a5488 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -113,16 +113,11 @@  int pcie_failed_link_retrain(struct pci_dev *dev)
 
 		pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
 
-		lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
-		lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
-		pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
-
-		ret = pcie_retrain_link(dev, false);
+		ret = pcie_set_target_speed(dev, PCIE_SPEED_2_5GT, false);
 		if (ret) {
 			pci_info(dev, "retraining failed\n");
-			pcie_capability_write_word(dev, PCI_EXP_LNKCTL2,
-						   oldlnkctl2);
-			pcie_retrain_link(dev, true);
+			pcie_set_target_speed(dev, PCIE_LNKCTL2_TLS2SPEED(oldlnkctl2),
+					      true);
 			return ret;
 		}
 
@@ -136,11 +131,7 @@  int pcie_failed_link_retrain(struct pci_dev *dev)
 
 		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
 		pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
-		lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
-		lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
-		pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
-
-		ret = pcie_retrain_link(dev, false);
+		ret = pcie_set_target_speed(dev, PCIE_LNKCAP_SLS2SPEED(lnkcap), false);
 		if (ret) {
 			pci_info(dev, "retraining failed\n");
 			return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 5f9de226be13..b5ce9513b06f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1798,9 +1798,19 @@  static inline int pci_irqd_intx_xlate(struct irq_domain *d,
 #ifdef CONFIG_PCIEPORTBUS
 extern bool pcie_ports_disabled;
 extern bool pcie_ports_native;
+
+int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
+			  bool use_lt);
 #else
 #define pcie_ports_disabled	true
 #define pcie_ports_native	false
+
+static inline int pcie_set_target_speed(struct pci_dev *port,
+					enum pci_bus_speed speed_req,
+					bool use_lt)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */