Message ID | 20241009095223.7093-7-ilpo.jarvinen@linux.intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | PCI: Add PCIe bandwidth controller | expand |
On Wed, 9 Oct 2024 12:52:21 +0300 Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > Currently, PCIe Link Speeds are adjusted by custom code rather than in > a common function provided in PCI core. PCIe bandwidth controller > (bwctrl) introduces an in-kernel API to set PCIe Link Speed. > > Convert Target Speed quirk to use the new API. The Target Speed quirk > runs very early when bwctrl is not yet probed for a Port and can also > run later when bwctrl is already setup for the Port, which requires the > per port mutex (set_speed_mutex) to be only taken if the bwctrl setup > is already complete. > > The new API is also intended to be used in an upcoming commit that adds > a thermal cooling device to throttle PCIe bandwidth when thermal > thresholds are reached. > > The PCIe bandwidth control procedure is as follows. The highest speed > supported by the Port and the PCIe device which is not higher than the > requested speed is selected and written into the Target Link Speed in > the Link Control 2 Register. Then bandwidth controller retrains the > PCIe Link. > > Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus > to keep track PCIe Link Speed changes. While Bandwidth Notifications > should also be generated when bandwidth controller alters the PCIe Link > Speed, a few platforms do not deliver LMBS interrupt after Link > Training as expected. Thus, after changing the Link Speed, bandwidth > controller makes additional read for the Link Status Register to ensure > cur_bus_speed is consistent with the new PCIe Link Speed. > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Trivial stuff inline. The mutex_destroy discussion is a just a consistency thing given that call is rarely bothered with but here it might help with debug. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Jonathan > --- > drivers/pci/pci.h | 20 +++++ > drivers/pci/pcie/bwctrl.c | 161 +++++++++++++++++++++++++++++++++++++- > drivers/pci/quirks.c | 17 +--- > include/linux/pci.h | 10 +++ > 4 files changed, 193 insertions(+), 15 deletions(-) > > diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c > index 1b11b5da79d4..1d3680ea8e06 100644 > --- a/drivers/pci/pcie/bwctrl.c > +++ b/drivers/pci/pcie/bwctrl.c > @@ -7,6 +7,11 @@ > static void pcie_bwnotif_enable(struct pcie_device *srv) > { > struct pcie_bwctrl_data *data = get_service_data(srv); > @@ -135,6 +288,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > if (!data) > return -ENOMEM; > > + mutex_init(&data->set_speed_mutex); > set_service_data(srv, data); > > ret = request_threaded_irq(srv->irq, NULL, pcie_bwnotif_irq_thread, > @@ -142,8 +296,10 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > if (ret) > return ret; > > - port->link_bwctrl = no_free_ptr(data); > - pcie_bwnotif_enable(srv); > + scoped_guard(rwsem_write, &pcie_bwctrl_remove_rwsem) { Calling it remove_rwsem and using it to protect against not yet present seems odd. Maybe rename, pcie_bwctrl_bound_rswem or something like that? > + port->link_bwctrl = no_free_ptr(data); > + pcie_bwnotif_enable(srv); > + } > > pci_dbg(port, "enabled with IRQ %d\n", srv->irq); > > @@ -159,6 +315,7 @@ static void pcie_bwnotif_remove(struct pcie_device *srv) > srv->port->link_bwctrl = NULL; > > free_irq(srv->irq, srv); > + mutex_destroy(&data->set_speed_mutex); Probably not worth doing. Also you don't do error handling for this above. Ideal is use devm_ for data and then devm_mutex_init() > kfree(data); > } >
On Thu, 17 Oct 2024, Jonathan Cameron wrote: > On Wed, 9 Oct 2024 12:52:21 +0300 > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > > > Currently, PCIe Link Speeds are adjusted by custom code rather than in > > a common function provided in PCI core. PCIe bandwidth controller > > (bwctrl) introduces an in-kernel API to set PCIe Link Speed. > > > > Convert Target Speed quirk to use the new API. The Target Speed quirk > > runs very early when bwctrl is not yet probed for a Port and can also > > run later when bwctrl is already setup for the Port, which requires the > > per port mutex (set_speed_mutex) to be only taken if the bwctrl setup > > is already complete. > > > > The new API is also intended to be used in an upcoming commit that adds > > a thermal cooling device to throttle PCIe bandwidth when thermal > > thresholds are reached. > > > > The PCIe bandwidth control procedure is as follows. The highest speed > > supported by the Port and the PCIe device which is not higher than the > > requested speed is selected and written into the Target Link Speed in > > the Link Control 2 Register. Then bandwidth controller retrains the > > PCIe Link. > > > > Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus > > to keep track PCIe Link Speed changes. While Bandwidth Notifications > > should also be generated when bandwidth controller alters the PCIe Link > > Speed, a few platforms do not deliver LMBS interrupt after Link > > Training as expected. Thus, after changing the Link Speed, bandwidth > > controller makes additional read for the Link Status Register to ensure > > cur_bus_speed is consistent with the new PCIe Link Speed. > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > Trivial stuff inline. The mutex_destroy discussion is a just a consistency > thing given that call is rarely bothered with but here it might help with > debug. > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Jonathan > > > --- > > drivers/pci/pci.h | 20 +++++ > > drivers/pci/pcie/bwctrl.c | 161 +++++++++++++++++++++++++++++++++++++- > > drivers/pci/quirks.c | 17 +--- > > include/linux/pci.h | 10 +++ > > 4 files changed, 193 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c > > index 1b11b5da79d4..1d3680ea8e06 100644 > > --- a/drivers/pci/pcie/bwctrl.c > > +++ b/drivers/pci/pcie/bwctrl.c > > @@ -7,6 +7,11 @@ > > > static void pcie_bwnotif_enable(struct pcie_device *srv) > > { > > struct pcie_bwctrl_data *data = get_service_data(srv); > > @@ -135,6 +288,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > > if (!data) > > return -ENOMEM; > > > > + mutex_init(&data->set_speed_mutex); > > set_service_data(srv, data); > > > > ret = request_threaded_irq(srv->irq, NULL, pcie_bwnotif_irq_thread, > > @@ -142,8 +296,10 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > > if (ret) > > return ret; > > > > - port->link_bwctrl = no_free_ptr(data); > > - pcie_bwnotif_enable(srv); > > + scoped_guard(rwsem_write, &pcie_bwctrl_remove_rwsem) { > > Calling it remove_rwsem and using it to protect against not yet > present seems odd. Maybe rename, pcie_bwctrl_bound_rswem or something like that? Okay but there's one related issue I need to address: I came across recursive locking splat from pcie_bwctrl_remove_rwsem (seems I had lockdep off for some odd reason in my config so it didn't trigger during my testing until now). The splat is because of: pcie_set_target_speed() scoped_guard(rwsem_read, &pcie_bwctrl_remove_rwsem) -> pcie_bwctrl_change_speed -> pcie_retrain_link() -> pcie_reset_lbms_count() guard(rwsem_read)(&pcie_bwctrl_remove_rwsem); I wouldn't want to special case pcie_retrain_link() just for BW control's sake, I plan to solve that by splitting pcie_bwctrl_remove_rwsem to have one for LBMS API and another for set speed API so split it into: - pcie_bwctrl_setspeed_rwsem - pcie_bwctrl_lbms_rwsem It requires taking both in .probe and .remove which is not as elegant I'd want but definitely seems more elegant compared with differentiating pcie_retrain_link(). If you, for some reason would instead prefer I differentiated pcie_retrain_link() for cases when holding pcie_bwctrl_remove_rwsem already (or pcie_bwctrl_bound_rwsem as suggested by you), please let me know. > > + port->link_bwctrl = no_free_ptr(data); > > + pcie_bwnotif_enable(srv); > > + } > > > > pci_dbg(port, "enabled with IRQ %d\n", srv->irq); > > > > @@ -159,6 +315,7 @@ static void pcie_bwnotif_remove(struct pcie_device *srv) > > srv->port->link_bwctrl = NULL; > > > > free_irq(srv->irq, srv); > > + mutex_destroy(&data->set_speed_mutex); > > Probably not worth doing. Also you don't do error handling for this above. I'm not sure what error handling you're referring to? > Ideal is use devm_ for data and then devm_mutex_init() I've now converted alloc, irq, and mutex to devm. I had them at one point but then Lukas said it's the wrong direction consider what lies in future but since the future seeming has now changed I've no problem of making our lives easier with devm. :-) I also got rid of the service data use for your convinience.
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 33ed324d1953..c8ea672c1892 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -331,6 +331,17 @@ void pci_disable_bridge_window(struct pci_dev *dev); struct pci_bus *pci_bus_get(struct pci_bus *bus); void pci_bus_put(struct pci_bus *bus); +#define PCIE_LNKCAP_SLS2SPEED(lnkcap) \ +({ \ + ((lnkcap) == PCI_EXP_LNKCAP_SLS_64_0GB ? PCIE_SPEED_64_0GT : \ + (lnkcap) == PCI_EXP_LNKCAP_SLS_32_0GB ? PCIE_SPEED_32_0GT : \ + (lnkcap) == PCI_EXP_LNKCAP_SLS_16_0GB ? PCIE_SPEED_16_0GT : \ + (lnkcap) == PCI_EXP_LNKCAP_SLS_8_0GB ? PCIE_SPEED_8_0GT : \ + (lnkcap) == PCI_EXP_LNKCAP_SLS_5_0GB ? PCIE_SPEED_5_0GT : \ + (lnkcap) == PCI_EXP_LNKCAP_SLS_2_5GB ? PCIE_SPEED_2_5GT : \ + PCI_SPEED_UNKNOWN); \ +}) + /* PCIe link information from Link Capabilities 2 */ #define PCIE_LNKCAP2_SLS2SPEED(lnkcap2) \ ((lnkcap2) & PCI_EXP_LNKCAP2_SLS_64_0GB ? PCIE_SPEED_64_0GT : \ @@ -341,6 +352,15 @@ void pci_bus_put(struct pci_bus *bus); (lnkcap2) & PCI_EXP_LNKCAP2_SLS_2_5GB ? PCIE_SPEED_2_5GT : \ PCI_SPEED_UNKNOWN) +#define PCIE_LNKCTL2_TLS2SPEED(lnkctl2) \ + ((lnkctl2) == PCI_EXP_LNKCTL2_TLS_64_0GT ? PCIE_SPEED_64_0GT : \ + (lnkctl2) == PCI_EXP_LNKCTL2_TLS_32_0GT ? PCIE_SPEED_32_0GT : \ + (lnkctl2) == PCI_EXP_LNKCTL2_TLS_16_0GT ? PCIE_SPEED_16_0GT : \ + (lnkctl2) == PCI_EXP_LNKCTL2_TLS_8_0GT ? PCIE_SPEED_8_0GT : \ + (lnkctl2) == PCI_EXP_LNKCTL2_TLS_5_0GT ? PCIE_SPEED_5_0GT : \ + (lnkctl2) == PCI_EXP_LNKCTL2_TLS_2_5GT ? PCIE_SPEED_2_5GT : \ + PCI_SPEED_UNKNOWN) + /* PCIe speed to Mb/s reduced by encoding overhead */ #define PCIE_SPEED2MBS_ENC(speed) \ ((speed) == PCIE_SPEED_64_0GT ? 64000*1/1 : \ diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c index 1b11b5da79d4..1d3680ea8e06 100644 --- a/drivers/pci/pcie/bwctrl.c +++ b/drivers/pci/pcie/bwctrl.c @@ -7,6 +7,11 @@ * Copyright (C) 2019 Dell Inc * Copyright (C) 2023-2024 Intel Corporation * + * The PCIe bandwidth controller provides a way to alter PCIe Link Speeds + * and notify the operating system when the Link Width or Speed changes. The + * notification capability is required for all Root Ports and Downstream + * Ports supporting Link Width wider than x1 and/or multiple Link Speeds. + * * This service port driver hooks into the Bandwidth Notification interrupt * watching for changes or links becoming degraded in operation. It updates * the cached Current Link Speed that is exposed to user space through sysfs. @@ -15,9 +20,12 @@ #define dev_fmt(fmt) "bwctrl: " fmt #include <linux/atomic.h> +#include <linux/bitops.h> +#include <linux/bits.h> #include <linux/cleanup.h> #include <linux/errno.h> #include <linux/interrupt.h> +#include <linux/mutex.h> #include <linux/pci.h> #include <linux/rwsem.h> #include <linux/slab.h> @@ -28,15 +36,160 @@ /** * struct pcie_bwctrl_data - PCIe bandwidth controller + * @set_speed_mutex: Serializes link speed changes * @lbms_count: Count for LBMS (since last reset) */ struct pcie_bwctrl_data { + struct mutex set_speed_mutex; atomic_t lbms_count; }; /* Prevents port removal during link speed changes and LBMS count accessors */ static DECLARE_RWSEM(pcie_bwctrl_remove_rwsem); +static bool pcie_valid_speed(enum pci_bus_speed speed) +{ + return (speed >= PCIE_SPEED_2_5GT) && (speed <= PCIE_SPEED_64_0GT); +} + +static u16 pci_bus_speed2lnkctl2(enum pci_bus_speed speed) +{ + static const u8 speed_conv[] = { + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT, + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT, + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT, + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT, + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT, + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT, + }; + + if (WARN_ON_ONCE(!pcie_valid_speed(speed))) + return 0; + + return speed_conv[speed]; +} + +static inline u16 pcie_supported_speeds2target_speed(u8 supported_speeds) +{ + return __fls(supported_speeds); +} + +/** + * pcie_bwctrl_select_speed - Select Target Link Speed + * @port: PCIe Port + * @speed_req: requested PCIe Link Speed + * + * Select Target Link Speed by take into account Supported Link Speeds of + * both the Root Port and the Endpoint. + * + * Return: Target Link Speed (1=2.5GT/s, 2=5GT/s, 3=8GT/s, etc.) + */ +static u16 pcie_bwctrl_select_speed(struct pci_dev *port, enum pci_bus_speed speed_req) +{ + struct pci_bus *bus = port->subordinate; + u8 desired_speeds, supported_speeds; + struct pci_dev *dev; + + desired_speeds = GENMASK(pci_bus_speed2lnkctl2(speed_req), + __fls(PCI_EXP_LNKCAP2_SLS_2_5GB)); + + supported_speeds = port->supported_speeds; + if (bus) { + down_read(&pci_bus_sem); + dev = list_first_entry_or_null(&bus->devices, struct pci_dev, bus_list); + if (dev) + supported_speeds &= dev->supported_speeds; + up_read(&pci_bus_sem); + } + if (!supported_speeds) + return PCI_EXP_LNKCAP2_SLS_2_5GB; + + return pcie_supported_speeds2target_speed(supported_speeds & desired_speeds); +} + +static int pcie_bwctrl_change_speed(struct pci_dev *port, u16 target_speed, bool use_lt) +{ + int ret; + + ret = pcie_capability_clear_and_set_word(port, PCI_EXP_LNKCTL2, + PCI_EXP_LNKCTL2_TLS, target_speed); + if (ret != PCIBIOS_SUCCESSFUL) + return pcibios_err_to_errno(ret); + + ret = pcie_retrain_link(port, use_lt); + if (ret < 0) + return ret; + + /* + * Ensure link speed updates also with platforms that have problems + * with notifications. + */ + if (port->subordinate) + pcie_update_link_speed(port->subordinate); + + return 0; +} + +/** + * pcie_set_target_speed - Set downstream Link Speed for PCIe Port + * @port: PCIe Port + * @speed_req: requested PCIe Link Speed + * @use_lt: Wait for the LT or DLLLA bit to detect the end of link training + * + * Attempts to set PCIe Port Link Speed to @speed_req. @speed_req may be + * adjusted downwards to the best speed supported by both the Port and PCIe + * Device underneath it. + * + * Return: + * * 0 - on success + * * -EINVAL - @speed_req is not a PCIe Link Speed + * * -ENODEV - @port is not controllable + * * -ETIMEDOUT - changing Link Speed took too long + * * -EAGAIN - Link Speed was changed but @speed_req was not achieved + */ +int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req, + bool use_lt) +{ + struct pci_bus *bus = port->subordinate; + u16 target_speed; + int ret; + + if (WARN_ON_ONCE(!pcie_valid_speed(speed_req))) + return -EINVAL; + + if (bus && bus->cur_bus_speed == speed_req) + return 0; + + target_speed = pcie_bwctrl_select_speed(port, speed_req); + + scoped_guard(rwsem_read, &pcie_bwctrl_remove_rwsem) { + struct pcie_bwctrl_data *data = port->link_bwctrl; + + /* + * port->link_bwctrl is NULL during initial scan when called + * e.g. from the Target Speed quirk. + */ + if (data) + mutex_lock(&data->set_speed_mutex); + + ret = pcie_bwctrl_change_speed(port, target_speed, use_lt); + + if (data) + mutex_unlock(&data->set_speed_mutex); + } + + /* + * Despite setting higher speed into the Target Link Speed, empty + * bus won't train to 5GT+ speeds. + */ + if (!ret && bus && bus->cur_bus_speed != speed_req && + !list_empty(&bus->devices)) + ret = -EAGAIN; + + return ret; +} +EXPORT_SYMBOL_GPL(pcie_set_target_speed); + static void pcie_bwnotif_enable(struct pcie_device *srv) { struct pcie_bwctrl_data *data = get_service_data(srv); @@ -135,6 +288,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) if (!data) return -ENOMEM; + mutex_init(&data->set_speed_mutex); set_service_data(srv, data); ret = request_threaded_irq(srv->irq, NULL, pcie_bwnotif_irq_thread, @@ -142,8 +296,10 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) if (ret) return ret; - port->link_bwctrl = no_free_ptr(data); - pcie_bwnotif_enable(srv); + scoped_guard(rwsem_write, &pcie_bwctrl_remove_rwsem) { + port->link_bwctrl = no_free_ptr(data); + pcie_bwnotif_enable(srv); + } pci_dbg(port, "enabled with IRQ %d\n", srv->irq); @@ -159,6 +315,7 @@ static void pcie_bwnotif_remove(struct pcie_device *srv) srv->port->link_bwctrl = NULL; free_irq(srv->irq, srv); + mutex_destroy(&data->set_speed_mutex); kfree(data); } diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index e6d502dca939..dcf1c86a5488 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -113,16 +113,11 @@ int pcie_failed_link_retrain(struct pci_dev *dev) pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n"); - lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS; - lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT; - pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2); - - ret = pcie_retrain_link(dev, false); + ret = pcie_set_target_speed(dev, PCIE_SPEED_2_5GT, false); if (ret) { pci_info(dev, "retraining failed\n"); - pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, - oldlnkctl2); - pcie_retrain_link(dev, true); + pcie_set_target_speed(dev, PCIE_LNKCTL2_TLS2SPEED(oldlnkctl2), + true); return ret; } @@ -136,11 +131,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev) pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n"); pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap); - lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS; - lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS; - pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2); - - ret = pcie_retrain_link(dev, false); + ret = pcie_set_target_speed(dev, PCIE_LNKCAP_SLS2SPEED(lnkcap), false); if (ret) { pci_info(dev, "retraining failed\n"); return ret; diff --git a/include/linux/pci.h b/include/linux/pci.h index 5f9de226be13..b5ce9513b06f 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1798,9 +1798,19 @@ static inline int pci_irqd_intx_xlate(struct irq_domain *d, #ifdef CONFIG_PCIEPORTBUS extern bool pcie_ports_disabled; extern bool pcie_ports_native; + +int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req, + bool use_lt); #else #define pcie_ports_disabled true #define pcie_ports_native false + +static inline int pcie_set_target_speed(struct pci_dev *port, + enum pci_bus_speed speed_req, + bool use_lt) +{ + return -EOPNOTSUPP; +} #endif #define PCIE_LINK_STATE_L0S (BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
Currently, PCIe Link Speeds are adjusted by custom code rather than in a common function provided in PCI core. PCIe bandwidth controller (bwctrl) introduces an in-kernel API to set PCIe Link Speed. Convert Target Speed quirk to use the new API. The Target Speed quirk runs very early when bwctrl is not yet probed for a Port and can also run later when bwctrl is already setup for the Port, which requires the per port mutex (set_speed_mutex) to be only taken if the bwctrl setup is already complete. The new API is also intended to be used in an upcoming commit that adds a thermal cooling device to throttle PCIe bandwidth when thermal thresholds are reached. The PCIe bandwidth control procedure is as follows. The highest speed supported by the Port and the PCIe device which is not higher than the requested speed is selected and written into the Target Link Speed in the Link Control 2 Register. Then bandwidth controller retrains the PCIe Link. Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus to keep track PCIe Link Speed changes. While Bandwidth Notifications should also be generated when bandwidth controller alters the PCIe Link Speed, a few platforms do not deliver LMBS interrupt after Link Training as expected. Thus, after changing the Link Speed, bandwidth controller makes additional read for the Link Status Register to ensure cur_bus_speed is consistent with the new PCIe Link Speed. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> --- drivers/pci/pci.h | 20 +++++ drivers/pci/pcie/bwctrl.c | 161 +++++++++++++++++++++++++++++++++++++- drivers/pci/quirks.c | 17 +--- include/linux/pci.h | 10 +++ 4 files changed, 193 insertions(+), 15 deletions(-)