Message ID | 20110624035452.GA12806@myri.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote: > There is a sizable performance boost for having the largest possible > maximum payload size on each PCI-E device. However, the maximum payload > size must be uniform on a given PCI-E fabric, and each device, bridge, > and root port can have a different max size. To find and configure the > optimal MPS settings, one must walk the fabric and determine the largest > MPS available on all the devices on the given fabric. Sorry, I haven't had a chance to try it on my setup here, will do asap. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote: > There is a sizable performance boost for having the largest possible > maximum payload size on each PCI-E device. However, the maximum payload > size must be uniform on a given PCI-E fabric, and each device, bridge, > and root port can have a different max size. To find and configure the > optimal MPS settings, one must walk the fabric and determine the largest > MPS available on all the devices on the given fabric. Ok, so a few comments: - Don't do a pci_walk_bus_self() that takes a "self" bool, that's ugly. If you want common code, then do a __pci_walk_bus() that takes a "self" argument (call it include_self ?) and two small wrappers for pci_walk_bus() and pci_walk_bus_self() (tho arguably the later isn't needed, just use the __ variant directly) - Thinking more, I don't see the point of that pci_walk_bus_self() ... It will cause a bridge device to be called twice. If all you want is to make sure the toplevel one is called, just do that manually before you call pci_walk_bus_self() - Does pci_walk_bus() provide any guarantee in term of API as to the order in which it walks the devices ? IE. Parent first, then children ? That's how it's implemented today but are we certain that will remain ? It's not documented as such.... Your "forcemax" case as it is implemented will work only as long as this ordering is respected. - I would like a way to specify the MPS of the host bridge, it may not be the max of the RC P2P bridge (it -generally is- but I'd like a way to override that easily). - I think we need to handle MRRS at the same time. We need MRRS to be clamped by MPS, don't we ? In addition to being clamped top-down. - For MRRS we -must- have the arch provide the max of the host bridge - pcie_mps_forcemax -> pcibios_use_max_mps(bus) or something like that. I want it to be a function so I can make it per-platform or per host bridge even if I have to. Might be a good spot to add an argument for the bridge to return a platform maximum to clamp everything else as well. - Finally, should we make this "forcemax" the default behaviour or are we afraid we'll break some nvidia gunk ? (I doubt it will break, that would imply the two cards come up with small & different MPS in practice). Do we want to keep an API for drivers who want to clamp things between two devices who want to do peer to peer ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 23, 2011 at 11:49 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote: >> There is a sizable performance boost for having the largest possible >> maximum payload size on each PCI-E device. However, the maximum payload >> size must be uniform on a given PCI-E fabric, and each device, bridge, >> and root port can have a different max size. To find and configure the >> optimal MPS settings, one must walk the fabric and determine the largest >> MPS available on all the devices on the given fabric. > > Ok, so a few comments: > > - Don't do a pci_walk_bus_self() that takes a "self" bool, that's ugly. > If you want common code, then do a __pci_walk_bus() that takes a "self" > argument (call it include_self ?) and two small wrappers for > pci_walk_bus() and pci_walk_bus_self() (tho arguably the later isn't > needed, just use the __ variant directly) > > - Thinking more, I don't see the point of that pci_walk_bus_self() ... > It will cause a bridge device to be called twice. If all you want is to > make sure the toplevel one is called, just do that manually before you > call pci_walk_bus_self() If they are nested, then there could be many layers of PCIE switches that need to be configured with the MPS. I'll look into this a bit more, but it is necessary. > - Does pci_walk_bus() provide any guarantee in term of API as to the > order in which it walks the devices ? IE. Parent first, then children ? I see no guarantee, but thought it would make you happy to see the patch doing it similar to the way the patch you sent out was doing things. > That's how it's implemented today but are we certain that will remain ? > It's not documented as such.... Your "forcemax" case as it is > implemented will work only as long as this ordering is respected. That is an excellent point. Would a comment above the function warning that changing it will break things, or should I create a simpler function, similar to what I had before, that would guarantee it? > - I would like a way to specify the MPS of the host bridge, it may > not be the max of the RC P2P bridge (it -generally is- but I'd like a > way to override that easily). I could do another boot arg for this. > - I think we need to handle MRRS at the same time. We need MRRS to be > clamped by MPS, don't we ? In addition to being clamped top-down. You are 100% correct. I was working on a follow-on patch to do it, but I'll just put it into here. > - For MRRS we -must- have the arch provide the max of the host bridge I'm not sure if this is a must, unless there is some arch requirement. IIRC, MRRS is more for bus fairness. In theory, we would want it to be uniform across all of the system to prevent bus hogging by a device with a larger MRRS. Of course, doing so will have a negative impact on performance. Perhaps we should have 2 settings: 1) performance - set the MPS to the parent MPS and the MRRS to the device MPSS 2) safety - set the MPS uniformly to the smallest on the fabric and the MRRS to the same value Both of these could be overridden by arch specified value. > - pcie_mps_forcemax -> pcibios_use_max_mps(bus) or something like that. > I want it to be a function so I can make it per-platform or per host > bridge even if I have to. Might be a good spot to add an argument for > the bridge to return a platform maximum to clamp everything else as > well. > > - Finally, should we make this "forcemax" the default behaviour or are > we afraid we'll break some nvidia gunk ? (I doubt it will break, that > would imply the two cards come up with small & different MPS in I'd be fine with that, as long as I can have a warning message telling them to use the more secure way of doing it if they have issues. Sound reasonable? > practice). Do we want to keep an API for drivers who want to clamp > things between two devices who want to do peer to peer ? I think that is a larger issue that I was wanting to address in this patch. I can do this as a follow on patch if there is anyone out there that is interested in having a peer-peer framework. Thanks so much for reviewing the patch! Thanks, Jon > Cheers, > Ben. > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2011 at 2:14 PM, jordan hargrave <jharg93@gmail.com> wrote: > Jon, > > I finally got around to testing this. On my system, I'm seeing this set > all devices to 128 during system boot (they are originally set to 256). > Hotplug works, obviously... but with my original code it would set the > inserted device to 256, not 128. Thanks for testing it. Sounds like I might have a bug in the code. Can you confirm that the hotplug slots are the only thing connected to the root ports? > > On Thu, Jun 23, 2011 at 10:54 PM, Jon Mason <mason@myri.com> wrote: >> >> There is a sizable performance boost for having the largest possible >> maximum payload size on each PCI-E device. However, the maximum payload >> size must be uniform on a given PCI-E fabric, and each device, bridge, >> and root port can have a different max size. To find and configure the >> optimal MPS settings, one must walk the fabric and determine the largest >> MPS available on all the devices on the given fabric. >> >> However, on some architectures (notably PPC) it is known that the >> downstream communication will never be larger than 128B. With this >> known, the MPS can be configured to be the largest value supported by >> the PCI-E switches and root port. In most cases this means that the >> devices on the fabric are not limited to the lowest common denominator. >> Use with caution, if this is not 100% true then there is the possibility >> of a bus error. >> >> To choose between the two available options, two kernel boot args have >> been added to the PCI calls. "mpsmin" will provide the former behavior, >> while "mpsmax" will perform the latter behavior. By default, the former >> behavior is used. >> >> Along with the potential performance improvement, this works around an >> issue with buggy BIOS revisions found on various whitebox motherboards >> which do not configure MPS beyond one level below the root port. >> >> There is one issue with the "mpsmin" design. For PCI-E hotplug enabled >> slots not connected directly to a PCI-E root port, there can be problems >> when hotplugging devices. This is due to the possibility of hotplugging >> a device into the fabric with a smaller MPS that the devices currently >> running have configured. Modifying the MPS on the running devices could >> cause a fatal bus error due to an incoming frame being larger than the >> newly configured MPS. To work around this, the MPS for the entire >> fabric must be set to the minimum size. Any devices hotplugged into >> this fabric will have the minimum MPS set. If the PCI hotplug slot is >> directly connected to the root port and there are not other devices on >> the fabric (which seems to be the most common case), then this is not an >> issue and MPS discovery will occur as normal. >> >> Signed-off-by: Jon Mason <mason@myri.com> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2011-06-24 at 17:42 -0500, Jon Mason wrote: > On Thu, Jun 23, 2011 at 11:49 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote: > >> There is a sizable performance boost for having the largest possible > >> maximum payload size on each PCI-E device. However, the maximum payload > >> size must be uniform on a given PCI-E fabric, and each device, bridge, > >> and root port can have a different max size. To find and configure the > >> optimal MPS settings, one must walk the fabric and determine the largest > >> MPS available on all the devices on the given fabric. > > > > Ok, so a few comments: > > > > - Don't do a pci_walk_bus_self() that takes a "self" bool, that's ugly. > > If you want common code, then do a __pci_walk_bus() that takes a "self" > > argument (call it include_self ?) and two small wrappers for > > pci_walk_bus() and pci_walk_bus_self() (tho arguably the later isn't > > needed, just use the __ variant directly) > > > > - Thinking more, I don't see the point of that pci_walk_bus_self() ... > > It will cause a bridge device to be called twice. If all you want is to > > make sure the toplevel one is called, just do that manually before you > > call pci_walk_bus_self() > > If they are nested, then there could be many layers of PCIE switches > that need to be configured with the MPS. I'll look into this a bit > more, but it is necessary. No, my point is that without your change it should already be calling the callback for intermediary bridges/switches. Or am I missing something ? The only "missing" case is the top level from what I can tell from reading the code. > > - Does pci_walk_bus() provide any guarantee in term of API as to the > > order in which it walks the devices ? IE. Parent first, then children ? > > I see no guarantee, but thought it would make you happy to see the > patch doing it similar to the way the patch you sent out was doing > things. > > > That's how it's implemented today but are we certain that will remain ? > > It's not documented as such.... Your "forcemax" case as it is > > implemented will work only as long as this ordering is respected. > > That is an excellent point. Would a comment above the function > warning that changing it will break things, or should I create a > simpler function, similar to what I had before, that would guarantee > it? Either way, what does Jesse prefers ? > > - I would like a way to specify the MPS of the host bridge, it may > > not be the max of the RC P2P bridge (it -generally is- but I'd like a > > way to override that easily). > > I could do another boot arg for this. Please no... not boot args. Those suck. Besides things can be different per host bridge, especially on !x86 platforms. Best is pcibios_* helpers (that take the bus as an argument) which the arch can populate for these things. > > - I think we need to handle MRRS at the same time. We need MRRS to be > > clamped by MPS, don't we ? In addition to being clamped top-down. > > You are 100% correct. I was working on a follow-on patch to do it, > but I'll just put it into here. Ok. > > - For MRRS we -must- have the arch provide the max of the host bridge > > I'm not sure if this is a must, unless there is some arch requirement. Well, RC doesn't expose it's capabilities. So you basically don't know what your PHB MRSS limit is unless the arch code tells you. On x86, you probably just rely on whatever the BIOS sets up, but a lot of other archs out there need to be able to setup the PCIe hierarchy from scratch. That's why my patch has this pcie_settings structure that is attached to a host bridge which contains the arch provided MPS and MRSS for that bridge. > IIRC, MRRS is more for bus fairness. In theory, we would want it to > be uniform across all of the system to prevent bus hogging by a device > with a larger MRRS. It's also about correctness. I've seen cases of devices not working properly due to MRSS violations. > Of course, doing so will have a negative impact > on performance. Perhaps we should have 2 settings: > 1) performance - set the MPS to the parent MPS and the MRRS to the device MPSS > 2) safety - set the MPS uniformly to the smallest on the fabric and > the MRRS to the same value > Both of these could be overridden by arch specified value. > > > - pcie_mps_forcemax -> pcibios_use_max_mps(bus) or something like that. > > I want it to be a function so I can make it per-platform or per host > > bridge even if I have to. Might be a good spot to add an argument for > > the bridge to return a platform maximum to clamp everything else as > > well. > > > > - Finally, should we make this "forcemax" the default behaviour or are > > we afraid we'll break some nvidia gunk ? (I doubt it will break, that > > would imply the two cards come up with small & different MPS in > > I'd be fine with that, as long as I can have a warning message telling > them to use the more secure way of doing it if they have issues. > Sound reasonable? > > > practice). Do we want to keep an API for drivers who want to clamp > > things between two devices who want to do peer to peer ? > > I think that is a larger issue that I was wanting to address in this > patch. I can do this as a follow on patch if there is anyone out > there that is interested in having a peer-peer framework. > > Thanks so much for reviewing the patch! Cheers, Ben. > Thanks, > Jon > > > Cheers, > > Ben. > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 27 Jun 2011 15:54:30 -0500 jordan hargrave <jharg93@gmail.com> wrote: > Jon, > > I finally got around to testing this. On my system, I'm seeing this set > all devices to 128 during system boot (they are originally set to 256). > Hotplug works, obviously... but with my original code it would set the > inserted device to 256, not 128. I'll keep following this, but Jon, it sounds like you don't have a final patch ready just yet? Thanks,
On Tue, Jun 28, 2011 at 11:24 AM, Jesse Barnes <jbarnes@virtuousgeek.org> wrote: > On Mon, 27 Jun 2011 15:54:30 -0500 > jordan hargrave <jharg93@gmail.com> wrote: > >> Jon, >> >> I finally got around to testing this. On my system, I'm seeing this set >> all devices to 128 during system boot (they are originally set to 256). >> Hotplug works, obviously... but with my original code it would set the >> inserted device to 256, not 128. > > I'll keep following this, but Jon, it sounds like you don't have a > final patch ready just yet? My apologies. While I am actively working on a patch to satisfy benh's comments, I do not have one ready to be pushed. Thanks, Jon > > Thanks, > -- > Jesse Barnes, Intel Open Source Technology Center > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 25, 2011 at 08:54:17AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2011-06-24 at 17:42 -0500, Jon Mason wrote: > > On Thu, Jun 23, 2011 at 11:49 PM, Benjamin Herrenschmidt > > <benh@kernel.crashing.org> wrote: > > > On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote: > > >> There is a sizable performance boost for having the largest possible > > >> maximum payload size on each PCI-E device. However, the maximum payload > > >> size must be uniform on a given PCI-E fabric, and each device, bridge, > > >> and root port can have a different max size. To find and configure the > > >> optimal MPS settings, one must walk the fabric and determine the largest > > >> MPS available on all the devices on the given fabric. > > > > > > Ok, so a few comments: > > > > > > - Don't do a pci_walk_bus_self() that takes a "self" bool, that's ugly. > > > If you want common code, then do a __pci_walk_bus() that takes a "self" > > > argument (call it include_self ?) and two small wrappers for > > > pci_walk_bus() and pci_walk_bus_self() (tho arguably the later isn't > > > needed, just use the __ variant directly) > > > > > > - Thinking more, I don't see the point of that pci_walk_bus_self() ... > > > It will cause a bridge device to be called twice. If all you want is to > > > make sure the toplevel one is called, just do that manually before you > > > call pci_walk_bus_self() > > > > If they are nested, then there could be many layers of PCIE switches > > that need to be configured with the MPS. I'll look into this a bit > > more, but it is necessary. > > No, my point is that without your change it should already be calling > the callback for intermediary bridges/switches. Or am I missing > something ? The only "missing" case is the top level from what I can > tell from reading the code. > > > > - Does pci_walk_bus() provide any guarantee in term of API as to the > > > order in which it walks the devices ? IE. Parent first, then children ? > > > > I see no guarantee, but thought it would make you happy to see the > > patch doing it similar to the way the patch you sent out was doing > > things. > > > > > That's how it's implemented today but are we certain that will remain ? > > > It's not documented as such.... Your "forcemax" case as it is > > > implemented will work only as long as this ordering is respected. > > > > That is an excellent point. Would a comment above the function > > warning that changing it will break things, or should I create a > > simpler function, similar to what I had before, that would guarantee > > it? > > Either way, what does Jesse prefers ? I added a comment above the function to document it. So if it is changed, hopefully all callers would be audited and the comment would be seen. > > > - I would like a way to specify the MPS of the host bridge, it may > > > not be the max of the RC P2P bridge (it -generally is- but I'd like a > > > way to override that easily). > > > > I could do another boot arg for this. > > Please no... not boot args. Those suck. Besides things can be different > per host bridge, especially on !x86 platforms. Best is pcibios_* helpers > (that take the bus as an argument) which the arch can populate for these > things. Ok, no boot arg. It's possible to have any arch code specify the pcie_mpss prior to the buses being walk, which would work around this issue. This might be a bit hacky. If it ends up being too difficult, we can do something similar to what you suggest above. > > > > - I think we need to handle MRRS at the same time. We need MRRS to be > > > clamped by MPS, don't we ? In addition to being clamped top-down. > > > > You are 100% correct. I was working on a follow-on patch to do it, > > but I'll just put it into here. > > Ok. This is implemented in the patch I just sent out. Please look it over. Thanks, Jon > > > > - For MRRS we -must- have the arch provide the max of the host bridge > > > > I'm not sure if this is a must, unless there is some arch requirement. > > Well, RC doesn't expose it's capabilities. So you basically don't know > what your PHB MRSS limit is unless the arch code tells you. On x86, you > probably just rely on whatever the BIOS sets up, but a lot of other > archs out there need to be able to setup the PCIe hierarchy from > scratch. That's why my patch has this pcie_settings structure that is > attached to a host bridge which contains the arch provided MPS and MRSS > for that bridge. > > > IIRC, MRRS is more for bus fairness. In theory, we would want it to > > be uniform across all of the system to prevent bus hogging by a device > > with a larger MRRS. > > It's also about correctness. I've seen cases of devices not working > properly due to MRSS violations. > > > Of course, doing so will have a negative impact > > on performance. Perhaps we should have 2 settings: > > 1) performance - set the MPS to the parent MPS and the MRRS to the device MPSS > > 2) safety - set the MPS uniformly to the smallest on the fabric and > > the MRRS to the same value > > Both of these could be overridden by arch specified value. > > > > > - pcie_mps_forcemax -> pcibios_use_max_mps(bus) or something like that. > > > I want it to be a function so I can make it per-platform or per host > > > bridge even if I have to. Might be a good spot to add an argument for > > > the bridge to return a platform maximum to clamp everything else as > > > well. > > > > > > - Finally, should we make this "forcemax" the default behaviour or are > > > we afraid we'll break some nvidia gunk ? (I doubt it will break, that > > > would imply the two cards come up with small & different MPS in > > > > I'd be fine with that, as long as I can have a warning message telling > > them to use the more secure way of doing it if they have issues. > > Sound reasonable? > > > > > practice). Do we want to keep an API for drivers who want to clamp > > > things between two devices who want to do peer to peer ? > > > > I think that is a larger issue that I was wanting to address in this > > patch. I can do this as a follow on patch if there is anyone out > > there that is interested in having a peer-peer framework. > > > > Thanks so much for reviewing the patch! > > Cheers, > Ben. > > > Thanks, > > Jon > > > > > Cheers, > > > Ben. > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-06-29 at 00:28 -0500, Jon Mason wrote: > There is a sizable performance boost for having the largest possible > maximum payload size on each PCI-E device. However, the maximum payload > size must be uniform on a given PCI-E fabric, and each device, bridge, > and root port can have a different max size. To find and configure the > optimal MPS settings, one must walk the fabric and determine the largest > MPS available on all the devices on the given fabric. > > However, on some architectures (notably PPC) it is known that the > downstream communication will never be larger than the MRRS. ^^^^ MPS ? > With this > known, the MPS can be configured to be the largest value supported by > the PCI-E switches and root port. I would phrase it differently, but that's just me :-) Something along the lines of: << It is important to ensure that PCIe payloads sends by a device are never bigger than the MPS setting of all devices on the way to the destination. This can be achieved two ways: - A very conservative way which is to use the smallest common denominator of the entire tree below a root complex for every device below that root complex. This means for example that having a 128 bytes MPS USB controller on one leg of a switch will dramatically reduce performances of a video card or 10GE adapter on another leg of that same switch. It also means that any hierarchy supporting hotplug slots (including expresscard or thunderbolt I suppose, dbl check that) will have to be entirely clamped to 128 bytes since we cannot predict what will be plugged into those slots, and we cannot change the MPS on a "live" system. - A more optimal way, which is possible if we agree on a couple of constraints: * If we can assume that the toplevel host bridge will never generate packets larger than the smallest TLP (or if it can be controlled independently from its MPS at least) * and if we are not going to support direct PCIe <-> PCIe transfers between devices without some additional code to specifically deal with that case Then we can use an approach that basically ignores downstream requests and focuses exclusively on upstream requests. In that case, all we need to care about is that a device MPS is no larger than its parent MPS, which allows us to keep all switches/bridges to the max MPS supported by their parent and eventually the PHB. In this case, your USB controller would no longer "starve" your 10GE ethernet and your hotplug slots won't affect your global MPS. Additionally, the hotplugged devices themselves can be configured to a larger MPS up to the value configured in the hotplug bridge. >> Additionally, I would remove mention of MRRS completely from this patch, we can deal with it (provided we decide something is to be done, tho I suppose we should at -least- fix locking on access to the register), in a separate patch or set of patches. Now, about the patch itself :-) ... > Signed-off-by: Jon Mason <mason@myri.com> > --- > drivers/pci/hotplug/pcihp_slot.c | 58 ++++------------ > drivers/pci/pci.c | 66 ++++++++++++++++++ > drivers/pci/probe.c | 138 ++++++++++++++++++++++++++++++++++++++ > include/linux/pci.h | 7 ++- > 4 files changed, 224 insertions(+), 45 deletions(-) > > diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c > index 749fdf0..ee4ff14 100644 > --- a/drivers/pci/hotplug/pcihp_slot.c > +++ b/drivers/pci/hotplug/pcihp_slot.c > @@ -26,6 +26,19 @@ > #include <linux/pci.h> > #include <linux/pci_hotplug.h> > > +extern void pcie_set_maxpayloadsize(struct pci_bus *bus); Put that in a header file. The name is a bit confusing too, I would call it something like pcie_bus_fixup_mps() or something like that. As it is, it sounds like just an accessor to tweak the value on that bus. > +static void pcie_determine_mps(struct pci_dev *dev) > +{ > + struct pci_bus *bus = dev->bus; > + > + while (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT && > + !pci_is_root_bus(bus)) > + bus = bus->parent; > + > + pcie_set_maxpayloadsize(bus); > +} So wait a minute, let me make sure I understand what you are doing here :-) You call this from pci_configure_slot(), which we don't seem to be using on pseries so I'm not 100% familiar with when that ends up being called, is this when a device is plugged ? This looks like it will run the MPS fixup pass over the entire hierarchy under which the slot lives right ? (ie going back up the tree), which sounds dangerous if/when this is called after boot and devices might already be active, or am I missing a piece of the puzzle ? > static struct hpp_type0 pci_default_type0 = { > .revision = 1, > .cache_line_size = 8, > @@ -158,47 +171,6 @@ static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp) > */ > } > > -/* Program PCIE MaxPayload setting on device: ensure parent maxpayload <= device */ It is interesting to see how busted that comment was in retrospect ;-) > -static int pci_set_payload(struct pci_dev *dev) > -{ > - int pos, ppos; > - u16 pctl, psz; > - u16 dctl, dsz, dcap, dmax; > - struct pci_dev *parent; > - > - parent = dev->bus->self; > - pos = pci_find_capability(dev, PCI_CAP_ID_EXP); > - if (!pos) > - return 0; > - > - /* Read Device MaxPayload capability and setting */ > - pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &dctl); > - pci_read_config_word(dev, pos + PCI_EXP_DEVCAP, &dcap); > - dsz = (dctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; > - dmax = (dcap & PCI_EXP_DEVCAP_PAYLOAD); > - > - /* Read Parent MaxPayload setting */ > - ppos = pci_find_capability(parent, PCI_CAP_ID_EXP); > - if (!ppos) > - return 0; > - pci_read_config_word(parent, ppos + PCI_EXP_DEVCTL, &pctl); > - psz = (pctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; > - > - /* If parent payload > device max payload -> error > - * If parent payload > device payload -> set speed > - * If parent payload <= device payload -> do nothing > - */ And this one too, it would be useful to know the rationale of whoever came up with that code in the first place ... oh well, let's not dwelve on it now... > - if (psz > dmax) > - return -1; > - else if (psz > dsz) { > - dev_info(&dev->dev, "Setting MaxPayload to %d\n", 128 << psz); > - pci_write_config_word(dev, pos + PCI_EXP_DEVCTL, > - (dctl & ~PCI_EXP_DEVCTL_PAYLOAD) + > - (psz << 5)); > - } > - return 0; > -} > - > void pci_configure_slot(struct pci_dev *dev) > { > struct pci_dev *cdev; > @@ -210,9 +182,7 @@ void pci_configure_slot(struct pci_dev *dev) > (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI))) > return; > > - ret = pci_set_payload(dev); > - if (ret) > - dev_warn(&dev->dev, "could not set device max payload\n"); > + pcie_determine_mps(dev); > > memset(&hpp, 0, sizeof(hpp)); > ret = pci_get_hp_params(dev, &hpp); > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 56098b3..bc25728 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -77,6 +77,8 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; > unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; > unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; > > +bool pcie_mps_forcemax = true; > + So I would really still like this to be an inline function taking the bus as an argument :-) But I won't fight for it... however, the name still really sucks. I think we might waste a few bits and instead call it something like pcie_mps_setting_method or something like that with an enum of methods don't you think ? Or maybe pcie_mps_conservative ? (I would say anal ? :-) > /* > * The default CLS is used if arch didn't set CLS explicitly and not > * all pci devices agree on the same value. Arch can override either > @@ -3218,6 +3220,66 @@ out: > EXPORT_SYMBOL(pcie_set_readrq); > > /** > + * pcie_set_mps - set PCI Express maximum payload size > + * @dev: PCI device to query > + * > + * Returns maximum payload size in bytes > + * or appropriate error value. > + */ > +int pcie_get_mps(struct pci_dev *dev) > +{ > + int ret, cap; > + u16 ctl; > + > + cap = pci_pcie_cap(dev); > + if (!cap) > + return -EINVAL; > + > + ret = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl); > + if (!ret) > + ret = 128 << ((ctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5); > + > + return ret; > +} > + > +/** > + * pcie_set_mps - set PCI Express maximum payload size > + * @dev: PCI device to query > + * @rq: maximum payload size in bytes > + * valid values are 128, 256, 512, 1024, 2048, 4096 > + * > + * If possible sets maximum payload size > + */ > +int pcie_set_mps(struct pci_dev *dev, int mps) > +{ > + int cap, err = -EINVAL; > + u16 ctl, v; > + > + if (mps < 128 || mps > 4096 || !is_power_of_2(mps)) > + goto out; > + > + v = (ffs(mps) - 8) << 5; > + if (v > dev->pcie_mpss) > + goto out; > + > + cap = pci_pcie_cap(dev); > + if (!cap) > + goto out; > + > + err = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl); > + if (err) > + goto out; > + > + if ((ctl & PCI_EXP_DEVCTL_PAYLOAD) != v) { > + ctl &= ~PCI_EXP_DEVCTL_PAYLOAD; > + ctl |= v; > + err = pci_write_config_word(dev, cap + PCI_EXP_DEVCTL, ctl); > + } > +out: > + return err; > +} > + > +/** > * pci_select_bars - Make BAR mask from the type of resource > * @dev: the PCI device for which BAR mask is made > * @flags: resource type mask to be selected > @@ -3498,6 +3560,10 @@ static int __init pci_setup(char *str) > pci_hotplug_io_size = memparse(str + 9, &str); > } else if (!strncmp(str, "hpmemsize=", 10)) { > pci_hotplug_mem_size = memparse(str + 10, &str); > + } else if (!strncmp(str, "mpsmin", 6)) { > + pcie_mps_forcemax = false; > + } else if (!strncmp(str, "mpsmax", 6)) { > + pcie_mps_forcemax = true; > } else { I'm even less fan of these, but let's Jesse make a call this time around :-) > printk(KERN_ERR "PCI: Unknown option `%s'\n", > str); > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 48849ff..d32c62e 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -860,6 +860,8 @@ void set_pcie_port_type(struct pci_dev *pdev) > pdev->pcie_cap = pos; > pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); > pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; > + pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, ®16); > + pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD; > } > > void set_pcie_hotplug_bridge(struct pci_dev *pdev) > @@ -1327,6 +1329,137 @@ int pci_scan_slot(struct pci_bus *bus, int devfn) > return nr; > } > > +static int pcie_find_smpss(struct pci_dev *dev, void *data) > +{ > + u8 *smpss = data; > + > + if (!pci_is_pcie(dev)) > + return 0; > + > + /* For PCIE hotplug enabled slots not connected directly to a > + * PCI-E root port, there can be problems when hotplugging > + * devices. This is due to the possibility of hotplugging a > + * device into the fabric with a smaller MPS that the devices > + * currently running have configured. Modifying the MPS on the > + * running devices could cause a fatal bus error due to an > + * incoming frame being larger than the newly configured MPS. > + * To work around this, the MPS for the entire fabric must be > + * set to the minimum size. Any devices hotplugged into this > + * fabric will have the minimum MPS set. If the PCI hotplug > + * slot is directly connected to the root port and there are not > + * other devices on the fabric (which seems to be the most > + * common case), then this is not an issue and MPS discovery > + * will occur as normal. > + */ > + if (dev->is_hotplug_bridge && (!list_is_singular(&dev->bus->devices) || > + dev->bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT)) > + *smpss = 0; Not sure whether the list_is_singular test means anything here... What are you trying to do ? Ie, the hotplug bridge could be the sole child of it's parent but still have cousins via some other bridge higher up the tree. > + if (*smpss > dev->pcie_mpss) > + *smpss = dev->pcie_mpss; > + > + return 0; > +} > + > +static int pcie_write_mps(struct pci_dev *dev, void *data) > +{ > + int rc, mrrs, mps, dev_mpss; > + > + if (!pci_is_pcie(dev)) > + return 0; > + > + mps = 128 << *(u8 *)data; > + dev_mpss = 128 << dev->pcie_mpss; > + > + if (pcie_mps_forcemax) { > + if (dev->bus->self) { > + dev_dbg(&dev->bus->dev, "Bus MPSS %d\n", > + 128 << dev->bus->self->pcie_mpss); > + > + /* For "MPS Force Max", the assumption is made that > + * downstream communication will never be larger than > + * the MRRS. So, the MPS only needs to be configured > + * for the upstream communication. This being the case, > + * walk from the top down and set the MPS of the child > + * to that of the parent bus. > + */ > + mps = 128 << dev->bus->self->pcie_mpss; > + if (mps > dev_mpss) > + dev_warn(&dev->dev, "MPS configured higher than" > + " maximum supported by the device. If" > + " a bus issue occurs, try running with" > + " pci=mpsmin.\n"); > + } > + > + /* For Max performance, the MRRS must be set to the largest > + * supported value. However, it cannot be configured larger > + * than the MPS the device or the bus can support. This assumes > + * that the largest MRRS available on the device cannot be > + * smaller than the device MPSS. > + */ > + mrrs = mps < dev_mpss ? mps : dev_mpss; > + > + dev->pcie_mpss = ffs(mps) - 8; > + } else > + /* In the "safe" case, configure the MRRS for fairness on the > + * bus by making all devices have the same size > + */ > + mrrs = mps; > + > + dev_info(&dev->dev, "Setting MRRS to %d\n", mrrs); > + rc = pcie_set_readrq(dev, mrrs); > + if (rc) > + dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); > + > + if (mrrs != pcie_get_readrq(dev)) { > + dev_err(&dev->dev, "MRRS is not what was written. Falling back" > + " to the minimum value.\n"); > + > + rc = pcie_set_readrq(dev, 128); > + if (rc) > + dev_err(&dev->dev, > + "Failed attempting to set the MRRS\n"); > + } > + > + dev_info(&dev->dev, "Setting MPS to %d\n", mps); > + rc = pcie_set_mps(dev, mps); > + if (rc) > + dev_err(&dev->dev, "Failed attempting to set the MPS\n"); > + > + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", > + pcie_get_mps(dev), dev_mpss, pcie_get_readrq(dev)); > + > + return 0; > +} > + > +/* pcie_set_maxpayloadsize requires that pci_walk_bus work in a top-down, > + * parents then children fashion. If this changes, then this code will not > + * work as designed. > + */ > +void pcie_set_maxpayloadsize(struct pci_bus *bus) > +{ See my earlier comment about the name of that function. > + u8 smpss; > + > + if (!bus->self) > + return; So if I call this function on my root bus which has no ->self it won't do anything. In fact you call it in scan_child_busses, so you'll basically going to scan over and over again, re-adjusting as you go. Should we instead do it once after the whole hierarchy has been scanned ? (IE. in pci_scan_bus_parented ?) That means the call would have to be done directly by arch like powerpc who don't call this but call pci_create_bus() and pci_scan_child_bus() directly but I'm fine with that. > + if (!pci_is_pcie(bus->self)) > + return; > + > + if (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT) > + return; > + > + smpss = bus->self->pcie_mpss; > + > + if (!pcie_mps_forcemax) { > + pcie_find_smpss(bus->self, &smpss); > + pci_walk_bus(bus, pcie_find_smpss, &smpss); > + } > + > + pcie_write_mps(bus->self, &smpss); > + pci_walk_bus(bus, pcie_write_mps, &smpss); > +} > + > unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) > { > unsigned int devfn, pass, max = bus->secondary; > @@ -1359,6 +1492,11 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) > max = pci_scan_bridge(bus, dev, max, pass); > } > > + /* After the bus has been walked and all devices discovered, determine > + * the MPS of the fabric and set it. > + */ > + pcie_set_maxpayloadsize(bus); > + > /* > * We've scanned the bus and so we know all about what's on > * the other side of any bridges that may be on this bus plus > diff --git a/include/linux/pci.h b/include/linux/pci.h > index c446b5c..9b35724 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -251,7 +251,8 @@ struct pci_dev { > u8 revision; /* PCI revision, low byte of class word */ > u8 hdr_type; /* PCI header type (`multi' flag masked out) */ > u8 pcie_cap; /* PCI-E capability offset */ > - u8 pcie_type; /* PCI-E device/port type */ > + u8 pcie_type:4; /* PCI-E device/port type */ > + u8 pcie_mpss:3; /* PCI-E Max Payload Size Supported */ > u8 rom_base_reg; /* which config register controls the ROM */ > u8 pin; /* which interrupt pin this device uses */ > > @@ -617,6 +618,8 @@ struct pci_driver { > /* these external functions are only available when PCI support is enabled */ > #ifdef CONFIG_PCI > > +extern bool pcie_mps_forcemax; > + > extern struct bus_type pci_bus_type; > > /* Do NOT directly access these two variables, unless you are arch specific pci > @@ -796,6 +799,8 @@ int pcix_get_mmrbc(struct pci_dev *dev); > int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc); > int pcie_get_readrq(struct pci_dev *dev); > int pcie_set_readrq(struct pci_dev *dev, int rq); > +int pcie_get_mps(struct pci_dev *dev); > +int pcie_set_mps(struct pci_dev *dev, int mps); > int __pci_reset_function(struct pci_dev *dev); > int pci_reset_function(struct pci_dev *dev); > void pci_update_resource(struct pci_dev *dev, int resno); Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-06-29 at 00:36 -0500, Jon Mason wrote: > > > > - I would like a way to specify the MPS of the host bridge, it may > > > > not be the max of the RC P2P bridge (it -generally is- but I'd like a > > > > way to override that easily). > > > > > > I could do another boot arg for this. > > > > Please no... not boot args. Those suck. Besides things can be different > > per host bridge, especially on !x86 platforms. Best is pcibios_* helpers > > (that take the bus as an argument) which the arch can populate for these > > things. > > Ok, no boot arg. Well, I suspect the x86 crowd will probably still want one :-) IE. The bootarg is going to come handy to diagnose problems if/when this patch causes some. It just think we should come up with a reasonable default (which should be either the "relaxed" way imho or arch set because I want powerpc to default to that) and a boot arg to override the arch. Forget about the helper I just wrote about, it can be global, I think your idea of me using the RC bridge to "fixup" the values there is good (as long as there's a quirk I can use to do so between you reading it from the device and later using it to fixup the devices). > It's possible to have any arch code specify the > pcie_mpss prior to the buses being walk, which would work around this > issue. Correct, we could do that. > This might be a bit hacky. If it ends up being too difficult, > we can do something similar to what you suggest above. No I think having the arch manipulate the RC from a quirk if it wants to is fine. The main question is which of the "conservative" vs. "relaxed" mode shall be the default. > > > > - I think we need to handle MRRS at the same time. We need MRRS to be > > > > clamped by MPS, don't we ? In addition to being clamped top-down. > > > > > > You are 100% correct. I was working on a follow-on patch to do it, > > > but I'll just put it into here. > > > > Ok. > > This is implemented in the patch I just sent out. Please look it > over. Well, our IRC discussion seem to have gone the other way :-) It looks like we are better off leaving MRRS alone with this patch and tackling it with a separate one. I tend personally to prefer making sure MRRS <= MPS as a sane "default" at boot time, and let the driver do whatever it wants (or even expose it via sysfs). But we shall make it a separate patch at least for the sake of bisectability in case something in our reasoning is wrong. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 29, 2011 at 1:07 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2011-06-29 at 00:28 -0500, Jon Mason wrote: >> There is a sizable performance boost for having the largest possible >> maximum payload size on each PCI-E device. However, the maximum payload >> size must be uniform on a given PCI-E fabric, and each device, bridge, >> and root port can have a different max size. To find and configure the >> optimal MPS settings, one must walk the fabric and determine the largest >> MPS available on all the devices on the given fabric. >> >> However, on some architectures (notably PPC) it is known that the >> downstream communication will never be larger than the MRRS. > ^^^^ MPS ? I was under the impression that PPC would never perform a DMA read larger than cache line (which I am assuming is less than 128B). Assuming that other arches can do larger DMA reads than their own cache lines, shouldn't the downstream size be bounded by something (if the device has a smaller MPS than the parent bridges)? If so, I believe setting the MRRS should bound the TLP to the small enough size. If this is not true, then I believe there is a potential hole. Without the above case, the MRRS need not be touched in this patch. Let me know if you want to rip it out. RE locking in MRRS, we could use the device lock there (similar to how it is being used in pci_walk_bus). > >> With this >> known, the MPS can be configured to be the largest value supported by >> the PCI-E switches and root port. > > I would phrase it differently, but that's just me :-) Something along > the lines of: > > << > It is important to ensure that PCIe payloads sends by a device are never > bigger than the MPS setting of all devices on the way to the > destination. > > This can be achieved two ways: > > - A very conservative way which is to use the smallest common > denominator of the entire tree below a root complex for every device > below that root complex. > > This means for example that having a 128 bytes MPS USB controller on one > leg of a switch will dramatically reduce performances of a video card or > 10GE adapter on another leg of that same switch. > > It also means that any hierarchy supporting hotplug slots (including > expresscard or thunderbolt I suppose, dbl check that) will have to be > entirely clamped to 128 bytes since we cannot predict what will be > plugged into those slots, and we cannot change the MPS on a "live" > system. > > - A more optimal way, which is possible if we agree on a couple of > constraints: > > * If we can assume that the toplevel host bridge will never generate > packets larger than the smallest TLP (or if it can be controlled > independently from its MPS at least) > * and if we are not going to support direct PCIe <-> PCIe transfers > between devices without some additional code to specifically deal with > that case > > Then we can use an approach that basically ignores downstream requests > and focuses exclusively on upstream requests. In that case, all we need > to care about is that a device MPS is no larger than its parent MPS, > which allows us to keep all switches/bridges to the max MPS supported by > their parent and eventually the PHB. > > In this case, your USB controller would no longer "starve" your 10GE > ethernet and your hotplug slots won't affect your global MPS. > Additionally, the hotplugged devices themselves can be configured to a > larger MPS up to the value configured in the hotplug bridge. Cool, I'll make liberal use of this in the next patch :) >>> > > Additionally, I would remove mention of MRRS completely from this patch, > we can deal with it (provided we decide something is to be done, tho I > suppose we should at -least- fix locking on access to the register), in > a separate patch or set of patches. > > Now, about the patch itself :-) ... > >> Signed-off-by: Jon Mason <mason@myri.com> >> --- >> drivers/pci/hotplug/pcihp_slot.c | 58 ++++------------ >> drivers/pci/pci.c | 66 ++++++++++++++++++ >> drivers/pci/probe.c | 138 ++++++++++++++++++++++++++++++++++++++ >> include/linux/pci.h | 7 ++- >> 4 files changed, 224 insertions(+), 45 deletions(-) >> >> diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c >> index 749fdf0..ee4ff14 100644 >> --- a/drivers/pci/hotplug/pcihp_slot.c >> +++ b/drivers/pci/hotplug/pcihp_slot.c >> @@ -26,6 +26,19 @@ >> #include <linux/pci.h> >> #include <linux/pci_hotplug.h> >> >> +extern void pcie_set_maxpayloadsize(struct pci_bus *bus); > > Put that in a header file. The name is a bit confusing too, I would call > it something like pcie_bus_fixup_mps() or something like that. As it is, > it sounds like just an accessor to tweak the value on that bus. fixup implies there is something busted, which may or may not be the case. How about pcie_bus_configure_mps? >> +static void pcie_determine_mps(struct pci_dev *dev) >> +{ >> + struct pci_bus *bus = dev->bus; >> + >> + while (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT && >> + !pci_is_root_bus(bus)) >> + bus = bus->parent; >> + >> + pcie_set_maxpayloadsize(bus); >> +} > > So wait a minute, let me make sure I understand what you are doing here :-) > > You call this from pci_configure_slot(), which we don't seem to be using > on pseries so I'm not 100% familiar with when that ends up being called, > is this when a device is plugged ? This is the use case of the original MPS fixup patch. I believe you are correct. > This looks like it will run the MPS fixup pass over the entire hierarchy > under which the slot lives right ? (ie going back up the tree), which > sounds dangerous if/when this is called after boot and devices might > already be active, or am I missing a piece of the puzzle ? True. It would be better to simply do it from the parent bus of the device. >> static struct hpp_type0 pci_default_type0 = { >> .revision = 1, >> .cache_line_size = 8, >> @@ -158,47 +171,6 @@ static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp) >> */ >> } >> >> -/* Program PCIE MaxPayload setting on device: ensure parent maxpayload <= device */ > > It is interesting to see how busted that comment was in retrospect ;-) > >> -static int pci_set_payload(struct pci_dev *dev) >> -{ >> - int pos, ppos; >> - u16 pctl, psz; >> - u16 dctl, dsz, dcap, dmax; >> - struct pci_dev *parent; >> - >> - parent = dev->bus->self; >> - pos = pci_find_capability(dev, PCI_CAP_ID_EXP); >> - if (!pos) >> - return 0; >> - >> - /* Read Device MaxPayload capability and setting */ >> - pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &dctl); >> - pci_read_config_word(dev, pos + PCI_EXP_DEVCAP, &dcap); >> - dsz = (dctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; >> - dmax = (dcap & PCI_EXP_DEVCAP_PAYLOAD); >> - >> - /* Read Parent MaxPayload setting */ >> - ppos = pci_find_capability(parent, PCI_CAP_ID_EXP); >> - if (!ppos) >> - return 0; >> - pci_read_config_word(parent, ppos + PCI_EXP_DEVCTL, &pctl); >> - psz = (pctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; >> - >> - /* If parent payload > device max payload -> error >> - * If parent payload > device payload -> set speed >> - * If parent payload <= device payload -> do nothing >> - */ > > And this one too, it would be useful to know the rationale of whoever > came up with that code in the first place ... oh well, let's not dwelve > on it now... I believe Jordan (who has been Cc'ed on these e-mails) was the originator of this code. Perhaps he can shed some light. > >> - if (psz > dmax) >> - return -1; >> - else if (psz > dsz) { >> - dev_info(&dev->dev, "Setting MaxPayload to %d\n", 128 << psz); >> - pci_write_config_word(dev, pos + PCI_EXP_DEVCTL, >> - (dctl & ~PCI_EXP_DEVCTL_PAYLOAD) + >> - (psz << 5)); >> - } >> - return 0; >> -} >> - >> void pci_configure_slot(struct pci_dev *dev) >> { >> struct pci_dev *cdev; >> @@ -210,9 +182,7 @@ void pci_configure_slot(struct pci_dev *dev) >> (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI))) >> return; >> >> - ret = pci_set_payload(dev); >> - if (ret) >> - dev_warn(&dev->dev, "could not set device max payload\n"); >> + pcie_determine_mps(dev); >> >> memset(&hpp, 0, sizeof(hpp)); >> ret = pci_get_hp_params(dev, &hpp); >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index 56098b3..bc25728 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -77,6 +77,8 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; >> unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; >> unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; >> >> +bool pcie_mps_forcemax = true; >> + > > So I would really still like this to be an inline function taking the > bus as an argument :-) But I won't fight for it... however, the name > still really sucks. My original name was pcie_mps_reckless. I still like it better ;-) > I think we might waste a few bits and instead call it something like > pcie_mps_setting_method or something like that with an enum of methods > don't you think ? > > Or maybe pcie_mps_conservative ? (I would say anal ? :-) pcie_mps_cya? We could call it pcie_mps_configuration, with the types being "performance" and "safety". If you want to try and make this trinary (performance, safety, peer-peer), then an enum makes sense. Since we don't care about peer-peer, then we can opt-out of that one and make it boolean. > >> /* >> * The default CLS is used if arch didn't set CLS explicitly and not >> * all pci devices agree on the same value. Arch can override either >> @@ -3218,6 +3220,66 @@ out: >> EXPORT_SYMBOL(pcie_set_readrq); >> >> /** >> + * pcie_set_mps - set PCI Express maximum payload size >> + * @dev: PCI device to query >> + * >> + * Returns maximum payload size in bytes >> + * or appropriate error value. >> + */ >> +int pcie_get_mps(struct pci_dev *dev) >> +{ >> + int ret, cap; >> + u16 ctl; >> + >> + cap = pci_pcie_cap(dev); >> + if (!cap) >> + return -EINVAL; >> + >> + ret = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl); >> + if (!ret) >> + ret = 128 << ((ctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5); >> + >> + return ret; >> +} >> + >> +/** >> + * pcie_set_mps - set PCI Express maximum payload size >> + * @dev: PCI device to query >> + * @rq: maximum payload size in bytes >> + * valid values are 128, 256, 512, 1024, 2048, 4096 >> + * >> + * If possible sets maximum payload size >> + */ >> +int pcie_set_mps(struct pci_dev *dev, int mps) >> +{ >> + int cap, err = -EINVAL; >> + u16 ctl, v; >> + >> + if (mps < 128 || mps > 4096 || !is_power_of_2(mps)) >> + goto out; >> + >> + v = (ffs(mps) - 8) << 5; >> + if (v > dev->pcie_mpss) >> + goto out; >> + >> + cap = pci_pcie_cap(dev); >> + if (!cap) >> + goto out; >> + >> + err = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl); >> + if (err) >> + goto out; >> + >> + if ((ctl & PCI_EXP_DEVCTL_PAYLOAD) != v) { >> + ctl &= ~PCI_EXP_DEVCTL_PAYLOAD; >> + ctl |= v; >> + err = pci_write_config_word(dev, cap + PCI_EXP_DEVCTL, ctl); >> + } >> +out: >> + return err; >> +} >> + >> +/** >> * pci_select_bars - Make BAR mask from the type of resource >> * @dev: the PCI device for which BAR mask is made >> * @flags: resource type mask to be selected >> @@ -3498,6 +3560,10 @@ static int __init pci_setup(char *str) >> pci_hotplug_io_size = memparse(str + 9, &str); >> } else if (!strncmp(str, "hpmemsize=", 10)) { >> pci_hotplug_mem_size = memparse(str + 10, &str); >> + } else if (!strncmp(str, "mpsmin", 6)) { >> + pcie_mps_forcemax = false; >> + } else if (!strncmp(str, "mpsmax", 6)) { >> + pcie_mps_forcemax = true; >> } else { > > I'm even less fan of these, but let's Jesse make a call this time around :-) Fair enough. Jeese, suggestions? > >> printk(KERN_ERR "PCI: Unknown option `%s'\n", >> str); >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c >> index 48849ff..d32c62e 100644 >> --- a/drivers/pci/probe.c >> +++ b/drivers/pci/probe.c >> @@ -860,6 +860,8 @@ void set_pcie_port_type(struct pci_dev *pdev) >> pdev->pcie_cap = pos; >> pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); >> pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; >> + pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, ®16); >> + pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD; >> } >> >> void set_pcie_hotplug_bridge(struct pci_dev *pdev) >> @@ -1327,6 +1329,137 @@ int pci_scan_slot(struct pci_bus *bus, int devfn) >> return nr; >> } >> >> +static int pcie_find_smpss(struct pci_dev *dev, void *data) >> +{ >> + u8 *smpss = data; >> + >> + if (!pci_is_pcie(dev)) >> + return 0; >> + >> + /* For PCIE hotplug enabled slots not connected directly to a >> + * PCI-E root port, there can be problems when hotplugging >> + * devices. This is due to the possibility of hotplugging a >> + * device into the fabric with a smaller MPS that the devices >> + * currently running have configured. Modifying the MPS on the >> + * running devices could cause a fatal bus error due to an >> + * incoming frame being larger than the newly configured MPS. >> + * To work around this, the MPS for the entire fabric must be >> + * set to the minimum size. Any devices hotplugged into this >> + * fabric will have the minimum MPS set. If the PCI hotplug >> + * slot is directly connected to the root port and there are not >> + * other devices on the fabric (which seems to be the most >> + * common case), then this is not an issue and MPS discovery >> + * will occur as normal. >> + */ >> + if (dev->is_hotplug_bridge && (!list_is_singular(&dev->bus->devices) || >> + dev->bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT)) >> + *smpss = 0; > > Not sure whether the list_is_singular test means anything here... What are you > trying to do ? Ie, the hotplug bridge could be the sole child of it's parent > but still have cousins via some other bridge higher up the tree. I am trying to allow a slot to be nested more than one layer, yet still be the only device on the bus and thus hotpluggable with a real MPS setting. If you think that it will never be nested, then I can make the check just be if not directly under the root port. > >> + if (*smpss > dev->pcie_mpss) >> + *smpss = dev->pcie_mpss; >> + >> + return 0; >> +} >> + >> +static int pcie_write_mps(struct pci_dev *dev, void *data) >> +{ >> + int rc, mrrs, mps, dev_mpss; >> + >> + if (!pci_is_pcie(dev)) >> + return 0; >> + >> + mps = 128 << *(u8 *)data; >> + dev_mpss = 128 << dev->pcie_mpss; >> + >> + if (pcie_mps_forcemax) { >> + if (dev->bus->self) { >> + dev_dbg(&dev->bus->dev, "Bus MPSS %d\n", >> + 128 << dev->bus->self->pcie_mpss); >> + >> + /* For "MPS Force Max", the assumption is made that >> + * downstream communication will never be larger than >> + * the MRRS. So, the MPS only needs to be configured >> + * for the upstream communication. This being the case, >> + * walk from the top down and set the MPS of the child >> + * to that of the parent bus. >> + */ >> + mps = 128 << dev->bus->self->pcie_mpss; >> + if (mps > dev_mpss) >> + dev_warn(&dev->dev, "MPS configured higher than" >> + " maximum supported by the device. If" >> + " a bus issue occurs, try running with" >> + " pci=mpsmin.\n"); >> + } >> + >> + /* For Max performance, the MRRS must be set to the largest >> + * supported value. However, it cannot be configured larger >> + * than the MPS the device or the bus can support. This assumes >> + * that the largest MRRS available on the device cannot be >> + * smaller than the device MPSS. >> + */ >> + mrrs = mps < dev_mpss ? mps : dev_mpss; >> + >> + dev->pcie_mpss = ffs(mps) - 8; >> + } else >> + /* In the "safe" case, configure the MRRS for fairness on the >> + * bus by making all devices have the same size >> + */ >> + mrrs = mps; >> + >> + dev_info(&dev->dev, "Setting MRRS to %d\n", mrrs); >> + rc = pcie_set_readrq(dev, mrrs); >> + if (rc) >> + dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); >> + >> + if (mrrs != pcie_get_readrq(dev)) { >> + dev_err(&dev->dev, "MRRS is not what was written. Falling back" >> + " to the minimum value.\n"); >> + >> + rc = pcie_set_readrq(dev, 128); >> + if (rc) >> + dev_err(&dev->dev, >> + "Failed attempting to set the MRRS\n"); >> + } >> + >> + dev_info(&dev->dev, "Setting MPS to %d\n", mps); >> + rc = pcie_set_mps(dev, mps); >> + if (rc) >> + dev_err(&dev->dev, "Failed attempting to set the MPS\n"); >> + >> + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", >> + pcie_get_mps(dev), dev_mpss, pcie_get_readrq(dev)); >> + >> + return 0; >> +} >> + >> +/* pcie_set_maxpayloadsize requires that pci_walk_bus work in a top-down, >> + * parents then children fashion. If this changes, then this code will not >> + * work as designed. >> + */ >> +void pcie_set_maxpayloadsize(struct pci_bus *bus) >> +{ > > See my earlier comment about the name of that function. > >> + u8 smpss; >> + >> + if (!bus->self) >> + return; > > So if I call this function on my root bus which has no ->self it won't > do anything. In fact you call it in scan_child_busses, so you'll basically > going to scan over and over again, re-adjusting as you go. No, the other checks below only have it scanned if it is a pcie root port. > Should we instead do it once after the whole hierarchy has been scanned ? > (IE. in pci_scan_bus_parented ?) > > That means the call would have to be done directly by arch like powerpc who > don't call this but call pci_create_bus() and pci_scan_child_bus() directly > but I'm fine with that. I was trying to make it be called in all archs, thus making the fixup generic. It probably should be in the PHB code in ppc and the ACPI code in x86/ia64. If we are move it there, then we should just conform to your parameter pass in suggestion too. > >> + if (!pci_is_pcie(bus->self)) >> + return; >> + >> + if (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT) >> + return; >> + >> + smpss = bus->self->pcie_mpss; >> + >> + if (!pcie_mps_forcemax) { >> + pcie_find_smpss(bus->self, &smpss); >> + pci_walk_bus(bus, pcie_find_smpss, &smpss); >> + } >> + >> + pcie_write_mps(bus->self, &smpss); >> + pci_walk_bus(bus, pcie_write_mps, &smpss); >> +} >> + >> unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) >> { >> unsigned int devfn, pass, max = bus->secondary; >> @@ -1359,6 +1492,11 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) >> max = pci_scan_bridge(bus, dev, max, pass); >> } >> >> + /* After the bus has been walked and all devices discovered, determine >> + * the MPS of the fabric and set it. >> + */ >> + pcie_set_maxpayloadsize(bus); >> + >> /* >> * We've scanned the bus and so we know all about what's on >> * the other side of any bridges that may be on this bus plus >> diff --git a/include/linux/pci.h b/include/linux/pci.h >> index c446b5c..9b35724 100644 >> --- a/include/linux/pci.h >> +++ b/include/linux/pci.h >> @@ -251,7 +251,8 @@ struct pci_dev { >> u8 revision; /* PCI revision, low byte of class word */ >> u8 hdr_type; /* PCI header type (`multi' flag masked out) */ >> u8 pcie_cap; /* PCI-E capability offset */ >> - u8 pcie_type; /* PCI-E device/port type */ >> + u8 pcie_type:4; /* PCI-E device/port type */ >> + u8 pcie_mpss:3; /* PCI-E Max Payload Size Supported */ >> u8 rom_base_reg; /* which config register controls the ROM */ >> u8 pin; /* which interrupt pin this device uses */ >> >> @@ -617,6 +618,8 @@ struct pci_driver { >> /* these external functions are only available when PCI support is enabled */ >> #ifdef CONFIG_PCI >> >> +extern bool pcie_mps_forcemax; >> + >> extern struct bus_type pci_bus_type; >> >> /* Do NOT directly access these two variables, unless you are arch specific pci >> @@ -796,6 +799,8 @@ int pcix_get_mmrbc(struct pci_dev *dev); >> int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc); >> int pcie_get_readrq(struct pci_dev *dev); >> int pcie_set_readrq(struct pci_dev *dev, int rq); >> +int pcie_get_mps(struct pci_dev *dev); >> +int pcie_set_mps(struct pci_dev *dev, int mps); >> int __pci_reset_function(struct pci_dev *dev); >> int pci_reset_function(struct pci_dev *dev); >> void pci_update_resource(struct pci_dev *dev, int resno); > > Cheers, > Ben. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 1e2ad92..1daeb22 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -256,8 +256,8 @@ void pci_enable_bridges(struct pci_bus *bus) * other than 0, we break out. * */ -void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *), - void *userdata) +void pci_walk_bus_self(struct pci_bus *top, int (*cb)(struct pci_dev *, void *), + void *userdata, bool self) { struct pci_dev *dev; struct pci_bus *bus; @@ -276,6 +276,13 @@ void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *), bus = bus->self->bus; continue; } + + if (self) { + retval = cb(bus->self, userdata); + if (retval) + break; + } + dev = list_entry(next, struct pci_dev, bus_list); if (dev->subordinate) { /* this is a pci-pci bridge, do its devices next */ @@ -293,6 +300,12 @@ void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *), } up_read(&pci_bus_sem); } + +void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *), + void *userdata) +{ + pci_walk_bus_self(top, cb, userdata, false); +} EXPORT_SYMBOL_GPL(pci_walk_bus); EXPORT_SYMBOL(pci_bus_alloc_resource); diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c index 749fdf0..ee4ff14 100644 --- a/drivers/pci/hotplug/pcihp_slot.c +++ b/drivers/pci/hotplug/pcihp_slot.c @@ -26,6 +26,19 @@ #include <linux/pci.h> #include <linux/pci_hotplug.h> +extern void pcie_set_maxpayloadsize(struct pci_bus *bus); + +static void pcie_determine_mps(struct pci_dev *dev) +{ + struct pci_bus *bus = dev->bus; + + while (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT && + !pci_is_root_bus(bus)) + bus = bus->parent; + + pcie_set_maxpayloadsize(bus); +} + static struct hpp_type0 pci_default_type0 = { .revision = 1, .cache_line_size = 8, @@ -158,47 +171,6 @@ static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp) */ } -/* Program PCIE MaxPayload setting on device: ensure parent maxpayload <= device */ -static int pci_set_payload(struct pci_dev *dev) -{ - int pos, ppos; - u16 pctl, psz; - u16 dctl, dsz, dcap, dmax; - struct pci_dev *parent; - - parent = dev->bus->self; - pos = pci_find_capability(dev, PCI_CAP_ID_EXP); - if (!pos) - return 0; - - /* Read Device MaxPayload capability and setting */ - pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &dctl); - pci_read_config_word(dev, pos + PCI_EXP_DEVCAP, &dcap); - dsz = (dctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; - dmax = (dcap & PCI_EXP_DEVCAP_PAYLOAD); - - /* Read Parent MaxPayload setting */ - ppos = pci_find_capability(parent, PCI_CAP_ID_EXP); - if (!ppos) - return 0; - pci_read_config_word(parent, ppos + PCI_EXP_DEVCTL, &pctl); - psz = (pctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; - - /* If parent payload > device max payload -> error - * If parent payload > device payload -> set speed - * If parent payload <= device payload -> do nothing - */ - if (psz > dmax) - return -1; - else if (psz > dsz) { - dev_info(&dev->dev, "Setting MaxPayload to %d\n", 128 << psz); - pci_write_config_word(dev, pos + PCI_EXP_DEVCTL, - (dctl & ~PCI_EXP_DEVCTL_PAYLOAD) + - (psz << 5)); - } - return 0; -} - void pci_configure_slot(struct pci_dev *dev) { struct pci_dev *cdev; @@ -210,9 +182,7 @@ void pci_configure_slot(struct pci_dev *dev) (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI))) return; - ret = pci_set_payload(dev); - if (ret) - dev_warn(&dev->dev, "could not set device max payload\n"); + pcie_determine_mps(dev); memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp); diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 56098b3..bcc5d31 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -77,6 +77,8 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; +bool pcie_mps_forcemax = false; + /* * The default CLS is used if arch didn't set CLS explicitly and not * all pci devices agree on the same value. Arch can override either @@ -3498,6 +3500,10 @@ static int __init pci_setup(char *str) pci_hotplug_io_size = memparse(str + 9, &str); } else if (!strncmp(str, "hpmemsize=", 10)) { pci_hotplug_mem_size = memparse(str + 10, &str); + } else if (!strncmp(str, "mpsmin", 6)) { + pcie_mps_forcemax = false; + } else if (!strncmp(str, "mpsmax", 6)) { + pcie_mps_forcemax = true; } else { printk(KERN_ERR "PCI: Unknown option `%s'\n", str); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 48849ff..f6bbab1 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -860,6 +860,8 @@ void set_pcie_port_type(struct pci_dev *pdev) pdev->pcie_cap = pos; pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; + pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, ®16); + pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD; } void set_pcie_hotplug_bridge(struct pci_dev *pdev) @@ -1327,6 +1329,101 @@ int pci_scan_slot(struct pci_bus *bus, int devfn) return nr; } +static int pcie_find_smpss(struct pci_dev *dev, void *data) +{ + u8 *smpss = data; + + if (!pci_is_pcie(dev)) + return 0; + + /* For PCIE hotplug enabled slots not connected directly to a + * PCI-E root port, there can be problems when hotplugging + * devices. This is due to the possibility of hotplugging a + * device into the fabric with a smaller MPS that the devices + * currently running have configured. Modifying the MPS on the + * running devices could cause a fatal bus error due to an + * incoming frame being larger than the newly configured MPS. + * To work around this, the MPS for the entire fabric must be + * set to the minimum size. Any devices hotpulled into this + * fabric will have the minimum MPS set. If the PCI hotplug + * slot is directly connected to the root port and there are not + * other devices on the fabric (which seems to be the most + * common case), then this is not an issue and MPS discovery + * will occur as normal. + */ + if (dev->is_hotplug_bridge && dev->pcie_type != PCI_EXP_TYPE_ROOT_PORT) + *smpss = 0; + + if (*smpss > dev->pcie_mpss) + *smpss = dev->pcie_mpss; + + return 0; +} + +static int pcie_write_mps(struct pci_dev *dev, void *data) +{ + u16 dctl, mps; + u8 smpss = *(u8 *)data; + + if (!pci_is_pcie(dev)) + return 0; + + if (pcie_mps_forcemax && dev->bus->self) { + dev_dbg(&dev->bus->dev, "Bus MPSS %x\n", + dev->bus->self->pcie_mpss); + /* For "MPS Force Max", the assumption is made that downstream + * communication will never be larger than 128 bytes. So, the + * MPS only needs to be configured for the upstream + * communication. This being the case, walk from the top down + * and set the MPS of the child to that of the parent bus. + */ + smpss = dev->bus->self->pcie_mpss; + if (dev->pcie_mpss < smpss) + dev_warn(&dev->dev, "MPS configured higher than Max " + "supported. This can cause a bus issue.\n"); + } + + pci_read_config_word(dev, dev->pcie_cap + PCI_EXP_DEVCTL, &dctl); + mps = (dctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; + + dev_dbg(&dev->dev, "Dev MPS %x MPSS %x\n", mps, dev->pcie_mpss); + + if (smpss != mps) { + dev_info(&dev->dev, "Setting MaxPayload to %d\n", 128 << smpss); + + pci_write_config_word(dev, dev->pcie_cap + PCI_EXP_DEVCTL, + (dctl & ~PCI_EXP_DEVCTL_PAYLOAD) | + (smpss << 5)); + } + + dev->pcie_mpss = smpss; + return 0; +} + +extern void pci_walk_bus_self(struct pci_bus *top, int (*cb)(struct pci_dev *, void *), + void *userdata, bool self); + +void pcie_set_maxpayloadsize(struct pci_bus *bus) +{ + u8 smpss; + + if (!bus->self) + return; + + if (!pci_is_pcie(bus->self)) + return; + + if (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT) + return; + + smpss = bus->self->pcie_mpss; + + if (!pcie_mps_forcemax) + pci_walk_bus_self(bus, pcie_find_smpss, &smpss, true); + + pci_walk_bus_self(bus, pcie_write_mps, &smpss, true); +} + unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) { unsigned int devfn, pass, max = bus->secondary; @@ -1359,6 +1456,11 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) max = pci_scan_bridge(bus, dev, max, pass); } + /* After the bus has been walked and all devices discovered, determine + * the MPS of the fabric and set it. + */ + pcie_set_maxpayloadsize(bus); + /* * We've scanned the bus and so we know all about what's on * the other side of any bridges that may be on this bus plus diff --git a/include/linux/pci.h b/include/linux/pci.h index c446b5c..2a60e635 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -251,7 +251,8 @@ struct pci_dev { u8 revision; /* PCI revision, low byte of class word */ u8 hdr_type; /* PCI header type (`multi' flag masked out) */ u8 pcie_cap; /* PCI-E capability offset */ - u8 pcie_type; /* PCI-E device/port type */ + u8 pcie_type:4; /* PCI-E device/port type */ + u8 pcie_mpss:3; /* PCI-E Max Payload Size Supported */ u8 rom_base_reg; /* which config register controls the ROM */ u8 pin; /* which interrupt pin this device uses */ @@ -617,6 +618,8 @@ struct pci_driver { /* these external functions are only available when PCI support is enabled */ #ifdef CONFIG_PCI +extern bool pcie_mps_forcemax; + extern struct bus_type pci_bus_type; /* Do NOT directly access these two variables, unless you are arch specific pci
There is a sizable performance boost for having the largest possible maximum payload size on each PCI-E device. However, the maximum payload size must be uniform on a given PCI-E fabric, and each device, bridge, and root port can have a different max size. To find and configure the optimal MPS settings, one must walk the fabric and determine the largest MPS available on all the devices on the given fabric. However, on some architectures (notably PPC) it is known that the downstream communication will never be larger than 128B. With this known, the MPS can be configured to be the largest value supported by the PCI-E switches and root port. In most cases this means that the devices on the fabric are not limited to the lowest common denominator. Use with caution, if this is not 100% true then there is the possibility of a bus error. To choose between the two available options, two kernel boot args have been added to the PCI calls. "mpsmin" will provide the former behavior, while "mpsmax" will perform the latter behavior. By default, the former behavior is used. Along with the potential performance improvement, this works around an issue with buggy BIOS revisions found on various whitebox motherboards which do not configure MPS beyond one level below the root port. There is one issue with the "mpsmin" design. For PCI-E hotplug enabled slots not connected directly to a PCI-E root port, there can be problems when hotplugging devices. This is due to the possibility of hotplugging a device into the fabric with a smaller MPS that the devices currently running have configured. Modifying the MPS on the running devices could cause a fatal bus error due to an incoming frame being larger than the newly configured MPS. To work around this, the MPS for the entire fabric must be set to the minimum size. Any devices hotplugged into this fabric will have the minimum MPS set. If the PCI hotplug slot is directly connected to the root port and there are not other devices on the fabric (which seems to be the most common case), then this is not an issue and MPS discovery will occur as normal. Signed-off-by: Jon Mason <mason@myri.com> --- drivers/pci/bus.c | 17 ++++++- drivers/pci/hotplug/pcihp_slot.c | 58 +++++---------------- drivers/pci/pci.c | 6 ++ drivers/pci/probe.c | 102 ++++++++++++++++++++++++++++++++++++++ include/linux/pci.h | 5 ++- 5 files changed, 141 insertions(+), 47 deletions(-)