Message ID | 20170302232104.10136-3-andi@firstfloor.org (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Thu, 2 Mar 2017, Andi Kleen wrote: > +struct pci_ops pci_mmconfig_ops = { > + .read = pci_mmconfig_read, > + .write = pci_mmconfig_write, > +}; > + > +/* Force all config accesses to go through mmconfig. */ > +int pci_bus_force_mmconfig(struct pci_bus *bus) > +{ > + if (!raw_pci_ext_ops) > + return -1; We have error defines. That aside, the weak version of this returns 0, i.e. success, but here you return fail. Consistency is overrated, right? > + bus->ops = &pci_mmconfig_ops; What guarantees that raw_pci_ext_ops == pci_mmcfg? Nothing as far as I can tell. So that function name is nonsensical. It does not force anything. And the way how this function is used is a horrible hack. It's called from a random driver at some random point in time. The proper solution is to identify the bus at the point where the bus is discovered and switch it to mmconfig if possible. Thanks, tglx
> And the way how this function is used is a horrible hack. It's called from > a random driver at some random point in time. > > The proper solution is to identify the bus at the point where the bus is > discovered and switch it to mmconfig if possible. But how would you know that it is safe? AFAIK the only one who knows "this is an internal SOC device" is the driver. Or are you saying it should be enabled unconditionally for everything? Or define some quirk table just for this purpose? -Andi
On Tue, 14 Mar 2017, Andi Kleen wrote: > > And the way how this function is used is a horrible hack. It's called from > > a random driver at some random point in time. > > > > The proper solution is to identify the bus at the point where the bus is > > discovered and switch it to mmconfig if possible. > > But how would you know that it is safe? > AFAIK the only one who knows "this is an internal SOC device" is the driver. The driver handles a specific device, but you apply that tweak to the bus of which the device hangs off. > Or are you saying it should be enabled unconditionally for everything? No. > Or define some quirk table just for this purpose? Nope. It's about identifying the bus. The bus which contains the uncore devices: The Uncore devices reside on CPUBUSNO(1), which is the PCI bus assigned for the processor socket. The bus number is derived from the max bus range setting and the processor socket number. So there should be a way to detect that and it would be appreciated if you could talk to your hardware folks what's the programmatic way to figure it out. Maybe there is information in ACPI as well. Thanks, tglx
> > Or define some quirk table just for this purpose? > > Nope. It's about identifying the bus. PCI just has no good way to identify busses. > > The bus which contains the uncore devices: > > The Uncore devices reside on CPUBUSNO(1), which is the PCI bus assigned > for the processor socket. The bus number is derived from the max bus range > setting and the processor socket number. I have had bad experiences with hard coding topology like this. These things tend to be very configurable by BIOS. > > So there should be a way to detect that and it would be appreciated if you > could talk to your hardware folks what's the programmatic way to figure it > out. There's likely some hidden register somewhere. But then it would be check model number look for hidden register configure these busses and it will likely change often. Frankly I fail to see how such a thing is better than just using the device ID. Perhaps the driver is not the right place for it, but the ID seems to be. The other option is to simply make it unconditional. That would be even simpler, but it is a bit more risky. Hmm, I suppose may be worth trying to find out what Windows uses. If they get away with MMCONFIG everywhere we should be too. Or the third option would be to move the ops pointer into the PCI device, so it's a per device setting. If people are ok with that I can look into it. > Maybe there is information in ACPI as well. I don't think ACPI has any concept of "internal SOC busses" -Andi
On Tue, 14 Mar 2017, Andi Kleen wrote: > The other option is to simply make it unconditional. That would > be even simpler, but it is a bit more risky. Hmm, I suppose may > be worth trying to find out what Windows uses. If they get away > with MMCONFIG everywhere we should be too. From the PCIe spec: The PCI 3.0 compatible Configuration Space can be accessed using either the mechanism defined in the PCI Local Bus Specification or the PCI Express Enhanced Configuration Access Mechanism (ECAM) described later in this section. Accesses made using either access mechanism are equivalent. The PCI Express Extended Configuration Space can only be accessed by using the ECAM. The 1.0 spec has a similar wording (s/3.0/2.3). Definitely the best way to go. Thanks, tglx
On Tue, Mar 14, 2017 at 06:56:26PM +0100, Thomas Gleixner wrote: > On Tue, 14 Mar 2017, Andi Kleen wrote: > > The other option is to simply make it unconditional. That would > > be even simpler, but it is a bit more risky. Hmm, I suppose may > > be worth trying to find out what Windows uses. If they get away > > with MMCONFIG everywhere we should be too. > > From the PCIe spec: > > The PCI 3.0 compatible Configuration Space can be accessed using either > the mechanism defined in the PCI Local Bus Specification or the PCI > Express Enhanced Configuration Access Mechanism (ECAM) described later in > this section. Accesses made using either access mechanism are > equivalent. The PCI Express Extended Configuration Space can only be > accessed by using the ECAM. > > The 1.0 spec has a similar wording (s/3.0/2.3). > > Definitely the best way to go. I agree that it should be fairly safe to do ECAM/MMCONFIG without locking. Can we handle the decision part by adding a "lockless" bit to struct pci_ops? Old ops don't mention that bit, so it will be initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops can set it and we can skip the locking. Bjorn
> I agree that it should be fairly safe to do ECAM/MMCONFIG without > locking. Can we handle the decision part by adding a "lockless" bit > to struct pci_ops? Old ops don't mention that bit, so it will be > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops > can set it and we can skip the locking. That's what my other patch already did. -Andi
On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote: > > I agree that it should be fairly safe to do ECAM/MMCONFIG without > > locking. Can we handle the decision part by adding a "lockless" bit > > to struct pci_ops? Old ops don't mention that bit, so it will be > > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops > > can set it and we can skip the locking. > > That's what my other patch already did. Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops. What I was wondering, but didn't explain very well, was whether instead of setting that bit at run-time in pci_mmcfg_arch_init(), we could set it statically in the pci_ops definition, e.g., static struct pci_ops ecam_ops = { .lockless = 1, .read = ecam_read, .write = ecam_write, }; I think it would be easier to read if the lockless-ness were declared right next to the accessors that need it (or don't need it). But it is a little confusing with all the different paths, at least on x86, so maybe it wouldn't be quite that simple. Bjorn
On Tue, 14 Mar 2017, Bjorn Helgaas wrote: > On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote: > > > I agree that it should be fairly safe to do ECAM/MMCONFIG without > > > locking. Can we handle the decision part by adding a "lockless" bit > > > to struct pci_ops? Old ops don't mention that bit, so it will be > > > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops > > > can set it and we can skip the locking. > > > > That's what my other patch already did. > > Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops. > > What I was wondering, but didn't explain very well, was whether > instead of setting that bit at run-time in pci_mmcfg_arch_init(), we > could set it statically in the pci_ops definition, e.g., > > static struct pci_ops ecam_ops = { > .lockless = 1, > .read = ecam_read, > .write = ecam_write, > }; > > I think it would be easier to read if the lockless-ness were declared > right next to the accessors that need it (or don't need it). > > But it is a little confusing with all the different paths, at least on > x86, so maybe it wouldn't be quite that simple. The pci_ops in x86 are a complete mess. We have struct pci_ops pci_root_ops = { .read = pci_read, .write = pci_write, }; That's the default and the r/w functions look like this: pci_read/write() { if (domain == 0 && reg && raw_pci_ops) return raw_pci_ops->read/write(); if (raw_pci_ext_ops) return raw_pci_ext_ops->read/write(); return -EINVAL; } raw_pci_ops and raw_pci_ext_ops are setup through an impenetrable maze of functions. Some of them overwrite pci_root_ops to something entirely different. pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops argument for any bus segment no matter which type it is. The locking aspect is interesting as well. The type0/1 functions are having their own internal locking. Oh, well. What we really want is to differentiate bus segments. That means a PCIe segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we can do what you suggested above, i.e. marking the ecam/mmconfig ops as lockless. Sure that's more work than just whacking a sloppy quirk into the code, but the right thing to do. Thanks, tglx
On Wed, Mar 15, 2017 at 11:00:22AM +0100, Thomas Gleixner wrote: > On Tue, 14 Mar 2017, Bjorn Helgaas wrote: > > On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote: > > > > I agree that it should be fairly safe to do ECAM/MMCONFIG without > > > > locking. Can we handle the decision part by adding a "lockless" bit > > > > to struct pci_ops? Old ops don't mention that bit, so it will be > > > > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops > > > > can set it and we can skip the locking. > > > > > > That's what my other patch already did. > > > > Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops. > > > > What I was wondering, but didn't explain very well, was whether > > instead of setting that bit at run-time in pci_mmcfg_arch_init(), we > > could set it statically in the pci_ops definition, e.g., > > > > static struct pci_ops ecam_ops = { > > .lockless = 1, > > .read = ecam_read, > > .write = ecam_write, > > }; > > > > I think it would be easier to read if the lockless-ness were declared > > right next to the accessors that need it (or don't need it). > > > > But it is a little confusing with all the different paths, at least on > > x86, so maybe it wouldn't be quite that simple. > > The pci_ops in x86 are a complete mess. That's certainly a pithy summary :) > pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops > argument for any bus segment no matter which type it is. > > The locking aspect is interesting as well. The type0/1 functions are having > their own internal locking. Oh, well. > > What we really want is to differentiate bus segments. That means a PCIe > segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we > can do what you suggested above, i.e. marking the ecam/mmconfig ops as > lockless. If we were starting from scratch, I think we would probably put the locking inside the device-specific config accessors at the lowest level. Then it would be directly at the place where it's obvious what's needed, and it would be easy to do no locking, per-host bridge locking, or system-wide locking. Right now we have many places that implicitly depend on pci_lock but there's no direct connection. We could conceivably migrate to that, but it would be a fair amount of work. Bjorn
> pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops > argument for any bus segment no matter which type it is. mmconfig is only initialized after PCI is initialized (an ordering problem with ACPI). So it would require updating existing busses with likely interesting race conditions. There are also other ordering problems in the PCI layer, that is one of the reason early and raw PCI accesses even exist. > > The locking aspect is interesting as well. The type0/1 functions are having > their own internal locking. Oh, well. Right it could set lockless too. The internal locking is still needed because there are other users too. > What we really want is to differentiate bus segments. That means a PCIe > segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we > can do what you suggested above, i.e. marking the ecam/mmconfig ops as > lockless. There's no need to separate PCIe and PCI. mmconfig has nothing to do with that. > Sure that's more work than just whacking a sloppy quirk into the code, but > the right thing to do. Before proposing grandiose plans it would be better if you acquired some basic understanding of the constraints this code is operating under first. -Andi
On Wed, 15 Mar 2017, Andi Kleen wrote: > > pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops > > argument for any bus segment no matter which type it is. > > mmconfig is only initialized after PCI is initialized (an ordering > problem with ACPI). Wrong. It can be initialized before that and it actually is on most of my machines. Unfortunately its not guaranteed. > So it would require updating existing busses with likely interesting race > conditions. More racy than switching it from a random driver after the PCI bus has been completely initialized and is already in use? Surely not. > There are also other ordering problems in the PCI layer, > that is one of the reason early and raw PCI accesses even exist. Early accesses are a different class for PCI accesses _before_ pci_arch_init() or acpi_init() has been invoked. That's handled by the early accessors which are hardcoded to use PCI type 1 configuration access via CF8/CFC. These are completely seperate and not in any way related to this. So lets look how this works: pci_arch_init() Setup of raw_pci_ops and raw_pci_ext_ops This sets mmconfig, when the information is available already. acpi_init() Parses the ACPI tables and sets up the PCI root. Sets mmconfig when not yet set. pci_subsys_init() Final x86 pci init calls, which might affect pci ops. So ideally we would switch to ECAM before acpi_init(), but - mmconfig might not yet be available - x86_init.pci.init() which is called from pci_subsys_init() can modify pci_root_ops or raw_pci_ops Though that's a non issue simple because after x86_init.pci.init() still nothing operates on PCI devices and it's safe and simple to replace the pci_root_ops read and write pointers with ECAM based variants. > > The locking aspect is interesting as well. The type0/1 functions are having > > their own internal locking. Oh, well. > Right it could set lockless too. The internal locking is still needed > because there are other users too. Looking at the x86 pci ops variants, there is only the ce4100 one, which relies on the external locking in the generic pci code. That's reasonable easy to fix and once that is done the whole conditional locking in the generic PCI accessors can be avoided. The locking can simply be compiled out. > > What we really want is to differentiate bus segments. That means a PCIe > > segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we > > can do what you suggested above, i.e. marking the ecam/mmconfig ops as > > lockless. > > There's no need to separate PCIe and PCI. mmconfig has nothing to do > with that. What? If the system does not have a PCIe compliant root complex/host bridge, then you cannot use mmconfig at all. So yes, there needs to be a decision made. Sure, we don't have to treat PCI busses behind a PCIe to PCI(-X) bridge differently as that handled by the host bridge and the PCIe/PCI(-X) bridge. There might be dragons lurking, but those can be handled with a date cutoff or a small set of quirks. > > Sure that's more work than just whacking a sloppy quirk into the code, but > > the right thing to do. > > Before proposing grandiose plans it would be better if you acquired some > basic understanding of the constraints this code is operating under > first. Contrary to you I studied the code and the spec before making uneducated claims and accusations. And contrary to you I care about the correctness and the maintainability of the code. Your works for me and know everything better attitude is the main reason for the mess which exists today. Your thought termination cliche, that others do not understand what they are talking about has been proven wrong over and over. I did not claim that it's simple and I merily talked about the ideal solution while I was well aware that there are dependencies and corner cases. It took me a only couple of hours to analyze all possible corner cases which reconfigure pci_root_ops or raw_pci_*ops to find a spot where this can be done in a sane way. Patches come with a seperate mail. They get rid of the global pci_lock in the generic accessors completely and avoid the extra pointer indirection and do not even get near a driver. It might look like a grandiose plan to you, but that might be due to a gross overestimation of the complexity of that code or the lack of basic engineering principles. Thanks, tglx
diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c index dd30b7e08bc2..bb56533290aa 100644 --- a/arch/x86/pci/mmconfig-shared.c +++ b/arch/x86/pci/mmconfig-shared.c @@ -816,3 +816,31 @@ int pci_mmconfig_delete(u16 seg, u8 start, u8 end) return -ENOENT; } + +static int pci_mmconfig_read(struct pci_bus *bus, unsigned int devfn, + int where, int size, u32 *value) +{ + return raw_pci_ext_ops->read(pci_domain_nr(bus), bus->number, + devfn, where, size, value); +} + +static int pci_mmconfig_write(struct pci_bus *bus, unsigned int devfn, + int where, int size, u32 value) +{ + return raw_pci_ext_ops->write(pci_domain_nr(bus), bus->number, + devfn, where, size, value); +} + +struct pci_ops pci_mmconfig_ops = { + .read = pci_mmconfig_read, + .write = pci_mmconfig_write, +}; + +/* Force all config accesses to go through mmconfig. */ +int pci_bus_force_mmconfig(struct pci_bus *bus) +{ + if (!raw_pci_ext_ops) + return -1; + bus->ops = &pci_mmconfig_ops; + return 0; +}