[3/4] x86, pci: Add interface to force mmconfig

Message ID	20170302232104.10136-3-andi@firstfloor.org (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> From: Andi Kleen <andi@firstfloor.org> To: bhelgaas@google.com Cc: x86@kernel.org, linux-pci@vger.kernel.org, eranian@google.com, peterz@infradead.org, linux-kernel@vger.kernel.org, Andi Kleen <ak@linux.intel.com> Subject: [PATCH 3/4] x86, pci: Add interface to force mmconfig Date: Thu, 2 Mar 2017 15:21:03 -0800 Message-Id: <20170302232104.10136-3-andi@firstfloor.org> In-Reply-To: <20170302232104.10136-1-andi@firstfloor.org> References: <20170302232104.10136-1-andi@firstfloor.org> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Andi Kleen March 2, 2017, 11:21 p.m. UTC

From: Andi Kleen <ak@linux.intel.com>

This fills in the pci_bus_force_mmconfig interface that was
added earlier for x86 to allow drivers to optimize config
space accesses. The implementation is straight forward
and uses the existing mmconfig access functions, just forcing
mmconfig access.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/pci/mmconfig-shared.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Thomas Gleixner March 14, 2017, 1:55 p.m. UTC | #1

On Thu, 2 Mar 2017, Andi Kleen wrote:
> +struct pci_ops pci_mmconfig_ops = {
> +	.read = pci_mmconfig_read,
> +	.write = pci_mmconfig_write,
> +};
> +
> +/* Force all config accesses to go through mmconfig. */
> +int pci_bus_force_mmconfig(struct pci_bus *bus)
> +{
> +	if (!raw_pci_ext_ops)
> +		return -1;

We have error defines. That aside, the weak version of this returns 0,
i.e. success, but here you return fail. Consistency is overrated, right?

> +	bus->ops = &pci_mmconfig_ops;

What guarantees that raw_pci_ext_ops == pci_mmcfg? Nothing as far as I can
tell. So that function name is nonsensical. It does not force anything.

And the way how this function is used is a horrible hack. It's called from
a random driver at some random point in time.

The proper solution is to identify the bus at the point where the bus is
discovered and switch it to mmconfig if possible.

Thanks,

	tglx

Andi Kleen March 14, 2017, 3:41 p.m. UTC | #2

> And the way how this function is used is a horrible hack. It's called from
> a random driver at some random point in time.
> 
> The proper solution is to identify the bus at the point where the bus is
> discovered and switch it to mmconfig if possible.

But how would you know that it is safe?
AFAIK the only one who knows "this is an internal SOC device" is the driver.

Or are you saying it should be enabled unconditionally for everything?
Or define some quirk table just for this purpose?

-Andi

Thomas Gleixner March 14, 2017, 4:40 p.m. UTC | #3

On Tue, 14 Mar 2017, Andi Kleen wrote:

> > And the way how this function is used is a horrible hack. It's called from
> > a random driver at some random point in time.
> > 
> > The proper solution is to identify the bus at the point where the bus is
> > discovered and switch it to mmconfig if possible.
> 
> But how would you know that it is safe?
> AFAIK the only one who knows "this is an internal SOC device" is the driver.

The driver handles a specific device, but you apply that tweak to the bus
of which the device hangs off.

> Or are you saying it should be enabled unconditionally for everything?

No.

> Or define some quirk table just for this purpose?

Nope. It's about identifying the bus.

The bus which contains the uncore devices:

 The Uncore devices reside on CPUBUSNO(1), which is the PCI bus assigned
 for the processor socket. The bus number is derived from the max bus range
 setting and the processor socket number.

So there should be a way to detect that and it would be appreciated if you
could talk to your hardware folks what's the programmatic way to figure it
out. Maybe there is information in ACPI as well.

Thanks,

	tglx

Andi Kleen March 14, 2017, 5:02 p.m. UTC | #4

> > Or define some quirk table just for this purpose?
> 
> Nope. It's about identifying the bus.

PCI just has no good way to identify busses.

> 
> The bus which contains the uncore devices:
> 
>  The Uncore devices reside on CPUBUSNO(1), which is the PCI bus assigned
>  for the processor socket. The bus number is derived from the max bus range
>  setting and the processor socket number.

I have had bad experiences with hard coding topology like this. These
things tend to be very configurable by BIOS.

> 
> So there should be a way to detect that and it would be appreciated if you
> could talk to your hardware folks what's the programmatic way to figure it
> out. 

There's likely some hidden register somewhere. But then it would be

check model number
look for hidden register
configure these busses

and it will likely change often.

Frankly I fail to see how such a thing is better than just using
the device ID. Perhaps the driver is not the right place for it,
but the ID seems to be.

The other option is to simply make it unconditional. That would
be even simpler, but it is a bit more risky. Hmm, I suppose may
be worth trying to find out what Windows uses. If they get away
with MMCONFIG everywhere we should be too.

Or the third option would be to move the ops pointer into the 
PCI device, so it's a per device setting. If people are ok with that
I can look into it.

> Maybe there is information in ACPI as well.

I don't think ACPI has any concept of "internal SOC busses"

-Andi

Thomas Gleixner March 14, 2017, 5:56 p.m. UTC | #5

On Tue, 14 Mar 2017, Andi Kleen wrote:
> The other option is to simply make it unconditional. That would
> be even simpler, but it is a bit more risky. Hmm, I suppose may
> be worth trying to find out what Windows uses. If they get away
> with MMCONFIG everywhere we should be too.

From the PCIe spec:

 The PCI 3.0 compatible Configuration Space can be accessed using either
 the mechanism defined in the PCI Local Bus Specification or the PCI
 Express Enhanced Configuration Access Mechanism (ECAM) described later in
 this section. Accesses made using either access mechanism are
 equivalent. The PCI Express Extended Configuration Space can only be
 accessed by using the ECAM.

The 1.0 spec has a similar wording (s/3.0/2.3).

Definitely the best way to go.

Thanks,

	tglx

Bjorn Helgaas March 14, 2017, 7:47 p.m. UTC | #6

On Tue, Mar 14, 2017 at 06:56:26PM +0100, Thomas Gleixner wrote:
> On Tue, 14 Mar 2017, Andi Kleen wrote:
> > The other option is to simply make it unconditional. That would
> > be even simpler, but it is a bit more risky. Hmm, I suppose may
> > be worth trying to find out what Windows uses. If they get away
> > with MMCONFIG everywhere we should be too.
> 
> From the PCIe spec:
> 
>  The PCI 3.0 compatible Configuration Space can be accessed using either
>  the mechanism defined in the PCI Local Bus Specification or the PCI
>  Express Enhanced Configuration Access Mechanism (ECAM) described later in
>  this section. Accesses made using either access mechanism are
>  equivalent. The PCI Express Extended Configuration Space can only be
>  accessed by using the ECAM.
> 
> The 1.0 spec has a similar wording (s/3.0/2.3).
> 
> Definitely the best way to go.

I agree that it should be fairly safe to do ECAM/MMCONFIG without
locking.  Can we handle the decision part by adding a "lockless" bit
to struct pci_ops?  Old ops don't mention that bit, so it will be
initialized to zero and we'll do locking as today.  ECAM/MMCONFIG ops
can set it and we can skip the locking.

Bjorn

Andi Kleen March 15, 2017, 2:24 a.m. UTC | #7

> I agree that it should be fairly safe to do ECAM/MMCONFIG without
> locking.  Can we handle the decision part by adding a "lockless" bit
> to struct pci_ops?  Old ops don't mention that bit, so it will be
> initialized to zero and we'll do locking as today.  ECAM/MMCONFIG ops
> can set it and we can skip the locking.

That's what my other patch already did. 

-Andi

Bjorn Helgaas March 15, 2017, 2:55 a.m. UTC | #8

On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote:
> > I agree that it should be fairly safe to do ECAM/MMCONFIG without
> > locking.  Can we handle the decision part by adding a "lockless" bit
> > to struct pci_ops?  Old ops don't mention that bit, so it will be
> > initialized to zero and we'll do locking as today.  ECAM/MMCONFIG ops
> > can set it and we can skip the locking.
> 
> That's what my other patch already did. 

Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops.

What I was wondering, but didn't explain very well, was whether
instead of setting that bit at run-time in pci_mmcfg_arch_init(), we
could set it statically in the pci_ops definition, e.g.,

  static struct pci_ops ecam_ops = {
    .lockless = 1,
    .read = ecam_read,
    .write = ecam_write,
  };

I think it would be easier to read if the lockless-ness were declared
right next to the accessors that need it (or don't need it).

But it is a little confusing with all the different paths, at least on
x86, so maybe it wouldn't be quite that simple.

Bjorn

Thomas Gleixner March 15, 2017, 10 a.m. UTC | #9

On Tue, 14 Mar 2017, Bjorn Helgaas wrote:
> On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote:
> > > I agree that it should be fairly safe to do ECAM/MMCONFIG without
> > > locking.  Can we handle the decision part by adding a "lockless" bit
> > > to struct pci_ops?  Old ops don't mention that bit, so it will be
> > > initialized to zero and we'll do locking as today.  ECAM/MMCONFIG ops
> > > can set it and we can skip the locking.
> > 
> > That's what my other patch already did. 
> 
> Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops.
> 
> What I was wondering, but didn't explain very well, was whether
> instead of setting that bit at run-time in pci_mmcfg_arch_init(), we
> could set it statically in the pci_ops definition, e.g.,
> 
>   static struct pci_ops ecam_ops = {
>     .lockless = 1,
>     .read = ecam_read,
>     .write = ecam_write,
>   };
> 
> I think it would be easier to read if the lockless-ness were declared
> right next to the accessors that need it (or don't need it).
> 
> But it is a little confusing with all the different paths, at least on
> x86, so maybe it wouldn't be quite that simple.

The pci_ops in x86 are a complete mess. We have

struct pci_ops pci_root_ops = {
        .read = pci_read,
	.write = pci_write,
};

That's the default and the r/w functions look like this:

pci_read/write()
{
	if (domain == 0 && reg && raw_pci_ops)
 		return raw_pci_ops->read/write();
	if (raw_pci_ext_ops)
 		return raw_pci_ext_ops->read/write();
	return -EINVAL;
}

raw_pci_ops and raw_pci_ext_ops are setup through an impenetrable maze of
functions. Some of them overwrite pci_root_ops to something entirely
different.

pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
argument for any bus segment no matter which type it is.

The locking aspect is interesting as well. The type0/1 functions are having
their own internal locking. Oh, well.

What we really want is to differentiate bus segments. That means a PCIe
segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
can do what you suggested above, i.e. marking the ecam/mmconfig ops as
lockless.

Sure that's more work than just whacking a sloppy quirk into the code, but
the right thing to do.

Thanks,

	tglx

Bjorn Helgaas March 15, 2017, 2:09 p.m. UTC | #10

On Wed, Mar 15, 2017 at 11:00:22AM +0100, Thomas Gleixner wrote:
> On Tue, 14 Mar 2017, Bjorn Helgaas wrote:
> > On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote:
> > > > I agree that it should be fairly safe to do ECAM/MMCONFIG without
> > > > locking.  Can we handle the decision part by adding a "lockless" bit
> > > > to struct pci_ops?  Old ops don't mention that bit, so it will be
> > > > initialized to zero and we'll do locking as today.  ECAM/MMCONFIG ops
> > > > can set it and we can skip the locking.
> > > 
> > > That's what my other patch already did. 
> > 
> > Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops.
> > 
> > What I was wondering, but didn't explain very well, was whether
> > instead of setting that bit at run-time in pci_mmcfg_arch_init(), we
> > could set it statically in the pci_ops definition, e.g.,
> > 
> >   static struct pci_ops ecam_ops = {
> >     .lockless = 1,
> >     .read = ecam_read,
> >     .write = ecam_write,
> >   };
> > 
> > I think it would be easier to read if the lockless-ness were declared
> > right next to the accessors that need it (or don't need it).
> > 
> > But it is a little confusing with all the different paths, at least on
> > x86, so maybe it wouldn't be quite that simple.
> 
> The pci_ops in x86 are a complete mess.

That's certainly a pithy summary :)

> pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
> argument for any bus segment no matter which type it is.
> 
> The locking aspect is interesting as well. The type0/1 functions are having
> their own internal locking. Oh, well.
> 
> What we really want is to differentiate bus segments. That means a PCIe
> segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
> can do what you suggested above, i.e. marking the ecam/mmconfig ops as
> lockless.

If we were starting from scratch, I think we would probably put the
locking inside the device-specific config accessors at the lowest
level.  Then it would be directly at the place where it's obvious
what's needed, and it would be easy to do no locking, per-host bridge
locking, or system-wide locking.  Right now we have many places that
implicitly depend on pci_lock but there's no direct connection.

We could conceivably migrate to that, but it would be a fair amount of
work.

Bjorn

Andi Kleen March 16, 2017, 12:02 a.m. UTC | #11

> pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
> argument for any bus segment no matter which type it is.

mmconfig is only initialized after PCI is initialized
(an ordering problem with ACPI). So it would require
updating existing busses with likely interesting race
conditions.

There are also other ordering problems in the PCI layer,
that is one of the reason early and raw PCI accesses even exist.

> 
> The locking aspect is interesting as well. The type0/1 functions are having
> their own internal locking. Oh, well.

Right it could set lockless too. The internal locking is still needed
because there are other users too.

> What we really want is to differentiate bus segments. That means a PCIe
> segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
> can do what you suggested above, i.e. marking the ecam/mmconfig ops as
> lockless.

There's no need to separate PCIe and PCI. mmconfig has nothing to do
with that.

> Sure that's more work than just whacking a sloppy quirk into the code, but
> the right thing to do.

Before proposing grandiose plans it would be better if you
acquired some basic understanding of the constraints this
code is operating under first.

-Andi

Thomas Gleixner March 16, 2017, 10:45 p.m. UTC | #12

On Wed, 15 Mar 2017, Andi Kleen wrote:
> > pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
> > argument for any bus segment no matter which type it is.
> 
> mmconfig is only initialized after PCI is initialized (an ordering
> problem with ACPI).

Wrong. It can be initialized before that and it actually is on most of my
machines. Unfortunately its not guaranteed.

> So it would require updating existing busses with likely interesting race
> conditions.

More racy than switching it from a random driver after the PCI bus has been
completely initialized and is already in use? Surely not.

> There are also other ordering problems in the PCI layer,
> that is one of the reason early and raw PCI accesses even exist.

Early accesses are a different class for PCI accesses _before_
pci_arch_init() or acpi_init() has been invoked. That's handled by the
early accessors which are hardcoded to use PCI type 1 configuration access
via CF8/CFC. These are completely seperate and not in any way related to
this.

So lets look how this works:

   pci_arch_init()

	Setup of raw_pci_ops and raw_pci_ext_ops

	This sets mmconfig, when the information is available already.

   acpi_init()

	Parses the ACPI tables and sets up the PCI root.

	Sets mmconfig when not yet set.

   pci_subsys_init()

	Final x86 pci init calls, which might affect pci ops.

So ideally we would switch to ECAM before acpi_init(), but

   - mmconfig might not yet be available

   - x86_init.pci.init() which is called from pci_subsys_init() can modify
     pci_root_ops or raw_pci_ops

Though that's a non issue simple because after x86_init.pci.init() still
nothing operates on PCI devices and it's safe and simple to replace the
pci_root_ops read and write pointers with ECAM based variants.

> > The locking aspect is interesting as well. The type0/1 functions are having
> > their own internal locking. Oh, well.

> Right it could set lockless too. The internal locking is still needed
> because there are other users too.

Looking at the x86 pci ops variants, there is only the ce4100 one, which
relies on the external locking in the generic pci code. That's reasonable
easy to fix and once that is done the whole conditional locking in the
generic PCI accessors can be avoided. The locking can simply be compiled
out.

> > What we really want is to differentiate bus segments. That means a PCIe
> > segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
> > can do what you suggested above, i.e. marking the ecam/mmconfig ops as
> > lockless.
> 
> There's no need to separate PCIe and PCI. mmconfig has nothing to do
> with that.

What? If the system does not have a PCIe compliant root complex/host
bridge, then you cannot use mmconfig at all. So yes, there needs to be a
decision made.

Sure, we don't have to treat PCI busses behind a PCIe to PCI(-X) bridge
differently as that handled by the host bridge and the PCIe/PCI(-X)
bridge. There might be dragons lurking, but those can be handled with a
date cutoff or a small set of quirks.

> > Sure that's more work than just whacking a sloppy quirk into the code, but
> > the right thing to do.
> 
> Before proposing grandiose plans it would be better if you acquired some
> basic understanding of the constraints this code is operating under
> first.

Contrary to you I studied the code and the spec before making uneducated
claims and accusations.

And contrary to you I care about the correctness and the maintainability of
the code. Your works for me and know everything better attitude is the main
reason for the mess which exists today. Your thought termination cliche,
that others do not understand what they are talking about has been proven
wrong over and over.

I did not claim that it's simple and I merily talked about the ideal
solution while I was well aware that there are dependencies and corner
cases.

It took me a only couple of hours to analyze all possible corner cases
which reconfigure pci_root_ops or raw_pci_*ops to find a spot where this
can be done in a sane way.

Patches come with a seperate mail. They get rid of the global pci_lock in
the generic accessors completely and avoid the extra pointer indirection
and do not even get near a driver.

It might look like a grandiose plan to you, but that might be due to a
gross overestimation of the complexity of that code or the lack of basic
engineering principles.

Thanks,

	tglx

[3/4] x86, pci: Add interface to force mmconfig

Commit Message

Comments

Patch