diff mbox

[v2,19/27] pci: PCIe driver for Marvell Armada 370/XP systems

Message ID 20130130154602.GL23505@n2100.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King - ARM Linux Jan. 30, 2013, 3:46 p.m. UTC
On Wed, Jan 30, 2013 at 04:36:59PM +0100, Thomas Petazzoni wrote:
> Dear Russell King - ARM Linux,
> 
> On Wed, 30 Jan 2013 15:19:34 +0000, Russell King - ARM Linux wrote:
> > diff --git a/arch/arm/include/asm/mach/pci.h b/arch/arm/include/asm/mach/pci.h
> > index db9fedb..bba0cf3 100644
> > --- a/arch/arm/include/asm/mach/pci.h
> > +++ b/arch/arm/include/asm/mach/pci.h
> > @@ -29,6 +29,7 @@ struct hw_pci {
> >  	void		(*postinit)(void);
> >  	u8		(*swizzle)(struct pci_dev *dev, u8 *pin);
> >  	int		(*map_irq)(const struct pci_dev *dev, u8 slot, u8 pin);
> > +	resource_size_t	(*window_align)(struct pci_bus *, unsigned long);
> >  };
> 
> Didn't you say just yesterday that you would prefer a numerical value
> rather than a hook that could do random things? :-)

Hrh, right. :)

And it also helps if I modify the right struct too!  Try this instead.
Overwrite sys->win_align_mem / sys->win_align_io in the setup function
as required.

 arch/arm/include/asm/mach/pci.h |    2 ++
 arch/arm/kernel/bios32.c        |   17 +++++++++++++++++
 2 files changed, 19 insertions(+), 0 deletions(-)

Comments

Thomas Petazzoni Jan. 31, 2013, 2:30 p.m. UTC | #1
Dear Russell King - ARM Linux,

On Wed, 30 Jan 2013 15:46:02 +0000, Russell King - ARM Linux wrote:

> +resource_size_t pcibios_window_alignment(struct pci_bus *bus,
> +					 unsigned long type)
> +{
> +	struct pci_sys_data *sys = bus->sysdata;
> +
> +	/* Ignore downstream buses */
> +	if (!bus->parent) {
> +		if (type & IORESOURCE_MEM)
> +			return sys->win_align_mem;
> +		if (type & IORESOURCE_IO)
> +			return sys->win_align_io;
> +	}
> +	return 1;
> +}
> +

Unfortunately, this doesn't work as is for me: the if (!bus->parent)
prevents the thing from being effective. Here my lspci output:

# /usr/sbin/lspci 
00:00.0 Host bridge: Marvell Technology Group Ltd. Device 102d
00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
00:03.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
00:04.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
00:05.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
00:06.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
03:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
05:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

And the function pcibios_window_alignment() only gets called for bus 1,
2, 3, 4, 5, 6 and never for bus 0. And therefore, the !bus->parent test
is always false. So, if I keep your implementation, I have the following
addresses assigned to my bridges:

pci 0000:00:03.0: PCI bridge to [bus 03]
pci 0000:00:03.0:   bridge window [io  0xc0000000-0xc0000fff]
pci 0000:00:03.0:   bridge window [mem 0xc1000000-0xc10fffff]
pci 0000:00:03.0:   bridge window [mem 0xc1100000-0xc11fffff pref]
pci 0000:00:05.0: PCI bridge to [bus 05]
pci 0000:00:05.0:   bridge window [io  0xc0001000-0xc0001fff]
pci 0000:00:05.0:   bridge window [mem 0xc1200000-0xc12fffff]
pci 0000:00:05.0:   bridge window [mem 0xc1300000-0xc13fffff pref]

Notice how the io window of the second bridge starts 4K after the io
window of the first bridge. Which cannot work on Marvell SoC, due to
the 64KB alignment.

If, however, I remove the !bus->parent test, the I/O addresses
correctly take into account the 64K requirement:

pci 0000:00:03.0: PCI bridge to [bus 03]
pci 0000:00:03.0:   bridge window [io  0xc0000000-0xc000ffff]
pci 0000:00:03.0:   bridge window [mem 0xc1000000-0xc10fffff]
pci 0000:00:03.0:   bridge window [mem 0xc1100000-0xc11fffff pref]
pci 0000:00:05.0: PCI bridge to [bus 05]
pci 0000:00:05.0:   bridge window [io  0xc0010000-0xc001ffff]
pci 0000:00:05.0:   bridge window [mem 0xc1200000-0xc12fffff]
pci 0000:00:05.0:   bridge window [mem 0xc1300000-0xc13fffff pref]

Any idea?

Thanks,

Thomas
Russell King - ARM Linux Jan. 31, 2013, 2:50 p.m. UTC | #2
On Thu, Jan 31, 2013 at 03:30:41PM +0100, Thomas Petazzoni wrote:
> Dear Russell King - ARM Linux,
> 
> On Wed, 30 Jan 2013 15:46:02 +0000, Russell King - ARM Linux wrote:
> 
> > +resource_size_t pcibios_window_alignment(struct pci_bus *bus,
> > +					 unsigned long type)
> > +{
> > +	struct pci_sys_data *sys = bus->sysdata;
> > +
> > +	/* Ignore downstream buses */
> > +	if (!bus->parent) {
> > +		if (type & IORESOURCE_MEM)
> > +			return sys->win_align_mem;
> > +		if (type & IORESOURCE_IO)
> > +			return sys->win_align_io;
> > +	}
> > +	return 1;
> > +}
> > +
> 
> Unfortunately, this doesn't work as is for me: the if (!bus->parent)
> prevents the thing from being effective. Here my lspci output:
> 
> # /usr/sbin/lspci 
> 00:00.0 Host bridge: Marvell Technology Group Ltd. Device 102d
> 00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 00:03.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 00:04.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 00:05.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 00:06.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 03:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
> 05:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
> 
> And the function pcibios_window_alignment() only gets called for bus 1,
> 2, 3, 4, 5, 6 and never for bus 0.

That's the exact reverse of what I'd expect: the child buses should
have a non-NULL parent pointer.  Hmm.  Try changing that for !bus->self -
that should make it effective only on the host bridge.

But... hang on...

/*
 * Returns true if the pci bus is root (behind host-pci bridge),
 * false otherwise
 */
static inline bool pci_is_root_bus(struct pci_bus *pbus)
{
        return !(pbus->parent);
}

So the original test _is_ correct, and should only be triggered for
the _root_ bus, that being bus 0 in the above case.

But... wait a moment, what are you saying?  Which bridges need this
fixup?  The Marvell PCI-to-PCI bridges or the host bridge?
Thomas Petazzoni Jan. 31, 2013, 2:57 p.m. UTC | #3
Dear Russell King - ARM Linux,

On Thu, 31 Jan 2013 14:50:02 +0000, Russell King - ARM Linux wrote:

> > > +	/* Ignore downstream buses */
> > > +	if (!bus->parent) {
> > > +		if (type & IORESOURCE_MEM)
> > > +			return sys->win_align_mem;
> > > +		if (type & IORESOURCE_IO)
> > > +			return sys->win_align_io;
> > > +	}
> > > +	return 1;
> > > +}
> > > +
> > 
> > Unfortunately, this doesn't work as is for me: the if (!bus->parent)
> > prevents the thing from being effective. Here my lspci output:
> > 
> > # /usr/sbin/lspci 
> > 00:00.0 Host bridge: Marvell Technology Group Ltd. Device 102d
> > 00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> > 00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> > 00:03.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> > 00:04.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> > 00:05.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> > 00:06.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> > 03:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
> > 05:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
> > 
> > And the function pcibios_window_alignment() only gets called for bus 1,
> > 2, 3, 4, 5, 6 and never for bus 0.
> 
> That's the exact reverse of what I'd expect: the child buses should
> have a non-NULL parent pointer.

Indeed. But this function never gets called with bus->number == 0, only
with bus->number = 1, 2, 3, 4, 5, 6. So those are child busses, and
therefore they have a parent.

If I had a debug message in this pcibios_window_alignment() function
(which gets shown unconditionally, i.e the debug message is outside the
if condition we are discussing), then I get:

pcibios_window_alignment: called for bus ef371c00 (sysdata=ef2f6bc0), number 1
pcibios_window_alignment: called for bus ef371c00 (sysdata=ef2f6bc0), number 1
pcibios_window_alignment: called for bus ef371c00 (sysdata=ef2f6bc0), number 1
pcibios_window_alignment: called for bus ef371a00 (sysdata=ef2f6bc0), number 2
pcibios_window_alignment: called for bus ef371a00 (sysdata=ef2f6bc0), number 2
pcibios_window_alignment: called for bus ef371a00 (sysdata=ef2f6bc0), number 2
pcibios_window_alignment: called for bus ef371800 (sysdata=ef2f6bc0), number 3
pcibios_window_alignment: called for bus ef371800 (sysdata=ef2f6bc0), number 3
pcibios_window_alignment: called for bus ef371800 (sysdata=ef2f6bc0), number 3
pcibios_window_alignment: called for bus ef371600 (sysdata=ef2f6bc0), number 4
pcibios_window_alignment: called for bus ef371600 (sysdata=ef2f6bc0), number 4
pcibios_window_alignment: called for bus ef371600 (sysdata=ef2f6bc0), number 4
pcibios_window_alignment: called for bus ef371400 (sysdata=ef2f6bc0), number 5
pcibios_window_alignment: called for bus ef371400 (sysdata=ef2f6bc0), number 5
pcibios_window_alignment: called for bus ef371400 (sysdata=ef2f6bc0), number 5
pcibios_window_alignment: called for bus ef371200 (sysdata=ef2f6bc0), number 6
pcibios_window_alignment: called for bus ef371200 (sysdata=ef2f6bc0), number 6
pcibios_window_alignment: called for bus ef371200 (sysdata=ef2f6bc0), number 6

See, never called bus bus number 0.

> Hmm.  Try changing that for !bus->self -
> that should make it effective only on the host bridge.
> 
> But... hang on...
> 
> /*
>  * Returns true if the pci bus is root (behind host-pci bridge),
>  * false otherwise
>  */
> static inline bool pci_is_root_bus(struct pci_bus *pbus)
> {
>         return !(pbus->parent);
> }
> 
> So the original test _is_ correct, and should only be triggered for
> the _root_ bus, that being bus 0 in the above case.

Except that this pcibios_window_alignement() function is apparently
never called for the root bus.

> 
> But... wait a moment, what are you saying?  Which bridges need this
> fixup?  The Marvell PCI-to-PCI bridges or the host bridge?

I am talking about the PCI-to-PCI bridges. I want the I/O windows
assigned to each PCI-to-PCI bridge to be 64K aligned. The PCI-to-PCI
bridges are devices that sit on bus 0, each giving access to the child
buses 1, 2, 3, 4, 5, 6.

I have the impression that the pcibios_window_alignment() function is
called on the *child* bus to know the requested alignments for the
bridge that sits on the parent bus and gives access to this child bus.

Best regards,

Thomas
Russell King - ARM Linux Jan. 31, 2013, 3:08 p.m. UTC | #4
On Thu, Jan 31, 2013 at 03:57:37PM +0100, Thomas Petazzoni wrote:
> Indeed. But this function never gets called with bus->number == 0, only
> with bus->number = 1, 2, 3, 4, 5, 6. So those are child busses, and
> therefore they have a parent.

Having thought about it, yes, that's what I expect, because root bus
resources are never resized.  Root bus resources are normally setup
before probing to indicate what size they _can_ be and define what
space is available to the downstream devices.

Child busses (behind a PCI-2-PCI bridge) are a different matter - these
will be adjusted according to their on-bus devices and the windows for
them sized and allocated appropriately _within_ the confines of the
root bus resource.

> I am talking about the PCI-to-PCI bridges. I want the I/O windows
> assigned to each PCI-to-PCI bridge to be 64K aligned. The PCI-to-PCI
> bridges are devices that sit on bus 0, each giving access to the child
> buses 1, 2, 3, 4, 5, 6.

Right, so you've just confirmed that this _is_ the right hook and it
_is_ being called at the right time.

However, I had interpreted your requirement as the _host_ bridge only
(insufficient information in your previous emails, or I missed it).

If that's what your bridge requires, then we need to detect it via
its vendor and device IDs and only apply this fixup to those bridges
which require a 64K alignment.

So, the IDs are vendor:device = 0x11ab:0x1092 ?  And let me get this
straight, it _is_ a specific requirement for this particular bridge
P2P bridge?
Thomas Petazzoni Jan. 31, 2013, 3:22 p.m. UTC | #5
Dear Russell King - ARM Linux,

On Thu, 31 Jan 2013 15:08:01 +0000, Russell King - ARM Linux wrote:
> On Thu, Jan 31, 2013 at 03:57:37PM +0100, Thomas Petazzoni wrote:
> > Indeed. But this function never gets called with bus->number == 0, only
> > with bus->number = 1, 2, 3, 4, 5, 6. So those are child busses, and
> > therefore they have a parent.
> 
> Having thought about it, yes, that's what I expect, because root bus
> resources are never resized.  Root bus resources are normally setup
> before probing to indicate what size they _can_ be and define what
> space is available to the downstream devices.
> 
> Child busses (behind a PCI-2-PCI bridge) are a different matter - these
> will be adjusted according to their on-bus devices and the windows for
> them sized and allocated appropriately _within_ the confines of the
> root bus resource.

Ok.

> > I am talking about the PCI-to-PCI bridges. I want the I/O windows
> > assigned to each PCI-to-PCI bridge to be 64K aligned. The PCI-to-PCI
> > bridges are devices that sit on bus 0, each giving access to the child
> > buses 1, 2, 3, 4, 5, 6.
> 
> Right, so you've just confirmed that this _is_ the right hook and it
> _is_ being called at the right time.

Ok.

> However, I had interpreted your requirement as the _host_ bridge only
> (insufficient information in your previous emails, or I missed it).

Might be insufficient information, or wrong terminology on my side.

> If that's what your bridge requires, then we need to detect it via
> its vendor and device IDs and only apply this fixup to those bridges
> which require a 64K alignment.
> 
> So, the IDs are vendor:device = 0x11ab:0x1092 ?  And let me get this
> straight, it _is_ a specific requirement for this particular bridge
> P2P bridge?

The vendor/device ID will change. This P2P bridge is emulated. However,
I'm not sure why you need to know the vendor:device ID to make the
fixup specific.

The fixup is already specific to those bridges, since I am just setting
pci_sys_data->win_align_io to 64K for the particular buses that are
downstream the problematic bridges. So it doesn't affect any other bus
on the system, and therefore I don't think this fixup needs to be made
specific to a given vendor:device, no?

Best regards,

Thomas
Russell King - ARM Linux Jan. 31, 2013, 3:36 p.m. UTC | #6
On Thu, Jan 31, 2013 at 04:22:37PM +0100, Thomas Petazzoni wrote:
> Dear Russell King - ARM Linux,
> 
> On Thu, 31 Jan 2013 15:08:01 +0000, Russell King - ARM Linux wrote:
> > If that's what your bridge requires, then we need to detect it via
> > its vendor and device IDs and only apply this fixup to those bridges
> > which require a 64K alignment.
> > 
> > So, the IDs are vendor:device = 0x11ab:0x1092 ?  And let me get this
> > straight, it _is_ a specific requirement for this particular bridge
> > P2P bridge?
> 
> The vendor/device ID will change. This P2P bridge is emulated. However,
> I'm not sure why you need to know the vendor:device ID to make the
> fixup specific.
> 
> The fixup is already specific to those bridges, since I am just setting
> pci_sys_data->win_align_io to 64K for the particular buses that are
> downstream the problematic bridges. So it doesn't affect any other bus
> on the system, and therefore I don't think this fixup needs to be made
> specific to a given vendor:device, no?

The pci_sys_data is not specific to one bus.  It's specific from the
root bus downwards, and is shared by all child busses.

The problem is if you have some card or a conventional P2P bridge which
has 4K windows.  If you merely set the alignment to 64K for all bridges,
then all bridges get this treatment whether or not they need it.  That's
what I'm trying to avoid.

Take, for instance, a cardbus bridge (remember, there are PCI cards which
can be plugged in to give you a cardbus slot.)  I have a device here which
can be plugged into a cardbus slot which has not just one P2P bridge but
two, and a bunch of downsteam devices, including VGA, ethernet, USB, PS/2
etc.  (Okay, Linux doesn't support this hardware because of crappy X86
stuff, despite the fact Windows cope with it just fine.)

There have been cards in the past which have had P2P bridges on them as
well.

So, simply believing that the only P2P bridges in the system will be
those on the physical board is a mistake.
Thomas Petazzoni Jan. 31, 2013, 3:47 p.m. UTC | #7
Dear Russell King - ARM Linux,

On Thu, 31 Jan 2013 15:36:25 +0000, Russell King - ARM Linux wrote:

> > The fixup is already specific to those bridges, since I am just setting
> > pci_sys_data->win_align_io to 64K for the particular buses that are
> > downstream the problematic bridges. So it doesn't affect any other bus
> > on the system, and therefore I don't think this fixup needs to be made
> > specific to a given vendor:device, no?
> 
> The pci_sys_data is not specific to one bus.  It's specific from the
> root bus downwards, and is shared by all child busses.

Ah, ok, that's the part I was missing.

> The problem is if you have some card or a conventional P2P bridge which
> has 4K windows.  If you merely set the alignment to 64K for all bridges,
> then all bridges get this treatment whether or not they need it.  That's
> what I'm trying to avoid.
> 
> Take, for instance, a cardbus bridge (remember, there are PCI cards which
> can be plugged in to give you a cardbus slot.)  I have a device here which
> can be plugged into a cardbus slot which has not just one P2P bridge but
> two, and a bunch of downsteam devices, including VGA, ethernet, USB, PS/2
> etc.  (Okay, Linux doesn't support this hardware because of crappy X86
> stuff, despite the fact Windows cope with it just fine.)
> 
> There have been cards in the past which have had P2P bridges on them as
> well.
> 
> So, simply believing that the only P2P bridges in the system will be
> those on the physical board is a mistake.

Yes, indeed, I understand this. I just thought this pci_sys_data
structure was per-bus. Of course, if it's shared by all buses on the
system, we need a way to apply this fixup only to the Marvell bridges.

Should I just hard-code this special fixup in
pcibios_window_alignment() with a check on VID/PID ?

Thanks,

Thomas
Russell King - ARM Linux Jan. 31, 2013, 3:48 p.m. UTC | #8
On Thu, Jan 31, 2013 at 04:47:35PM +0100, Thomas Petazzoni wrote:
> Dear Russell King - ARM Linux,
> 
> On Thu, 31 Jan 2013 15:36:25 +0000, Russell King - ARM Linux wrote:
> 
> > > The fixup is already specific to those bridges, since I am just setting
> > > pci_sys_data->win_align_io to 64K for the particular buses that are
> > > downstream the problematic bridges. So it doesn't affect any other bus
> > > on the system, and therefore I don't think this fixup needs to be made
> > > specific to a given vendor:device, no?
> > 
> > The pci_sys_data is not specific to one bus.  It's specific from the
> > root bus downwards, and is shared by all child busses.
> 
> Ah, ok, that's the part I was missing.
> 
> > The problem is if you have some card or a conventional P2P bridge which
> > has 4K windows.  If you merely set the alignment to 64K for all bridges,
> > then all bridges get this treatment whether or not they need it.  That's
> > what I'm trying to avoid.
> > 
> > Take, for instance, a cardbus bridge (remember, there are PCI cards which
> > can be plugged in to give you a cardbus slot.)  I have a device here which
> > can be plugged into a cardbus slot which has not just one P2P bridge but
> > two, and a bunch of downsteam devices, including VGA, ethernet, USB, PS/2
> > etc.  (Okay, Linux doesn't support this hardware because of crappy X86
> > stuff, despite the fact Windows cope with it just fine.)
> > 
> > There have been cards in the past which have had P2P bridges on them as
> > well.
> > 
> > So, simply believing that the only P2P bridges in the system will be
> > those on the physical board is a mistake.
> 
> Yes, indeed, I understand this. I just thought this pci_sys_data
> structure was per-bus. Of course, if it's shared by all buses on the
> system, we need a way to apply this fixup only to the Marvell bridges.
> 
> Should I just hard-code this special fixup in
> pcibios_window_alignment() with a check on VID/PID ?

Yes please.
Arnd Bergmann Jan. 31, 2013, 4:18 p.m. UTC | #9
On Thursday 31 January 2013 15:36:25 Russell King - ARM Linux wrote:
> The pci_sys_data is not specific to one bus.  It's specific from the
> root bus downwards, and is shared by all child busses.
> 
> The problem is if you have some card or a conventional P2P bridge which
> has 4K windows.  If you merely set the alignment to 64K for all bridges,
> then all bridges get this treatment whether or not they need it.  That's
> what I'm trying to avoid.
> 
> Take, for instance, a cardbus bridge (remember, there are PCI cards which
> can be plugged in to give you a cardbus slot.)  I have a device here which
> can be plugged into a cardbus slot which has not just one P2P bridge but
> two, and a bunch of downsteam devices, including VGA, ethernet, USB, PS/2
> etc.  (Okay, Linux doesn't support this hardware because of crappy X86
> stuff, despite the fact Windows cope with it just fine.)
> 
> There have been cards in the past which have had P2P bridges on them as
> well.
> 
> So, simply believing that the only P2P bridges in the system will be
> those on the physical board is a mistake.

I was going to write something similar. Actually I think it's worse because
the case of an extra P2P bridge is quite likely for devices that actually
use I/O space, given that the use of I/O space is deprecated on PCIe.

This also means that a lot of devices using I/O space are legacy crap
and have random bugs regarding PCI standard compliance. I would not
expect those devices in general to do the right thing when I/O ports
beyond 65535 are used, although a lot of them would work.

For all I could tell, the safest solution with the I/O space would
be to pretend we had a shared 64K I/O space for all of the PCIe
ports on Armada XP, and map a separate 64K window for each port
using a different io_offset for each one.
This way, you can have a device on the second PCIe port use e.g. I/O
port number 0x3f8 for a legacy UART on the bus, which gets translated
into the Linux-visible port number 0x103f8.

The currently used method to have io_offset=0 for all PCIe ports
and use separate I/O port ranges of 64K for each PCIe port probably
still works for most devices, except those where we hardcode a port
number in the Linux device driver, or where the high address bits
don't get decoded properly.

	Arnd
Jason Gunthorpe Jan. 31, 2013, 6:02 p.m. UTC | #10
On Thu, Jan 31, 2013 at 05:18:46PM +0100, Arnd Bergmann wrote:

> For all I could tell, the safest solution with the I/O space would
> be to pretend we had a shared 64K I/O space for all of the PCIe
> ports on Armada XP, and map a separate 64K window for each port
> using a different io_offset for each one.
> This way, you can have a device on the second PCIe port use e.g. I/O
> port number 0x3f8 for a legacy UART on the bus, which gets translated
> into the Linux-visible port number 0x103f8.
> 
> The currently used method to have io_offset=0 for all PCIe ports
> and use separate I/O port ranges of 64K for each PCIe port probably
> still works for most devices, except those where we hardcode a port
> number in the Linux device driver, or where the high address bits
> don't get decoded properly.

Thinking about this some more, which of these methods to choose is
going to be dictated by what the Marvell HW does.

Since the IO space in the TLP is a full 32 bits, it matters what 32
bit value the HW PCI core places in the IO Rd/Wr transaction. This
value *must* match the value given to the Linux PCI core for resource
allocation, because it must be correctly programmed by Linux into the
downstream BARs/bridge windows.

So there are probably two choices for what the HW does, given a MBUS
window of 0xDEAD0000 -> 0xDEADFFFF set for IO, a read from physical
address 0xDEAD0000 produces a IO Rd TLP with either '0x00000000' or
'0xDEAD0000' in the address field.

If it is 0xDEAD0000, then Thomas has to keep what he has now, you
can't mess with this address. Verify that the full 32 bit address
exactly matching the MBUS window address is written to the PCI-PCI
bridge IO base/limit registers.

If it is 0x00000000 then the mmap scheme I outlined before must be
used, and verify that only 0->0xFFFF is written to the PCI-PCI bridge
IO base/limit registers..

My guess is the Marvell PCI-E copies whatever address is it given into
the IO TLP, so it would be the 0xDEAD0000 behaviour, however I bet you
can use the MBUS window target address remapping feature to get the
0x00000000 behaviour as well (though there are a limited number of
remappable MBUS windows, so that is probably not a good idea)

Jason
Arnd Bergmann Jan. 31, 2013, 8:46 p.m. UTC | #11
On Thursday 31 January 2013, Jason Gunthorpe wrote:
> Thinking about this some more, which of these methods to choose is
> going to be dictated by what the Marvell HW does.
> 
> Since the IO space in the TLP is a full 32 bits, it matters what 32
> bit value the HW PCI core places in the IO Rd/Wr transaction. This
> value must match the value given to the Linux PCI core for resource
> allocation, because it must be correctly programmed by Linux into the
> downstream BARs/bridge windows.
> 
> So there are probably two choices for what the HW does, given a MBUS
> window of 0xDEAD0000 -> 0xDEADFFFF set for IO, a read from physical
> address 0xDEAD0000 produces a IO Rd TLP with either '0x00000000' or
> '0xDEAD0000' in the address field.
> 
> If it is 0xDEAD0000, then Thomas has to keep what he has now, you
> can't mess with this address. Verify that the full 32 bit address
> exactly matching the MBUS window address is written to the PCI-PCI
> bridge IO base/limit registers.

If you do this, you break all sorts of expectations in the kernel and
I guess you'd have to set the io_offset value of that bus to 0x21530000
in order to make Linux I/O port 0 go to the first byte of the window
and come out as 0xDEAD0000 on the bus, but you still won't be able to
use legacy devices with hardcoded I/O port numbers.

> If it is 0x00000000 then the mmap scheme I outlined before must be
> used, and verify that only 0->0xFFFF is written to the PCI-PCI bridge
> IO base/limit registers..

For the primary bus, yes, but there are still two options for the
second one: you can either start at 0 again or you can continue
at 0x10000 as we do for mv78xx0 and kirkwood for instance. Both
approaches probably have their merit.

	Arnd
Jason Gunthorpe Jan. 31, 2013, 10:44 p.m. UTC | #12
On Thu, Jan 31, 2013 at 08:46:22PM +0000, Arnd Bergmann wrote:

> > If it is 0xDEAD0000, then Thomas has to keep what he has now, you
> > can't mess with this address. Verify that the full 32 bit address
> > exactly matching the MBUS window address is written to the PCI-PCI
> > bridge IO base/limit registers.
> 
> If you do this, you break all sorts of expectations in the kernel and
> I guess you'd have to set the io_offset value of that bus to 0x21530000
> in order to make Linux I/O port 0 go to the first byte of the window
> and come out as 0xDEAD0000 on the bus, but you still won't be able to
> use legacy devices with hardcoded I/O port numbers.

I'm not sure exactly how the PCI core handles this, but it does look
like pci_add_resource_offset via io_offset is the answer. I'm not sure
what goes in the struct resource passed to the PCI core - the *bus* IO
address range or the *kernel* IO address range..

> > If it is 0x00000000 then the mmap scheme I outlined before must be
> > used, and verify that only 0->0xFFFF is written to the PCI-PCI bridge
> > IO base/limit registers..
> 
> For the primary bus, yes, but there are still two options for the
> second one: you can either start at 0 again or you can continue

No, for *all* links. You use a mmap scheme with 4k granularity, I
explained in a past email, but to quickly review..

- Each link gets 64k of reserved physical address space for IO,
  this is just set aside, no MBUS windows are permantently assigned.
- Linux is told to use a 64k IO range with bus IO address 0->0xFFFF
- When the IO base/limit register in the link PCI-PCI bridge is programmed
  the driver gets a 4k aligned region somewhere from 0->0xFFFF and then:
    - Allocates a 64k MBUS window that translates physical address
      0xZZZZxxxx to IO bus address 0x0000xxxx (goes in the TLP) for
      that link
    - Uses pci_ioremap_io to map the fraction of the link's 64k MBUS window
      allocated to that bridge to the correct offset in the 
      PCI_IO_VIRT_BASE region

So you'd end up with a MMU mapping something like:
  PCI_IO_VIRT_BASE    MBUS_IO_PHYS_BASE
    0->4k          => 0      -> 4k             // 4k assigned to link0
    4k->8k         => 64k+4k -> 64k+8k         // 4k assigned to link1
    8k->24k        => 128k+8k -> 128k+24k      // 8k assigned to link2

Where the physical mbus window for each link starts on each 64k block.

Thomas: This solves the need to have alignment of the IO regions, and
gets rid of any trouble with 32 bit IO addreses, however you'll need
to allocate the remap capable mbus windows separately for use by IO
mappings..

Though, there is still a problem with the MMIO mbus window
alignment. mbus windows are aligned to a multiple of their size, PCI
MMIO bridge windows are always aligned to 1M...

> at 0x10000 as we do for mv78xx0 and kirkwood for instance. Both
> approaches probably have their merit.

Kirkwood uses the MBUS remapping registers to set the TLP address of
link 0 to start at 0 and of link 1 to start at 0x10000 - so it is
consistent with what you describe..

However, this is a suboptimal way to run the HW. It would be much
better to place each link in a seperate PCI domain and have each link
start its bus IO address at 0, and assign the kernel IO address in
sequential 64k blocks as today.

Though, it is my hope that Thomas's driver will work on Kirkwood as
well...

Jason
Arnd Bergmann Feb. 1, 2013, 11:30 a.m. UTC | #13
On Thursday 31 January 2013, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2013 at 08:46:22PM +0000, Arnd Bergmann wrote:
> 
> > > If it is 0xDEAD0000, then Thomas has to keep what he has now, you
> > > can't mess with this address. Verify that the full 32 bit address
> > > exactly matching the MBUS window address is written to the PCI-PCI
> > > bridge IO base/limit registers.
> > 
> > If you do this, you break all sorts of expectations in the kernel and
> > I guess you'd have to set the io_offset value of that bus to 0x21530000
> > in order to make Linux I/O port 0 go to the first byte of the window
> > and come out as 0xDEAD0000 on the bus, but you still won't be able to
> > use legacy devices with hardcoded I/O port numbers.
> 
> I'm not sure exactly how the PCI core handles this, but it does look
> like pci_add_resource_offset via io_offset is the answer. I'm not sure
> what goes in the struct resource passed to the PCI core - the *bus* IO
> address range or the *kernel* IO address range..

IO Resources are always expressed in the kernel's view, so they are in
the range from 0 to IO_SPACE_LIMIT. The idea is that you can have multiple
buses that each have their own address space start at 0, but can put
them into the kernel address space at a different address.

Each device on any bus can still use I/O addresses starting at zero,
and you could have e.g. a VGA card on two buses each respond to I/O cycles
on port 0x3c0, but the PCI core will translate the resources to appear
in the kernel space at 0x103c0 for the second one.

> > > If it is 0x00000000 then the mmap scheme I outlined before must be
> > > used, and verify that only 0->0xFFFF is written to the PCI-PCI bridge
> > > IO base/limit registers..
> > 
> > For the primary bus, yes, but there are still two options for the
> > second one: you can either start at 0 again or you can continue
> 
> No, for *all* links. You use a mmap scheme with 4k granularity, I
> explained in a past email, but to quickly review..
> 
> - Each link gets 64k of reserved physical address space for IO,
>   this is just set aside, no MBUS windows are permantently assigned.
> - Linux is told to use a 64k IO range with bus IO address 0->0xFFFF
> - When the IO base/limit register in the link PCI-PCI bridge is programmed
>   the driver gets a 4k aligned region somewhere from 0->0xFFFF and then:
>     - Allocates a 64k MBUS window that translates physical address
>       0xZZZZxxxx to IO bus address 0x0000xxxx (goes in the TLP) for
>       that link
>     - Uses pci_ioremap_io to map the fraction of the link's 64k MBUS window
>       allocated to that bridge to the correct offset in the 
>       PCI_IO_VIRT_BASE region

We'd have to change pci_ioremap_io to allow mapping less than 64k, but
yes, that would work, too. I don't see an advantage to it though,
other than having io_offset always be zero.

> > at 0x10000 as we do for mv78xx0 and kirkwood for instance. Both
> > approaches probably have their merit.
> 
> Kirkwood uses the MBUS remapping registers to set the TLP address of
> link 0 to start at 0 and of link 1 to start at 0x10000 - so it is
> consistent with what you describe..

Right, so it also uses io_offset = 0 all the time, which means the
bus I/O port numbers are identical to the Linux I/O port numbers,
but they go beyond 64K on the bus on the second and later links.

> However, this is a suboptimal way to run the HW. It would be much
> better to place each link in a seperate PCI domain and have each link
> start its bus IO address at 0, and assign the kernel IO address in
> sequential 64k blocks as today.

I agree.

	Arnd
Jason Gunthorpe Feb. 1, 2013, 7:52 p.m. UTC | #14
On Fri, Feb 01, 2013 at 11:30:18AM +0000, Arnd Bergmann wrote:

> IO Resources are always expressed in the kernel's view, so they are
> in the range from 0 to IO_SPACE_LIMIT. The idea is that you can have
> multiple buses that each have their own address space start at 0,
> but can put them into the kernel address space at a different
> address.

Sure, I see that.. and that seems reasonable as long as any IO bus
address aliases are put in separate PCI domains. It would be wonky if
devices on different PCI bus numbers in a single PCI domain had
overlaping IO addresses.

> > No, for *all* links. You use a mmap scheme with 4k granularity, I
> > explained in a past email, but to quickly review..
> > 
> > - Each link gets 64k of reserved physical address space for IO,
> >   this is just set aside, no MBUS windows are permantently assigned.
> > - Linux is told to use a 64k IO range with bus IO address 0->0xFFFF
> > - When the IO base/limit register in the link PCI-PCI bridge is programmed
> >   the driver gets a 4k aligned region somewhere from 0->0xFFFF and then:
> >     - Allocates a 64k MBUS window that translates physical address
> >       0xZZZZxxxx to IO bus address 0x0000xxxx (goes in the TLP) for
> >       that link
> >     - Uses pci_ioremap_io to map the fraction of the link's 64k MBUS window
> >       allocated to that bridge to the correct offset in the 
> >       PCI_IO_VIRT_BASE region
> 
> We'd have to change pci_ioremap_io to allow mapping less than 64k, but
> yes, that would work, too. I don't see an advantage to it though,
> other than having io_offset always be zero.

Erm, that is the whole point. No PCI device in the system, on any of
the 10 links, would be required to use a 32 bit IO address. All are 16
bit and there is no compatibility problem on any links. You don't need
to declare any one link as being 'io supporting' or something like
that, it just works out of the box.

Jason
Thomas Petazzoni Feb. 6, 2013, 4:51 p.m. UTC | #15
Dear Jason Gunthorpe,

On Thu, 31 Jan 2013 15:44:59 -0700, Jason Gunthorpe wrote:

> > For the primary bus, yes, but there are still two options for the
> > second one: you can either start at 0 again or you can continue
> 
> No, for *all* links. You use a mmap scheme with 4k granularity, I
> explained in a past email, but to quickly review..
> 
> - Each link gets 64k of reserved physical address space for IO,
>   this is just set aside, no MBUS windows are permantently assigned.
> - Linux is told to use a 64k IO range with bus IO address 0->0xFFFF
> - When the IO base/limit register in the link PCI-PCI bridge is programmed
>   the driver gets a 4k aligned region somewhere from 0->0xFFFF and then:
>     - Allocates a 64k MBUS window that translates physical address
>       0xZZZZxxxx to IO bus address 0x0000xxxx (goes in the TLP) for
>       that link
>     - Uses pci_ioremap_io to map the fraction of the link's 64k MBUS window
>       allocated to that bridge to the correct offset in the 
>       PCI_IO_VIRT_BASE region

This, I think I now understand.

> So you'd end up with a MMU mapping something like:
>   PCI_IO_VIRT_BASE    MBUS_IO_PHYS_BASE
>     0->4k          => 0      -> 4k             // 4k assigned to link0
>     4k->8k         => 64k+4k -> 64k+8k         // 4k assigned to link1
>     8k->24k        => 128k+8k -> 128k+24k      // 8k assigned to link2

I am not sure to understand your example, starting at the second line.
Shouldn't the second line have been

      4k->8k         => 64k -> 64k+4k

 ?

If you do:

      4k->8k         => 64k+4k -> 64k+8k

as you suggested, then when the device driver will do an inl(0x4) on
this device, the device will receive the equivalent of an inl(0x1004),
no?

I understand that I have two choices here:

 * First one is to make the I/O regions of all PCIe links fit below the
   default IO_SPACE_LIMIT (0xffff) by doing the mapping trick you
   described above.

 * Second one is to have one 64 KB block for each PCIe link, which
   would require raising the IO_SPACE_LIMIT on this platform.

Is this correct?

If so, then what I don't understand is that Kirkwood does the second
thing (from arch/arm/mach-kirkwood/pcie.c) :

        switch (index) {
        case 0:
		[...]
		/* Here the code is mapping 0 -> 64k */
                pci_ioremap_io(SZ_64K * sys->busnr, KIRKWOOD_PCIE_IO_PHYS_BASE);
                break;
        case 1:
		[...]
		/* And here 64k -> 128k */
                pci_ioremap_io(SZ_64K * sys->busnr,
                               KIRKWOOD_PCIE1_IO_PHYS_BASE);
                break;

So it has PCI I/O space from 0 to 128k, but still it seems to use the
default IO_SPACE_LIMIT of 0xffff. How can this work? Maybe nobody every
used a device on the second PCIe link that required I/O accesses.

> Where the physical mbus window for each link starts on each 64k block.
> 
> Thomas: This solves the need to have alignment of the IO regions, and
> gets rid of any trouble with 32 bit IO addreses, however you'll need
> to allocate the remap capable mbus windows separately for use by IO
> mappings..
> 
> Though, there is still a problem with the MMIO mbus window
> alignment. mbus windows are aligned to a multiple of their size, PCI
> MMIO bridge windows are always aligned to 1M...

Can't this be solved using the window_alignement() hook we've been
discussing separately? Just like we teach the Linux PCI core about our
alignment requirements of 64K for the I/O regions, we could teach it
about our alignment requirement on memory regions as well. No?

> Though, it is my hope that Thomas's driver will work on Kirkwood as
> well...

Yes, my plan is to have it working on Kirkwood. This WE, I was given a
Kirkwood based machine that has a usable PCIe device.

Best regards,

Thomas
Jason Gunthorpe Feb. 6, 2013, 5:09 p.m. UTC | #16
On Wed, Feb 06, 2013 at 05:51:28PM +0100, Thomas Petazzoni wrote:

> > So you'd end up with a MMU mapping something like:
> >   PCI_IO_VIRT_BASE    MBUS_IO_PHYS_BASE
> >     0->4k          => 0      -> 4k             // 4k assigned to link0
> >     4k->8k         => 64k+4k -> 64k+8k         // 4k assigned to link1
> >     8k->24k        => 128k+8k -> 128k+24k      // 8k assigned to link2
> 
> I am not sure to understand your example, starting at the second line.
> Shouldn't the second line have been
> 
>       4k->8k         => 64k -> 64k+4k

No..
 
> as you suggested, then when the device driver will do an inl(0x4) on
> this device, the device will receive the equivalent of an inl(0x1004),
> no?

Link 0 translates like:

- Linux driver does inl(0x4)
- ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x4
- The CPU TLB converts that into a read from CPU physical
  0xc0000000 + 0x4
- The MBUS window remap register converts that into a read from IO
  space 0x4
- The address 0x4 is placed in the PCI-E IO transaction of link 0

Link 1 translates like:

- Linux driver does inl(0x1004)
- ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x1004
- The CPU TLB converts that into a read from CPU physical
  0xc0000000 + 0x11004 (ie the mbus window for the link 1)
- The MBUS window remap register converts that into a read from IO
  space 0x1004
- The address 0x1004 is placed in the PCI-E IO transaction of link 1

Noting that in both instances the IO address passed to inl is what
eventually appears on the PCI-E link after all the translation is
completed.

The CPU MMU is being used used to route 4k aligned ranges to the
correct link.

> I understand that I have two choices here:
> 
>  * First one is to make the I/O regions of all PCIe links fit below the
>    default IO_SPACE_LIMIT (0xffff) by doing the mapping trick you
>    described above.
> 
>  * Second one is to have one 64 KB block for each PCIe link, which
>    would require raising the IO_SPACE_LIMIT on this platform.

Yes, however, AFIAK this is the environment you should be running in:

#define IO_SPACE_LIMIT  ((resource_size_t)0xfffff)

Which is 5 f's not 4.

> > Though, there is still a problem with the MMIO mbus window
> > alignment. mbus windows are aligned to a multiple of their size, PCI
> > MMIO bridge windows are always aligned to 1M...
> 
> Can't this be solved using the window_alignement() hook we've been
> discussing separately? Just like we teach the Linux PCI core about our
> alignment requirements of 64K for the I/O regions, we could teach it
> about our alignment requirement on memory regions as well. No?

Hopefully :) As long as it can adjust the start and length you should
be fine.

Jason
Thomas Petazzoni Feb. 6, 2013, 5:18 p.m. UTC | #17
Dear Jason Gunthorpe,

On Wed, 6 Feb 2013 10:09:03 -0700, Jason Gunthorpe wrote:

> Link 0 translates like:
> 
> - Linux driver does inl(0x4)
> - ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x4
> - The CPU TLB converts that into a read from CPU physical
>   0xc0000000 + 0x4
> - The MBUS window remap register converts that into a read from IO
>   space 0x4
> - The address 0x4 is placed in the PCI-E IO transaction of link 0
> 
> Link 1 translates like:
> 
> - Linux driver does inl(0x1004)
> - ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x1004
> - The CPU TLB converts that into a read from CPU physical
>   0xc0000000 + 0x11004 (ie the mbus window for the link 1)
> - The MBUS window remap register converts that into a read from IO
>   space 0x1004
> - The address 0x1004 is placed in the PCI-E IO transaction of link 1

This last step is exactly what I thought would not work. If the PCIe
device has say 64 bytes of I/O space, then this 0x1004 PCI-E
transaction will be out of bounds, no?

Sorry, still learning how PCI works...

> > I understand that I have two choices here:
> > 
> >  * First one is to make the I/O regions of all PCIe links fit below the
> >    default IO_SPACE_LIMIT (0xffff) by doing the mapping trick you
> >    described above.
> > 
> >  * Second one is to have one 64 KB block for each PCIe link, which
> >    would require raising the IO_SPACE_LIMIT on this platform.
> 
> Yes, however, AFIAK this is the environment you should be running in:
> 
> #define IO_SPACE_LIMIT  ((resource_size_t)0xfffff)
> 
> Which is 5 f's not 4.

Aaah, you're right. My eyes got the number of f wrong. So I have 16
times 64 KB. So why would I bother doing the MMU trick if I can just
nicely remap 64 KB for each PCIe link ?

> > Can't this be solved using the window_alignement() hook we've been
> > discussing separately? Just like we teach the Linux PCI core about our
> > alignment requirements of 64K for the I/O regions, we could teach it
> > about our alignment requirement on memory regions as well. No?
> 
> Hopefully :) As long as it can adjust the start and length you should
> be fine.

Why would you need to adjust the length? If Linux allocates a 2 MB
resource on a 1 MB boundary, we simply increase the start address to
the next 2 MB boundary, and that's it. Why would the length need to
change?

Thanks,

Thomas
Jason Gunthorpe Feb. 6, 2013, 5:50 p.m. UTC | #18
On Wed, Feb 06, 2013 at 06:18:52PM +0100, Thomas Petazzoni wrote:
> Dear Jason Gunthorpe,
> 
> On Wed, 6 Feb 2013 10:09:03 -0700, Jason Gunthorpe wrote:
> 
> > Link 0 translates like:
> > 
> > - Linux driver does inl(0x4)
> > - ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x4
> > - The CPU TLB converts that into a read from CPU physical
> >   0xc0000000 + 0x4
> > - The MBUS window remap register converts that into a read from IO
> >   space 0x4
> > - The address 0x4 is placed in the PCI-E IO transaction of link 0
> > 
> > Link 1 translates like:
> > 
> > - Linux driver does inl(0x1004)
> > - ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x1004
> > - The CPU TLB converts that into a read from CPU physical
> >   0xc0000000 + 0x11004 (ie the mbus window for the link 1)
> > - The MBUS window remap register converts that into a read from IO
> >   space 0x1004
> > - The address 0x1004 is placed in the PCI-E IO transaction of link 1
> 
> This last step is exactly what I thought would not work. If the PCIe
> device has say 64 bytes of I/O space, then this 0x1004 PCI-E
> transaction will be out of bounds, no?

No.. PCI end devices are required to decode all 32 bits of address,
less the bits requires for their allocation. So a device with 64 bytes
of IO will match bits 31:6 and then use bits 5:0 for the internal
register.

So a full 32 bit address is technically fine, according to the spec,
however:
 - The 32 bit decode is an optional feature on bridges
 - Some devices are broken because x86 only uses the low 64k.

So for best compatibility it is ideal to put everything in the low
64k.

However, IO space really should not be used by anything except very
legacy devices, so if the MMU business is a hassle for some reason I'd
just go with the 64k aligned method.

> > > Can't this be solved using the window_alignement() hook we've been
> > > discussing separately? Just like we teach the Linux PCI core about our
> > > alignment requirements of 64K for the I/O regions, we could teach it
> > > about our alignment requirement on memory regions as well. No?
> > 
> > Hopefully :) As long as it can adjust the start and length you should
> > be fine.
> 
> Why would you need to adjust the length? If Linux allocates a 2 MB
> resource on a 1 MB boundary, we simply increase the start address to
> the next 2 MB boundary, and that's it. Why would the length need to
> change?

Well, lets say 3MB is the example. A 3mb region needs to fit inside a
4mb MBUS window. If you align the start to 4mb then the pci-e core
needs to know that it can't use the extra 1mb covered by the mbus
window. mbus windows must not overlap.

Adjusting the bridge window length to be 4mb communicates that dead
space to the PCI core, and presumably this shows up in lspci and
whatnot.

I suppose if you align the end to 4mb (thus creating the gap before,
not after) things should work out OK, but the information that the gap
is routed to a specific PCI link is lost..

Jason
Thomas Petazzoni Feb. 6, 2013, 6:02 p.m. UTC | #19
Dear Jason Gunthorpe,

On Wed, 6 Feb 2013 10:50:19 -0700, Jason Gunthorpe wrote:

> No.. PCI end devices are required to decode all 32 bits of address,
> less the bits requires for their allocation. So a device with 64 bytes
> of IO will match bits 31:6 and then use bits 5:0 for the internal
> register.
> 
> So a full 32 bit address is technically fine, according to the spec,
> however:
>  - The 32 bit decode is an optional feature on bridges
>  - Some devices are broken because x86 only uses the low 64k.

Thanks again for the great explanation, it is now clear to me.

> So for best compatibility it is ideal to put everything in the low
> 64k.
> 
> However, IO space really should not be used by anything except very
> legacy devices, so if the MMU business is a hassle for some reason I'd
> just go with the 64k aligned method.

Me too :-)

It's what is done on Kirkwood since a long time, and apparently hasn't
caused any trouble so far.

> > Why would you need to adjust the length? If Linux allocates a 2 MB
> > resource on a 1 MB boundary, we simply increase the start address to
> > the next 2 MB boundary, and that's it. Why would the length need to
> > change?
> 
> Well, lets say 3MB is the example. A 3mb region needs to fit inside a
> 4mb MBUS window. If you align the start to 4mb then the pci-e core
> needs to know that it can't use the extra 1mb covered by the mbus
> window. mbus windows must not overlap.

Grumble, grumble, you're right. I now understand. Need to think of it,
because the current pcibios_window_alignment() thing only allows to
adjust the start address if I'm correct. But I'll have a more detailed
look. Or maybe Russell can comment on this specific topic?

Thanks!

Thomas
Stephen Warren Feb. 6, 2013, 6:22 p.m. UTC | #20
On 02/06/2013 10:50 AM, Jason Gunthorpe wrote:
> On Wed, Feb 06, 2013 at 06:18:52PM +0100, Thomas Petazzoni wrote:
>> Dear Jason Gunthorpe,
>>
>> On Wed, 6 Feb 2013 10:09:03 -0700, Jason Gunthorpe wrote:
>>
>>> Link 0 translates like:
>>>
>>> - Linux driver does inl(0x4)
>>> - ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x4
>>> - The CPU TLB converts that into a read from CPU physical
>>>   0xc0000000 + 0x4
>>> - The MBUS window remap register converts that into a read from IO
>>>   space 0x4
>>> - The address 0x4 is placed in the PCI-E IO transaction of link 0
>>>
>>> Link 1 translates like:
>>>
>>> - Linux driver does inl(0x1004)
>>> - ARM layer converts that into a read from PCI_IO_VIRT_BASE + 0x1004
>>> - The CPU TLB converts that into a read from CPU physical
>>>   0xc0000000 + 0x11004 (ie the mbus window for the link 1)
>>> - The MBUS window remap register converts that into a read from IO
>>>   space 0x1004
>>> - The address 0x1004 is placed in the PCI-E IO transaction of link 1
>>
>> This last step is exactly what I thought would not work. If the PCIe
>> device has say 64 bytes of I/O space, then this 0x1004 PCI-E
>> transaction will be out of bounds, no?
> 
> No.. PCI end devices are required to decode all 32 bits of address,
> less the bits requires for their allocation. So a device with 64 bytes
> of IO will match bits 31:6 and then use bits 5:0 for the internal
> register.

Didn't Arnd say (earlier this thread) that PCI devices using IO BARs
were probably fairly legacy and hence might be buggy and might not obey
that rule? Now, I'd guess it's safe within the first 64k of IO space
though, so perhaps he was only talking about IO BAR bases >= 64k being
dubious? That would imply a device might only use bits 15:6 for matching
the BAR base and 5:0 for the internal register for a 64-byte BAR.
Jason Gunthorpe Feb. 6, 2013, 6:39 p.m. UTC | #21
On Wed, Feb 06, 2013 at 11:22:35AM -0700, Stephen Warren wrote:

> > No.. PCI end devices are required to decode all 32 bits of address,
> > less the bits requires for their allocation. So a device with 64 bytes
> > of IO will match bits 31:6 and then use bits 5:0 for the internal
> > register.
> 
> Didn't Arnd say (earlier this thread) that PCI devices using IO BARs
> were probably fairly legacy and hence might be buggy and might not obey
> that rule? Now, I'd guess it's safe within the first 64k of IO space
> though, so perhaps he was only talking about IO BAR bases >= 64k being
> dubious? That would imply a device might only use bits 15:6 for matching
> the BAR base and 5:0 for the internal register for a 64-byte BAR.

Right, that is what I was referring to when I said:

> - Some devices are broken because x86 only uses the low 64k.

Fortunately on PCI-E IO TLPs will be fully routed before they are sent
down a link, so downstream of a PCI-E link we will never see aliasing
of the low 16 bits.

This means if you do bridge to legacy PCI, and you do use devices that
don't decode the upper 16 bits that it will still work OK, because the
low 16 bits on the legacy PCI bus will still be unique in each device
on that bus.

That doesn't save you from weird legacy ISA stuff, or stuff that
doesn't respect the BARs, or other crazyness..

My personal hope would be that nobody using a PCI-E ARM SOC ever has
to deal with anything to do with IO space ;)

Jason
Thomas Petazzoni Feb. 6, 2013, 6:42 p.m. UTC | #22
Dear Stephen Warren,

On Wed, 06 Feb 2013 11:22:35 -0700, Stephen Warren wrote:

> > No.. PCI end devices are required to decode all 32 bits of address,
> > less the bits requires for their allocation. So a device with 64 bytes
> > of IO will match bits 31:6 and then use bits 5:0 for the internal
> > register.
> 
> Didn't Arnd say (earlier this thread) that PCI devices using IO BARs
> were probably fairly legacy and hence might be buggy and might not obey
> that rule? Now, I'd guess it's safe within the first 64k of IO space
> though, so perhaps he was only talking about IO BAR bases >= 64k being
> dubious? That would imply a device might only use bits 15:6 for matching
> the BAR base and 5:0 for the internal register for a 64-byte BAR.

The thing is that the existing PCIe support for earlier Marvell SoC
families already use more than the first 64 KB to map the I/O BARs, and
this hasn't apparently caused any problems. We're talking about PCIe
support, not PCI, so I guess a lot of the very legacy devices are
simply not part of the equation.

Can't we simply agree on having a first implementation that does the
simple thing, like the existing PCIe implementation for earlier Marvell
SoC families, and improve that if it happens to be needed, depending on
user feedback?

Best regards,

Thomas
Arnd Bergmann Feb. 6, 2013, 10:04 p.m. UTC | #23
On Wednesday 06 February 2013, Thomas Petazzoni wrote:
> Can't we simply agree on having a first implementation that does the
> simple thing, like the existing PCIe implementation for earlier Marvell
> SoC families, and improve that if it happens to be needed, depending on
> user feedback?

Makes sense. I just looked up the kirkwood source to verify that the
window is set up to map PCI IO address 0x10000-0x1ffff for the second
bus to KIRKWOOD_PCIE1_IO_PHYS_BASE, which is mapped to logical port
number 0x10000-0x1ffff (identity mapping).

	Arnd
Arnd Bergmann Feb. 7, 2013, 11:33 p.m. UTC | #24
On Thursday 07 February 2013, Thomas Petazzoni wrote:
> I am unfortunately starting to believe that using the standard PCI
> resource allocator is too complicated for our hardware, and that we
> should maybe have a dedicated allocator. But I would really like to
> avoid that if possible.

I see this just as more evidence that the emulated P2P bridge approach
is not the easiest solution and that it would be easier to go back
to adding the ports separately and make it possible to have every
port assign the BARs first and then set the resources based on
the physical address of the window we pick for it.

	Arnd
Bjorn Helgaas Feb. 8, 2013, 4:21 a.m. UTC | #25
On Thu, Feb 7, 2013 at 8:50 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Bjorn,
>
> I guess I need your advice on the below problem, but here is a summary.
>
> Basically, the PCI-to-PCI bridge specifications require that the memory
> address assigned to a PCI-to-PCI bridge is aligned on a 1 MB boundary.
>
> Unfortunately, on Marvell hardware, we can only create address decoding
> windows that are aligned on a multiple of their size (a 1 MB window
> must be 1 MB aligned, a 2 MB window must be 2 MB aligned, etc.).
>
> How can we teach the Linux PCI core about this requirement, so that it
> does a proper assignment of addresses at the PCI-to-PCI bridge level?
> For the I/O addresses, Russell suggested the pcibios_window_alignment()
> hook, but it doesn't receive the size of the resource, so we can't
> determine what alignment is needed.
>
> As Jason points out below, we need to be able to tell the PCI core that
> a given memory area needs some alignment, but also that its size is
> larger than what the PCIe device claims, because we cannot create
> address decoding windows of an arbitrary size. For example, an address
> decoding window of 3 MB is not possible, so if a device wants 3 MB,
> then we would need to extend this memory area to 4 MB so that the next
> device doesn't get an address decoding window that overlaps with the
> previous one.
>
> I was hoping that the emulated PCI-to-PCI bridge could, by its
> behavior, teach the Linux PCI core about these special constraints.
> However, reading the PCI-to-PCI bridge specification, I don't see how
> to achieve that.
>
> Do you have some suggestions?

Huh.  That hardware looks less and less like a P2P bridge all the time
:(  You can't configure it via standard PCI config accesses, and the
aperture alignment and size constraints sound completely non-standard.
 Are the specs for this thing public?

I could imagine changing pcibios_window_alignment() to take the
resource, so it could deal with the alignment question (though I
haven't looked in detail and there might be some implementation issue
with that).

With regard to the size issue (3MB window using 4MB of address space),
I can't think of a reasonable way to teach the PCI core about both
sizes.  But is there any reason to program the bridge for a 3MB window
instead of a 4MB window, given that there's nothing else we can do
with the extra 1MB anyway?  Is a 3MB window even possible?  I would
think something that must be aligned on its size would be restricted
to power-of-2 sizes anyway, just like PCI BARs are.  Maybe you can
just always round up window sizes to a power of 2?

Bjorn

> On Wed, 6 Feb 2013 10:50:19 -0700, Jason Gunthorpe wrote:
>
>> > > > Can't this be solved using the window_alignement() hook we've
>> > > > been discussing separately? Just like we teach the Linux PCI
>> > > > core about our alignment requirements of 64K for the I/O
>> > > > regions, we could teach it about our alignment requirement on
>> > > > memory regions as well. No?
>> > >
>> > > Hopefully :) As long as it can adjust the start and length you
>> > > should be fine.
>> >
>> > Why would you need to adjust the length? If Linux allocates a 2 MB
>> > resource on a 1 MB boundary, we simply increase the start address to
>> > the next 2 MB boundary, and that's it. Why would the length need to
>> > change?
>>
>> Well, lets say 3MB is the example. A 3mb region needs to fit inside a
>> 4mb MBUS window. If you align the start to 4mb then the pci-e core
>> needs to know that it can't use the extra 1mb covered by the mbus
>> window. mbus windows must not overlap.
>>
>> Adjusting the bridge window length to be 4mb communicates that dead
>> space to the PCI core, and presumably this shows up in lspci and
>> whatnot.
>>
>> I suppose if you align the end to 4mb (thus creating the gap before,
>> not after) things should work out OK, but the information that the gap
>> is routed to a specific PCI link is lost..
>>
>> Jason
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
>
>
> --
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
Thomas Petazzoni Feb. 8, 2013, 8:14 a.m. UTC | #26
Dear Bjorn Helgaas,

On Thu, 7 Feb 2013 21:21:57 -0700, Bjorn Helgaas wrote:

> Huh.  That hardware looks less and less like a P2P bridge all the time
> :(  You can't configure it via standard PCI config accesses, and the
> aperture alignment and size constraints sound completely non-standard.
>  Are the specs for this thing public?

The specs for the Armada XP are not yet public, but the one for earlier
Marvell SoC families are, and the PCIe stuff works basically the same.
The main difference between the Kirkwood family and Armada XP is that
Kirkwood had only 2 PCIe interfaces, so we could perfectly fine do a
static allocation of address decoding windows (see the four PCIe windows
in arch/arm/mach-kirkwood/addr-map.c:addr_map_info[]), while the Armada
XP has 10 PCIe interfaces, which makes the static allocation solution
unreasonable.

The list of publicly available specifications for Marvell EBU SoC is
available at Documentation/arm/Marvell/README. For Kirkwood, I would
recommend
http://www.marvell.com/embedded-processors/kirkwood/assets/FS_88F6180_9x_6281_OpenSource.pdf.
See chapter 2.3:

  The PCI Express address decoding scheme restricts the address window
  to a size of 2^n, and to a start address that is aligned to the window
  size.

> I could imagine changing pcibios_window_alignment() to take the
> resource, so it could deal with the alignment question (though I
> haven't looked in detail and there might be some implementation issue
> with that).
> 
> With regard to the size issue (3MB window using 4MB of address space),
> I can't think of a reasonable way to teach the PCI core about both
> sizes.  But is there any reason to program the bridge for a 3MB window
> instead of a 4MB window, given that there's nothing else we can do
> with the extra 1MB anyway?  Is a 3MB window even possible?  I would
> think something that must be aligned on its size would be restricted
> to power-of-2 sizes anyway, just like PCI BARs are.  Maybe you can
> just always round up window sizes to a power of 2?

The window sizes are power of two sizes. I didn't realize that it was
also the case for PCI BARs. Then there is no problem with the size I
guess, and only a problem of alignment. Having the possibility to
get the resource and return a fixed up start address would solve the
problem I'd say.

Best regards,

Thomas
Arnd Bergmann Feb. 12, 2013, 4 p.m. UTC | #27
On Friday 08 February 2013, Thomas Petazzoni wrote:
> Dear Bjorn Helgaas,
> 
> On Thu, 7 Feb 2013 21:21:57 -0700, Bjorn Helgaas wrote:
> 
> > Huh.  That hardware looks less and less like a P2P bridge all the time
> > :(  You can't configure it via standard PCI config accesses, and the
> > aperture alignment and size constraints sound completely non-standard.

Right.

> > I could imagine changing pcibios_window_alignment() to take the
> > resource, so it could deal with the alignment question (though I
> > haven't looked in detail and there might be some implementation issue
> > with that).
> > 
> > With regard to the size issue (3MB window using 4MB of address space),
> > I can't think of a reasonable way to teach the PCI core about both
> > sizes.  But is there any reason to program the bridge for a 3MB window
> > instead of a 4MB window, given that there's nothing else we can do
> > with the extra 1MB anyway?  Is a 3MB window even possible?  I would
> > think something that must be aligned on its size would be restricted
> > to power-of-2 sizes anyway, just like PCI BARs are.  Maybe you can
> > just always round up window sizes to a power of 2?
> 
> The window sizes are power of two sizes. I didn't realize that it was
> also the case for PCI BARs. Then there is no problem with the size I
> guess, and only a problem of alignment. Having the possibility to
> get the resource and return a fixed up start address would solve the
> problem I'd say.

I thought that only device BARs in PCI had natural alignment, while
bridges don't.

I tried understanding the actual problem we have with the current
procedure, which on today's kirkwood is rought implemented in ARM's
pci_common_init() as follows:

for_each_root_bus() {
	pci_scan_root_bus();
}
for_each_root_bus {
	pci_bus_size_bridges();
	pci_bus_assign_resources();
	pci_enable_bridges();
	pci_bus_add_devices();
}

This is using hardcoded windows today, which are set up before calling
pci_scan_root_bus(). With your change, there is only one root bus,
and you intercept the pci_bus_assign_resources() stage in order to
set up the hardware specific window configuration for each PCIe
port of that root bus.

My feeling is that an easier solution would be to keep separate
root buses for each port, which then behaves completely PCIe
compliant, but add a hook in the procedure above to set up the
address translation windows between the pci_bus_size_bridges()
and the pci_bus_assign_resources() calls.

	Arnd
Jason Gunthorpe Feb. 12, 2013, 6:41 p.m. UTC | #28
On Tue, Feb 12, 2013 at 04:00:08PM +0000, Arnd Bergmann wrote:

> > The window sizes are power of two sizes. I didn't realize that it was
> > also the case for PCI BARs. Then there is no problem with the size I
> > guess, and only a problem of alignment. Having the possibility to
> > get the resource and return a fixed up start address would solve the
> > problem I'd say.
> 
> I thought that only device BARs in PCI had natural alignment, while
> bridges don't.

Right

> My feeling is that an easier solution would be to keep separate
> root buses for each port, which then behaves completely PCIe
> compliant, but add a hook in the procedure above to set up the
> address translation windows between the pci_bus_size_bridges()
> and the pci_bus_assign_resources() calls.

This process is only done during driver initialization. How would you
support PCI-E device hotplug (my systems rely on this)? Hotplug works
today with the existing Marvell driver, however that relies on
pre-allocated windows.

Jason
Arnd Bergmann Feb. 12, 2013, 7:02 p.m. UTC | #29
On Tuesday 12 February 2013, Jason Gunthorpe wrote:
> > My feeling is that an easier solution would be to keep separate
> > root buses for each port, which then behaves completely PCIe
> > compliant, but add a hook in the procedure above to set up the
> > address translation windows between the pci_bus_size_bridges()
> > and the pci_bus_assign_resources() calls.
> 
> This process is only done during driver initialization. How would you
> support PCI-E device hotplug (my systems rely on this)? Hotplug works
> today with the existing Marvell driver, however that relies on
> pre-allocated windows.

I did not expect hotplug to work with either approach. How does
it work with the existing driver? From my understanding, you still
assign all the top-level P2P bridge resources at bootup, and only
if that happens to have some space left before the next bridge,
it would be possible to fit in a hotplug device.

	Arnd
Jason Gunthorpe Feb. 12, 2013, 7:38 p.m. UTC | #30
On Tue, Feb 12, 2013 at 07:02:14PM +0000, Arnd Bergmann wrote:
> On Tuesday 12 February 2013, Jason Gunthorpe wrote:
> > > My feeling is that an easier solution would be to keep separate
> > > root buses for each port, which then behaves completely PCIe
> > > compliant, but add a hook in the procedure above to set up the
> > > address translation windows between the pci_bus_size_bridges()
> > > and the pci_bus_assign_resources() calls.
> > 
> > This process is only done during driver initialization. How would you
> > support PCI-E device hotplug (my systems rely on this)? Hotplug works
> > today with the existing Marvell driver, however that relies on
> > pre-allocated windows.
> 
> I did not expect hotplug to work with either approach. How does
> it work with the existing driver? From my understanding, you still
> assign all the top-level P2P bridge resources at bootup, and only
> if that happens to have some space left before the next bridge,
> it would be possible to fit in a hotplug device.

PCI-E hotplug can be supported through the generic /sys/bus/pci/rescan
mechanism, which forces a rediscovery/re-evaluation of all the buses
in the system. The PCI core is smart enough to know what it can/can
not reassign and can move the bridge windows around (IIRC most of the
issues here are resolved these days?).

For PCI-E, the root port bridge can be placed anywhere in the host
aperture, so as long as the host aperture isn't filled the core can
allocate a memory region for the now active port.

Thomas's driver should support this as the regions and windows are all
properly dynamic.

This is one reason why re-using the PCI core code is so desirable, it
handles all these complexities.

Jason
Arnd Bergmann Feb. 12, 2013, 11:05 p.m. UTC | #31
On Tuesday 12 February 2013, Jason Gunthorpe wrote:
> PCI-E hotplug can be supported through the generic /sys/bus/pci/rescan
> mechanism, which forces a rediscovery/re-evaluation of all the buses
> in the system. The PCI core is smart enough to know what it can/can
> not reassign and can move the bridge windows around (IIRC most of the
> issues here are resolved these days?).
> 
> For PCI-E, the root port bridge can be placed anywhere in the host
> aperture, so as long as the host aperture isn't filled the core can
> allocate a memory region for the now active port.
> 
> Thomas's driver should support this as the regions and windows are all
> properly dynamic.

Ah, so you only allow hotplugging into the root ports, but not behind
additional bridges that have active devices on them, right?

I guess that is a common limitation for PCIe hotplugging.
 
> This is one reason why re-using the PCI core code is so desirable, it
> handles all these complexities.

Ok, I'm starting to understand what the entire discussion is about ;)

	Arnd
Jason Gunthorpe Feb. 13, 2013, 12:32 a.m. UTC | #32
On Tue, Feb 12, 2013 at 11:05:28PM +0000, Arnd Bergmann wrote:
> On Tuesday 12 February 2013, Jason Gunthorpe wrote:
> > PCI-E hotplug can be supported through the generic /sys/bus/pci/rescan
> > mechanism, which forces a rediscovery/re-evaluation of all the buses
> > in the system. The PCI core is smart enough to know what it can/can
> > not reassign and can move the bridge windows around (IIRC most of the
> > issues here are resolved these days?).
> > 
> > For PCI-E, the root port bridge can be placed anywhere in the host
> > aperture, so as long as the host aperture isn't filled the core can
> > allocate a memory region for the now active port.
> > 
> > Thomas's driver should support this as the regions and windows are all
> > properly dynamic.
> 
> Ah, so you only allow hotplugging into the root ports, but not behind
> additional bridges that have active devices on them, right?
 
> I guess that is a common limitation for PCIe hotplugging.

In all the cases I've worked with, it has been a root port hot plug,
but I could imagine something like ExpressCard requiring a second
bridge.

The standard answer is to leave appropriate gaps. My *guess* on this
matter is that on x86 the gaps are left, as appropriate, by the boot
firmware. Eg an ExpressCard slot will always have a window assigned to
its bridge and Linux would typically not reassign it (or similar).

PCI core support for firmware-less embedded will someday need to do
something similar, eg via a special DT attribute on hot plug capable
ports.

Just to circle back on this whole thread - Thomas's solution is pretty
good, it covers pretty much all the use cases. I think it is a good
place to start, and as the firmware-less 'drivers/pci/host' concept
develops the right support will eventually come, as everyone is now
aware of the need to control the host bridge aperture from the core
PCI code.

Jason
Arnd Bergmann Feb. 13, 2013, 6:53 p.m. UTC | #33
On Wednesday 13 February 2013, Jason Gunthorpe wrote:
> On Tue, Feb 12, 2013 at 11:05:28PM +0000, Arnd Bergmann wrote:
> > On Tuesday 12 February 2013, Jason Gunthorpe wrote:
> > > PCI-E hotplug can be supported through the generic /sys/bus/pci/rescan
> > > mechanism, which forces a rediscovery/re-evaluation of all the buses
> > > in the system. The PCI core is smart enough to know what it can/can
> > > not reassign and can move the bridge windows around (IIRC most of the
> > > issues here are resolved these days?).
> > > 
> > > For PCI-E, the root port bridge can be placed anywhere in the host
> > > aperture, so as long as the host aperture isn't filled the core can
> > > allocate a memory region for the now active port.
> > > 
> > > Thomas's driver should support this as the regions and windows are all
> > > properly dynamic.
> > 
> > Ah, so you only allow hotplugging into the root ports, but not behind
> > additional bridges that have active devices on them, right?
>  
> > I guess that is a common limitation for PCIe hotplugging.
> 
> In all the cases I've worked with, it has been a root port hot plug,
> but I could imagine something like ExpressCard requiring a second
> bridge.

Ok, I see.

> The standard answer is to leave appropriate gaps. My *guess* on this
> matter is that on x86 the gaps are left, as appropriate, by the boot
> firmware. Eg an ExpressCard slot will always have a window assigned to
> its bridge and Linux would typically not reassign it (or similar).
> 
> PCI core support for firmware-less embedded will someday need to do
> something similar, eg via a special DT attribute on hot plug capable
> ports.

I saw that the PCI core reserves 2MB memory space and 256 bytes of
I/O space per hotplug capable bridge by default, and you can override
these at boot time if you need more. I wonder if this means that
we end up using two of the precious address space windows for each
unused root port to already map these at boot time, and it certainly
works for most adapters, but this does not seem better than assigning
static windows of the same size at boot time for each port.

> Just to circle back on this whole thread - Thomas's solution is pretty
> good, it covers pretty much all the use cases. I think it is a good
> place to start, and as the firmware-less 'drivers/pci/host' concept
> develops the right support will eventually come, as everyone is now
> aware of the need to control the host bridge aperture from the core
> PCI code.

I agree the solution is not all that bad, I just want to be convinced
that it actually has advantages over the simpler approaches.

	Arnd
Jason Gunthorpe Feb. 13, 2013, 7:12 p.m. UTC | #34
On Wed, Feb 13, 2013 at 06:53:14PM +0000, Arnd Bergmann wrote:

> > The standard answer is to leave appropriate gaps. My *guess* on this
> > matter is that on x86 the gaps are left, as appropriate, by the boot
> > firmware. Eg an ExpressCard slot will always have a window assigned to
> > its bridge and Linux would typically not reassign it (or similar).
> > 
> > PCI core support for firmware-less embedded will someday need to do
> > something similar, eg via a special DT attribute on hot plug capable
> > ports.
> 
> I saw that the PCI core reserves 2MB memory space and 256 bytes of
> I/O space per hotplug capable bridge by default, and you can
> override

Haven't looked at how it determines what is hot plug
capable.. Technically every PCI-E port is hot plug capable, it really
depends on the specific board if a port can actually be hot plugged or
not - so maybe that is what gets set in DT?

> these at boot time if you need more. I wonder if this means that
> we end up using two of the precious address space windows for each
> unused root port to already map these at boot time, and it certainly
> works for most adapters, but this does not seem better than assigning
> static windows of the same size at boot time for each port.

If the PCI core programs the decoder on the bridge, then it will
consume a window - however if there is nothing behind the bridge then
leaving the brdige window disabled, but reserving the memory region is
a sensible thing to do.

I'm not sure what the state of the PCI core is today on this point,
but it could be altered..

Also the host driver can check the link status before consuming a
window, no link = no window.

Thomas, what were your test results on your 10 slot system? Did all 10
P2P bridges appear in lspci? Was there any address space reservation
for hot plug?

> > Just to circle back on this whole thread - Thomas's solution is pretty
> > good, it covers pretty much all the use cases. I think it is a good
> > place to start, and as the firmware-less 'drivers/pci/host' concept
> > develops the right support will eventually come, as everyone is now
> > aware of the need to control the host bridge aperture from the core
> > PCI code.
> 
> I agree the solution is not all that bad, I just want to be convinced
> that it actually has advantages over the simpler approaches.

Unfortunatelly my Marvell systems do not have oversubscribed mbus
windows, so I can't really comment on this :( However I do use the
hotplug capability in the current driver, so at least for me, it is
important to not loose that when trying to solve the oversubcription.

Jason
Thomas Petazzoni Feb. 13, 2013, 7:51 p.m. UTC | #35
Dear Jason Gunthorpe,

On Wed, 13 Feb 2013 12:12:04 -0700, Jason Gunthorpe wrote:

> Thomas, what were your test results on your 10 slot system? Did all 10
> P2P bridges appear in lspci? Was there any address space reservation
> for hot plug?

My test system "only" has 6 PCIe slots, so I couldn't test with the
entire 10 possible PCIe interfaces. But 6 PCIe slots is still a good
number :-)

If you look at the cover letter of the PATCHv3, you'll see the output
of lspci -vvv. It shows that I create one PCI-to-PCI bridge for each
PCIe interface, regardless of whether the link is here or not. And for
those where there is nothing behind the bridge, no address space
reservation occurs, so I am not allocating address decoding windows for
those unused PCIe interfaces.

Best regards,

Thomas
Yinghai Lu Feb. 13, 2013, 9:02 p.m. UTC | #36
On Tue, Feb 12, 2013 at 4:32 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Feb 12, 2013 at 11:05:28PM +0000, Arnd Bergmann wrote:
>>
>> Ah, so you only allow hotplugging into the root ports, but not behind
>> additional bridges that have active devices on them, right?
>
>> I guess that is a common limitation for PCIe hotplugging.
>
> In all the cases I've worked with, it has been a root port hot plug,
> but I could imagine something like ExpressCard requiring a second
> bridge.
>
> The standard answer is to leave appropriate gaps. My *guess* on this
> matter is that on x86 the gaps are left, as appropriate, by the boot
> firmware. Eg an ExpressCard slot will always have a window assigned to
> its bridge and Linux would typically not reassign it (or similar).

Yes, x86 BIOS will leave some pad number for every pcie root port.

Also with patches in
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
 for-pci-busn-alloc

could realloc bus range to get big bus number.

Thanks

Yinghai
Arnd Bergmann Feb. 13, 2013, 9:10 p.m. UTC | #37
On Wednesday 13 February 2013, Jason Gunthorpe wrote:
> On Wed, Feb 13, 2013 at 06:53:14PM +0000, Arnd Bergmann wrote:
> 
> > > The standard answer is to leave appropriate gaps. My *guess* on this
> > > matter is that on x86 the gaps are left, as appropriate, by the boot
> > > firmware. Eg an ExpressCard slot will always have a window assigned to
> > > its bridge and Linux would typically not reassign it (or similar).
> > > 
> > > PCI core support for firmware-less embedded will someday need to do
> > > something similar, eg via a special DT attribute on hot plug capable
> > > ports.
> > 
> > I saw that the PCI core reserves 2MB memory space and 256 bytes of
> > I/O space per hotplug capable bridge by default, and you can
> > override
> 
> Haven't looked at how it determines what is hot plug
> capable.. Technically every PCI-E port is hot plug capable, it really
> depends on the specific board if a port can actually be hot plugged or
> not - so maybe that is what gets set in DT?

The "is_hotplug_bridge" flag that determines this gets set for PCIe
bridges with the PCI_EXP_SLTCAP_HPC (hot plug capable) bit set in the
PCI_EXP_SLTCAP word.

> > these at boot time if you need more. I wonder if this means that
> > we end up using two of the precious address space windows for each
> > unused root port to already map these at boot time, and it certainly
> > works for most adapters, but this does not seem better than assigning
> > static windows of the same size at boot time for each port.
> 
> If the PCI core programs the decoder on the bridge, then it will
> consume a window - however if there is nothing behind the bridge then
> leaving the brdige window disabled, but reserving the memory region is
> a sensible thing to do.
>
> I'm not sure what the state of the PCI core is today on this point,
> but it could be altered..

The problem I see with the current implementation is that it reserves
a fixed size window and does not reassign the window of the bridge
itself, only the devices below it, at least if I am reading the
code correctly. I have not tried this myself.
 
> Also the host driver can check the link status before consuming a
> window, no link = no window.

Right, that works. Even if the link is up, it might require only
I/O or memory windows, rather than always using both. 

> > > Just to circle back on this whole thread - Thomas's solution is pretty
> > > good, it covers pretty much all the use cases. I think it is a good
> > > place to start, and as the firmware-less 'drivers/pci/host' concept
> > > develops the right support will eventually come, as everyone is now
> > > aware of the need to control the host bridge aperture from the core
> > > PCI code.
> > 
> > I agree the solution is not all that bad, I just want to be convinced
> > that it actually has advantages over the simpler approaches.
> 
> Unfortunatelly my Marvell systems do not have oversubscribed mbus
> windows, so I can't really comment on this :( However I do use the
> hotplug capability in the current driver, so at least for me, it is
> important to not loose that when trying to solve the oversubcription.

One thing worth trying is probably to hack the driver to only use
a couple of the available windows and see what happens when you hotplug
one card into all the slots one at a time.

	Arnd
Yinghai Lu Feb. 13, 2013, 9:20 p.m. UTC | #38
On Wed, Feb 13, 2013 at 1:10 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> The problem I see with the current implementation is that it reserves
> a fixed size window and does not reassign the window of the bridge
> itself, only the devices below it, at least if I am reading the
> code correctly. I have not tried this myself.

Which file?

Current code we *do* change the bridge itself resource (mmio, pref mmio, and io)
in
pciehp_configure_device/pci_assign_unassigned_bridge_resources

Let met if it does not work you.

Thanks

Yinghai
Arnd Bergmann Feb. 13, 2013, 10:24 p.m. UTC | #39
On Wednesday 13 February 2013, Yinghai Lu wrote:
> On Wed, Feb 13, 2013 at 1:10 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > The problem I see with the current implementation is that it reserves
> > a fixed size window and does not reassign the window of the bridge
> > itself, only the devices below it, at least if I am reading the
> > code correctly. I have not tried this myself.
> 
> Which file?
> 
> Current code we do change the bridge itself resource (mmio, pref mmio, and io)
> in
> pciehp_configure_device/pci_assign_unassigned_bridge_resources
> 
> Let met if it does not work you.

That is the code I was looking at, but I probably misunderstood something
there. I did not actually run it, just attempted to understand what it
does by inspection.

I found the code now, sorry for the confusion on my end.

Jason, Thomas: you win :-)

The concept that root ports don't get resized is hardwired a lot of places
along the way. That could be changed, but there is a significant risk
of regressions if we try that. Adding fake bridges to work around that
isn't the nicest solution, but the code is there and works without
being able to break something else, so let's do that unless there are
new problems that make it harder.

	Arnd
diff mbox

Patch

diff --git a/arch/arm/include/asm/mach/pci.h b/arch/arm/include/asm/mach/pci.h
index db9fedb..a2301ae 100644
--- a/arch/arm/include/asm/mach/pci.h
+++ b/arch/arm/include/asm/mach/pci.h
@@ -42,6 +42,8 @@  struct pci_sys_data {
 	int		busnr;		/* primary bus number			*/
 	u64		mem_offset;	/* bus->cpu memory mapping offset	*/
 	unsigned long	io_offset;	/* bus->cpu IO mapping offset		*/
+	resource_size_t	win_align_mem;
+	resource_size_t	win_align_io;
 	struct pci_bus	*bus;		/* PCI bus				*/
 	struct list_head resources;	/* root bus resources (apertures)       */
 	struct resource io_res;
diff --git a/arch/arm/kernel/bios32.c b/arch/arm/kernel/bios32.c
index 379cf32..ba81630 100644
--- a/arch/arm/kernel/bios32.c
+++ b/arch/arm/kernel/bios32.c
@@ -463,6 +463,8 @@  static void __init pcibios_init_hw(struct hw_pci *hw, struct list_head *head)
 		sys->swizzle = hw->swizzle;
 		sys->map_irq = hw->map_irq;
 		INIT_LIST_HEAD(&sys->resources);
+		sys->win_align_mem = 1;
+		sys->win_align_io = 1;
 
 		ret = hw->setup(nr, sys);
 
@@ -581,6 +583,21 @@  resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 	return start;
 }
 
+resource_size_t pcibios_window_alignment(struct pci_bus *bus,
+					 unsigned long type)
+{
+	struct pci_sys_data *sys = bus->sysdata;
+
+	/* Ignore downstream buses */
+	if (!bus->parent) {
+		if (type & IORESOURCE_MEM)
+			return sys->win_align_mem;
+		if (type & IORESOURCE_IO)
+			return sys->win_align_io;
+	}
+	return 1;
+}
+
 /**
  * pcibios_enable_device - Enable I/O and memory.
  * @dev: PCI device to be enabled