[v2,05/27] arm: pci: add a align_resource hook

Message ID	1359399397-29729-6-git-send-email-thomas.petazzoni@free-electrons.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> To: Bjorn Helgaas <bhelgaas@google.com>, linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org Cc: Jason Cooper <jason@lakedaemon.net>, Andrew Lunn <andrew@lunn.ch>, Gregory Clement <gregory.clement@free-electrons.com>, Arnd Bergmann <arnd@arndb.de>, Maen Suleiman <maen@marvell.com>, Lior Amsalem <alior@marvell.com>, Thierry Reding <thierry.reding@avionic-design.de>, Eran Ben-Avi <benavi@marvell.com>, Nadav Haklai <nadavh@marvell.com>, Shadi Ammouri <shadi@marvell.com>, Tawfik Bayouk <tawfik@marvell.com>, Stephen Warren <swarren@wwwdotorg.org>, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>, Russell King - ARM Linux <linux@arm.linux.org.uk> Subject: [PATCH v2 05/27] arm: pci: add a align_resource hook Date: Mon, 28 Jan 2013 19:56:14 +0100 Message-Id: <1359399397-29729-6-git-send-email-thomas.petazzoni@free-electrons.com> In-Reply-To: <1359399397-29729-1-git-send-email-thomas.petazzoni@free-electrons.com> References: <1359399397-29729-1-git-send-email-thomas.petazzoni@free-electrons.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Thomas Petazzoni Jan. 28, 2013, 6:56 p.m. UTC

The PCI specifications says that an I/O region must be aligned on a 4
KB boundary, and a memory region aligned on a 1 MB boundary.

However, the Marvell PCIe interfaces rely on address decoding windows
(which allow to associate a range of physical addresses with a given
device). For PCIe memory windows, those windows are defined with a 1
MB granularity (which matches the PCI specs), but PCIe I/O windows can
only be defined with a 64 KB granularity, so they have to be 64 KB
aligned. We therefore need to tell the PCI core about this special
alignement requirement.

The PCI core already calls pcibios_align_resource() in the ARM PCI
core, specifically for such purposes. So this patch extends the ARM
PCI core so that it calls a ->align_resource() hook registered by the
PCI driver, exactly like the existing ->map_irq() and ->swizzle()
hooks.

A particular PCI driver can register a align_resource() hook, and do
its own specific alignement, depending on the specific constraints of
the underlying hardware.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Russell King <linux@arm.linux.org.uk>
---
 arch/arm/include/asm/mach/pci.h |   11 +++++++++++
 arch/arm/kernel/bios32.c        |    6 ++++++
 2 files changed, 17 insertions(+)

Thomas Petazzoni Jan. 29, 2013, 3:12 p.m. UTC | #1

Russell,

As the arch/arm/kernel/ maintainer, what is your position regarding the
below patch?

Thanks for your review,

Thomas

On Mon, 28 Jan 2013 19:56:14 +0100, Thomas Petazzoni wrote:
> The PCI specifications says that an I/O region must be aligned on a 4
> KB boundary, and a memory region aligned on a 1 MB boundary.
> 
> However, the Marvell PCIe interfaces rely on address decoding windows
> (which allow to associate a range of physical addresses with a given
> device). For PCIe memory windows, those windows are defined with a 1
> MB granularity (which matches the PCI specs), but PCIe I/O windows can
> only be defined with a 64 KB granularity, so they have to be 64 KB
> aligned. We therefore need to tell the PCI core about this special
> alignement requirement.
> 
> The PCI core already calls pcibios_align_resource() in the ARM PCI
> core, specifically for such purposes. So this patch extends the ARM
> PCI core so that it calls a ->align_resource() hook registered by the
> PCI driver, exactly like the existing ->map_irq() and ->swizzle()
> hooks.
> 
> A particular PCI driver can register a align_resource() hook, and do
> its own specific alignement, depending on the specific constraints of
> the underlying hardware.
> 
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
> Cc: Russell King <linux@arm.linux.org.uk>
> ---
>  arch/arm/include/asm/mach/pci.h |   11 +++++++++++
>  arch/arm/kernel/bios32.c        |    6 ++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/arch/arm/include/asm/mach/pci.h b/arch/arm/include/asm/mach/pci.h
> index 5cf2e97..7d2c3c8 100644
> --- a/arch/arm/include/asm/mach/pci.h
> +++ b/arch/arm/include/asm/mach/pci.h
> @@ -30,6 +30,11 @@ struct hw_pci {
>  	void		(*postinit)(void);
>  	u8		(*swizzle)(struct pci_dev *dev, u8 *pin);
>  	int		(*map_irq)(const struct pci_dev *dev, u8 slot, u8 pin);
> +	resource_size_t (*align_resource)(struct pci_dev *dev,
> +					  const struct resource *res,
> +					  resource_size_t start,
> +					  resource_size_t size,
> +					  resource_size_t align);
>  };
>  
>  /*
> @@ -51,6 +56,12 @@ struct pci_sys_data {
>  	u8		(*swizzle)(struct pci_dev *, u8 *);
>  					/* IRQ mapping				*/
>  	int		(*map_irq)(const struct pci_dev *, u8, u8);
> +					/* Resource alignement requirements	*/
> +	resource_size_t (*align_resource)(struct pci_dev *dev,
> +					  const struct resource *res,
> +					  resource_size_t start,
> +					  resource_size_t size,
> +					  resource_size_t align);
>  	void		*private_data;	/* platform controller private data	*/
>  };
>  
> diff --git a/arch/arm/kernel/bios32.c b/arch/arm/kernel/bios32.c
> index 5401645..be2e6c9 100644
> --- a/arch/arm/kernel/bios32.c
> +++ b/arch/arm/kernel/bios32.c
> @@ -462,6 +462,7 @@ static void __init pcibios_init_hw(struct hw_pci *hw, struct list_head *head)
>  		sys->busnr   = busnr;
>  		sys->swizzle = hw->swizzle;
>  		sys->map_irq = hw->map_irq;
> +		sys->align_resource = hw->align_resource;
>  		INIT_LIST_HEAD(&sys->resources);
>  
>  		if (hw->private_data)
> @@ -574,6 +575,8 @@ char * __init pcibios_setup(char *str)
>  resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>  				resource_size_t size, resource_size_t align)
>  {
> +	struct pci_dev *dev = data;
> +	struct pci_sys_data *sys = dev->sysdata;
>  	resource_size_t start = res->start;
>  
>  	if (res->flags & IORESOURCE_IO && start & 0x300)
> @@ -581,6 +584,9 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>  
>  	start = (start + align - 1) & ~(align - 1);
>  
> +	if (sys->align_resource)
> +		return sys->align_resource(dev, res, start, size, align);
> +
>  	return start;
>  }
>

Russell King - ARM Linux Jan. 29, 2013, 3:15 p.m. UTC | #2

On Tue, Jan 29, 2013 at 04:12:11PM +0100, Thomas Petazzoni wrote:
> Russell,
> 
> As the arch/arm/kernel/ maintainer, what is your position regarding the
> below patch?

Given the description, I'd feel much happier with this if we specified
the alignment numerically rather than allowing "some random code" to do
something with the passed values.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 29, 2013, 3:23 p.m. UTC | #3

Russell,

Thanks for your quick feedback!

On Tue, 29 Jan 2013 15:15:01 +0000, Russell King - ARM Linux wrote:

> Given the description, I'd feel much happier with this if we specified
> the alignment numerically rather than allowing "some random code" to do
> something with the passed values.

So, you'd prefer to have two new members added in the hw_pci structure
to give the alignment requirements for I/O regions and memory regions?

Something like:

struct hw_pci {
	[...]
	unsigned long io_align;
	unsigned long mem_align;
};

If that's fine with you, I'll go ahead and change the implementation in
this direction. As long as I can express my special 64 KB alignment
requirement for I/O regions, I'm just fine :-)

Thanks again,

Thomas

Russell King - ARM Linux Jan. 29, 2013, 3:25 p.m. UTC | #4

On Tue, Jan 29, 2013 at 04:23:37PM +0100, Thomas Petazzoni wrote:
> Russell,
> 
> Thanks for your quick feedback!
> 
> On Tue, 29 Jan 2013 15:15:01 +0000, Russell King - ARM Linux wrote:
> 
> > Given the description, I'd feel much happier with this if we specified
> > the alignment numerically rather than allowing "some random code" to do
> > something with the passed values.
> 
> So, you'd prefer to have two new members added in the hw_pci structure
> to give the alignment requirements for I/O regions and memory regions?

Yep, otherwise we'll have yet more code to review rather than one
algorithm with a set of numbers...

I work on the principle that if something can be expressed numerically,
that's always better than expressing it with code.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 29, 2013, 3:28 p.m. UTC | #5

Russell,

On Tue, 29 Jan 2013 15:25:18 +0000, Russell King - ARM Linux wrote:

> Yep, otherwise we'll have yet more code to review rather than one
> algorithm with a set of numbers...
> 
> I work on the principle that if something can be expressed numerically,
> that's always better than expressing it with code.

Having a hook allows for more flexibility (for example having special
alignment requirements depending on the device or something like that),
but I don't need this flexibility for the specific case I'm interested
in. So as you suggest, let's not over-engineer this, and use numerical
values. I'll rework my patch according to this suggestion.

Thanks,

Thomas

Russell King - ARM Linux Jan. 29, 2013, 3:58 p.m. UTC | #6

On Tue, Jan 29, 2013 at 04:12:11PM +0100, Thomas Petazzoni wrote:
> On Mon, 28 Jan 2013 19:56:14 +0100, Thomas Petazzoni wrote:
> > The PCI specifications says that an I/O region must be aligned on a 4
> > KB boundary, and a memory region aligned on a 1 MB boundary.

BTW, this, as a general statement, is wrong - though it really depends
what you mean by "region".

Remember that BARs can be set where-ever provided that they satisify
their _individual_ alignment requirements.  So, an IO bar which
occupies 16 bytes must be set to a 16-byte boundary.

Now, there's an additional complication there which occurs if you have
ISA devices sharing the PCI IO space: ISA devices used to only decode
10 bits of IO space, which means that their registers repeat throughout
the IO space.

Therefore, it is generally accepted that within any 1K block, only the
first 256 locations are only usable.

Moreover, some PCI cards have taken advantage of this, particularly VGA
cards.  For example, S3 VGA cards put different registers on 1K
multiples of the standard PC VGA IO addresses...

Also, another reason why I suspect your statement is wrong if I were
to interpret "region" as "BAR" is that consider a bunch of PCI peripherals
behind a PCI bridge.  The total number of IO BARs on the peripherals
is 16.

If you allocate each of those IO BARs to be 4K aligned, then you
consume all 64K of IO space behind one bridge, which leaves no space
for any other IO peripherals elsewhere in the bus structure.

Last reason I think that intepretation is wrong is, on this PC, I see:

        Region 4: I/O ports at 1800 [size=8]
        Region 0: I/O ports at 1830 [size=8]
        Region 2: I/O ports at 1840 [size=32]
        Region 4: I/O ports at 1860 [size=32]
        Region 4: I/O ports at 1880 [size=32]
        Region 4: I/O ports at 18a0 [size=32]
        Region 4: I/O ports at 18c0 [size=32]
        Region 4: I/O ports at 18e0 [size=32]
        Region 4: I/O ports at 1c00 [size=32]
        Region 0: I/O ports at 1c48 [size=8]
        Region 1: I/O ports at 183c [size=4]
        Region 2: I/O ports at 1c40 [size=8]
        Region 3: I/O ports at 1838 [size=4]
        Region 4: I/O ports at 1c20 [size=32]
        Region 4: I/O ports at 1c60 [size=32]

which doesn't follow - and I can pull out other examples on other
x86 platforms where IO BARs aren't aligned to 4K...
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 29, 2013, 4:20 p.m. UTC | #7

Russell,

On Tue, 29 Jan 2013 15:58:20 +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 29, 2013 at 04:12:11PM +0100, Thomas Petazzoni wrote:
> > On Mon, 28 Jan 2013 19:56:14 +0100, Thomas Petazzoni wrote:
> > > The PCI specifications says that an I/O region must be aligned on a 4
> > > KB boundary, and a memory region aligned on a 1 MB boundary.
> 
> BTW, this, as a general statement, is wrong - though it really depends
> what you mean by "region".

Yes, sorry, my statement does not correctly reflect the reality. My
knowledge of the PCI terminology is still quite fuzzy (as you found
out). What I am referring to is that the PCI standard requires the I/O
base register of a PCI-to-PCI bridge to contain a 4 KB aligned address.

From the PCI-to-PCI Bridge Architecture Specification, Revision 1.1,
section 3.2.5.6. I/O Base Register and I/O Limit Register:

"""
   If a bridge implements an I/O address range, the upper 4 bits of
   both the I/O Base and I/O Limit registers are writable and
   correspond to address bits AD[15::12]. For the purpose of address
   decoding, the bridge assumes that the lower 12 address bits,
   AD[11::00], of the I/O base address (not implemented in the I/O Base
   register) are zero. Similarly, the bridge assumes that the lower 12
   address bits, AD[11::00], of the I/O limit address (not implemented
   in the I/O Limit register) are FFFh. Thus, the bottom of the defined
   I/O address range will be aligned to a 4 KB boundary and the top of
   the defined I/O address range will be one less than a 4 KB boundary.
"""

And the Linux PCI resource allocation code complies with this, so that
if I have two PCI-to-PCI bridges (each having downstream a device with
an I/O BAR), then the first PCI-to-PCI bridge gets its I/O base address
register set to ADDR + 0x0, and the second bridge gets its I/O base
address set to ADDR + 0x1000. And this doesn't play well with the
requirements of Marvell address decoding windows for PCIe I/O regions,
which must be 64 KB aligned.

So I guess I should simply rewrite the commit log to make it clear that
I'm referring to the I/O base address register of PCI-to-PCI bridges.

Would this be more correct? In that case, maybe in fact I really need a
hook so that this alignment requirement on only applied on the
resources allocated to bridges, and not on their downstream devices?

Thanks,

Thomas

Arnd Bergmann Jan. 29, 2013, 4:45 p.m. UTC | #8

On Tuesday 29 January 2013, Thomas Petazzoni wrote:
> And the Linux PCI resource allocation code complies with this, so that
> if I have two PCI-to-PCI bridges (each having downstream a device with
> an I/O BAR), then the first PCI-to-PCI bridge gets its I/O base address
> register set to ADDR + 0x0, and the second bridge gets its I/O base
> address set to ADDR + 0x1000. And this doesn't play well with the
> requirements of Marvell address decoding windows for PCIe I/O regions,
> which must be 64 KB aligned.

But we normally only assign a 64 KB I/O window to each PCI host bridge.
Requiring PCI bridges to be space 64 KB apart would mean that we cannot
actually support bridges at all.

Is this just about your "virtual" bridges? If each one has its
own 64 KB I/O range and its own configuration space, that sounds
a lot like you should make them appear as individual domains instead.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 29, 2013, 5:09 p.m. UTC | #9

Dear Arnd Bergmann,

On Tue, 29 Jan 2013 16:45:07 +0000, Arnd Bergmann wrote:
> On Tuesday 29 January 2013, Thomas Petazzoni wrote:
> > And the Linux PCI resource allocation code complies with this, so
> > that if I have two PCI-to-PCI bridges (each having downstream a
> > device with an I/O BAR), then the first PCI-to-PCI bridge gets its
> > I/O base address register set to ADDR + 0x0, and the second bridge
> > gets its I/O base address set to ADDR + 0x1000. And this doesn't
> > play well with the requirements of Marvell address decoding windows
> > for PCIe I/O regions, which must be 64 KB aligned.
> 
> But we normally only assign a 64 KB I/O window to each PCI host
> bridge. Requiring PCI bridges to be space 64 KB apart would mean that
> we cannot actually support bridges at all.
> 
> Is this just about your "virtual" bridges? If each one has its
> own 64 KB I/O range and its own configuration space, that sounds
> a lot like you should make them appear as individual domains instead.

Yes, it is about the emulated PCI-to-PCI bridges. Each
emulated PCI-to-PCI bridge corresponds to one hardware PCIe interface,
and I need the I/O base address assigned to each PCIe interface to be
aligned on a 64 KB boundary. I am not sure to understand why you think
this is a problem.

Also, what do you mean exactly by making them appear as individual
domains?

Remember that the very reason to use emulated PCI-to-PCI bridges is
that we want to assign a global range of addresses of I/O regions and a
global range of addresses of memory regions, and let the Linux PCI core
allocate from those two ranges to the different devices connected
downstream of the PCI-to-PCI bridges. This gives us for free the rather
complex allocation of addresses we need to set up our address decoding
windows.

If we have have separate domains for each of our hardware PCIe
interface, can we still benefit from this allocation of resources from
a globally defined range of I/O addresses and memory addresses?

Thanks,

Thomas

Arnd Bergmann Jan. 29, 2013, 8:15 p.m. UTC | #10

On Tuesday 29 January 2013, Thomas Petazzoni wrote:
> Yes, it is about the emulated PCI-to-PCI bridges. Each
> emulated PCI-to-PCI bridge corresponds to one hardware PCIe interface,
> and I need the I/O base address assigned to each PCIe interface to be
> aligned on a 64 KB boundary. I am not sure to understand why you think
> this is a problem.
> 
> Also, what do you mean exactly by making them appear as individual
> domains?

I mean you could make each root port look like a separate host
bridge that is not related to the others, and not have any
emulated PCI-to-PCI bridges at all.

> Remember that the very reason to use emulated PCI-to-PCI bridges is
> that we want to assign a global range of addresses of I/O regions and a
> global range of addresses of memory regions, and let the Linux PCI core
> allocate from those two ranges to the different devices connected
> downstream of the PCI-to-PCI bridges. This gives us for free the rather
> complex allocation of addresses we need to set up our address decoding
> windows.
> 
> If we have have separate domains for each of our hardware PCIe
> interface, can we still benefit from this allocation of resources from
> a globally defined range of I/O addresses and memory addresses?

My interpretation of what you told me in the previous mail is that
each root port has 

* A separate configuration space
* A separate 64KB I/O window that is not shared with the other ports,
  or potentially multiple 64KB windows, which we would not want to use
* A configurable range of the memory space that does not overlap
  with the other ports

Is the above a correct description?

If so, I think it would be most sensible to not try to put all ports
into the same domain, but give each port the full view of its own
256 buses, and 64KB I/O space. The memory space can still be directly
mapped, if you only set up the physical address window for that after
the bus scan is complete.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 29, 2013, 8:33 p.m. UTC | #11

Dear Arnd Bergmann,

On Tue, 29 Jan 2013 20:15:21 +0000, Arnd Bergmann wrote:

> I mean you could make each root port look like a separate host
> bridge that is not related to the others, and not have any
> emulated PCI-to-PCI bridges at all.

Ok.

> > Remember that the very reason to use emulated PCI-to-PCI bridges is
> > that we want to assign a global range of addresses of I/O regions
> > and a global range of addresses of memory regions, and let the
> > Linux PCI core allocate from those two ranges to the different
> > devices connected downstream of the PCI-to-PCI bridges. This gives
> > us for free the rather complex allocation of addresses we need to
> > set up our address decoding windows.
> > 
> > If we have have separate domains for each of our hardware PCIe
> > interface, can we still benefit from this allocation of resources
> > from a globally defined range of I/O addresses and memory addresses?
> 
> My interpretation of what you told me in the previous mail is that
> each root port has 
> 
> * A separate configuration space
> * A separate 64KB I/O window that is not shared with the other ports,
>   or potentially multiple 64KB windows, which we would not want to use
> * A configurable range of the memory space that does not overlap
>   with the other ports
> 
> Is the above a correct description?
> 
> If so, I think it would be most sensible to not try to put all ports
> into the same domain, but give each port the full view of its own
> 256 buses, and 64KB I/O space. The memory space can still be directly
> mapped, if you only set up the physical address window for that after
> the bus scan is complete.

Does this still allows me to give the Linux PCI *one* global range of
addresses for I/O space, and *one* global range of addresses for memory
space, and the the Linux PCI core assign ranges, within those global
ranges, to each host bridge?

This is absolutely essential for me, as I then read those allocated
ranges to configure the address decoding windows.

Basically, I have currently two suggestions:

 * From Jason Gunthorpe, to not use any host bridge, and instead use
   only PCI-to-PCI bridges, one per PCIe interface.

 * From you, to not use any PCI-to-PCI bridge, and use only host
   bridges, one per PCIe interface.

Would it be possible to get some consensus on this? In the review of
RFCv1, I was already told to use one global host bridge, and then one
PCI-to-PCI bridge per PCIe interface, and now we're talking about doing
something different. I'd like to avoid having to try gazillions of
different possible implementations :-)

Best regards,

Thomas

Thomas Petazzoni Jan. 29, 2013, 9:59 p.m. UTC | #12

Arnd,

On Tue, 29 Jan 2013 21:33:08 +0100, Thomas Petazzoni wrote:

> Basically, I have currently two suggestions:
> 
>  * From Jason Gunthorpe, to not use any host bridge, and instead use
>    only PCI-to-PCI bridges, one per PCIe interface.
> 
>  * From you, to not use any PCI-to-PCI bridge, and use only host
>    bridges, one per PCIe interface.

Thinking more about this, this solution (using one emulated host bridge
per PCIe interface) would cause one problem: the PCIe device itself
would no longer be in slot 0.

If I'm correct, with one host bridge per PCIe interface, we would have
the following topology:

 bus 0, slot 0: emulated host bridge 0
 bus 0, slot 1: PCIe device connected to PCIe interface 0
 bus 1, slot 0: emulated host bridge 1
 bus 1, slot 1: PCIe device connected to PCIe interface 1
 bus 2, slot 0: emulated host bridge 2
 bus 2, slot 1: PCIe device connected to PCIE interface 2
 etc.

However, one of the reason to use a PCI-to-PCI bridge was to ensure
that the PCIe devices were all listed in slot 0. According to the
Marvell engineers who work on the PCIe stuff, some new PCIe devices
have this requirement. I don't have a lot of details about this, but I
was told that most of the new Intel NICs require this, for example the
Intel X520 fiber NIC. Maybe PCIe experts (Jason?) could provide more
details about this, and confirm/infirm this statement.

The usage of PCI-to-PCI bridge allows to have each PCIe device on its
own bus, at slot 0, which also solves this problem.

Best regards,

Thomas

Stephen Warren Jan. 29, 2013, 10:17 p.m. UTC | #13

On 01/29/2013 02:59 PM, Thomas Petazzoni wrote:
> Arnd,
> 
> On Tue, 29 Jan 2013 21:33:08 +0100, Thomas Petazzoni wrote:
> 
>> Basically, I have currently two suggestions:
>>
>>  * From Jason Gunthorpe, to not use any host bridge, and instead use
>>    only PCI-to-PCI bridges, one per PCIe interface.
>>
>>  * From you, to not use any PCI-to-PCI bridge, and use only host
>>    bridges, one per PCIe interface.
> 
> Thinking more about this, this solution (using one emulated host bridge
> per PCIe interface) would cause one problem: the PCIe device itself
> would no longer be in slot 0.

I think that's device 0 not slot 0 right?

> If I'm correct, with one host bridge per PCIe interface, we would have
> the following topology:
> 
>  bus 0, slot 0: emulated host bridge 0
>  bus 0, slot 1: PCIe device connected to PCIe interface 0

I /think/ the bus that the root port itself is on is different from the
bus that the downstream device is on, so wouldn't you end up with:

bus 0, slot 0: emulated host bridge 0
bus 1, slot 0: PCIe device connected to PCIe interface 0

(and isn't that "root port" not "host bridge" in the first line above?)
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann Jan. 29, 2013, 10:54 p.m. UTC | #14

On Tuesday 29 January 2013, Thomas Petazzoni wrote:
> Does this still allows me to give the Linux PCI one global range of
> addresses for I/O space, and one global range of addresses for memory
> space, and the the Linux PCI core assign ranges, within those global
> ranges, to each host bridge?
> 
> This is absolutely essential for me, as I then read those allocated
> ranges to configure the address decoding windows.
> 
> Basically, I have currently two suggestions:
> 
>  * From Jason Gunthorpe, to not use any host bridge, and instead use
>    only PCI-to-PCI bridges, one per PCIe interface.
> 
>  * From you, to not use any PCI-to-PCI bridge, and use only host
>    bridges, one per PCIe interface.
> 
> Would it be possible to get some consensus on this? In the review of
> RFCv1, I was already told to use one global host bridge, and then one
> PCI-to-PCI bridge per PCIe interface, and now we're talking about doing
> something different. I'd like to avoid having to try gazillions of
> different possible implementations :-)

I'm actually fine with either of the two suggestions you mentioned above,
whichever is easier to implement and/or more closely matches what the
hardware actually implements is better IMHO.

The part that I did not like about having emulated PCI-to-PCI bridges
is that it seems to just work around a (percieved or real) limitation
in the Linux kernel by adding a piece of infrastructure, rather than
lifting that limitation by making the kernel deal with what the
hardware provides. That reminded me of the original mach-vt8500
PCI implementation that faked a complete PCI host bridge and a
bunch of PCI devices on it, in order to use the via-velocity
ethernet controller, instead of adding a simple 'platform_driver'
struct to that driver.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Jan. 30, 2013, 4:21 a.m. UTC | #15

On Tue, Jan 29, 2013 at 10:54:00PM +0000, Arnd Bergmann wrote:

> I'm actually fine with either of the two suggestions you mentioned above,
> whichever is easier to implement and/or more closely matches what the
> hardware actually implements is better IMHO.
> 
> The part that I did not like about having emulated PCI-to-PCI bridges
> is that it seems to just work around a (percieved or real) limitation
> in the Linux kernel by adding a piece of infrastructure, rather than
> lifting that limitation by making the kernel deal with what the
> hardware provides. That reminded me of the original mach-vt8500

Well.. in this case there is a standard - PCI-E for what HW vendors
are supposed to do. The kernel core code follows it and works with
compliant hardware.

Marvell HW is not compliant.

So..

Should the kernel core PCI code support this particular non-compliance?
Should the driver work around the non-compliance and present a
compliant interface to the kernel and userspace?

My take is the kernel core PCI code is fine, and I hope
this will be an isolated issue with one family of Marvell IP. So
working around the HW problem in the driver seems best.

If we learn of many more instances like this then, yah, update the
core code and rip out this driver work around...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Jan. 30, 2013, 4:49 a.m. UTC | #16

On Tue, Jan 29, 2013 at 10:59:32PM +0100, Thomas Petazzoni wrote:
> Arnd,
> 
> On Tue, 29 Jan 2013 21:33:08 +0100, Thomas Petazzoni wrote:
> 
> > Basically, I have currently two suggestions:
> > 
> >  * From Jason Gunthorpe, to not use any host bridge, and instead use
> >    only PCI-to-PCI bridges, one per PCIe interface.
> > 
> >  * From you, to not use any PCI-to-PCI bridge, and use only host
> >    bridges, one per PCIe interface.

Arnd is suggesting to use multiple *linux* host bridges (ie host
drivers), there is never any need for a 'host bridge config space' as
in patch #7, in either case.

> However, one of the reason to use a PCI-to-PCI bridge was to ensure
> that the PCIe devices were all listed in slot 0. According to the
> Marvell engineers who work on the PCIe stuff, some new PCIe devices
> have this requirement. I don't have a lot of details about this, but I
> was told that most of the new Intel NICs require this, for example the
> Intel X520 fiber NIC. Maybe PCIe experts (Jason?) could provide more
> details about this, and confirm/infirm this statement.

I'm not sure what this is referring to.. I don't recall any specific
requirements in PCI-E for the device number, I think the spec requires
it to be learned based on the config TLPs received.

There might be a device number sensitivity in INTx translation, but
that is defined by the spec.

That said, if your root complex is PCI-E compliant then all downstream
end ports attached to a root port should have a device number of 0.

> The usage of PCI-to-PCI bridge allows to have each PCIe device on its
> own bus, at slot 0, which also solves this problem.

Hrm....

Looking at the docs, you will also need to change the internal device
number (probably reg 41a04 again) to something other than 0, otherwise
the Marvell itself will claim device number 0 and the downstream end
port will be device number 1. You should see this happen today??

You should set the Marvell internal device number to something like
all ones and then deny any Linux config register access to the all
ones device number on the subordinate bus to hide the Marvell end port
config space registers from Linux.

As for as this process goes, it doesn't matter which approach you
take. If you use multiple PCI domains then you'd still be able to
arrange things via the above so that the downstream device could
always claim device number 0.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Jan. 30, 2013, 4:56 a.m. UTC | #17

On Tue, Jan 29, 2013 at 04:45:07PM +0000, Arnd Bergmann wrote:
> On Tuesday 29 January 2013, Thomas Petazzoni wrote:
> > And the Linux PCI resource allocation code complies with this, so that
> > if I have two PCI-to-PCI bridges (each having downstream a device with
> > an I/O BAR), then the first PCI-to-PCI bridge gets its I/O base address
> > register set to ADDR + 0x0, and the second bridge gets its I/O base
> > address set to ADDR + 0x1000. And this doesn't play well with the
> > requirements of Marvell address decoding windows for PCIe I/O regions,
> > which must be 64 KB aligned.
> 
> But we normally only assign a 64 KB I/O window to each PCI host bridge.
> Requiring PCI bridges to be space 64 KB apart would mean that we cannot
> actually support bridges at all.

The PCI resource code uses full 32 bit integers when it handles IO
addresses, so this actually does sort of work out.

However, Thomas how did you recover the high bits of the
IO window address from the bridge configuration? Are you reading the
struct resource directly? That probably causes problems with
hotplug/etc...

If you look back in your old emails I outlined a solution to this
using the MMU.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 30, 2013, 8:19 a.m. UTC | #18

Dear Jason Gunthorpe,

On Tue, 29 Jan 2013 21:56:55 -0700, Jason Gunthorpe wrote:

> However, Thomas how did you recover the high bits of the
> IO window address from the bridge configuration? Are you reading the
> struct resource directly? That probably causes problems with
> hotplug/etc...

The PCI-to-PCI bridge configuration space has a register with the high
bits of the I/O window address. If you look at the PCI-to-PCI emulation
code, I set the bit that says "I'm a bridge capable of 32 bits
addressing of I/O addresses", and then when setting up the windows, I
reconstruct the full 32 bits address by reading the two I/O address
registers.

See 3.2.5.6 in the PCI-to-PCI bridge specification:

  If the low four bits of the I/O Base and I/O Limit registers are 01h,
  then the bridge supports 32-bit I/O address decoding, and the I/O
  Base Upper 16 Bits and the I/O Limit Upper 16 Bits hold the upper 16
  bits, corresponding to AD[31::16], of the 32-bit I/O Base and I/O
  Limit addresses respectively. In this case, system configuration
  software is permitted to locate the I/O address range supported by
  the anywhere in the 4-GB I/O Space. Note that the 4-KB alignment and
  granularity restrictions still apply when the bridge supports 32 -bit
  I/O addressing.

(And my code does ensure that the low four bits of the I/O Base and I/O
Limit registers are 01h)

Best regards,

Thomas

Arnd Bergmann Jan. 30, 2013, 9:46 a.m. UTC | #19

On Wednesday 30 January 2013, Jason Gunthorpe wrote:
> > But we normally only assign a 64 KB I/O window to each PCI host bridge.
> > Requiring PCI bridges to be space 64 KB apart would mean that we cannot
> > actually support bridges at all.
> 
> The PCI resource code uses full 32 bit integers when it handles IO
> addresses, so this actually does sort of work out.

However, we only reserve 1 MB (I think) virtual address window for all
I/O spaces of all PCI domains combined, at a fixed location (0xfee00000).
This means we can have at most 16 such windows at run-time. That can
be changed if necessary, but it seems like overkill when in practice
you only need a few bytes at most.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 30, 2013, 9:54 a.m. UTC | #20

Dear Arnd Bergmann,

On Wed, 30 Jan 2013 09:46:53 +0000, Arnd Bergmann wrote:
> On Wednesday 30 January 2013, Jason Gunthorpe wrote:
> > > But we normally only assign a 64 KB I/O window to each PCI host bridge.
> > > Requiring PCI bridges to be space 64 KB apart would mean that we cannot
> > > actually support bridges at all.
> > 
> > The PCI resource code uses full 32 bit integers when it handles IO
> > addresses, so this actually does sort of work out.
> 
> However, we only reserve 1 MB (I think) virtual address window for all
> I/O spaces of all PCI domains combined, at a fixed location (0xfee00000).
> This means we can have at most 16 such windows at run-time. That can
> be changed if necessary, but it seems like overkill when in practice
> you only need a few bytes at most.

I am not sure where this 0xfee00000 address comes from, but in my case
(and I think in the Tegra PCI driver as well), we tell the Linux PCI
core from which addresses the I/O ranges should be allocated. In my DT,
I have:

                        ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
                                  0x00004800 0 0xd0042000 0xd0042000 0 0x00002000   /* port 2.0 registers */
                                  0x00001000 0 0xd0044000 0xd0044000 0 0x00002000   /* port 0.1 registers */
                                  0x00001800 0 0xd0048000 0xd0048000 0 0x00002000   /* port 0.2 registers */
                                  0x00002000 0 0xd004C000 0xd004C000 0 0x00002000   /* port 0.3 registers */
                                  0x00002800 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
                                  0x00005000 0 0xd0082000 0xd0082000 0 0x00002000   /* port 3.0 registers */
                                  0x00003000 0 0xd0084000 0xd0084000 0 0x00002000   /* port 1.1 registers */
                                  0x00003800 0 0xd0088000 0xd0088000 0 0x00002000   /* port 1.2 registers */
                                  0x00004000 0 0xd008C000 0xd008C000 0 0x00002000   /* port 1.3 registers */
                                  0x81000000 0 0          0xc0000000 0 0x00100000   /* downstream I/O */
                                  0x82000000 0 0          0xc1000000 0 0x08000000>; /* non-prefetchable memory */

And then, the Marvell PCI driver gets the "downstream I/O" range,
parses it into a "struct resource", and then does (where &pcie->io is
the struct resource into which we parsed the "downstream I/O" range):

        pci_add_resource_offset(&sys->resources, &pcie->io, sys->io_offset);
	[...]
	pci_ioremap_io(nr * SZ_64K, pcie->io.start);

And it works just fine, I get my I/O ranges allocated at 0xc0000000 for
the first device, 0xc0010000 (i.e base address + 64KB) for the second
device, etc.

The Tegra PCI driver does exactly the same (I shamelessly copied what
Thierry has done).

I somehow have the feeling that we are looking for problems that simply
don't exist...

Best regards,

Thomas

Arnd Bergmann Jan. 30, 2013, 9:55 a.m. UTC | #21

On Wednesday 30 January 2013, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2013 at 10:54:00PM +0000, Arnd Bergmann wrote:

> > The part that I did not like about having emulated PCI-to-PCI bridges
> > is that it seems to just work around a (percieved or real) limitation
> > in the Linux kernel by adding a piece of infrastructure, rather than
> > lifting that limitation by making the kernel deal with what the
> > hardware provides. That reminded me of the original mach-vt8500
> 
> Well.. in this case there is a standard - PCI-E for what HW vendors
> are supposed to do. The kernel core code follows it and works with
> compliant hardware.
> 
> Marvell HW is not compliant.
> 
> So..
> 
> Should the kernel core PCI code support this particular non-compliance?
> Should the driver work around the non-compliance and present a
> compliant interface to the kernel and userspace?
> 
> My take is the kernel core PCI code is fine, and I hope
> this will be an isolated issue with one family of Marvell IP. So
> working around the HW problem in the driver seems best.

I don't remember the kernel ever caring about whether hardware complies
to a standard or not. The kernel's job is to make hardware work, based
on the actual implementation of that hardware. In a lot of cases that
means taking the standard document as a reference, and adding quirks
for the devices that are different.

In the end, it comes down to the impact on the code complexity, and
the run-time overhead for whatever hardware is most common when adding
those quirks.

Can you (or someone else) describe what kind of changes to the core
code we would actually need to make it work without emulting the
bridge?

> If we learn of many more instances like this then, yah, update the
> core code and rip out this driver work around...

But the code was specifically written to be reusable, which is normally
a good thing.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann Jan. 30, 2013, 10:03 a.m. UTC | #22

On Wednesday 30 January 2013, Thomas Petazzoni wrote:

> I am not sure where this 0xfee00000 address comes from, but in my case
> (and I think in the Tegra PCI driver as well), we tell the Linux PCI
> core from which addresses the I/O ranges should be allocated. In my DT,
> I have:
> 
>                         ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
>                                   0x00004800 0 0xd0042000 0xd0042000 0 0x00002000   /* port 2.0 registers */
>                                   0x00001000 0 0xd0044000 0xd0044000 0 0x00002000   /* port 0.1 registers */
>                                   0x00001800 0 0xd0048000 0xd0048000 0 0x00002000   /* port 0.2 registers */
>                                   0x00002000 0 0xd004C000 0xd004C000 0 0x00002000   /* port 0.3 registers */
>                                   0x00002800 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
>                                   0x00005000 0 0xd0082000 0xd0082000 0 0x00002000   /* port 3.0 registers */
>                                   0x00003000 0 0xd0084000 0xd0084000 0 0x00002000   /* port 1.1 registers */
>                                   0x00003800 0 0xd0088000 0xd0088000 0 0x00002000   /* port 1.2 registers */
>                                   0x00004000 0 0xd008C000 0xd008C000 0 0x00002000   /* port 1.3 registers */
>                                   0x81000000 0 0          0xc0000000 0 0x00100000   /* downstream I/O */
>                                   0x82000000 0 0          0xc1000000 0 0x08000000>; /* non-prefetchable memory */
> 
> And then, the Marvell PCI driver gets the "downstream I/O" range,
> parses it into a "struct resource", and then does (where &pcie->io is
> the struct resource into which we parsed the "downstream I/O" range):
> 
>         pci_add_resource_offset(&sys->resources, &pcie->io, sys->io_offset);
> 	[...]
> 	pci_ioremap_io(nr * SZ_64K, pcie->io.start);

0xfee00000 is the platform independent virtual address that pci_ioremap_io
maps your platform specific physical address (from pcie->io.start) to. It's
defined (in the kernel I am looking at) in asm/io.h as

#define PCI_IO_VIRT_BASE        0xfee00000

and used by pci_ioremap_io as

        return ioremap_page_range(PCI_IO_VIRT_BASE + offset,
                                  PCI_IO_VIRT_BASE + offset + SZ_64K,
                                  phys_addr,
                                  __pgprot(get_mem_type(MT_DEVICE)->prot_pte));


> And it works just fine, I get my I/O ranges allocated at 0xc0000000 for
> the first device, 0xc0010000 (i.e base address + 64KB) for the second
> device, etc.

(void*)0xc0000000 is the normal PAGE_OFFSET. If you map your I/O space there,
you are in big trouble because that is supposed to have the start of your
physical memory mapping.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 30, 2013, 11:42 a.m. UTC | #23

Dear Arnd Bergmann,

On Wed, 30 Jan 2013 10:03:43 +0000, Arnd Bergmann wrote:

> 0xfee00000 is the platform independent virtual address that pci_ioremap_io
> maps your platform specific physical address (from pcie->io.start) to. It's
> defined (in the kernel I am looking at) in asm/io.h as
> 
> #define PCI_IO_VIRT_BASE        0xfee00000
> 
> and used by pci_ioremap_io as
> 
>         return ioremap_page_range(PCI_IO_VIRT_BASE + offset,
>                                   PCI_IO_VIRT_BASE + offset + SZ_64K,
>                                   phys_addr,
>                                   __pgprot(get_mem_type(MT_DEVICE)->prot_pte));
> 
> 
> > And it works just fine, I get my I/O ranges allocated at 0xc0000000 for
> > the first device, 0xc0010000 (i.e base address + 64KB) for the second
> > device, etc.
> 
> (void*)0xc0000000 is the normal PAGE_OFFSET. If you map your I/O space there,
> you are in big trouble because that is supposed to have the start of your
> physical memory mapping.

Aaah, I know where the confusion comes from. You are talking about
virtual addresses, while I am talking about physical addresses.
0xC0000000 in my DT is a *physical* address.

Basically, with Marvell SoCs, we have the following behavior:

                     -------                          --------------------
 Virtual address --> | MMU | --> Physical address --> | Address decoding | --> real hardware
                     -------                          --------------------

The MMU is the usual stuff everybody knows about. What's more special
about Marvell SoC is this "Address decoding" thing. Basically, instead
of having the physical address of things fully hardcoded and mentioned
in the datasheets, they are configurable. So for each PCIe interface,
you have to set up an address decoding window for the I/O accesses and
another address decoding window for the memory accesses. And the
physical address associated to each of these "address decoding windows"
can be freely chosen, so they must be "assigned" for each PCIe
interface.

So, my 0xC0000000 is a *physical* address is the diagram above. The
fact that it gets maps at 0xfee00000 as a virtual address doesn't
really matter for me, I'm just fine with that.

Does that clarify things?

Thomas

Thomas Petazzoni Jan. 30, 2013, 11:47 a.m. UTC | #24

Dear Arnd Bergmann,

On Wed, 30 Jan 2013 09:55:49 +0000, Arnd Bergmann wrote:

> I don't remember the kernel ever caring about whether hardware complies
> to a standard or not. The kernel's job is to make hardware work, based
> on the actual implementation of that hardware. In a lot of cases that
> means taking the standard document as a reference, and adding quirks
> for the devices that are different.
> 
> In the end, it comes down to the impact on the code complexity, and
> the run-time overhead for whatever hardware is most common when adding
> those quirks.

This is not only about standards, it is also about re-using the PCI
resource allocation code.

In my RFCv1, sent December, 7th, I wasn't using any emulated PCI-to-PCI
bridge. So it *can* perfectly work without it.

However, one major drawback of my RFCv1 version is that since I didn't
know how much I/O space and memory space was needed for each PCIe
device, I had to oversize the address decoding windows. And also, I had
to have a special allocator (certainly simple, but still) to find an
available physical address to set up each address decoding window.

Emulating a PCI-to-PCI bridge very nicely allows to re-use the PCI core
resource allocation code. I think it's really the main reason for
emulated those PCI-to-PCI bridges, rather than willing to comply to
some standards.

So what I'm going to do now is rework my patch series by removing the
emulated host bridge (which is normally mandatory by PCIe standard, but
Linux doesn't need it, so we don't care), but I'll keep the emulated
PCI-to-PCI bridges in order to benefit for the PCI core resource
allocation mechanisms.

Is this ok for you?

I'd like to settle on the strategy to follow, because we're really
going a funny road here: on December 7th, I submit a series that
doesn't use any PCI-to-PCI bridge, and I'm told that I should emulate
some. I spent a long time working on an implementation that uses
emumlated PCI-to-PCI bridges, which I submitted on Monday, now to be
told that I should work really hard not to use PCI-to-PCI bridges. I
hope you can feel my little embarrassment here...

Thanks,

Thomas

Arnd Bergmann Jan. 30, 2013, 4:17 p.m. UTC | #25

On Wednesday 30 January 2013, Thomas Petazzoni wrote:
> Dear Arnd Bergmann,
> 
> On Wed, 30 Jan 2013 09:55:49 +0000, Arnd Bergmann wrote:
> 
> > I don't remember the kernel ever caring about whether hardware complies
> > to a standard or not. The kernel's job is to make hardware work, based
> > on the actual implementation of that hardware. In a lot of cases that
> > means taking the standard document as a reference, and adding quirks
> > for the devices that are different.
> > 
> > In the end, it comes down to the impact on the code complexity, and
> > the run-time overhead for whatever hardware is most common when adding
> > those quirks.
> 
> This is not only about standards, it is also about re-using the PCI
> resource allocation code.
> 
> In my RFCv1, sent December, 7th, I wasn't using any emulated PCI-to-PCI
> bridge. So it *can* perfectly work without it.

Ok, Isee.

> However, one major drawback of my RFCv1 version is that since I didn't
> know how much I/O space and memory space was needed for each PCIe
> device, I had to oversize the address decoding windows. And also, I had
> to have a special allocator (certainly simple, but still) to find an
> available physical address to set up each address decoding window.

Well, for the I/O space, there is no oversizing because either way you
end up with exactly 64KB per root port, right?

> Emulating a PCI-to-PCI bridge very nicely allows to re-use the PCI core
> resource allocation code. I think it's really the main reason for
> emulated those PCI-to-PCI bridges, rather than willing to comply to
> some standards.
> 
> So what I'm going to do now is rework my patch series by removing the
> emulated host bridge (which is normally mandatory by PCIe standard, but
> Linux doesn't need it, so we don't care), but I'll keep the emulated
> PCI-to-PCI bridges in order to benefit for the PCI core resource
> allocation mechanisms.
> 
> Is this ok for you?

Using the Linux allocator for memory resources does sound useful,
so if that requires using the emulated PCI-to-PCI bridges, I guess
it's the best compromise.

> I'd like to settle on the strategy to follow, because we're really
> going a funny road here: on December 7th, I submit a series that
> doesn't use any PCI-to-PCI bridge, and I'm told that I should emulate
> some. I spent a long time working on an implementation that uses
> emumlated PCI-to-PCI bridges, which I submitted on Monday, now to be
> told that I should work really hard not to use PCI-to-PCI bridges. I
> hope you can feel my little embarrassment here...

I'm sorry about this situation. Unfortunatly the way that such decisions
are made is not always straightforward, and what seems like a good idea
at one point turns out to be a mistake or more complex than anticipated
later. With the description of the first patch set, I did not think
it would be necessary to fake a bridge device and their config space.

What I had guessed you could do was to call pci_scan_root_bus on
each of the ports, and then set up the memory space window for
the bus including all of its child devices.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Petazzoni Jan. 30, 2013, 4:38 p.m. UTC | #26

Dear Arnd Bergmann,

On Wed, 30 Jan 2013 16:17:38 +0000, Arnd Bergmann wrote:

> > However, one major drawback of my RFCv1 version is that since I didn't
> > know how much I/O space and memory space was needed for each PCIe
> > device, I had to oversize the address decoding windows. And also, I had
> > to have a special allocator (certainly simple, but still) to find an
> > available physical address to set up each address decoding window.
> 
> Well, for the I/O space, there is no oversizing because either way you
> end up with exactly 64KB per root port, right?

Correct I/O space is not an issue, of course. Only the memory windows
are an issue (in terms of quantity of address space used).

That said, the PCI-to-PCI bridge solution doesn't solve the fact that
I/O addresses get assigned even though the driver will most likely use
them. This means that one I/O window is consumed for each PCIe
interface, even though it is not being used in practice. And see I have
10 PCIe interfaces in this SoC, and only 20 windows available globally
(not only for PCIe, but also for NAND, NOR, etc.). But for now, I'd
like to leave this potential problem on the side, and get something
working. If it seems useful to remove this problem later, we'll work on
it.

> > Emulating a PCI-to-PCI bridge very nicely allows to re-use the PCI core
> > resource allocation code. I think it's really the main reason for
> > emulated those PCI-to-PCI bridges, rather than willing to comply to
> > some standards.
> > 
> > So what I'm going to do now is rework my patch series by removing the
> > emulated host bridge (which is normally mandatory by PCIe standard, but
> > Linux doesn't need it, so we don't care), but I'll keep the emulated
> > PCI-to-PCI bridges in order to benefit for the PCI core resource
> > allocation mechanisms.
> > 
> > Is this ok for you?
> 
> Using the Linux allocator for memory resources does sound useful,
> so if that requires using the emulated PCI-to-PCI bridges, I guess
> it's the best compromise.

Yes, that was Jason's original idea when he suggested to use PCI-to-PCI
bridges. And when I did the implementation, it really worked nicely.
And this PCI-to-PCI bridge emulation stuff is really not a big deal,
look at drivers/pci/sw-pci-pci-bridge.c: 185 lines in total, including
10 lines of comment header at the top.

> I'm sorry about this situation. Unfortunatly the way that such decisions
> are made is not always straightforward, and what seems like a good idea
> at one point turns out to be a mistake or more complex than anticipated
> later. With the description of the first patch set, I did not think
> it would be necessary to fake a bridge device and their config space.

Sure, I understand this. I guess you also understand my slight
frustration when I propose A, I'm told to do B, I propose B, and I'm
then suggested to do A again :-) But I agree, it's part of the
technical discussion, and we can't get it right on the first shot.

> What I had guessed you could do was to call pci_scan_root_bus on
> each of the ports, and then set up the memory space window for
> the bus including all of its child devices.

But where would I read how much space is needed for the I/O and memory
regions of each bus?

Thanks,

Thomas

Bjorn Helgaas Jan. 30, 2013, 8:48 p.m. UTC | #27

On Wed, Jan 30, 2013 at 4:47 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:

> So what I'm going to do now is rework my patch series by removing the
> emulated host bridge (which is normally mandatory by PCIe standard, but
> Linux doesn't need it, so we don't care), ...

This is a tangent since you're removing the emulated host bridge
anyway, but it's been mentioned a couple of times, and I'd like to
understand this.  Jason mentioned earlier in the [07/27] emulated host
bridge thread that the PCIe spec requires a host bridge at 00:00.0.
I've never seen that mentioned in the spec; can somebody point me to
the actual requirement that host bridges appear in config space?

My understanding has been that host bridges, whether PCI or PCIe, are
required to *exist*, but that the way you enumerate them and configure
them is outside the scope of the PCI/PCIe specs.  I know that many
chips, especially for x86, *do* make the host bridge appear in config
space, but I've never seen a requirement for that.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Jan. 30, 2013, 9:06 p.m. UTC | #28

On Wed, Jan 30, 2013 at 01:48:33PM -0700, Bjorn Helgaas wrote:

> This is a tangent since you're removing the emulated host bridge
> anyway, but it's been mentioned a couple of times, and I'd like to
> understand this.  Jason mentioned earlier in the [07/27] emulated host
> bridge thread that the PCIe spec requires a host bridge at 00:00.0.
> I've never seen that mentioned in the spec; can somebody point me to
> the actual requirement that host bridges appear in config space?

Hum, a more careful search/reading brings up this:

7.2.2.1. Host Bridge Requirements

 [...] The use of Host Bridge PCI class code is reserved for backwards
 compatibility; host Bridge configuration space is opaque to standard
 PCI Express software and may be implemented in an implementation
 specific manner that is compatible with PCI Host Bridge Type 0
 configuration space. A PCI Express Host Bridge is not required to
 signal errors through a Root Complex Event Collector. This support is
 optional for PCI Express Host Bridges.

So, if it is present it is required to be compatible with the 'PCI
Host Bridge' stuff, but it is not mandatory.

My bad, I believe I got also confused with the spec language regarding
a 'host bridge' vs a 'host bridge configuration space'

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v2,05/27] arm: pci: add a align_resource hook

Commit Message

Comments

Patch