diff mbox

[v4] powerpc/pci: Assign fixed PHB number based on device-tree properties

Message ID 1458337746-20337-1-git-send-email-gpiccoli@linux.vnet.ibm.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Guilherme G. Piccoli March 18, 2016, 9:49 p.m. UTC
The domain/PHB field of PCI addresses has its value obtained from a
global variable, incremented each time a new domain (represented by
struct pci_controller) is added on the system. The domain addition
process happens during boot or due to PCI device hotplug.

As recent kernels are using predictable naming for network interfaces,
the network stack is more tied to PCI naming. This can be a problem in
hotplug scenarios, because PCI addresses will change if devices are
removed and then re-added. This situation seems unusual, but it can
happen if a user wants to replace a NIC without rebooting the machine,
for example.

This patch changes the way PCI domain values are generated: now, we use
device-tree properties to assign fixed PHB numbers to PCI addresses
when available (meaning pSeries and PowerNV cases). We also use a bitmap
to allow dynamic PHB numbering when device-tree properties are not
used. This bitmap keeps track of used PHB numbers and if a PHB is
released (by hotplug operations for example), it allows the reuse of
this PHB number, avoiding PCI address to change in case of device remove
and re-add soon after. No functional changes were introduced.

Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

Comments

Michael Ellerman March 25, 2016, 9:33 a.m. UTC | #1
Hi Guilherme,

Some comments below ...

On Fri, 2016-18-03 at 21:49:06 UTC, "Guilherme G. Piccoli" wrote:
> The domain/PHB field of PCI addresses has its value obtained from a
> global variable, incremented each time a new domain (represented by
> struct pci_controller) is added on the system. The domain addition
> process happens during boot or due to PCI device hotplug.
> 
> As recent kernels are using predictable naming for network interfaces,
> the network stack is more tied to PCI naming. This can be a problem in
> hotplug scenarios, because PCI addresses will change if devices are
> removed and then re-added. This situation seems unusual, but it can
> happen if a user wants to replace a NIC without rebooting the machine,
> for example.
> 
> This patch changes the way PCI domain values are generated: now, we use
> device-tree properties to assign fixed PHB numbers to PCI addresses
> when available (meaning pSeries and PowerNV cases). We also use a bitmap
> to allow dynamic PHB numbering when device-tree properties are not
> used. This bitmap keeps track of used PHB numbers and if a PHB is
> released (by hotplug operations for example), it allows the reuse of
> this PHB number, avoiding PCI address to change in case of device remove
> and re-add soon after. No functional changes were introduced.
> 
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/pci-common.c | 40 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 0f7a60f..bc31ac1 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -44,8 +44,11 @@
>  static DEFINE_SPINLOCK(hose_spinlock);
>  LIST_HEAD(hose_list);
>  
> -/* XXX kill that some day ... */
> -static int global_phb_number;		/* Global phb counter */
> +/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */
> +#define	MAX_PHBS	8192

Did we just make that up? It seems like a lot, but then we have some big
systems?

> +/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */

Locking? It looks like it's protected by the hose_spinlock, but you should say
that here, and also in the comment for hose_spinlock.

> +static DECLARE_BITMAP(phb_bitmap, MAX_PHBS);
>  
>  /* ISA Memory physical address */
>  resource_size_t isa_mem_base;
> @@ -64,6 +67,32 @@ struct dma_map_ops *get_pci_dma_ops(void)
>  }
>  EXPORT_SYMBOL(get_pci_dma_ops);
>  

There should be a comment here saying what the locking requirements are for
this function.

> +static int get_phb_number(struct device_node *dn)
> +{
> +	const __be64 *prop64;
> +	const __be32 *regs;
> +	int phb_id = 0;
> +
> +	/* try fixed PHB numbering first, by checking archs and reading
> +	 * the respective device-tree property. */
> +	if (machine_is(pseries)) {

Firstly I don't see why this check needs to be conditional on pseries. Any
machine where the PHB has a 'reg' property should be able to use 'reg' for
numbering.

> +		regs = of_get_property(dn, "reg", NULL);
> +		if (regs)
> +			return (int)(be32_to_cpu(regs[1]) & 0xFFFF);

This should use of_property_read_u32().

> +	} else if (machine_is(powernv)) {

This shouldn't be a machine check, it should just look for 'ibm,opal-phbid'
first, before 'reg'.

> +		prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
> +		if (prop64)
> +			return (int)(be64_to_cpup(prop64) & 0xFFFF);

of_property_read_u64().

> +	}

And finally in either case above, where you get a number from the device tree,
you must check that it's not already allocated. Otherwise if you have a system
where some PHBs have a property but others don't, you may give out the same
number twice. Also you could have firmware give you the same number twice
(which would be a firmware bug, but those happen).

If the number is allocated you fall back to dynamic numbering.

If it's not allocated you must mark it as allocated in the bitmap.

> +
> +	/* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */
> +	phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS);
> +	BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */
> +	set_bit(phb_id, phb_bitmap);
> +
> +	return phb_id;
> +}
> +
>  struct pci_controller *pcibios_alloc_controller(struct device_node *dev)
>  {
>  	struct pci_controller *phb;

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guilherme G. Piccoli March 28, 2016, 12:36 p.m. UTC | #2
On 03/25/2016 06:33 AM, Michael Ellerman wrote:
> Hi Guilherme,
>
> Some comments below ...

Hi Michael, thanks for the comments.


>> +/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */
>> +#define	MAX_PHBS	8192
>
> Did we just make that up? It seems like a lot, but then we have some big
> systems?

Well, this is not documented AFAICT. I asked Benjamin on IRC and he 
pointed me the PCI stack (in special user space tools, like lspci) would 
be able to deal with at most 16 bit domain number (meaning 65536 bits in 
a bitmap). I thought it was too much, and chatting with Gavin, we ended 
up with 8192 ( == 1kB of memory, not too much I believe). What do you 
think about this number Michael? Should we decrease? Or even increase?
Below, following the last comment of yours, I'll discuss more about this 
value.


>> +/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */
>
> Locking? It looks like it's protected by the hose_spinlock, but you should say
> that here, and also in the comment for hose_spinlock.
>
>> +static DECLARE_BITMAP(phb_bitmap, MAX_PHBS);
>>
>>   /* ISA Memory physical address */
>>   resource_size_t isa_mem_base;
>> @@ -64,6 +67,32 @@ struct dma_map_ops *get_pci_dma_ops(void)
>>   }
>>   EXPORT_SYMBOL(get_pci_dma_ops);
>>
>
> There should be a comment here saying what the locking requirements are for
> this function.

Well pointed Michael, I'll add the comments.


>> +static int get_phb_number(struct device_node *dn)
>> +{
>> +	const __be64 *prop64;
>> +	const __be32 *regs;
>> +	int phb_id = 0;
>> +
>> +	/* try fixed PHB numbering first, by checking archs and reading
>> +	 * the respective device-tree property. */
>> +	if (machine_is(pseries)) {
>
> Firstly I don't see why this check needs to be conditional on pseries. Any
> machine where the PHB has a 'reg' property should be able to use 'reg' for
> numbering.

This is something I'm not sure for all the powerpc sub-architectures, 
like Cell - that's the reason of the check. If you are sure about this, 
I'll gladly remove this check =)


>
>> +		regs = of_get_property(dn, "reg", NULL);
>> +		if (regs)
>> +			return (int)(be32_to_cpu(regs[1]) & 0xFFFF);
>
> This should use of_property_read_u32().
>
>> +	} else if (machine_is(powernv)) {
>
> This shouldn't be a machine check, it should just look for 'ibm,opal-phbid'
> first, before 'reg'.
>
>> +		prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
>> +		if (prop64)
>> +			return (int)(be64_to_cpup(prop64) & 0xFFFF);
>
> of_property_read_u64().
>
>> +	}

OK, I'll implement these changes.


> And finally in either case above, where you get a number from the device tree,
> you must check that it's not already allocated. Otherwise if you have a system
> where some PHBs have a property but others don't, you may give out the same
> number twice. Also you could have firmware give you the same number twice
> (which would be a firmware bug, but those happen).
>
> If the number is allocated you fall back to dynamic numbering.
>
> If it's not allocated you must mark it as allocated in the bitmap.

Hmm..interesting. I thought in performing such check, but I wasn't able 
to imagine a system in which we can have some PHBs indexed by 
device-tree properties and others don't, seemed impossible to me. The 
buggy fw case is an example, I can implement this modification if you 
think it's valid.

But, notice that for consistency in implementation, I'll might need to 
increase the MAX_PHBS value to 65536, otherwise we won't cover all the 
possible wrong cases, since I'm performing an AND with 0xFFFF mask 
(imagine if we can have a buggy fw exposing same value for two different 
PHBs, and this value is higher than 8192). What do you think about this?

Cheers,


Guilherme

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Munsie April 6, 2016, 7:38 p.m. UTC | #3
Excerpts from Guilherme G. Piccoli's message of 2016-03-18 16:49:06 -0500:
> +static int get_phb_number(struct device_node *dn)

...

> +    /* try fixed PHB numbering first, by checking archs and reading
> +     * the respective device-tree property. */
> +    if (machine_is(pseries)) {
> +        regs = of_get_property(dn, "reg", NULL);
> +        if (regs)
> +            return (int)(be32_to_cpu(regs[1]) & 0xFFFF);
> +    } else if (machine_is(powernv)) {
> +        prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
> +        if (prop64)
> +            return (int)(be64_to_cpup(prop64) & 0xFFFF);
> +    }

I think these cases should still set the bit in phb_bitmap, otherwise a
virtual PHB (e.g. as used in cxl/cxlflash) will be assigned PHB 0, and
since that is already taken it will fail - we're already seeing a
failure in Ubuntu Xenial since Canonical picked this patch up already
(though have not confirmed that this is definitely the cause yet).

There might also be some interesting races to think about here if a
virtual PHB grabs a PHB number before the real one gets a chance.

> +
> +    /* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */
> +    phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS);
> +    BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */
> +    set_bit(phb_id, phb_bitmap);

-Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guilherme G. Piccoli April 6, 2016, 9:51 p.m. UTC | #4
On 04/06/2016 04:38 PM, Ian Munsie wrote:
>> +    /* try fixed PHB numbering first, by checking archs and reading
>> +     * the respective device-tree property. */
>> +    if (machine_is(pseries)) {
>> +        regs = of_get_property(dn, "reg", NULL);
>> +        if (regs)
>> +            return (int)(be32_to_cpu(regs[1]) & 0xFFFF);
>> +    } else if (machine_is(powernv)) {
>> +        prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
>> +        if (prop64)
>> +            return (int)(be64_to_cpup(prop64) & 0xFFFF);
>> +    }
>
> I think these cases should still set the bit in phb_bitmap, otherwise a
> virtual PHB (e.g. as used in cxl/cxlflash) will be assigned PHB 0, and
> since that is already taken it will fail - we're already seeing a
> failure in Ubuntu Xenial since Canonical picked this patch up already
> (though have not confirmed that this is definitely the cause yet).
>
> There might also be some interesting races to think about here if a
> virtual PHB grabs a PHB number before the real one gets a chance.

This is a very interesting case I didn't think before. Thanks for 
pointing this Ian.

We can, as you suggested, set the bitmap in any case to avoid conflicts 
with virtual PHBs.

And in the case a virtual PHB grabs the bitmap before, we just need to 
add Michael's suggested check and fallback to bitmap PHB numbering in 
this case.

Do you think this is enough to avoid issues with cxl'a virtual PHBs?

Thanks,


Guilherme

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Munsie April 7, 2016, 2:08 a.m. UTC | #5
Excerpts from Guilherme G. Piccoli's message of 2016-04-06 16:51:43 -0500:
> And in the case a virtual PHB grabs the bitmap before, we just need to 
> add Michael's suggested check and fallback to bitmap PHB numbering in 
> this case.
> 
> Do you think this is enough to avoid issues with cxl'a virtual PHBs?

Yep, that should be fine :)

Cheers,
-Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Ellerman May 25, 2016, 5:45 a.m. UTC | #6
Hi Guilherme,

Sorry for the very late reply, this got lost in my email filters.


On Mon, 2016-03-28 at 09:36 -0300, Guilherme G. Piccoli wrote:
> On 03/25/2016 06:33 AM, Michael Ellerman wrote:

> > > +static int get_phb_number(struct device_node *dn)
> > > +{
> > > +	const __be64 *prop64;
> > > +	const __be32 *regs;
> > > +	int phb_id = 0;
> > > +
> > > +	/* try fixed PHB numbering first, by checking archs and reading
> > > +	 * the respective device-tree property. */
> > > +	if (machine_is(pseries)) {
> > 
> > Firstly I don't see why this check needs to be conditional on pseries. Any
> > machine where the PHB has a 'reg' property should be able to use 'reg' for
> > numbering.
> 
> This is something I'm not sure for all the powerpc sub-architectures, 
> like Cell - that's the reason of the check. If you are sure about this, 
> I'll gladly remove this check =)

Please do.

I'll test on Cell & other platforms. If there are bugs we can fix them. Maybe
if we can't get it to work on eg. Cell then we need a machine_is() check, but
that should be the last resort.

> > > +		regs = of_get_property(dn, "reg", NULL);
> > > +		if (regs)
> > > +			return (int)(be32_to_cpu(regs[1]) & 0xFFFF);
> > 
> > This should use of_property_read_u32().

You missed this in v5 ^

> > And finally in either case above, where you get a number from the device tree,
> > you must check that it's not already allocated. Otherwise if you have a system
> > where some PHBs have a property but others don't, you may give out the same
> > number twice. Also you could have firmware give you the same number twice
> > (which would be a firmware bug, but those happen).
> > 
> > If the number is allocated you fall back to dynamic numbering.
> > 
> > If it's not allocated you must mark it as allocated in the bitmap.
> 
> Hmm..interesting. I thought in performing such check, but I wasn't able 
> to imagine a system in which we can have some PHBs indexed by 
> device-tree properties and others don't, seemed impossible to me. The 
> buggy fw case is an example, I can implement this modification if you 
> think it's valid.
> 
> But, notice that for consistency in implementation, I'll might need to 
> increase the MAX_PHBS value to 65536, otherwise we won't cover all the 
> possible wrong cases, since I'm performing an AND with 0xFFFF mask 
> (imagine if we can have a buggy fw exposing same value for two different 
> PHBs, and this value is higher than 8192). What do you think about this?

Yeah please increase the bitmap size to 65536. It will only take 8KB of memory,
which is negligible.

cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guilherme G. Piccoli May 25, 2016, 1:03 p.m. UTC | #7
On 05/25/2016 02:45 AM, Michael Ellerman wrote:
> Hi Guilherme,
>
> Sorry for the very late reply, this got lost in my email filters.

No problem Michael, thanks for replying!


> On Mon, 2016-03-28 at 09:36 -0300, Guilherme G. Piccoli wrote:
>> On 03/25/2016 06:33 AM, Michael Ellerman wrote:
>
>>>> +static int get_phb_number(struct device_node *dn)
>>>> +{
>>>> +	const __be64 *prop64;
>>>> +	const __be32 *regs;
>>>> +	int phb_id = 0;
>>>> +
>>>> +	/* try fixed PHB numbering first, by checking archs and reading
>>>> +	 * the respective device-tree property. */
>>>> +	if (machine_is(pseries)) {
>>>
>>> Firstly I don't see why this check needs to be conditional on pseries. Any
>>> machine where the PHB has a 'reg' property should be able to use 'reg' for
>>> numbering.
>>
>> This is something I'm not sure for all the powerpc sub-architectures,
>> like Cell - that's the reason of the check. If you are sure about this,
>> I'll gladly remove this check =)
>
> Please do.
>
> I'll test on Cell & other platforms. If there are bugs we can fix them. Maybe
> if we can't get it to work on eg. Cell then we need a machine_is() check, but
> that should be the last resort.
>
>>>> +		regs = of_get_property(dn, "reg", NULL);
>>>> +		if (regs)
>>>> +			return (int)(be32_to_cpu(regs[1]) & 0xFFFF);
>>>
>>> This should use of_property_read_u32().
>
> You missed this in v5 ^
>
>>> And finally in either case above, where you get a number from the device tree,
>>> you must check that it's not already allocated. Otherwise if you have a system
>>> where some PHBs have a property but others don't, you may give out the same
>>> number twice. Also you could have firmware give you the same number twice
>>> (which would be a firmware bug, but those happen).
>>>
>>> If the number is allocated you fall back to dynamic numbering.
>>>
>>> If it's not allocated you must mark it as allocated in the bitmap.
>>
>> Hmm..interesting. I thought in performing such check, but I wasn't able
>> to imagine a system in which we can have some PHBs indexed by
>> device-tree properties and others don't, seemed impossible to me. The
>> buggy fw case is an example, I can implement this modification if you
>> think it's valid.
>>
>> But, notice that for consistency in implementation, I'll might need to
>> increase the MAX_PHBS value to 65536, otherwise we won't cover all the
>> possible wrong cases, since I'm performing an AND with 0xFFFF mask
>> (imagine if we can have a buggy fw exposing same value for two different
>> PHBs, and this value is higher than 8192). What do you think about this?
>
> Yeah please increase the bitmap size to 65536. It will only take 8KB of memory,
> which is negligible.

Well, since I sent a v6 and you replied there too, I guess we can 
continue our iterations there - mostly suggestions (all except one) you 
gave here were implemented in v6.

Thanks,


Guilherme

>
> cheers
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Ellerman May 26, 2016, 1 a.m. UTC | #8
On Wed, 2016-05-25 at 10:03 -0300, Guilherme G. Piccoli wrote:
> On 05/25/2016 02:45 AM, Michael Ellerman wrote:
> > 
> > Yeah please increase the bitmap size to 65536. It will only take 8KB of memory,
> > which is negligible.
> 
> Well, since I sent a v6 and you replied there too, I guess we can 
> continue our iterations there - mostly suggestions (all except one) you 
> gave here were implemented in v6.

Yeah sorry, I'm very behind on email :)

cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 0f7a60f..bc31ac1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -44,8 +44,11 @@ 
 static DEFINE_SPINLOCK(hose_spinlock);
 LIST_HEAD(hose_list);
 
-/* XXX kill that some day ... */
-static int global_phb_number;		/* Global phb counter */
+/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */
+#define	MAX_PHBS	8192
+
+/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */
+static DECLARE_BITMAP(phb_bitmap, MAX_PHBS);
 
 /* ISA Memory physical address */
 resource_size_t isa_mem_base;
@@ -64,6 +67,32 @@  struct dma_map_ops *get_pci_dma_ops(void)
 }
 EXPORT_SYMBOL(get_pci_dma_ops);
 
+static int get_phb_number(struct device_node *dn)
+{
+	const __be64 *prop64;
+	const __be32 *regs;
+	int phb_id = 0;
+
+	/* try fixed PHB numbering first, by checking archs and reading
+	 * the respective device-tree property. */
+	if (machine_is(pseries)) {
+		regs = of_get_property(dn, "reg", NULL);
+		if (regs)
+			return (int)(be32_to_cpu(regs[1]) & 0xFFFF);
+	} else if (machine_is(powernv)) {
+		prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
+		if (prop64)
+			return (int)(be64_to_cpup(prop64) & 0xFFFF);
+	}
+
+	/* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */
+	phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS);
+	BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */
+	set_bit(phb_id, phb_bitmap);
+
+	return phb_id;
+}
+
 struct pci_controller *pcibios_alloc_controller(struct device_node *dev)
 {
 	struct pci_controller *phb;
@@ -72,7 +101,7 @@  struct pci_controller *pcibios_alloc_controller(struct device_node *dev)
 	if (phb == NULL)
 		return NULL;
 	spin_lock(&hose_spinlock);
-	phb->global_number = global_phb_number++;
+	phb->global_number = get_phb_number(dev);
 	list_add_tail(&phb->list_node, &hose_list);
 	spin_unlock(&hose_spinlock);
 	phb->dn = dev;
@@ -94,6 +123,11 @@  EXPORT_SYMBOL_GPL(pcibios_alloc_controller);
 void pcibios_free_controller(struct pci_controller *phb)
 {
 	spin_lock(&hose_spinlock);
+
+	/* clear bit of phb_bitmap to allow reuse of this phb number */
+	if (phb->global_number < MAX_PHBS)
+		clear_bit(phb->global_number, phb_bitmap);
+
 	list_del(&phb->list_node);
 	spin_unlock(&hose_spinlock);