Message ID | 1458337746-20337-1-git-send-email-gpiccoli@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Hi Guilherme, Some comments below ... On Fri, 2016-18-03 at 21:49:06 UTC, "Guilherme G. Piccoli" wrote: > The domain/PHB field of PCI addresses has its value obtained from a > global variable, incremented each time a new domain (represented by > struct pci_controller) is added on the system. The domain addition > process happens during boot or due to PCI device hotplug. > > As recent kernels are using predictable naming for network interfaces, > the network stack is more tied to PCI naming. This can be a problem in > hotplug scenarios, because PCI addresses will change if devices are > removed and then re-added. This situation seems unusual, but it can > happen if a user wants to replace a NIC without rebooting the machine, > for example. > > This patch changes the way PCI domain values are generated: now, we use > device-tree properties to assign fixed PHB numbers to PCI addresses > when available (meaning pSeries and PowerNV cases). We also use a bitmap > to allow dynamic PHB numbering when device-tree properties are not > used. This bitmap keeps track of used PHB numbers and if a PHB is > released (by hotplug operations for example), it allows the reuse of > this PHB number, avoiding PCI address to change in case of device remove > and re-add soon after. No functional changes were introduced. > > Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> > --- > arch/powerpc/kernel/pci-common.c | 40 +++++++++++++++++++++++++++++++++++++--- > 1 file changed, 37 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c > index 0f7a60f..bc31ac1 100644 > --- a/arch/powerpc/kernel/pci-common.c > +++ b/arch/powerpc/kernel/pci-common.c > @@ -44,8 +44,11 @@ > static DEFINE_SPINLOCK(hose_spinlock); > LIST_HEAD(hose_list); > > -/* XXX kill that some day ... */ > -static int global_phb_number; /* Global phb counter */ > +/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */ > +#define MAX_PHBS 8192 Did we just make that up? It seems like a lot, but then we have some big systems? > +/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */ Locking? It looks like it's protected by the hose_spinlock, but you should say that here, and also in the comment for hose_spinlock. > +static DECLARE_BITMAP(phb_bitmap, MAX_PHBS); > > /* ISA Memory physical address */ > resource_size_t isa_mem_base; > @@ -64,6 +67,32 @@ struct dma_map_ops *get_pci_dma_ops(void) > } > EXPORT_SYMBOL(get_pci_dma_ops); > There should be a comment here saying what the locking requirements are for this function. > +static int get_phb_number(struct device_node *dn) > +{ > + const __be64 *prop64; > + const __be32 *regs; > + int phb_id = 0; > + > + /* try fixed PHB numbering first, by checking archs and reading > + * the respective device-tree property. */ > + if (machine_is(pseries)) { Firstly I don't see why this check needs to be conditional on pseries. Any machine where the PHB has a 'reg' property should be able to use 'reg' for numbering. > + regs = of_get_property(dn, "reg", NULL); > + if (regs) > + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); This should use of_property_read_u32(). > + } else if (machine_is(powernv)) { This shouldn't be a machine check, it should just look for 'ibm,opal-phbid' first, before 'reg'. > + prop64 = of_get_property(dn, "ibm,opal-phbid", NULL); > + if (prop64) > + return (int)(be64_to_cpup(prop64) & 0xFFFF); of_property_read_u64(). > + } And finally in either case above, where you get a number from the device tree, you must check that it's not already allocated. Otherwise if you have a system where some PHBs have a property but others don't, you may give out the same number twice. Also you could have firmware give you the same number twice (which would be a firmware bug, but those happen). If the number is allocated you fall back to dynamic numbering. If it's not allocated you must mark it as allocated in the bitmap. > + > + /* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */ > + phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS); > + BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */ > + set_bit(phb_id, phb_bitmap); > + > + return phb_id; > +} > + > struct pci_controller *pcibios_alloc_controller(struct device_node *dev) > { > struct pci_controller *phb; cheers -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/25/2016 06:33 AM, Michael Ellerman wrote: > Hi Guilherme, > > Some comments below ... Hi Michael, thanks for the comments. >> +/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */ >> +#define MAX_PHBS 8192 > > Did we just make that up? It seems like a lot, but then we have some big > systems? Well, this is not documented AFAICT. I asked Benjamin on IRC and he pointed me the PCI stack (in special user space tools, like lspci) would be able to deal with at most 16 bit domain number (meaning 65536 bits in a bitmap). I thought it was too much, and chatting with Gavin, we ended up with 8192 ( == 1kB of memory, not too much I believe). What do you think about this number Michael? Should we decrease? Or even increase? Below, following the last comment of yours, I'll discuss more about this value. >> +/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */ > > Locking? It looks like it's protected by the hose_spinlock, but you should say > that here, and also in the comment for hose_spinlock. > >> +static DECLARE_BITMAP(phb_bitmap, MAX_PHBS); >> >> /* ISA Memory physical address */ >> resource_size_t isa_mem_base; >> @@ -64,6 +67,32 @@ struct dma_map_ops *get_pci_dma_ops(void) >> } >> EXPORT_SYMBOL(get_pci_dma_ops); >> > > There should be a comment here saying what the locking requirements are for > this function. Well pointed Michael, I'll add the comments. >> +static int get_phb_number(struct device_node *dn) >> +{ >> + const __be64 *prop64; >> + const __be32 *regs; >> + int phb_id = 0; >> + >> + /* try fixed PHB numbering first, by checking archs and reading >> + * the respective device-tree property. */ >> + if (machine_is(pseries)) { > > Firstly I don't see why this check needs to be conditional on pseries. Any > machine where the PHB has a 'reg' property should be able to use 'reg' for > numbering. This is something I'm not sure for all the powerpc sub-architectures, like Cell - that's the reason of the check. If you are sure about this, I'll gladly remove this check =) > >> + regs = of_get_property(dn, "reg", NULL); >> + if (regs) >> + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); > > This should use of_property_read_u32(). > >> + } else if (machine_is(powernv)) { > > This shouldn't be a machine check, it should just look for 'ibm,opal-phbid' > first, before 'reg'. > >> + prop64 = of_get_property(dn, "ibm,opal-phbid", NULL); >> + if (prop64) >> + return (int)(be64_to_cpup(prop64) & 0xFFFF); > > of_property_read_u64(). > >> + } OK, I'll implement these changes. > And finally in either case above, where you get a number from the device tree, > you must check that it's not already allocated. Otherwise if you have a system > where some PHBs have a property but others don't, you may give out the same > number twice. Also you could have firmware give you the same number twice > (which would be a firmware bug, but those happen). > > If the number is allocated you fall back to dynamic numbering. > > If it's not allocated you must mark it as allocated in the bitmap. Hmm..interesting. I thought in performing such check, but I wasn't able to imagine a system in which we can have some PHBs indexed by device-tree properties and others don't, seemed impossible to me. The buggy fw case is an example, I can implement this modification if you think it's valid. But, notice that for consistency in implementation, I'll might need to increase the MAX_PHBS value to 65536, otherwise we won't cover all the possible wrong cases, since I'm performing an AND with 0xFFFF mask (imagine if we can have a buggy fw exposing same value for two different PHBs, and this value is higher than 8192). What do you think about this? Cheers, Guilherme -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Guilherme G. Piccoli's message of 2016-03-18 16:49:06 -0500: > +static int get_phb_number(struct device_node *dn) ... > + /* try fixed PHB numbering first, by checking archs and reading > + * the respective device-tree property. */ > + if (machine_is(pseries)) { > + regs = of_get_property(dn, "reg", NULL); > + if (regs) > + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); > + } else if (machine_is(powernv)) { > + prop64 = of_get_property(dn, "ibm,opal-phbid", NULL); > + if (prop64) > + return (int)(be64_to_cpup(prop64) & 0xFFFF); > + } I think these cases should still set the bit in phb_bitmap, otherwise a virtual PHB (e.g. as used in cxl/cxlflash) will be assigned PHB 0, and since that is already taken it will fail - we're already seeing a failure in Ubuntu Xenial since Canonical picked this patch up already (though have not confirmed that this is definitely the cause yet). There might also be some interesting races to think about here if a virtual PHB grabs a PHB number before the real one gets a chance. > + > + /* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */ > + phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS); > + BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */ > + set_bit(phb_id, phb_bitmap); -Ian -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/06/2016 04:38 PM, Ian Munsie wrote: >> + /* try fixed PHB numbering first, by checking archs and reading >> + * the respective device-tree property. */ >> + if (machine_is(pseries)) { >> + regs = of_get_property(dn, "reg", NULL); >> + if (regs) >> + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); >> + } else if (machine_is(powernv)) { >> + prop64 = of_get_property(dn, "ibm,opal-phbid", NULL); >> + if (prop64) >> + return (int)(be64_to_cpup(prop64) & 0xFFFF); >> + } > > I think these cases should still set the bit in phb_bitmap, otherwise a > virtual PHB (e.g. as used in cxl/cxlflash) will be assigned PHB 0, and > since that is already taken it will fail - we're already seeing a > failure in Ubuntu Xenial since Canonical picked this patch up already > (though have not confirmed that this is definitely the cause yet). > > There might also be some interesting races to think about here if a > virtual PHB grabs a PHB number before the real one gets a chance. This is a very interesting case I didn't think before. Thanks for pointing this Ian. We can, as you suggested, set the bitmap in any case to avoid conflicts with virtual PHBs. And in the case a virtual PHB grabs the bitmap before, we just need to add Michael's suggested check and fallback to bitmap PHB numbering in this case. Do you think this is enough to avoid issues with cxl'a virtual PHBs? Thanks, Guilherme -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Guilherme G. Piccoli's message of 2016-04-06 16:51:43 -0500: > And in the case a virtual PHB grabs the bitmap before, we just need to > add Michael's suggested check and fallback to bitmap PHB numbering in > this case. > > Do you think this is enough to avoid issues with cxl'a virtual PHBs? Yep, that should be fine :) Cheers, -Ian -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Guilherme, Sorry for the very late reply, this got lost in my email filters. On Mon, 2016-03-28 at 09:36 -0300, Guilherme G. Piccoli wrote: > On 03/25/2016 06:33 AM, Michael Ellerman wrote: > > > +static int get_phb_number(struct device_node *dn) > > > +{ > > > + const __be64 *prop64; > > > + const __be32 *regs; > > > + int phb_id = 0; > > > + > > > + /* try fixed PHB numbering first, by checking archs and reading > > > + * the respective device-tree property. */ > > > + if (machine_is(pseries)) { > > > > Firstly I don't see why this check needs to be conditional on pseries. Any > > machine where the PHB has a 'reg' property should be able to use 'reg' for > > numbering. > > This is something I'm not sure for all the powerpc sub-architectures, > like Cell - that's the reason of the check. If you are sure about this, > I'll gladly remove this check =) Please do. I'll test on Cell & other platforms. If there are bugs we can fix them. Maybe if we can't get it to work on eg. Cell then we need a machine_is() check, but that should be the last resort. > > > + regs = of_get_property(dn, "reg", NULL); > > > + if (regs) > > > + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); > > > > This should use of_property_read_u32(). You missed this in v5 ^ > > And finally in either case above, where you get a number from the device tree, > > you must check that it's not already allocated. Otherwise if you have a system > > where some PHBs have a property but others don't, you may give out the same > > number twice. Also you could have firmware give you the same number twice > > (which would be a firmware bug, but those happen). > > > > If the number is allocated you fall back to dynamic numbering. > > > > If it's not allocated you must mark it as allocated in the bitmap. > > Hmm..interesting. I thought in performing such check, but I wasn't able > to imagine a system in which we can have some PHBs indexed by > device-tree properties and others don't, seemed impossible to me. The > buggy fw case is an example, I can implement this modification if you > think it's valid. > > But, notice that for consistency in implementation, I'll might need to > increase the MAX_PHBS value to 65536, otherwise we won't cover all the > possible wrong cases, since I'm performing an AND with 0xFFFF mask > (imagine if we can have a buggy fw exposing same value for two different > PHBs, and this value is higher than 8192). What do you think about this? Yeah please increase the bitmap size to 65536. It will only take 8KB of memory, which is negligible. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/25/2016 02:45 AM, Michael Ellerman wrote: > Hi Guilherme, > > Sorry for the very late reply, this got lost in my email filters. No problem Michael, thanks for replying! > On Mon, 2016-03-28 at 09:36 -0300, Guilherme G. Piccoli wrote: >> On 03/25/2016 06:33 AM, Michael Ellerman wrote: > >>>> +static int get_phb_number(struct device_node *dn) >>>> +{ >>>> + const __be64 *prop64; >>>> + const __be32 *regs; >>>> + int phb_id = 0; >>>> + >>>> + /* try fixed PHB numbering first, by checking archs and reading >>>> + * the respective device-tree property. */ >>>> + if (machine_is(pseries)) { >>> >>> Firstly I don't see why this check needs to be conditional on pseries. Any >>> machine where the PHB has a 'reg' property should be able to use 'reg' for >>> numbering. >> >> This is something I'm not sure for all the powerpc sub-architectures, >> like Cell - that's the reason of the check. If you are sure about this, >> I'll gladly remove this check =) > > Please do. > > I'll test on Cell & other platforms. If there are bugs we can fix them. Maybe > if we can't get it to work on eg. Cell then we need a machine_is() check, but > that should be the last resort. > >>>> + regs = of_get_property(dn, "reg", NULL); >>>> + if (regs) >>>> + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); >>> >>> This should use of_property_read_u32(). > > You missed this in v5 ^ > >>> And finally in either case above, where you get a number from the device tree, >>> you must check that it's not already allocated. Otherwise if you have a system >>> where some PHBs have a property but others don't, you may give out the same >>> number twice. Also you could have firmware give you the same number twice >>> (which would be a firmware bug, but those happen). >>> >>> If the number is allocated you fall back to dynamic numbering. >>> >>> If it's not allocated you must mark it as allocated in the bitmap. >> >> Hmm..interesting. I thought in performing such check, but I wasn't able >> to imagine a system in which we can have some PHBs indexed by >> device-tree properties and others don't, seemed impossible to me. The >> buggy fw case is an example, I can implement this modification if you >> think it's valid. >> >> But, notice that for consistency in implementation, I'll might need to >> increase the MAX_PHBS value to 65536, otherwise we won't cover all the >> possible wrong cases, since I'm performing an AND with 0xFFFF mask >> (imagine if we can have a buggy fw exposing same value for two different >> PHBs, and this value is higher than 8192). What do you think about this? > > Yeah please increase the bitmap size to 65536. It will only take 8KB of memory, > which is negligible. Well, since I sent a v6 and you replied there too, I guess we can continue our iterations there - mostly suggestions (all except one) you gave here were implemented in v6. Thanks, Guilherme > > cheers > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2016-05-25 at 10:03 -0300, Guilherme G. Piccoli wrote: > On 05/25/2016 02:45 AM, Michael Ellerman wrote: > > > > Yeah please increase the bitmap size to 65536. It will only take 8KB of memory, > > which is negligible. > > Well, since I sent a v6 and you replied there too, I guess we can > continue our iterations there - mostly suggestions (all except one) you > gave here were implemented in v6. Yeah sorry, I'm very behind on email :) cheers -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 0f7a60f..bc31ac1 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -44,8 +44,11 @@ static DEFINE_SPINLOCK(hose_spinlock); LIST_HEAD(hose_list); -/* XXX kill that some day ... */ -static int global_phb_number; /* Global phb counter */ +/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */ +#define MAX_PHBS 8192 + +/* For dynamic PHB numbering: used/free PHBs tracking bitmap. */ +static DECLARE_BITMAP(phb_bitmap, MAX_PHBS); /* ISA Memory physical address */ resource_size_t isa_mem_base; @@ -64,6 +67,32 @@ struct dma_map_ops *get_pci_dma_ops(void) } EXPORT_SYMBOL(get_pci_dma_ops); +static int get_phb_number(struct device_node *dn) +{ + const __be64 *prop64; + const __be32 *regs; + int phb_id = 0; + + /* try fixed PHB numbering first, by checking archs and reading + * the respective device-tree property. */ + if (machine_is(pseries)) { + regs = of_get_property(dn, "reg", NULL); + if (regs) + return (int)(be32_to_cpu(regs[1]) & 0xFFFF); + } else if (machine_is(powernv)) { + prop64 = of_get_property(dn, "ibm,opal-phbid", NULL); + if (prop64) + return (int)(be64_to_cpup(prop64) & 0xFFFF); + } + + /* if not pSeries nor PowerNV, fallback to dynamic PHB numbering */ + phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS); + BUG_ON(phb_id >= MAX_PHBS); /* reached maximum number of PHBs */ + set_bit(phb_id, phb_bitmap); + + return phb_id; +} + struct pci_controller *pcibios_alloc_controller(struct device_node *dev) { struct pci_controller *phb; @@ -72,7 +101,7 @@ struct pci_controller *pcibios_alloc_controller(struct device_node *dev) if (phb == NULL) return NULL; spin_lock(&hose_spinlock); - phb->global_number = global_phb_number++; + phb->global_number = get_phb_number(dev); list_add_tail(&phb->list_node, &hose_list); spin_unlock(&hose_spinlock); phb->dn = dev; @@ -94,6 +123,11 @@ EXPORT_SYMBOL_GPL(pcibios_alloc_controller); void pcibios_free_controller(struct pci_controller *phb) { spin_lock(&hose_spinlock); + + /* clear bit of phb_bitmap to allow reuse of this phb number */ + if (phb->global_number < MAX_PHBS) + clear_bit(phb->global_number, phb_bitmap); + list_del(&phb->list_node); spin_unlock(&hose_spinlock);