Message ID | 54F9EAA8.30007@ti.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On 03/06/2015 12:58 PM, Murali Karicheri wrote: > On 03/06/2015 11:55 AM, Guenter Roeck wrote: >> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >> [ ... ] >> >>> > From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 00:00:00 2001 >>> From: Murali Karicheri<m-karicheri2@ti.com> >>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>> Subject: [PATCH] pci: of : fix kernel crash >>> >>> This is a debug patch to root cause the kernel crash >>> >>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>> >>> PCI: Update DMA configuration from DT >>> >>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>> --- >>> drivers/of/of_pci.c | 8 ++++++++ >>> drivers/pci/host-bridge.c | 5 +++++ >>> 2 files changed, 13 insertions(+) >>> >>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>> index 86d3c38..5a59fb8 100644 >>> --- a/drivers/of/of_pci.c >>> +++ b/drivers/of/of_pci.c >>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev *pci_dev) >>> struct device *dev =&pci_dev->dev; >>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>> >>> + if (!bridge || !bridge->parent) { >>> + if (!bridge) >>> + pr_err("PCI bridge not found\n"); >>> + if (!bridge->parent) >>> + pr_err("PCI bridge parent not found\n"); >> >> You'll see a crash here if bridge is NULL. Maybe add an else before >> the second >> if statement ? Also, dev_err might be a bit more useful and would be >> available. >> > Fixed and attached. > > Murali >> Thanks, >> Guenter >> >>> + return; >>> + } >>> + >>> of_dma_configure(dev, bridge->parent->of_node); >>> pci_put_host_bridge_device(bridge); >>> } >>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>> index 3e5bbf9..ef2ab51 100644 >>> --- a/drivers/pci/host-bridge.c >>> +++ b/drivers/pci/host-bridge.c >>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct >>> pci_dev *dev) >>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>> struct device *bridge = root_bus->bridge; >>> >>> + if (!bridge) { >>> + pr_err("PCI: bridge not found\n"); >>> + return NULL; >>> + } >>> + >>> kobject_get(&bridge->kobj); >>> return bridge; >>> } >>> -- >>> 1.7.9.5 >>> >> > BJorn, Any chance of applying the attached debug patch to see if this fixes and provide some additional information on this BUG? Not sure who will pick this one and apply.
On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri <m-karicheri2@ti.com> wrote: > On 03/06/2015 12:58 PM, Murali Karicheri wrote: >> >> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>> >>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>> [ ... ] >>> >>>> > From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 00:00:00 2001 >>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>> Subject: [PATCH] pci: of : fix kernel crash >>>> >>>> This is a debug patch to root cause the kernel crash >>>> >>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>> >>>> PCI: Update DMA configuration from DT >>>> >>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>> --- >>>> drivers/of/of_pci.c | 8 ++++++++ >>>> drivers/pci/host-bridge.c | 5 +++++ >>>> 2 files changed, 13 insertions(+) >>>> >>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>> index 86d3c38..5a59fb8 100644 >>>> --- a/drivers/of/of_pci.c >>>> +++ b/drivers/of/of_pci.c >>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev *pci_dev) >>>> struct device *dev =&pci_dev->dev; >>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>> >>>> + if (!bridge || !bridge->parent) { >>>> + if (!bridge) >>>> + pr_err("PCI bridge not found\n"); >>>> + if (!bridge->parent) >>>> + pr_err("PCI bridge parent not found\n"); >>> >>> >>> You'll see a crash here if bridge is NULL. Maybe add an else before >>> the second >>> if statement ? Also, dev_err might be a bit more useful and would be >>> available. >>> >> Fixed and attached. >> >> Murali >>> >>> Thanks, >>> Guenter >>> >>>> + return; >>>> + } >>>> + >>>> of_dma_configure(dev, bridge->parent->of_node); >>>> pci_put_host_bridge_device(bridge); >>>> } >>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>> index 3e5bbf9..ef2ab51 100644 >>>> --- a/drivers/pci/host-bridge.c >>>> +++ b/drivers/pci/host-bridge.c >>>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct >>>> pci_dev *dev) >>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>> struct device *bridge = root_bus->bridge; >>>> >>>> + if (!bridge) { >>>> + pr_err("PCI: bridge not found\n"); >>>> + return NULL; >>>> + } >>>> + >>>> kobject_get(&bridge->kobj); >>>> return bridge; >>>> } >>>> -- >>>> 1.7.9.5 >>>> >>> >> > BJorn, > > Any chance of applying the attached debug patch to see if this fixes and > provide some additional information on this BUG? Not sure who will pick this > one and apply. The change that caused the oops (0b2af171520e ("PCI: Update DMA configuration from DT")) only exists on my pci/iommu branch, so I'm the one to apply it. It's much easier for me to deal with plain text patches (not attachments). I'm hesitating because I don't want to encourage use of the 0-day testing robot as a tool at which we can just throw debug patches. The robot is a service that costs somebody real money, and I want to be a good neighbor when using it. Was the information in the robot's report enough to reproduce the oops? If not, is there additional information we could add to the report that would enable you to reproduce it? Even if we can't reproduce the oops, the report seems detailed enough that we should be able to deduce the problem and produce a fix in which we have high confidence. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: > On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@ti.com> wrote: >> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>> >>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>> >>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>> [ ... ] >>>> >>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 00:00:00 2001 >>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>> >>>>> This is a debug patch to root cause the kernel crash >>>>> >>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>> >>>>> PCI: Update DMA configuration from DT >>>>> >>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>> --- >>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>> 2 files changed, 13 insertions(+) >>>>> >>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>> index 86d3c38..5a59fb8 100644 >>>>> --- a/drivers/of/of_pci.c >>>>> +++ b/drivers/of/of_pci.c >>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev *pci_dev) >>>>> struct device *dev =&pci_dev->dev; >>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>> >>>>> + if (!bridge || !bridge->parent) { >>>>> + if (!bridge) >>>>> + pr_err("PCI bridge not found\n"); >>>>> + if (!bridge->parent) >>>>> + pr_err("PCI bridge parent not found\n"); >>>> >>>> >>>> You'll see a crash here if bridge is NULL. Maybe add an else before >>>> the second >>>> if statement ? Also, dev_err might be a bit more useful and would be >>>> available. >>>> >>> Fixed and attached. >>> >>> Murali >>>> >>>> Thanks, >>>> Guenter >>>> >>>>> + return; >>>>> + } >>>>> + >>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>> pci_put_host_bridge_device(bridge); >>>>> } >>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>>> index 3e5bbf9..ef2ab51 100644 >>>>> --- a/drivers/pci/host-bridge.c >>>>> +++ b/drivers/pci/host-bridge.c >>>>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct >>>>> pci_dev *dev) >>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>> struct device *bridge = root_bus->bridge; >>>>> >>>>> + if (!bridge) { >>>>> + pr_err("PCI: bridge not found\n"); >>>>> + return NULL; >>>>> + } >>>>> + >>>>> kobject_get(&bridge->kobj); >>>>> return bridge; >>>>> } >>>>> -- >>>>> 1.7.9.5 >>>>> >>>> >>> >> BJorn, >> >> Any chance of applying the attached debug patch to see if this fixes and >> provide some additional information on this BUG? Not sure who will pick this >> one and apply. > > The change that caused the oops (0b2af171520e ("PCI: Update DMA > configuration from DT")) only exists on my pci/iommu branch, so I'm > the one to apply it. > > It's much easier for me to deal with plain text patches (not attachments). > > I'm hesitating because I don't want to encourage use of the 0-day > testing robot as a tool at which we can just throw debug patches. The > robot is a service that costs somebody real money, and I want to be a > good neighbor when using it. Thanks for the clarification as I don't have much information on the testing robot. At the same time the question is how similar incidence in the past have been handled. I am a newbie w.r.t to this. This is first time I have introduced a patch that impacts multiple arch/machines. > > Was the information in the robot's report enough to reproduce the > oops? If not, is there additional information we could add to the > report that would enable you to reproduce it? Even if we can't > reproduce the oops, the report seems detailed enough that we should be > able to deduce the problem and produce a fix in which we have high > confidence. The BUG report essentially indicates the crash happened in of_pci_dma_configure(). The machine specific log make sense to a person familiar with this arch and I am not familiar with the same. So anyone can help narrow down the root cause of this? Looking at the code, there are two ptr variables that are accessed without checking for NULL as initial thinking was that these can never be NULL. So the debug patch is just adding addition check before accessing the ptr. I can send this patch without debug prints if that make sense. I was thinking to get confirmation that this is indeed the case before adding the check. What do you think the right approach here? Send a patch for this to the ML for adding the check as a potential fix? Or someone can help me investigate the crash dump and root cause it? or if we can use test robot to confirm this, I can re-send the patch ASIS to the list. Please clarify. Thanks and regards, Murali > > Bjorn > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/09/2015 08:53 AM, Murali Karicheri wrote: > On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >> On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@ti.com> wrote: >>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>> >>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>> >>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>> [ ... ] >>>>> >>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 00:00:00 2001 >>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>> >>>>>> This is a debug patch to root cause the kernel crash >>>>>> >>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>> >>>>>> PCI: Update DMA configuration from DT >>>>>> >>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>> --- >>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>> 2 files changed, 13 insertions(+) >>>>>> >>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>> index 86d3c38..5a59fb8 100644 >>>>>> --- a/drivers/of/of_pci.c >>>>>> +++ b/drivers/of/of_pci.c >>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev *pci_dev) >>>>>> struct device *dev =&pci_dev->dev; >>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>> >>>>>> + if (!bridge || !bridge->parent) { >>>>>> + if (!bridge) >>>>>> + pr_err("PCI bridge not found\n"); >>>>>> + if (!bridge->parent) >>>>>> + pr_err("PCI bridge parent not found\n"); >>>>> >>>>> >>>>> You'll see a crash here if bridge is NULL. Maybe add an else before >>>>> the second >>>>> if statement ? Also, dev_err might be a bit more useful and would be >>>>> available. >>>>> >>>> Fixed and attached. >>>> >>>> Murali >>>>> >>>>> Thanks, >>>>> Guenter >>>>> >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>> pci_put_host_bridge_device(bridge); >>>>>> } >>>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>> --- a/drivers/pci/host-bridge.c >>>>>> +++ b/drivers/pci/host-bridge.c >>>>>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct >>>>>> pci_dev *dev) >>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>> struct device *bridge = root_bus->bridge; >>>>>> >>>>>> + if (!bridge) { >>>>>> + pr_err("PCI: bridge not found\n"); >>>>>> + return NULL; >>>>>> + } >>>>>> + >>>>>> kobject_get(&bridge->kobj); >>>>>> return bridge; >>>>>> } >>>>>> -- >>>>>> 1.7.9.5 >>>>>> >>>>> >>>> >>> BJorn, >>> >>> Any chance of applying the attached debug patch to see if this fixes and >>> provide some additional information on this BUG? Not sure who will pick this >>> one and apply. >> >> The change that caused the oops (0b2af171520e ("PCI: Update DMA >> configuration from DT")) only exists on my pci/iommu branch, so I'm >> the one to apply it. >> >> It's much easier for me to deal with plain text patches (not attachments). >> >> I'm hesitating because I don't want to encourage use of the 0-day >> testing robot as a tool at which we can just throw debug patches. The >> robot is a service that costs somebody real money, and I want to be a >> good neighbor when using it. > > Thanks for the clarification as I don't have much information on the testing robot. At the same time the question is how similar incidence in the past have been handled. I am a newbie w.r.t to this. This is first time I have introduced a patch that impacts multiple arch/machines. > >> >> Was the information in the robot's report enough to reproduce the >> oops? If not, is there additional information we could add to the >> report that would enable you to reproduce it? Even if we can't >> reproduce the oops, the report seems detailed enough that we should be >> able to deduce the problem and produce a fix in which we have high >> confidence. > > The BUG report essentially indicates the crash happened in of_pci_dma_configure(). The machine specific log make sense to a person familiar with this arch and I am not familiar with the same. So anyone can help narrow down the root cause of this? > > Looking at the code, there are two ptr variables that are accessed without checking for NULL as initial thinking was that these can never be NULL. So the debug patch is just adding addition check before accessing the ptr. I can send this patch without debug prints if that make sense. I was thinking to get confirmation that this is indeed the case before adding the check. What do you think the right approach here? Send a patch for this to the ML for adding the check as a potential fix? Or someone can help me investigate the crash dump and root cause it? or if we can use test robot to confirm this, I can re-send the patch ASIS to the list. Please clarify. > If the assumption is that the pointers can never be NULL, wouldn't it be important to see a call trace and to find out if the NULL pointers can actually be seen by design, or if there is some other bug ? I am a bit concerned that adding those NULL pointer checks might end up hiding some other bug, ie that they just hide the real bug without fixing it. Thanks, Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/09/2015 12:07 PM, Guenter Roeck wrote: > On 03/09/2015 08:53 AM, Murali Karicheri wrote: >> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >>> On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@ti.com> >>> wrote: >>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>>> >>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>>> >>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>>> [ ... ] >>>>>> >>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 >>>>>>>> 00:00:00 2001 >>>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>>> >>>>>>> This is a debug patch to root cause the kernel crash >>>>>>> >>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>>> >>>>>>> PCI: Update DMA configuration from DT >>>>>>> >>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>>> --- >>>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>>> 2 files changed, 13 insertions(+) >>>>>>> >>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>>> index 86d3c38..5a59fb8 100644 >>>>>>> --- a/drivers/of/of_pci.c >>>>>>> +++ b/drivers/of/of_pci.c >>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev >>>>>>> *pci_dev) >>>>>>> struct device *dev =&pci_dev->dev; >>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>>> >>>>>>> + if (!bridge || !bridge->parent) { >>>>>>> + if (!bridge) >>>>>>> + pr_err("PCI bridge not found\n"); >>>>>>> + if (!bridge->parent) >>>>>>> + pr_err("PCI bridge parent not found\n"); >>>>>> >>>>>> >>>>>> You'll see a crash here if bridge is NULL. Maybe add an else before >>>>>> the second >>>>>> if statement ? Also, dev_err might be a bit more useful and would be >>>>>> available. >>>>>> >>>>> Fixed and attached. >>>>> >>>>> Murali >>>>>> >>>>>> Thanks, >>>>>> Guenter >>>>>> >>>>>>> + return; >>>>>>> + } >>>>>>> + >>>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>>> pci_put_host_bridge_device(bridge); >>>>>>> } >>>>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>>> --- a/drivers/pci/host-bridge.c >>>>>>> +++ b/drivers/pci/host-bridge.c >>>>>>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct >>>>>>> pci_dev *dev) >>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>>> struct device *bridge = root_bus->bridge; >>>>>>> >>>>>>> + if (!bridge) { >>>>>>> + pr_err("PCI: bridge not found\n"); >>>>>>> + return NULL; >>>>>>> + } >>>>>>> + >>>>>>> kobject_get(&bridge->kobj); >>>>>>> return bridge; >>>>>>> } >>>>>>> -- >>>>>>> 1.7.9.5 >>>>>>> >>>>>> >>>>> >>>> BJorn, >>>> >>>> Any chance of applying the attached debug patch to see if this fixes >>>> and >>>> provide some additional information on this BUG? Not sure who will >>>> pick this >>>> one and apply. >>> >>> The change that caused the oops (0b2af171520e ("PCI: Update DMA >>> configuration from DT")) only exists on my pci/iommu branch, so I'm >>> the one to apply it. >>> >>> It's much easier for me to deal with plain text patches (not >>> attachments). >>> >>> I'm hesitating because I don't want to encourage use of the 0-day >>> testing robot as a tool at which we can just throw debug patches. The >>> robot is a service that costs somebody real money, and I want to be a >>> good neighbor when using it. >> >> Thanks for the clarification as I don't have much information on the >> testing robot. At the same time the question is how similar incidence >> in the past have been handled. I am a newbie w.r.t to this. This is >> first time I have introduced a patch that impacts multiple arch/machines. >> >>> >>> Was the information in the robot's report enough to reproduce the >>> oops? If not, is there additional information we could add to the >>> report that would enable you to reproduce it? Even if we can't >>> reproduce the oops, the report seems detailed enough that we should be >>> able to deduce the problem and produce a fix in which we have high >>> confidence. >> >> The BUG report essentially indicates the crash happened in >> of_pci_dma_configure(). The machine specific log make sense to a >> person familiar with this arch and I am not familiar with the same. So >> anyone can help narrow down the root cause of this? >> >> Looking at the code, there are two ptr variables that are accessed >> without checking for NULL as initial thinking was that these can never >> be NULL. So the debug patch is just adding addition check before >> accessing the ptr. I can send this patch without debug prints if that >> make sense. I was thinking to get confirmation that this is indeed the >> case before adding the check. What do you think the right approach >> here? Send a patch for this to the ML for adding the check as a >> potential fix? Or someone can help me investigate the crash dump and >> root cause it? or if we can use test robot to confirm this, I can >> re-send the patch ASIS to the list. Please clarify. >> > If the assumption is that the pointers can never be NULL, > wouldn't it be important to see a call trace and to find out > if the NULL pointers can actually be seen by design, > or if there is some other bug ? Call trace shows [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 And BUG seems to be in of_pci_dma_configure() as shown in the BUG report. of_pci_dma_configure() calls newly added API call to pci_get_host_bridge_device(). Seems like this has succeeded which means bridge is non NULL IMO. However in this function it passes bridge->parent->of_node to of_dma_configure(). So I suspect bridge->parent is NULL for some reason. Is there a chance for parent being NULL in this or any other platform? > > I am a bit concerned that adding those NULL pointer checks > might end up hiding some other bug, ie that they just hide > the real bug without fixing it. > I agree with you as well. That is the reason I had added the debug prints in the attached patch to check what is NULL here and that can help us dig into this more. If BJorn can accept this debug patch for finding this, that will help. I can re-send it to the list as a debug patch if everyone agrees. Otherwise I don't know how to proceed from here. Thanks and regards, Murali > Thanks, > Guenter >
On 03/09/2015 10:03 AM, Murali Karicheri wrote: > On 03/09/2015 12:07 PM, Guenter Roeck wrote: >> On 03/09/2015 08:53 AM, Murali Karicheri wrote: >>> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >>>> On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@ti.com> >>>> wrote: >>>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>>>> >>>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>>>> >>>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>>>> [ ... ] >>>>>>> >>>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 >>>>>>>>> 00:00:00 2001 >>>>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>>>> >>>>>>>> This is a debug patch to root cause the kernel crash >>>>>>>> >>>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>>>> >>>>>>>> PCI: Update DMA configuration from DT >>>>>>>> >>>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>> --- >>>>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>>>> 2 files changed, 13 insertions(+) >>>>>>>> >>>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>>>> index 86d3c38..5a59fb8 100644 >>>>>>>> --- a/drivers/of/of_pci.c >>>>>>>> +++ b/drivers/of/of_pci.c >>>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev >>>>>>>> *pci_dev) >>>>>>>> struct device *dev =&pci_dev->dev; >>>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>>>> >>>>>>>> + if (!bridge || !bridge->parent) { >>>>>>>> + if (!bridge) >>>>>>>> + pr_err("PCI bridge not found\n"); >>>>>>>> + if (!bridge->parent) >>>>>>>> + pr_err("PCI bridge parent not found\n"); >>>>>>> >>>>>>> >>>>>>> You'll see a crash here if bridge is NULL. Maybe add an else before >>>>>>> the second >>>>>>> if statement ? Also, dev_err might be a bit more useful and would be >>>>>>> available. >>>>>>> >>>>>> Fixed and attached. >>>>>> >>>>>> Murali >>>>>>> >>>>>>> Thanks, >>>>>>> Guenter >>>>>>> >>>>>>>> + return; >>>>>>>> + } >>>>>>>> + >>>>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>>>> pci_put_host_bridge_device(bridge); >>>>>>>> } >>>>>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>>>> --- a/drivers/pci/host-bridge.c >>>>>>>> +++ b/drivers/pci/host-bridge.c >>>>>>>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct >>>>>>>> pci_dev *dev) >>>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>>>> struct device *bridge = root_bus->bridge; >>>>>>>> >>>>>>>> + if (!bridge) { >>>>>>>> + pr_err("PCI: bridge not found\n"); >>>>>>>> + return NULL; >>>>>>>> + } >>>>>>>> + >>>>>>>> kobject_get(&bridge->kobj); >>>>>>>> return bridge; >>>>>>>> } >>>>>>>> -- >>>>>>>> 1.7.9.5 >>>>>>>> >>>>>>> >>>>>> >>>>> BJorn, >>>>> >>>>> Any chance of applying the attached debug patch to see if this fixes >>>>> and >>>>> provide some additional information on this BUG? Not sure who will >>>>> pick this >>>>> one and apply. >>>> >>>> The change that caused the oops (0b2af171520e ("PCI: Update DMA >>>> configuration from DT")) only exists on my pci/iommu branch, so I'm >>>> the one to apply it. >>>> >>>> It's much easier for me to deal with plain text patches (not >>>> attachments). >>>> >>>> I'm hesitating because I don't want to encourage use of the 0-day >>>> testing robot as a tool at which we can just throw debug patches. The >>>> robot is a service that costs somebody real money, and I want to be a >>>> good neighbor when using it. >>> >>> Thanks for the clarification as I don't have much information on the >>> testing robot. At the same time the question is how similar incidence >>> in the past have been handled. I am a newbie w.r.t to this. This is >>> first time I have introduced a patch that impacts multiple arch/machines. >>> >>>> >>>> Was the information in the robot's report enough to reproduce the >>>> oops? If not, is there additional information we could add to the >>>> report that would enable you to reproduce it? Even if we can't >>>> reproduce the oops, the report seems detailed enough that we should be >>>> able to deduce the problem and produce a fix in which we have high >>>> confidence. >>> >>> The BUG report essentially indicates the crash happened in >>> of_pci_dma_configure(). The machine specific log make sense to a >>> person familiar with this arch and I am not familiar with the same. So >>> anyone can help narrow down the root cause of this? >>> >>> Looking at the code, there are two ptr variables that are accessed >>> without checking for NULL as initial thinking was that these can never >>> be NULL. So the debug patch is just adding addition check before >>> accessing the ptr. I can send this patch without debug prints if that >>> make sense. I was thinking to get confirmation that this is indeed the >>> case before adding the check. What do you think the right approach >>> here? Send a patch for this to the ML for adding the check as a >>> potential fix? Or someone can help me investigate the crash dump and >>> root cause it? or if we can use test robot to confirm this, I can >>> re-send the patch ASIS to the list. Please clarify. >>> >> If the assumption is that the pointers can never be NULL, >> wouldn't it be important to see a call trace and to find out >> if the NULL pointers can actually be seen by design, >> or if there is some other bug ? > > Call trace shows > > [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 > [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 > > > And BUG seems to be in of_pci_dma_configure() as shown in the BUG report. > > of_pci_dma_configure() calls newly added API call to pci_get_host_bridge_device(). Seems like this has succeeded which means bridge is non NULL IMO. However in this function it passes bridge->parent->of_node to of_dma_configure(). So I suspect bridge->parent is NULL for some reason. Is there a chance for parent being NULL in this or any other platform? > Can bridge be the root bridge ? Guenter >> >> I am a bit concerned that adding those NULL pointer checks >> might end up hiding some other bug, ie that they just hide >> the real bug without fixing it. >> > > I agree with you as well. That is the reason I had added the debug prints in the attached patch to check what is NULL here and that can help us dig into this more. > > If BJorn can accept this debug patch for finding this, that will help. I can re-send it to the list as a debug patch if everyone agrees. Otherwise I don't know how to proceed from here. > > Thanks and regards, > > Murali >> Thanks, >> Guenter >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/09/2015 01:34 PM, Guenter Roeck wrote: > On 03/09/2015 10:03 AM, Murali Karicheri wrote: >> On 03/09/2015 12:07 PM, Guenter Roeck wrote: >>> On 03/09/2015 08:53 AM, Murali Karicheri wrote: >>>> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >>>>> On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@ti.com> >>>>> wrote: >>>>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>>>>> >>>>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>>>>> >>>>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>>>>> [ ... ] >>>>>>>> >>>>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 >>>>>>>>>> 00:00:00 2001 >>>>>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>>>>> >>>>>>>>> This is a debug patch to root cause the kernel crash >>>>>>>>> >>>>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>>>>> >>>>>>>>> PCI: Update DMA configuration from DT >>>>>>>>> >>>>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>> --- >>>>>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>>>>> 2 files changed, 13 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>>>>> index 86d3c38..5a59fb8 100644 >>>>>>>>> --- a/drivers/of/of_pci.c >>>>>>>>> +++ b/drivers/of/of_pci.c >>>>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev >>>>>>>>> *pci_dev) >>>>>>>>> struct device *dev =&pci_dev->dev; >>>>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>>>>> >>>>>>>>> + if (!bridge || !bridge->parent) { >>>>>>>>> + if (!bridge) >>>>>>>>> + pr_err("PCI bridge not found\n"); >>>>>>>>> + if (!bridge->parent) >>>>>>>>> + pr_err("PCI bridge parent not found\n"); >>>>>>>> >>>>>>>> >>>>>>>> You'll see a crash here if bridge is NULL. Maybe add an else before >>>>>>>> the second >>>>>>>> if statement ? Also, dev_err might be a bit more useful and >>>>>>>> would be >>>>>>>> available. >>>>>>>> >>>>>>> Fixed and attached. >>>>>>> >>>>>>> Murali >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Guenter >>>>>>>> >>>>>>>>> + return; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>>>>> pci_put_host_bridge_device(bridge); >>>>>>>>> } >>>>>>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>>>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>>>>> --- a/drivers/pci/host-bridge.c >>>>>>>>> +++ b/drivers/pci/host-bridge.c >>>>>>>>> @@ -28,6 +28,11 @@ struct device >>>>>>>>> *pci_get_host_bridge_device(struct >>>>>>>>> pci_dev *dev) >>>>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>>>>> struct device *bridge = root_bus->bridge; >>>>>>>>> >>>>>>>>> + if (!bridge) { >>>>>>>>> + pr_err("PCI: bridge not found\n"); >>>>>>>>> + return NULL; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> kobject_get(&bridge->kobj); >>>>>>>>> return bridge; >>>>>>>>> } >>>>>>>>> -- >>>>>>>>> 1.7.9.5 >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> BJorn, >>>>>> >>>>>> Any chance of applying the attached debug patch to see if this fixes >>>>>> and >>>>>> provide some additional information on this BUG? Not sure who will >>>>>> pick this >>>>>> one and apply. >>>>> >>>>> The change that caused the oops (0b2af171520e ("PCI: Update DMA >>>>> configuration from DT")) only exists on my pci/iommu branch, so I'm >>>>> the one to apply it. >>>>> >>>>> It's much easier for me to deal with plain text patches (not >>>>> attachments). >>>>> >>>>> I'm hesitating because I don't want to encourage use of the 0-day >>>>> testing robot as a tool at which we can just throw debug patches. The >>>>> robot is a service that costs somebody real money, and I want to be a >>>>> good neighbor when using it. >>>> >>>> Thanks for the clarification as I don't have much information on the >>>> testing robot. At the same time the question is how similar incidence >>>> in the past have been handled. I am a newbie w.r.t to this. This is >>>> first time I have introduced a patch that impacts multiple >>>> arch/machines. >>>> >>>>> >>>>> Was the information in the robot's report enough to reproduce the >>>>> oops? If not, is there additional information we could add to the >>>>> report that would enable you to reproduce it? Even if we can't >>>>> reproduce the oops, the report seems detailed enough that we should be >>>>> able to deduce the problem and produce a fix in which we have high >>>>> confidence. >>>> >>>> The BUG report essentially indicates the crash happened in >>>> of_pci_dma_configure(). The machine specific log make sense to a >>>> person familiar with this arch and I am not familiar with the same. So >>>> anyone can help narrow down the root cause of this? >>>> >>>> Looking at the code, there are two ptr variables that are accessed >>>> without checking for NULL as initial thinking was that these can never >>>> be NULL. So the debug patch is just adding addition check before >>>> accessing the ptr. I can send this patch without debug prints if that >>>> make sense. I was thinking to get confirmation that this is indeed the >>>> case before adding the check. What do you think the right approach >>>> here? Send a patch for this to the ML for adding the check as a >>>> potential fix? Or someone can help me investigate the crash dump and >>>> root cause it? or if we can use test robot to confirm this, I can >>>> re-send the patch ASIS to the list. Please clarify. >>>> >>> If the assumption is that the pointers can never be NULL, >>> wouldn't it be important to see a call trace and to find out >>> if the NULL pointers can actually be seen by design, >>> or if there is some other bug ? >> >> Call trace shows >> >> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >> >> >> And BUG seems to be in of_pci_dma_configure() as shown in the BUG report. >> >> of_pci_dma_configure() calls newly added API call to >> pci_get_host_bridge_device(). Seems like this has succeeded which >> means bridge is non NULL IMO. However in this function it passes >> bridge->parent->of_node to of_dma_configure(). So I suspect >> bridge->parent is NULL for some reason. Is there a chance for parent >> being NULL in this or any other platform? >> > > Can bridge be the root bridge ? Going by the code below, bridge is assigned the ptr to bridge on the root bus. +struct device *pci_get_host_bridge_device(struct pci_dev *dev) +{ + struct pci_bus *root_bus = find_pci_root_bus(dev->bus); + struct device *bridge = root_bus->bridge; + + kobject_get(&bridge->kobj); + return bridge; +} + So to answer your question, yes it is the root bridge.
On 03/09/2015 11:09 AM, Murali Karicheri wrote: > On 03/09/2015 01:34 PM, Guenter Roeck wrote: >> On 03/09/2015 10:03 AM, Murali Karicheri wrote: >>> On 03/09/2015 12:07 PM, Guenter Roeck wrote: >>>> On 03/09/2015 08:53 AM, Murali Karicheri wrote: >>>>> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >>>>>> On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@ti.com> >>>>>> wrote: >>>>>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>>>>>> >>>>>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>>>>>> >>>>>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>>>>>> [ ... ] >>>>>>>>> >>>>>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 >>>>>>>>>>> 00:00:00 2001 >>>>>>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>>>>>> >>>>>>>>>> This is a debug patch to root cause the kernel crash >>>>>>>>>> >>>>>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>>>>>> >>>>>>>>>> PCI: Update DMA configuration from DT >>>>>>>>>> >>>>>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>>> --- >>>>>>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>>>>>> 2 files changed, 13 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>>>>>> index 86d3c38..5a59fb8 100644 >>>>>>>>>> --- a/drivers/of/of_pci.c >>>>>>>>>> +++ b/drivers/of/of_pci.c >>>>>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev >>>>>>>>>> *pci_dev) >>>>>>>>>> struct device *dev =&pci_dev->dev; >>>>>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>>>>>> >>>>>>>>>> + if (!bridge || !bridge->parent) { >>>>>>>>>> + if (!bridge) >>>>>>>>>> + pr_err("PCI bridge not found\n"); >>>>>>>>>> + if (!bridge->parent) >>>>>>>>>> + pr_err("PCI bridge parent not found\n"); >>>>>>>>> >>>>>>>>> >>>>>>>>> You'll see a crash here if bridge is NULL. Maybe add an else before >>>>>>>>> the second >>>>>>>>> if statement ? Also, dev_err might be a bit more useful and >>>>>>>>> would be >>>>>>>>> available. >>>>>>>>> >>>>>>>> Fixed and attached. >>>>>>>> >>>>>>>> Murali >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Guenter >>>>>>>>> >>>>>>>>>> + return; >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>>>>>> pci_put_host_bridge_device(bridge); >>>>>>>>>> } >>>>>>>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c >>>>>>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>>>>>> --- a/drivers/pci/host-bridge.c >>>>>>>>>> +++ b/drivers/pci/host-bridge.c >>>>>>>>>> @@ -28,6 +28,11 @@ struct device >>>>>>>>>> *pci_get_host_bridge_device(struct >>>>>>>>>> pci_dev *dev) >>>>>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>>>>>> struct device *bridge = root_bus->bridge; >>>>>>>>>> >>>>>>>>>> + if (!bridge) { >>>>>>>>>> + pr_err("PCI: bridge not found\n"); >>>>>>>>>> + return NULL; >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> kobject_get(&bridge->kobj); >>>>>>>>>> return bridge; >>>>>>>>>> } >>>>>>>>>> -- >>>>>>>>>> 1.7.9.5 >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> BJorn, >>>>>>> >>>>>>> Any chance of applying the attached debug patch to see if this fixes >>>>>>> and >>>>>>> provide some additional information on this BUG? Not sure who will >>>>>>> pick this >>>>>>> one and apply. >>>>>> >>>>>> The change that caused the oops (0b2af171520e ("PCI: Update DMA >>>>>> configuration from DT")) only exists on my pci/iommu branch, so I'm >>>>>> the one to apply it. >>>>>> >>>>>> It's much easier for me to deal with plain text patches (not >>>>>> attachments). >>>>>> >>>>>> I'm hesitating because I don't want to encourage use of the 0-day >>>>>> testing robot as a tool at which we can just throw debug patches. The >>>>>> robot is a service that costs somebody real money, and I want to be a >>>>>> good neighbor when using it. >>>>> >>>>> Thanks for the clarification as I don't have much information on the >>>>> testing robot. At the same time the question is how similar incidence >>>>> in the past have been handled. I am a newbie w.r.t to this. This is >>>>> first time I have introduced a patch that impacts multiple >>>>> arch/machines. >>>>> >>>>>> >>>>>> Was the information in the robot's report enough to reproduce the >>>>>> oops? If not, is there additional information we could add to the >>>>>> report that would enable you to reproduce it? Even if we can't >>>>>> reproduce the oops, the report seems detailed enough that we should be >>>>>> able to deduce the problem and produce a fix in which we have high >>>>>> confidence. >>>>> >>>>> The BUG report essentially indicates the crash happened in >>>>> of_pci_dma_configure(). The machine specific log make sense to a >>>>> person familiar with this arch and I am not familiar with the same. So >>>>> anyone can help narrow down the root cause of this? >>>>> >>>>> Looking at the code, there are two ptr variables that are accessed >>>>> without checking for NULL as initial thinking was that these can never >>>>> be NULL. So the debug patch is just adding addition check before >>>>> accessing the ptr. I can send this patch without debug prints if that >>>>> make sense. I was thinking to get confirmation that this is indeed the >>>>> case before adding the check. What do you think the right approach >>>>> here? Send a patch for this to the ML for adding the check as a >>>>> potential fix? Or someone can help me investigate the crash dump and >>>>> root cause it? or if we can use test robot to confirm this, I can >>>>> re-send the patch ASIS to the list. Please clarify. >>>>> >>>> If the assumption is that the pointers can never be NULL, >>>> wouldn't it be important to see a call trace and to find out >>>> if the NULL pointers can actually be seen by design, >>>> or if there is some other bug ? >>> >>> Call trace shows >>> >>> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >>> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >>> >>> >>> And BUG seems to be in of_pci_dma_configure() as shown in the BUG report. >>> >>> of_pci_dma_configure() calls newly added API call to >>> pci_get_host_bridge_device(). Seems like this has succeeded which >>> means bridge is non NULL IMO. However in this function it passes >>> bridge->parent->of_node to of_dma_configure(). So I suspect >>> bridge->parent is NULL for some reason. Is there a chance for parent >>> being NULL in this or any other platform? >>> >> >> Can bridge be the root bridge ? > > Going by the code below, bridge is assigned the ptr to bridge on the root bus. > > +struct device *pci_get_host_bridge_device(struct pci_dev *dev) > +{ > + struct pci_bus *root_bus = find_pci_root_bus(dev->bus); > + struct device *bridge = root_bus->bridge; > + > + kobject_get(&bridge->kobj); > + return bridge; > +} > + > > So to answer your question, yes it is the root bridge. > AFAIK the root bridge does not have a parent. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/09/2015 02:12 PM, Guenter Roeck wrote: > On 03/09/2015 11:09 AM, Murali Karicheri wrote: >> On 03/09/2015 01:34 PM, Guenter Roeck wrote: >>> On 03/09/2015 10:03 AM, Murali Karicheri wrote: >>>> On 03/09/2015 12:07 PM, Guenter Roeck wrote: >>>>> On 03/09/2015 08:53 AM, Murali Karicheri wrote: >>>>>> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >>>>>>> On Mon, Mar 9, 2015 at 9:17 AM, Murali >>>>>>> Karicheri<m-karicheri2@ti.com> >>>>>>> wrote: >>>>>>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>>>>>>> >>>>>>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>>>>>>> >>>>>>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>>>>>>> [ ... ] >>>>>>>>>> >>>>>>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 >>>>>>>>>>>> 00:00:00 2001 >>>>>>>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>>>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>>>>>>> >>>>>>>>>>> This is a debug patch to root cause the kernel crash >>>>>>>>>>> >>>>>>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>>>>>>> >>>>>>>>>>> PCI: Update DMA configuration from DT >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>>>> --- >>>>>>>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>>>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>>>>>>> 2 files changed, 13 insertions(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>>>>>>> index 86d3c38..5a59fb8 100644 >>>>>>>>>>> --- a/drivers/of/of_pci.c >>>>>>>>>>> +++ b/drivers/of/of_pci.c >>>>>>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev >>>>>>>>>>> *pci_dev) >>>>>>>>>>> struct device *dev =&pci_dev->dev; >>>>>>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>>>>>>> >>>>>>>>>>> + if (!bridge || !bridge->parent) { >>>>>>>>>>> + if (!bridge) >>>>>>>>>>> + pr_err("PCI bridge not found\n"); >>>>>>>>>>> + if (!bridge->parent) >>>>>>>>>>> + pr_err("PCI bridge parent not found\n"); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> You'll see a crash here if bridge is NULL. Maybe add an else >>>>>>>>>> before >>>>>>>>>> the second >>>>>>>>>> if statement ? Also, dev_err might be a bit more useful and >>>>>>>>>> would be >>>>>>>>>> available. >>>>>>>>>> >>>>>>>>> Fixed and attached. >>>>>>>>> >>>>>>>>> Murali >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Guenter >>>>>>>>>> >>>>>>>>>>> + return; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>>>>>>> pci_put_host_bridge_device(bridge); >>>>>>>>>>> } >>>>>>>>>>> diff --git a/drivers/pci/host-bridge.c >>>>>>>>>>> b/drivers/pci/host-bridge.c >>>>>>>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>>>>>>> --- a/drivers/pci/host-bridge.c >>>>>>>>>>> +++ b/drivers/pci/host-bridge.c >>>>>>>>>>> @@ -28,6 +28,11 @@ struct device >>>>>>>>>>> *pci_get_host_bridge_device(struct >>>>>>>>>>> pci_dev *dev) >>>>>>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>>>>>>> struct device *bridge = root_bus->bridge; >>>>>>>>>>> >>>>>>>>>>> + if (!bridge) { >>>>>>>>>>> + pr_err("PCI: bridge not found\n"); >>>>>>>>>>> + return NULL; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> kobject_get(&bridge->kobj); >>>>>>>>>>> return bridge; >>>>>>>>>>> } >>>>>>>>>>> -- >>>>>>>>>>> 1.7.9.5 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> BJorn, >>>>>>>> >>>>>>>> Any chance of applying the attached debug patch to see if this >>>>>>>> fixes >>>>>>>> and >>>>>>>> provide some additional information on this BUG? Not sure who will >>>>>>>> pick this >>>>>>>> one and apply. >>>>>>> >>>>>>> The change that caused the oops (0b2af171520e ("PCI: Update DMA >>>>>>> configuration from DT")) only exists on my pci/iommu branch, so I'm >>>>>>> the one to apply it. >>>>>>> >>>>>>> It's much easier for me to deal with plain text patches (not >>>>>>> attachments). >>>>>>> >>>>>>> I'm hesitating because I don't want to encourage use of the 0-day >>>>>>> testing robot as a tool at which we can just throw debug patches. >>>>>>> The >>>>>>> robot is a service that costs somebody real money, and I want to >>>>>>> be a >>>>>>> good neighbor when using it. >>>>>> >>>>>> Thanks for the clarification as I don't have much information on the >>>>>> testing robot. At the same time the question is how similar incidence >>>>>> in the past have been handled. I am a newbie w.r.t to this. This is >>>>>> first time I have introduced a patch that impacts multiple >>>>>> arch/machines. >>>>>> >>>>>>> >>>>>>> Was the information in the robot's report enough to reproduce the >>>>>>> oops? If not, is there additional information we could add to the >>>>>>> report that would enable you to reproduce it? Even if we can't >>>>>>> reproduce the oops, the report seems detailed enough that we >>>>>>> should be >>>>>>> able to deduce the problem and produce a fix in which we have high >>>>>>> confidence. >>>>>> >>>>>> The BUG report essentially indicates the crash happened in >>>>>> of_pci_dma_configure(). The machine specific log make sense to a >>>>>> person familiar with this arch and I am not familiar with the >>>>>> same. So >>>>>> anyone can help narrow down the root cause of this? >>>>>> >>>>>> Looking at the code, there are two ptr variables that are accessed >>>>>> without checking for NULL as initial thinking was that these can >>>>>> never >>>>>> be NULL. So the debug patch is just adding addition check before >>>>>> accessing the ptr. I can send this patch without debug prints if that >>>>>> make sense. I was thinking to get confirmation that this is indeed >>>>>> the >>>>>> case before adding the check. What do you think the right approach >>>>>> here? Send a patch for this to the ML for adding the check as a >>>>>> potential fix? Or someone can help me investigate the crash dump and >>>>>> root cause it? or if we can use test robot to confirm this, I can >>>>>> re-send the patch ASIS to the list. Please clarify. >>>>>> >>>>> If the assumption is that the pointers can never be NULL, >>>>> wouldn't it be important to see a call trace and to find out >>>>> if the NULL pointers can actually be seen by design, >>>>> or if there is some other bug ? >>>> >>>> Call trace shows >>>> >>>> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >>>> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >>>> >>>> >>>> And BUG seems to be in of_pci_dma_configure() as shown in the BUG >>>> report. >>>> >>>> of_pci_dma_configure() calls newly added API call to >>>> pci_get_host_bridge_device(). Seems like this has succeeded which >>>> means bridge is non NULL IMO. However in this function it passes >>>> bridge->parent->of_node to of_dma_configure(). So I suspect >>>> bridge->parent is NULL for some reason. Is there a chance for parent >>>> being NULL in this or any other platform? >>>> >>> >>> Can bridge be the root bridge ? >> >> Going by the code below, bridge is assigned the ptr to bridge on the >> root bus. >> >> +struct device *pci_get_host_bridge_device(struct pci_dev *dev) >> +{ >> + struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >> + struct device *bridge = root_bus->bridge; >> + >> + kobject_get(&bridge->kobj); >> + return bridge; >> +} >> + >> >> So to answer your question, yes it is the root bridge. >> > AFAIK the root bridge does not have a parent. Aha! That seems to be the problem. A grep on the tree showed up a bunch of calling pci_create_root_bus() with first arg (parent) being null (x86, ia64). So the code needs to check for the parent node being non null and potentially causing this crash. I will send an incremental patch to add this check. Murali > > Guenter >
On 03/09/2015 02:12 PM, Guenter Roeck wrote: > On 03/09/2015 11:09 AM, Murali Karicheri wrote: >> On 03/09/2015 01:34 PM, Guenter Roeck wrote: >>> On 03/09/2015 10:03 AM, Murali Karicheri wrote: >>>> On 03/09/2015 12:07 PM, Guenter Roeck wrote: >>>>> On 03/09/2015 08:53 AM, Murali Karicheri wrote: >>>>>> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote: >>>>>>> On Mon, Mar 9, 2015 at 9:17 AM, Murali >>>>>>> Karicheri<m-karicheri2@ti.com> >>>>>>> wrote: >>>>>>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote: >>>>>>>>> >>>>>>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote: >>>>>>>>>> >>>>>>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote: >>>>>>>>>> [ ... ] >>>>>>>>>> >>>>>>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17 >>>>>>>>>>>> 00:00:00 2001 >>>>>>>>>>> From: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500 >>>>>>>>>>> Subject: [PATCH] pci: of : fix kernel crash >>>>>>>>>>> >>>>>>>>>>> This is a debug patch to root cause the kernel crash >>>>>>>>>>> >>>>>>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 >>>>>>>>>>> >>>>>>>>>>> PCI: Update DMA configuration from DT >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@ti.com> >>>>>>>>>>> --- >>>>>>>>>>> drivers/of/of_pci.c | 8 ++++++++ >>>>>>>>>>> drivers/pci/host-bridge.c | 5 +++++ >>>>>>>>>>> 2 files changed, 13 insertions(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c >>>>>>>>>>> index 86d3c38..5a59fb8 100644 >>>>>>>>>>> --- a/drivers/of/of_pci.c >>>>>>>>>>> +++ b/drivers/of/of_pci.c >>>>>>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev >>>>>>>>>>> *pci_dev) >>>>>>>>>>> struct device *dev =&pci_dev->dev; >>>>>>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev); >>>>>>>>>>> >>>>>>>>>>> + if (!bridge || !bridge->parent) { >>>>>>>>>>> + if (!bridge) >>>>>>>>>>> + pr_err("PCI bridge not found\n"); >>>>>>>>>>> + if (!bridge->parent) >>>>>>>>>>> + pr_err("PCI bridge parent not found\n"); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> You'll see a crash here if bridge is NULL. Maybe add an else >>>>>>>>>> before >>>>>>>>>> the second >>>>>>>>>> if statement ? Also, dev_err might be a bit more useful and >>>>>>>>>> would be >>>>>>>>>> available. >>>>>>>>>> >>>>>>>>> Fixed and attached. >>>>>>>>> >>>>>>>>> Murali >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Guenter >>>>>>>>>> >>>>>>>>>>> + return; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> of_dma_configure(dev, bridge->parent->of_node); >>>>>>>>>>> pci_put_host_bridge_device(bridge); >>>>>>>>>>> } >>>>>>>>>>> diff --git a/drivers/pci/host-bridge.c >>>>>>>>>>> b/drivers/pci/host-bridge.c >>>>>>>>>>> index 3e5bbf9..ef2ab51 100644 >>>>>>>>>>> --- a/drivers/pci/host-bridge.c >>>>>>>>>>> +++ b/drivers/pci/host-bridge.c >>>>>>>>>>> @@ -28,6 +28,11 @@ struct device >>>>>>>>>>> *pci_get_host_bridge_device(struct >>>>>>>>>>> pci_dev *dev) >>>>>>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >>>>>>>>>>> struct device *bridge = root_bus->bridge; >>>>>>>>>>> >>>>>>>>>>> + if (!bridge) { >>>>>>>>>>> + pr_err("PCI: bridge not found\n"); >>>>>>>>>>> + return NULL; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> kobject_get(&bridge->kobj); >>>>>>>>>>> return bridge; >>>>>>>>>>> } >>>>>>>>>>> -- >>>>>>>>>>> 1.7.9.5 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> BJorn, >>>>>>>> >>>>>>>> Any chance of applying the attached debug patch to see if this >>>>>>>> fixes >>>>>>>> and >>>>>>>> provide some additional information on this BUG? Not sure who will >>>>>>>> pick this >>>>>>>> one and apply. >>>>>>> >>>>>>> The change that caused the oops (0b2af171520e ("PCI: Update DMA >>>>>>> configuration from DT")) only exists on my pci/iommu branch, so I'm >>>>>>> the one to apply it. >>>>>>> >>>>>>> It's much easier for me to deal with plain text patches (not >>>>>>> attachments). >>>>>>> >>>>>>> I'm hesitating because I don't want to encourage use of the 0-day >>>>>>> testing robot as a tool at which we can just throw debug patches. >>>>>>> The >>>>>>> robot is a service that costs somebody real money, and I want to >>>>>>> be a >>>>>>> good neighbor when using it. >>>>>> >>>>>> Thanks for the clarification as I don't have much information on the >>>>>> testing robot. At the same time the question is how similar incidence >>>>>> in the past have been handled. I am a newbie w.r.t to this. This is >>>>>> first time I have introduced a patch that impacts multiple >>>>>> arch/machines. >>>>>> >>>>>>> >>>>>>> Was the information in the robot's report enough to reproduce the >>>>>>> oops? If not, is there additional information we could add to the >>>>>>> report that would enable you to reproduce it? Even if we can't >>>>>>> reproduce the oops, the report seems detailed enough that we >>>>>>> should be >>>>>>> able to deduce the problem and produce a fix in which we have high >>>>>>> confidence. >>>>>> >>>>>> The BUG report essentially indicates the crash happened in >>>>>> of_pci_dma_configure(). The machine specific log make sense to a >>>>>> person familiar with this arch and I am not familiar with the >>>>>> same. So >>>>>> anyone can help narrow down the root cause of this? >>>>>> >>>>>> Looking at the code, there are two ptr variables that are accessed >>>>>> without checking for NULL as initial thinking was that these can >>>>>> never >>>>>> be NULL. So the debug patch is just adding addition check before >>>>>> accessing the ptr. I can send this patch without debug prints if that >>>>>> make sense. I was thinking to get confirmation that this is indeed >>>>>> the >>>>>> case before adding the check. What do you think the right approach >>>>>> here? Send a patch for this to the ML for adding the check as a >>>>>> potential fix? Or someone can help me investigate the crash dump and >>>>>> root cause it? or if we can use test robot to confirm this, I can >>>>>> re-send the patch ASIS to the list. Please clarify. >>>>>> >>>>> If the assumption is that the pointers can never be NULL, >>>>> wouldn't it be important to see a call trace and to find out >>>>> if the NULL pointers can actually be seen by design, >>>>> or if there is some other bug ? >>>> >>>> Call trace shows >>>> >>>> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >>>> [ 0.576666] [<7976c1ac>] pci_device_add+0xbc/0x820 >>>> >>>> >>>> And BUG seems to be in of_pci_dma_configure() as shown in the BUG >>>> report. >>>> >>>> of_pci_dma_configure() calls newly added API call to >>>> pci_get_host_bridge_device(). Seems like this has succeeded which >>>> means bridge is non NULL IMO. However in this function it passes >>>> bridge->parent->of_node to of_dma_configure(). So I suspect >>>> bridge->parent is NULL for some reason. Is there a chance for parent >>>> being NULL in this or any other platform? >>>> >>> >>> Can bridge be the root bridge ? >> >> Going by the code below, bridge is assigned the ptr to bridge on the >> root bus. >> >> +struct device *pci_get_host_bridge_device(struct pci_dev *dev) >> +{ >> + struct pci_bus *root_bus = find_pci_root_bus(dev->bus); >> + struct device *bridge = root_bus->bridge; >> + >> + kobject_get(&bridge->kobj); >> + return bridge; >> +} >> + >> >> So to answer your question, yes it is the root bridge. >> > AFAIK the root bridge does not have a parent. > > Guenter > BJorn, Just posted a patch to the PCI list which I believe is a potential fix for this issue. Please review and apply if looks good. Subject patch : pci: of : fix BUG: unable to handle kernel
From 4965a4580e0ffe13e7e6e9fd58d6e56b08c730c4 Mon Sep 17 00:00:00 2001 From: Murali Karicheri <m-karicheri2@ti.com> Date: Fri, 6 Mar 2015 10:23:08 -0500 Subject: [PATCH] pci: of : fix kernel crash This is a debug patch to root cause the kernel crash commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6 PCI: Update DMA configuration from DT Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> --- drivers/of/of_pci.c | 9 +++++++++ drivers/pci/host-bridge.c | 5 +++++ 2 files changed, 14 insertions(+) diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c index 86d3c38..6e1b111 100644 --- a/drivers/of/of_pci.c +++ b/drivers/of/of_pci.c @@ -129,6 +129,15 @@ void of_pci_dma_configure(struct pci_dev *pci_dev) struct device *dev = &pci_dev->dev; struct device *bridge = pci_get_host_bridge_device(pci_dev); + if (!bridge || !bridge->parent) { + if (!bridge) + dev_err(&pci_dev->dev, "PCI bridge not found\n"); + else if (!bridge->parent) + dev_err(&pci_dev->dev, + "PCI bridge parent not found\n"); + return; + } + of_dma_configure(dev, bridge->parent->of_node); pci_put_host_bridge_device(bridge); } diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c index 3e5bbf9..ba642b8 100644 --- a/drivers/pci/host-bridge.c +++ b/drivers/pci/host-bridge.c @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct pci_dev *dev) struct pci_bus *root_bus = find_pci_root_bus(dev->bus); struct device *bridge = root_bus->bridge; + if (!bridge) { + dev_err(&dev->dev, "PCI: bridge not found\n"); + return NULL; + } + kobject_get(&bridge->kobj); return bridge; } -- 1.7.9.5