Message ID | 1438229360-370-1-git-send-email-wangyijing@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Hi Yijing, On Thu, Jul 30, 2015 at 12:09:20PM +0800, Yijing Wang wrote: > Now we stop the pci_bus->devices in reverse order, but in > pcie_aspm_exit_link_state(), we only would do something when > the device is the last one. > > void pcie_aspm_exit_link_state(struct pci_dev *pdev) > { > ... > if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) Ugh. This was caused by a confusion between two different meanings of "last": 1) the element at the end of the list, and 2) the only remaining element in the list 3419c75e15f8 ("PCI: properly clean up ASPM link state on device remove"), which added this line, clearly intended the second, but list_is_last() implements the first. But that's a trivial problem. I think the real problem is that the way we manage ASPM link_state is a complete disaster. I want to make steps toward cleaning that up rather than apply band-aids to a broken design. I struggled to understand this, so I'm going to ramble a bit to see if I understand the problem correctly. Your hierarchy is this: b7:02.0 bridge to [bus bb-bd] Downstream Port; ASPM on Link to bus bb bb:00.0 bridge to [bus bc-bd] Switch Upstream Port; no ASPM bb:00.1 endpoint bb:00.2 endpoint bb:00.3 endpoint bb:00.4 endpoint bc:01.0 bridge to [bus bd] Switch Downstream Port; ASPM on Link to bus bd bd:00.0 endpoint There are only two Links in this picture: 1) from b7:02.0 to bb:00.0 2) from bc:01.0 to bd:00.0 Those are the two Links where ASPM is important. Bus bc is the switch's internal bus, so the connection from bb:00.0 to bc:01.0 is not a Link and ASPM is not applicable. Both ends of the Link participate in ASPM, but we allocate ASPM link_state only for the component on the *upstream* end of a Link. We do the allocation during enumeration, like this: pcie_aspm_init_link_state(pdev=b7:02.0) alloc_pcie_link_state(pdev=b7:02.0) link = kzalloc(...) link->pdev = pdev # b7:02.0 pdev->link_state = link # alloc link_state for link #1 pcie_aspm_init_link_state(pdev=bc:01.0) alloc_pcie_link_state(pdev=bc:01.0) link = kzalloc(...) link->pdev = pdev # bc:01.0 link->parent = pdev->bus->parent->self->link_state # b7:02.0 link_state pdev->link_state = link # alloc link_state for link #2 The allocation path makes sense, at least in the sense that we allocate link_state for device X when we enumerate device X. Now we remove the tree rooted at b7:02.0: pci_stop_bus_device(pdev=b7:02.0) pci_stop_bus_device(pdev=bb:00.4) # iterate in reverse pci_stop_dev(pdev=bb:00.4) pcie_aspm_exit_link_state(pdev=bb:00.4) parent = pdev->bus->self # parent=b7:02.0 link = parent->link_state free_link_state(link) # b7:02.0 link_state link->pdev->link_state = NULL A kfree(link) # free link_state for #1 pci_stop_bus_device(pdev=bb:00.3) pci_stop_dev(pdev=bb:00.3) pcie_aspm_exit_link_state(pdev=bb:00.3) parent = pdev->bus->self # parent=b7:02.0 return # parent->link_state == NULL ... pci_stop_bus_device(pdev=bb:00.0) pci_stop_bus_device(pdev=bc:01.0) pci_stop_bus_device(pdev=bd:00.0) pci_stop_dev(pdev=bd:00.0) pcie_aspm_exit_link_state(pdev=bd:00.0) parent = pdev->bus->self # parent=bc:01.0 link = parent->link_state # bc:01.0 link_state parent_link = link->parent # b7:02.0 link_state free_link_state(link) # bc:01.0 link_state B kfree(link) # free link_state for #2 C pcie_config_aspm_path(b7:02.0 link_state) # use link_state for #1 At "C", we try to use the b7:02.0 link_state, which we've already deallocated at "A", so this is a "use-after-free" problem. What seems wrong to me is that when we're removing device X, we free the link_state for a *parent* of X. I think the code would be much simpler and easier to get right if we freed the link_state for X when we remove X. Can you look at fixing the problem that way? > goto out; > ... > } > > So if we have the following pcie tree, system may crash. > > [b7-bd]--+-02.0-[bb-bd]--+-00.0-[bc-bd]----01.0-[bd]----00.0 PLX Technology, Inc. Device 0002 > +-00.1 PLX Technology, Inc. Device 0002 > +-00.2 PLX Technology, Inc. Device 0002 > +-00.3 PLX Technology, Inc. Device 0002 > \-00.4 PLX Technology, Inc. Device 0002 > > In this case, we would stop bb:00.4 before bb:00.0, so when we touch bb:00.4, > we would call pcie_aspm_exit_link_state(), and free the pcie_link_state. > So when we want to stop bd:00.0 and free related pcie_link_state, > it would try to access the parent pcie_link_state which has been freed. > > Part crash call trace: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > CPU 16 Pid: 33262, comm: IVS_PowerOn > RIP: 0010:[<ffffffffa0d7c14f>] [<ffffffffa0d7c14f>] pcie_config_aspm_link+0x3f/0x100 > RSP: 0018:ffff8801bc577790 EFLAGS: 00010282 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000000e7e6 > RDX: 000000000000e6e6 RSI: 00000000ffffc5ec RDI: 0000000000000246 > RBP: ffff8801bc5777d0 R08: ffff88007b001000 R09: 00000000003fffff > ... > Call Trace: > [<ffffffff8124a542>] pcie_config_aspm_path+0x32/0x60 > [<ffffffffa0d7cc00>] pcie_aspm_exit_link_state+0x160/0x560 > [<ffffffffa0d7c0bc>] pci_stop_bus_device+0x8c/0xe0 > [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 > [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 > [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 > [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 > [<ffffffff8123eca1>] pci_stop_and_remove_bus_device+0x11/0x20 > ... > > Signed-off-by: Yijing Wang <wangyijing@huawei.com> > CC: stable@vger.kernel.org #3.4+ I need a clue about why you picked v3.4 here. Is it because ac205b7bb72f ("PCI: make sriov work with hotplug remove") appeared in v3.4? Bjorn > --- > drivers/pci/pcie/aspm.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c > index 317e355..c81f549 100644 > --- a/drivers/pci/pcie/aspm.c > +++ b/drivers/pci/pcie/aspm.c > @@ -648,7 +648,8 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev) > * All PCIe functions are in one slot, remove one function will remove > * the whole slot, so just wait until we are the last function left. > */ > - if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) > + if (!(pdev == list_first_entry(&parent->subordinate->devices, > + struct pci_dev, bus_list))) > goto out; > > link = parent->link_state; > -- > 1.7.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
? 2015/8/29 20:20, Bjorn Helgaas ??: > Hi Yijing, > > On Thu, Jul 30, 2015 at 12:09:20PM +0800, Yijing Wang wrote: >> Now we stop the pci_bus->devices in reverse order, but in >> pcie_aspm_exit_link_state(), we only would do something when >> the device is the last one. >> >> void pcie_aspm_exit_link_state(struct pci_dev *pdev) >> { >> ... >> if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) > > Ugh. This was caused by a confusion between two different meanings of > "last": > > 1) the element at the end of the list, and > 2) the only remaining element in the list > > 3419c75e15f8 ("PCI: properly clean up ASPM link state on device remove"), > which added this line, clearly intended the second, but list_is_last() > implements the first. > > But that's a trivial problem. I think the real problem is that the way we > manage ASPM link_state is a complete disaster. I want to make steps toward > cleaning that up rather than apply band-aids to a broken design. > > I struggled to understand this, so I'm going to ramble a bit to see if I > understand the problem correctly. Your hierarchy is this: > > b7:02.0 bridge to [bus bb-bd] Downstream Port; ASPM on Link to bus bb > bb:00.0 bridge to [bus bc-bd] Switch Upstream Port; no ASPM > bb:00.1 endpoint > bb:00.2 endpoint > bb:00.3 endpoint > bb:00.4 endpoint > bc:01.0 bridge to [bus bd] Switch Downstream Port; ASPM on Link to bus bd > bd:00.0 endpoint > > There are only two Links in this picture: > > 1) from b7:02.0 to bb:00.0 > 2) from bc:01.0 to bd:00.0 > > Those are the two Links where ASPM is important. Bus bc is the switch's > internal bus, so the connection from bb:00.0 to bc:01.0 is not a Link and > ASPM is not applicable. > > Both ends of the Link participate in ASPM, but we allocate ASPM link_state > only for the component on the *upstream* end of a Link. We do the > allocation during enumeration, like this: > > pcie_aspm_init_link_state(pdev=b7:02.0) > alloc_pcie_link_state(pdev=b7:02.0) > link = kzalloc(...) > link->pdev = pdev # b7:02.0 > pdev->link_state = link # alloc link_state for link #1 > > pcie_aspm_init_link_state(pdev=bc:01.0) > alloc_pcie_link_state(pdev=bc:01.0) > link = kzalloc(...) > link->pdev = pdev # bc:01.0 > link->parent = pdev->bus->parent->self->link_state # b7:02.0 link_state > pdev->link_state = link # alloc link_state for link #2 > > The allocation path makes sense, at least in the sense that we allocate > link_state for device X when we enumerate device X. Now we remove the tree > rooted at b7:02.0: > > pci_stop_bus_device(pdev=b7:02.0) > pci_stop_bus_device(pdev=bb:00.4) # iterate in reverse > pci_stop_dev(pdev=bb:00.4) > pcie_aspm_exit_link_state(pdev=bb:00.4) > parent = pdev->bus->self # parent=b7:02.0 > link = parent->link_state > free_link_state(link) # b7:02.0 link_state > link->pdev->link_state = NULL > A kfree(link) # free link_state for #1 > pci_stop_bus_device(pdev=bb:00.3) > pci_stop_dev(pdev=bb:00.3) > pcie_aspm_exit_link_state(pdev=bb:00.3) > parent = pdev->bus->self # parent=b7:02.0 > return # parent->link_state == NULL > ... > pci_stop_bus_device(pdev=bb:00.0) > pci_stop_bus_device(pdev=bc:01.0) > pci_stop_bus_device(pdev=bd:00.0) > pci_stop_dev(pdev=bd:00.0) > pcie_aspm_exit_link_state(pdev=bd:00.0) > parent = pdev->bus->self # parent=bc:01.0 > link = parent->link_state # bc:01.0 link_state > parent_link = link->parent # b7:02.0 link_state > free_link_state(link) # bc:01.0 link_state > B kfree(link) # free link_state for #2 > C pcie_config_aspm_path(b7:02.0 link_state) # use link_state for #1 > > At "C", we try to use the b7:02.0 link_state, which we've already > deallocated at "A", so this is a "use-after-free" problem. Yes, I agree with you. > > What seems wrong to me is that when we're removing device X, we free the > link_state for a *parent* of X. I think the code would be much simpler and What I am worried about here is if we hot remove a endpoint device here, and leave the parent device, so we don't call pci_aspm_exit_link_state() for this link anymore ? pcie link downstream port ---------- endpoint device > easier to get right if we freed the link_state for X when we remove X. > > Can you look at fixing the problem that way? I'm sorry, I don't have the platform now, this issue was found in product department, and they moved the platform away. > >> goto out; >> ... >> } >> >> So if we have the following pcie tree, system may crash. >> >> [b7-bd]--+-02.0-[bb-bd]--+-00.0-[bc-bd]----01.0-[bd]----00.0 PLX Technology, Inc. Device 0002 >> +-00.1 PLX Technology, Inc. Device 0002 >> +-00.2 PLX Technology, Inc. Device 0002 >> +-00.3 PLX Technology, Inc. Device 0002 ... >> Signed-off-by: Yijing Wang <wangyijing@huawei.com> >> CC: stable@vger.kernel.org #3.4+ > > I need a clue about why you picked v3.4 here. Is it because ac205b7bb72f > ("PCI: make sriov work with hotplug remove") appeared in v3.4? Actually, this issue was found at v3.4 stable kernel, which was introduced in 3419c75e15f8 ("PCI: properly clean up ASPM link state on device remove") I think. Thanks! Yijing. > > Bjorn > >> --- >> drivers/pci/pcie/aspm.c | 3 ++- >> 1 files changed, 2 insertions(+), 1 deletions(-) >> >> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c >> index 317e355..c81f549 100644 >> --- a/drivers/pci/pcie/aspm.c >> +++ b/drivers/pci/pcie/aspm.c >> @@ -648,7 +648,8 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev) >> * All PCIe functions are in one slot, remove one function will remove >> * the whole slot, so just wait until we are the last function left. >> */ >> - if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) >> + if (!(pdev == list_first_entry(&parent->subordinate->devices, >> + struct pci_dev, bus_list))) >> goto out; >> >> link = parent->link_state; >> -- >> 1.7.1 >> > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 31, 2015 at 09:24:56AM +0800, wangyijing wrote: > ? 2015/8/29 20:20, Bjorn Helgaas ??: > > On Thu, Jul 30, 2015 at 12:09:20PM +0800, Yijing Wang wrote: > >> Now we stop the pci_bus->devices in reverse order, but in > >> pcie_aspm_exit_link_state(), we only would do something when > >> the device is the last one. > >> > >> void pcie_aspm_exit_link_state(struct pci_dev *pdev) > >> { > >> ... > >> if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) > ... > > What seems wrong to me is that when we're removing device X, we free the > > link_state for a *parent* of X. I think the code would be much simpler and > > What I am worried about here is if we hot remove a endpoint device here, and leave > the parent device, so we don't call pci_aspm_exit_link_state() for this link anymore ? > > pcie link > downstream port ---------- endpoint device If we remove the endpoint and leave the parent, we do call pcie_aspm_exit_link_state() for the endpoint. I'm proposing to change that path so that when the endpoint is removed, we do whatever is necessary for changing the ASPM configuration on the link, but leave the ASPM structure allocated, since the parent still exists. Today we allocate link_state only when a port has a downstream link *and* there's a device on the other end of the link. I'm proposing that we allocate link_state even if there's no downstream device. Then the alloc/free path is always tied directly to the parent. We can still change ASPM configuration as needed when downstream devices are added or removed. It's not a trivial change, but it seems possible. > > easier to get right if we freed the link_state for X when we remove X. > > > > Can you look at fixing the problem that way? > > I'm sorry, I don't have the platform now, this issue was found in product department, and they moved > the platform away. OK. I'll drop this patch for now. I definitely agree this is a problem, but the fact that you don't have the platform any more doesn't mean we need to throw in a band-aid instead of designing a better solution. This problem should be pretty easy to reproduce on any platform, possibly with a little bit of scaffolding to tweak the topology. > >> goto out; > >> ... > >> } > >> > >> So if we have the following pcie tree, system may crash. > >> > >> [b7-bd]--+-02.0-[bb-bd]--+-00.0-[bc-bd]----01.0-[bd]----00.0 PLX Technology, Inc. Device 0002 > >> +-00.1 PLX Technology, Inc. Device 0002 > >> +-00.2 PLX Technology, Inc. Device 0002 > >> +-00.3 PLX Technology, Inc. Device 0002 > ... > >> Signed-off-by: Yijing Wang <wangyijing@huawei.com> > >> CC: stable@vger.kernel.org #3.4+ > > > > I need a clue about why you picked v3.4 here. Is it because ac205b7bb72f > > ("PCI: make sriov work with hotplug remove") appeared in v3.4? > > Actually, this issue was found at v3.4 stable kernel, which was introduced in > 3419c75e15f8 ("PCI: properly clean up ASPM link state on device remove") I think. Did you bisect to this or figure it out by analysis? It's good to mention the actual commit that broke it so people backporting can figure out if they need the fix. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
? 2015/8/31 21:56, Bjorn Helgaas ??: > On Mon, Aug 31, 2015 at 09:24:56AM +0800, wangyijing wrote: >> ? 2015/8/29 20:20, Bjorn Helgaas ??: >>> On Thu, Jul 30, 2015 at 12:09:20PM +0800, Yijing Wang wrote: >>>> Now we stop the pci_bus->devices in reverse order, but in >>>> pcie_aspm_exit_link_state(), we only would do something when >>>> the device is the last one. >>>> >>>> void pcie_aspm_exit_link_state(struct pci_dev *pdev) >>>> { >>>> ... >>>> if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) >> ... > >>> What seems wrong to me is that when we're removing device X, we free the >>> link_state for a *parent* of X. I think the code would be much simpler and >> >> What I am worried about here is if we hot remove a endpoint device here, and leave >> the parent device, so we don't call pci_aspm_exit_link_state() for this link anymore ? >> >> pcie link >> downstream port ---------- endpoint device > > If we remove the endpoint and leave the parent, we do call > pcie_aspm_exit_link_state() for the endpoint. I'm proposing to change > that path so that when the endpoint is removed, we do whatever is > necessary for changing the ASPM configuration on the link, but leave > the ASPM structure allocated, since the parent still exists. > > Today we allocate link_state only when a port has a downstream link > *and* there's a device on the other end of the link. I'm proposing > that we allocate link_state even if there's no downstream device. > > Then the alloc/free path is always tied directly to the parent. We > can still change ASPM configuration as needed when downstream devices > are added or removed. It's not a trivial change, but it seems > possible. > Basically, I agree with you, but it's not a trival change, and now we have no platform in hand, so I think it's better to leave it until we have platform to reproduce it. >>> easier to get right if we freed the link_state for X when we remove X. >>> >>> Can you look at fixing the problem that way? >> >> I'm sorry, I don't have the platform now, this issue was found in product department, and they moved >> the platform away. > > OK. I'll drop this patch for now. I definitely agree this is a > problem, but the fact that you don't have the platform any more > doesn't mean we need to throw in a band-aid instead of designing a > better solution. OK. > > This problem should be pretty easy to reproduce on any platform, > possibly with a little bit of scaffolding to tweak the topology. > >>>> goto out; >>>> ... >>>> } >>>> >>>> So if we have the following pcie tree, system may crash. >>>> >>>> [b7-bd]--+-02.0-[bb-bd]--+-00.0-[bc-bd]----01.0-[bd]----00.0 PLX Technology, Inc. Device 0002 >>>> +-00.1 PLX Technology, Inc. Device 0002 >>>> +-00.2 PLX Technology, Inc. Device 0002 >>>> +-00.3 PLX Technology, Inc. Device 0002 >> ... >>>> Signed-off-by: Yijing Wang <wangyijing@huawei.com> >>>> CC: stable@vger.kernel.org #3.4+ >>> >>> I need a clue about why you picked v3.4 here. Is it because ac205b7bb72f >>> ("PCI: make sriov work with hotplug remove") appeared in v3.4? >> >> Actually, this issue was found at v3.4 stable kernel, which was introduced in >> 3419c75e15f8 ("PCI: properly clean up ASPM link state on device remove") I think. > > Did you bisect to this or figure it out by analysis? It's good to > mention the actual commit that broke it so people backporting can > figure out if they need the fix. By analysis, if I have the platform agagin, I would try to bisect it. Thanks! Yijing. > > Bjorn > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index 317e355..c81f549 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -648,7 +648,8 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev) * All PCIe functions are in one slot, remove one function will remove * the whole slot, so just wait until we are the last function left. */ - if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) + if (!(pdev == list_first_entry(&parent->subordinate->devices, + struct pci_dev, bus_list))) goto out; link = parent->link_state;
Now we stop the pci_bus->devices in reverse order, but in pcie_aspm_exit_link_state(), we only would do something when the device is the last one. void pcie_aspm_exit_link_state(struct pci_dev *pdev) { ... if (!list_is_last(&pdev->bus_list, &parent->subordinate->devices)) goto out; ... } So if we have the following pcie tree, system may crash. [b7-bd]--+-02.0-[bb-bd]--+-00.0-[bc-bd]----01.0-[bd]----00.0 PLX Technology, Inc. Device 0002 +-00.1 PLX Technology, Inc. Device 0002 +-00.2 PLX Technology, Inc. Device 0002 +-00.3 PLX Technology, Inc. Device 0002 \-00.4 PLX Technology, Inc. Device 0002 In this case, we would stop bb:00.4 before bb:00.0, so when we touch bb:00.4, we would call pcie_aspm_exit_link_state(), and free the pcie_link_state. So when we want to stop bd:00.0 and free related pcie_link_state, it would try to access the parent pcie_link_state which has been freed. Part crash call trace: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 CPU 16 Pid: 33262, comm: IVS_PowerOn RIP: 0010:[<ffffffffa0d7c14f>] [<ffffffffa0d7c14f>] pcie_config_aspm_link+0x3f/0x100 RSP: 0018:ffff8801bc577790 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000000e7e6 RDX: 000000000000e6e6 RSI: 00000000ffffc5ec RDI: 0000000000000246 RBP: ffff8801bc5777d0 R08: ffff88007b001000 R09: 00000000003fffff ... Call Trace: [<ffffffff8124a542>] pcie_config_aspm_path+0x32/0x60 [<ffffffffa0d7cc00>] pcie_aspm_exit_link_state+0x160/0x560 [<ffffffffa0d7c0bc>] pci_stop_bus_device+0x8c/0xe0 [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 [<ffffffffa0d7c068>] pci_stop_bus_device+0x38/0xe0 [<ffffffff8123eca1>] pci_stop_and_remove_bus_device+0x11/0x20 ... Signed-off-by: Yijing Wang <wangyijing@huawei.com> CC: stable@vger.kernel.org #3.4+ --- drivers/pci/pcie/aspm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)