Message ID | 12097002.O9o76ZdvQC@kreacher (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | ACPI: PCI: Fix device reference counting in acpi_get_pci_dev() | expand |
On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed > to reference count the device returned by acpi_get_pci_dev() as > expected by its callers which in some cases may cause device objects > to be dropped prematurely. > > Add the missing get_device() to acpi_get_pci_dev(). > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") FYI this (and the rtc-cmos regression discussed in https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/) took down the entire Intel gfx CI. I've applied both fixes into our fixup branch and things are looking much healthier now. This one caused i915 selftests to eat a lot of POISON_FREE in the CI. While bisecting it locally I didn't have poisoning enabled so I got refcount_t undeflows instead. https://intel-gfx-ci.01.org/tree/drm-tip/index.html has a lot of colorful boxes to click if you're interested in any of the logs. The fixes are included in the CI_DRM_12259 build. Earlier builds were broken. > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/acpi/pci_root.c | 1 + > 1 file changed, 1 insertion(+) > > Index: linux-pm/drivers/acpi/pci_root.c > =================================================================== > --- linux-pm.orig/drivers/acpi/pci_root.c > +++ linux-pm/drivers/acpi/pci_root.c > @@ -323,6 +323,7 @@ struct pci_dev *acpi_get_pci_dev(acpi_ha > > list_for_each_entry(pn, &adev->physical_node_list, node) { > if (dev_is_pci(pn->dev)) { > + get_device(pn->dev); > pci_dev = to_pci_dev(pn->dev); > break; > } > > >
On Wed, Oct 19, 2022 at 11:02 AM Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed > > to reference count the device returned by acpi_get_pci_dev() as > > expected by its callers which in some cases may cause device objects > > to be dropped prematurely. > > > > Add the missing get_device() to acpi_get_pci_dev(). > > > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") > > FYI this (and the rtc-cmos regression discussed in > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/) > took down the entire Intel gfx CI. Sorry for the disturbance. > I've applied both fixes into our fixup branch and things are looking much > healthier now. Thanks for letting me know. I've just added the $subject patch to my linux-next branch as an urgent fix and the other one has been applied to the RTC tree. > This one caused i915 selftests to eat a lot of POISON_FREE > in the CI. While bisecting it locally I didn't have > poisoning enabled so I got refcount_t undeflows instead. Unfortunately, making no mistakes is generally hard to offer. If catching things like this early is better, what about pulling my bleeding-edge branch, where all of my changes are staged before going into linux-next, into the CI? > https://intel-gfx-ci.01.org/tree/drm-tip/index.html has a lot > of colorful boxes to click if you're interested in any of the > logs. The fixes are included in the CI_DRM_12259 build. Earlier > builds were broken. Thanks!
On Wed, Oct 19, 2022 at 01:35:26PM +0200, Rafael J. Wysocki wrote: > On Wed, Oct 19, 2022 at 11:02 AM Ville Syrjälä > <ville.syrjala@linux.intel.com> wrote: > > > > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed > > > to reference count the device returned by acpi_get_pci_dev() as > > > expected by its callers which in some cases may cause device objects > > > to be dropped prematurely. > > > > > > Add the missing get_device() to acpi_get_pci_dev(). > > > > > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") > > > > FYI this (and the rtc-cmos regression discussed in > > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/) > > took down the entire Intel gfx CI. > > Sorry for the disturbance. > > > I've applied both fixes into our fixup branch and things are looking much > > healthier now. > > Thanks for letting me know. > > I've just added the $subject patch to my linux-next branch as an > urgent fix and the other one has been applied to the RTC tree. > > > This one caused i915 selftests to eat a lot of POISON_FREE > > in the CI. While bisecting it locally I didn't have > > poisoning enabled so I got refcount_t undeflows instead. > > Unfortunately, making no mistakes is generally hard to offer. > > If catching things like this early is better, what about pulling my > bleeding-edge branch, where all of my changes are staged before going > into linux-next, into the CI? Pretty sure we don't have the resources to become the CI for everyone. So testing random trees is not really possible. And the alternative of pulling random trees into drm-tip is probably a not a popular idea either. We used to pull in the sound tree since it's pretty closely tied to graphics, but I think we stopped even that because it eneded up pulling the whole of -rc1 in at random points in time when we were't expecting it. Ideally each subsystem would have its own CI, or there should be some kernel wide thing. But I suppose the progress towards something like that is glacial. That said, we do test linux-next to some degree. And looks like at least one of these could have been caught a bit earlier through that. Unfortunately no one is really keeping an eye on that so things tend to slip through. Probably need to figure out something to make better use of that. > > > https://intel-gfx-ci.01.org/tree/drm-tip/index.html has a lot > > of colorful boxes to click if you're interested in any of the > > logs. The fixes are included in the CI_DRM_12259 build. Earlier > > builds were broken. > > Thanks!
On Wed, 19 Oct 2022, Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > On Wed, Oct 19, 2022 at 01:35:26PM +0200, Rafael J. Wysocki wrote: >> If catching things like this early is better, what about pulling my >> bleeding-edge branch, where all of my changes are staged before going >> into linux-next, into the CI? > > Pretty sure we don't have the resources to become the CI for > everyone. So testing random trees is not really possible. And > the alternative of pulling random trees into drm-tip is probably > a not a popular idea either. We used to pull in the sound tree > since it's pretty closely tied to graphics, but I think we > stopped even that because it eneded up pulling the whole of > -rc1 in at random points in time when we were't expecting it. Basically, we only pull branches to drm-tip that are managed using our tools and our development model and under our control. It was too much trouble dealing with conflicts, Linus' master being pulled in at random points (like in the middle of the merge window), and stuff like that, with the external trees. > Ideally each subsystem would have its own CI, or there should > be some kernel wide thing. But I suppose the progress towards > something like that is glacial. > > That said, we do test linux-next to some degree. And looks like > at least one of these could have been caught a bit earlier through > that. Unfortunately no one is really keeping an eye on that so > things tend to slip through. Probably need to figure out something > to make better use of that. Yeah, we need to pay more attention to linux-next test results, as well as Linus' master during the merge window. It's not necessarily easy with the volatility of linux-next, you could easily have very broken runs followed by good ones, but the low hanging fruit is raising more flags and being louder about it earlier when everything's busted for several days in linux-next or Linus' master. BR, Jani.
On Wed, Oct 19, 2022 at 2:22 PM Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > > On Wed, Oct 19, 2022 at 01:35:26PM +0200, Rafael J. Wysocki wrote: > > On Wed, Oct 19, 2022 at 11:02 AM Ville Syrjälä > > <ville.syrjala@linux.intel.com> wrote: > > > > > > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed > > > > to reference count the device returned by acpi_get_pci_dev() as > > > > expected by its callers which in some cases may cause device objects > > > > to be dropped prematurely. > > > > > > > > Add the missing get_device() to acpi_get_pci_dev(). > > > > > > > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") > > > > > > FYI this (and the rtc-cmos regression discussed in > > > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/) > > > took down the entire Intel gfx CI. > > > > Sorry for the disturbance. > > > > > I've applied both fixes into our fixup branch and things are looking much > > > healthier now. > > > > Thanks for letting me know. > > > > I've just added the $subject patch to my linux-next branch as an > > urgent fix and the other one has been applied to the RTC tree. > > > > > This one caused i915 selftests to eat a lot of POISON_FREE > > > in the CI. While bisecting it locally I didn't have > > > poisoning enabled so I got refcount_t undeflows instead. > > > > Unfortunately, making no mistakes is generally hard to offer. > > > > If catching things like this early is better, what about pulling my > > bleeding-edge branch, where all of my changes are staged before going > > into linux-next, into the CI? > > Pretty sure we don't have the resources to become the CI for > everyone. So testing random trees is not really possible. And > the alternative of pulling random trees into drm-tip is probably > a not a popular idea either. We used to pull in the sound tree > since it's pretty closely tied to graphics, but I think we > stopped even that because it eneded up pulling the whole of > -rc1 in at random points in time when we were't expecting it. I see. > Ideally each subsystem would have its own CI, or there should > be some kernel wide thing. But I suppose the progress towards > something like that is glacial. Well, I definitely cannot afford a dedicated CI just for my tree and I haven't got any useful information from KernlCI yet (even though hey pull and test my linux-next branch on a regular basis). KernelCI seems to be focusing on different set of hardware, so to speak. > That said, we do test linux-next to some degree. And looks like > at least one of these could have been caught a bit earlier through > that. Unfortunately no one is really keeping an eye on that so > things tend to slip through. Probably need to figure out something > to make better use of that. I think it could also be possible to contribute to KernelCI to get more useful x86 coverage from it.
On Wed, Oct 19, 2022 at 11:54:42AM +0300, Ville Syrjälä wrote: > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed > > to reference count the device returned by acpi_get_pci_dev() as > > expected by its callers which in some cases may cause device objects > > to be dropped prematurely. > > > > Add the missing get_device() to acpi_get_pci_dev(). > > > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") > > FYI this (and the rtc-cmos regression discussed in > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/) > took down the entire Intel gfx CI. From 1000 miles away and zero background with the gfx CI, this sounds like "our CI system, whose purpose is to find bugs, found one", which is a good thing. Bjorn
On Wed, Oct 19, 2022 at 11:53:26AM -0500, Bjorn Helgaas wrote: > On Wed, Oct 19, 2022 at 11:54:42AM +0300, Ville Syrjälä wrote: > > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed > > > to reference count the device returned by acpi_get_pci_dev() as > > > expected by its callers which in some cases may cause device objects > > > to be dropped prematurely. > > > > > > Add the missing get_device() to acpi_get_pci_dev(). > > > > > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") > > > > FYI this (and the rtc-cmos regression discussed in > > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/) > > took down the entire Intel gfx CI. > > >From 1000 miles away and zero background with the gfx CI, this sounds > like "our CI system, whose purpose is to find bugs, found one", which > is a good thing. Mostly. It's certainly better than it going entirely undetected. Sadly we found it after rc1 because no one was really looking at linux-next results. Something we need to improve. But ideally it would have been found by some other CI system whose primary job is to prevent bugs in those subsystems, rather than the one whose primary job is to prevent bugs in gfx drivers. Also ideally it wouldn't have been me bisecting this :P The biggest downside of bugs reaching our CI via rc1/etc. is that it pretty much stops everyone from getting premerge results for their graphics driver patches since the CI keeps tripping over the already existing bugs. But I guess you can call this one a somewhat self inflicted wound and we should just try harder to keep new code out of our tree until it's known to be healthy.
Index: linux-pm/drivers/acpi/pci_root.c =================================================================== --- linux-pm.orig/drivers/acpi/pci_root.c +++ linux-pm/drivers/acpi/pci_root.c @@ -323,6 +323,7 @@ struct pci_dev *acpi_get_pci_dev(acpi_ha list_for_each_entry(pn, &adev->physical_node_list, node) { if (dev_is_pci(pn->dev)) { + get_device(pn->dev); pci_dev = to_pci_dev(pn->dev); break; }