diff mbox

[3/3] pciehp: Fix race condition handling surprise link-down

Message ID 1479544367-7195-4-git-send-email-ashok.raj@intel.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Ashok Raj Nov. 19, 2016, 8:32 a.m. UTC
A surprise link down may retrain very quickly, causing the same slot to
generate a link up event before handling the link down completes.

Since the link is active, the power off work queued from the first link
down will cause a second down event when the power is disabled. The second
down event should be ignored because the slot is already powering off;
however, the "link up" event sets the slot state to POWERON before the
event to handle this is enqueued, making the second down event believe
it needs to do something. This creates a constant link up and down
event cycle.

This patch fixes that by setting the slot state only when the work to
handle the power event is executing, protected by the hot plug mutex.

Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp_ctrl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas Dec. 7, 2016, 11:40 p.m. UTC | #1
On Sat, Nov 19, 2016 at 12:32:47AM -0800, Ashok Raj wrote:
> A surprise link down may retrain very quickly, causing the same slot to
> generate a link up event before handling the link down completes.
> 
> Since the link is active, the power off work queued from the first link
> down will cause a second down event when the power is disabled. The second
> down event should be ignored because the slot is already powering off;
> however, the "link up" event sets the slot state to POWERON before the
> event to handle this is enqueued, making the second down event believe
> it needs to do something. This creates a constant link up and down
> event cycle.
> 
> This patch fixes that by setting the slot state only when the work to
> handle the power event is executing, protected by the hot plug mutex.

Please mention the mutex specifically by name, e.g.,
"p_slot->hotplug_lock".

> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> 
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Reviewed-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/hotplug/pciehp_ctrl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
> index ec0b4c1..7ae068c 100644
> --- a/drivers/pci/hotplug/pciehp_ctrl.c
> +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> @@ -182,6 +182,7 @@ static void pciehp_power_thread(struct work_struct *work)
>  	switch (info->req) {
>  	case DISABLE_REQ:
>  		mutex_lock(&p_slot->hotplug_lock);
> +		p_slot->state = POWEROFF_STATE;

It sounds right that p_slot->state should be protected.

It looks like handle_button_press_event() and
pciehp_sysfs_enable_slot() hold p_slot->lock while updating
p_slot->state.

You're setting "state = POWEROFF_STATE" while holding
p_slot->hotplug_lock (not p_slot->lock).  Four lines down, we set
"state = STATIC_STATE", but this time we're holding p_slot->lock.

What is the difference between the p_slot->lock and
p_slot->hotplug_lock?  Do we need both?  How do we know which one to
use?

I'm very confused.

>  		pciehp_disable_slot(p_slot);
>  		mutex_unlock(&p_slot->hotplug_lock);
>  		mutex_lock(&p_slot->lock);
> @@ -190,6 +191,7 @@ static void pciehp_power_thread(struct work_struct *work)
>  		break;
>  	case ENABLE_REQ:
>  		mutex_lock(&p_slot->hotplug_lock);
> +		p_slot->state = POWERON_STATE;
>  		ret = pciehp_enable_slot(p_slot);
>  		mutex_unlock(&p_slot->hotplug_lock);
>  		if (ret)
> @@ -209,8 +211,6 @@ static void pciehp_queue_power_work(struct slot *p_slot, int req)
>  {
>  	struct power_work_info *info;
>  
> -	p_slot->state = (req == ENABLE_REQ) ? POWERON_STATE : POWEROFF_STATE;
> -
>  	info = kmalloc(sizeof(*info), GFP_KERNEL);
>  	if (!info) {
>  		ctrl_err(p_slot->ctrl, "no memory to queue %s request\n",
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch Dec. 8, 2016, 12:04 a.m. UTC | #2
On Wed, Dec 07, 2016 at 05:40:54PM -0600, Bjorn Helgaas wrote:
> On Sat, Nov 19, 2016 at 12:32:47AM -0800, Ashok Raj wrote:
> > --- a/drivers/pci/hotplug/pciehp_ctrl.c
> > +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> > @@ -182,6 +182,7 @@ static void pciehp_power_thread(struct work_struct *work)
> >  	switch (info->req) {
> >  	case DISABLE_REQ:
> >  		mutex_lock(&p_slot->hotplug_lock);
> > +		p_slot->state = POWEROFF_STATE;
> 
> It sounds right that p_slot->state should be protected.
> 
> It looks like handle_button_press_event() and
> pciehp_sysfs_enable_slot() hold p_slot->lock while updating
> p_slot->state.
> 
> You're setting "state = POWEROFF_STATE" while holding
> p_slot->hotplug_lock (not p_slot->lock).  Four lines down, we set
> "state = STATIC_STATE", but this time we're holding p_slot->lock.
> 
> What is the difference between the p_slot->lock and
> p_slot->hotplug_lock?  Do we need both?  How do we know which one to
> use?
> 
> I'm very confused.

This is _very_ confusing. :)

The p_slot->hotplug_lock serializes a power off and a power on event. We
only want to set the state when one of those events gets to execute
rather than when the event is queued. Changing the state when the event
is queued can trigger a never ending up-down sequence.

It currently looks safe to nest the p_slot->lock under the
p_slot->hotplug_lock if that is you recommendation.

Alternatively we could fix this if we used an ordered work queue for
the slot's work, but that is a significantly more complex change.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 8, 2016, 3:11 p.m. UTC | #3
On Wed, Dec 07, 2016 at 07:04:33PM -0500, Keith Busch wrote:
> On Wed, Dec 07, 2016 at 05:40:54PM -0600, Bjorn Helgaas wrote:
> > On Sat, Nov 19, 2016 at 12:32:47AM -0800, Ashok Raj wrote:
> > > --- a/drivers/pci/hotplug/pciehp_ctrl.c
> > > +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> > > @@ -182,6 +182,7 @@ static void pciehp_power_thread(struct work_struct *work)
> > >  	switch (info->req) {
> > >  	case DISABLE_REQ:
> > >  		mutex_lock(&p_slot->hotplug_lock);
> > > +		p_slot->state = POWEROFF_STATE;
> > 
> > It sounds right that p_slot->state should be protected.
> > 
> > It looks like handle_button_press_event() and
> > pciehp_sysfs_enable_slot() hold p_slot->lock while updating
> > p_slot->state.
> > 
> > You're setting "state = POWEROFF_STATE" while holding
> > p_slot->hotplug_lock (not p_slot->lock).  Four lines down, we set
> > "state = STATIC_STATE", but this time we're holding p_slot->lock.
> > 
> > What is the difference between the p_slot->lock and
> > p_slot->hotplug_lock?  Do we need both?  How do we know which one to
> > use?
> > 
> > I'm very confused.
> 
> This is _very_ confusing. :)
> 
> The p_slot->hotplug_lock serializes a power off and a power on event. We
> only want to set the state when one of those events gets to execute
> rather than when the event is queued. Changing the state when the event
> is queued can trigger a never ending up-down sequence.
> 
> It currently looks safe to nest the p_slot->lock under the
> p_slot->hotplug_lock if that is you recommendation.

I'm not making a recommendation; that would take a lot more thought
than I've given this.

There are at least three locks in pciehp:

  struct controller.ctrl_lock
  struct slot.lock
  struct slot.hotplug_lock

There shouldn't really be any performance paths in pciehp, so I'm
pretty doubtful that we need such complicated locking.

> Alternatively we could fix this if we used an ordered work queue for
> the slot's work, but that is a significantly more complex change.

You mean we can currently execute things from the work queue in a
different order than they were enqueued?  That sounds ... difficult to
analyze, to say the least.

I don't know much about work queues, and Documentation/workqueue.txt
doesn't mention alloc_ordered_workqueue().  Is that what you are
referring to?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch Dec. 8, 2016, 5:20 p.m. UTC | #4
On Thu, Dec 08, 2016 at 09:11:58AM -0600, Bjorn Helgaas wrote:
> On Wed, Dec 07, 2016 at 07:04:33PM -0500, Keith Busch wrote:
> > 
> > It currently looks safe to nest the p_slot->lock under the
> > p_slot->hotplug_lock if that is you recommendation.
> 
> I'm not making a recommendation; that would take a lot more thought
> than I've given this.
> 
> There are at least three locks in pciehp:
> 
>   struct controller.ctrl_lock
>   struct slot.lock
>   struct slot.hotplug_lock
> 
> There shouldn't really be any performance paths in pciehp, so I'm
> pretty doubtful that we need such complicated locking.

They are protecting different things, but I agree it looks like room
for simplification exists.
 
> > Alternatively we could fix this if we used an ordered work queue for
> > the slot's work, but that is a significantly more complex change.
> 
> You mean we can currently execute things from the work queue in a
> different order than they were enqueued?  That sounds ... difficult to
> analyze, to say the least.

The events are dequeued in order, but they don't wait for the previous
to complete, so pciehp's current work queue can have multiple events
executing in parallel. That's part of why rapid pciehp slot events are
a little more difficult to follow, and I think we may even be unsafely
relying on the order the mutexes are obtained from these work events.

Partly unrelated, we could process surprise removals significantly
faster (microseconds vs. seconds), with the limited pci access series
here, giving fewer simultaneously executing events to consider:

 https://www.spinics.net/lists/linux-pci/msg55585.html

Do you have any concerns with that series?

> I don't know much about work queues, and Documentation/workqueue.txt
> doesn't mention alloc_ordered_workqueue().  Is that what you are
> referring to?

Yes, the alloc_ordered_workqueue is what I had in mind, though switching
to that is not as simple as calling the different API. I am looking into
that for longer term, but for the incremental fix, do you think we can
go forward with Raj's proposal?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 8, 2016, 5:50 p.m. UTC | #5
On Thu, Dec 08, 2016 at 12:20:58PM -0500, Keith Busch wrote:
> On Thu, Dec 08, 2016 at 09:11:58AM -0600, Bjorn Helgaas wrote:
> > On Wed, Dec 07, 2016 at 07:04:33PM -0500, Keith Busch wrote:
> > > 
> > > It currently looks safe to nest the p_slot->lock under the
> > > p_slot->hotplug_lock if that is you recommendation.
> > 
> > I'm not making a recommendation; that would take a lot more thought
> > than I've given this.
> > 
> > There are at least three locks in pciehp:
> > 
> >   struct controller.ctrl_lock
> >   struct slot.lock
> >   struct slot.hotplug_lock
> > 
> > There shouldn't really be any performance paths in pciehp, so I'm
> > pretty doubtful that we need such complicated locking.
> 
> They are protecting different things, but I agree it looks like room
> for simplification exists.

If we can't simplify this immediately, can we add a comment about what
the different locks protect so people have a hint about which one to
use?  For example, it looks like this patch might benefit from that
knowledge.

> > > Alternatively we could fix this if we used an ordered work queue for
> > > the slot's work, but that is a significantly more complex change.
> > 
> > You mean we can currently execute things from the work queue in a
> > different order than they were enqueued?  That sounds ... difficult to
> > analyze, to say the least.
> 
> The events are dequeued in order, but they don't wait for the previous
> to complete, so pciehp's current work queue can have multiple events
> executing in parallel. That's part of why rapid pciehp slot events are
> a little more difficult to follow, and I think we may even be unsafely
> relying on the order the mutexes are obtained from these work events.

Hmm.  I certainly did not expect multiple events executing in
parallel.  That sounds like a pretty serious issue to me.

> Partly unrelated, we could process surprise removals significantly
> faster (microseconds vs. seconds), with the limited pci access series
> here, giving fewer simultaneously executing events to consider:
> 
>  https://www.spinics.net/lists/linux-pci/msg55585.html
> 
> Do you have any concerns with that series?

I'm dragging my feet because I want the removal process to become
simpler to understand, not more complicated, and we're exposing more
issues that I didn't even know about.

> > I don't know much about work queues, and Documentation/workqueue.txt
> > doesn't mention alloc_ordered_workqueue().  Is that what you are
> > referring to?
> 
> Yes, the alloc_ordered_workqueue is what I had in mind, though switching
> to that is not as simple as calling the different API. I am looking into
> that for longer term, but for the incremental fix, do you think we can
> go forward with Raj's proposal?

I'd like to at least see a consistent locking strategy for protecting
p_slot->state.  All the existing updates are protected by
p_slot->lock, but the one Raj is adding is protected by
p_slot->hotplug_lock.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ashok Raj Dec. 9, 2016, 9:11 p.m. UTC | #6
Hi Bjorn

Thanks. I have resent the last patch again with consistent lock usage as you
had requested.

I did attempt to make things a bit more easier to understand in one my
earlier experiments, but that resulted in very substantial changes. 
Certainly something we should look in future to make the state machine 
more robust.

Cheers,
Ashok

On Thu, Dec 08, 2016 at 11:50:09AM -0600, Bjorn Helgaas wrote:
> > Yes, the alloc_ordered_workqueue is what I had in mind, though switching
> > to that is not as simple as calling the different API. I am looking into
> > that for longer term, but for the incremental fix, do you think we can
> > go forward with Raj's proposal?
> 
> I'd like to at least see a consistent locking strategy for protecting
> p_slot->state.  All the existing updates are protected by
> p_slot->lock, but the one Raj is adding is protected by
> p_slot->hotplug_lock.
> 
> Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index ec0b4c1..7ae068c 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -182,6 +182,7 @@  static void pciehp_power_thread(struct work_struct *work)
 	switch (info->req) {
 	case DISABLE_REQ:
 		mutex_lock(&p_slot->hotplug_lock);
+		p_slot->state = POWEROFF_STATE;
 		pciehp_disable_slot(p_slot);
 		mutex_unlock(&p_slot->hotplug_lock);
 		mutex_lock(&p_slot->lock);
@@ -190,6 +191,7 @@  static void pciehp_power_thread(struct work_struct *work)
 		break;
 	case ENABLE_REQ:
 		mutex_lock(&p_slot->hotplug_lock);
+		p_slot->state = POWERON_STATE;
 		ret = pciehp_enable_slot(p_slot);
 		mutex_unlock(&p_slot->hotplug_lock);
 		if (ret)
@@ -209,8 +211,6 @@  static void pciehp_queue_power_work(struct slot *p_slot, int req)
 {
 	struct power_work_info *info;
 
-	p_slot->state = (req == ENABLE_REQ) ? POWERON_STATE : POWEROFF_STATE;
-
 	info = kmalloc(sizeof(*info), GFP_KERNEL);
 	if (!info) {
 		ctrl_err(p_slot->ctrl, "no memory to queue %s request\n",