diff mbox

[for-4.14,RFC,1/2] dm rq: avoid deadlock if dm-mq is stacked on old .request_fn device(s)

Message ID 20170713211217.52361-2-snitzer@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mike Snitzer July 13, 2017, 9:12 p.m. UTC
Conditionally use __blk_put_request() or blk_put_request() instead of
just blk_put_request() in multipath_release_clone().

Otherwise a deadlock will occur because scsi_end_request() will take the
clone request's queue_lock, around its call to blk_finish_request(), and
then the later blk_put_request() also tries to take it:

[12749.916332]  queued_spin_lock_slowpath+0xb/0xf
[12749.916335]  _raw_spin_lock_irqsave+0x37/0x40
[12749.916339]  blk_put_request+0x39/0x60
[12749.916342]  multipath_release_clone+0xe/0x10 [dm_multipath]
[12749.916350]  dm_softirq_done+0x156/0x240 [dm_mod]
[12749.916353]  __blk_mq_complete_request+0x90/0x140
[12749.916355]  blk_mq_complete_request+0x16/0x20
[12749.916360]  dm_complete_request+0x23/0x40 [dm_mod]
[12749.916365]  end_clone_request+0x1d/0x20 [dm_mod]
[12749.916367]  blk_finish_request+0x83/0x120
[12749.916370]  scsi_end_request+0x12d/0x1d0
[12749.916371]  scsi_io_completion+0x13c/0x630
[12749.916374]  ? set_next_entity+0x7c/0x780
[12749.916376]  scsi_finish_command+0xd9/0x120
[12749.916378]  scsi_softirq_done+0x12a/0x150
[12749.916380]  blk_done_softirq+0x9e/0xd0
[12749.916382]  __do_softirq+0xc9/0x269
[12749.916384]  run_ksoftirqd+0x29/0x50
[12749.916385]  smpboot_thread_fn+0x110/0x160
[12749.916387]  kthread+0x109/0x140
[12749.916389]  ? sort_range+0x30/0x30
[12749.916390]  ? kthread_park+0x60/0x60
[12749.916391]  ret_from_fork+0x25/0x30

This "fix" is gross in that the long-term fitness of stacking blk-mq DM
multipath (dm-mq) ontop of old .request_fn devices is questionable.  The
above stack trace shows just how ugly it is to have old .request_fn SCSI
code cascade into blk-mq code during DM multipath request completion.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c         | 16 ++++++++++++++--
 drivers/md/dm-rq.c            |  8 +++++---
 drivers/md/dm-target.c        |  4 ++--
 include/linux/device-mapper.h |  3 ++-
 4 files changed, 23 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig July 14, 2017, 7:22 a.m. UTC | #1
The problem here is the following:

blk_finish_request must always be called with the queue lock held,
it even has an assert.

Without blk-mq used by dm-rq, dm uses the block softirq to execute the
completion, which means we always have a different execution context and
can take the queue lock again without issuesi.

With blk-mq used by dm-rq, the the dm .complete handler that is the rough
equivalent of the softirq handler is called either directly if were are
on the same CPU, or using a IPI (hardirq) if not.  If this handler gets
called from a legacy request function it will be called with the
queue_lock held, but if it's called from a blk-mq driver or actually
uses the IPI no lock will be held.

When I did my blk-mq only for dm-mpath WIP patch my solution to that
was that I removed the ->complete handler entirely and just ran the
whole dm completion from the original hardirq context.  With that change
I know that for blk-mq we'll never hold the queue_lock (and the blk-mq
request free path doesn't care), and for legacy we always hold it,
so __blk_put_request can always be used.
Mike Snitzer July 14, 2017, 2:19 p.m. UTC | #2
On Fri, Jul 14 2017 at  3:22am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> The problem here is the following:
> 
> blk_finish_request must always be called with the queue lock held,
> it even has an assert.
> 
> Without blk-mq used by dm-rq, dm uses the block softirq to execute the
> completion, which means we always have a different execution context and
> can take the queue lock again without issuesi.
> 
> With blk-mq used by dm-rq, the the dm .complete handler that is the rough
> equivalent of the softirq handler is called either directly if were are
> on the same CPU, or using a IPI (hardirq) if not.  If this handler gets
> called from a legacy request function it will be called with the
> queue_lock held, but if it's called from a blk-mq driver or actually
> uses the IPI no lock will be held.

Yeap, very well explained!  I found exactly that yesterday when I
developed this patch.  I stopped short of getting into those details in
my header though, but as you know it comes down to dm_complete_request's
blk-mq-vs-not branching (blk_mq_complete_request vs blk_complete_request).

> When I did my blk-mq only for dm-mpath WIP patch my solution to that
> was that I removed the ->complete handler entirely and just ran the
> whole dm completion from the original hardirq context.  With that change
> I know that for blk-mq we'll never hold the queue_lock (and the blk-mq
> request free path doesn't care), and for legacy we always hold it,
> so __blk_put_request can always be used.

Do you see a benefit to extracting that portion of your WIP patch
(removing the ->complete handler entirely)?

Or leave well enough alone and just continue to disable dm-mq's ability
to stack on .request_fn paths?

Given SCSI's switch to scsi-mq by default I cannot see value in propping
up stacking on the old .request_fn devices.

But interested to get your thoughts, thanks.
Ewan Milne July 14, 2017, 5:17 p.m. UTC | #3
On Fri, 2017-07-14 at 10:19 -0400, Mike Snitzer wrote:
> 
> Do you see a benefit to extracting that portion of your WIP patch
> (removing the ->complete handler entirely)?
> 
> Or leave well enough alone and just continue to disable dm-mq's ability
> to stack on .request_fn paths?
> 
> Given SCSI's switch to scsi-mq by default I cannot see value in propping
> up stacking on the old .request_fn devices.

So, the dm_mod.use_blk_mq flag is global, right?  I guess the question
is whether all of the block device types used on a system under DM are
supported under MQ.  If that is the case then we would be OK.

The other question is whether there are negative performance
consequences in any (corner?) cases with MQ that would result in it
being preferable to run in non-MQ mode (e.g. tag space with lpfc, did
we ever resolve that?) but the right approach there is to put the effort
into the MQ path going forward, as has been the case.

-Ewan
Mike Snitzer July 14, 2017, 9:15 p.m. UTC | #4
On Fri, Jul 14 2017 at  1:17pm -0400,
Ewan D. Milne <emilne@redhat.com> wrote:

> On Fri, 2017-07-14 at 10:19 -0400, Mike Snitzer wrote:
> > 
> > Do you see a benefit to extracting that portion of your WIP patch
> > (removing the ->complete handler entirely)?
> > 
> > Or leave well enough alone and just continue to disable dm-mq's ability
> > to stack on .request_fn paths?
> > 
> > Given SCSI's switch to scsi-mq by default I cannot see value in propping
> > up stacking on the old .request_fn devices.
> 
> So, the dm_mod.use_blk_mq flag is global, right?  I guess the question
> is whether all of the block device types used on a system under DM are
> supported under MQ.  If that is the case then we would be OK.

I didn't quite understand Ewan's question so we talked in person.  His
concern was whether other DM targets needed to be worried about blk-mq
vs not.  I clarified that DM multipath is the only target that is
request-based and that it is fine with stacking on scsi-mq.  And all the
bio-based targets obviously stack just fine on scsi-mq devices.

> The other question is whether there are negative performance
> consequences in any (corner?) cases with MQ that would result in it
> being preferable to run in non-MQ mode (e.g. tag space with lpfc, did
> we ever resolve that?) but the right approach there is to put the effort
> into the MQ path going forward, as has been the case.

Yeap.
diff mbox

Patch

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 0e8ab5b..34cf7b6 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -520,9 +520,21 @@  static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 	return DM_MAPIO_REMAPPED;
 }
 
-static void multipath_release_clone(struct request *clone)
+static void multipath_release_clone(struct dm_target *ti, struct request *clone)
 {
-	blk_put_request(clone);
+	struct multipath *m = ti->private;
+	struct request_queue *q = clone->q;
+
+	if (!q->mq_ops && m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) {
+		/*
+		 * dm-mq on .request_fn already holds clone->q->queue_lock
+		 * via blk_finish_request()...
+		 * - true for .request_fn SCSI, but is it _always_ true?
+		 */
+		lockdep_assert_held(q->queue_lock);
+		__blk_put_request(q, clone);
+	} else
+		blk_put_request(clone);
 }
 
 /*
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index c6ebc5b..95bb44c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -220,11 +220,12 @@  static void dm_end_request(struct request *clone, blk_status_t error)
 {
 	int rw = rq_data_dir(clone);
 	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct dm_target *ti = tio->ti;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
 	blk_rq_unprep_clone(clone);
-	tio->ti->type->release_clone_rq(clone);
+	ti->type->release_clone_rq(ti, clone);
 
 	rq_end_stats(md, rq);
 	if (!rq->q->mq_ops)
@@ -267,6 +268,7 @@  static void dm_mq_delay_requeue_request(struct request *rq, unsigned long msecs)
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_requeue)
 {
+	struct dm_target *ti = tio->ti;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	int rw = rq_data_dir(rq);
@@ -274,7 +276,7 @@  static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_
 	rq_end_stats(md, rq);
 	if (tio->clone) {
 		blk_rq_unprep_clone(tio->clone);
-		tio->ti->type->release_clone_rq(tio->clone);
+		ti->type->release_clone_rq(ti, tio->clone);
 	}
 
 	if (!rq->q->mq_ops)
@@ -488,7 +490,7 @@  static int map_request(struct dm_rq_target_io *tio)
 	case DM_MAPIO_REMAPPED:
 		if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
 			/* -ENOMEM */
-			ti->type->release_clone_rq(clone);
+			ti->type->release_clone_rq(ti, clone);
 			return DM_MAPIO_REQUEUE;
 		}
 
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index c0d7e60..adbd17b 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -138,12 +138,12 @@  static int io_err_clone_and_map_rq(struct dm_target *ti, struct request *rq,
 	return DM_MAPIO_KILL;
 }
 
-static void io_err_release_clone_rq(struct request *clone)
+static void io_err_release_clone_rq(struct dm_target *ti, struct request *clone)
 {
 }
 
 static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+				     long nr_pages, void **kaddr, pfn_t *pfn)
 {
 	return -EIO;
 }
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 0c1b50ad..f2ca0ab 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -61,7 +61,8 @@  typedef int (*dm_clone_and_map_request_fn) (struct dm_target *ti,
 					    struct request *rq,
 					    union map_info *map_context,
 					    struct request **clone);
-typedef void (*dm_release_clone_request_fn) (struct request *clone);
+typedef void (*dm_release_clone_request_fn) (struct dm_target *ti,
+					     struct request *clone);
 
 /*
  * Returns: