diff mbox

v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone

Message ID 20170207024906.4oswyuvxfnqkvbhr@moria.home.lan (mailing list archive)
State Not Applicable, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Kent Overstreet Feb. 7, 2017, 2:49 a.m. UTC
On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > 
> > So.. what needs to be done there?

> But, I just got an idea for how to handle this that might be halfway sane, maybe
> I'll try and come up with a patch...

Ok, here's such a patch, only lightly tested:

-- >8 --
Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset

Note: this patch is very lightly tested.

Also, trigger rescuing whenever with bios on current->bio_list, instead
of only when we block in bio_alloc_bioset(). This is more correct, and
should result in fewer rescuer threads.

XXX: The current->bio_list plugging needs to be unified with the
blk_plug mechanism.

TODO: If we change normal request_queue drivers to handle arbitrary size
bios by processing requests incrementally, instead of splitting bios,
then we can get rid of rescuer threads from those devices.
---
 block/bio.c            | 107 ++++---------------------------------------------
 block/blk-core.c       |  58 ++++++++++++++++++++++++---
 block/blk-sysfs.c      |   2 +
 include/linux/bio.h    |  16 ++++----
 include/linux/blkdev.h |  10 +++++
 include/linux/sched.h  |   2 +-
 kernel/sched/core.c    |   4 ++
 7 files changed, 83 insertions(+), 116 deletions(-)

Comments

Mike Snitzer Feb. 7, 2017, 5:13 p.m. UTC | #1
On Mon, Feb 06 2017 at  9:49pm -0500,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > 
> > > So.. what needs to be done there?
> 
> > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > I'll try and come up with a patch...
> 
> Ok, here's such a patch, only lightly tested:
> 
> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
> 
> Note: this patch is very lightly tested.
> 
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.
> 
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.
> 
> TODO: If we change normal request_queue drivers to handle arbitrary size
> bios by processing requests incrementally, instead of splitting bios,
> then we can get rid of rescuer threads from those devices.

Hi Kent,

I really appreciate you working on this further.  Thanks.

As I think you're probably already aware, a long standing issue with the
per bio_set rescuer is this bug (which manifests in dm-snapshot
deadlocks): https://bugzilla.kernel.org/show_bug.cgi?id=119841

Please also see this patch header, from a private branch from a while
ago, that describes the problem in detail:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=cd2c760b5a609e2aaf3735a7b9503a953535c368

Would welcome your consideration of that BZ as you think further and/or
iterate on this line of work.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Pavel Machek Feb. 7, 2017, 8:39 p.m. UTC | #2
On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > 
> > > So.. what needs to be done there?
> 
> > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > I'll try and come up with a patch...
> 
> Ok, here's such a patch, only lightly tested:

I guess it would be nice for me to test it... but what it is against?
I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

Thanks,
								Pavel

> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
> 
> Note: this patch is very lightly tested.
> 
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.
> 
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.
> 
> TODO: If we change normal request_queue drivers to handle arbitrary size
> bios by processing requests incrementally, instead of splitting bios,
> then we can get rid of rescuer threads from those devices.
> ---
>  block/bio.c            | 107 ++++---------------------------------------------
>  block/blk-core.c       |  58 ++++++++++++++++++++++++---
>  block/blk-sysfs.c      |   2 +
>  include/linux/bio.h    |  16 ++++----
>  include/linux/blkdev.h |  10 +++++
>  include/linux/sched.h  |   2 +-
>  kernel/sched/core.c    |   4 ++
>  7 files changed, 83 insertions(+), 116 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index f3b5786202..9ad54a9b12 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -336,54 +336,6 @@ void bio_chain(struct bio *bio, struct bio *parent)
>  }
>  EXPORT_SYMBOL(bio_chain);
>  
> -static void bio_alloc_rescue(struct work_struct *work)
> -{
> -	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> -	struct bio *bio;
> -
> -	while (1) {
> -		spin_lock(&bs->rescue_lock);
> -		bio = bio_list_pop(&bs->rescue_list);
> -		spin_unlock(&bs->rescue_lock);
> -
> -		if (!bio)
> -			break;
> -
> -		generic_make_request(bio);
> -	}
> -}
> -
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> -{
> -	struct bio_list punt, nopunt;
> -	struct bio *bio;
> -
> -	/*
> -	 * In order to guarantee forward progress we must punt only bios that
> -	 * were allocated from this bio_set; otherwise, if there was a bio on
> -	 * there for a stacking driver higher up in the stack, processing it
> -	 * could require allocating bios from this bio_set, and doing that from
> -	 * our own rescuer would be bad.
> -	 *
> -	 * Since bio lists are singly linked, pop them all instead of trying to
> -	 * remove from the middle of the list:
> -	 */
> -
> -	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
> -
> -	while ((bio = bio_list_pop(current->bio_list)))
> -		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -	*current->bio_list = nopunt;
> -
> -	spin_lock(&bs->rescue_lock);
> -	bio_list_merge(&bs->rescue_list, &punt);
> -	spin_unlock(&bs->rescue_lock);
> -
> -	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> -}
> -
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -421,54 +373,27 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -	gfp_t saved_gfp = gfp_mask;
>  	unsigned front_pad;
>  	unsigned inline_vecs;
>  	struct bio_vec *bvl = NULL;
>  	struct bio *bio;
>  	void *p;
>  
> -	if (!bs) {
> -		if (nr_iovecs > UIO_MAXIOV)
> -			return NULL;
> +	WARN(current->bio_list &&
> +	     !current->bio_list->q->rescue_workqueue,
> +	     "allocating bio beneath generic_make_request() without rescuer");
>  
> +	if (nr_iovecs > UIO_MAXIOV)
> +		return NULL;
> +
> +	if (!bs) {
>  		p = kmalloc(sizeof(struct bio) +
>  			    nr_iovecs * sizeof(struct bio_vec),
>  			    gfp_mask);
>  		front_pad = 0;
>  		inline_vecs = nr_iovecs;
>  	} else {
> -		/*
> -		 * generic_make_request() converts recursion to iteration; this
> -		 * means if we're running beneath it, any bios we allocate and
> -		 * submit will not be submitted (and thus freed) until after we
> -		 * return.
> -		 *
> -		 * This exposes us to a potential deadlock if we allocate
> -		 * multiple bios from the same bio_set() while running
> -		 * underneath generic_make_request(). If we were to allocate
> -		 * multiple bios (say a stacking block driver that was splitting
> -		 * bios), we would deadlock if we exhausted the mempool's
> -		 * reserve.
> -		 *
> -		 * We solve this, and guarantee forward progress, with a rescuer
> -		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -		 * bios we would be blocking to the rescuer workqueue before
> -		 * we retry with the original gfp_flags.
> -		 */
> -
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> -
>  		p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -		if (!p && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -		}
> -
>  		front_pad = bs->front_pad;
>  		inline_vecs = BIO_INLINE_VECS;
>  	}
> @@ -483,12 +408,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		unsigned long idx = 0;
>  
>  		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -		if (!bvl && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -		}
> -
>  		if (unlikely(!bvl))
>  			goto err_free;
>  
> @@ -1938,10 +1857,6 @@ int biovec_init_pool(mempool_t *pool, int pool_entries)
>  
>  void bioset_exit(struct bio_set *bs)
>  {
> -	if (bs->rescue_workqueue)
> -		destroy_workqueue(bs->rescue_workqueue);
> -	bs->rescue_workqueue = NULL;
> -
>  	mempool_exit(&bs->bio_pool);
>  	mempool_exit(&bs->bvec_pool);
>  
> @@ -1968,10 +1883,6 @@ static int __bioset_init(struct bio_set *bs,
>  
>  	bs->front_pad = front_pad;
>  
> -	spin_lock_init(&bs->rescue_lock);
> -	bio_list_init(&bs->rescue_list);
> -	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> -
>  	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
>  	if (!bs->bio_slab)
>  		return -ENOMEM;
> @@ -1983,10 +1894,6 @@ static int __bioset_init(struct bio_set *bs,
>  	    biovec_init_pool(&bs->bvec_pool, pool_size))
>  		goto bad;
>  
> -	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> -	if (!bs->rescue_workqueue)
> -		goto bad;
> -
>  	return 0;
>  bad:
>  	bioset_exit(bs);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7e3cfa9c88..f716164cb3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -48,6 +48,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
>  
>  DEFINE_IDA(blk_queue_ida);
>  
> +static void bio_rescue_work(struct work_struct *);
> +
>  /*
>   * For the allocated request tables
>   */
> @@ -759,11 +761,21 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
>  		goto fail_bdi;
>  
> -	if (blkcg_init_queue(q))
> +	spin_lock_init(&q->rescue_lock);
> +	bio_list_init(&q->rescue_list);
> +	INIT_WORK(&q->rescue_work, bio_rescue_work);
> +
> +	q->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> +	if (!q->rescue_workqueue)
>  		goto fail_ref;
>  
> +	if (blkcg_init_queue(q))
> +		goto fail_rescue;
> +
>  	return q;
>  
> +fail_rescue:
> +	destroy_workqueue(q->rescue_workqueue);
>  fail_ref:
>  	percpu_ref_exit(&q->q_usage_counter);
>  fail_bdi:
> @@ -1994,7 +2006,7 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -	struct bio_list bio_list_on_stack;
> +	struct bio_plug_list bio_list_on_stack;
>  	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	if (!generic_make_request_checks(bio))
> @@ -2011,7 +2023,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * should be added at the tail
>  	 */
>  	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +		WARN(!current->bio_list->q->rescue_workqueue,
> +		     "submitting bio beneath generic_make_request() without rescuer");
> +		bio_list_add(&current->bio_list->bios, bio);
>  		goto out;
>  	}
>  
> @@ -2030,19 +2044,23 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * bio_list, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> +	bio_list_init(&bio_list_on_stack.bios);
>  	current->bio_list = &bio_list_on_stack;
> +
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
> +		current->bio_list->q = q;
> +
>  		if (likely(blk_queue_enter(q, false) == 0)) {
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
>  
> -			bio = bio_list_pop(current->bio_list);
> +			bio = bio_list_pop(&current->bio_list->bios);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> +			struct bio *bio_next =
> +				bio_list_pop(&current->bio_list->bios);
>  
>  			bio_io_error(bio);
>  			bio = bio_next;
> @@ -2055,6 +2073,34 @@ blk_qc_t generic_make_request(struct bio *bio)
>  }
>  EXPORT_SYMBOL(generic_make_request);
>  
> +static void bio_rescue_work(struct work_struct *work)
> +{
> +	struct request_queue *q =
> +		container_of(work, struct request_queue, rescue_work);
> +	struct bio *bio;
> +
> +	while (1) {
> +		spin_lock(&q->rescue_lock);
> +		bio = bio_list_pop(&q->rescue_list);
> +		spin_unlock(&q->rescue_lock);
> +
> +		if (!bio)
> +			break;
> +
> +		generic_make_request(bio);
> +	}
> +}
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *list)
> +{
> +	spin_lock(&list->q->rescue_lock);
> +	bio_list_merge(&list->q->rescue_list, &list->bios);
> +	bio_list_init(&list->bios);
> +	spin_unlock(&list->q->rescue_lock);
> +
> +	queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
> +}
> +
>  /**
>   * submit_bio - submit a bio to the block device layer for I/O
>   * @bio: The &struct bio which describes the I/O
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7f27a18cc4..77529238d1 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -660,6 +660,8 @@ static void blk_release_queue(struct kobject *kobj)
>  
>  	blk_trace_shutdown(q);
>  
> +	if (q->rescue_workqueue)
> +		destroy_workqueue(q->rescue_workqueue);
>  	if (q->bio_split)
>  		bioset_free(q->bio_split);
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1ffe8e37ae..87eeec7eda 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -658,6 +658,13 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
>  	return bio;
>  }
>  
> +struct bio_plug_list {
> +	struct bio_list		bios;
> +	struct request_queue	*q;
> +};
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *);
> +
>  /*
>   * Increment chain count for the bio. Make sure the CHAIN flag update
>   * is visible before the raised count.
> @@ -687,15 +694,6 @@ struct bio_set {
>  	mempool_t bio_integrity_pool;
>  	mempool_t bvec_integrity_pool;
>  #endif
> -
> -	/*
> -	 * Deadlock avoidance for stacking block drivers: see comments in
> -	 * bio_alloc_bioset() for details
> -	 */
> -	spinlock_t		rescue_lock;
> -	struct bio_list		rescue_list;
> -	struct work_struct	rescue_work;
> -	struct workqueue_struct	*rescue_workqueue;
>  };
>  
>  struct biovec_slab {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c47c358ba0..f64b886c65 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -476,6 +476,16 @@ struct request_queue {
>  	struct bio_set		*bio_split;
>  
>  	bool			mq_sysfs_init_done;
> +
> +	/*
> +	 * Deadlock avoidance, to deal with the plugging in
> +	 * generic_make_request() that converts recursion to iteration to avoid
> +	 * stack overflow:
> +	 */
> +	spinlock_t		rescue_lock;
> +	struct bio_list		rescue_list;
> +	struct work_struct	rescue_work;
> +	struct workqueue_struct	*rescue_workqueue;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2865d10a28..59df7a1030 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1762,7 +1762,7 @@ struct task_struct {
>  	void *journal_info;
>  
>  /* stacked block device info */
> -	struct bio_list *bio_list;
> +	struct bio_plug_list *bio_list;
>  
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bd39d698cb..23b6290ba1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3439,6 +3439,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  {
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
>  		return;
> +
> +	if (tsk->bio_list && !bio_list_empty(&tsk->bio_list->bios))
> +		blk_punt_blocked_bios(tsk->bio_list);
> +
>  	/*
>  	 * If we are going to sleep and we have plugged IO queued,
>  	 * make sure to submit it to avoid deadlocks.
Ming Lei Feb. 8, 2017, 2:47 a.m. UTC | #3
On Tue, Feb 7, 2017 at 10:49 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
>> On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
>> > Still there on v4.9, 36 threads on nokia n900 cellphone.
>> >
>> > So.. what needs to be done there?
>
>> But, I just got an idea for how to handle this that might be halfway sane, maybe
>> I'll try and come up with a patch...
>
> Ok, here's such a patch, only lightly tested:
>
> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
>
> Note: this patch is very lightly tested.
>
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.

Looks the rescuer stuff gets simplified much with this patch.

>
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.

Yeah, that can be another benefit, :-)

>
> TODO: If we change normal request_queue drivers to handle arbitrary size
> bios by processing requests incrementally, instead of splitting bios,
> then we can get rid of rescuer threads from those devices.

Also the rescue threads are often from some reserved block devices, such as
loop/nbd, and we should have allowed these drivers to delay allocating
the thread
just before the disk is activated. Then the thread number can get descreased
a lot.

> ---
>  block/bio.c            | 107 ++++---------------------------------------------
>  block/blk-core.c       |  58 ++++++++++++++++++++++++---
>  block/blk-sysfs.c      |   2 +
>  include/linux/bio.h    |  16 ++++----
>  include/linux/blkdev.h |  10 +++++
>  include/linux/sched.h  |   2 +-
>  kernel/sched/core.c    |   4 ++
>  7 files changed, 83 insertions(+), 116 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index f3b5786202..9ad54a9b12 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -336,54 +336,6 @@ void bio_chain(struct bio *bio, struct bio *parent)
>  }
>  EXPORT_SYMBOL(bio_chain);
>
> -static void bio_alloc_rescue(struct work_struct *work)
> -{
> -       struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> -       struct bio *bio;
> -
> -       while (1) {
> -               spin_lock(&bs->rescue_lock);
> -               bio = bio_list_pop(&bs->rescue_list);
> -               spin_unlock(&bs->rescue_lock);
> -
> -               if (!bio)
> -                       break;
> -
> -               generic_make_request(bio);
> -       }
> -}
> -
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> -{
> -       struct bio_list punt, nopunt;
> -       struct bio *bio;
> -
> -       /*
> -        * In order to guarantee forward progress we must punt only bios that
> -        * were allocated from this bio_set; otherwise, if there was a bio on
> -        * there for a stacking driver higher up in the stack, processing it
> -        * could require allocating bios from this bio_set, and doing that from
> -        * our own rescuer would be bad.
> -        *
> -        * Since bio lists are singly linked, pop them all instead of trying to
> -        * remove from the middle of the list:
> -        */
> -
> -       bio_list_init(&punt);
> -       bio_list_init(&nopunt);
> -
> -       while ((bio = bio_list_pop(current->bio_list)))
> -               bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -       *current->bio_list = nopunt;
> -
> -       spin_lock(&bs->rescue_lock);
> -       bio_list_merge(&bs->rescue_list, &punt);
> -       spin_unlock(&bs->rescue_lock);
> -
> -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> -}
> -
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -421,54 +373,27 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -       gfp_t saved_gfp = gfp_mask;
>         unsigned front_pad;
>         unsigned inline_vecs;
>         struct bio_vec *bvl = NULL;
>         struct bio *bio;
>         void *p;
>
> -       if (!bs) {
> -               if (nr_iovecs > UIO_MAXIOV)
> -                       return NULL;
> +       WARN(current->bio_list &&
> +            !current->bio_list->q->rescue_workqueue,
> +            "allocating bio beneath generic_make_request() without rescuer");
>
> +       if (nr_iovecs > UIO_MAXIOV)
> +               return NULL;
> +
> +       if (!bs) {
>                 p = kmalloc(sizeof(struct bio) +
>                             nr_iovecs * sizeof(struct bio_vec),
>                             gfp_mask);
>                 front_pad = 0;
>                 inline_vecs = nr_iovecs;
>         } else {
> -               /*
> -                * generic_make_request() converts recursion to iteration; this
> -                * means if we're running beneath it, any bios we allocate and
> -                * submit will not be submitted (and thus freed) until after we
> -                * return.
> -                *
> -                * This exposes us to a potential deadlock if we allocate
> -                * multiple bios from the same bio_set() while running
> -                * underneath generic_make_request(). If we were to allocate
> -                * multiple bios (say a stacking block driver that was splitting
> -                * bios), we would deadlock if we exhausted the mempool's
> -                * reserve.
> -                *
> -                * We solve this, and guarantee forward progress, with a rescuer
> -                * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -                * bios we would be blocking to the rescuer workqueue before
> -                * we retry with the original gfp_flags.
> -                */
> -
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> -                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> -
>                 p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -               if (!p && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -               }
> -
>                 front_pad = bs->front_pad;
>                 inline_vecs = BIO_INLINE_VECS;
>         }
> @@ -483,12 +408,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                 unsigned long idx = 0;
>
>                 bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -               if (!bvl && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -               }
> -
>                 if (unlikely(!bvl))
>                         goto err_free;
>
> @@ -1938,10 +1857,6 @@ int biovec_init_pool(mempool_t *pool, int pool_entries)
>
>  void bioset_exit(struct bio_set *bs)
>  {
> -       if (bs->rescue_workqueue)
> -               destroy_workqueue(bs->rescue_workqueue);
> -       bs->rescue_workqueue = NULL;
> -
>         mempool_exit(&bs->bio_pool);
>         mempool_exit(&bs->bvec_pool);
>
> @@ -1968,10 +1883,6 @@ static int __bioset_init(struct bio_set *bs,
>
>         bs->front_pad = front_pad;
>
> -       spin_lock_init(&bs->rescue_lock);
> -       bio_list_init(&bs->rescue_list);
> -       INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> -
>         bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
>         if (!bs->bio_slab)
>                 return -ENOMEM;
> @@ -1983,10 +1894,6 @@ static int __bioset_init(struct bio_set *bs,
>             biovec_init_pool(&bs->bvec_pool, pool_size))
>                 goto bad;
>
> -       bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> -       if (!bs->rescue_workqueue)
> -               goto bad;
> -
>         return 0;
>  bad:
>         bioset_exit(bs);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7e3cfa9c88..f716164cb3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -48,6 +48,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
>
>  DEFINE_IDA(blk_queue_ida);
>
> +static void bio_rescue_work(struct work_struct *);
> +
>  /*
>   * For the allocated request tables
>   */
> @@ -759,11 +761,21 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>                                 PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
>                 goto fail_bdi;
>
> -       if (blkcg_init_queue(q))
> +       spin_lock_init(&q->rescue_lock);
> +       bio_list_init(&q->rescue_list);
> +       INIT_WORK(&q->rescue_work, bio_rescue_work);
> +
> +       q->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> +       if (!q->rescue_workqueue)
>                 goto fail_ref;
>
> +       if (blkcg_init_queue(q))
> +               goto fail_rescue;
> +
>         return q;
>
> +fail_rescue:
> +       destroy_workqueue(q->rescue_workqueue);
>  fail_ref:
>         percpu_ref_exit(&q->q_usage_counter);
>  fail_bdi:
> @@ -1994,7 +2006,7 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       struct bio_plug_list bio_list_on_stack;
>         blk_qc_t ret = BLK_QC_T_NONE;
>
>         if (!generic_make_request_checks(bio))
> @@ -2011,7 +2023,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * should be added at the tail
>          */
>         if (current->bio_list) {
> -               bio_list_add(current->bio_list, bio);
> +               WARN(!current->bio_list->q->rescue_workqueue,
> +                    "submitting bio beneath generic_make_request() without rescuer");
> +               bio_list_add(&current->bio_list->bios, bio);
>                 goto out;
>         }
>
> @@ -2030,19 +2044,23 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * bio_list, and call into ->make_request() again.
>          */
>         BUG_ON(bio->bi_next);
> -       bio_list_init(&bio_list_on_stack);
> +       bio_list_init(&bio_list_on_stack.bios);
>         current->bio_list = &bio_list_on_stack;
> +
>         do {
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> +               current->bio_list->q = q;
> +
>                 if (likely(blk_queue_enter(q, false) == 0)) {
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
>
> -                       bio = bio_list_pop(current->bio_list);
> +                       bio = bio_list_pop(&current->bio_list->bios);
>                 } else {
> -                       struct bio *bio_next = bio_list_pop(current->bio_list);
> +                       struct bio *bio_next =
> +                               bio_list_pop(&current->bio_list->bios);
>
>                         bio_io_error(bio);
>                         bio = bio_next;
> @@ -2055,6 +2073,34 @@ blk_qc_t generic_make_request(struct bio *bio)
>  }
>  EXPORT_SYMBOL(generic_make_request);
>
> +static void bio_rescue_work(struct work_struct *work)
> +{
> +       struct request_queue *q =
> +               container_of(work, struct request_queue, rescue_work);
> +       struct bio *bio;
> +
> +       while (1) {
> +               spin_lock(&q->rescue_lock);
> +               bio = bio_list_pop(&q->rescue_list);
> +               spin_unlock(&q->rescue_lock);
> +
> +               if (!bio)
> +                       break;
> +
> +               generic_make_request(bio);
> +       }
> +}
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *list)
> +{
> +       spin_lock(&list->q->rescue_lock);
> +       bio_list_merge(&list->q->rescue_list, &list->bios);
> +       bio_list_init(&list->bios);
> +       spin_unlock(&list->q->rescue_lock);
> +
> +       queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
> +}

I guess we may need to move the bios into its own queue's rescue list,
otherwise it still may deadlock in the following situation:

for example of md:

generic_make_request(bio_m0, md_q)
    ->.make_request(md_q)
        ->bio_l = mempool_alloc(md_pool)
            ->bio_l->q = ll_q
            ->generic_make_request(bio_l, ll_q)

    ....
    mempool_alloc reclaim is triggered when allocate a new bio for low level:
        -> suppose current bio_list has bio_m1, bio_l0, bio_l1,
        -> all are queued into md_q's rescue_list, and handled in md_q's rescue
            wq context
        -> bio_m1 is handled first, and mempool_alloc() reclaimed too
        -> when bio_l0 is handled in generic_make_request(), the rescue_list
        is switched to ll_q's(this rescue list can be empty), then
nothing can move on

> +
>  /**
>   * submit_bio - submit a bio to the block device layer for I/O
>   * @bio: The &struct bio which describes the I/O
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7f27a18cc4..77529238d1 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -660,6 +660,8 @@ static void blk_release_queue(struct kobject *kobj)
>
>         blk_trace_shutdown(q);
>
> +       if (q->rescue_workqueue)
> +               destroy_workqueue(q->rescue_workqueue);
>         if (q->bio_split)
>                 bioset_free(q->bio_split);
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1ffe8e37ae..87eeec7eda 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -658,6 +658,13 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
>         return bio;
>  }
>
> +struct bio_plug_list {
> +       struct bio_list         bios;
> +       struct request_queue    *q;
> +};
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *);
> +
>  /*
>   * Increment chain count for the bio. Make sure the CHAIN flag update
>   * is visible before the raised count.
> @@ -687,15 +694,6 @@ struct bio_set {
>         mempool_t bio_integrity_pool;
>         mempool_t bvec_integrity_pool;
>  #endif
> -
> -       /*
> -        * Deadlock avoidance for stacking block drivers: see comments in
> -        * bio_alloc_bioset() for details
> -        */
> -       spinlock_t              rescue_lock;
> -       struct bio_list         rescue_list;
> -       struct work_struct      rescue_work;
> -       struct workqueue_struct *rescue_workqueue;
>  };
>
>  struct biovec_slab {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c47c358ba0..f64b886c65 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -476,6 +476,16 @@ struct request_queue {
>         struct bio_set          *bio_split;
>
>         bool                    mq_sysfs_init_done;
> +
> +       /*
> +        * Deadlock avoidance, to deal with the plugging in
> +        * generic_make_request() that converts recursion to iteration to avoid
> +        * stack overflow:
> +        */
> +       spinlock_t              rescue_lock;
> +       struct bio_list         rescue_list;
> +       struct work_struct      rescue_work;
> +       struct workqueue_struct *rescue_workqueue;
>  };
>
>  #define QUEUE_FLAG_QUEUED      1       /* uses generic tag queueing */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2865d10a28..59df7a1030 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1762,7 +1762,7 @@ struct task_struct {
>         void *journal_info;
>
>  /* stacked block device info */
> -       struct bio_list *bio_list;
> +       struct bio_plug_list *bio_list;
>
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bd39d698cb..23b6290ba1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3439,6 +3439,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  {
>         if (!tsk->state || tsk_is_pi_blocked(tsk))
>                 return;
> +
> +       if (tsk->bio_list && !bio_list_empty(&tsk->bio_list->bios))
> +               blk_punt_blocked_bios(tsk->bio_list);
> +
>         /*
>          * If we are going to sleep and we have plugged IO queued,
>          * make sure to submit it to avoid deadlocks.
> --
> 2.11.0
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Galbraith Feb. 8, 2017, 3:12 a.m. UTC | #4
On Tue, 2017-02-07 at 21:39 +0100, Pavel Machek wrote:
> On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > 
> > > > So.. what needs to be done there?
> > 
> > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > I'll try and come up with a patch...
> > 
> > Ok, here's such a patch, only lightly tested:
> 
> I guess it would be nice for me to test it... but what it is against?
> I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

It wedged into master easily enough (box still seems to work.. but I'll
be rebooting in a very few seconds just in case:), but threads on my
desktop box only dropped from 73 to 71.  Poo.

	-Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Galbraith Feb. 8, 2017, 6:57 a.m. UTC | #5
On Tue, 2017-02-07 at 19:58 -0900, Kent Overstreet wrote:
> On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > 
> > > > > So.. what needs to be done there?
> > > 
> > > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > > I'll try and come up with a patch...
> > > 
> > > Ok, here's such a patch, only lightly tested:
> > 
> > I guess it would be nice for me to test it... but what it is against?
> > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> 
> Sorry, I forgot I had a few other patches in my branch that touch
> mempool/biosets code.
> 
> Also, after thinking about it more and looking at the relevant code, I'm pretty
> sure we don't need rescuer threads for block devices that just split bios - i.e.
> most of them, so I changed my patch to do that.
> 
> Tested it by ripping out the current->bio_list checks/workarounds from the
> bcache code, appears to work:

Patch killed every last one of them, but..

homer:/root # dmesg|grep WARNING
[   11.701447] WARNING: CPU: 4 PID: 801 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   11.711027] WARNING: CPU: 4 PID: 801 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
[   19.728989] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   19.737020] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
[   19.746173] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   19.755260] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
[   19.763837] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   19.772526] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/bio.c b/block/bio.c
index f3b5786202..9ad54a9b12 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -336,54 +336,6 @@  void bio_chain(struct bio *bio, struct bio *parent)
 }
 EXPORT_SYMBOL(bio_chain);
 
-static void bio_alloc_rescue(struct work_struct *work)
-{
-	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
-	struct bio *bio;
-
-	while (1) {
-		spin_lock(&bs->rescue_lock);
-		bio = bio_list_pop(&bs->rescue_list);
-		spin_unlock(&bs->rescue_lock);
-
-		if (!bio)
-			break;
-
-		generic_make_request(bio);
-	}
-}
-
-static void punt_bios_to_rescuer(struct bio_set *bs)
-{
-	struct bio_list punt, nopunt;
-	struct bio *bio;
-
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
-
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
-}
-
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -421,54 +373,27 @@  static void punt_bios_to_rescuer(struct bio_set *bs)
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	struct bio_vec *bvl = NULL;
 	struct bio *bio;
 	void *p;
 
-	if (!bs) {
-		if (nr_iovecs > UIO_MAXIOV)
-			return NULL;
+	WARN(current->bio_list &&
+	     !current->bio_list->q->rescue_workqueue,
+	     "allocating bio beneath generic_make_request() without rescuer");
 
+	if (nr_iovecs > UIO_MAXIOV)
+		return NULL;
+
+	if (!bs) {
 		p = kmalloc(sizeof(struct bio) +
 			    nr_iovecs * sizeof(struct bio_vec),
 			    gfp_mask);
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
-		/*
-		 * generic_make_request() converts recursion to iteration; this
-		 * means if we're running beneath it, any bios we allocate and
-		 * submit will not be submitted (and thus freed) until after we
-		 * return.
-		 *
-		 * This exposes us to a potential deadlock if we allocate
-		 * multiple bios from the same bio_set() while running
-		 * underneath generic_make_request(). If we were to allocate
-		 * multiple bios (say a stacking block driver that was splitting
-		 * bios), we would deadlock if we exhausted the mempool's
-		 * reserve.
-		 *
-		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
-		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-
 		p = mempool_alloc(&bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(&bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -483,12 +408,6 @@  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		unsigned long idx = 0;
 
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
@@ -1938,10 +1857,6 @@  int biovec_init_pool(mempool_t *pool, int pool_entries)
 
 void bioset_exit(struct bio_set *bs)
 {
-	if (bs->rescue_workqueue)
-		destroy_workqueue(bs->rescue_workqueue);
-	bs->rescue_workqueue = NULL;
-
 	mempool_exit(&bs->bio_pool);
 	mempool_exit(&bs->bvec_pool);
 
@@ -1968,10 +1883,6 @@  static int __bioset_init(struct bio_set *bs,
 
 	bs->front_pad = front_pad;
 
-	spin_lock_init(&bs->rescue_lock);
-	bio_list_init(&bs->rescue_list);
-	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
-
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab)
 		return -ENOMEM;
@@ -1983,10 +1894,6 @@  static int __bioset_init(struct bio_set *bs,
 	    biovec_init_pool(&bs->bvec_pool, pool_size))
 		goto bad;
 
-	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
-	if (!bs->rescue_workqueue)
-		goto bad;
-
 	return 0;
 bad:
 	bioset_exit(bs);
diff --git a/block/blk-core.c b/block/blk-core.c
index 7e3cfa9c88..f716164cb3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -48,6 +48,8 @@  EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
 
 DEFINE_IDA(blk_queue_ida);
 
+static void bio_rescue_work(struct work_struct *);
+
 /*
  * For the allocated request tables
  */
@@ -759,11 +761,21 @@  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_bdi;
 
-	if (blkcg_init_queue(q))
+	spin_lock_init(&q->rescue_lock);
+	bio_list_init(&q->rescue_list);
+	INIT_WORK(&q->rescue_work, bio_rescue_work);
+
+	q->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+	if (!q->rescue_workqueue)
 		goto fail_ref;
 
+	if (blkcg_init_queue(q))
+		goto fail_rescue;
+
 	return q;
 
+fail_rescue:
+	destroy_workqueue(q->rescue_workqueue);
 fail_ref:
 	percpu_ref_exit(&q->q_usage_counter);
 fail_bdi:
@@ -1994,7 +2006,7 @@  generic_make_request_checks(struct bio *bio)
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct bio_plug_list bio_list_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2011,7 +2023,9 @@  blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		WARN(!current->bio_list->q->rescue_workqueue,
+		     "submitting bio beneath generic_make_request() without rescuer");
+		bio_list_add(&current->bio_list->bios, bio);
 		goto out;
 	}
 
@@ -2030,19 +2044,23 @@  blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
+	bio_list_init(&bio_list_on_stack.bios);
 	current->bio_list = &bio_list_on_stack;
+
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
+		current->bio_list->q = q;
+
 		if (likely(blk_queue_enter(q, false) == 0)) {
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(&current->bio_list->bios);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
+			struct bio *bio_next =
+				bio_list_pop(&current->bio_list->bios);
 
 			bio_io_error(bio);
 			bio = bio_next;
@@ -2055,6 +2073,34 @@  blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+static void bio_rescue_work(struct work_struct *work)
+{
+	struct request_queue *q =
+		container_of(work, struct request_queue, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&q->rescue_lock);
+		bio = bio_list_pop(&q->rescue_list);
+		spin_unlock(&q->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+void blk_punt_blocked_bios(struct bio_plug_list *list)
+{
+	spin_lock(&list->q->rescue_lock);
+	bio_list_merge(&list->q->rescue_list, &list->bios);
+	bio_list_init(&list->bios);
+	spin_unlock(&list->q->rescue_lock);
+
+	queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
+}
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7f27a18cc4..77529238d1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -660,6 +660,8 @@  static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	if (q->rescue_workqueue)
+		destroy_workqueue(q->rescue_workqueue);
 	if (q->bio_split)
 		bioset_free(q->bio_split);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1ffe8e37ae..87eeec7eda 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -658,6 +658,13 @@  static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+struct bio_plug_list {
+	struct bio_list		bios;
+	struct request_queue	*q;
+};
+
+void blk_punt_blocked_bios(struct bio_plug_list *);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
@@ -687,15 +694,6 @@  struct bio_set {
 	mempool_t bio_integrity_pool;
 	mempool_t bvec_integrity_pool;
 #endif
-
-	/*
-	 * Deadlock avoidance for stacking block drivers: see comments in
-	 * bio_alloc_bioset() for details
-	 */
-	spinlock_t		rescue_lock;
-	struct bio_list		rescue_list;
-	struct work_struct	rescue_work;
-	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358ba0..f64b886c65 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -476,6 +476,16 @@  struct request_queue {
 	struct bio_set		*bio_split;
 
 	bool			mq_sysfs_init_done;
+
+	/*
+	 * Deadlock avoidance, to deal with the plugging in
+	 * generic_make_request() that converts recursion to iteration to avoid
+	 * stack overflow:
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2865d10a28..59df7a1030 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1762,7 +1762,7 @@  struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct bio_plug_list *bio_list;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd39d698cb..23b6290ba1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3439,6 +3439,10 @@  static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
+
+	if (tsk->bio_list && !bio_list_empty(&tsk->bio_list->bios))
+		blk_punt_blocked_bios(tsk->bio_list);
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.