diff mbox

INFO: task hung in blk_queue_enter

Message ID 201805222020.FEJ82897.OFtJMFHOVLQOSF@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Tetsuo Handa May 22, 2018, 11:20 a.m. UTC
I checked counter values using debug printk() patch shown below, and
found that q->q_usage_counter.count == 1 when this deadlock occurs.

Since sum of percpu_count did not change after percpu_ref_kill(), this is
not a race condition while folding percpu counter values into atomic counter
value. That is, for some reason, someone who is responsible for calling
percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
unable to call percpu_ref_put().


But I don't know how to find someone who is failing to call percpu_ref_put()...

Comments

Tetsuo Handa June 1, 2018, 10:10 a.m. UTC | #1
Tetsuo Handa wrote:
> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> not a race condition while folding percpu counter values into atomic counter
> value. That is, for some reason, someone who is responsible for calling
> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> unable to call percpu_ref_put().
> But I don't know how to find someone who is failing to call percpu_ref_put()...

I found the someone. It was already there in the backtrace...

----------------------------------------
[   62.065852] a.out           D    0  4414   4337 0x00000000
[   62.067677] Call Trace:
[   62.068545]  __schedule+0x40b/0x860
[   62.069726]  schedule+0x31/0x80
[   62.070796]  schedule_timeout+0x1c1/0x3c0
[   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
[   62.073670]  blk_queue_enter+0x218/0x520
[   62.074985]  ? remove_wait_queue+0x70/0x70
[   62.076361]  generic_make_request+0x3d/0x540
[   62.077785]  ? __bio_clone_fast+0x6b/0x80
[   62.079147]  ? bio_clone_fast+0x2c/0x70
[   62.080456]  blk_queue_split+0x29b/0x560
[   62.081772]  ? blk_queue_split+0x29b/0x560
[   62.083162]  blk_mq_make_request+0x7c/0x430
[   62.084562]  generic_make_request+0x276/0x540
[   62.086034]  submit_bio+0x6e/0x140
[   62.087185]  ? submit_bio+0x6e/0x140
[   62.088384]  ? guard_bio_eod+0x9d/0x1d0
[   62.089681]  do_mpage_readpage+0x328/0x730
[   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
[   62.092726]  mpage_readpages+0x120/0x190
[   62.094034]  ? check_disk_change+0x70/0x70
[   62.095454]  ? check_disk_change+0x70/0x70
[   62.096849]  ? alloc_pages_current+0x65/0xd0
[   62.098277]  blkdev_readpages+0x18/0x20
[   62.099568]  __do_page_cache_readahead+0x298/0x360
[   62.101157]  ondemand_readahead+0x1f6/0x490
[   62.102546]  ? ondemand_readahead+0x1f6/0x490
[   62.103995]  page_cache_sync_readahead+0x29/0x40
[   62.105539]  generic_file_read_iter+0x7d0/0x9d0
[   62.107067]  ? futex_wait+0x221/0x240
[   62.108303]  ? trace_hardirqs_on+0xd/0x10
[   62.109654]  blkdev_read_iter+0x30/0x40
[   62.110954]  generic_file_splice_read+0xc5/0x140
[   62.112538]  do_splice_to+0x74/0x90
[   62.113726]  splice_direct_to_actor+0xa4/0x1f0
[   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
[   62.116773]  do_splice_direct+0x8a/0xb0
[   62.118056]  do_sendfile+0x1aa/0x390
[   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
[   62.120666]  do_syscall_64+0x6e/0x210
[   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
----------------------------------------

The someone is blk_queue_split() from blk_mq_make_request() who depends on an
assumption that blk_queue_enter() from recursively called generic_make_request()
does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.

----------------------------------------
generic_make_request(struct bio *bio) {
  if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
    if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
      bio_wouldblock_error(bio);
    else
      bio_io_error(bio);
    return ret;
  }
(...snipped...)
  ret = q->make_request_fn(q, bio);
(...snipped...)
  if (q)
    blk_queue_exit(q);
}
----------------------------------------

where q->make_request_fn == blk_mq_make_request which does

----------------------------------------
blk_mq_make_request(struct request_queue *q, struct bio *bio) {
   blk_queue_split(q, &bio);
}

blk_queue_split(struct request_queue *q, struct bio **bio) {
  generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
}
----------------------------------------

and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
percpu_ref_kill() are called by	blk_freeze_queue_start()...

Now, it is up to you about how to fix this race problem.
Dmitry Vyukov June 4, 2018, 11:46 a.m. UTC | #2
On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>
> I found the someone. It was already there in the backtrace...

Nice!

Do I understand it correctly that this bug is probably the root cause
of a whole lot of syzbot "task hung" reports? E.g. this one too?
https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
I guess we will need to sweep close everything related to
filesystems/block devices when this is committed?


> ----------------------------------------
> [   62.065852] a.out           D    0  4414   4337 0x00000000
> [   62.067677] Call Trace:
> [   62.068545]  __schedule+0x40b/0x860
> [   62.069726]  schedule+0x31/0x80
> [   62.070796]  schedule_timeout+0x1c1/0x3c0
> [   62.072159]  ? __next_timer_interrupt+0xd0/0xd0
> [   62.073670]  blk_queue_enter+0x218/0x520
> [   62.074985]  ? remove_wait_queue+0x70/0x70
> [   62.076361]  generic_make_request+0x3d/0x540
> [   62.077785]  ? __bio_clone_fast+0x6b/0x80
> [   62.079147]  ? bio_clone_fast+0x2c/0x70
> [   62.080456]  blk_queue_split+0x29b/0x560
> [   62.081772]  ? blk_queue_split+0x29b/0x560
> [   62.083162]  blk_mq_make_request+0x7c/0x430
> [   62.084562]  generic_make_request+0x276/0x540
> [   62.086034]  submit_bio+0x6e/0x140
> [   62.087185]  ? submit_bio+0x6e/0x140
> [   62.088384]  ? guard_bio_eod+0x9d/0x1d0
> [   62.089681]  do_mpage_readpage+0x328/0x730
> [   62.091045]  ? __add_to_page_cache_locked+0x12e/0x1a0
> [   62.092726]  mpage_readpages+0x120/0x190
> [   62.094034]  ? check_disk_change+0x70/0x70
> [   62.095454]  ? check_disk_change+0x70/0x70
> [   62.096849]  ? alloc_pages_current+0x65/0xd0
> [   62.098277]  blkdev_readpages+0x18/0x20
> [   62.099568]  __do_page_cache_readahead+0x298/0x360
> [   62.101157]  ondemand_readahead+0x1f6/0x490
> [   62.102546]  ? ondemand_readahead+0x1f6/0x490
> [   62.103995]  page_cache_sync_readahead+0x29/0x40
> [   62.105539]  generic_file_read_iter+0x7d0/0x9d0
> [   62.107067]  ? futex_wait+0x221/0x240
> [   62.108303]  ? trace_hardirqs_on+0xd/0x10
> [   62.109654]  blkdev_read_iter+0x30/0x40
> [   62.110954]  generic_file_splice_read+0xc5/0x140
> [   62.112538]  do_splice_to+0x74/0x90
> [   62.113726]  splice_direct_to_actor+0xa4/0x1f0
> [   62.115209]  ? generic_pipe_buf_nosteal+0x10/0x10
> [   62.116773]  do_splice_direct+0x8a/0xb0
> [   62.118056]  do_sendfile+0x1aa/0x390
> [   62.119255]  __x64_sys_sendfile64+0x4e/0xc0
> [   62.120666]  do_syscall_64+0x6e/0x210
> [   62.121909]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
>
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>
> ----------------------------------------
> generic_make_request(struct bio *bio) {
>   if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>     if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>       bio_wouldblock_error(bio);
>     else
>       bio_io_error(bio);
>     return ret;
>   }
> (...snipped...)
>   ret = q->make_request_fn(q, bio);
> (...snipped...)
>   if (q)
>     blk_queue_exit(q);
> }
> ----------------------------------------
>
> where q->make_request_fn == blk_mq_make_request which does
>
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>    blk_queue_split(q, &bio);
> }
>
> blk_queue_split(struct request_queue *q, struct bio **bio) {
>   generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
>
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by blk_freeze_queue_start()...
>
> Now, it is up to you about how to fix this race problem.
>
Tetsuo Handa June 4, 2018, 1:13 p.m. UTC | #3
On 2018/06/04 20:46, Dmitry Vyukov wrote:
> On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> Tetsuo Handa wrote:
>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>> not a race condition while folding percpu counter values into atomic counter
>>> value. That is, for some reason, someone who is responsible for calling
>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>> unable to call percpu_ref_put().
>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>
>> I found the someone. It was already there in the backtrace...
> 
> Nice!
> 
> Do I understand it correctly that this bug is probably the root cause
> of a whole lot of syzbot "task hung" reports? E.g. this one too?
> https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
> I guess we will need to sweep close everything related to
> filesystems/block devices when this is committed?
> 

I can't tell it. We still have lockdep warning for loop module. Also, I suspect
that hung tasks waiting for page bit might be related to binder module.

Since we are about to merge "kernel/hung_task.c: show all hung tasks before panic"
patch, we might be able to get more hints after 4.18-rc1.
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..6933020 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,6 +908,12 @@  struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+	return (unsigned long __percpu *)
+		(ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -950,10 +956,22 @@  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		 */
 		smp_rmb();
 
-		wait_event(q->mq_freeze_wq,
-			   (atomic_read(&q->mq_freeze_depth) == 0 &&
-			    (preempt || !blk_queue_preempt_only(q))) ||
-			   blk_queue_dying(q));
+		while (wait_event_timeout(q->mq_freeze_wq,
+					  (atomic_read(&q->mq_freeze_depth) == 0 &&
+					   (preempt || !blk_queue_preempt_only(q))) ||
+					  blk_queue_dying(q), 3 * HZ) == 0) {
+			struct percpu_ref *ref = &q->q_usage_counter;
+			unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+			unsigned long count = 0;
+			int cpu;
+
+			for_each_possible_cpu(cpu)
+				count += *per_cpu_ptr(percpu_count, cpu);
+
+			printk("%s(%d): %px %ld %ld\n", current->comm, current->pid,
+			       ref, atomic_long_read(&ref->count), count);
+		}
+
 		if (blk_queue_dying(q))
 			return -ENODEV;
 	}
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7..72773ce 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -133,8 +133,8 @@  static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 	for_each_possible_cpu(cpu)
 		count += *per_cpu_ptr(percpu_count, cpu);
 
-	pr_debug("global %ld percpu %ld",
-		 atomic_long_read(&ref->count), (long)count);
+	printk("%px global %ld percpu %ld\n", ref,
+	       atomic_long_read(&ref->count), (long)count);
 
 	/*
 	 * It's crucial that we sum the percpu counters _before_ adding the sum
@@ -150,6 +150,8 @@  static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 	 */
 	atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count);
 
+	printk("%px global %ld\n", ref, atomic_long_read(&ref->count));
+
 	WARN_ONCE(atomic_long_read(&ref->count) <= 0,
 		  "percpu ref (%pf) <= 0 (%ld) after switching to atomic",
 		  ref->release, atomic_long_read(&ref->count));

If I change blk_queue_enter() not to wait at wait_event() if
q->mq_freeze_depth != 0, this deadlock problem does not occur.

Also, I found that if blk_freeze_queue_start() tries to wait for
counters to become 1 before calling percpu_ref_kill() (like shown
below), this deadlock problem does not occur.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac..4bff534 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -134,12 +134,36 @@  void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
 }
 
+#define PERCPU_COUNT_BIAS       (1LU << (BITS_PER_LONG - 1))
+
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+	return (unsigned long __percpu *)
+		(ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
 void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
+		int i;
+		for (i = 0; i < 10; i++) {
+			struct percpu_ref *ref = &q->q_usage_counter;
+			unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+			unsigned long count = 0;
+			int cpu;
+
+			for_each_possible_cpu(cpu)
+				count += *per_cpu_ptr(percpu_count, cpu);
+
+			if (atomic_long_read(&ref->count) + count - PERCPU_COUNT_BIAS == 1)
+				break;
+			printk("%s(%d):! %px %ld %ld\n", current->comm, current->pid,
+			       ref, atomic_long_read(&ref->count), count);
+			schedule_timeout_uninterruptible(HZ / 10);
+		}
 		percpu_ref_kill(&q->q_usage_counter);
 		if (q->mq_ops)
 			blk_mq_run_hw_queues(q, false);