Message ID | 201805222020.FEJ82897.OFtJMFHOVLQOSF@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Tetsuo Handa wrote: > Since sum of percpu_count did not change after percpu_ref_kill(), this is > not a race condition while folding percpu counter values into atomic counter > value. That is, for some reason, someone who is responsible for calling > percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is > unable to call percpu_ref_put(). > But I don't know how to find someone who is failing to call percpu_ref_put()... I found the someone. It was already there in the backtrace... ---------------------------------------- [ 62.065852] a.out D 0 4414 4337 0x00000000 [ 62.067677] Call Trace: [ 62.068545] __schedule+0x40b/0x860 [ 62.069726] schedule+0x31/0x80 [ 62.070796] schedule_timeout+0x1c1/0x3c0 [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0 [ 62.073670] blk_queue_enter+0x218/0x520 [ 62.074985] ? remove_wait_queue+0x70/0x70 [ 62.076361] generic_make_request+0x3d/0x540 [ 62.077785] ? __bio_clone_fast+0x6b/0x80 [ 62.079147] ? bio_clone_fast+0x2c/0x70 [ 62.080456] blk_queue_split+0x29b/0x560 [ 62.081772] ? blk_queue_split+0x29b/0x560 [ 62.083162] blk_mq_make_request+0x7c/0x430 [ 62.084562] generic_make_request+0x276/0x540 [ 62.086034] submit_bio+0x6e/0x140 [ 62.087185] ? submit_bio+0x6e/0x140 [ 62.088384] ? guard_bio_eod+0x9d/0x1d0 [ 62.089681] do_mpage_readpage+0x328/0x730 [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0 [ 62.092726] mpage_readpages+0x120/0x190 [ 62.094034] ? check_disk_change+0x70/0x70 [ 62.095454] ? check_disk_change+0x70/0x70 [ 62.096849] ? alloc_pages_current+0x65/0xd0 [ 62.098277] blkdev_readpages+0x18/0x20 [ 62.099568] __do_page_cache_readahead+0x298/0x360 [ 62.101157] ondemand_readahead+0x1f6/0x490 [ 62.102546] ? ondemand_readahead+0x1f6/0x490 [ 62.103995] page_cache_sync_readahead+0x29/0x40 [ 62.105539] generic_file_read_iter+0x7d0/0x9d0 [ 62.107067] ? futex_wait+0x221/0x240 [ 62.108303] ? trace_hardirqs_on+0xd/0x10 [ 62.109654] blkdev_read_iter+0x30/0x40 [ 62.110954] generic_file_splice_read+0xc5/0x140 [ 62.112538] do_splice_to+0x74/0x90 [ 62.113726] splice_direct_to_actor+0xa4/0x1f0 [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10 [ 62.116773] do_splice_direct+0x8a/0xb0 [ 62.118056] do_sendfile+0x1aa/0x390 [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0 [ 62.120666] do_syscall_64+0x6e/0x210 [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe ---------------------------------------- The someone is blk_queue_split() from blk_mq_make_request() who depends on an assumption that blk_queue_enter() from recursively called generic_make_request() does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure. ---------------------------------------- generic_make_request(struct bio *bio) { if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */ if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) bio_wouldblock_error(bio); else bio_io_error(bio); return ret; } (...snipped...) ret = q->make_request_fn(q, bio); (...snipped...) if (q) blk_queue_exit(q); } ---------------------------------------- where q->make_request_fn == blk_mq_make_request which does ---------------------------------------- blk_mq_make_request(struct request_queue *q, struct bio *bio) { blk_queue_split(q, &bio); } blk_queue_split(struct request_queue *q, struct bio **bio) { generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */ } ---------------------------------------- and meanwhile atomic_inc_return(&q->mq_freeze_depth) and percpu_ref_kill() are called by blk_freeze_queue_start()... Now, it is up to you about how to fix this race problem.
On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > Tetsuo Handa wrote: >> Since sum of percpu_count did not change after percpu_ref_kill(), this is >> not a race condition while folding percpu counter values into atomic counter >> value. That is, for some reason, someone who is responsible for calling >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is >> unable to call percpu_ref_put(). >> But I don't know how to find someone who is failing to call percpu_ref_put()... > > I found the someone. It was already there in the backtrace... Nice! Do I understand it correctly that this bug is probably the root cause of a whole lot of syzbot "task hung" reports? E.g. this one too? https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976 I guess we will need to sweep close everything related to filesystems/block devices when this is committed? > ---------------------------------------- > [ 62.065852] a.out D 0 4414 4337 0x00000000 > [ 62.067677] Call Trace: > [ 62.068545] __schedule+0x40b/0x860 > [ 62.069726] schedule+0x31/0x80 > [ 62.070796] schedule_timeout+0x1c1/0x3c0 > [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0 > [ 62.073670] blk_queue_enter+0x218/0x520 > [ 62.074985] ? remove_wait_queue+0x70/0x70 > [ 62.076361] generic_make_request+0x3d/0x540 > [ 62.077785] ? __bio_clone_fast+0x6b/0x80 > [ 62.079147] ? bio_clone_fast+0x2c/0x70 > [ 62.080456] blk_queue_split+0x29b/0x560 > [ 62.081772] ? blk_queue_split+0x29b/0x560 > [ 62.083162] blk_mq_make_request+0x7c/0x430 > [ 62.084562] generic_make_request+0x276/0x540 > [ 62.086034] submit_bio+0x6e/0x140 > [ 62.087185] ? submit_bio+0x6e/0x140 > [ 62.088384] ? guard_bio_eod+0x9d/0x1d0 > [ 62.089681] do_mpage_readpage+0x328/0x730 > [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0 > [ 62.092726] mpage_readpages+0x120/0x190 > [ 62.094034] ? check_disk_change+0x70/0x70 > [ 62.095454] ? check_disk_change+0x70/0x70 > [ 62.096849] ? alloc_pages_current+0x65/0xd0 > [ 62.098277] blkdev_readpages+0x18/0x20 > [ 62.099568] __do_page_cache_readahead+0x298/0x360 > [ 62.101157] ondemand_readahead+0x1f6/0x490 > [ 62.102546] ? ondemand_readahead+0x1f6/0x490 > [ 62.103995] page_cache_sync_readahead+0x29/0x40 > [ 62.105539] generic_file_read_iter+0x7d0/0x9d0 > [ 62.107067] ? futex_wait+0x221/0x240 > [ 62.108303] ? trace_hardirqs_on+0xd/0x10 > [ 62.109654] blkdev_read_iter+0x30/0x40 > [ 62.110954] generic_file_splice_read+0xc5/0x140 > [ 62.112538] do_splice_to+0x74/0x90 > [ 62.113726] splice_direct_to_actor+0xa4/0x1f0 > [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10 > [ 62.116773] do_splice_direct+0x8a/0xb0 > [ 62.118056] do_sendfile+0x1aa/0x390 > [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0 > [ 62.120666] do_syscall_64+0x6e/0x210 > [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe > ---------------------------------------- > > The someone is blk_queue_split() from blk_mq_make_request() who depends on an > assumption that blk_queue_enter() from recursively called generic_make_request() > does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure. > > ---------------------------------------- > generic_make_request(struct bio *bio) { > if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */ > if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) > bio_wouldblock_error(bio); > else > bio_io_error(bio); > return ret; > } > (...snipped...) > ret = q->make_request_fn(q, bio); > (...snipped...) > if (q) > blk_queue_exit(q); > } > ---------------------------------------- > > where q->make_request_fn == blk_mq_make_request which does > > ---------------------------------------- > blk_mq_make_request(struct request_queue *q, struct bio *bio) { > blk_queue_split(q, &bio); > } > > blk_queue_split(struct request_queue *q, struct bio **bio) { > generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */ > } > ---------------------------------------- > > and meanwhile atomic_inc_return(&q->mq_freeze_depth) and > percpu_ref_kill() are called by blk_freeze_queue_start()... > > Now, it is up to you about how to fix this race problem. >
On 2018/06/04 20:46, Dmitry Vyukov wrote: > On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: >> Tetsuo Handa wrote: >>> Since sum of percpu_count did not change after percpu_ref_kill(), this is >>> not a race condition while folding percpu counter values into atomic counter >>> value. That is, for some reason, someone who is responsible for calling >>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is >>> unable to call percpu_ref_put(). >>> But I don't know how to find someone who is failing to call percpu_ref_put()... >> >> I found the someone. It was already there in the backtrace... > > Nice! > > Do I understand it correctly that this bug is probably the root cause > of a whole lot of syzbot "task hung" reports? E.g. this one too? > https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976 > I guess we will need to sweep close everything related to > filesystems/block devices when this is committed? > I can't tell it. We still have lockdep warning for loop module. Also, I suspect that hung tasks waiting for page bit might be related to binder module. Since we are about to merge "kernel/hung_task.c: show all hung tasks before panic" patch, we might be able to get more hints after 4.18-rc1.
diff --git a/block/blk-core.c b/block/blk-core.c index 85909b4..6933020 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -908,6 +908,12 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); +static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) +{ + return (unsigned long __percpu *) + (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); +} + /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer @@ -950,10 +956,22 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) */ smp_rmb(); - wait_event(q->mq_freeze_wq, - (atomic_read(&q->mq_freeze_depth) == 0 && - (preempt || !blk_queue_preempt_only(q))) || - blk_queue_dying(q)); + while (wait_event_timeout(q->mq_freeze_wq, + (atomic_read(&q->mq_freeze_depth) == 0 && + (preempt || !blk_queue_preempt_only(q))) || + blk_queue_dying(q), 3 * HZ) == 0) { + struct percpu_ref *ref = &q->q_usage_counter; + unsigned long __percpu *percpu_count = percpu_count_ptr(ref); + unsigned long count = 0; + int cpu; + + for_each_possible_cpu(cpu) + count += *per_cpu_ptr(percpu_count, cpu); + + printk("%s(%d): %px %ld %ld\n", current->comm, current->pid, + ref, atomic_long_read(&ref->count), count); + } + if (blk_queue_dying(q)) return -ENODEV; } diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 9f96fa7..72773ce 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -133,8 +133,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) for_each_possible_cpu(cpu) count += *per_cpu_ptr(percpu_count, cpu); - pr_debug("global %ld percpu %ld", - atomic_long_read(&ref->count), (long)count); + printk("%px global %ld percpu %ld\n", ref, + atomic_long_read(&ref->count), (long)count); /* * It's crucial that we sum the percpu counters _before_ adding the sum @@ -150,6 +150,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) */ atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count); + printk("%px global %ld\n", ref, atomic_long_read(&ref->count)); + WARN_ONCE(atomic_long_read(&ref->count) <= 0, "percpu ref (%pf) <= 0 (%ld) after switching to atomic", ref->release, atomic_long_read(&ref->count)); If I change blk_queue_enter() not to wait at wait_event() if q->mq_freeze_depth != 0, this deadlock problem does not occur. Also, I found that if blk_freeze_queue_start() tries to wait for counters to become 1 before calling percpu_ref_kill() (like shown below), this deadlock problem does not occur. diff --git a/block/blk-mq.c b/block/blk-mq.c index 9ce9cac..4bff534 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -134,12 +134,36 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part, blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi); } +#define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1)) + +static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) +{ + return (unsigned long __percpu *) + (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); +} + void blk_freeze_queue_start(struct request_queue *q) { int freeze_depth; freeze_depth = atomic_inc_return(&q->mq_freeze_depth); if (freeze_depth == 1) { + int i; + for (i = 0; i < 10; i++) { + struct percpu_ref *ref = &q->q_usage_counter; + unsigned long __percpu *percpu_count = percpu_count_ptr(ref); + unsigned long count = 0; + int cpu; + + for_each_possible_cpu(cpu) + count += *per_cpu_ptr(percpu_count, cpu); + + if (atomic_long_read(&ref->count) + count - PERCPU_COUNT_BIAS == 1) + break; + printk("%s(%d):! %px %ld %ld\n", current->comm, current->pid, + ref, atomic_long_read(&ref->count), count); + schedule_timeout_uninterruptible(HZ / 10); + } percpu_ref_kill(&q->q_usage_counter); if (q->mq_ops) blk_mq_run_hw_queues(q, false);