From patchwork Sun May 22 17:07:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12858285 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9852EC433EF for ; Sun, 22 May 2022 17:07:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349546AbiEVRHv (ORCPT ); Sun, 22 May 2022 13:07:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38460 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229436AbiEVRHt (ORCPT ); Sun, 22 May 2022 13:07:49 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 322203A73F; Sun, 22 May 2022 10:07:48 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id E36C21F388; Sun, 22 May 2022 17:07:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1653239266; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vzA3qWmeCqPx4k/yYTvG3mgQXpDOLLr4ex0hPbI2Bhc=; b=zi6z6jePTcOlXMt/sloQkj/ud7FP7NgyHa77AVbYQnJ3BUVx4CL3yGKoGJStmEOCMSNHM5 OibGckevfz/BKh+wziiU1qmjKE/cZvlhXoVdOlxzFy+S8F5XyWABGbgLyCtKr1z9SyEJl2 fZc2M7yvKjz8466ciaSnMos8SkNabgk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1653239266; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vzA3qWmeCqPx4k/yYTvG3mgQXpDOLLr4ex0hPbI2Bhc=; b=mFG2vPS3EJK42jJMPDQu2q3ywb1Bi/N0hdlyNYzg+qV03czxs3M/jz/lIQ/yoFgv7RVkth 88ApdXZPv6FX+0Bw== Received: from localhost.localdomain (colyli.tcp.ovpn1.nue.suse.de [10.163.16.22]) by relay2.suse.de (Postfix) with ESMTP id ED3F02C141; Sun, 22 May 2022 17:07:44 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , stable@vger.kernel.org Subject: [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Date: Mon, 23 May 2022 01:07:33 +0800 Message-Id: <20220522170736.6582-2-colyli@suse.de> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20220522170736.6582-1-colyli@suse.de> References: <20220522170736.6582-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Commit 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved. This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible(). With this change, the code is more clear, and some potential error conditions are avoided. Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li Cc: stable@vger.kernel.org --- drivers/md/bcache/btree.c | 58 ++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 32 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index ad9f16689419..2362bb8ef6d1 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2006,8 +2006,7 @@ int bch_btree_check(struct cache_set *c) int i; struct bkey *k = NULL; struct btree_iter iter; - struct btree_check_state *check_state; - char name[32]; + struct btree_check_state check_state; /* check and mark root node keys */ for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) @@ -2018,63 +2017,58 @@ int bch_btree_check(struct cache_set *c) if (c->root->level == 0) return 0; - check_state = kzalloc(sizeof(struct btree_check_state), GFP_KERNEL); - if (!check_state) - return -ENOMEM; - - check_state->c = c; - check_state->total_threads = bch_btree_chkthread_nr(); - check_state->key_idx = 0; - spin_lock_init(&check_state->idx_lock); - atomic_set(&check_state->started, 0); - atomic_set(&check_state->enough, 0); - init_waitqueue_head(&check_state->wait); + check_state.c = c; + check_state.total_threads = bch_btree_chkthread_nr(); + check_state.key_idx = 0; + spin_lock_init(&check_state.idx_lock); + atomic_set(&check_state.started, 0); + atomic_set(&check_state.enough, 0); + init_waitqueue_head(&check_state.wait); + rw_lock(0, c->root, c->root->level); /* * Run multiple threads to check btree nodes in parallel, - * if check_state->enough is non-zero, it means current + * if check_state.enough is non-zero, it means current * running check threads are enough, unncessary to create * more. */ - for (i = 0; i < check_state->total_threads; i++) { - /* fetch latest check_state->enough earlier */ + for (i = 0; i < check_state.total_threads; i++) { + /* fetch latest check_state.enough earlier */ smp_mb__before_atomic(); - if (atomic_read(&check_state->enough)) + if (atomic_read(&check_state.enough)) break; - check_state->infos[i].result = 0; - check_state->infos[i].state = check_state; - snprintf(name, sizeof(name), "bch_btrchk[%u]", i); - atomic_inc(&check_state->started); + check_state.infos[i].result = 0; + check_state.infos[i].state = &check_state; - check_state->infos[i].thread = + check_state.infos[i].thread = kthread_run(bch_btree_check_thread, - &check_state->infos[i], - name); - if (IS_ERR(check_state->infos[i].thread)) { + &check_state.infos[i], + "bch_btrchk[%d]", i); + if (IS_ERR(check_state.infos[i].thread)) { pr_err("fails to run thread bch_btrchk[%d]\n", i); for (--i; i >= 0; i--) - kthread_stop(check_state->infos[i].thread); + kthread_stop(check_state.infos[i].thread); ret = -ENOMEM; goto out; } + atomic_inc(&check_state.started); } /* * Must wait for all threads to stop. */ - wait_event_interruptible(check_state->wait, - atomic_read(&check_state->started) == 0); + wait_event(check_state.wait, atomic_read(&check_state.started) == 0); - for (i = 0; i < check_state->total_threads; i++) { - if (check_state->infos[i].result) { - ret = check_state->infos[i].result; + for (i = 0; i < check_state.total_threads; i++) { + if (check_state.infos[i].result) { + ret = check_state.infos[i].result; goto out; } } out: - kfree(check_state); + rw_unlock(0, c->root); return ret; } From patchwork Sun May 22 17:07:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12858286 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 725DFC43217 for ; Sun, 22 May 2022 17:07:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229436AbiEVRHw (ORCPT ); Sun, 22 May 2022 13:07:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38482 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349551AbiEVRHv (ORCPT ); Sun, 22 May 2022 13:07:51 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A72BA3A73F; Sun, 22 May 2022 10:07:50 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 5C9021F381; Sun, 22 May 2022 17:07:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1653239269; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z16zL+M+oJaoh/XJ6M0tFx94DDwj/2XE3+JLveI34QU=; b=ztgYZ9Nghh+fmMFu8sYrQ7J9VtQkSyvyc4F0wGRsV5r1PUVXPIaLlFWd7lXST3foJNZogZ 9yf1G+HeQbJA0pajlzIShrjvLVE2M88SypFyxWwWfz9dA0TGVEP4sBSKDTu+RbjLE8ENFs WiviWOM7N29ZeS584QZE18XnxtRN9Vo= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1653239269; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z16zL+M+oJaoh/XJ6M0tFx94DDwj/2XE3+JLveI34QU=; b=E4zryyx1TNwj+YsBlKjhuoeZ61J3tbHl2dysxDOA5UAbOYNSfutBe83HkMRihNKkLcWLCH ptUo1keR0Ph6UQDw== Received: from localhost.localdomain (colyli.tcp.ovpn1.nue.suse.de [10.163.16.22]) by relay2.suse.de (Postfix) with ESMTP id 67CD32C141; Sun, 22 May 2022 17:07:47 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , stable@vger.kernel.org Subject: [PATCH 2/4] bcache: improve multithreaded bch_sectors_dirty_init() Date: Mon, 23 May 2022 01:07:34 +0800 Message-Id: <20220522170736.6582-3-colyli@suse.de> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20220522170736.6582-1-colyli@suse.de> References: <20220522170736.6582-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved. This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible(). With the above changes, the code is more clear, and some potential error conditions are avoided. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li Cc: stable@vger.kernel.org --- drivers/md/bcache/writeback.c | 62 ++++++++++++++--------------------- 1 file changed, 25 insertions(+), 37 deletions(-) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 9ee0005874cd..d24c09490f8e 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -948,10 +948,10 @@ void bch_sectors_dirty_init(struct bcache_device *d) struct btree_iter iter; struct sectors_dirty_init op; struct cache_set *c = d->c; - struct bch_dirty_init_state *state; - char name[32]; + struct bch_dirty_init_state state; /* Just count root keys if no leaf node */ + rw_lock(0, c->root, c->root->level); if (c->root->level == 0) { bch_btree_op_init(&op.op, -1); op.inode = d->id; @@ -961,54 +961,42 @@ void bch_sectors_dirty_init(struct bcache_device *d) for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) sectors_dirty_init_fn(&op.op, c->root, k); + rw_unlock(0, c->root); return; } - state = kzalloc(sizeof(struct bch_dirty_init_state), GFP_KERNEL); - if (!state) { - pr_warn("sectors dirty init failed: cannot allocate memory\n"); - return; - } - - state->c = c; - state->d = d; - state->total_threads = bch_btre_dirty_init_thread_nr(); - state->key_idx = 0; - spin_lock_init(&state->idx_lock); - atomic_set(&state->started, 0); - atomic_set(&state->enough, 0); - init_waitqueue_head(&state->wait); - - for (i = 0; i < state->total_threads; i++) { - /* Fetch latest state->enough earlier */ + state.c = c; + state.d = d; + state.total_threads = bch_btre_dirty_init_thread_nr(); + state.key_idx = 0; + spin_lock_init(&state.idx_lock); + atomic_set(&state.started, 0); + atomic_set(&state.enough, 0); + init_waitqueue_head(&state.wait); + + for (i = 0; i < state.total_threads; i++) { + /* Fetch latest state.enough earlier */ smp_mb__before_atomic(); - if (atomic_read(&state->enough)) + if (atomic_read(&state.enough)) break; - state->infos[i].state = state; - atomic_inc(&state->started); - snprintf(name, sizeof(name), "bch_dirty_init[%d]", i); - - state->infos[i].thread = - kthread_run(bch_dirty_init_thread, - &state->infos[i], - name); - if (IS_ERR(state->infos[i].thread)) { + state.infos[i].state = &state; + state.infos[i].thread = + kthread_run(bch_dirty_init_thread, &state.infos[i], + "bch_dirtcnt[%d]", i); + if (IS_ERR(state.infos[i].thread)) { pr_err("fails to run thread bch_dirty_init[%d]\n", i); for (--i; i >= 0; i--) - kthread_stop(state->infos[i].thread); + kthread_stop(state.infos[i].thread); goto out; } + atomic_inc(&state.started); } - /* - * Must wait for all threads to stop. - */ - wait_event_interruptible(state->wait, - atomic_read(&state->started) == 0); - out: - kfree(state); + /* Must wait for all threads to stop. */ + wait_event(state.wait, atomic_read(&state.started) == 0); + rw_unlock(0, c->root); } void bch_cached_dev_writeback_init(struct cached_dev *dc) From patchwork Sun May 22 17:07:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12858287 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28979C4332F for ; Sun, 22 May 2022 17:07:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349615AbiEVRHz (ORCPT ); Sun, 22 May 2022 13:07:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38518 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349551AbiEVRHy (ORCPT ); Sun, 22 May 2022 13:07:54 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66A093AA43; Sun, 22 May 2022 10:07:53 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id CABA021A72; Sun, 22 May 2022 17:07:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1653239271; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9kjfZ17/ruoaTQnJvYhTQS9cRBP4eTRXwhgVvBnbNYg=; b=bnRpRSHNck9275uIKemvpGIWaecQaT9TGCCyViuioWKiwQk8PaX5pPfu+lbD9OC4a2qvsI +/Va7DaqY4KQ4FwHHrlb+fsEyWpHnq59OHNx52v5UYBWE1qrQjlz0UUQWNKhALw1axfTLo 4jjISdIKzrGIcKGeq6A/a9b2pfjR94c= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1653239271; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9kjfZ17/ruoaTQnJvYhTQS9cRBP4eTRXwhgVvBnbNYg=; b=K3VNaWbq4lnmKplyxBXd3inbnX5hi/lkeAypreVndxJm3G0ElzbemKsWWrOMs83fTGc+1y 0J/AMZmr6Hlv3uDQ== Received: from localhost.localdomain (colyli.tcp.ovpn1.nue.suse.de [10.163.16.22]) by relay2.suse.de (Postfix) with ESMTP id D37EE2C141; Sun, 22 May 2022 17:07:49 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , stable@vger.kernel.org Subject: [PATCH 3/4] bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() Date: Mon, 23 May 2022 01:07:35 +0800 Message-Id: <20220522170736.6582-4-colyli@suse.de> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20220522170736.6582-1-colyli@suse.de> References: <20220522170736.6582-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree(). That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one. In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes. The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does. But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li Cc: stable@vger.kernel.org --- drivers/md/bcache/writeback.c | 41 +++++++++++------------------------ 1 file changed, 13 insertions(+), 28 deletions(-) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index d24c09490f8e..75b71199800d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -805,13 +805,11 @@ static int bch_writeback_thread(void *arg) /* Init */ #define INIT_KEYS_EACH_TIME 500000 -#define INIT_KEYS_SLEEP_MS 100 struct sectors_dirty_init { struct btree_op op; unsigned int inode; size_t count; - struct bkey start; }; static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b, @@ -827,11 +825,8 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b, KEY_START(k), KEY_SIZE(k)); op->count++; - if (atomic_read(&b->c->search_inflight) && - !(op->count % INIT_KEYS_EACH_TIME)) { - bkey_copy_key(&op->start, k); - return -EAGAIN; - } + if (!(op->count % INIT_KEYS_EACH_TIME)) + cond_resched(); return MAP_CONTINUE; } @@ -846,24 +841,16 @@ static int bch_root_node_dirty_init(struct cache_set *c, bch_btree_op_init(&op.op, -1); op.inode = d->id; op.count = 0; - op.start = KEY(op.inode, 0, 0); - - do { - ret = bcache_btree(map_keys_recurse, - k, - c->root, - &op.op, - &op.start, - sectors_dirty_init_fn, - 0); - if (ret == -EAGAIN) - schedule_timeout_interruptible( - msecs_to_jiffies(INIT_KEYS_SLEEP_MS)); - else if (ret < 0) { - pr_warn("sectors dirty init failed, ret=%d!\n", ret); - break; - } - } while (ret == -EAGAIN); + + ret = bcache_btree(map_keys_recurse, + k, + c->root, + &op.op, + &KEY(op.inode, 0, 0), + sectors_dirty_init_fn, + 0); + if (ret < 0) + pr_warn("sectors dirty init failed, ret=%d!\n", ret); return ret; } @@ -907,7 +894,6 @@ static int bch_dirty_init_thread(void *arg) goto out; } skip_nr--; - cond_resched(); } if (p) { @@ -917,7 +903,6 @@ static int bch_dirty_init_thread(void *arg) p = NULL; prev_idx = cur_idx; - cond_resched(); } out: @@ -956,11 +941,11 @@ void bch_sectors_dirty_init(struct bcache_device *d) bch_btree_op_init(&op.op, -1); op.inode = d->id; op.count = 0; - op.start = KEY(op.inode, 0, 0); for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) sectors_dirty_init_fn(&op.op, c->root, k); + rw_unlock(0, c->root); return; } From patchwork Sun May 22 17:07:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12858288 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45F0EC433FE for ; Sun, 22 May 2022 17:08:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349656AbiEVRID (ORCPT ); Sun, 22 May 2022 13:08:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232802AbiEVRID (ORCPT ); Sun, 22 May 2022 13:08:03 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 01ED13AA41; Sun, 22 May 2022 10:08:02 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id B3E5B21A72; Sun, 22 May 2022 17:08:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1653239280; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wmJpe9/Kua8zJtz04nyJuieA4qrmKJ7yZERYmqpEk9U=; b=uWoeNullntL857Y+VVgdC2hvcgRPJ/LqtnmDDQl2hmZD3Jr+lrKT8aOe061N6O04DlEU+M leic693DAyw3bE5j2D2aDS/XP/nDjx3UrXmZC+KFd/w5DxDpWBtRCJgcthoc64APFGuAm+ UUBU2jlG+xGb2BCbFBieUzpNh8nWsoc= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1653239280; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wmJpe9/Kua8zJtz04nyJuieA4qrmKJ7yZERYmqpEk9U=; b=VdCdMOvkhGgESLdRGDqvNgMACvqs1HMBvspPl45uYHQN5OcusY5AgB2DqihNAfr9BLCwL0 d127tlqJY2S+52DA== Received: from localhost.localdomain (colyli.tcp.ovpn1.nue.suse.de [10.163.16.22]) by relay2.suse.de (Postfix) with ESMTP id 4D26A2C141; Sun, 22 May 2022 17:07:52 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Nikhil Kshirsagar , stable@vger.kernel.org Subject: [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Date: Mon, 23 May 2022 01:07:36 +0800 Message-Id: <20220522170736.6582-5-colyli@suse.de> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20220522170736.6582-1-colyli@suse.de> References: <20220522170736.6582-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar Signed-off-by: Coly Li Cc: stable@vger.kernel.org --- drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++----- drivers/md/bcache/journal.h | 2 ++ drivers/md/bcache/super.c | 1 + 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index df5347ea450b..e5da469a4235 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -405,6 +405,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) return ret; } +void bch_journal_space_reserve(struct journal *j) +{ + j->do_reserve = true; +} + /* Journalling */ static void btree_flush_write(struct cache_set *c) @@ -621,12 +626,30 @@ static void do_journal_discard(struct cache *ca) } } +static unsigned int free_journal_buckets(struct cache_set *c) +{ + struct journal *j = &c->journal; + struct cache *ca = c->cache; + struct journal_device *ja = &c->cache->journal; + unsigned int n; + + /* In case njournal_buckets is not power of 2 */ + if (ja->cur_idx >= ja->discard_idx) + n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx; + else + n = ja->discard_idx - ja->cur_idx; + + if (n > (1 + j->do_reserve)) + return n - (1 + j->do_reserve); + + return 0; +} + static void journal_reclaim(struct cache_set *c) { struct bkey *k = &c->journal.key; struct cache *ca = c->cache; uint64_t last_seq; - unsigned int next; struct journal_device *ja = &ca->journal; atomic_t p __maybe_unused; @@ -649,12 +672,10 @@ static void journal_reclaim(struct cache_set *c) if (c->journal.blocks_free) goto out; - next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; - /* No space available on this device */ - if (next == ja->discard_idx) + if (!free_journal_buckets(c)) goto out; - ja->cur_idx = next; + ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets; k->ptr[0] = MAKE_PTR(0, bucket_to_sector(c, ca->sb.d[ja->cur_idx]), ca->sb.nr_this_dev); diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h index f2ea34d5f431..cd316b4a1e95 100644 --- a/drivers/md/bcache/journal.h +++ b/drivers/md/bcache/journal.h @@ -105,6 +105,7 @@ struct journal { spinlock_t lock; spinlock_t flush_write_lock; bool btree_flushing; + bool do_reserve; /* used when waiting because the journal was full */ struct closure_waitlist wait; struct closure io; @@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list); void bch_journal_free(struct cache_set *c); int bch_journal_alloc(struct cache_set *c); +void bch_journal_space_reserve(struct journal *j); #endif /* _BCACHE_JOURNAL_H */ diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index bf3de149d3c9..2bb55278d22d 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2128,6 +2128,7 @@ static int run_cache_set(struct cache_set *c) flash_devs_run(c); + bch_journal_space_reserve(&c->journal); set_bit(CACHE_SET_RUNNING, &c->flags); return 0; err: