Message ID | 20220521170502.20026-4-colyli@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] bcache: improve multithreaded bch_btree_check() | expand |
On 21 May 2022, Coly Li spake thusly: > When all journal buckets are fully filled by active jset with heavy > write I/O load, the cache set registration (after a reboot) will load > all active jsets and inserting them into the btree again (which is > called journal replay). If a journaled bkey is inserted into a btree > node and results btree node split, new journal request might be > triggered. For example, the btree grows one more level after the node > split, then the root node record in cache device super block will be > upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no > space in journal buckets, the journal replay has to wait for new journal > bucket to be reclaimed after at least one journal bucket replayed. This > is one example that how the journal no-space deadlock happens. > > The solution to avoid the deadlock is to reserve 1 journal bucket in It seems to me that this could happen more than once in a single journal replay (multiple nodes might be split, etc). Is one bucket actually always enough, or is it merely enough nearly all the time?
> 2022年6月9日 04:45,Nix <nix@esperi.org.uk> 写道: > > On 21 May 2022, Coly Li spake thusly: > >> When all journal buckets are fully filled by active jset with heavy >> write I/O load, the cache set registration (after a reboot) will load >> all active jsets and inserting them into the btree again (which is >> called journal replay). If a journaled bkey is inserted into a btree >> node and results btree node split, new journal request might be >> triggered. For example, the btree grows one more level after the node >> split, then the root node record in cache device super block will be >> upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no >> space in journal buckets, the journal replay has to wait for new journal >> bucket to be reclaimed after at least one journal bucket replayed. This >> is one example that how the journal no-space deadlock happens. >> >> The solution to avoid the deadlock is to reserve 1 journal bucket in > > It seems to me that this could happen more than once in a single journal > replay (multiple nodes might be split, etc). Is one bucket actually > always enough, or is it merely enough nearly all the time? It is possible that multiple leaf nodes split during journal replay, but the journal_meta() only gets called when the root node is updated. For the new bkey of the new split node inserting into root node, it doesn’t go into journal because journal only records inserting bkeys for leaf nodes. Only when the btree node split causes root node split, the new root node location (bkey) has to be recored in journal set. Therefore almost all the time that btree root node only splits once during journal replay, it is very rare that between two root node splits (that means very large number of bkeys inserted) the oldest journal entry doesn’t get replayed, that is almost impossible in real practice. So reserving 8K journal space is indeed enough for the no-space deadlock situation. The default bucket size is much larger than 8K, so we don’t have to worry about the reserved journal space exhausted even with a much larger journal buckets number. Indeed my initial effort was to reserve 8K space within a journal bucket if the bucket size > 8KB. But there are too many locations should be careful, and the logic of the patch is complicated and total change set is 200+ lines. And I find if I reserve a whole bucket, the change set is only 30+ lines. So finally I decide to reserve a whole journal bucket, because the change is much simpler and easier to be understood. Coly Li
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index df5347ea450b..e5da469a4235 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -405,6 +405,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) return ret; } +void bch_journal_space_reserve(struct journal *j) +{ + j->do_reserve = true; +} + /* Journalling */ static void btree_flush_write(struct cache_set *c) @@ -621,12 +626,30 @@ static void do_journal_discard(struct cache *ca) } } +static unsigned int free_journal_buckets(struct cache_set *c) +{ + struct journal *j = &c->journal; + struct cache *ca = c->cache; + struct journal_device *ja = &c->cache->journal; + unsigned int n; + + /* In case njournal_buckets is not power of 2 */ + if (ja->cur_idx >= ja->discard_idx) + n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx; + else + n = ja->discard_idx - ja->cur_idx; + + if (n > (1 + j->do_reserve)) + return n - (1 + j->do_reserve); + + return 0; +} + static void journal_reclaim(struct cache_set *c) { struct bkey *k = &c->journal.key; struct cache *ca = c->cache; uint64_t last_seq; - unsigned int next; struct journal_device *ja = &ca->journal; atomic_t p __maybe_unused; @@ -649,12 +672,10 @@ static void journal_reclaim(struct cache_set *c) if (c->journal.blocks_free) goto out; - next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; - /* No space available on this device */ - if (next == ja->discard_idx) + if (!free_journal_buckets(c)) goto out; - ja->cur_idx = next; + ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets; k->ptr[0] = MAKE_PTR(0, bucket_to_sector(c, ca->sb.d[ja->cur_idx]), ca->sb.nr_this_dev); diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h index f2ea34d5f431..cd316b4a1e95 100644 --- a/drivers/md/bcache/journal.h +++ b/drivers/md/bcache/journal.h @@ -105,6 +105,7 @@ struct journal { spinlock_t lock; spinlock_t flush_write_lock; bool btree_flushing; + bool do_reserve; /* used when waiting because the journal was full */ struct closure_waitlist wait; struct closure io; @@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list); void bch_journal_free(struct cache_set *c); int bch_journal_alloc(struct cache_set *c); +void bch_journal_space_reserve(struct journal *j); #endif /* _BCACHE_JOURNAL_H */ diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index bf3de149d3c9..2bb55278d22d 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2128,6 +2128,7 @@ static int run_cache_set(struct cache_set *c) flash_devs_run(c); + bch_journal_space_reserve(&c->journal); set_bit(CACHE_SET_RUNNING, &c->flags); return 0; err:
The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org --- drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++----- drivers/md/bcache/journal.h | 2 ++ drivers/md/bcache/super.c | 1 + 3 files changed, 29 insertions(+), 5 deletions(-)