Message ID | 8e334e4412410a46d3928950c796c9914cebdf92.1729537203.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5] btrfs: try to search for data csums in commit root | expand |
On Mon, Oct 21, 2024 at 12:01:53PM -0700, Boris Burkov wrote: > If you run a workload like: > - a cgroup that does tons of data reading, with a harsh memory limit > - a second cgroup that tries to write new files > e.g.: > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh > > then what quickly occurs is: > - a high degree of contention on the csum root node eb rwsem > - memory starved cgroup doing tons of reclaim on CPU. > - many reader threads in the memory starved cgroup "holding" the sem > as readers, but not scheduling promptly. i.e., task __state == 0, but > not running on a cpu. > - btrfs_commit_transaction stuck trying to acquire the sem as a writer. > > This results in VERY long transactions. On my test system, that script > produces 20-30s long transaction commits. This then results in > seriously degraded performance for any cgroup using the filesystem (the > victim cgroup in the script). > > This reproducer is a bit silly, as the villanous cgroup is using almost > all of its memory.max for kernel memory (specifically pagetables) but it > sort of doesn't matter, as I am most interested in the btrfs locking > behavior. It isn't an academic problem, as we see this exact problem in > production at Meta with one cgroup over memory.max ruining btrfs > performance for the whole system. > > The underlying scheduling "problem" with global rwsems is sort of thorny > and apparently well known. e.g. > https://lpc.events/event/18/contributions/1883/ > > As a result, our main lever in the short term is just trying to reduce > contention on our various rwsems. In the case of the csum tree, we can > either redesign btree locking (hard...) or try to use the commit root > when we can. Luckily, it seems likely that many reads are for old extents > written many transactions ago, and that for those we *can* in fact > search the commit root! > > This change detects when we are trying to read an old extent (according > to extent map generation) and then wires that through bio_ctrl to the > btrfs_bio, which unfortunately isn't allocated yet when we have this > information. Luckily, we don't need this flag in the bio after > submitting, so we can save space by setting it on bbio->bio.bi_flags > and clear before submitting, so the block layer is unaffected. > > When we go to lookup the csums in lookup_bio_sums we can check this > condition on the btrfs_bio and do the commit root lookup accordingly. > > With the fix, on that same test case, commit latencies no longer exceed > ~400ms on my system. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > Changelog: > v5: > - static inline flag functions > - make bbio const for the getter > - move around and improve the comments v5 looks good, thanks. > v4: > - replace generic private flag machinery with specific function for the > one flag > - move the bio_ctrl field to take advantage of alignment > v3: > - add some simple machinery for setting/getting/clearing btrfs private > flags in bi_flags > - clear those flags before bio_submit (ensure no-op wrt block layer) > - store the csum commit root flag there to save space > v2: > - hold the commit_root_sem for the duration of the entire lookup, not > just per btrfs_search_slot. Note that we can't naively do the thing > where we release the lock every loop as that is exactly what we are > trying to avoid. Theoretically, we could re-grab the lock and fully > start over if the lock is write contended or something. I suspect the > rwsem fairness features will let the commit writer get it fast enough > anyway. > > --- > fs/btrfs/bio.c | 8 ++++++++ > fs/btrfs/bio.h | 17 +++++++++++++++++ > fs/btrfs/extent_io.c | 20 ++++++++++++++++++++ > fs/btrfs/file-item.c | 30 ++++++++++++++++++++++++++++++ > 4 files changed, 75 insertions(+) > > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c > index 1f216d07eff6..4d675f6dd3e9 100644 > --- a/fs/btrfs/bio.c > +++ b/fs/btrfs/bio.c > @@ -482,6 +482,14 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr) > static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc, > struct btrfs_io_stripe *smap, int mirror_num) > { > + /* > + * It is important to clear the bits we used in bio->bi_flags. > + * Because bio->bi_flags belongs to the block layer, we should > + * avoid leaving stray bits set when we transfer ownership of > + * the bio by submitting it. > + */ > + btrfs_bio_clear_csum_search_commit_root(btrfs_bio(bio)); > + > if (!bioc) { > /* Single mirror read/write fast path. */ > btrfs_bio(bio)->mirror_num = mirror_num; > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h > index e2fe16074ad6..8915863d0d0f 100644 > --- a/fs/btrfs/bio.h > +++ b/fs/btrfs/bio.h > @@ -104,6 +104,23 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf, > btrfs_bio_end_io_t end_io, void *private); > void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status); > > +#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT (1U << (BIO_FLAG_LAST + 1)) > + > +static inline void btrfs_bio_set_csum_search_commit_root(struct btrfs_bio *bbio) > +{ > + bbio->bio.bi_flags |= BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; > +} > + > +static inline void btrfs_bio_clear_csum_search_commit_root(struct btrfs_bio *bbio) > +{ > + bbio->bio.bi_flags &= ~BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; > +} > + > +static inline bool btrfs_bio_csum_search_commit_root(const struct btrfs_bio *bbio) > +{ > + return bbio->bio.bi_flags & BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; > +} > + > /* Submit using blkcg_punt_bio_submit. */ > #define REQ_BTRFS_CGROUP_PUNT REQ_FS_PRIVATE > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 1beaba232532..bdd7673d989c 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -99,6 +99,15 @@ struct btrfs_bio_ctrl { > enum btrfs_compression_type compress_type; > u32 len_to_oe_boundary; > blk_opf_t opf; > + /* > + * If this is a data read bio, we set this to true if the extent is > + * from a past transaction and thus it is safe to search for csums > + * in the commit root. Otherwise, we set it to false. > + * > + * See the comment in btrfs_lookup_bio_sums for more detail on the > + * need for this optimization. > + */ > + bool csum_search_commit_root; > btrfs_bio_end_io_t end_io_func; > struct writeback_control *wbc; > > @@ -771,6 +780,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl, > alloc_new_bio(inode, bio_ctrl, disk_bytenr, > folio_pos(folio) + pg_offset); > } > + if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && > + is_data_inode(inode) && bio_ctrl->csum_search_commit_root) > + btrfs_bio_set_csum_search_commit_root(bio_ctrl->bbio); > > /* Cap to the current ordered extent boundary if there is one. */ > if (len > bio_ctrl->len_to_oe_boundary) { > @@ -1049,6 +1061,14 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > if (prev_em_start) > *prev_em_start = em->start; > > + /* > + * At this point, we know it is safe to search for > + * csums in the commit root, but we have not yet > + * allocated a bio, so stash it in bio_ctrl. > + */ > + if (em->generation < btrfs_get_fs_generation(fs_info)) > + bio_ctrl->csum_search_commit_root = true; > + > free_extent_map(em); > em = NULL; > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index 886749b39672..cd63769959f9 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -401,6 +401,33 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio) > path->skip_locking = 1; > } > > + /* > + * If we are searching for a csum of an extent from a past > + * transaction, we can search in the commit root and reduce > + * lock contention on the csum tree root node's extent buffer. > + * > + * This is important because that lock is an rwswem which gets > + * pretty heavy write load, unlike the commit_root_csum. ^^^^^^^^^^^^^^^^ commit_root_sem? > + * > + * Due to how rwsem is implemented, there is a possible > + * priority inversion where the readers holding the lock don't > + * get scheduled (say they're in a cgroup stuck in heavy reclaim) > + * which then blocks writers, including transaction commit. By > + * using a semaphore with fewer writers (only a commit switching > + * the roots), we make this issue less likely. > + * > + * Note that we don't rely on btrfs_search_slot to lock the > + * commit root csum. We call search_slot multiple times, which would sem > + * create a potential race where a commit comes in between searches > + * while we are not holding the commit_root_csum, and we get csums sem Only typos so you can fix it once you add it to for-next, also please reflow the comments so they fit 80 columns per line. Reviewed-by: David Sterba <dsterba@suse.com>
On Mon, Oct 21, 2024 at 09:10:35PM +0200, David Sterba wrote: > On Mon, Oct 21, 2024 at 12:01:53PM -0700, Boris Burkov wrote: > > If you run a workload like: > > - a cgroup that does tons of data reading, with a harsh memory limit > > - a second cgroup that tries to write new files > > e.g.: > > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh > > > > then what quickly occurs is: > > - a high degree of contention on the csum root node eb rwsem > > - memory starved cgroup doing tons of reclaim on CPU. > > - many reader threads in the memory starved cgroup "holding" the sem > > as readers, but not scheduling promptly. i.e., task __state == 0, but > > not running on a cpu. > > - btrfs_commit_transaction stuck trying to acquire the sem as a writer. > > > > This results in VERY long transactions. On my test system, that script > > produces 20-30s long transaction commits. This then results in > > seriously degraded performance for any cgroup using the filesystem (the > > victim cgroup in the script). > > > > This reproducer is a bit silly, as the villanous cgroup is using almost > > all of its memory.max for kernel memory (specifically pagetables) but it > > sort of doesn't matter, as I am most interested in the btrfs locking > > behavior. It isn't an academic problem, as we see this exact problem in > > production at Meta with one cgroup over memory.max ruining btrfs > > performance for the whole system. > > > > The underlying scheduling "problem" with global rwsems is sort of thorny > > and apparently well known. e.g. > > https://lpc.events/event/18/contributions/1883/ > > > > As a result, our main lever in the short term is just trying to reduce > > contention on our various rwsems. In the case of the csum tree, we can > > either redesign btree locking (hard...) or try to use the commit root > > when we can. Luckily, it seems likely that many reads are for old extents > > written many transactions ago, and that for those we *can* in fact > > search the commit root! > > > > This change detects when we are trying to read an old extent (according > > to extent map generation) and then wires that through bio_ctrl to the > > btrfs_bio, which unfortunately isn't allocated yet when we have this > > information. Luckily, we don't need this flag in the bio after > > submitting, so we can save space by setting it on bbio->bio.bi_flags > > and clear before submitting, so the block layer is unaffected. > > > > When we go to lookup the csums in lookup_bio_sums we can check this > > condition on the btrfs_bio and do the commit root lookup accordingly. > > > > With the fix, on that same test case, commit latencies no longer exceed > > ~400ms on my system. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > Changelog: > > v5: > > - static inline flag functions > > - make bbio const for the getter > > - move around and improve the comments > > v5 looks good, thanks. > > > v4: > > - replace generic private flag machinery with specific function for the > > one flag > > - move the bio_ctrl field to take advantage of alignment > > v3: > > - add some simple machinery for setting/getting/clearing btrfs private > > flags in bi_flags > > - clear those flags before bio_submit (ensure no-op wrt block layer) > > - store the csum commit root flag there to save space > > v2: > > - hold the commit_root_sem for the duration of the entire lookup, not > > just per btrfs_search_slot. Note that we can't naively do the thing > > where we release the lock every loop as that is exactly what we are > > trying to avoid. Theoretically, we could re-grab the lock and fully > > start over if the lock is write contended or something. I suspect the > > rwsem fairness features will let the commit writer get it fast enough > > anyway. > > > > --- > > fs/btrfs/bio.c | 8 ++++++++ > > fs/btrfs/bio.h | 17 +++++++++++++++++ > > fs/btrfs/extent_io.c | 20 ++++++++++++++++++++ > > fs/btrfs/file-item.c | 30 ++++++++++++++++++++++++++++++ > > 4 files changed, 75 insertions(+) > > > > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c > > index 1f216d07eff6..4d675f6dd3e9 100644 > > --- a/fs/btrfs/bio.c > > +++ b/fs/btrfs/bio.c > > @@ -482,6 +482,14 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr) > > static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc, > > struct btrfs_io_stripe *smap, int mirror_num) > > { > > + /* > > + * It is important to clear the bits we used in bio->bi_flags. > > + * Because bio->bi_flags belongs to the block layer, we should > > + * avoid leaving stray bits set when we transfer ownership of > > + * the bio by submitting it. > > + */ > > + btrfs_bio_clear_csum_search_commit_root(btrfs_bio(bio)); > > + > > if (!bioc) { > > /* Single mirror read/write fast path. */ > > btrfs_bio(bio)->mirror_num = mirror_num; > > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h > > index e2fe16074ad6..8915863d0d0f 100644 > > --- a/fs/btrfs/bio.h > > +++ b/fs/btrfs/bio.h > > @@ -104,6 +104,23 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf, > > btrfs_bio_end_io_t end_io, void *private); > > void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status); > > > > +#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT (1U << (BIO_FLAG_LAST + 1)) > > + > > +static inline void btrfs_bio_set_csum_search_commit_root(struct btrfs_bio *bbio) > > +{ > > + bbio->bio.bi_flags |= BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; > > +} > > + > > +static inline void btrfs_bio_clear_csum_search_commit_root(struct btrfs_bio *bbio) > > +{ > > + bbio->bio.bi_flags &= ~BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; > > +} > > + > > +static inline bool btrfs_bio_csum_search_commit_root(const struct btrfs_bio *bbio) > > +{ > > + return bbio->bio.bi_flags & BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; > > +} > > + > > /* Submit using blkcg_punt_bio_submit. */ > > #define REQ_BTRFS_CGROUP_PUNT REQ_FS_PRIVATE > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index 1beaba232532..bdd7673d989c 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -99,6 +99,15 @@ struct btrfs_bio_ctrl { > > enum btrfs_compression_type compress_type; > > u32 len_to_oe_boundary; > > blk_opf_t opf; > > + /* > > + * If this is a data read bio, we set this to true if the extent is > > + * from a past transaction and thus it is safe to search for csums > > + * in the commit root. Otherwise, we set it to false. > > + * > > + * See the comment in btrfs_lookup_bio_sums for more detail on the > > + * need for this optimization. > > + */ > > + bool csum_search_commit_root; > > btrfs_bio_end_io_t end_io_func; > > struct writeback_control *wbc; > > > > @@ -771,6 +780,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl, > > alloc_new_bio(inode, bio_ctrl, disk_bytenr, > > folio_pos(folio) + pg_offset); > > } > > + if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && > > + is_data_inode(inode) && bio_ctrl->csum_search_commit_root) > > + btrfs_bio_set_csum_search_commit_root(bio_ctrl->bbio); > > > > /* Cap to the current ordered extent boundary if there is one. */ > > if (len > bio_ctrl->len_to_oe_boundary) { > > @@ -1049,6 +1061,14 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > if (prev_em_start) > > *prev_em_start = em->start; > > > > + /* > > + * At this point, we know it is safe to search for > > + * csums in the commit root, but we have not yet > > + * allocated a bio, so stash it in bio_ctrl. > > + */ > > + if (em->generation < btrfs_get_fs_generation(fs_info)) > > + bio_ctrl->csum_search_commit_root = true; > > + > > free_extent_map(em); > > em = NULL; > > > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > > index 886749b39672..cd63769959f9 100644 > > --- a/fs/btrfs/file-item.c > > +++ b/fs/btrfs/file-item.c > > @@ -401,6 +401,33 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio) > > path->skip_locking = 1; > > } > > > > + /* > > + * If we are searching for a csum of an extent from a past > > + * transaction, we can search in the commit root and reduce > > + * lock contention on the csum tree root node's extent buffer. > > + * > > + * This is important because that lock is an rwswem which gets > > + * pretty heavy write load, unlike the commit_root_csum. > ^^^^^^^^^^^^^^^^ > > commit_root_sem? > > > + * > > + * Due to how rwsem is implemented, there is a possible > > + * priority inversion where the readers holding the lock don't > > + * get scheduled (say they're in a cgroup stuck in heavy reclaim) > > + * which then blocks writers, including transaction commit. By > > + * using a semaphore with fewer writers (only a commit switching > > + * the roots), we make this issue less likely. > > + * > > + * Note that we don't rely on btrfs_search_slot to lock the > > + * commit root csum. We call search_slot multiple times, which would > sem > > > + * create a potential race where a commit comes in between searches > > + * while we are not holding the commit_root_csum, and we get csums > sem > > Only typos so you can fix it once you add it to for-next, also please > reflow the comments so they fit 80 columns per line. Oops, my bad on the typos. I could not find any comment over 80 columns, so I pushed it as-is. Let me know what I missed and I am happy to do the fixup... > > Reviewed-by: David Sterba <dsterba@suse.com>
On Mon, Oct 21, 2024 at 12:52:23PM -0700, Boris Burkov wrote: > On Mon, Oct 21, 2024 at 09:10:35PM +0200, David Sterba wrote: > > On Mon, Oct 21, 2024 at 12:01:53PM -0700, Boris Burkov wrote: > > > + * pretty heavy write load, unlike the commit_root_csum. > > ^^^^^^^^^^^^^^^^ > > > > commit_root_sem? > > > > > + * > > > + * Due to how rwsem is implemented, there is a possible > > > + * priority inversion where the readers holding the lock don't > > > + * get scheduled (say they're in a cgroup stuck in heavy reclaim) > > > + * which then blocks writers, including transaction commit. By > > > + * using a semaphore with fewer writers (only a commit switching > > > + * the roots), we make this issue less likely. > > > + * > > > + * Note that we don't rely on btrfs_search_slot to lock the > > > + * commit root csum. We call search_slot multiple times, which would > > sem > > > > > + * create a potential race where a commit comes in between searches > > > + * while we are not holding the commit_root_csum, and we get csums > > sem > > > > Only typos so you can fix it once you add it to for-next, also please > > reflow the comments so they fit 80 columns per line. > > Oops, my bad on the typos. I could not find any comment over 80 columns, > so I pushed it as-is. Let me know what I missed and I am happy to do the > fixup... Not over 80 but too much under 80, I'll fix it in for-next.
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index 1f216d07eff6..4d675f6dd3e9 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -482,6 +482,14 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr) static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc, struct btrfs_io_stripe *smap, int mirror_num) { + /* + * It is important to clear the bits we used in bio->bi_flags. + * Because bio->bi_flags belongs to the block layer, we should + * avoid leaving stray bits set when we transfer ownership of + * the bio by submitting it. + */ + btrfs_bio_clear_csum_search_commit_root(btrfs_bio(bio)); + if (!bioc) { /* Single mirror read/write fast path. */ btrfs_bio(bio)->mirror_num = mirror_num; diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h index e2fe16074ad6..8915863d0d0f 100644 --- a/fs/btrfs/bio.h +++ b/fs/btrfs/bio.h @@ -104,6 +104,23 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf, btrfs_bio_end_io_t end_io, void *private); void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status); +#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT (1U << (BIO_FLAG_LAST + 1)) + +static inline void btrfs_bio_set_csum_search_commit_root(struct btrfs_bio *bbio) +{ + bbio->bio.bi_flags |= BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; +} + +static inline void btrfs_bio_clear_csum_search_commit_root(struct btrfs_bio *bbio) +{ + bbio->bio.bi_flags &= ~BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; +} + +static inline bool btrfs_bio_csum_search_commit_root(const struct btrfs_bio *bbio) +{ + return bbio->bio.bi_flags & BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT; +} + /* Submit using blkcg_punt_bio_submit. */ #define REQ_BTRFS_CGROUP_PUNT REQ_FS_PRIVATE diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1beaba232532..bdd7673d989c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -99,6 +99,15 @@ struct btrfs_bio_ctrl { enum btrfs_compression_type compress_type; u32 len_to_oe_boundary; blk_opf_t opf; + /* + * If this is a data read bio, we set this to true if the extent is + * from a past transaction and thus it is safe to search for csums + * in the commit root. Otherwise, we set it to false. + * + * See the comment in btrfs_lookup_bio_sums for more detail on the + * need for this optimization. + */ + bool csum_search_commit_root; btrfs_bio_end_io_t end_io_func; struct writeback_control *wbc; @@ -771,6 +780,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl, alloc_new_bio(inode, bio_ctrl, disk_bytenr, folio_pos(folio) + pg_offset); } + if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && + is_data_inode(inode) && bio_ctrl->csum_search_commit_root) + btrfs_bio_set_csum_search_commit_root(bio_ctrl->bbio); /* Cap to the current ordered extent boundary if there is one. */ if (len > bio_ctrl->len_to_oe_boundary) { @@ -1049,6 +1061,14 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, if (prev_em_start) *prev_em_start = em->start; + /* + * At this point, we know it is safe to search for + * csums in the commit root, but we have not yet + * allocated a bio, so stash it in bio_ctrl. + */ + if (em->generation < btrfs_get_fs_generation(fs_info)) + bio_ctrl->csum_search_commit_root = true; + free_extent_map(em); em = NULL; diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 886749b39672..cd63769959f9 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -401,6 +401,33 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio) path->skip_locking = 1; } + /* + * If we are searching for a csum of an extent from a past + * transaction, we can search in the commit root and reduce + * lock contention on the csum tree root node's extent buffer. + * + * This is important because that lock is an rwswem which gets + * pretty heavy write load, unlike the commit_root_csum. + * + * Due to how rwsem is implemented, there is a possible + * priority inversion where the readers holding the lock don't + * get scheduled (say they're in a cgroup stuck in heavy reclaim) + * which then blocks writers, including transaction commit. By + * using a semaphore with fewer writers (only a commit switching + * the roots), we make this issue less likely. + * + * Note that we don't rely on btrfs_search_slot to lock the + * commit root csum. We call search_slot multiple times, which would + * create a potential race where a commit comes in between searches + * while we are not holding the commit_root_csum, and we get csums + * from across transactions. + */ + if (btrfs_bio_csum_search_commit_root(bbio)) { + path->search_commit_root = 1; + path->skip_locking = 1; + down_read(&fs_info->commit_root_sem); + } + while (bio_offset < orig_len) { int count; u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset; @@ -446,6 +473,9 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio) bio_offset += count * sectorsize; } + if (btrfs_bio_csum_search_commit_root(bbio)) + up_read(&fs_info->commit_root_sem); + btrfs_free_path(path); return ret; }
If you run a workload like: - a cgroup that does tons of data reading, with a harsh memory limit - a second cgroup that tries to write new files e.g.: https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh then what quickly occurs is: - a high degree of contention on the csum root node eb rwsem - memory starved cgroup doing tons of reclaim on CPU. - many reader threads in the memory starved cgroup "holding" the sem as readers, but not scheduling promptly. i.e., task __state == 0, but not running on a cpu. - btrfs_commit_transaction stuck trying to acquire the sem as a writer. This results in VERY long transactions. On my test system, that script produces 20-30s long transaction commits. This then results in seriously degraded performance for any cgroup using the filesystem (the victim cgroup in the script). This reproducer is a bit silly, as the villanous cgroup is using almost all of its memory.max for kernel memory (specifically pagetables) but it sort of doesn't matter, as I am most interested in the btrfs locking behavior. It isn't an academic problem, as we see this exact problem in production at Meta with one cgroup over memory.max ruining btrfs performance for the whole system. The underlying scheduling "problem" with global rwsems is sort of thorny and apparently well known. e.g. https://lpc.events/event/18/contributions/1883/ As a result, our main lever in the short term is just trying to reduce contention on our various rwsems. In the case of the csum tree, we can either redesign btree locking (hard...) or try to use the commit root when we can. Luckily, it seems likely that many reads are for old extents written many transactions ago, and that for those we *can* in fact search the commit root! This change detects when we are trying to read an old extent (according to extent map generation) and then wires that through bio_ctrl to the btrfs_bio, which unfortunately isn't allocated yet when we have this information. Luckily, we don't need this flag in the bio after submitting, so we can save space by setting it on bbio->bio.bi_flags and clear before submitting, so the block layer is unaffected. When we go to lookup the csums in lookup_bio_sums we can check this condition on the btrfs_bio and do the commit root lookup accordingly. With the fix, on that same test case, commit latencies no longer exceed ~400ms on my system. Signed-off-by: Boris Burkov <boris@bur.io> --- Changelog: v5: - static inline flag functions - make bbio const for the getter - move around and improve the comments v4: - replace generic private flag machinery with specific function for the one flag - move the bio_ctrl field to take advantage of alignment v3: - add some simple machinery for setting/getting/clearing btrfs private flags in bi_flags - clear those flags before bio_submit (ensure no-op wrt block layer) - store the csum commit root flag there to save space v2: - hold the commit_root_sem for the duration of the entire lookup, not just per btrfs_search_slot. Note that we can't naively do the thing where we release the lock every loop as that is exactly what we are trying to avoid. Theoretically, we could re-grab the lock and fully start over if the lock is write contended or something. I suspect the rwsem fairness features will let the commit writer get it fast enough anyway. --- fs/btrfs/bio.c | 8 ++++++++ fs/btrfs/bio.h | 17 +++++++++++++++++ fs/btrfs/extent_io.c | 20 ++++++++++++++++++++ fs/btrfs/file-item.c | 30 ++++++++++++++++++++++++++++++ 4 files changed, 75 insertions(+)