Message ID | 20230314165910.373347-4-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/10] btrfs: use a plain workqueue for ordered_extent processing | expand |
On 14.03.23 17:59, Christoph Hellwig wrote: > + if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) { > + if (btrfs_is_free_space_inode(bbio->inode)) > + return fs_info->endio_freespace_worker; > + if (bbio->bio.bi_opf & REQ_META) > + return fs_info->endio_write_meta_workers; > + return fs_info->endio_write_workers; > + } else { > + if (bbio->bio.bi_opf & REQ_META) > + return fs_info->endio_meta_workers; > + return fs_info->endio_workers; > + } Can't that else be dropped as well?
On 2023/3/15 00:59, Christoph Hellwig wrote: > Currently all reads are offloaded to workqueues for checksum > verification. Writes are initially processed in the bi_end_io > context that usually happens in hard or soft interrupts, and are > only offloaded to a workqueue for ordered extent completion or > compressed extent handling, and in the future for raid stripe > tree handling. > > Switch to offload all writes to workqueues instead of doing that > only for the final ordered extent processing. This means that > there are slightly workqueue offloads for large writes as each > submitted btrfs_bio can only hold 1MiB worth of data on systems > with a 4K page size. On the other hand this will allow large > simpliciations by always having a process context. > > For metadata writes there was not offload at all before, which > creates various problems like not being able to drop the final > extent_buffered reference from the end_io handler. > > With the entire series applied this shows no performance differences > on data heavy workload, while for metadata heavy workloads like > Filipes fs_mark there also is no change in averages, while it > seems to somewhat reduce the outliers in terms of best and worst > performance. This is probably due to the removal of the irq > disabling in the next patches. I'm uncertain what is proper or the better solution when handling the endio functions. Sure, they are called in very strict context, thus we should keep them short. But on the other hand, we're already having too many workqueues, and I'm always wondering under what situation they can lead to deadlock. (e.g. why we need to queue endios for free space and regular data inodes into different workqueues?) My current method is always consider the workqueue has only 1 max_active, but I'm still not sure for such case, what would happen if one work slept? Would the workqueue being able to choose the next work item? Or that workqueue is stalled until the only active got woken? Thus I'm uncertain if we're going the correct way. Personally speaking, I'd like to keep the btrfs bio endio function calls in the old soft/hard irq context, and let the higher layer to queue the work. However we have already loosen the endio context for btrfs bio, from the old soft/hard irq to the current workqueue context already... > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/btrfs/bio.c | 74 ++++++++++++++++++++++++++++------------- > fs/btrfs/disk-io.c | 5 +++ > fs/btrfs/fs.h | 1 + > fs/btrfs/ordered-data.c | 17 +--------- > fs/btrfs/ordered-data.h | 2 -- > 5 files changed, 57 insertions(+), 42 deletions(-) > > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c > index 85539964864a34..037eded3b33ba2 100644 > --- a/fs/btrfs/bio.c > +++ b/fs/btrfs/bio.c > @@ -300,20 +300,33 @@ static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio) > { > struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info; > > - if (bbio->bio.bi_opf & REQ_META) > - return fs_info->endio_meta_workers; > - return fs_info->endio_workers; > + if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) { > + if (btrfs_is_free_space_inode(bbio->inode)) > + return fs_info->endio_freespace_worker; > + if (bbio->bio.bi_opf & REQ_META) > + return fs_info->endio_write_meta_workers; > + return fs_info->endio_write_workers; > + } else { > + if (bbio->bio.bi_opf & REQ_META) > + return fs_info->endio_meta_workers; > + return fs_info->endio_workers; > + } > } > > -static void btrfs_end_bio_work(struct work_struct *work) > +static void btrfs_simple_end_bio_work(struct work_struct *work) > { > struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work); > + struct bio *bio = &bbio->bio; > + struct btrfs_device *dev = bio->bi_private; > > /* Metadata reads are checked and repaired by the submitter. */ > - if (bbio->bio.bi_opf & REQ_META) > - bbio->end_io(bbio); > - else > - btrfs_check_read_bio(bbio, bbio->bio.bi_private); > + if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) { > + btrfs_check_read_bio(bbio, dev); > + } else { > + if (bio_op(bio) == REQ_OP_ZONE_APPEND) > + btrfs_record_physical_zoned(bbio); > + btrfs_orig_bbio_end_io(bbio); > + } > } > > static void btrfs_simple_end_io(struct bio *bio) > @@ -322,18 +335,19 @@ static void btrfs_simple_end_io(struct bio *bio) > struct btrfs_device *dev = bio->bi_private; > > btrfs_bio_counter_dec(bbio->inode->root->fs_info); > - > if (bio->bi_status) > btrfs_log_dev_io_error(bio, dev); > > - if (bio_op(bio) == REQ_OP_READ) { > - INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work); > - queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); > - } else { > - if (bio_op(bio) == REQ_OP_ZONE_APPEND) > - btrfs_record_physical_zoned(bbio); > - btrfs_orig_bbio_end_io(bbio); > - } I'm already seeing in the future, there would be extra if checks to skip the work queue for scrub bios... > + INIT_WORK(&bbio->end_io_work, btrfs_simple_end_bio_work); > + queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); > +} > + > +static void btrfs_write_end_bio_work(struct work_struct *work) > +{ > + struct btrfs_bio *bbio = > + container_of(work, struct btrfs_bio, end_io_work); > + > + btrfs_orig_bbio_end_io(bbio); > } > > static void btrfs_raid56_end_io(struct bio *bio) > @@ -343,12 +357,23 @@ static void btrfs_raid56_end_io(struct bio *bio) > > btrfs_bio_counter_dec(bioc->fs_info); > bbio->mirror_num = bioc->mirror_num; > - if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META)) > - btrfs_check_read_bio(bbio, NULL); > - else > - btrfs_orig_bbio_end_io(bbio); > - > btrfs_put_bioc(bioc); > + > + /* > + * The RAID5/6 code already completes all bios from workqueues, but for > + * writs we need to offload them again to avoid deadlocks in the ordered > + * extent processing. > + */ Mind to share some details about the possible deadlock here? To me, it's just a different workqueue doing the finish ordered io workload. Thanks, Qu > + if (bio_op(bio) == REQ_OP_READ) { > + /* Metadata reads are checked and repaired by the submitter. */ > + if (!(bio->bi_opf & REQ_META)) > + btrfs_check_read_bio(bbio, NULL); > + else > + btrfs_orig_bbio_end_io(bbio); > + } else { > + INIT_WORK(&btrfs_bio(bio)->end_io_work, btrfs_write_end_bio_work); > + queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); > + } > } > > static void btrfs_orig_write_end_io(struct bio *bio) > @@ -372,9 +397,10 @@ static void btrfs_orig_write_end_io(struct bio *bio) > bio->bi_status = BLK_STS_IOERR; > else > bio->bi_status = BLK_STS_OK; > - > - btrfs_orig_bbio_end_io(bbio); > btrfs_put_bioc(bioc); > + > + INIT_WORK(&bbio->end_io_work, btrfs_write_end_bio_work); > + queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); > } > > static void btrfs_clone_write_end_io(struct bio *bio) > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 0889eb81e71a7d..5b63a5161cedea 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2008,6 +2008,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info) > * the queues used for metadata I/O, since tasks from those other work > * queues can do metadata I/O operations. > */ > + if (fs_info->endio_write_meta_workers) > + destroy_workqueue(fs_info->endio_write_meta_workers); > if (fs_info->endio_meta_workers) > destroy_workqueue(fs_info->endio_meta_workers); > } > @@ -2204,6 +2206,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > alloc_workqueue("btrfs-endio", flags, max_active); > fs_info->endio_meta_workers = > alloc_workqueue("btrfs-endio-meta", flags, max_active); > + fs_info->endio_write_meta_workers = > + alloc_workqueue("btrfs-endio-write-meta", flags, max_active); > fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active); > fs_info->endio_write_workers = > alloc_workqueue("btrfs-endio-write", flags, max_active); > @@ -2224,6 +2228,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > fs_info->endio_workers && fs_info->endio_meta_workers && > fs_info->compressed_write_workers && > fs_info->endio_write_workers && > + fs_info->endio_write_meta_workers && > fs_info->endio_freespace_worker && fs_info->rmw_workers && > fs_info->caching_workers && fs_info->fixup_workers && > fs_info->delayed_workers && fs_info->qgroup_rescan_workers && > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h > index 276a17780f2b1b..e2643bc5c039ad 100644 > --- a/fs/btrfs/fs.h > +++ b/fs/btrfs/fs.h > @@ -543,6 +543,7 @@ struct btrfs_fs_info { > struct workqueue_struct *rmw_workers; > struct workqueue_struct *compressed_write_workers; > struct workqueue_struct *endio_write_workers; > + struct workqueue_struct *endio_write_meta_workers; > struct workqueue_struct *endio_freespace_worker; > struct btrfs_workqueue *caching_workers; > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > index 23f496f0d7b776..48c7510df80a0d 100644 > --- a/fs/btrfs/ordered-data.c > +++ b/fs/btrfs/ordered-data.c > @@ -303,14 +303,6 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, > spin_unlock_irq(&tree->lock); > } > > -static void finish_ordered_fn(struct work_struct *work) > -{ > - struct btrfs_ordered_extent *ordered_extent; > - > - ordered_extent = container_of(work, struct btrfs_ordered_extent, work); > - btrfs_finish_ordered_io(ordered_extent); > -} > - > /* > * Mark all ordered extents io inside the specified range finished. > * > @@ -330,17 +322,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, > { > struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree; > struct btrfs_fs_info *fs_info = inode->root->fs_info; > - struct workqueue_struct *wq; > struct rb_node *node; > struct btrfs_ordered_extent *entry = NULL; > unsigned long flags; > u64 cur = file_offset; > > - if (btrfs_is_free_space_inode(inode)) > - wq = fs_info->endio_freespace_worker; > - else > - wq = fs_info->endio_write_workers; > - > if (page) > ASSERT(page->mapping && page_offset(page) <= file_offset && > file_offset + num_bytes <= page_offset(page) + PAGE_SIZE); > @@ -439,8 +425,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, > refcount_inc(&entry->refs); > trace_btrfs_ordered_extent_mark_finished(inode, entry); > spin_unlock_irqrestore(&tree->lock, flags); > - INIT_WORK(&entry->work, finish_ordered_fn); > - queue_work(wq, &entry->work); > + btrfs_finish_ordered_io(entry); > spin_lock_irqsave(&tree->lock, flags); > } > cur += len; > diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h > index b8a92f040458f0..64a37d42e7a24a 100644 > --- a/fs/btrfs/ordered-data.h > +++ b/fs/btrfs/ordered-data.h > @@ -146,8 +146,6 @@ struct btrfs_ordered_extent { > /* a per root list of all the pending ordered extents */ > struct list_head root_extent_list; > > - struct work_struct work; > - > struct completion completion; > struct btrfs_work flush_work; > struct list_head work_list;
On Mon, Mar 20, 2023 at 07:29:38PM +0800, Qu Wenruo wrote: > Sure, they are called in very strict context, thus we should keep them > short. > But on the other hand, we're already having too many workqueues, and I'm > always wondering under what situation they can lead to deadlock. > (e.g. why we need to queue endios for free space and regular data inodes > into different workqueues?) In general the reason for separate workqueues is if one workqueue depends on the execution of another. It seems like this is Josef's area, but my impression is that finishing and ordered extent can cause writeback and a possible wait for data in the freespace inode. Normally such workqueue splits should have comments in the code to explain them, but so far I haven't found one. > My current method is always consider the workqueue has only 1 max_active, > but I'm still not sure for such case, what would happen if one work slept? That's my understanding of the workqueue mechanisms, yes. > Would the workqueue being able to choose the next work item? Or that > workqueue is stalled until the only active got woken? I think it is stalled. That's why the workqueue heavily discourages limiting max_active unless you have a good reason to, and most callers follow that advise. > Personally speaking, I'd like to keep the btrfs bio endio function calls in > the old soft/hard irq context, and let the higher layer to queue the work. Can you explain why? > However we have already loosen the endio context for btrfs bio, from the > old soft/hard irq to the current workqueue context already... read I/O are executed in a workqueue. For write completions there also are various offloads, but none that is consistent and dependable so far.
On 2023/3/20 20:30, Christoph Hellwig wrote: > On Mon, Mar 20, 2023 at 07:29:38PM +0800, Qu Wenruo wrote: >> Sure, they are called in very strict context, thus we should keep them >> short. >> But on the other hand, we're already having too many workqueues, and I'm >> always wondering under what situation they can lead to deadlock. >> (e.g. why we need to queue endios for free space and regular data inodes >> into different workqueues?) > > In general the reason for separate workqueues is if one workqueue depends > on the execution of another. It seems like this is Josef's area, but > my impression is that finishing and ordered extent can cause writeback > and a possible wait for data in the freespace inode. Normally such > workqueue splits should have comments in the code to explain them, but > so far I haven't found one. > >> My current method is always consider the workqueue has only 1 max_active, >> but I'm still not sure for such case, what would happen if one work slept? > > That's my understanding of the workqueue mechanisms, yes. > >> Would the workqueue being able to choose the next work item? Or that >> workqueue is stalled until the only active got woken? > > I think it is stalled. That's why the workqueue heavily discourages > limiting max_active unless you have a good reason to, and most callers > follow that advise. To me, this looks more like hiding a bug in the workqueue user. Shouldn't we expose such bugs instead of hiding them? Furthermore although I'm a big fan of plain workqueue, the old btrfs workqueues allows us to apply max_active to the plain workqueues, but now we don't have this ability any more. Thus at least, we should bring back the old behavior, and possibly fix whatever bugs in our workqueue usage. > >> Personally speaking, I'd like to keep the btrfs bio endio function calls in >> the old soft/hard irq context, and let the higher layer to queue the work. > > Can you explain why? Just to keep the context consistent. Image a situation, where we put some sleep functions into a endio function without any doubt, and fully rely on the fact the endio function is executed under workqueue context. Or we have to extra comments explain the situation every time we're doing something like this. Thanks, Qu > >> However we have already loosen the endio context for btrfs bio, from the >> old soft/hard irq to the current workqueue context already... > > read I/O are executed in a workqueue. For write completions there also > are various offloads, but none that is consistent and dependable so far.
On Tue, Mar 21, 2023 at 07:37:46AM +0800, Qu Wenruo wrote: >> I think it is stalled. That's why the workqueue heavily discourages >> limiting max_active unless you have a good reason to, and most callers >> follow that advise. > > To me, this looks more like hiding a bug in the workqueue user. > Shouldn't we expose such bugs instead of hiding them? If you limit max_active to a certain value, you clearly tell the workqueue code not not use more workers that that. That is what the argument is for. I don't see where there is a bug, and that someone is hiding it. > Furthermore although I'm a big fan of plain workqueue, the old btrfs > workqueues allows us to apply max_active to the plain workqueues, but now > we don't have this ability any more. You can pass max_active to plain Linux workqueues and there is a bunch of places that do that, although the vast majority passes either 0 to use the default, or 1 to limit themselves to a single active worker. The core also allows to change it, but nothing but the btrfs_workqueue code ever does. We can wire up btrfs to change the conccurency if we have a good reason to do. And I'd like to figure out if there is one, and if yes for which workqueue, but for that we'd need to have an argument for why which workqueue would like to use a larger or smaller than default value. The old argument of higher resource usage with a bigger number of workers does not apply any more with the concurrency managed workqueue implementation. >>> Personally speaking, I'd like to keep the btrfs bio endio function calls in >>> the old soft/hard irq context, and let the higher layer to queue the work. >> >> Can you explain why? > > Just to keep the context consistent. Which is what this series does. Before all read I/O completion handlers were called in workqueue context, while write ones weren't. With the series write completion handlers are called from workqueue context as well, and with that all significant btrfs code except for tiny bits of bi_end_io handlers are called from process context. > Image a situation, where we put some sleep functions into a endio function > without any doubt, and fully rely on the fact the endio function is > executed under workqueue context. Being able to do that is the point of this series.
On 2023/3/21 20:55, Christoph Hellwig wrote: > On Tue, Mar 21, 2023 at 07:37:46AM +0800, Qu Wenruo wrote: >>> I think it is stalled. That's why the workqueue heavily discourages >>> limiting max_active unless you have a good reason to, and most callers >>> follow that advise. >> >> To me, this looks more like hiding a bug in the workqueue user. >> Shouldn't we expose such bugs instead of hiding them? > > If you limit max_active to a certain value, you clearly tell the > workqueue code not not use more workers that that. That is what the > argument is for. And if a work load can only be deadlock free using the default max_active, but not any value smaller, then I'd say the work load itself is buggy. > > I don't see where there is a bug, and that someone is hiding it. > >> Furthermore although I'm a big fan of plain workqueue, the old btrfs >> workqueues allows us to apply max_active to the plain workqueues, but now >> we don't have this ability any more. > > You can pass max_active to plain Linux workqueues and there is a bunch > of places that do that, although the vast majority passes either 0 to > use the default, or 1 to limit themselves to a single active worker. > > The core also allows to change it, but nothing but the btrfs_workqueue > code ever does. We can wire up btrfs to change the conccurency if we > have a good reason to do. And I'd like to figure out if there is one, > and if yes for which workqueue, but for that we'd need to have an > argument for why which workqueue would like to use a larger or smaller > than default value. The old argument of higher resource usage with > a bigger number of workers does not apply any more with the concurrency > managed workqueue implementation. The usecase is still there. To limit the amount of CPU time spent by btrfs workloads, from csum verification to compression. And if limiting max_active for plain workqueue could help us to expose possible deadlocks, it would be extra benefits. Thanks, Qu > >>>> Personally speaking, I'd like to keep the btrfs bio endio function calls in >>>> the old soft/hard irq context, and let the higher layer to queue the work. >>> >>> Can you explain why? >> >> Just to keep the context consistent. > > Which is what this series does. Before all read I/O completion handlers > were called in workqueue context, while write ones weren't. With the > series write completion handlers are called from workqueue context > as well, and with that all significant btrfs code except for tiny bits > of bi_end_io handlers are called from process context. > >> Image a situation, where we put some sleep functions into a endio function >> without any doubt, and fully rely on the fact the endio function is >> executed under workqueue context. > > Being able to do that is the point of this series.
On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote: >> If you limit max_active to a certain value, you clearly tell the >> workqueue code not not use more workers that that. That is what the >> argument is for. > > And if a work load can only be deadlock free using the default max_active, > but not any value smaller, then I'd say the work load itself is buggy. Anything that has an interaction between two instances of a work_struct can deadlock. Only a single execution context is guaranteed (and even that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due to e.g. only using a global workqueue in file systems or block devices that can stack. Fortunately these days lockdep is generally able to catch these dependencies as well. > The usecase is still there. > To limit the amount of CPU time spent by btrfs workloads, from csum > verification to compression. So this is the first time I see an actual explanation, thanks for that first. If this is the reason we should apply the max_active to all workqueus that do csum an compression work, but not to other random workqueues. Dave, Josef, Chis: do you agree to this interpretation?
On 2023/3/22 16:32, Christoph Hellwig wrote: > On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote: >>> If you limit max_active to a certain value, you clearly tell the >>> workqueue code not not use more workers that that. That is what the >>> argument is for. >> >> And if a work load can only be deadlock free using the default max_active, >> but not any value smaller, then I'd say the work load itself is buggy. > > Anything that has an interaction between two instances of a work_struct > can deadlock. Only a single execution context is guaranteed (and even > that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due > to e.g. only using a global workqueue in file systems or block devices > that can stack. Shouldn't we avoid such cross workqueue workload at all cost? IIRC at least we don't have such usage in btrfs. And I hope we will never have. > > Fortunately these days lockdep is generally able to catch these > dependencies as well. > >> The usecase is still there. >> To limit the amount of CPU time spent by btrfs workloads, from csum >> verification to compression. > > So this is the first time I see an actual explanation, thanks for that > first. If this is the reason we should apply the max_active to all > workqueus that do csum an compression work, but not to other random > workqueues. If we're limiting the max_active for certain workqueues, then I'd say why not to all workqueues? If we have some usage relying on the amount of workers, at least we should be able to expose it and fix it. (IIRC we should have a better way with less cross-workqueue dependency) Thanks, Qu > > Dave, Josef, Chis: do you agree to this interpretation?
On Thu, Mar 23, 2023 at 04:07:28PM +0800, Qu Wenruo wrote: >>> And if a work load can only be deadlock free using the default max_active, >>> but not any value smaller, then I'd say the work load itself is buggy. >> >> Anything that has an interaction between two instances of a work_struct >> can deadlock. Only a single execution context is guaranteed (and even >> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due >> to e.g. only using a global workqueue in file systems or block devices >> that can stack. > > Shouldn't we avoid such cross workqueue workload at all cost? Yes, btrfs uses per-sb workqueues. As do most other places now, but there as a bit of a learning curve years ago. >> So this is the first time I see an actual explanation, thanks for that >> first. If this is the reason we should apply the max_active to all >> workqueus that do csum an compression work, but not to other random >> workqueues. > > If we're limiting the max_active for certain workqueues, then I'd say why > not to all workqueues? > > If we have some usage relying on the amount of workers, at least we should > be able to expose it and fix it. Again, btrfs is the odd one out allowing the user to set arbitrary limits. This code predates using the kernel workqueues, and I'm a little doubtful it still is useful. But for that I need to figure out why it was even be kept when converting btrfs to use workqueues. > (IIRC we should have a better way with less cross-workqueue dependency) I've been very actively working on reducing the amount of different workqueues. This series is an important part of that.
On 2023/3/23 16:12, Christoph Hellwig wrote: > On Thu, Mar 23, 2023 at 04:07:28PM +0800, Qu Wenruo wrote: >>>> And if a work load can only be deadlock free using the default max_active, >>>> but not any value smaller, then I'd say the work load itself is buggy. >>> >>> Anything that has an interaction between two instances of a work_struct >>> can deadlock. Only a single execution context is guaranteed (and even >>> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due >>> to e.g. only using a global workqueue in file systems or block devices >>> that can stack. >> >> Shouldn't we avoid such cross workqueue workload at all cost? > > Yes, btrfs uses per-sb workqueues. As do most other places now, > but there as a bit of a learning curve years ago. > >>> So this is the first time I see an actual explanation, thanks for that >>> first. If this is the reason we should apply the max_active to all >>> workqueus that do csum an compression work, but not to other random >>> workqueues. >> >> If we're limiting the max_active for certain workqueues, then I'd say why >> not to all workqueues? >> >> If we have some usage relying on the amount of workers, at least we should >> be able to expose it and fix it. > > Again, btrfs is the odd one out allowing the user to set arbitrary limits. > This code predates using the kernel workqueues, and I'm a little > doubtful it still is useful. But for that I need to figure out why > it was even be kept when converting btrfs to use workqueues. > >> (IIRC we should have a better way with less cross-workqueue dependency) > > I've been very actively working on reducing the amount of different > workqueues. This series is an important part of that. > Yeah, your work is very appreciated, that's without any doubt. I'm just wondering if we all know that we should avoid cross-workqueue dependency to avoid deadlock, then why no one is trying to expose any such deadlocks by simply limiting max_active to 1 for all workqueues? Especially with lockdep, once we got a reproduce, it should not be that hard to fix. That's the only other valid usecase other than limiting CPU usage for csum/compression, I know regular users should not touch this, but us developer should still be able to create a worst case for our CI systems to suffer. Thanks, Qu
On 3/22/23 4:32 AM, Christoph Hellwig wrote: > On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote: >>> If you limit max_active to a certain value, you clearly tell the >>> workqueue code not not use more workers that that. That is what the >>> argument is for. >> >> And if a work load can only be deadlock free using the default max_active, >> but not any value smaller, then I'd say the work load itself is buggy. > > Anything that has an interaction between two instances of a work_struct > can deadlock. Only a single execution context is guaranteed (and even > that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due > to e.g. only using a global workqueue in file systems or block devices > that can stack. > > Fortunately these days lockdep is generally able to catch these > dependencies as well. > >> The usecase is still there. >> To limit the amount of CPU time spent by btrfs workloads, from csum >> verification to compression. > > So this is the first time I see an actual explanation, thanks for that > first. If this is the reason we should apply the max_active to all > workqueus that do csum an compression work, but not to other random > workqueues. > > Dave, Josef, Chis: do you agree to this interpretation? The original reason for limiting the number of workers was that work like crc calculations was causing tons of context switches. This isn't a surprise, there were a ton of work items, and each one was processed very quickly. So the original btrfs thread pools had optimizations meant to limit context switches by preferring to give work to a smaller number of workers. For compression it's more clear cut. I wanted the ability to saturate all the CPUs with compression work, but also wanted a default that didn't take over the whole machine. -chris
On Thu, Mar 23, 2023 at 10:53:16AM -0400, Chris Mason wrote: > The original reason for limiting the number of workers was that work > like crc calculations was causing tons of context switches. This isn't > a surprise, there were a ton of work items, and each one was processed > very quickly. Yeah. Although quite a bit of that went away since, by not doing the offload for sync I/O, not for metadata when there is a fast crc32 implementation (we need to do the same for small data I/O and hardware accelerated sha256, btw), and by doing these in larger chunks, but .. > > So the original btrfs thread pools had optimizations meant to limit > context switches by preferring to give work to a smaller number of workers. .. this is something that the workqueue code already does under the hood anyway. > For compression it's more clear cut. I wanted the ability to saturate > all the CPUs with compression work, but also wanted a default that > didn't take over the whole machine. And that's actually a very good use case. It almost asks for a separate option just for compression, or at least for compression and checksumming only. Is there consensus that for now we should limit thread_pool for se workqueues that do compression and chekcsumming for now? Then I'll send a series to wire it up for the read completion workqueue again that does the checksum verification, the compressed write workqueue, but drop it for all others but the write submission one?
On Thu, Mar 23, 2023 at 04:20:45PM +0800, Qu Wenruo wrote: > I'm just wondering if we all know that we should avoid cross-workqueue > dependency to avoid deadlock, then why no one is trying to expose any such > deadlocks by simply limiting max_active to 1 for all workqueues? Cross workqueue dependencies are not a problem per se, and a lot of places in the kernel rely on them. Dependencies on another work_struct on the same workqueue on the other hand are a problem. > Especially with lockdep, once we got a reproduce, it should not be that > hard to fix. lockdep helps you to find them without requiring to limit the max_active.
On 3/23/23 9:09 PM, Christoph Hellwig wrote: > On Thu, Mar 23, 2023 at 10:53:16AM -0400, Chris Mason wrote: >> The original reason for limiting the number of workers was that work >> like crc calculations was causing tons of context switches. This isn't >> a surprise, there were a ton of work items, and each one was processed >> very quickly. > > Yeah. Although quite a bit of that went away since, by not doing > the offload for sync I/O, not for metadata when there is a fast crc32 > implementation (we need to do the same for small data I/O and hardware > accelerated sha256, btw), and by doing these in larger chunks, but .. > >> >> So the original btrfs thread pools had optimizations meant to limit >> context switches by preferring to give work to a smaller number of workers. > > .. this is something that the workqueue code already does under the hood > anyway. Yeah, between hardware changes and kernel changes, these motivations don't really apply anymore. Even if they did, we're probably better off fixing that in the workqueue code directly now. > >> For compression it's more clear cut. I wanted the ability to saturate >> all the CPUs with compression work, but also wanted a default that >> didn't take over the whole machine. > > And that's actually a very good use case. > > It almost asks for a separate option just for compression, or at least > for compression and checksumming only. > > Is there consensus that for now we should limit thread_pool for > se workqueues that do compression and chekcsumming for now? Then > I'll send a series to wire it up for the read completion workqueue > again that does the checksum verification, the compressed write > workqueue, but drop it for all others but the write submission > one? As you mentioned above, we're currently doing synchronous crcs for metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes. We've benchmarked this a few times and I think with modern hardware a better default is just always doing synchronous crcs for data too. Qu or David have you looked synchronous data crcs recently? Looking ahead, encryption is going to want similar knobs to compression. My preference would be: - crcs are always inline if your hardware is fast - Compression, encryption, slow hardware crcs use the thread_pool_size knob - We don't try to limit the other workers The idea behind the knob is just "how much of your CPU should each btrfs mount consume?" Obviously we'll silently judge anyone that chooses less than 100% but I guess it's good to give people options. -chris
On 3/24/23 9:25 AM, Chris Mason wrote: > On 3/23/23 9:09 PM, Christoph Hellwig wrote: > As you mentioned above, we're currently doing synchronous crcs for > metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes. > We've benchmarked this a few times and I think with modern hardware a > better default is just always doing synchronous crcs for data too. > I asked Jens to run some numbers on his fast drives (thanks Jens!). He did 1M buffered writes w/fio. Unpatched Linus: write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets Jens said there was a lot of variation during the run going down as low as 1100MB/s. Synchronous crcs: write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets The synchronous run had about the same peak write speed but much lower dips, and fewer kworkers churning around. Both tests had fio saturated at 100% CPU. I think we should flip crcs to synchronous as long as it's HW accelerated. -chris
On Fri, Mar 24, 2023 at 03:20:47PM -0400, Chris Mason wrote: > I asked Jens to run some numbers on his fast drives (thanks Jens!). He > did 1M buffered writes w/fio. > > Unpatched Linus: > > write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets > > Jens said there was a lot of variation during the run going down as low > as 1100MB/s. > > Synchronous crcs: > > write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets > > The synchronous run had about the same peak write speed but much lower > dips, and fewer kworkers churning around. FYI, I did some similar runs on this just with kvm on consumer drives, and the results were similar. Interestingly enough I first accidentally ran with the non-accelerated crc32 code, as kvm by default doesn't pass through cpu flags. Even with that the workqueue offload only looked better for really large writes, and then only marginally.
On Fri, Mar 24, 2023 at 09:25:07AM -0400, Chris Mason wrote: > As you mentioned above, we're currently doing synchronous crcs for > metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes. > We've benchmarked this a few times and I think with modern hardware a > better default is just always doing synchronous crcs for data too. > > Qu or David have you looked synchronous data crcs recently? As mentioend in the other mail I have a bit. But only for crc32 so far, and only on x86, so the matrix might be a little more complicated. > My preference would be: > > - crcs are always inline if your hardware is fast > - Compression, encryption, slow hardware crcs use the thread_pool_size knob > - We don't try to limit the other workers > > The idea behind the knob is just "how much of your CPU should each btrfs > mount consume?" Obviously we'll silently judge anyone that chooses less > than 100% but I guess it's good to give people options. Ok. I'll do a series about the nobs ASAP, and then the inline CRCs next.
On 2023/3/25 16:15, Christoph Hellwig wrote: > On Fri, Mar 24, 2023 at 09:25:07AM -0400, Chris Mason wrote: >> As you mentioned above, we're currently doing synchronous crcs for >> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes. >> We've benchmarked this a few times and I think with modern hardware a >> better default is just always doing synchronous crcs for data too. >> >> Qu or David have you looked synchronous data crcs recently? > > As mentioend in the other mail I have a bit. But only for crc32 > so far, and only on x86, so the matrix might be a little more > complicated. Haven't really looked up the async csum part at all. In fact I'm not even sure what's the point of generating data checksum asynchronously... I didn't see any extra split to get multiple threads to calculate hash on different parts. Furthermore, I'm not sure even if all the supported hash algos (CRC32/SHA256/BLAKE2/XXHASH) can support such split and merge multi-threading. (IIRC SHA256 can not?) Thus I'm not really sure of the benefit of async csum generation anyway. (Not to mention asynchronous code itself is not straightforward) Considering at least we used to do data csum verification in endio function synchronously, I see no problem to do the generation path synchronously, even without hardware accelerated checksum support. Thanks, Qu > >> My preference would be: >> >> - crcs are always inline if your hardware is fast >> - Compression, encryption, slow hardware crcs use the thread_pool_size knob >> - We don't try to limit the other workers >> >> The idea behind the knob is just "how much of your CPU should each btrfs >> mount consume?" Obviously we'll silently judge anyone that chooses less >> than 100% but I guess it's good to give people options. > > Ok. I'll do a series about the nobs ASAP, and then the inline CRCs > next.
On 3/25/23 4:13 AM, Christoph Hellwig wrote: > On Fri, Mar 24, 2023 at 03:20:47PM -0400, Chris Mason wrote: >> I asked Jens to run some numbers on his fast drives (thanks Jens!). He >> did 1M buffered writes w/fio. >> >> Unpatched Linus: >> >> write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets >> >> Jens said there was a lot of variation during the run going down as low >> as 1100MB/s. >> >> Synchronous crcs: >> >> write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets >> >> The synchronous run had about the same peak write speed but much lower >> dips, and fewer kworkers churning around. > > FYI, I did some similar runs on this just with kvm on consumer drives, > and the results were similar. Interestingly enough I first accidentally > ran with the non-accelerated crc32 code, as kvm by default doesn't > pass through cpu flags. Even with that the workqueue offload only > looked better for really large writes, and then only marginally. There is pretty clearly an opportunity to tune the crc workers, but I wouldn't make that a requirement for your series. It'll be a bigger win to just force inline most of the time. -chris
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index 85539964864a34..037eded3b33ba2 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -300,20 +300,33 @@ static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio) { struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info; - if (bbio->bio.bi_opf & REQ_META) - return fs_info->endio_meta_workers; - return fs_info->endio_workers; + if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) { + if (btrfs_is_free_space_inode(bbio->inode)) + return fs_info->endio_freespace_worker; + if (bbio->bio.bi_opf & REQ_META) + return fs_info->endio_write_meta_workers; + return fs_info->endio_write_workers; + } else { + if (bbio->bio.bi_opf & REQ_META) + return fs_info->endio_meta_workers; + return fs_info->endio_workers; + } } -static void btrfs_end_bio_work(struct work_struct *work) +static void btrfs_simple_end_bio_work(struct work_struct *work) { struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work); + struct bio *bio = &bbio->bio; + struct btrfs_device *dev = bio->bi_private; /* Metadata reads are checked and repaired by the submitter. */ - if (bbio->bio.bi_opf & REQ_META) - bbio->end_io(bbio); - else - btrfs_check_read_bio(bbio, bbio->bio.bi_private); + if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) { + btrfs_check_read_bio(bbio, dev); + } else { + if (bio_op(bio) == REQ_OP_ZONE_APPEND) + btrfs_record_physical_zoned(bbio); + btrfs_orig_bbio_end_io(bbio); + } } static void btrfs_simple_end_io(struct bio *bio) @@ -322,18 +335,19 @@ static void btrfs_simple_end_io(struct bio *bio) struct btrfs_device *dev = bio->bi_private; btrfs_bio_counter_dec(bbio->inode->root->fs_info); - if (bio->bi_status) btrfs_log_dev_io_error(bio, dev); - if (bio_op(bio) == REQ_OP_READ) { - INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work); - queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); - } else { - if (bio_op(bio) == REQ_OP_ZONE_APPEND) - btrfs_record_physical_zoned(bbio); - btrfs_orig_bbio_end_io(bbio); - } + INIT_WORK(&bbio->end_io_work, btrfs_simple_end_bio_work); + queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); +} + +static void btrfs_write_end_bio_work(struct work_struct *work) +{ + struct btrfs_bio *bbio = + container_of(work, struct btrfs_bio, end_io_work); + + btrfs_orig_bbio_end_io(bbio); } static void btrfs_raid56_end_io(struct bio *bio) @@ -343,12 +357,23 @@ static void btrfs_raid56_end_io(struct bio *bio) btrfs_bio_counter_dec(bioc->fs_info); bbio->mirror_num = bioc->mirror_num; - if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META)) - btrfs_check_read_bio(bbio, NULL); - else - btrfs_orig_bbio_end_io(bbio); - btrfs_put_bioc(bioc); + + /* + * The RAID5/6 code already completes all bios from workqueues, but for + * writs we need to offload them again to avoid deadlocks in the ordered + * extent processing. + */ + if (bio_op(bio) == REQ_OP_READ) { + /* Metadata reads are checked and repaired by the submitter. */ + if (!(bio->bi_opf & REQ_META)) + btrfs_check_read_bio(bbio, NULL); + else + btrfs_orig_bbio_end_io(bbio); + } else { + INIT_WORK(&btrfs_bio(bio)->end_io_work, btrfs_write_end_bio_work); + queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); + } } static void btrfs_orig_write_end_io(struct bio *bio) @@ -372,9 +397,10 @@ static void btrfs_orig_write_end_io(struct bio *bio) bio->bi_status = BLK_STS_IOERR; else bio->bi_status = BLK_STS_OK; - - btrfs_orig_bbio_end_io(bbio); btrfs_put_bioc(bioc); + + INIT_WORK(&bbio->end_io_work, btrfs_write_end_bio_work); + queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work); } static void btrfs_clone_write_end_io(struct bio *bio) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0889eb81e71a7d..5b63a5161cedea 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2008,6 +2008,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info) * the queues used for metadata I/O, since tasks from those other work * queues can do metadata I/O operations. */ + if (fs_info->endio_write_meta_workers) + destroy_workqueue(fs_info->endio_write_meta_workers); if (fs_info->endio_meta_workers) destroy_workqueue(fs_info->endio_meta_workers); } @@ -2204,6 +2206,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) alloc_workqueue("btrfs-endio", flags, max_active); fs_info->endio_meta_workers = alloc_workqueue("btrfs-endio-meta", flags, max_active); + fs_info->endio_write_meta_workers = + alloc_workqueue("btrfs-endio-write-meta", flags, max_active); fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active); fs_info->endio_write_workers = alloc_workqueue("btrfs-endio-write", flags, max_active); @@ -2224,6 +2228,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) fs_info->endio_workers && fs_info->endio_meta_workers && fs_info->compressed_write_workers && fs_info->endio_write_workers && + fs_info->endio_write_meta_workers && fs_info->endio_freespace_worker && fs_info->rmw_workers && fs_info->caching_workers && fs_info->fixup_workers && fs_info->delayed_workers && fs_info->qgroup_rescan_workers && diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 276a17780f2b1b..e2643bc5c039ad 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -543,6 +543,7 @@ struct btrfs_fs_info { struct workqueue_struct *rmw_workers; struct workqueue_struct *compressed_write_workers; struct workqueue_struct *endio_write_workers; + struct workqueue_struct *endio_write_meta_workers; struct workqueue_struct *endio_freespace_worker; struct btrfs_workqueue *caching_workers; diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 23f496f0d7b776..48c7510df80a0d 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -303,14 +303,6 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, spin_unlock_irq(&tree->lock); } -static void finish_ordered_fn(struct work_struct *work) -{ - struct btrfs_ordered_extent *ordered_extent; - - ordered_extent = container_of(work, struct btrfs_ordered_extent, work); - btrfs_finish_ordered_io(ordered_extent); -} - /* * Mark all ordered extents io inside the specified range finished. * @@ -330,17 +322,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, { struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree; struct btrfs_fs_info *fs_info = inode->root->fs_info; - struct workqueue_struct *wq; struct rb_node *node; struct btrfs_ordered_extent *entry = NULL; unsigned long flags; u64 cur = file_offset; - if (btrfs_is_free_space_inode(inode)) - wq = fs_info->endio_freespace_worker; - else - wq = fs_info->endio_write_workers; - if (page) ASSERT(page->mapping && page_offset(page) <= file_offset && file_offset + num_bytes <= page_offset(page) + PAGE_SIZE); @@ -439,8 +425,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, refcount_inc(&entry->refs); trace_btrfs_ordered_extent_mark_finished(inode, entry); spin_unlock_irqrestore(&tree->lock, flags); - INIT_WORK(&entry->work, finish_ordered_fn); - queue_work(wq, &entry->work); + btrfs_finish_ordered_io(entry); spin_lock_irqsave(&tree->lock, flags); } cur += len; diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index b8a92f040458f0..64a37d42e7a24a 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -146,8 +146,6 @@ struct btrfs_ordered_extent { /* a per root list of all the pending ordered extents */ struct list_head root_extent_list; - struct work_struct work; - struct completion completion; struct btrfs_work flush_work; struct list_head work_list;
Currently all reads are offloaded to workqueues for checksum verification. Writes are initially processed in the bi_end_io context that usually happens in hard or soft interrupts, and are only offloaded to a workqueue for ordered extent completion or compressed extent handling, and in the future for raid stripe tree handling. Switch to offload all writes to workqueues instead of doing that only for the final ordered extent processing. This means that there are slightly workqueue offloads for large writes as each submitted btrfs_bio can only hold 1MiB worth of data on systems with a 4K page size. On the other hand this will allow large simpliciations by always having a process context. For metadata writes there was not offload at all before, which creates various problems like not being able to drop the final extent_buffered reference from the end_io handler. With the entire series applied this shows no performance differences on data heavy workload, while for metadata heavy workloads like Filipes fs_mark there also is no change in averages, while it seems to somewhat reduce the outliers in terms of best and worst performance. This is probably due to the removal of the irq disabling in the next patches. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/btrfs/bio.c | 74 ++++++++++++++++++++++++++++------------- fs/btrfs/disk-io.c | 5 +++ fs/btrfs/fs.h | 1 + fs/btrfs/ordered-data.c | 17 +--------- fs/btrfs/ordered-data.h | 2 -- 5 files changed, 57 insertions(+), 42 deletions(-)