Message ID | 87r328j00i.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
I started looking further at the improvements we can make once generic_make_request is fixed, and realised that I had missed an important detail in this patch. Several places test current->bio_list, and two actually edit the list. With this change, that cannot see the whole lists, so it could cause a regression. So I've revised the patch to make sure that the entire list of queued bios remains visible, and change the relevant code to look at both the new list and the old list. Following that there are some patches which make the rescuer thread optional, and then starts removing it from some easy-to-fix places. The series summary is below. NeilBrown NeilBrown (5): blk: improve order of bio handling in generic_make_request() blk: remove bio_set arg from blk_queue_split() blk: make the bioset rescue_workqueue optional. blk: use non-rescuing bioset for q->bio_split. block_dev: make blkdev_dio_pool a non-rescuing bioset block/bio.c | 39 +++++++++++++++++++++++++++------ block/blk-core.c | 42 ++++++++++++++++++++++++++++------- block/blk-merge.c | 7 +++--- block/blk-mq.c | 4 ++- drivers/block/drbd/drbd_main.c | 2 +- drivers/block/drbd/drbd_req.c | 2 +- drivers/block/pktcdvd.c | 2 +- drivers/block/ps3vram.c | 2 +- drivers/block/rsxx/dev.c | 2 +- drivers/block/umem.c | 2 +- drivers/block/zram/zram_drv.c | 2 +- drivers/lightnvm/rrpc.c | 2 +- drivers/md/bcache/super.c | 4 ++- drivers/md/dm-crypt.c | 2 +- drivers/md/dm-io.c | 2 +- drivers/md/dm.c | 32 +++++++++++++++------------ drivers/md/md.c | 4 ++- drivers/md/raid10.c | 3 ++- drivers/md/raid5-cache.c | 2 +- drivers/s390/block/dcssblk.c | 2 +- drivers/s390/block/xpram.c | 2 +- drivers/target/target_core_iblock.c | 2 +- fs/btrfs/extent_io.c | 4 ++- fs/xfs/xfs_super.c | 2 +- include/linux/bio.h | 2 ++ include/linux/blkdev.h | 3 +-- 26 files changed, 114 insertions(+), 60 deletions(-)
On 03/09/2017 09:32 PM, NeilBrown wrote: > > I started looking further at the improvements we can make once > generic_make_request is fixed, and realised that I had missed an > important detail in this patch. > Several places test current->bio_list, and two actually edit the list. > With this change, that cannot see the whole lists, so it could cause a > regression. > > So I've revised the patch to make sure that the entire list of queued > bios remains visible, and change the relevant code to look at both > the new list and the old list. > > Following that there are some patches which make the rescuer thread > optional, and then starts removing it from some easy-to-fix places. Neil, note that the v2 patch is already in Linus tree as of earlier today. You need to rebase the series, and if we need fixups on top of v2, then that should be done separately and with increased urgency.
On 03/09/2017 09:38 PM, Jens Axboe wrote: > On 03/09/2017 09:32 PM, NeilBrown wrote: >> >> I started looking further at the improvements we can make once >> generic_make_request is fixed, and realised that I had missed an >> important detail in this patch. >> Several places test current->bio_list, and two actually edit the list. >> With this change, that cannot see the whole lists, so it could cause a >> regression. >> >> So I've revised the patch to make sure that the entire list of queued >> bios remains visible, and change the relevant code to look at both >> the new list and the old list. >> >> Following that there are some patches which make the rescuer thread >> optional, and then starts removing it from some easy-to-fix places. > > Neil, note that the v2 patch is already in Linus tree as of earlier > today. You need to rebase the series, and if we need fixups on > top of v2, then that should be done separately and with increased > urgency. Additionally, at least the first patch appears to be badly mangled. The formatting is screwed up.
On Thu, Mar 09 2017, Jens Axboe wrote: > On 03/09/2017 09:32 PM, NeilBrown wrote: >> >> I started looking further at the improvements we can make once >> generic_make_request is fixed, and realised that I had missed an >> important detail in this patch. >> Several places test current->bio_list, and two actually edit the list. >> With this change, that cannot see the whole lists, so it could cause a >> regression. >> >> So I've revised the patch to make sure that the entire list of queued >> bios remains visible, and change the relevant code to look at both >> the new list and the old list. >> >> Following that there are some patches which make the rescuer thread >> optional, and then starts removing it from some easy-to-fix places. > > Neil, note that the v2 patch is already in Linus tree as of earlier > today. You need to rebase the series, and if we need fixups on > top of v2, then that should be done separately and with increased > urgency. I had checked linux-next, but not the latest from Linus. I see it now - thanks! I'll rebase (and ensure nothing gets mangled) Thanks, NeilBrown
> --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) > */ > blk_qc_t generic_make_request(struct bio *bio) > { > - struct bio_list bio_list_on_stack; > + /* > + * bio_list_on_stack[0] contains bios submitted by the current > + * make_request_fn. > + * bio_list_on_stack[1] contains bios that were submitted before > + * the current make_request_fn, but that haven't been processed > + * yet. > + */ > + struct bio_list bio_list_on_stack[2]; > blk_qc_t ret = BLK_QC_T_NONE; May I suggest that, if you intend to assign something that is not a plain &(struct bio_list), but a &(struct bio_list[2]), you change the task member so it is renamed (current->bio_list vs current->bio_lists, plural, is what I did last year). Or you will break external modules, silently, and horribly (or, rather, they won't notice, but break the kernel). Examples of such modules would be DRBD, ZFS, quite possibly others. Thanks, Lars
On Fri, Mar 10 2017 at 7:34am -0500, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) > > */ > > blk_qc_t generic_make_request(struct bio *bio) > > { > > - struct bio_list bio_list_on_stack; > > + /* > > + * bio_list_on_stack[0] contains bios submitted by the current > > + * make_request_fn. > > + * bio_list_on_stack[1] contains bios that were submitted before > > + * the current make_request_fn, but that haven't been processed > > + * yet. > > + */ > > + struct bio_list bio_list_on_stack[2]; > > blk_qc_t ret = BLK_QC_T_NONE; > > May I suggest that, if you intend to assign something that is not a > plain &(struct bio_list), but a &(struct bio_list[2]), > you change the task member so it is renamed (current->bio_list vs > current->bio_lists, plural, is what I did last year). > Or you will break external modules, silently, and horribly (or, > rather, they won't notice, but break the kernel). > Examples of such modules would be DRBD, ZFS, quite possibly others. drbd is upstream -- so what is the problem? (if you are having to distribute drbd independent of the upstream drbd then why is drbd upstream?) As for ZFS, worrying about ZFS kABI breakage is the last thing we should be doing. So Nack from me on this defensive make-work for external modules.
On Fri, 10 Mar 2017, Mike Snitzer wrote: > On Fri, Mar 10 2017 at 7:34am -0500, > Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > > > > --- a/block/blk-core.c > > > +++ b/block/blk-core.c > > > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) > > > */ > > > blk_qc_t generic_make_request(struct bio *bio) > > > { > > > - struct bio_list bio_list_on_stack; > > > + /* > > > + * bio_list_on_stack[0] contains bios submitted by the current > > > + * make_request_fn. > > > + * bio_list_on_stack[1] contains bios that were submitted before > > > + * the current make_request_fn, but that haven't been processed > > > + * yet. > > > + */ > > > + struct bio_list bio_list_on_stack[2]; > > > blk_qc_t ret = BLK_QC_T_NONE; > > > > May I suggest that, if you intend to assign something that is not a > > plain &(struct bio_list), but a &(struct bio_list[2]), > > you change the task member so it is renamed (current->bio_list vs > > current->bio_lists, plural, is what I did last year). > > Or you will break external modules, silently, and horribly (or, > > rather, they won't notice, but break the kernel). > > Examples of such modules would be DRBD, ZFS, quite possibly others. > > drbd is upstream -- so what is the problem? (if you are having to > distribute drbd independent of the upstream drbd then why is drbd > upstream?) > > As for ZFS, worrying about ZFS kABI breakage is the last thing we should > be doing. It's better to make external modules not compile than to silently introduce bugs in them. So yes, I would rename that. Mikulas > So Nack from me on this defensive make-work for external modules. >
On 10.03.2017 15:55, Mikulas Patocka wrote: > > > On Fri, 10 Mar 2017, Mike Snitzer wrote: > >> On Fri, Mar 10 2017 at 7:34am -0500, >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote: >> >>>> --- a/block/blk-core.c >>>> +++ b/block/blk-core.c >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) >>>> */ >>>> blk_qc_t generic_make_request(struct bio *bio) >>>> { >>>> - struct bio_list bio_list_on_stack; >>>> + /* >>>> + * bio_list_on_stack[0] contains bios submitted by the current >>>> + * make_request_fn. >>>> + * bio_list_on_stack[1] contains bios that were submitted before >>>> + * the current make_request_fn, but that haven't been processed >>>> + * yet. >>>> + */ >>>> + struct bio_list bio_list_on_stack[2]; >>>> blk_qc_t ret = BLK_QC_T_NONE; >>> >>> May I suggest that, if you intend to assign something that is not a >>> plain &(struct bio_list), but a &(struct bio_list[2]), >>> you change the task member so it is renamed (current->bio_list vs >>> current->bio_lists, plural, is what I did last year). >>> Or you will break external modules, silently, and horribly (or, >>> rather, they won't notice, but break the kernel). >>> Examples of such modules would be DRBD, ZFS, quite possibly others. >> >> drbd is upstream -- so what is the problem? (if you are having to >> distribute drbd independent of the upstream drbd then why is drbd >> upstream?) >> >> As for ZFS, worrying about ZFS kABI breakage is the last thing we should >> be doing. > > It's better to make external modules not compile than to silently > introduce bugs in them. So yes, I would rename that. > > Mikulas Agree, better rename current->bio_list to current->bio_lists Regards, Jack
On Fri, Mar 10 2017 at 10:07am -0500, Jack Wang <jinpu.wang@profitbricks.com> wrote: > > > On 10.03.2017 15:55, Mikulas Patocka wrote: > > > > > > On Fri, 10 Mar 2017, Mike Snitzer wrote: > > > >> On Fri, Mar 10 2017 at 7:34am -0500, > >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > >> > >>>> --- a/block/blk-core.c > >>>> +++ b/block/blk-core.c > >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) > >>>> */ > >>>> blk_qc_t generic_make_request(struct bio *bio) > >>>> { > >>>> - struct bio_list bio_list_on_stack; > >>>> + /* > >>>> + * bio_list_on_stack[0] contains bios submitted by the current > >>>> + * make_request_fn. > >>>> + * bio_list_on_stack[1] contains bios that were submitted before > >>>> + * the current make_request_fn, but that haven't been processed > >>>> + * yet. > >>>> + */ > >>>> + struct bio_list bio_list_on_stack[2]; > >>>> blk_qc_t ret = BLK_QC_T_NONE; > >>> > >>> May I suggest that, if you intend to assign something that is not a > >>> plain &(struct bio_list), but a &(struct bio_list[2]), > >>> you change the task member so it is renamed (current->bio_list vs > >>> current->bio_lists, plural, is what I did last year). > >>> Or you will break external modules, silently, and horribly (or, > >>> rather, they won't notice, but break the kernel). > >>> Examples of such modules would be DRBD, ZFS, quite possibly others. > >> > >> drbd is upstream -- so what is the problem? (if you are having to > >> distribute drbd independent of the upstream drbd then why is drbd > >> upstream?) > >> > >> As for ZFS, worrying about ZFS kABI breakage is the last thing we should > >> be doing. > > > > It's better to make external modules not compile than to silently > > introduce bugs in them. So yes, I would rename that. > > > > Mikulas > > Agree, better rename current->bio_list to current->bio_lists Fine, normally wouldn't do so but I'm not so opposed that we need to get hung up on this detail. If Neil and Jens agree then so be it.
On Fri, Mar 10, 2017 at 04:07:58PM +0100, Jack Wang wrote: > On 10.03.2017 15:55, Mikulas Patocka wrote: > > On Fri, 10 Mar 2017, Mike Snitzer wrote: > >> On Fri, Mar 10 2017 at 7:34am -0500, > >> Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > >> > >>>> --- a/block/blk-core.c > >>>> +++ b/block/blk-core.c > >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) > >>>> */ > >>>> blk_qc_t generic_make_request(struct bio *bio) > >>>> { > >>>> - struct bio_list bio_list_on_stack; > >>>> + /* > >>>> + * bio_list_on_stack[0] contains bios submitted by the current > >>>> + * make_request_fn. > >>>> + * bio_list_on_stack[1] contains bios that were submitted before > >>>> + * the current make_request_fn, but that haven't been processed > >>>> + * yet. > >>>> + */ > >>>> + struct bio_list bio_list_on_stack[2]; > >>>> blk_qc_t ret = BLK_QC_T_NONE; > >>> > >>> May I suggest that, if you intend to assign something that is not a > >>> plain &(struct bio_list), but a &(struct bio_list[2]), > >>> you change the task member so it is renamed (current->bio_list vs > >>> current->bio_lists, plural, is what I did last year). > >>> Or you will break external modules, silently, and horribly (or, > >>> rather, they won't notice, but break the kernel). > >>> Examples of such modules would be DRBD, ZFS, quite possibly others. > > It's better to make external modules not compile than to silently > > introduce bugs in them. So yes, I would rename that. > > > > Mikulas > > Agree, better rename current->bio_list to current->bio_lists > > Regards, > Jack Thank you. (I don't know if some one does, but...) Thing is: *IF* some external thing messes with current->bio_list in "interesting" ways, and not just the "I don't care, one level of real recursion fixes this for me" pattern of struct bio_list *tmp = current->bio_list; current->bio_list = NULL; submit_bio() current->bio_list = tmp; you get a chance of stack corruption, without even as much as a compiler warning. Which is why I wrote https://lkml.org/lkml/2016/7/8/469 ... Instead, I suggest to distinguish between recursive calls to generic_make_request(), and pushing back the remainder part in blk_queue_split(), by pointing current->bio_lists to a struct recursion_to_iteration_bio_lists { struct bio_list recursion; struct bio_list queue; } By providing each q->make_request_fn() with an empty "recursion" bio_list, then merging any recursively submitted bios to the head of the "queue" list, we can make the recursion-to-iteration logic in generic_make_request() process deepest level bios first, and "sibling" bios of the same level in "natural" order. ... Cheers, Lars
On Fri, Mar 10 2017, Lars Ellenberg wrote: >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio) >> */ >> blk_qc_t generic_make_request(struct bio *bio) >> { >> - struct bio_list bio_list_on_stack; >> + /* >> + * bio_list_on_stack[0] contains bios submitted by the current >> + * make_request_fn. >> + * bio_list_on_stack[1] contains bios that were submitted before >> + * the current make_request_fn, but that haven't been processed >> + * yet. >> + */ >> + struct bio_list bio_list_on_stack[2]; >> blk_qc_t ret = BLK_QC_T_NONE; > > May I suggest that, if you intend to assign something that is not a > plain &(struct bio_list), but a &(struct bio_list[2]), > you change the task member so it is renamed (current->bio_list vs > current->bio_lists, plural, is what I did last year). > Or you will break external modules, silently, and horribly (or, > rather, they won't notice, but break the kernel). > Examples of such modules would be DRBD, ZFS, quite possibly others. > This is exactly what I didn't in my first draft (bio_list -> bio_lists), but then I reverted that change because it didn't seem to be worth the noise. It isn't much noise, sched.h, bcache/btree.c, md/dm-bufio.c, and md/raid1.c get minor changes. But as I'm hoping to get rid of all of those uses, renaming before removing seemed pointless ... though admittedly that is what I did for bioset_create().... I wondered about that too. The example you give later: struct bio_list *tmp = current->bio_list; current->bio_list = NULL; submit_bio() current->bio_list = tmp; won't cause any problem. Whatever lists the parent generic_make_request is holding onto will be untouched during the submit_bio() call, and will be exactly as it expects them when this caller returns. If some out-of-tree code does anything with ->bio_list that makes sense with the previous code, then it will still make sense with the new code. However there will be a few bios that it didn't get too look at. These will all be bios that were submitted by a device further up the stack (closer to the filesystem), so they *should* be irrelevant. I could probably come up with some weird behaviour that might have worked before but now wouldn't quite work the same way. But just fixing bugs can sometimes affect an out-of-tree driver in a strange way because it was assuming those bugs. I hope that I'll soon be able to remove punt_bios_to_rescuer and flush_current_bio_list, after which current->bio_list can really be just a list again. I don't think it is worth changing the name for a transient situation. But thanks for the review - it encouraged me to think though the consequences again and I'm now more confident. I actually now think that change probably wasn't necessary. It is safer though. It ensures that current functionality isn't removed without a clear justification. Thanks, NeilBrown
diff --git a/block/blk-core.c b/block/blk-core.c index b9e857f4afe8..9520e82aa78c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list hold; + struct bio_list lower, same; + + /* Create a fresh bio_list for all subordinate requests */ + hold = bio_list_on_stack; + bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); blk_queue_exit(q); - bio = bio_list_pop(current->bio_list); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); } else { - struct bio *bio_next = bio_list_pop(current->bio_list); - bio_io_error(bio); - bio = bio_next; } + bio = bio_list_pop(current->bio_list); } while (bio); current->bio_list = NULL; /* deactivate */