Message ID | 20220601145110.18162-3-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: Fix IO priority mess | expand |
On 2022/06/01 23:51, Jan Kara wrote: > Commit e70344c05995 ("block: fix default IO priority handling") > introduced an inconsistency in get_current_ioprio() that tasks without > IO context return IOPRIO_DEFAULT priority while tasks with freshly > allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. > Tasks without IO context used to be rare before 5a9d041ba2f6 ("block: > move io_context creation into where it's needed") but after this commit > they became common because now only BFQ IO scheduler setups task's IO > context. Similar inconsistency is there for get_task_ioprio() so this > inconsistency is now exposed to userspace and userspace will see > different IO priority for tasks operating on devices with BFQ compared > to devices without BFQ. Furthemore the changes done by commit > e70344c05995 change the behavior when no IO priority is set for BFQ IO > scheduler which is also documented in ioprio_set(2) manpage - namely > that tasks without set IO priority will use IO priority based on their > nice value. > > So make sure we default to IOPRIO_CLASS_NONE as used to be the case > before commit e70344c05995. Also cleanup alloc_io_context() to > explicitely set this IO priority for the allocated IO context. > > Fixes: e70344c05995 ("block: fix default IO priority handling") > Signed-off-by: Jan Kara <jack@suse.cz> > --- > block/blk-ioc.c | 2 ++ > include/linux/ioprio.h | 2 +- > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/block/blk-ioc.c b/block/blk-ioc.c > index df9cfe4ca532..63fc02042408 100644 > --- a/block/blk-ioc.c > +++ b/block/blk-ioc.c > @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node) > INIT_HLIST_HEAD(&ioc->icq_list); > INIT_WORK(&ioc->release_work, ioc_release_fn); > #endif > + ioc->ioprio = IOPRIO_DEFAULT; > + > return ioc; > } > > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h > index 774bb90ad668..d9dc78a15301 100644 > --- a/include/linux/ioprio.h > +++ b/include/linux/ioprio.h > @@ -11,7 +11,7 @@ > /* > * Default IO priority. > */ > -#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM) > +#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0) "man ioprio_set" says: IOPRIO_CLASS_BE (2) This is the best-effort scheduling class, which is the default for any process that hasn't set a specific I/O priority. Which is why patch e70344c05995 introduced IOPRIO_DEFAULT definition using the BE class, to have the kernel in sync with the manual. The different ioprio leading to no BIO merging is definitely a problem but this patch is not really fixing anything in my opinion. It simply gets back to the previous "all 0s" ioprio initialization, which do not show the places where we actually have missing ioprio initialization to IOPRIO_DEFAULT. Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio being allocated (in bio_alloc ?) before it is setup and inherits the user io priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing IOPRIO_DEFAULT to that value removes the differences you observed. > > /* > * Check that a priority value has a valid class.
On Thu 02-06-22 00:08:28, Damien Le Moal wrote: > On 2022/06/01 23:51, Jan Kara wrote: > > Commit e70344c05995 ("block: fix default IO priority handling") > > introduced an inconsistency in get_current_ioprio() that tasks without > > IO context return IOPRIO_DEFAULT priority while tasks with freshly > > allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. > > Tasks without IO context used to be rare before 5a9d041ba2f6 ("block: > > move io_context creation into where it's needed") but after this commit > > they became common because now only BFQ IO scheduler setups task's IO > > context. Similar inconsistency is there for get_task_ioprio() so this > > inconsistency is now exposed to userspace and userspace will see > > different IO priority for tasks operating on devices with BFQ compared > > to devices without BFQ. Furthemore the changes done by commit > > e70344c05995 change the behavior when no IO priority is set for BFQ IO > > scheduler which is also documented in ioprio_set(2) manpage - namely > > that tasks without set IO priority will use IO priority based on their > > nice value. > > > > So make sure we default to IOPRIO_CLASS_NONE as used to be the case > > before commit e70344c05995. Also cleanup alloc_io_context() to > > explicitely set this IO priority for the allocated IO context. > > > > Fixes: e70344c05995 ("block: fix default IO priority handling") > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > block/blk-ioc.c | 2 ++ > > include/linux/ioprio.h | 2 +- > > 2 files changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-ioc.c b/block/blk-ioc.c > > index df9cfe4ca532..63fc02042408 100644 > > --- a/block/blk-ioc.c > > +++ b/block/blk-ioc.c > > @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node) > > INIT_HLIST_HEAD(&ioc->icq_list); > > INIT_WORK(&ioc->release_work, ioc_release_fn); > > #endif > > + ioc->ioprio = IOPRIO_DEFAULT; > > + > > return ioc; > > } > > > > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h > > index 774bb90ad668..d9dc78a15301 100644 > > --- a/include/linux/ioprio.h > > +++ b/include/linux/ioprio.h > > @@ -11,7 +11,7 @@ > > /* > > * Default IO priority. > > */ > > -#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM) > > +#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0) > > "man ioprio_set" says: > > IOPRIO_CLASS_BE (2) > This is the best-effort scheduling class, which is the default for any process > that hasn't set a specific I/O priority. > > Which is why patch e70344c05995 introduced IOPRIO_DEFAULT definition using the > BE class, to have the kernel in sync with the manual. Yes, but it also has: If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). In Linux kernels before version 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior. So apparently even the manpage itself is inconsistent ;) (and I will neglect the mistake that the text says "no I/O scheduler" instead of "no I/O priority"). And there is actually code in BFQ (and there used to be code in CFQ) that does: switch (ioprio_class) { ... case IOPRIO_CLASS_NONE: /* * No prio set, inherit CPU scheduling settings. */ bfqq->new_ioprio = task_nice_ioprio(tsk); bfqq->new_ioprio_class = task_nice_ioclass(tsk); break; So IOPRIO_CLASS_NONE indeed has a meaning and it used to be the default one until Christoph's 5a9d041ba2f6 in most cases. Your change e70344c05995 happening before that actually didn't change much practically because IO contexts were initialized with 0 priority anyway and that was what get_current_ioprio() was returning. > The different ioprio leading to no BIO merging is definitely a problem > but this patch is not really fixing anything in my opinion. It simply > gets back to the previous "all 0s" ioprio initialization, which do not > show the places where we actually have missing ioprio initialization to > IOPRIO_DEFAULT. So I agree we should settle on how to treat IOs with unset IO priority. The behavior to use task's CPU priority when IO priority is unset is there for a *long* time and so I think we should preserve that. The question is where in the stack should the switch from "unset" value to "effective ioprio" happen. Switching in IO context is IMO too early since userspace needs to be able to differentiate "unset" from "set to IOPRIO_CLASS_BE,IOPRIO_BE_NORM". But we could have a helper like current_effective_ioprio() that will do the magic with mangling unset IO priority based on task's CPU priority. The fact is that bio->bi_ioprio gets set to anything only for direct IO in iomap_dio_rw(). The rest of the IO has priority unset (BFQ fetches the priority from task's IO context and ignores priority on bios BTW). And the only place where req->ioprio (inherited from bio->bi_ioprio) gets used is in a few drivers to set HIGHPRI flag for IOPRIO_CLASS_RT IO and then the relatively new code in mq-deadline. So it all is very inconsistent mess :-| > Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio > being allocated (in bio_alloc ?) before it is setup and inherits the user io > priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing > IOPRIO_DEFAULT to that value removes the differences you observed. Yes, I think that would make sence although as I explain above this is somewhat independent to what the default IO priority behavior should be. Honza
On 2022/06/02 1:04, Jan Kara wrote: > On Thu 02-06-22 00:08:28, Damien Le Moal wrote: >> On 2022/06/01 23:51, Jan Kara wrote: >>> Commit e70344c05995 ("block: fix default IO priority handling") >>> introduced an inconsistency in get_current_ioprio() that tasks without >>> IO context return IOPRIO_DEFAULT priority while tasks with freshly >>> allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. >>> Tasks without IO context used to be rare before 5a9d041ba2f6 ("block: >>> move io_context creation into where it's needed") but after this commit >>> they became common because now only BFQ IO scheduler setups task's IO >>> context. Similar inconsistency is there for get_task_ioprio() so this >>> inconsistency is now exposed to userspace and userspace will see >>> different IO priority for tasks operating on devices with BFQ compared >>> to devices without BFQ. Furthemore the changes done by commit >>> e70344c05995 change the behavior when no IO priority is set for BFQ IO >>> scheduler which is also documented in ioprio_set(2) manpage - namely >>> that tasks without set IO priority will use IO priority based on their >>> nice value. >>> >>> So make sure we default to IOPRIO_CLASS_NONE as used to be the case >>> before commit e70344c05995. Also cleanup alloc_io_context() to >>> explicitely set this IO priority for the allocated IO context. >>> >>> Fixes: e70344c05995 ("block: fix default IO priority handling") >>> Signed-off-by: Jan Kara <jack@suse.cz> >>> --- >>> block/blk-ioc.c | 2 ++ >>> include/linux/ioprio.h | 2 +- >>> 2 files changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c >>> index df9cfe4ca532..63fc02042408 100644 >>> --- a/block/blk-ioc.c >>> +++ b/block/blk-ioc.c >>> @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node) >>> INIT_HLIST_HEAD(&ioc->icq_list); >>> INIT_WORK(&ioc->release_work, ioc_release_fn); >>> #endif >>> + ioc->ioprio = IOPRIO_DEFAULT; >>> + >>> return ioc; >>> } >>> >>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h >>> index 774bb90ad668..d9dc78a15301 100644 >>> --- a/include/linux/ioprio.h >>> +++ b/include/linux/ioprio.h >>> @@ -11,7 +11,7 @@ >>> /* >>> * Default IO priority. >>> */ >>> -#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM) >>> +#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0) >> >> "man ioprio_set" says: >> >> IOPRIO_CLASS_BE (2) >> This is the best-effort scheduling class, which is the default for any process >> that hasn't set a specific I/O priority. >> >> Which is why patch e70344c05995 introduced IOPRIO_DEFAULT definition using the >> BE class, to have the kernel in sync with the manual. > > Yes, but it also has: > > If no I/O scheduler has been set for a thread, then by default the I/O > priority will follow the CPU nice value (setpriority(2)). In Linux > kernels before version 2.6.24, once an I/O priority had been set using > ioprio_set(), there was no way to reset the I/O scheduling behavior to > the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to > reset to the default I/O scheduling behavior. > > So apparently even the manpage itself is inconsistent ;) (and I will > neglect the mistake that the text says "no I/O scheduler" instead of "no > I/O priority"). And there is actually code in BFQ (and there used to be > code in CFQ) that does: Arg, indeed, it is a mess. We need to fix this manpage too. > switch (ioprio_class) { > ... > case IOPRIO_CLASS_NONE: > /* > * No prio set, inherit CPU scheduling settings. > */ > bfqq->new_ioprio = task_nice_ioprio(tsk); > bfqq->new_ioprio_class = task_nice_ioclass(tsk); > break; > > So IOPRIO_CLASS_NONE indeed has a meaning and it used to be the default > one until Christoph's 5a9d041ba2f6 in most cases. Your change e70344c05995 > happening before that actually didn't change much practically because IO > contexts were initialized with 0 priority anyway and that was what > get_current_ioprio() was returning. 5a9d041ba2f6 was from Jens, but agreed, io priority initialization never has been correct. > >> The different ioprio leading to no BIO merging is definitely a problem >> but this patch is not really fixing anything in my opinion. It simply >> gets back to the previous "all 0s" ioprio initialization, which do not >> show the places where we actually have missing ioprio initialization to >> IOPRIO_DEFAULT. > > So I agree we should settle on how to treat IOs with unset IO priority. The > behavior to use task's CPU priority when IO priority is unset is there for > a *long* time and so I think we should preserve that. The question is where > in the stack should the switch from "unset" value to "effective ioprio" > happen. Switching in IO context is IMO too early since userspace needs to > be able to differentiate "unset" from "set to > IOPRIO_CLASS_BE,IOPRIO_BE_NORM". But we could have a helper like > current_effective_ioprio() that will do the magic with mangling unset IO > priority based on task's CPU priority. I agree that the task's CPU priority is a more sensible default. However, I do not understand your point about the IO context being too early to set the effective priority. If we do not do that, getting the issuer CPU priority will not be easily possible, right ? > > The fact is that bio->bi_ioprio gets set to anything only for direct IO in > iomap_dio_rw(). The rest of the IO has priority unset (BFQ fetches the > priority from task's IO context and ignores priority on bios BTW). And the > only place where req->ioprio (inherited from bio->bi_ioprio) gets used is > in a few drivers to set HIGHPRI flag for IOPRIO_CLASS_RT IO and then the > relatively new code in mq-deadline. Yes, only IOPRIO_CLASS_RT has an effect at the hardware level right now. Everything else is only for the IO scheduler to play with. I am preparing a patch series to support scsi/ata command duration limits though. That adds a new IOPRIO_CLASS_DL and that one will also have an effect on the hardware. Note that if a process/thread use ioprio_set(), we should also honor the ioprio set for buffered reads, at least those page IOs that are issued directly from the IO issuer context (which may include readahead... hmmm). > > So it all is very inconsistent mess :-| Indeed it is. > >> Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio >> being allocated (in bio_alloc ?) before it is setup and inherits the user io >> priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing >> IOPRIO_DEFAULT to that value removes the differences you observed. > > Yes, I think that would make sence although as I explain above this is > somewhat independent to what the default IO priority behavior should be. I am OK with the use of the task CPU priority/ionice value as the default when no other ioprio is set for a bio using the user aio_reqprio or ioprio_set(). If this relies on task_nice_ioclass() as it is today (I see no reason to change that), then the default class for regular tasks remains IOPRIO_CLASS_BE as is defined by IOPRIO_DEFAULT. But to avoid the performance regression you observed, we really need to be 100% sure that all bios have their ->bi_ioprio field correctly initialized. Something like: void bio_set_effective_ioprio(struct *bio) { switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) { case IOPRIO_CLASS_RT: case IOPRIO_CLASS_BE: case IOPRIO_CLASS_IDLE: /* * the bio ioprio was already set from an aio kiocb ioprio * (aio->aio_reqprio) or from the issuer context ioprio if that * context used ioprio_set(). */; return; case IOPRIO_CLASS_NONE: default: /* Use the current task CPU priority */ bio->ioprio = IOPRIO_PRIO_VALUE(task_nice_ioclass(current), task_nice_ioprio(current)); return; } } being called before a bio is inserted in a scheduler or bypass inserted in the dispatch queues should result in all BIOs having an ioprio that is set to something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at the beginning of submit_bio(), before submit_bio_noacct() is called. I am tempted to argue that block device drivers should never see any req with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should ever enter the block stack with that ioprio either. With the above solution, bios from DM targets submitted with submit_bio_noacct() could still have IOPRIO_CLASS_NONE... So would submit_bio_noacct() be the better place to call the effective ioprio helper ?
On Thu 02-06-22 10:53:29, Damien Le Moal wrote: > On 2022/06/02 1:04, Jan Kara wrote: > > On Thu 02-06-22 00:08:28, Damien Le Moal wrote: > >> The different ioprio leading to no BIO merging is definitely a problem > >> but this patch is not really fixing anything in my opinion. It simply > >> gets back to the previous "all 0s" ioprio initialization, which do not > >> show the places where we actually have missing ioprio initialization to > >> IOPRIO_DEFAULT. > > > > So I agree we should settle on how to treat IOs with unset IO priority. The > > behavior to use task's CPU priority when IO priority is unset is there for > > a *long* time and so I think we should preserve that. The question is where > > in the stack should the switch from "unset" value to "effective ioprio" > > happen. Switching in IO context is IMO too early since userspace needs to > > be able to differentiate "unset" from "set to > > IOPRIO_CLASS_BE,IOPRIO_BE_NORM". But we could have a helper like > > current_effective_ioprio() that will do the magic with mangling unset IO > > priority based on task's CPU priority. > > I agree that the task's CPU priority is a more sensible default. However, > I do not understand your point about the IO context being too early to > set the effective priority. If we do not do that, getting the issuer CPU > priority will not be easily possible, right ? I just meant that in the IO context we need to keep the information whether IO priority is set to a particular value, or whether it is set to 0 meaning "inherit from CPU priority". So we cannot just store the effective IO priority immediately in the IO context instead of 0. > > The fact is that bio->bi_ioprio gets set to anything only for direct IO in > > iomap_dio_rw(). The rest of the IO has priority unset (BFQ fetches the > > priority from task's IO context and ignores priority on bios BTW). And the > > only place where req->ioprio (inherited from bio->bi_ioprio) gets used is > > in a few drivers to set HIGHPRI flag for IOPRIO_CLASS_RT IO and then the > > relatively new code in mq-deadline. > > Yes, only IOPRIO_CLASS_RT has an effect at the hardware level right now. > Everything else is only for the IO scheduler to play with. I am preparing a > patch series to support scsi/ata command duration limits though. That adds a new > IOPRIO_CLASS_DL and that one will also have an effect on the hardware. > > Note that if a process/thread use ioprio_set(), we should also honor the ioprio > set for buffered reads, at least those page IOs that are issued directly from > the IO issuer context (which may include readahead... hmmm). Yes, that would make sense as well. > >> Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio > >> being allocated (in bio_alloc ?) before it is setup and inherits the user io > >> priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing > >> IOPRIO_DEFAULT to that value removes the differences you observed. > > > > Yes, I think that would make sence although as I explain above this is > > somewhat independent to what the default IO priority behavior should be. > I am OK with the use of the task CPU priority/ionice value as the default when > no other ioprio is set for a bio using the user aio_reqprio or ioprio_set(). If > this relies on task_nice_ioclass() as it is today (I see no reason to change > that), then the default class for regular tasks remains IOPRIO_CLASS_BE as is > defined by IOPRIO_DEFAULT. Yes, good. > But to avoid the performance regression you observed, we really need to be 100% > sure that all bios have their ->bi_ioprio field correctly initialized. Something > like: > > void bio_set_effective_ioprio(struct *bio) > { > switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) { > case IOPRIO_CLASS_RT: > case IOPRIO_CLASS_BE: > case IOPRIO_CLASS_IDLE: > /* > * the bio ioprio was already set from an aio kiocb ioprio > * (aio->aio_reqprio) or from the issuer context ioprio if that > * context used ioprio_set(). > */; > return; > case IOPRIO_CLASS_NONE: > default: > /* Use the current task CPU priority */ > bio->ioprio = > IOPRIO_PRIO_VALUE(task_nice_ioclass(current), > task_nice_ioprio(current)); > return; > } > } > > being called before a bio is inserted in a scheduler or bypass inserted in the > dispatch queues should result in all BIOs having an ioprio that is set to > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at > the beginning of submit_bio(), before submit_bio_noacct() is called. > > I am tempted to argue that block device drivers should never see any req > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should > ever enter the block stack with that ioprio either. With the above > solution, bios from DM targets submitted with submit_bio_noacct() could > still have IOPRIO_CLASS_NONE... So would submit_bio_noacct() be the > better place to call the effective ioprio helper ? Yes, I also think it would be the cleanest if we made sure bio->ioprio is always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can make that happen in the least painful way :). Thanks for your input! Honza
On Mon 06-06-22 12:42:02, Jan Kara wrote: > On Thu 02-06-22 10:53:29, Damien Le Moal wrote: > > But to avoid the performance regression you observed, we really need to be 100% > > sure that all bios have their ->bi_ioprio field correctly initialized. Something > > like: > > > > void bio_set_effective_ioprio(struct *bio) > > { > > switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) { > > case IOPRIO_CLASS_RT: > > case IOPRIO_CLASS_BE: > > case IOPRIO_CLASS_IDLE: > > /* > > * the bio ioprio was already set from an aio kiocb ioprio > > * (aio->aio_reqprio) or from the issuer context ioprio if that > > * context used ioprio_set(). > > */; > > return; > > case IOPRIO_CLASS_NONE: > > default: > > /* Use the current task CPU priority */ > > bio->ioprio = > > IOPRIO_PRIO_VALUE(task_nice_ioclass(current), > > task_nice_ioprio(current)); > > return; > > } > > } > > > > being called before a bio is inserted in a scheduler or bypass inserted in the > > dispatch queues should result in all BIOs having an ioprio that is set to > > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at > > the beginning of submit_bio(), before submit_bio_noacct() is called. > > > > I am tempted to argue that block device drivers should never see any req > > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should > > ever enter the block stack with that ioprio either. With the above > > solution, bios from DM targets submitted with submit_bio_noacct() could > > still have IOPRIO_CLASS_NONE... So would submit_bio_noacct() be the > > better place to call the effective ioprio helper ? > > Yes, I also think it would be the cleanest if we made sure bio->ioprio is > always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can > make that happen in the least painful way :). Thanks for your input! When looking into this I've hit a snag: ioprio rq_qos policy relies on the fact that bio->bi_ioprio remains at 0 (unless explicitely set to some other value by userspace) until we call rq_qos_track() in blk_mq_submit_bio(). BTW this happens after we have attempted to merge the bio to existing requests so ioprio rq_qos policy is going to show strange behavior wrt merging - most of the bios will not be able to merge to existing queued requests due to ioprio mismatch. I'd say .track hook gets called too late to properly set bio->bi_ioprio. Shouldn't we set the io priority much earlier - I'd be tempted to use bio_associate_blkg_from_css() for this... What do people think? Honza
On Mon, Jun 06, 2022 at 04:21:36PM +0200, Jan Kara wrote: > On Mon 06-06-22 12:42:02, Jan Kara wrote: > > On Thu 02-06-22 10:53:29, Damien Le Moal wrote: > > > But to avoid the performance regression you observed, we really need to be 100% > > > sure that all bios have their ->bi_ioprio field correctly initialized. Something > > > like: > > > > > > void bio_set_effective_ioprio(struct *bio) > > > { > > > switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) { > > > case IOPRIO_CLASS_RT: > > > case IOPRIO_CLASS_BE: > > > case IOPRIO_CLASS_IDLE: > > > /* > > > * the bio ioprio was already set from an aio kiocb ioprio > > > * (aio->aio_reqprio) or from the issuer context ioprio if that > > > * context used ioprio_set(). > > > */; > > > return; > > > case IOPRIO_CLASS_NONE: > > > default: > > > /* Use the current task CPU priority */ > > > bio->ioprio = > > > IOPRIO_PRIO_VALUE(task_nice_ioclass(current), > > > task_nice_ioprio(current)); > > > return; > > > } > > > } > > > > > > being called before a bio is inserted in a scheduler or bypass inserted in the > > > dispatch queues should result in all BIOs having an ioprio that is set to > > > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at > > > the beginning of submit_bio(), before submit_bio_noacct() is called. > > > > > > I am tempted to argue that block device drivers should never see any req > > > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should > > > ever enter the block stack with that ioprio either. With the above > > > solution, bios from DM targets submitted with submit_bio_noacct() could > > > still have IOPRIO_CLASS_NONE... So would submit_bio_noacct() be the > > > better place to call the effective ioprio helper ? > > > > Yes, I also think it would be the cleanest if we made sure bio->ioprio is > > always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can > > make that happen in the least painful way :). Thanks for your input! > > When looking into this I've hit a snag: ioprio rq_qos policy relies on the > fact that bio->bi_ioprio remains at 0 (unless explicitely set to some other > value by userspace) until we call rq_qos_track() in blk_mq_submit_bio(). > BTW this happens after we have attempted to merge the bio to existing > requests so ioprio rq_qos policy is going to show strange behavior wrt > merging - most of the bios will not be able to merge to existing queued > requests due to ioprio mismatch. > > I'd say .track hook gets called too late to properly set bio->bi_ioprio. > Shouldn't we set the io priority much earlier - I'd be tempted to use > bio_associate_blkg_from_css() for this... What do people think? Hello Jan, bio_associate_blkg_from_css() is just an empty stub if CONFIG_BLK_CGROUP is not set. Having the effective ioprio set should correctly shouldn't depend on if CONFIG_BLK_CGROUP is set or not, no? The function name bio_associate_blkg_from_css() (css - cgroup_subsys_state) also seems to imply that it should only perform cgroup related things, no? AFAICT, both bfq and mq-deadline can currently prioritize requests without CONFIG_BLK_CGROUP enabled. Kind regards, Niklas
On Tue 07-06-22 12:13:48, Niklas Cassel wrote: > On Mon, Jun 06, 2022 at 04:21:36PM +0200, Jan Kara wrote: > > On Mon 06-06-22 12:42:02, Jan Kara wrote: > > > On Thu 02-06-22 10:53:29, Damien Le Moal wrote: > > > > But to avoid the performance regression you observed, we really need to be 100% > > > > sure that all bios have their ->bi_ioprio field correctly initialized. Something > > > > like: > > > > > > > > void bio_set_effective_ioprio(struct *bio) > > > > { > > > > switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) { > > > > case IOPRIO_CLASS_RT: > > > > case IOPRIO_CLASS_BE: > > > > case IOPRIO_CLASS_IDLE: > > > > /* > > > > * the bio ioprio was already set from an aio kiocb ioprio > > > > * (aio->aio_reqprio) or from the issuer context ioprio if that > > > > * context used ioprio_set(). > > > > */; > > > > return; > > > > case IOPRIO_CLASS_NONE: > > > > default: > > > > /* Use the current task CPU priority */ > > > > bio->ioprio = > > > > IOPRIO_PRIO_VALUE(task_nice_ioclass(current), > > > > task_nice_ioprio(current)); > > > > return; > > > > } > > > > } > > > > > > > > being called before a bio is inserted in a scheduler or bypass inserted in the > > > > dispatch queues should result in all BIOs having an ioprio that is set to > > > > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at > > > > the beginning of submit_bio(), before submit_bio_noacct() is called. > > > > > > > > I am tempted to argue that block device drivers should never see any req > > > > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should > > > > ever enter the block stack with that ioprio either. With the above > > > > solution, bios from DM targets submitted with submit_bio_noacct() could > > > > still have IOPRIO_CLASS_NONE... So would submit_bio_noacct() be the > > > > better place to call the effective ioprio helper ? > > > > > > Yes, I also think it would be the cleanest if we made sure bio->ioprio is > > > always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can > > > make that happen in the least painful way :). Thanks for your input! > > > > When looking into this I've hit a snag: ioprio rq_qos policy relies on the > > fact that bio->bi_ioprio remains at 0 (unless explicitely set to some other > > value by userspace) until we call rq_qos_track() in blk_mq_submit_bio(). > > BTW this happens after we have attempted to merge the bio to existing > > requests so ioprio rq_qos policy is going to show strange behavior wrt > > merging - most of the bios will not be able to merge to existing queued > > requests due to ioprio mismatch. > > > > I'd say .track hook gets called too late to properly set bio->bi_ioprio. > > Shouldn't we set the io priority much earlier - I'd be tempted to use > > bio_associate_blkg_from_css() for this... What do people think? > > Hello Jan, > > bio_associate_blkg_from_css() is just an empty stub if CONFIG_BLK_CGROUP > is not set. > > Having the effective ioprio set should correctly shouldn't depend on if > CONFIG_BLK_CGROUP is set or not, no? > > The function name bio_associate_blkg_from_css() (css - cgroup_subsys_state) > also seems to imply that it should only perform cgroup related things, no? > > AFAICT, both bfq and mq-deadline can currently prioritize requests without > CONFIG_BLK_CGROUP enabled. Correct on all points. However the ioprio rq_qos policy very much depends on the cgroup support. So at least the update of bio->bi_ioprio based on that policy would make sense in bio_associate_blkg_from_css(). OTOH thinking about it now, we would have a problem that if bio blkcg association changes, we don't have enough information to update the bi_ioprio accordingly (since we don't know whether the resulting ioprio is set by userspace or by ioprio rq_qos policy). So probably we need make ioprio rq_qos policy to set the priority later (likely at bio submission time when bio blkcg association is stable) but we need to do it earlier than after we try to merge the bio... Honza
diff --git a/block/blk-ioc.c b/block/blk-ioc.c index df9cfe4ca532..63fc02042408 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node) INIT_HLIST_HEAD(&ioc->icq_list); INIT_WORK(&ioc->release_work, ioc_release_fn); #endif + ioc->ioprio = IOPRIO_DEFAULT; + return ioc; } diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 774bb90ad668..d9dc78a15301 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -11,7 +11,7 @@ /* * Default IO priority. */ -#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM) +#define IOPRIO_DEFAULT IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0) /* * Check that a priority value has a valid class.
Commit e70344c05995 ("block: fix default IO priority handling") introduced an inconsistency in get_current_ioprio() that tasks without IO context return IOPRIO_DEFAULT priority while tasks with freshly allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. Tasks without IO context used to be rare before 5a9d041ba2f6 ("block: move io_context creation into where it's needed") but after this commit they became common because now only BFQ IO scheduler setups task's IO context. Similar inconsistency is there for get_task_ioprio() so this inconsistency is now exposed to userspace and userspace will see different IO priority for tasks operating on devices with BFQ compared to devices without BFQ. Furthemore the changes done by commit e70344c05995 change the behavior when no IO priority is set for BFQ IO scheduler which is also documented in ioprio_set(2) manpage - namely that tasks without set IO priority will use IO priority based on their nice value. So make sure we default to IOPRIO_CLASS_NONE as used to be the case before commit e70344c05995. Also cleanup alloc_io_context() to explicitely set this IO priority for the allocated IO context. Fixes: e70344c05995 ("block: fix default IO priority handling") Signed-off-by: Jan Kara <jack@suse.cz> --- block/blk-ioc.c | 2 ++ include/linux/ioprio.h | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-)