Message ID | 20190114192137.111275-1-bo.liu@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-iolatency: fix IO hang due to negative inflight counter | expand |
ping? thanks, liubo On Mon, Jan 14, 2019 at 11:25 AM Liu Bo <bo.liu@linux.alibaba.com> wrote: > > Our test reported the following stack, and vmcore showed that > ->inflight counter is -1. > > [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > [ffffc9003fcc3958] schedule at ffffffff8173de26 > [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > > The ->inflight counter may be negative (-1) if > > 0) blk-throttle had been enabled when the IO was issued, so its bio > has a associated blkg, > > 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > in this blkg was not available by then, > > 2) blk-iolatency was enabled before this IO reached its endio, so that > iolatency_grp became available when the IO did the endio. > > 3) the ->inflight counter is decreased from 0 to -1. > > This uses atomic_dec_is_positive() instead to avoid the negative > inflight counter. > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> > --- > block/blk-iolatency.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c > index 159af76d60e6..da4ede5c5ddf 100644 > --- a/block/blk-iolatency.c > +++ b/block/blk-iolatency.c > @@ -660,7 +660,12 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, struct bio *bio) > } > rqw = &iolat->rq_wait; > > - atomic_dec(&rqw->inflight); > + /* > + * if blkg is enabled by another throttler (e.g. blk-throttle) > + * and blk-iolatency is enabled while IO is inflight, > + * blk-iolatency's ->inflight counter may be still 0. > + */ > + atomic_dec_if_positive(&rqw->inflight); > if (!enabled || iolat->min_lat_nsec == 0) > goto next; > iolatency_record_time(iolat, &bio->bi_issue, now, > @@ -692,7 +697,8 @@ static void blkcg_iolatency_cleanup(struct rq_qos *rqos, struct bio *bio) > goto next; > > rqw = &iolat->rq_wait; > - atomic_dec(&rqw->inflight); > + /* see comments in blkcg_iolatency_done_io() */ > + atomic_dec_if_positive(&rqw->inflight); > wake_up(&rqw->wait); > next: > blkg = blkg->parent; > -- > 2.20.1.2.gb21ebb6 >
On 1/14/19 12:21 PM, Liu Bo wrote: > Our test reported the following stack, and vmcore showed that > ->inflight counter is -1. > > [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > [ffffc9003fcc3958] schedule at ffffffff8173de26 > [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > > The ->inflight counter may be negative (-1) if > > 0) blk-throttle had been enabled when the IO was issued, so its bio > has a associated blkg, > > 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > in this blkg was not available by then, > > 2) blk-iolatency was enabled before this IO reached its endio, so that > iolatency_grp became available when the IO did the endio. > > 3) the ->inflight counter is decreased from 0 to -1. > > This uses atomic_dec_is_positive() instead to avoid the negative > inflight counter. The problem with that is that it'll hide a lot of other issues, too. Any way we can either track if this rqw is in flight, and only dec if it is, or quiesce when enabling?
On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: > On 1/14/19 12:21 PM, Liu Bo wrote: > > Our test reported the following stack, and vmcore showed that > > ->inflight counter is -1. > > > > [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > > [ffffc9003fcc3958] schedule at ffffffff8173de26 > > [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > > [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > > [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > > [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > > [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > > [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > > [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > > [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > > [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > > [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > > [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > > [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > > [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > > [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > > [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > > [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > > [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > > > > The ->inflight counter may be negative (-1) if > > > > 0) blk-throttle had been enabled when the IO was issued, so its bio > > has a associated blkg, > > > > 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > > in this blkg was not available by then, > > > > 2) blk-iolatency was enabled before this IO reached its endio, so that > > iolatency_grp became available when the IO did the endio. > > > > 3) the ->inflight counter is decreased from 0 to -1. > > > > This uses atomic_dec_is_positive() instead to avoid the negative > > inflight counter. > > The problem with that is that it'll hide a lot of other issues, too. > Any way we can either track if this rqw is in flight, and only dec > if it is, or quiesce when enabling? > I worried about this too, but really the side-effect of allowing more through because of mis-counting means we just let more IO through. I think maybe we add a debug option that we can turn on to see if we're messing up accounting, but in general I don't see a problem with this approach. The problem we're running into here is there's not really a good way to tag a bio as "seen by io.latency." We just have to assume if we're on and there's a bi_blkg associated that we saw it at submit time. We can't just add a flag for every io controller that starts tracking inflight io's, so for now I think this is a reasonable solution. Thanks, Josef
On 1/18/19 9:21 AM, Josef Bacik wrote: > On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: >> On 1/14/19 12:21 PM, Liu Bo wrote: >>> Our test reported the following stack, and vmcore showed that >>> ->inflight counter is -1. >>> >>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d >>> [ffffc9003fcc3958] schedule at ffffffff8173de26 >>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 >>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb >>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 >>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a >>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 >>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d >>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] >>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] >>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae >>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 >>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 >>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] >>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b >>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd >>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 >>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 >>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e >>> >>> The ->inflight counter may be negative (-1) if >>> >>> 0) blk-throttle had been enabled when the IO was issued, so its bio >>> has a associated blkg, >>> >>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp >>> in this blkg was not available by then, >>> >>> 2) blk-iolatency was enabled before this IO reached its endio, so that >>> iolatency_grp became available when the IO did the endio. >>> >>> 3) the ->inflight counter is decreased from 0 to -1. >>> >>> This uses atomic_dec_is_positive() instead to avoid the negative >>> inflight counter. >> >> The problem with that is that it'll hide a lot of other issues, too. >> Any way we can either track if this rqw is in flight, and only dec >> if it is, or quiesce when enabling? >> > > I worried about this too, but really the side-effect of allowing more through > because of mis-counting means we just let more IO through. I think maybe we add > a debug option that we can turn on to see if we're messing up accounting, but in > general I don't see a problem with this approach. The problem is that a problem in accounting elsewhere (like missing increment) will now go unnoticed, which could completely screw it up. An occasional blip like the one described is totally fine, but that's not guaranteed to be the case. > The problem we're running into here is there's not really a good way to tag a > bio as "seen by io.latency." We just have to assume if we're on and there's a > bi_blkg associated that we saw it at submit time. We can't just add a flag for > every io controller that starts tracking inflight io's, so for now I think this > is a reasonable solution. Thanks, Can we quiesce instead when enabling/disabling?
On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: > On 1/18/19 9:21 AM, Josef Bacik wrote: > > On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: > >> On 1/14/19 12:21 PM, Liu Bo wrote: > >>> Our test reported the following stack, and vmcore showed that > >>> ->inflight counter is -1. > >>> > >>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > >>> [ffffc9003fcc3958] schedule at ffffffff8173de26 > >>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > >>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > >>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > >>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > >>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > >>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > >>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > >>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > >>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > >>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > >>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > >>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > >>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > >>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > >>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > >>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > >>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > >>> > >>> The ->inflight counter may be negative (-1) if > >>> > >>> 0) blk-throttle had been enabled when the IO was issued, so its bio > >>> has a associated blkg, > >>> > >>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > >>> in this blkg was not available by then, > >>> > >>> 2) blk-iolatency was enabled before this IO reached its endio, so that > >>> iolatency_grp became available when the IO did the endio. > >>> > >>> 3) the ->inflight counter is decreased from 0 to -1. > >>> > >>> This uses atomic_dec_is_positive() instead to avoid the negative > >>> inflight counter. > >> > >> The problem with that is that it'll hide a lot of other issues, too. > >> Any way we can either track if this rqw is in flight, and only dec > >> if it is, or quiesce when enabling? > >> > > > > I worried about this too, but really the side-effect of allowing more through > > because of mis-counting means we just let more IO through. I think maybe we add > > a debug option that we can turn on to see if we're messing up accounting, but in > > general I don't see a problem with this approach. > > The problem is that a problem in accounting elsewhere (like missing increment) > will now go unnoticed, which could completely screw it up. An occasional > blip like the one described is totally fine, but that's not guaranteed to be > the case. > Yeah I agree it's kind of shitty. > > The problem we're running into here is there's not really a good way to tag a > > bio as "seen by io.latency." We just have to assume if we're on and there's a > > bi_blkg associated that we saw it at submit time. We can't just add a flag for > > every io controller that starts tracking inflight io's, so for now I think this > > is a reasonable solution. Thanks, > > Can we quiesce instead when enabling/disabling? Actually that's not a bad idea, you want to look into that Liu? That way we can make sure everything is correct always. Thanks, Josef
On Fri, Jan 18, 2019 at 8:43 AM Josef Bacik <josef@toxicpanda.com> wrote: > > On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: > > On 1/18/19 9:21 AM, Josef Bacik wrote: > > > On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: > > >> On 1/14/19 12:21 PM, Liu Bo wrote: > > >>> Our test reported the following stack, and vmcore showed that > > >>> ->inflight counter is -1. > > >>> > > >>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > > >>> [ffffc9003fcc3958] schedule at ffffffff8173de26 > > >>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > > >>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > > >>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > > >>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > > >>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > > >>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > > >>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > > >>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > > >>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > > >>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > > >>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > > >>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > > >>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > > >>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > > >>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > > >>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > > >>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > > >>> > > >>> The ->inflight counter may be negative (-1) if > > >>> > > >>> 0) blk-throttle had been enabled when the IO was issued, so its bio > > >>> has a associated blkg, > > >>> > > >>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > > >>> in this blkg was not available by then, > > >>> > > >>> 2) blk-iolatency was enabled before this IO reached its endio, so that > > >>> iolatency_grp became available when the IO did the endio. > > >>> > > >>> 3) the ->inflight counter is decreased from 0 to -1. > > >>> > > >>> This uses atomic_dec_is_positive() instead to avoid the negative > > >>> inflight counter. > > >> > > >> The problem with that is that it'll hide a lot of other issues, too. > > >> Any way we can either track if this rqw is in flight, and only dec > > >> if it is, or quiesce when enabling? > > >> > > > > > > I worried about this too, but really the side-effect of allowing more through > > > because of mis-counting means we just let more IO through. I think maybe we add > > > a debug option that we can turn on to see if we're messing up accounting, but in > > > general I don't see a problem with this approach. > > > > The problem is that a problem in accounting elsewhere (like missing increment) > > will now go unnoticed, which could completely screw it up. An occasional > > blip like the one described is totally fine, but that's not guaranteed to be > > the case. > > > > Yeah I agree it's kind of shitty. > > > > The problem we're running into here is there's not really a good way to tag a > > > bio as "seen by io.latency." We just have to assume if we're on and there's a > > > bi_blkg associated that we saw it at submit time. We can't just add a flag for > > > every io controller that starts tracking inflight io's, so for now I think this > > > is a reasonable solution. Thanks, > > > > Can we quiesce instead when enabling/disabling? > > Actually that's not a bad idea, you want to look into that Liu? That way we can > make sure everything is correct always. Thanks, > Just to confirm, does "quiesce when enabling/disabling" mean "waiting for inflight IOs to complete when enabling/disabling iolatency"? thanks, liubo
On 1/18/19 6:39 PM, Liu Bo wrote: > On Fri, Jan 18, 2019 at 8:43 AM Josef Bacik <josef@toxicpanda.com> wrote: >> >> On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: >>> On 1/18/19 9:21 AM, Josef Bacik wrote: >>>> On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: >>>>> On 1/14/19 12:21 PM, Liu Bo wrote: >>>>>> Our test reported the following stack, and vmcore showed that >>>>>> ->inflight counter is -1. >>>>>> >>>>>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d >>>>>> [ffffc9003fcc3958] schedule at ffffffff8173de26 >>>>>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 >>>>>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb >>>>>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 >>>>>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a >>>>>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 >>>>>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d >>>>>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] >>>>>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] >>>>>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae >>>>>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 >>>>>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 >>>>>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] >>>>>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b >>>>>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd >>>>>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 >>>>>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 >>>>>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e >>>>>> >>>>>> The ->inflight counter may be negative (-1) if >>>>>> >>>>>> 0) blk-throttle had been enabled when the IO was issued, so its bio >>>>>> has a associated blkg, >>>>>> >>>>>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp >>>>>> in this blkg was not available by then, >>>>>> >>>>>> 2) blk-iolatency was enabled before this IO reached its endio, so that >>>>>> iolatency_grp became available when the IO did the endio. >>>>>> >>>>>> 3) the ->inflight counter is decreased from 0 to -1. >>>>>> >>>>>> This uses atomic_dec_is_positive() instead to avoid the negative >>>>>> inflight counter. >>>>> >>>>> The problem with that is that it'll hide a lot of other issues, too. >>>>> Any way we can either track if this rqw is in flight, and only dec >>>>> if it is, or quiesce when enabling? >>>>> >>>> >>>> I worried about this too, but really the side-effect of allowing more through >>>> because of mis-counting means we just let more IO through. I think maybe we add >>>> a debug option that we can turn on to see if we're messing up accounting, but in >>>> general I don't see a problem with this approach. >>> >>> The problem is that a problem in accounting elsewhere (like missing increment) >>> will now go unnoticed, which could completely screw it up. An occasional >>> blip like the one described is totally fine, but that's not guaranteed to be >>> the case. >>> >> >> Yeah I agree it's kind of shitty. >> >>>> The problem we're running into here is there's not really a good way to tag a >>>> bio as "seen by io.latency." We just have to assume if we're on and there's a >>>> bi_blkg associated that we saw it at submit time. We can't just add a flag for >>>> every io controller that starts tracking inflight io's, so for now I think this >>>> is a reasonable solution. Thanks, >>> >>> Can we quiesce instead when enabling/disabling? >> >> Actually that's not a bad idea, you want to look into that Liu? That way we can >> make sure everything is correct always. Thanks, >> > > Just to confirm, does "quiesce when enabling/disabling" mean "waiting > for inflight IOs to complete when enabling/disabling iolatency"? Precisely.
On Fri, Jan 18, 2019 at 5:51 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 1/18/19 6:39 PM, Liu Bo wrote: > > On Fri, Jan 18, 2019 at 8:43 AM Josef Bacik <josef@toxicpanda.com> wrote: > >> > >> On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: > >>> On 1/18/19 9:21 AM, Josef Bacik wrote: > >>>> On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: > >>>>> On 1/14/19 12:21 PM, Liu Bo wrote: > >>>>>> Our test reported the following stack, and vmcore showed that > >>>>>> ->inflight counter is -1. > >>>>>> > >>>>>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > >>>>>> [ffffc9003fcc3958] schedule at ffffffff8173de26 > >>>>>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > >>>>>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > >>>>>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > >>>>>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > >>>>>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > >>>>>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > >>>>>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > >>>>>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > >>>>>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > >>>>>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > >>>>>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > >>>>>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > >>>>>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > >>>>>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > >>>>>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > >>>>>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > >>>>>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > >>>>>> > >>>>>> The ->inflight counter may be negative (-1) if > >>>>>> > >>>>>> 0) blk-throttle had been enabled when the IO was issued, so its bio > >>>>>> has a associated blkg, > >>>>>> > >>>>>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > >>>>>> in this blkg was not available by then, > >>>>>> > >>>>>> 2) blk-iolatency was enabled before this IO reached its endio, so that > >>>>>> iolatency_grp became available when the IO did the endio. > >>>>>> > >>>>>> 3) the ->inflight counter is decreased from 0 to -1. > >>>>>> > >>>>>> This uses atomic_dec_is_positive() instead to avoid the negative > >>>>>> inflight counter. > >>>>> > >>>>> The problem with that is that it'll hide a lot of other issues, too. > >>>>> Any way we can either track if this rqw is in flight, and only dec > >>>>> if it is, or quiesce when enabling? > >>>>> > >>>> > >>>> I worried about this too, but really the side-effect of allowing more through > >>>> because of mis-counting means we just let more IO through. I think maybe we add > >>>> a debug option that we can turn on to see if we're messing up accounting, but in > >>>> general I don't see a problem with this approach. > >>> > >>> The problem is that a problem in accounting elsewhere (like missing increment) > >>> will now go unnoticed, which could completely screw it up. An occasional > >>> blip like the one described is totally fine, but that's not guaranteed to be > >>> the case. > >>> > >> > >> Yeah I agree it's kind of shitty. > >> > >>>> The problem we're running into here is there's not really a good way to tag a > >>>> bio as "seen by io.latency." We just have to assume if we're on and there's a > >>>> bi_blkg associated that we saw it at submit time. We can't just add a flag for > >>>> every io controller that starts tracking inflight io's, so for now I think this > >>>> is a reasonable solution. Thanks, > >>> > >>> Can we quiesce instead when enabling/disabling? > >> > >> Actually that's not a bad idea, you want to look into that Liu? That way we can > >> make sure everything is correct always. Thanks, > >> > > > > Just to confirm, does "quiesce when enabling/disabling" mean "waiting > > for inflight IOs to complete when enabling/disabling iolatency"? > So it might not be an easy thing to quiesce queue because... iolatency_set_limit() blkg_conf_prep() // return with rcu and queue's lock neither freeze_queue nor quiesce_queue is allowed in rcu reader side context. If we do it outside of blkg_conf_prep/finish(), I feel like the overhead is more than it worths. Any thoughts? thanks, liubo > Precisely. > > -- > Jens Axboe >
On 1/23/19 12:52 AM, Liu Bo wrote: > On Fri, Jan 18, 2019 at 5:51 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 1/18/19 6:39 PM, Liu Bo wrote: >>> On Fri, Jan 18, 2019 at 8:43 AM Josef Bacik <josef@toxicpanda.com> wrote: >>>> >>>> On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: >>>>> On 1/18/19 9:21 AM, Josef Bacik wrote: >>>>>> On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: >>>>>>> On 1/14/19 12:21 PM, Liu Bo wrote: >>>>>>>> Our test reported the following stack, and vmcore showed that >>>>>>>> ->inflight counter is -1. >>>>>>>> >>>>>>>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d >>>>>>>> [ffffc9003fcc3958] schedule at ffffffff8173de26 >>>>>>>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 >>>>>>>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb >>>>>>>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 >>>>>>>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a >>>>>>>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 >>>>>>>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d >>>>>>>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] >>>>>>>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] >>>>>>>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae >>>>>>>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 >>>>>>>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 >>>>>>>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] >>>>>>>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b >>>>>>>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd >>>>>>>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 >>>>>>>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 >>>>>>>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e >>>>>>>> >>>>>>>> The ->inflight counter may be negative (-1) if >>>>>>>> >>>>>>>> 0) blk-throttle had been enabled when the IO was issued, so its bio >>>>>>>> has a associated blkg, >>>>>>>> >>>>>>>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp >>>>>>>> in this blkg was not available by then, >>>>>>>> >>>>>>>> 2) blk-iolatency was enabled before this IO reached its endio, so that >>>>>>>> iolatency_grp became available when the IO did the endio. >>>>>>>> >>>>>>>> 3) the ->inflight counter is decreased from 0 to -1. >>>>>>>> >>>>>>>> This uses atomic_dec_is_positive() instead to avoid the negative >>>>>>>> inflight counter. >>>>>>> >>>>>>> The problem with that is that it'll hide a lot of other issues, too. >>>>>>> Any way we can either track if this rqw is in flight, and only dec >>>>>>> if it is, or quiesce when enabling? >>>>>>> >>>>>> >>>>>> I worried about this too, but really the side-effect of allowing more through >>>>>> because of mis-counting means we just let more IO through. I think maybe we add >>>>>> a debug option that we can turn on to see if we're messing up accounting, but in >>>>>> general I don't see a problem with this approach. >>>>> >>>>> The problem is that a problem in accounting elsewhere (like missing increment) >>>>> will now go unnoticed, which could completely screw it up. An occasional >>>>> blip like the one described is totally fine, but that's not guaranteed to be >>>>> the case. >>>>> >>>> >>>> Yeah I agree it's kind of shitty. >>>> >>>>>> The problem we're running into here is there's not really a good way to tag a >>>>>> bio as "seen by io.latency." We just have to assume if we're on and there's a >>>>>> bi_blkg associated that we saw it at submit time. We can't just add a flag for >>>>>> every io controller that starts tracking inflight io's, so for now I think this >>>>>> is a reasonable solution. Thanks, >>>>> >>>>> Can we quiesce instead when enabling/disabling? >>>> >>>> Actually that's not a bad idea, you want to look into that Liu? That way we can >>>> make sure everything is correct always. Thanks, >>>> >>> >>> Just to confirm, does "quiesce when enabling/disabling" mean "waiting >>> for inflight IOs to complete when enabling/disabling iolatency"? >> > > So it might not be an easy thing to quiesce queue because... > > iolatency_set_limit() > blkg_conf_prep() // return with rcu and queue's lock > > neither freeze_queue nor quiesce_queue is allowed in rcu reader side context. > > If we do it outside of blkg_conf_prep/finish(), I feel like the > overhead is more than it worths. I think it's hard to say it's more overhead than it's worth, when the current situation is clearly buggy and the proposed fix has definite flaws. The only other alternative I can see is to have a flag per throttling entity, which isn't going to work due to flag shortage. So I'd very much encourage you to continue going down the path of making quiesce work. You could potentially do a callback based solution, if the context within the prep/finish is too limiting?
On Tue, Jan 22, 2019 at 11:52:06PM -0800, Liu Bo wrote: > On Fri, Jan 18, 2019 at 5:51 PM Jens Axboe <axboe@kernel.dk> wrote: > > > > On 1/18/19 6:39 PM, Liu Bo wrote: > > > On Fri, Jan 18, 2019 at 8:43 AM Josef Bacik <josef@toxicpanda.com> wrote: > > >> > > >> On Fri, Jan 18, 2019 at 09:28:06AM -0700, Jens Axboe wrote: > > >>> On 1/18/19 9:21 AM, Josef Bacik wrote: > > >>>> On Fri, Jan 18, 2019 at 05:58:18AM -0700, Jens Axboe wrote: > > >>>>> On 1/14/19 12:21 PM, Liu Bo wrote: > > >>>>>> Our test reported the following stack, and vmcore showed that > > >>>>>> ->inflight counter is -1. > > >>>>>> > > >>>>>> [ffffc9003fcc38d0] __schedule at ffffffff8173d95d > > >>>>>> [ffffc9003fcc3958] schedule at ffffffff8173de26 > > >>>>>> [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 > > >>>>>> [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb > > >>>>>> [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 > > >>>>>> [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a > > >>>>>> [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 > > >>>>>> [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d > > >>>>>> [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] > > >>>>>> [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] > > >>>>>> [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae > > >>>>>> [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 > > >>>>>> [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 > > >>>>>> [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] > > >>>>>> [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b > > >>>>>> [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd > > >>>>>> [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 > > >>>>>> [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 > > >>>>>> [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e > > >>>>>> > > >>>>>> The ->inflight counter may be negative (-1) if > > >>>>>> > > >>>>>> 0) blk-throttle had been enabled when the IO was issued, so its bio > > >>>>>> has a associated blkg, > > >>>>>> > > >>>>>> 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp > > >>>>>> in this blkg was not available by then, > > >>>>>> > > >>>>>> 2) blk-iolatency was enabled before this IO reached its endio, so that > > >>>>>> iolatency_grp became available when the IO did the endio. > > >>>>>> > > >>>>>> 3) the ->inflight counter is decreased from 0 to -1. > > >>>>>> > > >>>>>> This uses atomic_dec_is_positive() instead to avoid the negative > > >>>>>> inflight counter. > > >>>>> > > >>>>> The problem with that is that it'll hide a lot of other issues, too. > > >>>>> Any way we can either track if this rqw is in flight, and only dec > > >>>>> if it is, or quiesce when enabling? > > >>>>> > > >>>> > > >>>> I worried about this too, but really the side-effect of allowing more through > > >>>> because of mis-counting means we just let more IO through. I think maybe we add > > >>>> a debug option that we can turn on to see if we're messing up accounting, but in > > >>>> general I don't see a problem with this approach. > > >>> > > >>> The problem is that a problem in accounting elsewhere (like missing increment) > > >>> will now go unnoticed, which could completely screw it up. An occasional > > >>> blip like the one described is totally fine, but that's not guaranteed to be > > >>> the case. > > >>> > > >> > > >> Yeah I agree it's kind of shitty. > > >> > > >>>> The problem we're running into here is there's not really a good way to tag a > > >>>> bio as "seen by io.latency." We just have to assume if we're on and there's a > > >>>> bi_blkg associated that we saw it at submit time. We can't just add a flag for > > >>>> every io controller that starts tracking inflight io's, so for now I think this > > >>>> is a reasonable solution. Thanks, > > >>> > > >>> Can we quiesce instead when enabling/disabling? > > >> > > >> Actually that's not a bad idea, you want to look into that Liu? That way we can > > >> make sure everything is correct always. Thanks, > > >> > > > > > > Just to confirm, does "quiesce when enabling/disabling" mean "waiting > > > for inflight IOs to complete when enabling/disabling iolatency"? > > > > So it might not be an easy thing to quiesce queue because... > > iolatency_set_limit() > blkg_conf_prep() // return with rcu and queue's lock > > neither freeze_queue nor quiesce_queue is allowed in rcu reader side context. > > If we do it outside of blkg_conf_prep/finish(), I feel like the > overhead is more than it worths. I had to do this for io.weight yesterday, I just put it in blkg_conf_prep/finish. I don't think that's unreasonable, it's not like we mess with settings a lot, it'll get setup on cgroup creation before something is really doing IO, and saves us if things are running currently. Thanks, Josef
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index 159af76d60e6..da4ede5c5ddf 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -660,7 +660,12 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, struct bio *bio) } rqw = &iolat->rq_wait; - atomic_dec(&rqw->inflight); + /* + * if blkg is enabled by another throttler (e.g. blk-throttle) + * and blk-iolatency is enabled while IO is inflight, + * blk-iolatency's ->inflight counter may be still 0. + */ + atomic_dec_if_positive(&rqw->inflight); if (!enabled || iolat->min_lat_nsec == 0) goto next; iolatency_record_time(iolat, &bio->bi_issue, now, @@ -692,7 +697,8 @@ static void blkcg_iolatency_cleanup(struct rq_qos *rqos, struct bio *bio) goto next; rqw = &iolat->rq_wait; - atomic_dec(&rqw->inflight); + /* see comments in blkcg_iolatency_done_io() */ + atomic_dec_if_positive(&rqw->inflight); wake_up(&rqw->wait); next: blkg = blkg->parent;
Our test reported the following stack, and vmcore showed that ->inflight counter is -1. [ffffc9003fcc38d0] __schedule at ffffffff8173d95d [ffffc9003fcc3958] schedule at ffffffff8173de26 [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e The ->inflight counter may be negative (-1) if 0) blk-throttle had been enabled when the IO was issued, so its bio has a associated blkg, 1) blk-iolatency was disabled when the IO was issued, so iolatency_grp in this blkg was not available by then, 2) blk-iolatency was enabled before this IO reached its endio, so that iolatency_grp became available when the IO did the endio. 3) the ->inflight counter is decreased from 0 to -1. This uses atomic_dec_is_positive() instead to avoid the negative inflight counter. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> --- block/blk-iolatency.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)