Message ID | 20170315215107.5628-6-rgoldwyn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote: > diff --git a/block/blk-core.c b/block/blk-core.c > index 0eeb99e..2e5cba2 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) > do { > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > - if (likely(blk_queue_enter(q, false) == 0)) { > + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { > struct bio_list hold; > struct bio_list lower, same; > > @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) > bio_list_merge(&bio_list_on_stack, &same); > bio_list_merge(&bio_list_on_stack, &hold); > } else { > - bio_io_error(bio); > + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) > + bio_wouldblock_error(bio); > + else > + bio_io_error(bio); This doesn't look right. What if the queue is dying, and BIO_NOWAIT just happened to be set? And you're missing wbt_wait() as well as a blocking point. Ditto in blk-mq. > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 159187a..942ce8c 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) > rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); > if (unlikely(!rq)) { > __wbt_done(q->rq_wb, wb_acct); > + if (bio && bio_flagged(bio, BIO_NOWAIT)) > + bio_wouldblock_error(bio); > return BLK_QC_T_NONE; > } > This seems a little fragile now, since not both paths free the bio.
On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > A new flag BIO_NOWAIT is introduced to identify bio's > orignating from iocb with IOCB_NOWAIT. This flag indicates > to return immediately if a request cannot be made instead > of retrying. So this makes a congested block device run the bio IO completion callback with an -EAGAIN error present? Are all the filesystem direct IO submission and completion routines OK with that? i.e. does such a congestion case cause filesystems to temporarily expose stale data to unprivileged users when the IO is requeued in this way? e.g. filesystem does allocation without blocking, submits bio, device is congested, runs IO completion with error, so nothing written to allocated blocks, write gets queued, so other read comes in while the write is queued, reads data from uninitialised blocks that were allocated during the write.... Seems kinda problematic to me to have a undocumented design constraint (i.e a landmine) where we submit the AIO only to have it error out and then expect the filesystem to do something special and different /without blocking/ on EAGAIN. Why isn't the congestion check at a higher layer like we do for page cache readahead? i.e. using the bdi*congested() API at the time we are doing all the other filesystem blocking checks. -Dave.
On Thu, Mar 16, 2017 at 10:33 PM, Jens Axboe <axboe@kernel.dk> wrote: > On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote: >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 0eeb99e..2e5cba2 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) >> do { >> struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> >> - if (likely(blk_queue_enter(q, false) == 0)) { >> + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { >> struct bio_list hold; >> struct bio_list lower, same; >> >> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) >> bio_list_merge(&bio_list_on_stack, &same); >> bio_list_merge(&bio_list_on_stack, &hold); >> } else { >> - bio_io_error(bio); >> + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) >> + bio_wouldblock_error(bio); >> + else >> + bio_io_error(bio); > > This doesn't look right. What if the queue is dying, and BIO_NOWAIT just > happened to be set? > > And you're missing wbt_wait() as well as a blocking point. Ditto in > blk-mq. There are also tons of block points in dm/md/bcache, which need to be considered or be documented at least, :-) Thanks, Ming Lei
On 03/16/2017 04:31 PM, Dave Chinner wrote: > On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >> >> A new flag BIO_NOWAIT is introduced to identify bio's >> orignating from iocb with IOCB_NOWAIT. This flag indicates >> to return immediately if a request cannot be made instead >> of retrying. > > So this makes a congested block device run the bio IO completion > callback with an -EAGAIN error present? Are all the filesystem > direct IO submission and completion routines OK with that? i.e. does > such a congestion case cause filesystems to temporarily expose stale > data to unprivileged users when the IO is requeued in this way? > > e.g. filesystem does allocation without blocking, submits bio, > device is congested, runs IO completion with error, so nothing > written to allocated blocks, write gets queued, so other read > comes in while the write is queued, reads data from uninitialised > blocks that were allocated during the write.... > > Seems kinda problematic to me to have a undocumented design > constraint (i.e a landmine) where we submit the AIO only to have it > error out and then expect the filesystem to do something special and > different /without blocking/ on EAGAIN. If the filesystems has to perform block allocation, we would return -EAGAIN early enough. However, I agree there is a problem, since not all filesystems know this. I worked on only three of them. > > Why isn't the congestion check at a higher layer like we do for page > cache readahead? i.e. using the bdi*congested() API at the time we > are doing all the other filesystem blocking checks. > Yes, that may work better. We will have to call bdi_read_congested() on a write path. (will have to comment that part of the code). Would it encompass all possible waits in the block layer?
On 03/16/2017 09:33 AM, Jens Axboe wrote: > On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote: >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 0eeb99e..2e5cba2 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) >> do { >> struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> >> - if (likely(blk_queue_enter(q, false) == 0)) { >> + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { >> struct bio_list hold; >> struct bio_list lower, same; >> >> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) >> bio_list_merge(&bio_list_on_stack, &same); >> bio_list_merge(&bio_list_on_stack, &hold); >> } else { >> - bio_io_error(bio); >> + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) >> + bio_wouldblock_error(bio); >> + else >> + bio_io_error(bio); > > This doesn't look right. What if the queue is dying, and BIO_NOWAIT just > happened to be set? Yes, need a check for that. > > And you're missing wbt_wait() as well as a blocking point. Ditto in > blk-mq. Agree. > >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 159187a..942ce8c 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) >> rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); >> if (unlikely(!rq)) { >> __wbt_done(q->rq_wb, wb_acct); >> + if (bio && bio_flagged(bio, BIO_NOWAIT)) >> + bio_wouldblock_error(bio); >> return BLK_QC_T_NONE; >> } >> > > This seems a little fragile now, since not both paths free the bio. > bio_put() is handled in dio_bio_complete(). However, I am not sure why there is no bio_endio() call for !rq. IIRC, it was not meant to be.
On Fri 17-03-17 07:23:24, Goldwyn Rodrigues wrote: > On 03/16/2017 04:31 PM, Dave Chinner wrote: > > On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote: > >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> > >> > >> A new flag BIO_NOWAIT is introduced to identify bio's > >> orignating from iocb with IOCB_NOWAIT. This flag indicates > >> to return immediately if a request cannot be made instead > >> of retrying. > > > > So this makes a congested block device run the bio IO completion > > callback with an -EAGAIN error present? Are all the filesystem > > direct IO submission and completion routines OK with that? i.e. does > > such a congestion case cause filesystems to temporarily expose stale > > data to unprivileged users when the IO is requeued in this way? > > > > e.g. filesystem does allocation without blocking, submits bio, > > device is congested, runs IO completion with error, so nothing > > written to allocated blocks, write gets queued, so other read > > comes in while the write is queued, reads data from uninitialised > > blocks that were allocated during the write.... > > > > Seems kinda problematic to me to have a undocumented design > > constraint (i.e a landmine) where we submit the AIO only to have it > > error out and then expect the filesystem to do something special and > > different /without blocking/ on EAGAIN. > > If the filesystems has to perform block allocation, we would return > -EAGAIN early enough. However, I agree there is a problem, since not all > filesystems know this. I worked on only three of them. So Dave has a good point about the missing documentation explaining the expectation from the filesystem. Other than that the filesystem is responsible for determining in its direct IO submission routine whether it can deal with NOWAIT direct IO without blocking (including backing out of block layer refuses to do the IO) and if not, just return with EAGAIN right away. Probably we should also have a feature flag in filesystem_type telling whether a particular filesystem knows about NOWAIT direct IO at all so that we don't have to add stub to each and every filesystem supporting direct IO that returns EAGAIN when NOWAIT is set (OTOH adding that stub won't be that hard either since there are not *that* many filesystems supporting direct IO). But one or the other has to be done. > > Why isn't the congestion check at a higher layer like we do for page > > cache readahead? i.e. using the bdi*congested() API at the time we > > are doing all the other filesystem blocking checks. > > > > Yes, that may work better. We will have to call bdi_read_congested() on > a write path. (will have to comment that part of the code). Would it > encompass all possible waits in the block layer? Congestion mechanism is not really reliable. The bdi can show the device is not congested but block layer will still block you waiting for some resource. And for some of the stuff like tag allocation you cannot really do a reliable check in advance so I don't think congestion checks are really a viable alternative. Honza
On 03/16/2017 09:33 AM, Jens Axboe wrote: > On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote: >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 0eeb99e..2e5cba2 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) >> do { >> struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> >> - if (likely(blk_queue_enter(q, false) == 0)) { >> + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { >> struct bio_list hold; >> struct bio_list lower, same; >> >> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) >> bio_list_merge(&bio_list_on_stack, &same); >> bio_list_merge(&bio_list_on_stack, &hold); >> } else { >> - bio_io_error(bio); >> + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) >> + bio_wouldblock_error(bio); >> + else >> + bio_io_error(bio); > > This doesn't look right. What if the queue is dying, and BIO_NOWAIT just > happened to be set? > Yes, I need to add a condition here to check for blk_queue_dying(). Thanks. > And you're missing wbt_wait() as well as a blocking point. Ditto in > blk-mq. wbt_wait() does not apply to WRITE_ODIRECT > >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 159187a..942ce8c 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) >> rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); >> if (unlikely(!rq)) { >> __wbt_done(q->rq_wb, wb_acct); >> + if (bio && bio_flagged(bio, BIO_NOWAIT)) >> + bio_wouldblock_error(bio); >> return BLK_QC_T_NONE; >> } >> > > This seems a little fragile now, since not both paths free the bio. > Direct I/O should free the bios in bio_dio_complete(). I am not sure why it would not free bio here originally, but IIRC, this path is for bio==NULL only. So, with this patch we would get a rq==NULL here and hence the bio_wouldblock_error() call.
On 03/24/2017 05:32 AM, Goldwyn Rodrigues wrote: > > > On 03/16/2017 09:33 AM, Jens Axboe wrote: >> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote: >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 0eeb99e..2e5cba2 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) >>> do { >>> struct request_queue *q = bdev_get_queue(bio->bi_bdev); >>> >>> - if (likely(blk_queue_enter(q, false) == 0)) { >>> + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { >>> struct bio_list hold; >>> struct bio_list lower, same; >>> >>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) >>> bio_list_merge(&bio_list_on_stack, &same); >>> bio_list_merge(&bio_list_on_stack, &hold); >>> } else { >>> - bio_io_error(bio); >>> + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) >>> + bio_wouldblock_error(bio); >>> + else >>> + bio_io_error(bio); >> >> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just >> happened to be set? >> > > Yes, I need to add a condition here to check for blk_queue_dying(). Thanks. > >> And you're missing wbt_wait() as well as a blocking point. Ditto in >> blk-mq. > > wbt_wait() does not apply to WRITE_ODIRECT It doesn't _currently_ block for it, but that could change. It should have an explicit check. Additionally, the more weird assumptions you make, the more half assed this will be. The other assumption made here is that only direct IO would set nonblock. We have to assume that any IO can have NONBLOCK set, and not make any assumptions about the caller.
diff --git a/block/blk-core.c b/block/blk-core.c index 0eeb99e..2e5cba2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op, if (!IS_ERR(rq)) return rq; + if (bio && bio_flagged(bio, BIO_NOWAIT)) { + blk_put_rl(rl); + return ERR_PTR(-EAGAIN); + } + if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); - if (likely(blk_queue_enter(q, false) == 0)) { + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { struct bio_list hold; struct bio_list lower, same; @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) bio_list_merge(&bio_list_on_stack, &same); bio_list_merge(&bio_list_on_stack, &hold); } else { - bio_io_error(bio); + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) + bio_wouldblock_error(bio); + else + bio_io_error(bio); } bio = bio_list_pop(current->bio_list); } while (bio); diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 09af8ff..40e78b5 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q, if (likely(!data->hctx)) data->hctx = blk_mq_map_queue(q, data->ctx->cpu); + if (likely(bio) && bio_flagged(bio, BIO_NOWAIT)) + data->flags |= BLK_MQ_REQ_NOWAIT; + if (e) { data->flags |= BLK_MQ_REQ_INTERNAL; diff --git a/block/blk-mq.c b/block/blk-mq.c index 159187a..942ce8c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio && bio_flagged(bio, BIO_NOWAIT)) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } @@ -1642,6 +1644,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio && bio_flagged(bio, BIO_NOWAIT)) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } diff --git a/fs/direct-io.c b/fs/direct-io.c index a04ebea..f6835d3 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, else bio->bi_end_io = dio_bio_end_io; + if (dio->iocb->ki_flags & IOCB_NOWAIT) + bio_set_flag(bio, BIO_NOWAIT); + sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; } @@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio) unsigned i; int err; - if (bio->bi_error) - dio->io_error = -EIO; + if (bio->bi_error) { + if (bio_flagged(bio, BIO_NOWAIT)) + dio->io_error = -EAGAIN; + else + dio->io_error = -EIO; + } if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) { err = bio->bi_error; diff --git a/include/linux/bio.h b/include/linux/bio.h index 8e52119..1a92707 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio) bio_endio(bio); } +static inline void bio_wouldblock_error(struct bio *bio) +{ + bio->bi_error = -EAGAIN; + bio_endio(bio); +} + struct request_queue; extern int bio_phys_segments(struct request_queue *, struct bio *); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d703acb..514c08e 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -102,6 +102,7 @@ struct bio { #define BIO_REFFED 8 /* bio has elevated ->bi_cnt */ #define BIO_THROTTLED 9 /* This bio has already been subjected to * throttling rules. Don't do it again. */ +#define BIO_NOWAIT 10 /* don't block over blk device congestion */ /* * Flags starting here get preserved by bio_reset() - this includes