Message ID | 20170726235806.12148-2-rgoldwyn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > Nowait is a feature of direct AIO, where users can request > to return immediately if the I/O is going to block. This translates > to REQ_NOWAIT in bio.bi_opf flags. While request based devices > don't wait, stacked devices such as md/dm will. > > In order to explicitly mark stacked devices as supported, we > set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN > whenever the device would block. probably you should route this patch to Jens first, DM/MD are different trees. > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > --- > block/blk-core.c | 3 ++- > include/linux/blkdev.h | 2 ++ > 2 files changed, 4 insertions(+), 1 deletion(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 970b9c9638c5..1c9a981d88e5 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) > * if queue is not a request based queue. > */ > > - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) > + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && > + !blk_queue_supports_nowait(q)) > goto not_supported; > > part = bio->bi_bdev->bd_part; > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 25f6a0cb27d3..fae021ebec1b 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -633,6 +633,7 @@ struct request_queue { > #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ > #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ > #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ > +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ > > #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ > (1 << QUEUE_FLAG_STACKABLE) | \ > @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) > #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) > #define blk_queue_scsi_passthrough(q) \ > test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) > +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) Should this bit consider under layer disks? For example, one raid array disk doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? I have another generic question. If a bio is splitted into 2 bios, one bio doesn't need to wait but the other need to wait. We will return -EAGAIN for the second bio, so the whole bio will return -EAGAIN, but the first bio is already dispatched to disk. Is this correct behavior?
On 08/08/2017 02:32 PM, Shaohua Li wrote: >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >> index 25f6a0cb27d3..fae021ebec1b 100644 >> --- a/include/linux/blkdev.h >> +++ b/include/linux/blkdev.h >> @@ -633,6 +633,7 @@ struct request_queue { >> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ Does this work on 32-bit, where sizeof(unsigned long) == 32?
On 08/08/2017 03:32 PM, Shaohua Li wrote: > On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >> >> Nowait is a feature of direct AIO, where users can request >> to return immediately if the I/O is going to block. This translates >> to REQ_NOWAIT in bio.bi_opf flags. While request based devices >> don't wait, stacked devices such as md/dm will. >> >> In order to explicitly mark stacked devices as supported, we >> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN >> whenever the device would block. > > probably you should route this patch to Jens first, DM/MD are different trees. Yes, I have sent it to linux-block as well, and he has commented as well. > >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >> --- >> block/blk-core.c | 3 ++- >> include/linux/blkdev.h | 2 ++ >> 2 files changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 970b9c9638c5..1c9a981d88e5 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) >> * if queue is not a request based queue. >> */ >> >> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) >> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && >> + !blk_queue_supports_nowait(q)) >> goto not_supported; >> >> part = bio->bi_bdev->bd_part; >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >> index 25f6a0cb27d3..fae021ebec1b 100644 >> --- a/include/linux/blkdev.h >> +++ b/include/linux/blkdev.h >> @@ -633,6 +633,7 @@ struct request_queue { >> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >> >> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ >> (1 << QUEUE_FLAG_STACKABLE) | \ >> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) >> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) >> #define blk_queue_scsi_passthrough(q) \ >> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) >> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) > > Should this bit consider under layer disks? For example, one raid array disk > doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? Yes, it should. I will add a check before setting the flag. Thanks. Request-based devices don't wait. So, they would not have this flag set. It is only the bio-based, with the make_request_fn hook which need this. > > I have another generic question. If a bio is splitted into 2 bios, one bio > doesn't need to wait but the other need to wait. We will return -EAGAIN for the > second bio, so the whole bio will return -EAGAIN, but the first bio is already > dispatched to disk. Is this correct behavior? > No, from a multi-device point of view, this is inconsistent. I have tried the request bio returns -EAGAIN before the split, but I shall check again. Where do you see this happening?
On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote: > > > On 08/08/2017 03:32 PM, Shaohua Li wrote: > > On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: > >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> > >> > >> Nowait is a feature of direct AIO, where users can request > >> to return immediately if the I/O is going to block. This translates > >> to REQ_NOWAIT in bio.bi_opf flags. While request based devices > >> don't wait, stacked devices such as md/dm will. > >> > >> In order to explicitly mark stacked devices as supported, we > >> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN > >> whenever the device would block. > > > > probably you should route this patch to Jens first, DM/MD are different trees. > > Yes, I have sent it to linux-block as well, and he has commented as well. > > > > > >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > >> --- > >> block/blk-core.c | 3 ++- > >> include/linux/blkdev.h | 2 ++ > >> 2 files changed, 4 insertions(+), 1 deletion(-) > >> > >> diff --git a/block/blk-core.c b/block/blk-core.c > >> index 970b9c9638c5..1c9a981d88e5 100644 > >> --- a/block/blk-core.c > >> +++ b/block/blk-core.c > >> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) > >> * if queue is not a request based queue. > >> */ > >> > >> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) > >> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && > >> + !blk_queue_supports_nowait(q)) > >> goto not_supported; > >> > >> part = bio->bi_bdev->bd_part; > >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > >> index 25f6a0cb27d3..fae021ebec1b 100644 > >> --- a/include/linux/blkdev.h > >> +++ b/include/linux/blkdev.h > >> @@ -633,6 +633,7 @@ struct request_queue { > >> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ > >> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ > >> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ > >> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ > >> > >> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ > >> (1 << QUEUE_FLAG_STACKABLE) | \ > >> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) > >> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) > >> #define blk_queue_scsi_passthrough(q) \ > >> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) > >> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) > > > > Should this bit consider under layer disks? For example, one raid array disk > > doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? > > Yes, it should. I will add a check before setting the flag. Thanks. > Request-based devices don't wait. So, they would not have this flag set. > It is only the bio-based, with the make_request_fn hook which need this. > > > > > I have another generic question. If a bio is splitted into 2 bios, one bio > > doesn't need to wait but the other need to wait. We will return -EAGAIN for the > > second bio, so the whole bio will return -EAGAIN, but the first bio is already > > dispatched to disk. Is this correct behavior? > > > > No, from a multi-device point of view, this is inconsistent. I have > tried the request bio returns -EAGAIN before the split, but I shall > check again. Where do you see this happening? No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.
On 08/09/2017 10:02 AM, Shaohua Li wrote: > On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote: >> >> >> On 08/08/2017 03:32 PM, Shaohua Li wrote: >>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: >>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>> >>>> Nowait is a feature of direct AIO, where users can request >>>> to return immediately if the I/O is going to block. This translates >>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices >>>> don't wait, stacked devices such as md/dm will. >>>> >>>> In order to explicitly mark stacked devices as supported, we >>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN >>>> whenever the device would block. >>> >>> probably you should route this patch to Jens first, DM/MD are different trees. >> >> Yes, I have sent it to linux-block as well, and he has commented as well. >> >> >>> >>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>> --- >>>> block/blk-core.c | 3 ++- >>>> include/linux/blkdev.h | 2 ++ >>>> 2 files changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>> index 970b9c9638c5..1c9a981d88e5 100644 >>>> --- a/block/blk-core.c >>>> +++ b/block/blk-core.c >>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) >>>> * if queue is not a request based queue. >>>> */ >>>> >>>> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) >>>> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && >>>> + !blk_queue_supports_nowait(q)) >>>> goto not_supported; >>>> >>>> part = bio->bi_bdev->bd_part; >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>> --- a/include/linux/blkdev.h >>>> +++ b/include/linux/blkdev.h >>>> @@ -633,6 +633,7 @@ struct request_queue { >>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >>>> >>>> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ >>>> (1 << QUEUE_FLAG_STACKABLE) | \ >>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) >>>> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) >>>> #define blk_queue_scsi_passthrough(q) \ >>>> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) >>>> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) >>> >>> Should this bit consider under layer disks? For example, one raid array disk >>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? >> >> Yes, it should. I will add a check before setting the flag. Thanks. >> Request-based devices don't wait. So, they would not have this flag set. >> It is only the bio-based, with the make_request_fn hook which need this. >> >>> >>> I have another generic question. If a bio is splitted into 2 bios, one bio >>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the >>> second bio, so the whole bio will return -EAGAIN, but the first bio is already >>> dispatched to disk. Is this correct behavior? >>> >> >> No, from a multi-device point of view, this is inconsistent. I have >> tried the request bio returns -EAGAIN before the split, but I shall >> check again. Where do you see this happening? > > No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split. > In that case, the bio end_io function is chained and the bio of the split will replicate the error to the parent (if not already set).
On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote: > > > On 08/09/2017 10:02 AM, Shaohua Li wrote: > > On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote: > >> > >> > >> On 08/08/2017 03:32 PM, Shaohua Li wrote: > >>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: > >>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> > >>>> > >>>> Nowait is a feature of direct AIO, where users can request > >>>> to return immediately if the I/O is going to block. This translates > >>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices > >>>> don't wait, stacked devices such as md/dm will. > >>>> > >>>> In order to explicitly mark stacked devices as supported, we > >>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN > >>>> whenever the device would block. > >>> > >>> probably you should route this patch to Jens first, DM/MD are different trees. > >> > >> Yes, I have sent it to linux-block as well, and he has commented as well. > >> > >> > >>> > >>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > >>>> --- > >>>> block/blk-core.c | 3 ++- > >>>> include/linux/blkdev.h | 2 ++ > >>>> 2 files changed, 4 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/block/blk-core.c b/block/blk-core.c > >>>> index 970b9c9638c5..1c9a981d88e5 100644 > >>>> --- a/block/blk-core.c > >>>> +++ b/block/blk-core.c > >>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) > >>>> * if queue is not a request based queue. > >>>> */ > >>>> > >>>> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) > >>>> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && > >>>> + !blk_queue_supports_nowait(q)) > >>>> goto not_supported; > >>>> > >>>> part = bio->bi_bdev->bd_part; > >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > >>>> index 25f6a0cb27d3..fae021ebec1b 100644 > >>>> --- a/include/linux/blkdev.h > >>>> +++ b/include/linux/blkdev.h > >>>> @@ -633,6 +633,7 @@ struct request_queue { > >>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ > >>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ > >>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ > >>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ > >>>> > >>>> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ > >>>> (1 << QUEUE_FLAG_STACKABLE) | \ > >>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) > >>>> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) > >>>> #define blk_queue_scsi_passthrough(q) \ > >>>> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) > >>>> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) > >>> > >>> Should this bit consider under layer disks? For example, one raid array disk > >>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? > >> > >> Yes, it should. I will add a check before setting the flag. Thanks. > >> Request-based devices don't wait. So, they would not have this flag set. > >> It is only the bio-based, with the make_request_fn hook which need this. > >> > >>> > >>> I have another generic question. If a bio is splitted into 2 bios, one bio > >>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the > >>> second bio, so the whole bio will return -EAGAIN, but the first bio is already > >>> dispatched to disk. Is this correct behavior? > >>> > >> > >> No, from a multi-device point of view, this is inconsistent. I have > >> tried the request bio returns -EAGAIN before the split, but I shall > >> check again. Where do you see this happening? > > > > No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split. > > > > In that case, the bio end_io function is chained and the bio of the > split will replicate the error to the parent (if not already set). this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio probably already dispatched to disk (if the bio is splitted to 2 bios, one returns -EAGAIN, the other one doesn't block and dispatch to disk), what will application be going to do? I think this is different to other IO errors. FOr other IO errors, application will handle the error, while we ask app to retry the whole bio here and app doesn't know part of bio is already written to disk.
On 08/09/2017 03:21 PM, Shaohua Li wrote: > On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote: >> >> >> On 08/09/2017 10:02 AM, Shaohua Li wrote: >>> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote: >>>> >>>> >>>> On 08/08/2017 03:32 PM, Shaohua Li wrote: >>>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: >>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>>>> >>>>>> Nowait is a feature of direct AIO, where users can request >>>>>> to return immediately if the I/O is going to block. This translates >>>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices >>>>>> don't wait, stacked devices such as md/dm will. >>>>>> >>>>>> In order to explicitly mark stacked devices as supported, we >>>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN >>>>>> whenever the device would block. >>>>> >>>>> probably you should route this patch to Jens first, DM/MD are different trees. >>>> >>>> Yes, I have sent it to linux-block as well, and he has commented as well. >>>> >>>> >>>>> >>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>>>> --- >>>>>> block/blk-core.c | 3 ++- >>>>>> include/linux/blkdev.h | 2 ++ >>>>>> 2 files changed, 4 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>>>> index 970b9c9638c5..1c9a981d88e5 100644 >>>>>> --- a/block/blk-core.c >>>>>> +++ b/block/blk-core.c >>>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) >>>>>> * if queue is not a request based queue. >>>>>> */ >>>>>> >>>>>> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) >>>>>> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && >>>>>> + !blk_queue_supports_nowait(q)) >>>>>> goto not_supported; >>>>>> >>>>>> part = bio->bi_bdev->bd_part; >>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>>>> --- a/include/linux/blkdev.h >>>>>> +++ b/include/linux/blkdev.h >>>>>> @@ -633,6 +633,7 @@ struct request_queue { >>>>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >>>>>> >>>>>> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ >>>>>> (1 << QUEUE_FLAG_STACKABLE) | \ >>>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) >>>>>> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) >>>>>> #define blk_queue_scsi_passthrough(q) \ >>>>>> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) >>>>>> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) >>>>> >>>>> Should this bit consider under layer disks? For example, one raid array disk >>>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? >>>> >>>> Yes, it should. I will add a check before setting the flag. Thanks. >>>> Request-based devices don't wait. So, they would not have this flag set. >>>> It is only the bio-based, with the make_request_fn hook which need this. >>>> >>>>> >>>>> I have another generic question. If a bio is splitted into 2 bios, one bio >>>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the >>>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already >>>>> dispatched to disk. Is this correct behavior? >>>>> >>>> >>>> No, from a multi-device point of view, this is inconsistent. I have >>>> tried the request bio returns -EAGAIN before the split, but I shall >>>> check again. Where do you see this happening? >>> >>> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split. >>> >> >> In that case, the bio end_io function is chained and the bio of the >> split will replicate the error to the parent (if not already set). > > this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio > probably already dispatched to disk (if the bio is splitted to 2 bios, one > returns -EAGAIN, the other one doesn't block and dispatch to disk), what will > application be going to do? I think this is different to other IO errors. FOr > other IO errors, application will handle the error, while we ask app to retry > the whole bio here and app doesn't know part of bio is already written to disk. It is the same as for other I/O errors as well, such as EIO. You do not know which bio of all submitted bio's returned the error EIO. The application would and should consider the whole I/O as failed. The user application does not know of bios, or how it is going to be split in the underlying layers. It knows at the system call level. In this case, the EAGAIN will be returned to the user for the whole I/O not as a part of the I/O. It is up to application to try the I/O again with or without RWF_NOWAIT set. In direct I/O, it is bubbled out using dio->io_error. You can read about it at the patch header for the initial patchset at [1]. Use case: It is for applications having two threads, a compute thread and an I/O thread. It would try to push AIO as much as possible in the compute thread using RWF_NOWAIT, and if it fails, would pass it on to I/O thread which would perform without RWF_NOWAIT. End result if done right is you save on context switches and all the synchronization/messaging machinery to perform I/O. [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
On Wed, Aug 09, 2017 at 05:16:23PM -0500, Goldwyn Rodrigues wrote: > > > On 08/09/2017 03:21 PM, Shaohua Li wrote: > > On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote: > >> > >> > >> On 08/09/2017 10:02 AM, Shaohua Li wrote: > >>> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote: > >>>> > >>>> > >>>> On 08/08/2017 03:32 PM, Shaohua Li wrote: > >>>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: > >>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> > >>>>>> > >>>>>> Nowait is a feature of direct AIO, where users can request > >>>>>> to return immediately if the I/O is going to block. This translates > >>>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices > >>>>>> don't wait, stacked devices such as md/dm will. > >>>>>> > >>>>>> In order to explicitly mark stacked devices as supported, we > >>>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN > >>>>>> whenever the device would block. > >>>>> > >>>>> probably you should route this patch to Jens first, DM/MD are different trees. > >>>> > >>>> Yes, I have sent it to linux-block as well, and he has commented as well. > >>>> > >>>> > >>>>> > >>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > >>>>>> --- > >>>>>> block/blk-core.c | 3 ++- > >>>>>> include/linux/blkdev.h | 2 ++ > >>>>>> 2 files changed, 4 insertions(+), 1 deletion(-) > >>>>>> > >>>>>> diff --git a/block/blk-core.c b/block/blk-core.c > >>>>>> index 970b9c9638c5..1c9a981d88e5 100644 > >>>>>> --- a/block/blk-core.c > >>>>>> +++ b/block/blk-core.c > >>>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) > >>>>>> * if queue is not a request based queue. > >>>>>> */ > >>>>>> > >>>>>> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) > >>>>>> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && > >>>>>> + !blk_queue_supports_nowait(q)) > >>>>>> goto not_supported; > >>>>>> > >>>>>> part = bio->bi_bdev->bd_part; > >>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > >>>>>> index 25f6a0cb27d3..fae021ebec1b 100644 > >>>>>> --- a/include/linux/blkdev.h > >>>>>> +++ b/include/linux/blkdev.h > >>>>>> @@ -633,6 +633,7 @@ struct request_queue { > >>>>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ > >>>>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ > >>>>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ > >>>>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ > >>>>>> > >>>>>> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ > >>>>>> (1 << QUEUE_FLAG_STACKABLE) | \ > >>>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) > >>>>>> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) > >>>>>> #define blk_queue_scsi_passthrough(q) \ > >>>>>> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) > >>>>>> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) > >>>>> > >>>>> Should this bit consider under layer disks? For example, one raid array disk > >>>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? > >>>> > >>>> Yes, it should. I will add a check before setting the flag. Thanks. > >>>> Request-based devices don't wait. So, they would not have this flag set. > >>>> It is only the bio-based, with the make_request_fn hook which need this. > >>>> > >>>>> > >>>>> I have another generic question. If a bio is splitted into 2 bios, one bio > >>>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the > >>>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already > >>>>> dispatched to disk. Is this correct behavior? > >>>>> > >>>> > >>>> No, from a multi-device point of view, this is inconsistent. I have > >>>> tried the request bio returns -EAGAIN before the split, but I shall > >>>> check again. Where do you see this happening? > >>> > >>> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split. > >>> > >> > >> In that case, the bio end_io function is chained and the bio of the > >> split will replicate the error to the parent (if not already set). > > > > this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio > > probably already dispatched to disk (if the bio is splitted to 2 bios, one > > returns -EAGAIN, the other one doesn't block and dispatch to disk), what will > > application be going to do? I think this is different to other IO errors. FOr > > other IO errors, application will handle the error, while we ask app to retry > > the whole bio here and app doesn't know part of bio is already written to disk. > > It is the same as for other I/O errors as well, such as EIO. You do not > know which bio of all submitted bio's returned the error EIO. The > application would and should consider the whole I/O as failed. > > The user application does not know of bios, or how it is going to be > split in the underlying layers. It knows at the system call level. In > this case, the EAGAIN will be returned to the user for the whole I/O not > as a part of the I/O. It is up to application to try the I/O again with > or without RWF_NOWAIT set. In direct I/O, it is bubbled out using > dio->io_error. You can read about it at the patch header for the initial > patchset at [1]. > > Use case: It is for applications having two threads, a compute thread > and an I/O thread. It would try to push AIO as much as possible in the > compute thread using RWF_NOWAIT, and if it fails, would pass it on to > I/O thread which would perform without RWF_NOWAIT. End result if done > right is you save on context switches and all the > synchronization/messaging machinery to perform I/O. > > [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 Yes, I knew the concept, but I didn't see previous patches mentioned the -EAGAIN actually should be taken as a real IO error. This means a lot to applications and make the API hard to use. I'm wondering if we should disable bio split for NOWAIT bio, which will make the -EAGAIN only mean 'try again'. Thanks, Shaohua
On 08/09/2017 08:17 PM, Shaohua Li wrote: > On Wed, Aug 09, 2017 at 05:16:23PM -0500, Goldwyn Rodrigues wrote: >> >> >> On 08/09/2017 03:21 PM, Shaohua Li wrote: >>> On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote: >>>> >>>> >>>> On 08/09/2017 10:02 AM, Shaohua Li wrote: >>>>> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote: >>>>>> >>>>>> >>>>>> On 08/08/2017 03:32 PM, Shaohua Li wrote: >>>>>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote: >>>>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>>>>>> >>>>>>>> Nowait is a feature of direct AIO, where users can request >>>>>>>> to return immediately if the I/O is going to block. This translates >>>>>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices >>>>>>>> don't wait, stacked devices such as md/dm will. >>>>>>>> >>>>>>>> In order to explicitly mark stacked devices as supported, we >>>>>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN >>>>>>>> whenever the device would block. >>>>>>> >>>>>>> probably you should route this patch to Jens first, DM/MD are different trees. >>>>>> >>>>>> Yes, I have sent it to linux-block as well, and he has commented as well. >>>>>> >>>>>> >>>>>>> >>>>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>>>>>> --- >>>>>>>> block/blk-core.c | 3 ++- >>>>>>>> include/linux/blkdev.h | 2 ++ >>>>>>>> 2 files changed, 4 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>>>>>> index 970b9c9638c5..1c9a981d88e5 100644 >>>>>>>> --- a/block/blk-core.c >>>>>>>> +++ b/block/blk-core.c >>>>>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) >>>>>>>> * if queue is not a request based queue. >>>>>>>> */ >>>>>>>> >>>>>>>> - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) >>>>>>>> + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && >>>>>>>> + !blk_queue_supports_nowait(q)) >>>>>>>> goto not_supported; >>>>>>>> >>>>>>>> part = bio->bi_bdev->bd_part; >>>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>>>>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>>>>>> --- a/include/linux/blkdev.h >>>>>>>> +++ b/include/linux/blkdev.h >>>>>>>> @@ -633,6 +633,7 @@ struct request_queue { >>>>>>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>>>>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>>>>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>>>>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >>>>>>>> >>>>>>>> #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ >>>>>>>> (1 << QUEUE_FLAG_STACKABLE) | \ >>>>>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) >>>>>>>> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) >>>>>>>> #define blk_queue_scsi_passthrough(q) \ >>>>>>>> test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) >>>>>>>> +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) >>>>>>> >>>>>>> Should this bit consider under layer disks? For example, one raid array disk >>>>>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array? >>>>>> >>>>>> Yes, it should. I will add a check before setting the flag. Thanks. >>>>>> Request-based devices don't wait. So, they would not have this flag set. >>>>>> It is only the bio-based, with the make_request_fn hook which need this. >>>>>> >>>>>>> >>>>>>> I have another generic question. If a bio is splitted into 2 bios, one bio >>>>>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the >>>>>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already >>>>>>> dispatched to disk. Is this correct behavior? >>>>>>> >>>>>> >>>>>> No, from a multi-device point of view, this is inconsistent. I have >>>>>> tried the request bio returns -EAGAIN before the split, but I shall >>>>>> check again. Where do you see this happening? >>>>> >>>>> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split. >>>>> >>>> >>>> In that case, the bio end_io function is chained and the bio of the >>>> split will replicate the error to the parent (if not already set). >>> >>> this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio >>> probably already dispatched to disk (if the bio is splitted to 2 bios, one >>> returns -EAGAIN, the other one doesn't block and dispatch to disk), what will >>> application be going to do? I think this is different to other IO errors. FOr >>> other IO errors, application will handle the error, while we ask app to retry >>> the whole bio here and app doesn't know part of bio is already written to disk. >> >> It is the same as for other I/O errors as well, such as EIO. You do not >> know which bio of all submitted bio's returned the error EIO. The >> application would and should consider the whole I/O as failed. >> >> The user application does not know of bios, or how it is going to be >> split in the underlying layers. It knows at the system call level. In >> this case, the EAGAIN will be returned to the user for the whole I/O not >> as a part of the I/O. It is up to application to try the I/O again with >> or without RWF_NOWAIT set. In direct I/O, it is bubbled out using >> dio->io_error. You can read about it at the patch header for the initial >> patchset at [1]. >> >> Use case: It is for applications having two threads, a compute thread >> and an I/O thread. It would try to push AIO as much as possible in the >> compute thread using RWF_NOWAIT, and if it fails, would pass it on to >> I/O thread which would perform without RWF_NOWAIT. End result if done >> right is you save on context switches and all the >> synchronization/messaging machinery to perform I/O. >> >> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 > > Yes, I knew the concept, but I didn't see previous patches mentioned the > -EAGAIN actually should be taken as a real IO error. This means a lot to > applications and make the API hard to use. I'm wondering if we should disable > bio split for NOWAIT bio, which will make the -EAGAIN only mean 'try again'. > Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say the API is hard to use? Do you have a case to back it up? No, not splitting the bio does not make sense here. I do not see any advantage in it, unless you can present a case otherwise.
On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>> I shall check again. Where do you see this happening? >>>>>> >>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>> Please see blk_queue_split. >>>>>> >>>>> >>>>> In that case, the bio end_io function is chained and the bio of >>>>> the split will replicate the error to the parent (if not already >>>>> set). >>>> >>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>> of the bio probably already dispatched to disk (if the bio is >>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>> block and dispatch to disk), what will application be going to do? >>>> I think this is different to other IO errors. FOr other IO errors, >>>> application will handle the error, while we ask app to retry the >>>> whole bio here and app doesn't know part of bio is already written >>>> to disk. >>> >>> It is the same as for other I/O errors as well, such as EIO. You do >>> not know which bio of all submitted bio's returned the error EIO. >>> The application would and should consider the whole I/O as failed. >>> >>> The user application does not know of bios, or how it is going to be >>> split in the underlying layers. It knows at the system call level. >>> In this case, the EAGAIN will be returned to the user for the whole >>> I/O not as a part of the I/O. It is up to application to try the I/O >>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>> out using dio->io_error. You can read about it at the patch header >>> for the initial patchset at [1]. >>> >>> Use case: It is for applications having two threads, a compute >>> thread and an I/O thread. It would try to push AIO as much as >>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>> would pass it on to I/O thread which would perform without >>> RWF_NOWAIT. End result if done right is you save on context switches >>> and all the synchronization/messaging machinery to perform I/O. >>> >>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >> >> Yes, I knew the concept, but I didn't see previous patches mentioned >> the -EAGAIN actually should be taken as a real IO error. This means a >> lot to applications and make the API hard to use. I'm wondering if we >> should disable bio split for NOWAIT bio, which will make the -EAGAIN >> only mean 'try again'. > > Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say > the API is hard to use? Do you have a case to back it up? Because it is hard to use, and potentially suboptimal. Let's say you're doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a short write, or do we return EWOULDBLOCK? If the latter, then that really sucks from an API point of view. > No, not splitting the bio does not make sense here. I do not see any > advantage in it, unless you can present a case otherwise. It ties back into the "hard to use" that I do agree with IFF we don't return the short write. It's hard for an application to use that efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the full 1MB from a different context.
On 08/08/2017 02:36 PM, Jens Axboe wrote: > On 08/08/2017 02:32 PM, Shaohua Li wrote: >>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>> index 25f6a0cb27d3..fae021ebec1b 100644 >>> --- a/include/linux/blkdev.h >>> +++ b/include/linux/blkdev.h >>> @@ -633,6 +633,7 @@ struct request_queue { >>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ > > Does this work on 32-bit, where sizeof(unsigned long) == 32? I didn't get an answer to this one.
On 08/09/2017 09:18 PM, Jens Axboe wrote: > On 08/08/2017 02:36 PM, Jens Axboe wrote: >> On 08/08/2017 02:32 PM, Shaohua Li wrote: >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>> --- a/include/linux/blkdev.h >>>> +++ b/include/linux/blkdev.h >>>> @@ -633,6 +633,7 @@ struct request_queue { >>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >> >> Does this work on 32-bit, where sizeof(unsigned long) == 32? > > I didn't get an answer to this one. > Oh, I assumed the question is rhetorical. No, it will not work on 32-bit. I was planning to change the field queue_flags to u64. Is that okay?
On 08/09/2017 09:17 PM, Jens Axboe wrote: > On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>> I shall check again. Where do you see this happening? >>>>>>> >>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>> Please see blk_queue_split. >>>>>>> >>>>>> >>>>>> In that case, the bio end_io function is chained and the bio of >>>>>> the split will replicate the error to the parent (if not already >>>>>> set). >>>>> >>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>> of the bio probably already dispatched to disk (if the bio is >>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>> block and dispatch to disk), what will application be going to do? >>>>> I think this is different to other IO errors. FOr other IO errors, >>>>> application will handle the error, while we ask app to retry the >>>>> whole bio here and app doesn't know part of bio is already written >>>>> to disk. >>>> >>>> It is the same as for other I/O errors as well, such as EIO. You do >>>> not know which bio of all submitted bio's returned the error EIO. >>>> The application would and should consider the whole I/O as failed. >>>> >>>> The user application does not know of bios, or how it is going to be >>>> split in the underlying layers. It knows at the system call level. >>>> In this case, the EAGAIN will be returned to the user for the whole >>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>> out using dio->io_error. You can read about it at the patch header >>>> for the initial patchset at [1]. >>>> >>>> Use case: It is for applications having two threads, a compute >>>> thread and an I/O thread. It would try to push AIO as much as >>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>> would pass it on to I/O thread which would perform without >>>> RWF_NOWAIT. End result if done right is you save on context switches >>>> and all the synchronization/messaging machinery to perform I/O. >>>> >>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>> >>> Yes, I knew the concept, but I didn't see previous patches mentioned >>> the -EAGAIN actually should be taken as a real IO error. This means a >>> lot to applications and make the API hard to use. I'm wondering if we >>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>> only mean 'try again'. >> >> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >> the API is hard to use? Do you have a case to back it up? > > Because it is hard to use, and potentially suboptimal. Let's say you're > doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a > short write, or do we return EWOULDBLOCK? If the latter, then that > really sucks from an API point of view. > >> No, not splitting the bio does not make sense here. I do not see any >> advantage in it, unless you can present a case otherwise. > > It ties back into the "hard to use" that I do agree with IFF we don't > return the short write. It's hard for an application to use that > efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the > full 1MB from a different context. > It returns the error code only and not short reads/writes. But isn't that true for all system calls in case of error? For aio, there are two result fields in io_event out of which one could be used for error while the other be used for amount of writes/reads performed. However, only one is used. This will not work with pread()/pwrite() calls though because of the limitation of return values. Finally, what if the EWOULDBLOCK is returned for an earlier bio (say offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are successful. What short return value should the system call return?
On 08/10/2017 05:38 AM, Goldwyn Rodrigues wrote: > > > On 08/09/2017 09:18 PM, Jens Axboe wrote: >> On 08/08/2017 02:36 PM, Jens Axboe wrote: >>> On 08/08/2017 02:32 PM, Shaohua Li wrote: >>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>>> --- a/include/linux/blkdev.h >>>>> +++ b/include/linux/blkdev.h >>>>> @@ -633,6 +633,7 @@ struct request_queue { >>>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >>> >>> Does this work on 32-bit, where sizeof(unsigned long) == 32? >> >> I didn't get an answer to this one. >> > > Oh, I assumed the question is rhetorical. > No, it will not work on 32-bit. I was planning to change the field > queue_flags to u64. Is that okay? No, besides that would not work with set/test_bit() and friends. Grab a free bit instead.
On 08/10/2017 05:49 AM, Goldwyn Rodrigues wrote: > > > On 08/09/2017 09:17 PM, Jens Axboe wrote: >> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>> >>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>> Please see blk_queue_split. >>>>>>>> >>>>>>> >>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>> the split will replicate the error to the parent (if not already >>>>>>> set). >>>>>> >>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>> block and dispatch to disk), what will application be going to do? >>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>> application will handle the error, while we ask app to retry the >>>>>> whole bio here and app doesn't know part of bio is already written >>>>>> to disk. >>>>> >>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>> not know which bio of all submitted bio's returned the error EIO. >>>>> The application would and should consider the whole I/O as failed. >>>>> >>>>> The user application does not know of bios, or how it is going to be >>>>> split in the underlying layers. It knows at the system call level. >>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>> out using dio->io_error. You can read about it at the patch header >>>>> for the initial patchset at [1]. >>>>> >>>>> Use case: It is for applications having two threads, a compute >>>>> thread and an I/O thread. It would try to push AIO as much as >>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>> would pass it on to I/O thread which would perform without >>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>> and all the synchronization/messaging machinery to perform I/O. >>>>> >>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>> >>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>> lot to applications and make the API hard to use. I'm wondering if we >>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>> only mean 'try again'. >>> >>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>> the API is hard to use? Do you have a case to back it up? >> >> Because it is hard to use, and potentially suboptimal. Let's say you're >> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >> short write, or do we return EWOULDBLOCK? If the latter, then that >> really sucks from an API point of view. >> >>> No, not splitting the bio does not make sense here. I do not see any >>> advantage in it, unless you can present a case otherwise. >> >> It ties back into the "hard to use" that I do agree with IFF we don't >> return the short write. It's hard for an application to use that >> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >> full 1MB from a different context. >> > > It returns the error code only and not short reads/writes. But isn't > that true for all system calls in case of error? It's not a hard error. If you wrote 896K in the example above, I'd really expect the return value to be 896*1024. The API is hard to use efficiently, if that's not the case. > For aio, there are two result fields in io_event out of which one could > be used for error while the other be used for amount of writes/reads > performed. However, only one is used. This will not work with > pread()/pwrite() calls though because of the limitation of return values. Don't invent something new for this, the mechanism already exists for returning a short read or write. That's how all of them have worked for decades. > Finally, what if the EWOULDBLOCK is returned for an earlier bio (say > offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are > successful. What short return value should the system call return? It should return 128*1024, since that's how much was successfully done from the start offset. But yes, this is exactly the point that I brought up, and why contesting Shaohua's suggestion to perhaps treat splits differently should not be discarded so quickly.
On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote: > On 08/09/2017 09:17 PM, Jens Axboe wrote: > > On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: > >>>>>>>> No, from a multi-device point of view, this is inconsistent. I > >>>>>>>> have tried the request bio returns -EAGAIN before the split, but > >>>>>>>> I shall check again. Where do you see this happening? > >>>>>>> > >>>>>>> No, this isn't multi-device specific, any driver can do it. > >>>>>>> Please see blk_queue_split. > >>>>>>> > >>>>>> > >>>>>> In that case, the bio end_io function is chained and the bio of > >>>>>> the split will replicate the error to the parent (if not already > >>>>>> set). > >>>>> > >>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part > >>>>> of the bio probably already dispatched to disk (if the bio is > >>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't > >>>>> block and dispatch to disk), what will application be going to do? > >>>>> I think this is different to other IO errors. FOr other IO errors, > >>>>> application will handle the error, while we ask app to retry the > >>>>> whole bio here and app doesn't know part of bio is already written > >>>>> to disk. > >>>> > >>>> It is the same as for other I/O errors as well, such as EIO. You do > >>>> not know which bio of all submitted bio's returned the error EIO. > >>>> The application would and should consider the whole I/O as failed. > >>>> > >>>> The user application does not know of bios, or how it is going to be > >>>> split in the underlying layers. It knows at the system call level. > >>>> In this case, the EAGAIN will be returned to the user for the whole > >>>> I/O not as a part of the I/O. It is up to application to try the I/O > >>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled > >>>> out using dio->io_error. You can read about it at the patch header > >>>> for the initial patchset at [1]. > >>>> > >>>> Use case: It is for applications having two threads, a compute > >>>> thread and an I/O thread. It would try to push AIO as much as > >>>> possible in the compute thread using RWF_NOWAIT, and if it fails, > >>>> would pass it on to I/O thread which would perform without > >>>> RWF_NOWAIT. End result if done right is you save on context switches > >>>> and all the synchronization/messaging machinery to perform I/O. > >>>> > >>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 > >>> > >>> Yes, I knew the concept, but I didn't see previous patches mentioned > >>> the -EAGAIN actually should be taken as a real IO error. This means a > >>> lot to applications and make the API hard to use. I'm wondering if we > >>> should disable bio split for NOWAIT bio, which will make the -EAGAIN > >>> only mean 'try again'. > >> > >> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say > >> the API is hard to use? Do you have a case to back it up? > > > > Because it is hard to use, and potentially suboptimal. Let's say you're > > doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a > > short write, or do we return EWOULDBLOCK? If the latter, then that > > really sucks from an API point of view. > > > >> No, not splitting the bio does not make sense here. I do not see any > >> advantage in it, unless you can present a case otherwise. > > > > It ties back into the "hard to use" that I do agree with IFF we don't > > return the short write. It's hard for an application to use that > > efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the > > full 1MB from a different context. > > > > It returns the error code only and not short reads/writes. But isn't > that true for all system calls in case of error? > > For aio, there are two result fields in io_event out of which one could > be used for error while the other be used for amount of writes/reads > performed. However, only one is used. This will not work with > pread()/pwrite() calls though because of the limitation of return values. > > Finally, what if the EWOULDBLOCK is returned for an earlier bio (say > offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are > successful. What short return value should the system call return? This is indeed tricky. If an application submits 1MB write, I don't think we can afford to just write arbitrary subset of it. That just IMHO too much violates how writes traditionally behaved. Even short writes trigger bugs in various applications but I'm willing to require that applications using NOWAIT IO can handle these. However writing arbitrary subset looks like a nasty catch. IMHO we should not submit further bios until we are sure current one does not return EWOULDBLOCK when splitting a larger one... Honza
On 08/10/2017 08:25 AM, Jan Kara wrote: > On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote: >> On 08/09/2017 09:17 PM, Jens Axboe wrote: >>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>>> >>>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>>> Please see blk_queue_split. >>>>>>>>> >>>>>>>> >>>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>>> the split will replicate the error to the parent (if not already >>>>>>>> set). >>>>>>> >>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>>> block and dispatch to disk), what will application be going to do? >>>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>>> application will handle the error, while we ask app to retry the >>>>>>> whole bio here and app doesn't know part of bio is already written >>>>>>> to disk. >>>>>> >>>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>>> not know which bio of all submitted bio's returned the error EIO. >>>>>> The application would and should consider the whole I/O as failed. >>>>>> >>>>>> The user application does not know of bios, or how it is going to be >>>>>> split in the underlying layers. It knows at the system call level. >>>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>>> out using dio->io_error. You can read about it at the patch header >>>>>> for the initial patchset at [1]. >>>>>> >>>>>> Use case: It is for applications having two threads, a compute >>>>>> thread and an I/O thread. It would try to push AIO as much as >>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>>> would pass it on to I/O thread which would perform without >>>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>>> and all the synchronization/messaging machinery to perform I/O. >>>>>> >>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>>> >>>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>>> lot to applications and make the API hard to use. I'm wondering if we >>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>>> only mean 'try again'. >>>> >>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>>> the API is hard to use? Do you have a case to back it up? >>> >>> Because it is hard to use, and potentially suboptimal. Let's say you're >>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >>> short write, or do we return EWOULDBLOCK? If the latter, then that >>> really sucks from an API point of view. >>> >>>> No, not splitting the bio does not make sense here. I do not see any >>>> advantage in it, unless you can present a case otherwise. >>> >>> It ties back into the "hard to use" that I do agree with IFF we don't >>> return the short write. It's hard for an application to use that >>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >>> full 1MB from a different context. >>> >> >> It returns the error code only and not short reads/writes. But isn't >> that true for all system calls in case of error? >> >> For aio, there are two result fields in io_event out of which one could >> be used for error while the other be used for amount of writes/reads >> performed. However, only one is used. This will not work with >> pread()/pwrite() calls though because of the limitation of return values. >> >> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say >> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are >> successful. What short return value should the system call return? > > This is indeed tricky. If an application submits 1MB write, I don't think > we can afford to just write arbitrary subset of it. That just IMHO too much > violates how writes traditionally behaved. Even short writes trigger bugs > in various applications but I'm willing to require that applications using > NOWAIT IO can handle these. However writing arbitrary subset looks like a > nasty catch. IMHO we should not submit further bios until we are sure > current one does not return EWOULDBLOCK when splitting a larger one... Exactly, that's the point that both Shaohua and I was getting at. Short writes should be fine, especially if NOWAIT is set. Discontig writes should also be OK, but it's horrible and inefficient. If we do that, then using this feature is a net-loss, not a win by any stretch.
On 08/10/2017 09:28 AM, Jens Axboe wrote: > On 08/10/2017 08:25 AM, Jan Kara wrote: >> On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote: >>> On 08/09/2017 09:17 PM, Jens Axboe wrote: >>>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>>>> >>>>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>>>> Please see blk_queue_split. >>>>>>>>>> >>>>>>>>> >>>>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>>>> the split will replicate the error to the parent (if not already >>>>>>>>> set). >>>>>>>> >>>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>>>> block and dispatch to disk), what will application be going to do? >>>>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>>>> application will handle the error, while we ask app to retry the >>>>>>>> whole bio here and app doesn't know part of bio is already written >>>>>>>> to disk. >>>>>>> >>>>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>>>> not know which bio of all submitted bio's returned the error EIO. >>>>>>> The application would and should consider the whole I/O as failed. >>>>>>> >>>>>>> The user application does not know of bios, or how it is going to be >>>>>>> split in the underlying layers. It knows at the system call level. >>>>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>>>> out using dio->io_error. You can read about it at the patch header >>>>>>> for the initial patchset at [1]. >>>>>>> >>>>>>> Use case: It is for applications having two threads, a compute >>>>>>> thread and an I/O thread. It would try to push AIO as much as >>>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>>>> would pass it on to I/O thread which would perform without >>>>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>>>> and all the synchronization/messaging machinery to perform I/O. >>>>>>> >>>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>>>> >>>>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>>>> lot to applications and make the API hard to use. I'm wondering if we >>>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>>>> only mean 'try again'. >>>>> >>>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>>>> the API is hard to use? Do you have a case to back it up? >>>> >>>> Because it is hard to use, and potentially suboptimal. Let's say you're >>>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >>>> short write, or do we return EWOULDBLOCK? If the latter, then that >>>> really sucks from an API point of view. >>>> >>>>> No, not splitting the bio does not make sense here. I do not see any >>>>> advantage in it, unless you can present a case otherwise. >>>> >>>> It ties back into the "hard to use" that I do agree with IFF we don't >>>> return the short write. It's hard for an application to use that >>>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >>>> full 1MB from a different context. >>>> >>> >>> It returns the error code only and not short reads/writes. But isn't >>> that true for all system calls in case of error? >>> >>> For aio, there are two result fields in io_event out of which one could >>> be used for error while the other be used for amount of writes/reads >>> performed. However, only one is used. This will not work with >>> pread()/pwrite() calls though because of the limitation of return values. >>> >>> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say >>> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are >>> successful. What short return value should the system call return? >> >> This is indeed tricky. If an application submits 1MB write, I don't think >> we can afford to just write arbitrary subset of it. That just IMHO too much >> violates how writes traditionally behaved. Even short writes trigger bugs >> in various applications but I'm willing to require that applications using >> NOWAIT IO can handle these. However writing arbitrary subset looks like a >> nasty catch. IMHO we should not submit further bios until we are sure >> current one does not return EWOULDBLOCK when splitting a larger one... > > Exactly, that's the point that both Shaohua and I was getting at. Short > writes should be fine, especially if NOWAIT is set. Discontig writes > should also be OK, but it's horrible and inefficient. If we do that, > then using this feature is a net-loss, not a win by any stretch. > To make sure I understand this, we disable bio splits for NOWAIT bio so we return EWOULDBLOCK for the entire I/O.
On 08/10/2017 09:14 AM, Jens Axboe wrote: > On 08/10/2017 05:38 AM, Goldwyn Rodrigues wrote: >> >> >> On 08/09/2017 09:18 PM, Jens Axboe wrote: >>> On 08/08/2017 02:36 PM, Jens Axboe wrote: >>>> On 08/08/2017 02:32 PM, Shaohua Li wrote: >>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>>>> --- a/include/linux/blkdev.h >>>>>> +++ b/include/linux/blkdev.h >>>>>> @@ -633,6 +633,7 @@ struct request_queue { >>>>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >>>> >>>> Does this work on 32-bit, where sizeof(unsigned long) == 32? >>> >>> I didn't get an answer to this one. >>> >> >> Oh, I assumed the question is rhetorical. >> No, it will not work on 32-bit. I was planning to change the field >> queue_flags to u64. Is that okay? > > No, besides that would not work with set/test_bit() and friends. Grab > a free bit instead. > Which bit is free? I don't see any gaps in QUEUE_FLAG_*, and I am not sure if any one is unused.
On 08/10/2017 11:15 AM, Goldwyn Rodrigues wrote: > > > On 08/10/2017 09:14 AM, Jens Axboe wrote: >> On 08/10/2017 05:38 AM, Goldwyn Rodrigues wrote: >>> >>> >>> On 08/09/2017 09:18 PM, Jens Axboe wrote: >>>> On 08/08/2017 02:36 PM, Jens Axboe wrote: >>>>> On 08/08/2017 02:32 PM, Shaohua Li wrote: >>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>>>>> index 25f6a0cb27d3..fae021ebec1b 100644 >>>>>>> --- a/include/linux/blkdev.h >>>>>>> +++ b/include/linux/blkdev.h >>>>>>> @@ -633,6 +633,7 @@ struct request_queue { >>>>>>> #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ >>>>>>> #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ >>>>>>> #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ >>>>>>> +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ >>>>> >>>>> Does this work on 32-bit, where sizeof(unsigned long) == 32? >>>> >>>> I didn't get an answer to this one. >>>> >>> >>> Oh, I assumed the question is rhetorical. >>> No, it will not work on 32-bit. I was planning to change the field >>> queue_flags to u64. Is that okay? >> >> No, besides that would not work with set/test_bit() and friends. Grab >> a free bit instead. >> > > Which bit is free? I don't see any gaps in QUEUE_FLAG_*, and I am not > sure if any one is unused. Bit 0 is free in mainline, and I just looked, and two other bits are gone as well: http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.14/block&id=e743eb1ecd5564b5ae0a4a76c1566f748a358839
On 08/10/2017 11:15 AM, Goldwyn Rodrigues wrote: > > > On 08/10/2017 09:28 AM, Jens Axboe wrote: >> On 08/10/2017 08:25 AM, Jan Kara wrote: >>> On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote: >>>> On 08/09/2017 09:17 PM, Jens Axboe wrote: >>>>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>>>>> >>>>>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>>>>> Please see blk_queue_split. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>>>>> the split will replicate the error to the parent (if not already >>>>>>>>>> set). >>>>>>>>> >>>>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>>>>> block and dispatch to disk), what will application be going to do? >>>>>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>>>>> application will handle the error, while we ask app to retry the >>>>>>>>> whole bio here and app doesn't know part of bio is already written >>>>>>>>> to disk. >>>>>>>> >>>>>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>>>>> not know which bio of all submitted bio's returned the error EIO. >>>>>>>> The application would and should consider the whole I/O as failed. >>>>>>>> >>>>>>>> The user application does not know of bios, or how it is going to be >>>>>>>> split in the underlying layers. It knows at the system call level. >>>>>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>>>>> out using dio->io_error. You can read about it at the patch header >>>>>>>> for the initial patchset at [1]. >>>>>>>> >>>>>>>> Use case: It is for applications having two threads, a compute >>>>>>>> thread and an I/O thread. It would try to push AIO as much as >>>>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>>>>> would pass it on to I/O thread which would perform without >>>>>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>>>>> and all the synchronization/messaging machinery to perform I/O. >>>>>>>> >>>>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>>>>> >>>>>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>>>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>>>>> lot to applications and make the API hard to use. I'm wondering if we >>>>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>>>>> only mean 'try again'. >>>>>> >>>>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>>>>> the API is hard to use? Do you have a case to back it up? >>>>> >>>>> Because it is hard to use, and potentially suboptimal. Let's say you're >>>>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >>>>> short write, or do we return EWOULDBLOCK? If the latter, then that >>>>> really sucks from an API point of view. >>>>> >>>>>> No, not splitting the bio does not make sense here. I do not see any >>>>>> advantage in it, unless you can present a case otherwise. >>>>> >>>>> It ties back into the "hard to use" that I do agree with IFF we don't >>>>> return the short write. It's hard for an application to use that >>>>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >>>>> full 1MB from a different context. >>>>> >>>> >>>> It returns the error code only and not short reads/writes. But isn't >>>> that true for all system calls in case of error? >>>> >>>> For aio, there are two result fields in io_event out of which one could >>>> be used for error while the other be used for amount of writes/reads >>>> performed. However, only one is used. This will not work with >>>> pread()/pwrite() calls though because of the limitation of return values. >>>> >>>> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say >>>> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are >>>> successful. What short return value should the system call return? >>> >>> This is indeed tricky. If an application submits 1MB write, I don't think >>> we can afford to just write arbitrary subset of it. That just IMHO too much >>> violates how writes traditionally behaved. Even short writes trigger bugs >>> in various applications but I'm willing to require that applications using >>> NOWAIT IO can handle these. However writing arbitrary subset looks like a >>> nasty catch. IMHO we should not submit further bios until we are sure >>> current one does not return EWOULDBLOCK when splitting a larger one... >> >> Exactly, that's the point that both Shaohua and I was getting at. Short >> writes should be fine, especially if NOWAIT is set. Discontig writes >> should also be OK, but it's horrible and inefficient. If we do that, >> then using this feature is a net-loss, not a win by any stretch. >> > > To make sure I understand this, we disable bio splits for NOWAIT bio so > we return EWOULDBLOCK for the entire I/O. That's also not great, since splits is a common operation, and the majority of splits can proceed without hitting out-of-resources. So ideally we'd handle that case, but in a saner fashion than the laissez faire approach that the current patchset takes.
diff --git a/block/blk-core.c b/block/blk-core.c index 970b9c9638c5..1c9a981d88e5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio) * if queue is not a request based queue. */ - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) && + !blk_queue_supports_nowait(q)) goto not_supported; part = bio->bi_bdev->bd_part; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 25f6a0cb27d3..fae021ebec1b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -633,6 +633,7 @@ struct request_queue { #define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */ #define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED 31 /* queue has been quiesced */ +#define QUEUE_FLAG_NOWAIT 32 /* stack device driver supports REQ_NOWAIT */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) #define blk_queue_scsi_passthrough(q) \ test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) +#define blk_queue_supports_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \