Message ID | 20201206055332.3144-1-tom.ty89@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [1/3] block: try one write zeroes request before going further | expand |
On 12/6/20 6:53 AM, Tom Yan wrote: > At least the SCSI disk driver is "benevolent" when it try to decide > whether the device actually supports write zeroes, i.e. unless the > device explicity report otherwise, it assumes it does at first. > > Therefore before we pile up bios that would fail at the end, we try > the command/request once, as not doing so could trigger quite a > disaster in at least certain case. For example, the host controller > can be messed up entirely when one does `blkdiscard -z` a UAS drive. > > Signed-off-by: Tom Yan <tom.ty89@gmail.com> > --- > block/blk-lib.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index e90614fd8d6a..c1e9388a8fb8 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > struct bio *bio = *biop; > unsigned int max_write_zeroes_sectors; > struct request_queue *q = bdev_get_queue(bdev); > + int i = 0; > > if (!q) > return -ENXIO; > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > return -EOPNOTSUPP; > > while (nr_sects) { > - bio = blk_next_bio(bio, 0, gfp_mask); > + if (i != 1) { > + bio = blk_next_bio(bio, 0, gfp_mask); > + } else { > + submit_bio_wait(bio); > + bio_put(bio); > + > + if (bdev_write_zeroes_sectors(bdev) == 0) > + return -EOPNOTSUPP; > + else > + bio = bio_alloc(gfp_mask, 0); > + } > bio->bi_iter.bi_sector = sector; > bio_set_dev(bio, bdev); > bio->bi_opf = REQ_OP_WRITE_ZEROES; > @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > nr_sects = 0; > } > cond_resched(); > + i++; > } > > *biop = bio; > We do want to keep the chain of bios intact such that end_io processing will recurse back to the original end_io callback. As such we need to call bio_chain on the first bio, submit that (possibly with submit_bio_wait()), and then decide whether we can / should continue. With your patch we'll lose the information that indeed other bios might be linked to the original one. Cheers, Hannes
I think you misunderstood it. The goal of this patch is to split the current situation into two chains (or one unchained bio + a series of chained bio). The first one is an attempt/trial which makes sure that the latter large bio chain can actually be handled (as per the "command capability" of the device). P.S. I think I missed the fact that it requires my blk_next_bio() patch to work properly. (It still seems like a typo bug to me.) On Sun, 6 Dec 2020 at 19:25, Hannes Reinecke <hare@suse.de> wrote: > > On 12/6/20 6:53 AM, Tom Yan wrote: > > At least the SCSI disk driver is "benevolent" when it try to decide > > whether the device actually supports write zeroes, i.e. unless the > > device explicity report otherwise, it assumes it does at first. > > > > Therefore before we pile up bios that would fail at the end, we try > > the command/request once, as not doing so could trigger quite a > > disaster in at least certain case. For example, the host controller > > can be messed up entirely when one does `blkdiscard -z` a UAS drive. > > > > Signed-off-by: Tom Yan <tom.ty89@gmail.com> > > --- > > block/blk-lib.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-lib.c b/block/blk-lib.c > > index e90614fd8d6a..c1e9388a8fb8 100644 > > --- a/block/blk-lib.c > > +++ b/block/blk-lib.c > > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > > struct bio *bio = *biop; > > unsigned int max_write_zeroes_sectors; > > struct request_queue *q = bdev_get_queue(bdev); > > + int i = 0; > > > > if (!q) > > return -ENXIO; > > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > > return -EOPNOTSUPP; > > > > while (nr_sects) { > > - bio = blk_next_bio(bio, 0, gfp_mask); > > + if (i != 1) { > > + bio = blk_next_bio(bio, 0, gfp_mask); > > + } else { > > + submit_bio_wait(bio); > > + bio_put(bio); > > + > > + if (bdev_write_zeroes_sectors(bdev) == 0) > > + return -EOPNOTSUPP; > > + else > > + bio = bio_alloc(gfp_mask, 0); > > + } > > bio->bi_iter.bi_sector = sector; > > bio_set_dev(bio, bdev); > > bio->bi_opf = REQ_OP_WRITE_ZEROES; > > @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > > nr_sects = 0; > > } > > cond_resched(); > > + i++; > > } > > > > *biop = bio; > > > We do want to keep the chain of bios intact such that end_io processing > will recurse back to the original end_io callback. > As such we need to call bio_chain on the first bio, submit that > (possibly with submit_bio_wait()), and then decide whether we can / > should continue. > With your patch we'll lose the information that indeed other bios might > be linked to the original one. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On 12/6/20 2:25 PM, Tom Yan wrote: > I think you misunderstood it. The goal of this patch is to split the > current situation into two chains (or one unchained bio + a series of > chained bio). The first one is an attempt/trial which makes sure that > the latter large bio chain can actually be handled (as per the > "command capability" of the device). > Oh, I think I do get what you're trying to do. And, in fact, I don't argue with what you're trying to achieve. What I would like to see, though, is keep the current bio_chain logic intact (irrespective of your previous patch, which should actually be part of this series), and just lift the first check out of the loop: @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, if (max_write_zeroes_sectors == 0) return -EOPNOTSUPP; - + new = bio_alloc(gfp_mask, 0); + bio_chain(bio, new); + if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) { + bio_put(new); + return -ENOPNOTSUPP; + } + bio = new; while (nr_sects) { - bio = blk_next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; bio_set_dev(bio, bdev); bio->bi_opf = REQ_OP_WRITE_ZEROES; @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, bio->bi_iter.bi_size = nr_sects << 9; nr_sects = 0; } + bio = blk_next_bio(bio, 0, gfp_mask); cond_resched(); } (The error checking from submit_bio_wait() could be improved :-) Cheers, Hannes
Yes it does have "dependency" to the blk_next_bio() patch. I just somehow missed that. The problem is, I don't think I'm trying to change the logic of bio_chain(), or even that of blk_next_bio(). It really just looks like a careless mistake, that the arguments were typed in the wrong order. Adding those who signed off the original commit (block: remove struct bio_batch / 9082e87b) here too to the CC list. On Sun, 6 Dec 2020 at 21:56, Hannes Reinecke <hare@suse.de> wrote: > > On 12/6/20 2:25 PM, Tom Yan wrote: > > I think you misunderstood it. The goal of this patch is to split the > > current situation into two chains (or one unchained bio + a series of > > chained bio). The first one is an attempt/trial which makes sure that > > the latter large bio chain can actually be handled (as per the > > "command capability" of the device). > > > Oh, I think I do get what you're trying to do. And, in fact, I don't > argue with what you're trying to achieve. > > What I would like to see, though, is keep the current bio_chain logic > intact (irrespective of your previous patch, which should actually be > part of this series), and just lift the first check out of the loop: > > @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct > block_device *bdev, > > if (max_write_zeroes_sectors == 0) > return -EOPNOTSUPP; > - > + new = bio_alloc(gfp_mask, 0); > + bio_chain(bio, new); > + if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) { > + bio_put(new); > + return -ENOPNOTSUPP; > + } > + bio = new; > while (nr_sects) { > - bio = blk_next_bio(bio, 0, gfp_mask); > bio->bi_iter.bi_sector = sector; > bio_set_dev(bio, bdev); > bio->bi_opf = REQ_OP_WRITE_ZEROES; > @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct > block_device *bdev, > bio->bi_iter.bi_size = nr_sects << 9; > nr_sects = 0; > } > + bio = blk_next_bio(bio, 0, gfp_mask); > cond_resched(); > } > > (The error checking from submit_bio_wait() could be improved :-) > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Btw, while this series relies on the blk_next_bio() patch to work, it was not the reason that I sent the latter. It was just because the way it calls bio_chain() doesn't look right to any of the functions that make use of it (or in other words, the apparent logic of itself). That's actually why I didn't have it in the same series. On Sun, 6 Dec 2020 at 22:07, Tom Yan <tom.ty89@gmail.com> wrote: > > Yes it does have "dependency" to the blk_next_bio() patch. I just > somehow missed that. > > The problem is, I don't think I'm trying to change the logic of > bio_chain(), or even that of blk_next_bio(). It really just looks like > a careless mistake, that the arguments were typed in the wrong order. > > Adding those who signed off the original commit (block: remove struct > bio_batch / 9082e87b) here too to the CC list. > > > On Sun, 6 Dec 2020 at 21:56, Hannes Reinecke <hare@suse.de> wrote: > > > > On 12/6/20 2:25 PM, Tom Yan wrote: > > > I think you misunderstood it. The goal of this patch is to split the > > > current situation into two chains (or one unchained bio + a series of > > > chained bio). The first one is an attempt/trial which makes sure that > > > the latter large bio chain can actually be handled (as per the > > > "command capability" of the device). > > > > > Oh, I think I do get what you're trying to do. And, in fact, I don't > > argue with what you're trying to achieve. > > > > What I would like to see, though, is keep the current bio_chain logic > > intact (irrespective of your previous patch, which should actually be > > part of this series), and just lift the first check out of the loop: > > > > @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct > > block_device *bdev, > > > > if (max_write_zeroes_sectors == 0) > > return -EOPNOTSUPP; > > - > > + new = bio_alloc(gfp_mask, 0); > > + bio_chain(bio, new); > > + if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) { > > + bio_put(new); > > + return -ENOPNOTSUPP; > > + } > > + bio = new; > > while (nr_sects) { > > - bio = blk_next_bio(bio, 0, gfp_mask); > > bio->bi_iter.bi_sector = sector; > > bio_set_dev(bio, bdev); > > bio->bi_opf = REQ_OP_WRITE_ZEROES; > > @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct > > block_device *bdev, > > bio->bi_iter.bi_size = nr_sects << 9; > > nr_sects = 0; > > } > > + bio = blk_next_bio(bio, 0, gfp_mask); > > cond_resched(); > > } > > > > (The error checking from submit_bio_wait() could be improved :-) > > > > Cheers, > > > > Hannes > > -- > > Dr. Hannes Reinecke Kernel Storage Architect > > hare@suse.de +49 911 74053 688 > > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On Sun, Dec 06, 2020 at 01:53:30PM +0800, Tom Yan wrote: > At least the SCSI disk driver is "benevolent" when it try to decide > whether the device actually supports write zeroes, i.e. unless the > device explicity report otherwise, it assumes it does at first. > > Therefore before we pile up bios that would fail at the end, we try > the command/request once, as not doing so could trigger quite a > disaster in at least certain case. For example, the host controller > can be messed up entirely when one does `blkdiscard -z` a UAS drive. > > Signed-off-by: Tom Yan <tom.ty89@gmail.com> > --- > block/blk-lib.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index e90614fd8d6a..c1e9388a8fb8 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > struct bio *bio = *biop; > unsigned int max_write_zeroes_sectors; > struct request_queue *q = bdev_get_queue(bdev); > + int i = 0; > > if (!q) > return -ENXIO; > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > return -EOPNOTSUPP; > > while (nr_sects) { > - bio = blk_next_bio(bio, 0, gfp_mask); > + if (i != 1) { > + bio = blk_next_bio(bio, 0, gfp_mask); > + } else { > + submit_bio_wait(bio); > + bio_put(bio); > + > + if (bdev_write_zeroes_sectors(bdev) == 0) > + return -EOPNOTSUPP; > + else This means you now massively slow down say nvme operations by adding a wait. If at all we need a maybe supports write zeroes flag and only do that if the driver hasn't decided yet if write zeroes is actually supported.
You mean like submit_bio_wait() is a blocking operation? On Mon, 7 Dec 2020 at 21:36, Christoph Hellwig <hch@infradead.org> wrote: > > On Sun, Dec 06, 2020 at 01:53:30PM +0800, Tom Yan wrote: > > At least the SCSI disk driver is "benevolent" when it try to decide > > whether the device actually supports write zeroes, i.e. unless the > > device explicity report otherwise, it assumes it does at first. > > > > Therefore before we pile up bios that would fail at the end, we try > > the command/request once, as not doing so could trigger quite a > > disaster in at least certain case. For example, the host controller > > can be messed up entirely when one does `blkdiscard -z` a UAS drive. > > > > Signed-off-by: Tom Yan <tom.ty89@gmail.com> > > --- > > block/blk-lib.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-lib.c b/block/blk-lib.c > > index e90614fd8d6a..c1e9388a8fb8 100644 > > --- a/block/blk-lib.c > > +++ b/block/blk-lib.c > > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > > struct bio *bio = *biop; > > unsigned int max_write_zeroes_sectors; > > struct request_queue *q = bdev_get_queue(bdev); > > + int i = 0; > > > > if (!q) > > return -ENXIO; > > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, > > return -EOPNOTSUPP; > > > > while (nr_sects) { > > - bio = blk_next_bio(bio, 0, gfp_mask); > > + if (i != 1) { > > + bio = blk_next_bio(bio, 0, gfp_mask); > > + } else { > > + submit_bio_wait(bio); > > + bio_put(bio); > > + > > + if (bdev_write_zeroes_sectors(bdev) == 0) > > + return -EOPNOTSUPP; > > + else > > This means you now massively slow down say nvme operations by adding > a wait. If at all we need a maybe supports write zeroes flag and > only do that if the driver hasn't decided yet if write zeroes is > actually supported.
On Sun, 2020-12-06 at 13:53 +0800, Tom Yan wrote: > At least the SCSI disk driver is "benevolent" when it try to decide > whether the device actually supports write zeroes, i.e. unless the > device explicity report otherwise, it assumes it does at first. > > Therefore before we pile up bios that would fail at the end, we try > the command/request once, as not doing so could trigger quite a > disaster in at least certain case. For example, the host controller > can be messed up entirely when one does `blkdiscard -z` a UAS drive. > It's not as simple as that. There are some SCSI devices that support WRITE ZEROES, but do not return the MAXIMUM WRITE SAME LENGTH in the block device limits VPD page. So, some commands might work, and others might not. In particular, a commonly-used hypervisor does this. The sd driver disables the use of write same if certain errors are returned (INVALID COMMAND w/ INVALID COMMAND OPCODE or INVALID FIELD IN CDB), but if you do a blkdiscard -z of an entire drive a whole lot of bios/requests are already queued by the time you get that. Higher level code checks to see if write zeroes is supported, and won't queue the requests once it is turned off, but that doesn't happen until a command fails. We also check in command setup, see my earlier patch which deals with spurious blk_update_request errors if the disablement of write same gets detected there... I explicitly did not try to fix that by "testing" with one bio for the reason Christoph mentions. -Ewan
On Tue, Dec 08, 2020 at 08:48:15PM +0800, Tom Yan wrote:
> You mean like submit_bio_wait() is a blocking operation?
Yes, it blocks to wait for the I/O to complete.
diff --git a/block/blk-lib.c b/block/blk-lib.c index e90614fd8d6a..c1e9388a8fb8 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, struct bio *bio = *biop; unsigned int max_write_zeroes_sectors; struct request_queue *q = bdev_get_queue(bdev); + int i = 0; if (!q) return -ENXIO; @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, return -EOPNOTSUPP; while (nr_sects) { - bio = blk_next_bio(bio, 0, gfp_mask); + if (i != 1) { + bio = blk_next_bio(bio, 0, gfp_mask); + } else { + submit_bio_wait(bio); + bio_put(bio); + + if (bdev_write_zeroes_sectors(bdev) == 0) + return -EOPNOTSUPP; + else + bio = bio_alloc(gfp_mask, 0); + } bio->bi_iter.bi_sector = sector; bio_set_dev(bio, bdev); bio->bi_opf = REQ_OP_WRITE_ZEROES; @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, nr_sects = 0; } cond_resched(); + i++; } *biop = bio;
At least the SCSI disk driver is "benevolent" when it try to decide whether the device actually supports write zeroes, i.e. unless the device explicity report otherwise, it assumes it does at first. Therefore before we pile up bios that would fail at the end, we try the command/request once, as not doing so could trigger quite a disaster in at least certain case. For example, the host controller can be messed up entirely when one does `blkdiscard -z` a UAS drive. Signed-off-by: Tom Yan <tom.ty89@gmail.com> --- block/blk-lib.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)