Message ID | cc14c28a-0f0a-640a-b576-cdfbaee75dc5@ce.jp.nec.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 02/03/2017 12:55 AM, Junichi Nomura wrote: > I found following ext4 error occurs on a certain storage since v4.10-rc1: > EXT4-fs (sdc1): Delayed block allocation failed for inode 12 at logical offset 100 with max blocks 2 with error 121 > EXT4-fs (sdc1): This should not happen!! Data will be lost > > Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout(). > That came from sd driver because WRITE SAME was sent to the device > which didn't support it. > > The problem was introduced by commit e73c23ff736e ("block: add async > variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout > fell back to normal zero writing when WRITE SAME failed and it seems > sd driver's heuristics depends on that behaviour. CC Christoph and Chaitanya.
On Fri, Feb 03, 2017 at 08:21:31AM -0700, Jens Axboe wrote: > > Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout(). > > That came from sd driver because WRITE SAME was sent to the device > > which didn't support it. > > > > The problem was introduced by commit e73c23ff736e ("block: add async > > variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout > > fell back to normal zero writing when WRITE SAME failed and it seems > > sd driver's heuristics depends on that behaviour. > > CC Christoph and Chaitanya. And adding Martin as the sd.c Write Same code is his. I suspect we'll have to restore the old way this works for 4.10 as it's too late in the cycle, but that whole idea of trying Write Same first and just disabling it if it doesn't work is a receipe for desaster - it kinda works for a synchronous blkdev_issue_zeroout, but if we want to be able to submit it asynchronously it's getting too hairy to handle. I think we should fix sd.c to only send WRITE SAME if either of the variants are explicitly listed as supported through REPORT SUPPORTED OPERATION CODES, or maybe through a whitelist if there are important enough devices.
On 02/03/2017 09:12 AM, Christoph Hellwig wrote: > On Fri, Feb 03, 2017 at 08:21:31AM -0700, Jens Axboe wrote: >>> Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout(). >>> That came from sd driver because WRITE SAME was sent to the device >>> which didn't support it. >>> >>> The problem was introduced by commit e73c23ff736e ("block: add async >>> variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout >>> fell back to normal zero writing when WRITE SAME failed and it seems >>> sd driver's heuristics depends on that behaviour. >> >> CC Christoph and Chaitanya. > > And adding Martin as the sd.c Write Same code is his. > > I suspect we'll have to restore the old way this works for 4.10 as it's > too late in the cycle, but that whole idea of trying Write Same first > and just disabling it if it doesn't work is a receipe for desaster - > it kinda works for a synchronous blkdev_issue_zeroout, but if we want > to be able to submit it asynchronously it's getting too hairy to handle. I agree, the current approach is a hot and ugly mess. > I think we should fix sd.c to only send WRITE SAME if either of the > variants are explicitly listed as supported through > REPORT SUPPORTED OPERATION CODES, or maybe through a whitelist if > there are important enough devices. Yep
>>>>> "Jens" == Jens Axboe <axboe@kernel.dk> writes: >> I think we should fix sd.c to only send WRITE SAME if either of the >> variants are explicitly listed as supported through REPORT SUPPORTED >> OPERATION CODES, or maybe through a whitelist if there are important >> enough devices. Jens> Yep I hate it too. But the reason it's assumed on is that there is essentially no heuristic that works. Just like we assume that READ always works. Out of the ~200 devices I have access to in the lab: - 100% of the SAS/FC disk drives and SSDs support WRITE SAME - Only 2 out of about 50 different drive models support RSOC - About half of the arrays support WRITE SAME(10/16) - None of the arrays I have support RSOC So even if we were to entertain using RSOC for "enterprise" transport classes (which I concur would be nice for other reasons), it wouldn't solve the WRITE SAME problem...
On 02/03/2017 03:45 PM, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <axboe@kernel.dk> writes: > >>> I think we should fix sd.c to only send WRITE SAME if either of the >>> variants are explicitly listed as supported through REPORT SUPPORTED >>> OPERATION CODES, or maybe through a whitelist if there are important >>> enough devices. > > Jens> Yep > > I hate it too. But the reason it's assumed on is that there is > essentially no heuristic that works. Just like we assume that READ > always works. > > Out of the ~200 devices I have access to in the lab: > > - 100% of the SAS/FC disk drives and SSDs support WRITE SAME > - Only 2 out of about 50 different drive models support RSOC > - About half of the arrays support WRITE SAME(10/16) > - None of the arrays I have support RSOC > > So even if we were to entertain using RSOC for "enterprise" transport > classes (which I concur would be nice for other reasons), it wouldn't > solve the WRITE SAME problem... We're at (almost) -rc7 time, we have to do more than hand wave about this. What's the plan for 4.10 final?
On Fri, Feb 03, 2017 at 08:17:10PM -0700, Jens Axboe wrote: > We're at (almost) -rc7 time, we have to do more than hand wave about > this. What's the plan for 4.10 final? I'll send your a fix to revert the async write same changes for now, as suggested in the last mail from me.
diff --git a/block/blk-lib.c b/block/blk-lib.c index f8c82a9..8e53474 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -360,6 +360,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, bool discard) { int ret; + int pass = 0; struct bio *bio = NULL; struct blk_plug plug; @@ -369,6 +370,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return 0; } + retry_other_method: blk_start_plug(&plug); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, &bio, discard); @@ -378,6 +380,11 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, } blk_finish_plug(&plug); + if (ret && pass++ == 0) { + bio = NULL; + goto retry_other_method; + } + return ret; } EXPORT_SYMBOL(blkdev_issue_zeroout);