diff mbox

[REGRESSION,v4.10-rc1] blkdev_issue_zeroout() returns -EREMOTEIO on the first call for SCSI device that doesn't support WRITE SAME

Message ID cc14c28a-0f0a-640a-b576-cdfbaee75dc5@ce.jp.nec.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Junichi Nomura Feb. 3, 2017, 7:55 a.m. UTC
I found following ext4 error occurs on a certain storage since v4.10-rc1:
  EXT4-fs (sdc1): Delayed block allocation failed for inode 12 at logical offset 100 with max blocks 2 with error 121
  EXT4-fs (sdc1): This should not happen!! Data will be lost

Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout().
That came from sd driver because WRITE SAME was sent to the device
which didn't support it.

The problem was introduced by commit e73c23ff736e ("block: add async
variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout
fell back to normal zero writing when WRITE SAME failed and it seems
sd driver's heuristics depends on that behaviour.

Below is a band-aid fix to restore the fallback behaviour for sd. Although
there should be better fix as retrying blindly is not a good idea...

v4.10-rc6:
  # cat /sys/block/sdc/queue/write_same_max_bytes
  33553920
  # fallocate -v -z -l 512 /dev/sdc1
  fallocate: fallocate failed: Remote I/O error
  # cat /sys/block/sdc/queue/write_same_max_bytes
  0
  # fallocate -v -z -l 512 /dev/sdc1
  # echo $?
  0

v4.9 or v4.10-rc6 + this patch:
  # grep . /sys/block/sdc/queue/write_same_max_bytes
  33553920
  # fallocate -v -z -l 512 /dev/sdc1
  # echo $?
  0
  # grep . /sys/block/sdc/queue/write_same_max_bytes
  0

Comments

Jens Axboe Feb. 3, 2017, 3:21 p.m. UTC | #1
On 02/03/2017 12:55 AM, Junichi Nomura wrote:
> I found following ext4 error occurs on a certain storage since v4.10-rc1:
>   EXT4-fs (sdc1): Delayed block allocation failed for inode 12 at logical offset 100 with max blocks 2 with error 121
>   EXT4-fs (sdc1): This should not happen!! Data will be lost
> 
> Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout().
> That came from sd driver because WRITE SAME was sent to the device
> which didn't support it.
> 
> The problem was introduced by commit e73c23ff736e ("block: add async
> variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout
> fell back to normal zero writing when WRITE SAME failed and it seems
> sd driver's heuristics depends on that behaviour.

CC Christoph and Chaitanya.
Christoph Hellwig Feb. 3, 2017, 4:12 p.m. UTC | #2
On Fri, Feb 03, 2017 at 08:21:31AM -0700, Jens Axboe wrote:
> > Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout().
> > That came from sd driver because WRITE SAME was sent to the device
> > which didn't support it.
> > 
> > The problem was introduced by commit e73c23ff736e ("block: add async
> > variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout
> > fell back to normal zero writing when WRITE SAME failed and it seems
> > sd driver's heuristics depends on that behaviour.
> 
> CC Christoph and Chaitanya.

And adding Martin as the sd.c Write Same code is his.

I suspect we'll have to restore the old way this works for 4.10 as it's
too late in the cycle, but that whole idea of trying Write Same first
and just disabling it if it doesn't work is a receipe for desaster -
it kinda works for a synchronous blkdev_issue_zeroout, but if we want
to be able to submit it asynchronously it's getting too hairy to handle.

I think we should fix sd.c to only send WRITE SAME if either of the
variants are explicitly listed as supported through
REPORT SUPPORTED OPERATION CODES, or maybe through a whitelist if
there are important enough devices.
Jens Axboe Feb. 3, 2017, 4:14 p.m. UTC | #3
On 02/03/2017 09:12 AM, Christoph Hellwig wrote:
> On Fri, Feb 03, 2017 at 08:21:31AM -0700, Jens Axboe wrote:
>>> Error 121 (EREMOTEIO) was returned from blkdev_issue_zeroout().
>>> That came from sd driver because WRITE SAME was sent to the device
>>> which didn't support it.
>>>
>>> The problem was introduced by commit e73c23ff736e ("block: add async
>>> variant of blkdev_issue_zeroout"). Before the commit, blkdev_issue_zeroout
>>> fell back to normal zero writing when WRITE SAME failed and it seems
>>> sd driver's heuristics depends on that behaviour.
>>
>> CC Christoph and Chaitanya.
> 
> And adding Martin as the sd.c Write Same code is his.
> 
> I suspect we'll have to restore the old way this works for 4.10 as it's
> too late in the cycle, but that whole idea of trying Write Same first
> and just disabling it if it doesn't work is a receipe for desaster -
> it kinda works for a synchronous blkdev_issue_zeroout, but if we want
> to be able to submit it asynchronously it's getting too hairy to handle.

I agree, the current approach is a hot and ugly mess.

> I think we should fix sd.c to only send WRITE SAME if either of the
> variants are explicitly listed as supported through
> REPORT SUPPORTED OPERATION CODES, or maybe through a whitelist if
> there are important enough devices.

Yep
Martin K. Petersen Feb. 3, 2017, 10:45 p.m. UTC | #4
>>>>> "Jens" == Jens Axboe <axboe@kernel.dk> writes:

>> I think we should fix sd.c to only send WRITE SAME if either of the
>> variants are explicitly listed as supported through REPORT SUPPORTED
>> OPERATION CODES, or maybe through a whitelist if there are important
>> enough devices.

Jens> Yep

I hate it too. But the reason it's assumed on is that there is
essentially no heuristic that works. Just like we assume that READ
always works.

Out of the ~200 devices I have access to in the lab:

 - 100% of the SAS/FC disk drives and SSDs support WRITE SAME
 - Only 2 out of about 50 different drive models support RSOC
 - About half of the arrays support WRITE SAME(10/16)
 - None of the arrays I have support RSOC

So even if we were to entertain using RSOC for "enterprise" transport
classes (which I concur would be nice for other reasons), it wouldn't
solve the WRITE SAME problem...
Jens Axboe Feb. 4, 2017, 3:17 a.m. UTC | #5
On 02/03/2017 03:45 PM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@kernel.dk> writes:
> 
>>> I think we should fix sd.c to only send WRITE SAME if either of the
>>> variants are explicitly listed as supported through REPORT SUPPORTED
>>> OPERATION CODES, or maybe through a whitelist if there are important
>>> enough devices.
> 
> Jens> Yep
> 
> I hate it too. But the reason it's assumed on is that there is
> essentially no heuristic that works. Just like we assume that READ
> always works.
> 
> Out of the ~200 devices I have access to in the lab:
> 
>  - 100% of the SAS/FC disk drives and SSDs support WRITE SAME
>  - Only 2 out of about 50 different drive models support RSOC
>  - About half of the arrays support WRITE SAME(10/16)
>  - None of the arrays I have support RSOC
> 
> So even if we were to entertain using RSOC for "enterprise" transport
> classes (which I concur would be nice for other reasons), it wouldn't
> solve the WRITE SAME problem...

We're at (almost) -rc7 time, we have to do more than hand wave about
this. What's the plan for 4.10 final?
Christoph Hellwig Feb. 4, 2017, 8:48 a.m. UTC | #6
On Fri, Feb 03, 2017 at 08:17:10PM -0700, Jens Axboe wrote:
> We're at (almost) -rc7 time, we have to do more than hand wave about
> this. What's the plan for 4.10 final?

I'll send your a fix to revert the async write same changes for now,
as suggested in the last mail from me.
diff mbox

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f8c82a9..8e53474 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -360,6 +360,7 @@  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			 sector_t nr_sects, gfp_t gfp_mask, bool discard)
 {
 	int ret;
+	int pass = 0;
 	struct bio *bio = NULL;
 	struct blk_plug plug;
 
@@ -369,6 +370,7 @@  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			return 0;
 	}
 
+  retry_other_method:
 	blk_start_plug(&plug);
 	ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
 			&bio, discard);
@@ -378,6 +380,11 @@  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	}
 	blk_finish_plug(&plug);
 
+	if (ret && pass++ == 0) {
+		bio = NULL;
+		goto retry_other_method;
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);