diff mbox

[V2,10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq

Message ID 20170907161640.30465-11-damien.lemoal@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Damien Le Moal Sept. 7, 2017, 4:16 p.m. UTC
In the case of a ZBC disk used with scsi-mq, zone write locking does
not prevent write reordering in sequential zones. Unlike the legacy
case, zone locking can only be done after the command request is
removed from the scheduler dispatch queue. That is, at the time of
zone locking, the write command may already be out of order.

Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
used with scsi-mq. Write order guarantees can be provided by an
adapted I/O scheduler.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 drivers/scsi/sd_zbc.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Ming Lei Sept. 8, 2017, 12:43 p.m. UTC | #1
Hi Damien,

On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
> In the case of a ZBC disk used with scsi-mq, zone write locking does
> not prevent write reordering in sequential zones. Unlike the legacy
> case, zone locking can only be done after the command request is
> removed from the scheduler dispatch queue. That is, at the time of
> zone locking, the write command may already be out of order.

Per my understanding, for legacy case, it can be quite tricky to let
the existed I/O scheduler guarantee the write order for ZBC disk.
I guess requeue still might cause write reorder even in legacy path,
since requeue can happen in both scsi_request_fn() and scsi_io_completion()
with q->queue_lock released, meantime new rq belonging to the same
zone can come and be inserted to queue.

> 
> Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
> used with scsi-mq. Write order guarantees can be provided by an
> adapted I/O scheduler.

Sounds a good idea to enhance the order in a new scheduler, will
look at the following patch.
Damien Le Moal Sept. 8, 2017, 4:53 p.m. UTC | #2
Ming,

On 9/8/17 05:43, Ming Lei wrote:
> Hi Damien,
> 
> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
>> In the case of a ZBC disk used with scsi-mq, zone write locking does
>> not prevent write reordering in sequential zones. Unlike the legacy
>> case, zone locking can only be done after the command request is
>> removed from the scheduler dispatch queue. That is, at the time of
>> zone locking, the write command may already be out of order.
> 
> Per my understanding, for legacy case, it can be quite tricky to let
> the existed I/O scheduler guarantee the write order for ZBC disk.
> I guess requeue still might cause write reorder even in legacy path,
> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
> with q->queue_lock released, meantime new rq belonging to the same
> zone can come and be inserted to queue.

Yes, the write ordering will always depend on the scheduler doing the
right thing. But both cfq, deadline and even noop do the right thing
there, even considering the aging case. The next write for a zone will
always be the oldest in the queue for that zone, if it is not, it means
that the application did not write sequentially. Extensive testing in
the legacy case never showed a problem due to the scheduler itself.

scsi_requeue_command() does the unprep (zone unlock) and requeue while
holding the queue lock. So this is atomic with new write command
insertion. Requeued commands are added to the dispatch queue head, and
since a zone will only have a single write in-flight, there is no
reordering possible. The next write command for a zone to go again is
the last requeued one or the next in lba order. It works.

Note that for write commands that failed due to an unaligned write
error, there is no retry done, so no requeue. The requeue case for
writes would only happen for other conditions (a dead drive being the
most likely in this case).

>> Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
>> used with scsi-mq. Write order guarantees can be provided by an
>> adapted I/O scheduler.
> 
> Sounds a good idea to enhance the order in a new scheduler, will
> look at the following patch.

For blk-mq, I only tried mq-deadline. The zoned scheduler I posted is
based on it. There is no fundamental change to the ordering on
insertion. Only different choices on dispatch (using the zone lock).

For rotating rust and blk-mq, I think that getting calls to dispatch
serialized would naturally enhance ordering and also merging to some
extent. Ordering really gets killed when multiple context try to push
down requests, which each context ending up each with only a few
requests in their local dispatch lists. Some initial patch I wrote for
zbc that attacked the problem from within blk-mq did that serialization.
That is not mandatory anymore with the zoned scheduler, but I think
would still be benefitial to both ZBC disks and standard disks too.

Best regards.
Ming Lei Sept. 10, 2017, 5:10 a.m. UTC | #3
On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
> Ming,
> 
> On 9/8/17 05:43, Ming Lei wrote:
> > Hi Damien,
> > 
> > On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
> >> In the case of a ZBC disk used with scsi-mq, zone write locking does
> >> not prevent write reordering in sequential zones. Unlike the legacy
> >> case, zone locking can only be done after the command request is
> >> removed from the scheduler dispatch queue. That is, at the time of
> >> zone locking, the write command may already be out of order.
> > 
> > Per my understanding, for legacy case, it can be quite tricky to let
> > the existed I/O scheduler guarantee the write order for ZBC disk.
> > I guess requeue still might cause write reorder even in legacy path,
> > since requeue can happen in both scsi_request_fn() and scsi_io_completion()
> > with q->queue_lock released, meantime new rq belonging to the same
> > zone can come and be inserted to queue.
> 
> Yes, the write ordering will always depend on the scheduler doing the
> right thing. But both cfq, deadline and even noop do the right thing
> there, even considering the aging case. The next write for a zone will
> always be the oldest in the queue for that zone, if it is not, it means
> that the application did not write sequentially. Extensive testing in
> the legacy case never showed a problem due to the scheduler itself.

OK, I suggest to document this guarantee of no write reorder for ZBC
somewhere, so that people will keep it in mind when trying to change
the current code.

> 
> scsi_requeue_command() does the unprep (zone unlock) and requeue while
> holding the queue lock. So this is atomic with new write command
> insertion. Requeued commands are added to the dispatch queue head, and
> since a zone will only have a single write in-flight, there is no
> reordering possible. The next write command for a zone to go again is
> the last requeued one or the next in lba order. It works.

One special case is write with FLUSH/FUA, which may be added to
front of q->queue_head directly. Suppose one write with FUA is
just comes between requeue and run queue, write reorder may be
triggered.

If this assumption is true, there might be such issue on blk-mq
too.
Damien Le Moal Sept. 12, 2017, 8:24 a.m. UTC | #4
Ming,

On 9/10/17 14:10, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
>> Ming,
>>
>> On 9/8/17 05:43, Ming Lei wrote:
>>> Hi Damien,
>>>
>>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
>>>> In the case of a ZBC disk used with scsi-mq, zone write locking does
>>>> not prevent write reordering in sequential zones. Unlike the legacy
>>>> case, zone locking can only be done after the command request is
>>>> removed from the scheduler dispatch queue. That is, at the time of
>>>> zone locking, the write command may already be out of order.
>>>
>>> Per my understanding, for legacy case, it can be quite tricky to let
>>> the existed I/O scheduler guarantee the write order for ZBC disk.
>>> I guess requeue still might cause write reorder even in legacy path,
>>> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
>>> with q->queue_lock released, meantime new rq belonging to the same
>>> zone can come and be inserted to queue.
>>
>> Yes, the write ordering will always depend on the scheduler doing the
>> right thing. But both cfq, deadline and even noop do the right thing
>> there, even considering the aging case. The next write for a zone will
>> always be the oldest in the queue for that zone, if it is not, it means
>> that the application did not write sequentially. Extensive testing in
>> the legacy case never showed a problem due to the scheduler itself.
> 
> OK, I suggest to document this guarantee of no write reorder for ZBC
> somewhere, so that people will keep it in mind when trying to change
> the current code.

Have you looked at the comments in sd_zbc.c ? That is explained there.
Granted, this is a little deep in the stack, but this is after all
dependent on the implementation of scsi_request_fn(). I can add comments
there too if you prefer.

>> scsi_requeue_command() does the unprep (zone unlock) and requeue while
>> holding the queue lock. So this is atomic with new write command
>> insertion. Requeued commands are added to the dispatch queue head, and
>> since a zone will only have a single write in-flight, there is no
>> reordering possible. The next write command for a zone to go again is
>> the last requeued one or the next in lba order. It works.
> 
> One special case is write with FLUSH/FUA, which may be added to
> front of q->queue_head directly. Suppose one write with FUA is
> just comes between requeue and run queue, write reorder may be
> triggered.

Zoned disks are recent and all of them support FUA. This means that a
write with FUA will be processed like any other write request (if I read
the code in blk-flush.c correctly). Well, at least for the mq case,
which does a blk_mq_sched_insert_request(), while the direct call to
list_add_tail() for the legacy case is suspicious. Why not
blk_insert_request() ? This may be a problem, but all the testing I have
done on the legacy path never hit this one. The reason is I think that
the two in-kernel users of zoned devices (f2fs and dm-zoned) do not use
REQ_FUA, nor issue flush requests with data.

> If this assumption is true, there might be such issue on blk-mq
> too.

For the same above reason, I think blk-mq is OK too. But granted, this
seems all a little brittle. I will look into this case more carefully to
solidify the overall support.

Thank you for your comments.
Ming Lei Sept. 12, 2017, 9:26 a.m. UTC | #5
On Tue, Sep 12, 2017 at 05:24:02PM +0900, Damien Le Moal wrote:
> Ming,
> 
> On 9/10/17 14:10, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
> >> Ming,
> >>
> >> On 9/8/17 05:43, Ming Lei wrote:
> >>> Hi Damien,
> >>>
> >>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
> >>>> In the case of a ZBC disk used with scsi-mq, zone write locking does
> >>>> not prevent write reordering in sequential zones. Unlike the legacy
> >>>> case, zone locking can only be done after the command request is
> >>>> removed from the scheduler dispatch queue. That is, at the time of
> >>>> zone locking, the write command may already be out of order.
> >>>
> >>> Per my understanding, for legacy case, it can be quite tricky to let
> >>> the existed I/O scheduler guarantee the write order for ZBC disk.
> >>> I guess requeue still might cause write reorder even in legacy path,
> >>> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
> >>> with q->queue_lock released, meantime new rq belonging to the same
> >>> zone can come and be inserted to queue.
> >>
> >> Yes, the write ordering will always depend on the scheduler doing the
> >> right thing. But both cfq, deadline and even noop do the right thing
> >> there, even considering the aging case. The next write for a zone will
> >> always be the oldest in the queue for that zone, if it is not, it means
> >> that the application did not write sequentially. Extensive testing in
> >> the legacy case never showed a problem due to the scheduler itself.
> > 
> > OK, I suggest to document this guarantee of no write reorder for ZBC
> > somewhere, so that people will keep it in mind when trying to change
> > the current code.
> 
> Have you looked at the comments in sd_zbc.c ? That is explained there.
> Granted, this is a little deep in the stack, but this is after all
> dependent on the implementation of scsi_request_fn(). I can add comments
> there too if you prefer.

Yeah, I looked at that, but seems it is too coarse.

> 
> >> scsi_requeue_command() does the unprep (zone unlock) and requeue while
> >> holding the queue lock. So this is atomic with new write command
> >> insertion. Requeued commands are added to the dispatch queue head, and
> >> since a zone will only have a single write in-flight, there is no
> >> reordering possible. The next write command for a zone to go again is
> >> the last requeued one or the next in lba order. It works.
> > 
> > One special case is write with FLUSH/FUA, which may be added to
> > front of q->queue_head directly. Suppose one write with FUA is
> > just comes between requeue and run queue, write reorder may be
> > triggered.
> 
> Zoned disks are recent and all of them support FUA. This means that a
> write with FUA will be processed like any other write request (if I read
> the code in blk-flush.c correctly). Well, at least for the mq case,
> which does a blk_mq_sched_insert_request(), while the direct call to

blk_mq_sched_bypass_insert() can be called for flush requests too,
since requests in flush sequence share one driver tag(rq->tag != -1),
then the rq can be added to front of hctx->dispatch directly.
Damien Le Moal Sept. 13, 2017, 12:13 a.m. UTC | #6
Ming,

On 9/12/17 18:26, Ming Lei wrote:
>>> OK, I suggest to document this guarantee of no write reorder for ZBC
>>> somewhere, so that people will keep it in mind when trying to change
>>> the current code.
>>
>> Have you looked at the comments in sd_zbc.c ? That is explained there.
>> Granted, this is a little deep in the stack, but this is after all
>> dependent on the implementation of scsi_request_fn(). I can add comments
>> there too if you prefer.
> 
> Yeah, I looked at that, but seems it is too coarse.

OK. Will improve that in V3.

>> Zoned disks are recent and all of them support FUA. This means that a
>> write with FUA will be processed like any other write request (if I read
>> the code in blk-flush.c correctly). Well, at least for the mq case,
>> which does a blk_mq_sched_insert_request(), while the direct call to
> 
> blk_mq_sched_bypass_insert() can be called for flush requests too,
> since requests in flush sequence share one driver tag(rq->tag != -1),
> then the rq can be added to front of hctx->dispatch directly.

For pure flush commands, any order is OK, since these are not actual
writes. the ZBC/ZAC standards do not modify the behavior of
flush/synchronize cache commands.

But I will look into more detail at the PREFLUSH/POSTFLUSH mechanic and
relation to write request ordering to make sure no corner cases are left
ignored.

Thanks for the comments.

Best regards.
diff mbox

Patch

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index bcece82a1d8d..6b1a7f9c1e90 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -264,6 +264,10 @@  int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	    (sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors)
 		return BLKPREP_KILL;
 
+	/* No write locking with scsi-mq */
+	if (rq->q->mq_ops)
+		return BLKPREP_OK;
+
 	/*
 	 * There is no write constraint on conventional zones, but do not issue
 	 * more than one write at a time per sequential zone. This avoids write