Message ID | 20170907161640.30465-11-damien.lemoal@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Damien, On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote: > In the case of a ZBC disk used with scsi-mq, zone write locking does > not prevent write reordering in sequential zones. Unlike the legacy > case, zone locking can only be done after the command request is > removed from the scheduler dispatch queue. That is, at the time of > zone locking, the write command may already be out of order. Per my understanding, for legacy case, it can be quite tricky to let the existed I/O scheduler guarantee the write order for ZBC disk. I guess requeue still might cause write reorder even in legacy path, since requeue can happen in both scsi_request_fn() and scsi_io_completion() with q->queue_lock released, meantime new rq belonging to the same zone can come and be inserted to queue. > > Disable zone write locking in sd_zbc_write_lock_zone() if the disk is > used with scsi-mq. Write order guarantees can be provided by an > adapted I/O scheduler. Sounds a good idea to enhance the order in a new scheduler, will look at the following patch.
Ming, On 9/8/17 05:43, Ming Lei wrote: > Hi Damien, > > On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote: >> In the case of a ZBC disk used with scsi-mq, zone write locking does >> not prevent write reordering in sequential zones. Unlike the legacy >> case, zone locking can only be done after the command request is >> removed from the scheduler dispatch queue. That is, at the time of >> zone locking, the write command may already be out of order. > > Per my understanding, for legacy case, it can be quite tricky to let > the existed I/O scheduler guarantee the write order for ZBC disk. > I guess requeue still might cause write reorder even in legacy path, > since requeue can happen in both scsi_request_fn() and scsi_io_completion() > with q->queue_lock released, meantime new rq belonging to the same > zone can come and be inserted to queue. Yes, the write ordering will always depend on the scheduler doing the right thing. But both cfq, deadline and even noop do the right thing there, even considering the aging case. The next write for a zone will always be the oldest in the queue for that zone, if it is not, it means that the application did not write sequentially. Extensive testing in the legacy case never showed a problem due to the scheduler itself. scsi_requeue_command() does the unprep (zone unlock) and requeue while holding the queue lock. So this is atomic with new write command insertion. Requeued commands are added to the dispatch queue head, and since a zone will only have a single write in-flight, there is no reordering possible. The next write command for a zone to go again is the last requeued one or the next in lba order. It works. Note that for write commands that failed due to an unaligned write error, there is no retry done, so no requeue. The requeue case for writes would only happen for other conditions (a dead drive being the most likely in this case). >> Disable zone write locking in sd_zbc_write_lock_zone() if the disk is >> used with scsi-mq. Write order guarantees can be provided by an >> adapted I/O scheduler. > > Sounds a good idea to enhance the order in a new scheduler, will > look at the following patch. For blk-mq, I only tried mq-deadline. The zoned scheduler I posted is based on it. There is no fundamental change to the ordering on insertion. Only different choices on dispatch (using the zone lock). For rotating rust and blk-mq, I think that getting calls to dispatch serialized would naturally enhance ordering and also merging to some extent. Ordering really gets killed when multiple context try to push down requests, which each context ending up each with only a few requests in their local dispatch lists. Some initial patch I wrote for zbc that attacked the problem from within blk-mq did that serialization. That is not mandatory anymore with the zoned scheduler, but I think would still be benefitial to both ZBC disks and standard disks too. Best regards.
On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote: > Ming, > > On 9/8/17 05:43, Ming Lei wrote: > > Hi Damien, > > > > On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote: > >> In the case of a ZBC disk used with scsi-mq, zone write locking does > >> not prevent write reordering in sequential zones. Unlike the legacy > >> case, zone locking can only be done after the command request is > >> removed from the scheduler dispatch queue. That is, at the time of > >> zone locking, the write command may already be out of order. > > > > Per my understanding, for legacy case, it can be quite tricky to let > > the existed I/O scheduler guarantee the write order for ZBC disk. > > I guess requeue still might cause write reorder even in legacy path, > > since requeue can happen in both scsi_request_fn() and scsi_io_completion() > > with q->queue_lock released, meantime new rq belonging to the same > > zone can come and be inserted to queue. > > Yes, the write ordering will always depend on the scheduler doing the > right thing. But both cfq, deadline and even noop do the right thing > there, even considering the aging case. The next write for a zone will > always be the oldest in the queue for that zone, if it is not, it means > that the application did not write sequentially. Extensive testing in > the legacy case never showed a problem due to the scheduler itself. OK, I suggest to document this guarantee of no write reorder for ZBC somewhere, so that people will keep it in mind when trying to change the current code. > > scsi_requeue_command() does the unprep (zone unlock) and requeue while > holding the queue lock. So this is atomic with new write command > insertion. Requeued commands are added to the dispatch queue head, and > since a zone will only have a single write in-flight, there is no > reordering possible. The next write command for a zone to go again is > the last requeued one or the next in lba order. It works. One special case is write with FLUSH/FUA, which may be added to front of q->queue_head directly. Suppose one write with FUA is just comes between requeue and run queue, write reorder may be triggered. If this assumption is true, there might be such issue on blk-mq too.
Ming, On 9/10/17 14:10, Ming Lei wrote: > On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote: >> Ming, >> >> On 9/8/17 05:43, Ming Lei wrote: >>> Hi Damien, >>> >>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote: >>>> In the case of a ZBC disk used with scsi-mq, zone write locking does >>>> not prevent write reordering in sequential zones. Unlike the legacy >>>> case, zone locking can only be done after the command request is >>>> removed from the scheduler dispatch queue. That is, at the time of >>>> zone locking, the write command may already be out of order. >>> >>> Per my understanding, for legacy case, it can be quite tricky to let >>> the existed I/O scheduler guarantee the write order for ZBC disk. >>> I guess requeue still might cause write reorder even in legacy path, >>> since requeue can happen in both scsi_request_fn() and scsi_io_completion() >>> with q->queue_lock released, meantime new rq belonging to the same >>> zone can come and be inserted to queue. >> >> Yes, the write ordering will always depend on the scheduler doing the >> right thing. But both cfq, deadline and even noop do the right thing >> there, even considering the aging case. The next write for a zone will >> always be the oldest in the queue for that zone, if it is not, it means >> that the application did not write sequentially. Extensive testing in >> the legacy case never showed a problem due to the scheduler itself. > > OK, I suggest to document this guarantee of no write reorder for ZBC > somewhere, so that people will keep it in mind when trying to change > the current code. Have you looked at the comments in sd_zbc.c ? That is explained there. Granted, this is a little deep in the stack, but this is after all dependent on the implementation of scsi_request_fn(). I can add comments there too if you prefer. >> scsi_requeue_command() does the unprep (zone unlock) and requeue while >> holding the queue lock. So this is atomic with new write command >> insertion. Requeued commands are added to the dispatch queue head, and >> since a zone will only have a single write in-flight, there is no >> reordering possible. The next write command for a zone to go again is >> the last requeued one or the next in lba order. It works. > > One special case is write with FLUSH/FUA, which may be added to > front of q->queue_head directly. Suppose one write with FUA is > just comes between requeue and run queue, write reorder may be > triggered. Zoned disks are recent and all of them support FUA. This means that a write with FUA will be processed like any other write request (if I read the code in blk-flush.c correctly). Well, at least for the mq case, which does a blk_mq_sched_insert_request(), while the direct call to list_add_tail() for the legacy case is suspicious. Why not blk_insert_request() ? This may be a problem, but all the testing I have done on the legacy path never hit this one. The reason is I think that the two in-kernel users of zoned devices (f2fs and dm-zoned) do not use REQ_FUA, nor issue flush requests with data. > If this assumption is true, there might be such issue on blk-mq > too. For the same above reason, I think blk-mq is OK too. But granted, this seems all a little brittle. I will look into this case more carefully to solidify the overall support. Thank you for your comments.
On Tue, Sep 12, 2017 at 05:24:02PM +0900, Damien Le Moal wrote: > Ming, > > On 9/10/17 14:10, Ming Lei wrote: > > On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote: > >> Ming, > >> > >> On 9/8/17 05:43, Ming Lei wrote: > >>> Hi Damien, > >>> > >>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote: > >>>> In the case of a ZBC disk used with scsi-mq, zone write locking does > >>>> not prevent write reordering in sequential zones. Unlike the legacy > >>>> case, zone locking can only be done after the command request is > >>>> removed from the scheduler dispatch queue. That is, at the time of > >>>> zone locking, the write command may already be out of order. > >>> > >>> Per my understanding, for legacy case, it can be quite tricky to let > >>> the existed I/O scheduler guarantee the write order for ZBC disk. > >>> I guess requeue still might cause write reorder even in legacy path, > >>> since requeue can happen in both scsi_request_fn() and scsi_io_completion() > >>> with q->queue_lock released, meantime new rq belonging to the same > >>> zone can come and be inserted to queue. > >> > >> Yes, the write ordering will always depend on the scheduler doing the > >> right thing. But both cfq, deadline and even noop do the right thing > >> there, even considering the aging case. The next write for a zone will > >> always be the oldest in the queue for that zone, if it is not, it means > >> that the application did not write sequentially. Extensive testing in > >> the legacy case never showed a problem due to the scheduler itself. > > > > OK, I suggest to document this guarantee of no write reorder for ZBC > > somewhere, so that people will keep it in mind when trying to change > > the current code. > > Have you looked at the comments in sd_zbc.c ? That is explained there. > Granted, this is a little deep in the stack, but this is after all > dependent on the implementation of scsi_request_fn(). I can add comments > there too if you prefer. Yeah, I looked at that, but seems it is too coarse. > > >> scsi_requeue_command() does the unprep (zone unlock) and requeue while > >> holding the queue lock. So this is atomic with new write command > >> insertion. Requeued commands are added to the dispatch queue head, and > >> since a zone will only have a single write in-flight, there is no > >> reordering possible. The next write command for a zone to go again is > >> the last requeued one or the next in lba order. It works. > > > > One special case is write with FLUSH/FUA, which may be added to > > front of q->queue_head directly. Suppose one write with FUA is > > just comes between requeue and run queue, write reorder may be > > triggered. > > Zoned disks are recent and all of them support FUA. This means that a > write with FUA will be processed like any other write request (if I read > the code in blk-flush.c correctly). Well, at least for the mq case, > which does a blk_mq_sched_insert_request(), while the direct call to blk_mq_sched_bypass_insert() can be called for flush requests too, since requests in flush sequence share one driver tag(rq->tag != -1), then the rq can be added to front of hctx->dispatch directly.
Ming, On 9/12/17 18:26, Ming Lei wrote: >>> OK, I suggest to document this guarantee of no write reorder for ZBC >>> somewhere, so that people will keep it in mind when trying to change >>> the current code. >> >> Have you looked at the comments in sd_zbc.c ? That is explained there. >> Granted, this is a little deep in the stack, but this is after all >> dependent on the implementation of scsi_request_fn(). I can add comments >> there too if you prefer. > > Yeah, I looked at that, but seems it is too coarse. OK. Will improve that in V3. >> Zoned disks are recent and all of them support FUA. This means that a >> write with FUA will be processed like any other write request (if I read >> the code in blk-flush.c correctly). Well, at least for the mq case, >> which does a blk_mq_sched_insert_request(), while the direct call to > > blk_mq_sched_bypass_insert() can be called for flush requests too, > since requests in flush sequence share one driver tag(rq->tag != -1), > then the rq can be added to front of hctx->dispatch directly. For pure flush commands, any order is OK, since these are not actual writes. the ZBC/ZAC standards do not modify the behavior of flush/synchronize cache commands. But I will look into more detail at the PREFLUSH/POSTFLUSH mechanic and relation to write request ordering to make sure no corner cases are left ignored. Thanks for the comments. Best regards.
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index bcece82a1d8d..6b1a7f9c1e90 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -264,6 +264,10 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) (sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors) return BLKPREP_KILL; + /* No write locking with scsi-mq */ + if (rq->q->mq_ops) + return BLKPREP_OK; + /* * There is no write constraint on conventional zones, but do not issue * more than one write at a time per sequential zone. This avoids write