Message ID | 20250115224649.3973718-1-bvanassche@acm.org (mailing list archive) |
---|---|
Headers | show |
Series | Improve write performance for zoned UFS devices | expand |
On 1/16/25 7:46 AM, Bart Van Assche wrote: > Hi Damien and Christoph, > > This patch series improves small write IOPS by a factor of four (+300%) for > zoned UFS devices on my test setup. Please help with reviewing this patch > series. Bart, Given that this is a set of 14 patches, giving a more detailed summary here of what the entire patch set intends to do and how it does it would really help with the review. I do not want to have to reverse engineer your line of thoughts to see if the patch set is correctly organized. Also, patch 14 puts back a lot of the code that was recently removed. Not nice, and its commit message is also far too short given the size of the patch. Please spend more time explaining what the patches do and how they do it to facilitate review. I also fail to see why you treat request requeue as errors if you actually expect them to happen... I do not think you need error handling and tracking of the completion wp position: when you get a requeue event, requeue all requests in the plug and set the plug wp position to the lowest sector of all the requeued requests. That is necessarily the current location of the write pointer. No need for re-introducing all the error handling for that, no ? I am going to wait for you to resend this with a better "big picture" explanation of what you are trying to do so that I do not randomly comment on things that I am not sure I completely understand. Thanks.
On 1/17/25 2:47 PM, Damien Le Moal wrote: > Also, patch 14 puts back a lot of the code that was recently removed. Not nice, > and its commit message is also far too short given the size of the patch. Please > spend more time explaining what the patches do and how they do it to facilitate > review. Regarding the following part of patch 07/14: - disk_zone_wplug_abort(zwplug); - zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE; + if (!disk->queue->limits.driver_preserves_write_order) + zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE; + zwplug->flags |= BLK_ZONE_WPLUG_ERROR; How about setting BLK_ZONE_WPLUG_ERROR only if disk->queue->limits.driver_preserves_write_order has been set such that the restored error handling code is only used if the storage controller preserves the write order? This should be sufficient to preserve the existing behavior of the blk-zoned.c code for SMR disks. The reason I restored the error handling code is that the code in blk-zoned.c does not support error handling inside block drivers if QD > 1 and because supporting the SCSI error handler is essential for UFS devices. > I also fail to see why you treat request requeue as errors if you actually > expect them to happen... I only expect requeues to happen in exceptional cases, e.g. because the SCSI error handler has been activated or because a SCSI device reports a unit attention. Requeues may happen in any order so I think that it is essential to pause request submission after a requeue until all pending zoned writes for the same zone have completed or failed. > I do not think you need error handling and tracking of > the completion wp position: when you get a requeue event, requeue all requests > in the plug and set the plug wp position to the lowest sector of all the > requeued requests. That is necessarily the current location of the write > pointer. That's an interesting proposal. I will look into setting the WP to the lowest sector of all requeued requests. > No need for re-introducing all the error handling for that, no ? The error handling code has been restored because of another reason, namely to support the SCSI error handler in case the queue depth is larger than one. > I am going to wait for you to resend this with a better "big picture" > explanation of what you are trying to do so that I do not randomly comment on > things that I am not sure I completely understand. Thanks. I will include a more detailed cover letter when I repost this series (after the merge window has closed). While I work on this, feedback on this version of this patch series is still welcome. Thanks, Bart.
On 1/22/25 6:57 AM, Bart Van Assche wrote: > On 1/17/25 2:47 PM, Damien Le Moal wrote: >> Also, patch 14 puts back a lot of the code that was recently removed. Not nice, >> and its commit message is also far too short given the size of the patch. Please >> spend more time explaining what the patches do and how they do it to facilitate >> review. > > Regarding the following part of patch 07/14: > > - disk_zone_wplug_abort(zwplug); > - zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE; > + if (!disk->queue->limits.driver_preserves_write_order) > + zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE; > + zwplug->flags |= BLK_ZONE_WPLUG_ERROR; > > How about setting BLK_ZONE_WPLUG_ERROR only if > disk->queue->limits.driver_preserves_write_order has been set such that > the restored error handling code is only used if the storage controller > preserves the write order? This should be sufficient to preserve the > existing behavior of the blk-zoned.c code for SMR disks. Your patch changes BLK_ZONE_WPLUG_NEED_WP_UPDATE into BLK_ZONE_WPLUG_ERROR for regular (no write order preserving) zoned devices. I do not see why this is needed. The main difference between this default behavior and what you are trying to do is that BLK_ZONE_WPLUG_NEED_WP_UPDATE is to be expected with your changes while it is NOT expected to ever happen for a regular coned device. > The reason I restored the error handling code is that the code in > blk-zoned.c does not support error handling inside block drivers if QD > 1 > and because supporting the SCSI error handler is essential for UFS > devices. What error handling ? The only error handling we need is to tell tell the user that there was an error and track that the user actually does something about it. That tracking is done with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. That's it, nothing more. I think that conflating/handling the expected (even if infrequent) write requeue events with zoned UFS devices with actual hard device write errors (e.g. the write failed because the device is toast or you hit a bad sector, or you have a bad cable or whatever other hardware reason for the write to fail) is making a mess of everything. Treating the requeues like hard errors makes things complicated, but also wrong because even for zoned UFS devices, we may still get hard write errors, and treating these in the same way as requeue event is wrong in my opinion. For hard errors, the write must not be retried and the error propagated immediately to the issuer (e.g. the FS). For a requeue event, even though that is signaled initially at the bottom of the stack by a device unit attention error, there is no reason for us to treat these events as failed requests in the zone write plugging code. We need to have a special treatment for these, e.g. a "ZONE_WRITE_NEED_RETRY" request flag (and that flag can bhe set in the scsi mid-layer before calling into the block layer requeue. When the zone write plugging sees this flag, it can: 1) Stop issuing write BIOs since we know they will fail 2) Wait for all requests already issued and in-flight to come back 3) Restoore the zone write plug write pointer position to the lowest sector of all requests that were requeued 4) Re-issue the requeued writes (after somehow sorting them) 5) Re-start issuing BIOs waiting in the write plug And any write that is failed due to a hard error will still set the zone write plug with BLK_ZONE_WPLUG_NEED_WP_UPDATE after aborting all BIOs that are pluggeg in the zone write plug. Now, I think that the biggest difficulty of this work is to make sure that a write that fails with an unaligned write error due to a write requeue event before it can be clearly differentiated from the same failure without the requeue event before it. For the former, we can recover. For the latter, it is a user bug and we must NOT hide it. Can this be done cleanly ? I am not entirely sure about it because I am still not 100% clear about the conditions that result in a zoned UFS device failing a write with unit attention. As far as I can tell, the first write issued when the device is sleeping will be rejected with that error and must be requeued. But others write behind this failed write that are already in-flight will endup failing with an unaligned write error, no ? Or will they be failed with a unit attention too ? This is all unclear to me.
On 1/22/25 8:16 PM, Damien Le Moal wrote: > On 1/22/25 6:57 AM, Bart Van Assche wrote: >> The reason I restored the error handling code is that the code in >> blk-zoned.c does not support error handling inside block drivers if QD > 1 >> and because supporting the SCSI error handler is essential for UFS >> devices. > > What error handling ? The SCSI error handler. If the SCSI error handler is invoked while only one request is outstanding, there is no chance of reordering and no additional measures are necessary. If multiple requests are outstanding and the SCSI error handler is invoked, care must be taken that the pending requests are resubmitted in LBA order per zone. This is not possible without blocking zoned write processing until all pending zoned writes have completed. Hence the need to set BLK_ZONE_WPLUG_ERROR if write pipelining is enabled and a request is requeued. > I think that conflating/handling the expected (even if infrequent) write requeue > events with zoned UFS devices with actual hard device write errors (e.g. the > write failed because the device is toast or you hit a bad sector, or you have a > bad cable or whatever other hardware reason for the write to fail) is making a > mess of everything. Treating the requeues like hard errors makes things > complicated, but also wrong because even for zoned UFS devices, we may still get > hard write errors, and treating these in the same way as requeue event is wrong > in my opinion. For hard errors, the write must not be retried and the error > propagated immediately to the issuer (e.g. the FS). > > For a requeue event, even though that is signaled initially at the bottom of the > stack by a device unit attention error, there is no reason for us to treat these > events as failed requests in the zone write plugging code. We need to have a > special treatment for these, e.g. a "ZONE_WRITE_NEED_RETRY" request flag (and > that flag can bhe set in the scsi mid-layer before calling into the block layer > requeue. > > When the zone write plugging sees this flag, it can: > 1) Stop issuing write BIOs since we know they will fail > 2) Wait for all requests already issued and in-flight to come back > 3) Restoore the zone write plug write pointer position to the lowest sector of > all requests that were requeued > 4) Re-issue the requeued writes (after somehow sorting them) > 5) Re-start issuing BIOs waiting in the write plug > > And any write that is failed due to a hard error will still set the zone write > plug with BLK_ZONE_WPLUG_NEED_WP_UPDATE after aborting all BIOs that are pluggeg > in the zone write plug. > > Now, I think that the biggest difficulty of this work is to make sure that a > write that fails with an unaligned write error due to a write requeue event > before it can be clearly differentiated from the same failure without the > requeue event before it. For the former, we can recover. For the latter, it is a > user bug and we must NOT hide it. > > Can this be done cleanly ? I am not entirely sure about it because I am still > not 100% clear about the conditions that result in a zoned UFS device failing a > write with unit attention. As far as I can tell, the first write issued when the > device is sleeping will be rejected with that error and must be requeued. But > others write behind this failed write that are already in-flight will endup > failing with an unaligned write error, no ? Or will they be failed with a unit > attention too ? This is all unclear to me. I propose to keep the approach of the code in blk-zoned.c independent of the details of how UFS devices work. That will make it more likely that the blk-zoned.c code remains generic and that it can be reused for other storage protocols. Although I have not yet observed reordering due to a unit attention, I think that unit attentions should be supported because unit attentions are such an important SCSI mechanism. Since UFSHCI 3 controllers use a 32-bit register in the UFSHCI controller for indicating which requests are in flight, reordering of requests is likely to happen if a UFSHCI controller comes out of the suspended state. When a UFSHCI controller leaves the suspended state, it scans that 32-bit register from the lowest to the highest bit and hence ordering information that was provided by the host is lost. It seems to me that the easiest way to address this reordering is by resubmitting requests once in case the controller reordered the requests due to leaving the suspended state. Please note that UFSHCI 3 controllers are not my primary focus and that I would be fine with not supporting write pipelining for these controllers if there would be disagreement about this aspect. UFSHCI 4 controllers have proper submission queues in host memory, just like NVMe controllers, and hence request ordering information for submission queues is not lost if the controller is suspended and resumed. The SCSI error handler can also cause reordering. If e.g. the UFS controller reports a CRC error then the UFS driver will reset the UFS host controller and activate the SCSI error handler. Pending writes must be resubmitted in LBA order per zone after the SCSI error handler has finished to prevent that write errors propagate to the file system and to user space. At the time a UFS device reports an unaligned write error, I don't think that it is possible to determine whether or not this is a retryable error. So instead of introducing a ZONE_WRITE_NEED_RETRY flag, I propose the following: * Use BLK_ZONE_WPLUG_ERROR for both hard and soft errors. Hard errors means errors that shouldn't be retried. Soft errors means errors that are retryable. * After all pending writes have completed and before resubmitting any zoned writes, check whether or not these should be retried. For UFS devices this can be done by comparing the wp_offset_compl member with the lowest LBA of the pending zoned writes. Do I remember correctly that residual information is not always reliable for zoned writes to SMR disks and hence that another approach is needed for SMR disks? How about setting BLK_ZONE_WPLUG_NEED_WP_UPDATE for SMR disks instead of resubmitting writes after an unaligned write has been reported by the storage device? Thanks, Bart.