Message ID | 20230609180805.736872-1-jpittman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: set reasonable default for discard max | expand |
John, > Some drive manufacturers export a very large supported max discard > size. However, when the operating system sends I/O of the max size to > the device, extreme I/O latency can often be encountered. Since > hardware does not provide an optimal discard value in addition to the > max, and there is no way to foreshadow how well a drive handles the > large size, take the method from max_sectors setting, and use > BLK_DEF_MAX_SECTORS to set a more reasonable default discard max. This > should avoid the extreme latency while still allowing the user to > increase the value for specific needs. What's reasonable for one device may be completely unreasonable for another. 4 * BLK_DEF_MAX_SECTORS is *tiny* and will penalize performance on many devices. If there's a problem with a device returning something that doesn't make sense, let's quirk it.
Thanks alot for responding Martin. Forgive my ignorance; just trying to gain understanding. For example, if we find a device with a 2TiB max discard (way too high for any device to handle reasonably from what I've seen), and we make a quirk for it that brings the max discard down, how do we decide what value to bring that down to? Would we ask the hardware vendor for an optimal value? Is there some way we could decide the value? Thanks again for any help. On Fri, Jun 9, 2023 at 2:48 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > John, > > > Some drive manufacturers export a very large supported max discard > > size. However, when the operating system sends I/O of the max size to > > the device, extreme I/O latency can often be encountered. Since > > hardware does not provide an optimal discard value in addition to the > > max, and there is no way to foreshadow how well a drive handles the > > large size, take the method from max_sectors setting, and use > > BLK_DEF_MAX_SECTORS to set a more reasonable default discard max. This > > should avoid the extreme latency while still allowing the user to > > increase the value for specific needs. > > What's reasonable for one device may be completely unreasonable for > another. 4 * BLK_DEF_MAX_SECTORS is *tiny* and will penalize performance > on many devices. > > If there's a problem with a device returning something that doesn't make > sense, let's quirk it. > > -- > Martin K. Petersen Oracle Linux Engineering >
On 6/9/23 17:22, John Pittman wrote: > Thanks alot for responding Martin. Forgive my ignorance; just trying > to gain understanding. For example, if we find a device with a 2TiB > max discard (way too high for any device to handle reasonably from > what I've seen), and we make a quirk for it that brings the max > discard down, how do we decide what value to bring that down to? > Would we ask the hardware vendor for an optimal value? Is there some > way we could decide the value? Thanks again for any help. asking a H/W vendor is definitely good start, but you have to test that value and make sure it is working before adding a quirk is right way to go, because sometimes those values won't work from my experience ... Also, in absence of information from H/W vendor I'd run few experiments with the range of this values to establish the correctness that will not be too low to significantly degrade the performance or to high to create potential problems. -ck
John, > Thanks alot for responding Martin. Forgive my ignorance; just trying > to gain understanding. For example, if we find a device with a 2TiB > max discard (way too high for any device to handle reasonably from > what I've seen), That really depends. On some devices performing a 2TB discard may just involve updating a few entries in a block translation table. It could be close to constant time regardless of the number of terabytes unmapped. > and we make a quirk for it that brings the max discard down, how do we > decide what value to bring that down to? Would we ask the hardware > vendor for an optimal value? Is there some way we could decide the > value? The best thing would be to ask the vendor. Or maybe if you provide the output of # sg_vpd -p bl /dev/sdN and we can try to see if we can come up with a suitable heuristic for a quirk.
On Fri, Jun 09, 2023 at 02:08:05PM -0400, John Pittman wrote: > Some drive manufacturers export a very large supported max discard size. Most of devices exports very large max discard size, and it shouldn't be just some. > However, when the operating system sends I/O of the max size to the > device, extreme I/O latency can often be encountered. Since hardware It often depends on drive internal state. > does not provide an optimal discard value in addition to the max, and > there is no way to foreshadow how well a drive handles the large size, > take the method from max_sectors setting, and use BLK_DEF_MAX_SECTORS to > set a more reasonable default discard max. This should avoid the extreme > latency while still allowing the user to increase the value for specific > needs. As Martin and Chaitanya mentioned, reducing max discard size to level of BLK_DEF_MAX_SECTORS won't be one ideal approach, and shouldn't be accepted, since discarding command is supposed to be cheap from device viewpoint, such as, SSD FW usually just marks the LBA ranges as invalid for handling discard. However, how well discard performs may depend on SSD internal, such as if GC is in-progress. And it might be one more generic issue for SSD, and it was discussed several times, such as "Issues around discard" in lsfmm2019: https://lwn.net/Articles/787272/ BTW, blk-wbt actually throttles discard, but it is just in queue depth level. If max discard size is too big, single big discard request still may cause device irresponsive. Thanks, Ming
FWIW the most recent discard performance issue I looked at was an early SSD that only supported a small max discard size. Doing a discard of the whole device (mkfs) took several minutes with lots of little discards. So it isn't always just an issue of the max size. -Ewan On Mon, Jun 12, 2023 at 10:00 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > John, > > > Thanks alot for responding Martin. Forgive my ignorance; just trying > > to gain understanding. For example, if we find a device with a 2TiB > > max discard (way too high for any device to handle reasonably from > > what I've seen), > > That really depends. On some devices performing a 2TB discard may just > involve updating a few entries in a block translation table. It could be > close to constant time regardless of the number of terabytes unmapped. > > > and we make a quirk for it that brings the max discard down, how do we > > decide what value to bring that down to? Would we ask the hardware > > vendor for an optimal value? Is there some way we could decide the > > value? > > The best thing would be to ask the vendor. Or maybe if you provide the > output of > > # sg_vpd -p bl /dev/sdN > > and we can try to see if we can come up with a suitable heuristic for a > quirk. > > -- > Martin K. Petersen Oracle Linux Engineering >
Hi Jens, I saw in the v2 of your original patchset that made max discard rw, you were going to set a 64M default, but Mike mentioned some issues it might present in dm, so you said you would come back to it later and decide on a default max [1]. In the link that Ming provided [2], the general consensus was that the hw manufacturers should do what they say; they've sold this function as fast, so they should make it fast. But unfortunately, from what I've read, it still seems there are very few, if any, devices that can truly handle a 2TiB discard well. Of course, I don't have any insight into the hw manufacturers intentions, but in my opinion, we shouldn't be keeping problematic kernel limits based on a selling point by the hw manufacturers. So, the thought (after bugging Ming offline) is we could take one of the below approaches: 1) Set a 64M limit as you've suggested in the past. It seems more prudent to tune the value upward for the few devices that can actually handle a 2TiB discard rather than tune downward for the large majority. 2) Create a udev rule to go into RHEL that would set 64M as we still steadily see support cases where max discard needs to be tuned downward. In either case, I suppose we'd have to figure a way to not impact dm. Maybe we could do that with a driver setting or udev rule? Jens and Mike, Do you guys have any thoughts on this? Asking here instead of RH bug (2160828) since Jens said he wanted to come back to this issue later in the thread mentioned. Thanks for the continued assistance. [1] https://lkml.org/lkml/2015/7/15/859 [2] https://lwn.net/Articles/787272/ On Mon, Jun 12, 2023 at 9:37 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Fri, Jun 09, 2023 at 02:08:05PM -0400, John Pittman wrote: > > Some drive manufacturers export a very large supported max discard size. > > Most of devices exports very large max discard size, and it shouldn't be > just some. > > > However, when the operating system sends I/O of the max size to the > > device, extreme I/O latency can often be encountered. Since hardware > > It often depends on drive internal state. > > > does not provide an optimal discard value in addition to the max, and > > there is no way to foreshadow how well a drive handles the large size, > > take the method from max_sectors setting, and use BLK_DEF_MAX_SECTORS to > > set a more reasonable default discard max. This should avoid the extreme > > latency while still allowing the user to increase the value for specific > > needs. > > As Martin and Chaitanya mentioned, reducing max discard size to level > of BLK_DEF_MAX_SECTORS won't be one ideal approach, and shouldn't be > accepted, since discarding command is supposed to be cheap from device > viewpoint, such as, SSD FW usually just marks the LBA ranges as invalid > for handling discard. > > However, how well discard performs may depend on SSD internal, such as > if GC is in-progress. And it might be one more generic issue for SSD, > and it was discussed several times, such as "Issues around discard" in > lsfmm2019: > > https://lwn.net/Articles/787272/ > > BTW, blk-wbt actually throttles discard, but it is just in queue depth > level. If max discard size is too big, single big discard request still may > cause device irresponsive. > > > Thanks, > Ming >
John, > 1) Set a 64M limit as you've suggested in the past. It seems more > prudent to tune the value upward for the few devices that can actually > handle a 2TiB discard rather than tune downward for the large > majority. This makes the assumption that smaller always equals faster. However, some devices track allocations in very large units. And as a result, sending smaller discards will result in *slower* performance for those devices. In addition, for doings things like the full block device sweep we do during mkfs, it is often imperative that we issue large sequential ranges. Please provide the VPD pages from the device in question.
Thanks Martin. Sorry for the wait; it took a while for the user to respond. They don't have scsi devs but rather nvme devs. The sg_vpd command failed b/c there was no bl page available. I asked them to provide the supported vpd pages that are there. I do have some nvme data though from a sosreport. I'll send you that in a separate email. Thanks again. On Tue, Jun 20, 2023 at 11:20 AM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > John, > > > 1) Set a 64M limit as you've suggested in the past. It seems more > > prudent to tune the value upward for the few devices that can actually > > handle a 2TiB discard rather than tune downward for the large > > majority. > > This makes the assumption that smaller always equals faster. However, > some devices track allocations in very large units. And as a result, > sending smaller discards will result in *slower* performance for those > devices. In addition, for doings things like the full block device sweep > we do during mkfs, it is often imperative that we issue large sequential > ranges. > > Please provide the VPD pages from the device in question. > > -- > Martin K. Petersen Oracle Linux Engineering >
John, > Sorry for the wait; it took a while for the user to respond. They > don't have scsi devs but rather nvme devs. The sg_vpd command failed > b/c there was no bl page available. I asked them to provide the > supported vpd pages that are there. I do have some nvme data though > from a sosreport. I'll send you that in a separate email. The device vendor should be fixing their firmware to report a suitable set of DMRL/DMRSL/DMSL values which align with the expected performance of the device.
Thanks again Martin. We're going to try to engage the hardware folks now. On Wed, Jul 19, 2023 at 9:58 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > John, > > > Sorry for the wait; it took a while for the user to respond. They > > don't have scsi devs but rather nvme devs. The sg_vpd command failed > > b/c there was no bl page available. I asked them to provide the > > supported vpd pages that are there. I do have some nvme data though > > from a sosreport. I'll send you that in a separate email. > > The device vendor should be fixing their firmware to report a suitable > set of DMRL/DMRSL/DMSL values which align with the expected performance > of the device. > > -- > Martin K. Petersen Oracle Linux Engineering >
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block index c57e5b7cb532..158a1e6f1f6d 100644 --- a/Documentation/ABI/stable/sysfs-block +++ b/Documentation/ABI/stable/sysfs-block @@ -235,7 +235,9 @@ Description: large latencies when large discards are issued, setting this value lower will make Linux issue smaller discards and potentially help reduce latencies induced by large discard - operations. + operations. For this reason, the max is currently defaulted to + four times BLK_DEF_MAX_SECTORS, but can be increased via sysfs + as needed. What: /sys/block/<disk>/queue/discard_max_hw_bytes diff --git a/block/blk-settings.c b/block/blk-settings.c index 4dd59059b788..4401c0b8477e 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -179,7 +179,8 @@ void blk_queue_max_discard_sectors(struct request_queue *q, unsigned int max_discard_sectors) { q->limits.max_hw_discard_sectors = max_discard_sectors; - q->limits.max_discard_sectors = max_discard_sectors; + q->limits.max_discard_sectors = min(max_discard_sectors, + BLK_DEF_MAX_SECTORS * 4); } EXPORT_SYMBOL(blk_queue_max_discard_sectors);
Some drive manufacturers export a very large supported max discard size. However, when the operating system sends I/O of the max size to the device, extreme I/O latency can often be encountered. Since hardware does not provide an optimal discard value in addition to the max, and there is no way to foreshadow how well a drive handles the large size, take the method from max_sectors setting, and use BLK_DEF_MAX_SECTORS to set a more reasonable default discard max. This should avoid the extreme latency while still allowing the user to increase the value for specific needs. Signed-off-by: John Pittman <jpittman@redhat.com> Suggested-by: David Jeffery <djeffery@redhat.com> --- Documentation/ABI/stable/sysfs-block | 4 +++- block/blk-settings.c | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-)