diff mbox series

block: set reasonable default for discard max

Message ID 20230609180805.736872-1-jpittman@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: set reasonable default for discard max | expand

Commit Message

John Pittman June 9, 2023, 6:08 p.m. UTC
Some drive manufacturers export a very large supported max discard size.
However, when the operating system sends I/O of the max size to the
device, extreme I/O latency can often be encountered. Since hardware
does not provide an optimal discard value in addition to the max, and
there is no way to foreshadow how well a drive handles the large size,
take the method from max_sectors setting, and use BLK_DEF_MAX_SECTORS to
set a more reasonable default discard max. This should avoid the extreme
latency while still allowing the user to increase the value for specific
needs.

Signed-off-by: John Pittman <jpittman@redhat.com>
Suggested-by: David Jeffery <djeffery@redhat.com>
---
 Documentation/ABI/stable/sysfs-block | 4 +++-
 block/blk-settings.c                 | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Comments

Martin K. Petersen June 9, 2023, 6:46 p.m. UTC | #1
John,

> Some drive manufacturers export a very large supported max discard
> size. However, when the operating system sends I/O of the max size to
> the device, extreme I/O latency can often be encountered. Since
> hardware does not provide an optimal discard value in addition to the
> max, and there is no way to foreshadow how well a drive handles the
> large size, take the method from max_sectors setting, and use
> BLK_DEF_MAX_SECTORS to set a more reasonable default discard max. This
> should avoid the extreme latency while still allowing the user to
> increase the value for specific needs.

What's reasonable for one device may be completely unreasonable for
another. 4 * BLK_DEF_MAX_SECTORS is *tiny* and will penalize performance
on many devices.

If there's a problem with a device returning something that doesn't make
sense, let's quirk it.
John Pittman June 10, 2023, 12:22 a.m. UTC | #2
Thanks alot for responding Martin.  Forgive my ignorance; just trying
to gain understanding.  For example, if we find a device with a 2TiB
max discard (way too high for any device to handle reasonably from
what I've seen), and we make a quirk for it that brings the max
discard down, how do we decide what value to bring that down to?
Would we ask the hardware vendor for an optimal value?  Is there some
way we could decide the value?  Thanks again for any help.

On Fri, Jun 9, 2023 at 2:48 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> John,
>
> > Some drive manufacturers export a very large supported max discard
> > size. However, when the operating system sends I/O of the max size to
> > the device, extreme I/O latency can often be encountered. Since
> > hardware does not provide an optimal discard value in addition to the
> > max, and there is no way to foreshadow how well a drive handles the
> > large size, take the method from max_sectors setting, and use
> > BLK_DEF_MAX_SECTORS to set a more reasonable default discard max. This
> > should avoid the extreme latency while still allowing the user to
> > increase the value for specific needs.
>
> What's reasonable for one device may be completely unreasonable for
> another. 4 * BLK_DEF_MAX_SECTORS is *tiny* and will penalize performance
> on many devices.
>
> If there's a problem with a device returning something that doesn't make
> sense, let's quirk it.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
>
Chaitanya Kulkarni June 10, 2023, 10:52 p.m. UTC | #3
On 6/9/23 17:22, John Pittman wrote:
> Thanks alot for responding Martin.  Forgive my ignorance; just trying
> to gain understanding.  For example, if we find a device with a 2TiB
> max discard (way too high for any device to handle reasonably from
> what I've seen), and we make a quirk for it that brings the max
> discard down, how do we decide what value to bring that down to?
> Would we ask the hardware vendor for an optimal value?  Is there some
> way we could decide the value?  Thanks again for any help.

asking a H/W vendor is definitely good start, but you have to test that
value and make sure it is working before adding a quirk is right way
to go, because sometimes those values won't work from my experience ...

Also, in absence of information from H/W vendor I'd run few experiments
with the range of this values to establish the correctness that will not
be too low to significantly degrade the performance or to high to create
potential problems.

-ck
Martin K. Petersen June 13, 2023, 12:55 a.m. UTC | #4
John,

> Thanks alot for responding Martin.  Forgive my ignorance; just trying
> to gain understanding.  For example, if we find a device with a 2TiB
> max discard (way too high for any device to handle reasonably from
> what I've seen),

That really depends. On some devices performing a 2TB discard may just
involve updating a few entries in a block translation table. It could be
close to constant time regardless of the number of terabytes unmapped.

> and we make a quirk for it that brings the max discard down, how do we
> decide what value to bring that down to? Would we ask the hardware
> vendor for an optimal value? Is there some way we could decide the
> value?

The best thing would be to ask the vendor. Or maybe if you provide the
output of

# sg_vpd -p bl /dev/sdN

and we can try to see if we can come up with a suitable heuristic for a
quirk.
Ming Lei June 13, 2023, 1:37 a.m. UTC | #5
On Fri, Jun 09, 2023 at 02:08:05PM -0400, John Pittman wrote:
> Some drive manufacturers export a very large supported max discard size.

Most of devices exports very large max discard size, and it shouldn't be
just some.

> However, when the operating system sends I/O of the max size to the
> device, extreme I/O latency can often be encountered. Since hardware

It often depends on drive internal state.

> does not provide an optimal discard value in addition to the max, and
> there is no way to foreshadow how well a drive handles the large size,
> take the method from max_sectors setting, and use BLK_DEF_MAX_SECTORS to
> set a more reasonable default discard max. This should avoid the extreme
> latency while still allowing the user to increase the value for specific
> needs.

As Martin and Chaitanya mentioned, reducing max discard size to level
of BLK_DEF_MAX_SECTORS won't be one ideal approach, and shouldn't be
accepted, since discarding command is supposed to be cheap from device
viewpoint, such as, SSD FW usually just marks the LBA ranges as invalid
for handling discard.

However, how well discard performs may depend on SSD internal, such as
if GC is in-progress. And it might be one more generic issue for SSD,
and it was discussed several times, such as "Issues around discard" in
lsfmm2019:

	https://lwn.net/Articles/787272/

BTW, blk-wbt actually throttles discard, but it is just in queue depth
level. If max discard size is too big, single big discard request still may
cause device irresponsive.


Thanks,
Ming
Ewan Milne June 13, 2023, 11:23 a.m. UTC | #6
FWIW the most recent discard performance issue I looked at was
an early SSD that only supported a small max discard size.  Doing
a discard of the whole device (mkfs) took several minutes with
lots of little discards.  So it isn't always just an issue of the max size.

-Ewan

On Mon, Jun 12, 2023 at 10:00 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> John,
>
> > Thanks alot for responding Martin.  Forgive my ignorance; just trying
> > to gain understanding.  For example, if we find a device with a 2TiB
> > max discard (way too high for any device to handle reasonably from
> > what I've seen),
>
> That really depends. On some devices performing a 2TB discard may just
> involve updating a few entries in a block translation table. It could be
> close to constant time regardless of the number of terabytes unmapped.
>
> > and we make a quirk for it that brings the max discard down, how do we
> > decide what value to bring that down to? Would we ask the hardware
> > vendor for an optimal value? Is there some way we could decide the
> > value?
>
> The best thing would be to ask the vendor. Or maybe if you provide the
> output of
>
> # sg_vpd -p bl /dev/sdN
>
> and we can try to see if we can come up with a suitable heuristic for a
> quirk.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
>
John Pittman June 19, 2023, 4:02 p.m. UTC | #7
Hi Jens, I saw in the v2 of your original patchset that made max
discard rw, you were going to set a 64M default, but Mike mentioned
some issues it might present in dm, so you said you would come back to
it later and decide on a default max [1].  In the link that Ming
provided [2], the general consensus was that the hw manufacturers
should do what they say; they've sold this function as fast, so they
should make it fast.  But unfortunately, from what I've read, it still
seems there are very few, if any, devices that can truly handle a 2TiB
discard well.  Of course, I don't have any insight into the hw
manufacturers intentions, but in my opinion, we shouldn't be keeping
problematic kernel limits based on a selling point by the hw
manufacturers.

So, the thought (after bugging Ming offline) is we could take one of
the below approaches:

1) Set a 64M limit as you've suggested in the past.  It seems more
prudent to tune the value upward for the few devices that can actually
handle a 2TiB discard rather than tune downward for the large
majority.
2) Create a udev rule to go into RHEL that would set 64M as we still
steadily see support cases where max discard needs to be tuned
downward.

In either case, I suppose we'd have to figure a way to not impact dm.
Maybe we could do that with a driver setting or udev rule?

Jens and Mike, Do you guys have any thoughts on this?  Asking here
instead of RH bug (2160828) since Jens said he wanted to come back to
this issue later in the thread mentioned.

Thanks for the continued assistance.

[1] https://lkml.org/lkml/2015/7/15/859
[2] https://lwn.net/Articles/787272/

On Mon, Jun 12, 2023 at 9:37 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, Jun 09, 2023 at 02:08:05PM -0400, John Pittman wrote:
> > Some drive manufacturers export a very large supported max discard size.
>
> Most of devices exports very large max discard size, and it shouldn't be
> just some.
>
> > However, when the operating system sends I/O of the max size to the
> > device, extreme I/O latency can often be encountered. Since hardware
>
> It often depends on drive internal state.
>
> > does not provide an optimal discard value in addition to the max, and
> > there is no way to foreshadow how well a drive handles the large size,
> > take the method from max_sectors setting, and use BLK_DEF_MAX_SECTORS to
> > set a more reasonable default discard max. This should avoid the extreme
> > latency while still allowing the user to increase the value for specific
> > needs.
>
> As Martin and Chaitanya mentioned, reducing max discard size to level
> of BLK_DEF_MAX_SECTORS won't be one ideal approach, and shouldn't be
> accepted, since discarding command is supposed to be cheap from device
> viewpoint, such as, SSD FW usually just marks the LBA ranges as invalid
> for handling discard.
>
> However, how well discard performs may depend on SSD internal, such as
> if GC is in-progress. And it might be one more generic issue for SSD,
> and it was discussed several times, such as "Issues around discard" in
> lsfmm2019:
>
>         https://lwn.net/Articles/787272/
>
> BTW, blk-wbt actually throttles discard, but it is just in queue depth
> level. If max discard size is too big, single big discard request still may
> cause device irresponsive.
>
>
> Thanks,
> Ming
>
Martin K. Petersen June 20, 2023, 3:06 p.m. UTC | #8
John,

> 1) Set a 64M limit as you've suggested in the past.  It seems more
> prudent to tune the value upward for the few devices that can actually
> handle a 2TiB discard rather than tune downward for the large
> majority.

This makes the assumption that smaller always equals faster. However,
some devices track allocations in very large units. And as a result,
sending smaller discards will result in *slower* performance for those
devices. In addition, for doings things like the full block device sweep
we do during mkfs, it is often imperative that we issue large sequential
ranges.

Please provide the VPD pages from the device in question.
John Pittman July 14, 2023, 5:15 p.m. UTC | #9
Thanks Martin.  Sorry for the wait; it took a while for the user to
respond.  They don't have scsi devs but rather nvme devs.  The sg_vpd
command failed b/c there was no bl page available.  I asked them to
provide the supported vpd pages that are there.  I do have some nvme
data though from a sosreport.  I'll send you that in a separate email.
Thanks again.

On Tue, Jun 20, 2023 at 11:20 AM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> John,
>
> > 1) Set a 64M limit as you've suggested in the past.  It seems more
> > prudent to tune the value upward for the few devices that can actually
> > handle a 2TiB discard rather than tune downward for the large
> > majority.
>
> This makes the assumption that smaller always equals faster. However,
> some devices track allocations in very large units. And as a result,
> sending smaller discards will result in *slower* performance for those
> devices. In addition, for doings things like the full block device sweep
> we do during mkfs, it is often imperative that we issue large sequential
> ranges.
>
> Please provide the VPD pages from the device in question.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
>
Martin K. Petersen July 20, 2023, 1:58 a.m. UTC | #10
John,

> Sorry for the wait; it took a while for the user to respond. They
> don't have scsi devs but rather nvme devs. The sg_vpd command failed
> b/c there was no bl page available. I asked them to provide the
> supported vpd pages that are there. I do have some nvme data though
> from a sosreport. I'll send you that in a separate email.

The device vendor should be fixing their firmware to report a suitable
set of DMRL/DMRSL/DMSL values which align with the expected performance
of the device.
John Pittman July 20, 2023, 2:43 p.m. UTC | #11
Thanks again Martin.  We're going to try to engage the hardware folks now.

On Wed, Jul 19, 2023 at 9:58 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> John,
>
> > Sorry for the wait; it took a while for the user to respond. They
> > don't have scsi devs but rather nvme devs. The sg_vpd command failed
> > b/c there was no bl page available. I asked them to provide the
> > supported vpd pages that are there. I do have some nvme data though
> > from a sosreport. I'll send you that in a separate email.
>
> The device vendor should be fixing their firmware to report a suitable
> set of DMRL/DMRSL/DMSL values which align with the expected performance
> of the device.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
>
diff mbox series

Patch

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..158a1e6f1f6d 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -235,7 +235,9 @@  Description:
 		large latencies when large discards are issued, setting this
 		value lower will make Linux issue smaller discards and
 		potentially help reduce latencies induced by large discard
-		operations.
+		operations. For this reason, the max is currently defaulted to
+		four times BLK_DEF_MAX_SECTORS, but can be increased via sysfs
+		as needed.
 
 
 What:		/sys/block/<disk>/queue/discard_max_hw_bytes
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dd59059b788..4401c0b8477e 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -179,7 +179,8 @@  void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors)
 {
 	q->limits.max_hw_discard_sectors = max_discard_sectors;
-	q->limits.max_discard_sectors = max_discard_sectors;
+	q->limits.max_discard_sectors = min(max_discard_sectors,
+						BLK_DEF_MAX_SECTORS * 4);
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);