Message ID | CAGnHSEmDQ6xaJsw9i=v23BffWQ8RncrgNBsGWy3ziC36QK+9yQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom> Today I check if blkdiscard really does a full device trim/wipe for
Tom> my Intel 530 SSD (240gb) with hexdump. I end up found that it fail
Tom> to do so because it report garbage info on its block limits VPD.
It is a SATA-attached drive, it has no block limits VPD. What you are
seeing is information prepared by libata's SATL.
Tom> The fact is, in each iteration, for this drive, blkdiscard can only
Tom> trim a maximum of 65528 sectors, which is the largest multiple of 8
Tom> sectors, which is the minimum possible.
Maximum write same length: 0x3fffc0 blocks
That's 2GB-e per request for a drive with 512-byte logical blocks.
Tom> but still I would like to know why the kernel doesn't allow
Tom> "discard_granularity" and "discard_max_bytes" to be configurable
Tom> for users.
Because if the vendor got these trivial values wrong there is little to
no chance that they implemented discard correctly in their firmware.
On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote: > It is a SATA-attached drive, it has no block limits VPD. What you are > seeing is information prepared by libata's SATL. > > Because if the vendor got these trivial values wrong there is little to > no chance that they implemented discard correctly in their firmware. I don't get it. So there's a chance that the VPDs is not purely from the hardware? Then how can I differentiate them? But then you said "the vendor got these trivial values wrong", so were you talking about this drive or just for real SCSI drives? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
By the way do you think it could be a bug of libata's SATL anyway? Like perhaps it should break a single unmap request to multiple ATA commands? I am not totally sure about it but it looks like there's a limit of addressed blocks in a single ATA DSM/TRIM command (4 bytes, which is 65535). On 21 June 2015 at 15:03, Tom Yan <tom.ty89@gmail.com> wrote: > On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote: >> It is a SATA-attached drive, it has no block limits VPD. What you are >> seeing is information prepared by libata's SATL. >> >> Because if the vendor got these trivial values wrong there is little to >> no chance that they implemented discard correctly in their firmware. > > I don't get it. So there's a chance that the VPDs is not purely from > the hardware? Then how can I differentiate them? But then you said > "the vendor got these trivial values wrong", so were you talking about > this drive or just for real SCSI drives? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=18f0f97850059303ed73b1f02084f55ca330a80c https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=e78db4dfb1355a895f7ea50133b702b55b8ed184 So I think I found the answer myself. But I don't see the rationale here at all. So values (arbitrary?) of granularity and maximum has been hardcoded in the kernel for ALL devices which make use of libata, and the parameter cannot be changed by user in runtime? So does it expect all devices to be of the same spec? Or is it just like a draft/example which people can only consider it as "may work"? On 21 June 2015 at 16:05, Tom Yan <tom.ty89@gmail.com> wrote: > By the way do you think it could be a bug of libata's SATL anyway? > Like perhaps it should break a single unmap request to multiple ATA > commands? I am not totally sure about it but it looks like there's a > limit of addressed blocks in a single ATA DSM/TRIM command (4 bytes, > which is 65535). > > On 21 June 2015 at 15:03, Tom Yan <tom.ty89@gmail.com> wrote: >> On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>> It is a SATA-attached drive, it has no block limits VPD. What you are >>> seeing is information prepared by libata's SATL. >>> >>> Because if the vendor got these trivial values wrong there is little to >>> no chance that they implemented discard correctly in their firmware. >> >> I don't get it. So there's a chance that the VPDs is not purely from >> the hardware? Then how can I differentiate them? But then you said >> "the vendor got these trivial values wrong", so were you talking about >> this drive or just for real SCSI drives? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
After rechecking the result of blkdiscard (without specifying --step) with hexdump, I see that the single ioctl is actually broken into multiple TRIM commands/ranges by 65535 sector. I wonder if it will be further tuned by ANY KIND of granularity though. Though it will probably only be a problem for device TRIM, and for that you can probably always rely on blkdiscard with --step (and avoid device TRIM with things like mkfs.btrfs) so I guess it won't be a big problem. And it will be fine for filesystem TRIM anyway I guess. So it seems that "Maximum write same length: 0x3fffc0 blocks" doesn't really matter (In fact when I ran `sg_write_same -U --lba=0 --num=0x7fffffff`, the command also trim the first 65528 sectors of the drive). However, I am curious what is the meaning of "Our current TRIM payload is a single sector that can accommodate 64 * 65535 blocks being unmapped. Report this value in the Block Limits Maximum Unmap LBA count field." in the commit message. What does the value actually mean/affect? Could it cause trouble to certain drives? By the way I think I got the answer for my USB TRIM question. Basically for USB->SATA drives the SATL is implemented in the bridges, so it must be able to "SAT" one of the three SCSI unmap methods to ATA TRIM, just like libata does, so that it can handle whatever requested through the scsi disk driver, otherwise one can only TRIM it through SCSI's ATA pass-through commands if the bridge supports them, e.g. hdparm. So my dirty hacking was silly. P.S.: [tom@localhost ~]$ sudo sg_vpd -p ai /dev/sda ATA information VPD page: SAT Vendor identification: linux SAT Product identification: libata SAT Product revision level: 3.00 So it actually tells. And I hope that the kernel wouldn't "falsify" anything for devices which do provide some VPD(s). : \ Another P.S.: I made a mistake in one of my last mails. 65535 is two bytes, not four bytes. : / On 21 June 2015 at 20:36, Tom Yan <tom.ty89@gmail.com> wrote: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=18f0f97850059303ed73b1f02084f55ca330a80c > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-scsi.c?id=e78db4dfb1355a895f7ea50133b702b55b8ed184 > > So I think I found the answer myself. But I don't see the rationale > here at all. So values (arbitrary?) of granularity and maximum has > been hardcoded in the kernel for ALL devices which make use of libata, > and the parameter cannot be changed by user in runtime? So does it > expect all devices to be of the same spec? Or is it just like a > draft/example which people can only consider it as "may work"? > > On 21 June 2015 at 16:05, Tom Yan <tom.ty89@gmail.com> wrote: >> By the way do you think it could be a bug of libata's SATL anyway? >> Like perhaps it should break a single unmap request to multiple ATA >> commands? I am not totally sure about it but it looks like there's a >> limit of addressed blocks in a single ATA DSM/TRIM command (4 bytes, >> which is 65535). >> >> On 21 June 2015 at 15:03, Tom Yan <tom.ty89@gmail.com> wrote: >>> On 21 June 2015 at 08:20, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>> It is a SATA-attached drive, it has no block limits VPD. What you are >>>> seeing is information prepared by libata's SATL. >>>> >>>> Because if the vendor got these trivial values wrong there is little to >>>> no chance that they implemented discard correctly in their firmware. >>> >>> I don't get it. So there's a chance that the VPDs is not purely from >>> the hardware? Then how can I differentiate them? But then you said >>> "the vendor got these trivial values wrong", so were you talking about >>> this drive or just for real SCSI drives? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom> However, I am curious what is the meaning of "Our current TRIM
Tom> payload is a single sector that can accommodate 64 * 65535 blocks
Tom> being unmapped. Report this value in the Block Limits Maximum Unmap
Tom> LBA count field." in the commit message. What does the value
Tom> actually mean/affect? Could it cause trouble to certain drives?
DSM Trim takes a payload of <block nr, block count> ranges. The block
count is constrained to 65535 by virtue of being 16 bits. We can fit 64
8-byte range descriptors in a single 512-byte sector payload. That's why
we set the limit for a single, contiguous discard request to
0x3fffc0. That results in 2GB minus change for a single command.
Tom> By the way I think I got the answer for my USB TRIM question.
Tom> Basically for USB->SATA drives the SATL is implemented in the
Tom> bridges, so it must be able to "SAT" one of the three SCSI unmap
Tom> methods to ATA TRIM, just like libata does,
Yep. I have set to see one that gets that right, though. There are even
many enterprise SAS/RAID controllers that don't support UNMAP-TRIM
translation.
Linux' libata was one of the first implementations of SCSI-ATA
translation for TRIM. Initially we bent the rules for UNMAP a bit. When
the standards caught up we switched to WRITE SAME.
Tom> So it actually tells. And I hope that the kernel wouldn't "falsify"
Tom> anything for devices which do provide some VPD(s). : \
SATA doesn't have VPDs.
Still, I wonder if the kernel should allow the user to configure the real granularity and send ATA commands that are rounded off by it (which works just like --step in blkdiscard). For example, (64 * 65535) % 8 = 0 but 65535 % 8 = 7 So the block count limit for the ATA commands would be 65535 - (65535 % real_gran) per range. Also, by experimenting on my SanDisk Extreme USB with hdparm TRIM and hexdump, I can see that the maximum block/sector allowed to be discarded per ATA command is 65536, inspite of the payload size. I don't know whether the USB bridging or the way hdparm does TRIM matters, but it seems that some devices can't really handle limit like 0x3fffc0 blocks. So I still think that the kernel should allow users to configure limits that can make libata send reliable ATA TRIM commands. On 23 June 2015 at 04:57, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > Tom> So it actually tells. And I hope that the kernel wouldn't "falsify" > Tom> anything for devices which do provide some VPD(s). : \ > > SATA doesn't have VPDs. > I know now, that's why I think it's "alright" for libata to do so. I just don't know whether the kernel would tamper with them for devices like USB or real SCSI drives. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom> I don't know whether the USB bridging or the way hdparm does TRIM
Tom> matters, but it seems that some devices can't really handle limit
Tom> like 0x3fffc0 blocks.
The 0x3fffc0 limit is for SATA devices connected through Linux' libata
SCSI ATA translation. This number corresponds to the biggest range we
can discard while maintaining a 1:1 mapping between the SCSI command and
the ATA ditto.
If you connect the SATA drive to a USB/UAS bridge you are entirely at
the bridge's whim when it comes to translation of WRITE SAME(10/16) w/
UNMAP or the UNMAP command to DSM TRIM. libata is not involved and
neither is the 0x3fffc0 limit (although chances are that the drive has
the same limit internally). Even when using hdparm the bridge device
still needs to correctly translate the ATA PASSTHROUGH commands.
Tom> So I still think that the kernel should allow users to configure
Tom> limits that can make libata send reliable ATA TRIM commands.
You haven't given us any reason to. We are not aware of any ATA drives
that put constraints on the range block count.
Again, if you want to help please focus on your bridge devices and what,
if anything, can be done to uniquely identify them and override any
incorrect values they might report to the SCSI stack. Up to and
including us disabling discard entirely on these devices if their
translation isn't verifiable and bullet proof.
On 23 June 2015 at 23:36, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > You haven't given us any reason to. We are not aware of any ATA drives > that put constraints on the range block count. > What I have been doing is trying to show you example of those constraints. When I talked about the block count limit of the SanDisk Extreme USB, I am not asking you to fix anything for the drive I HAVE (because I can only use hdparm to TRIM it anyway), but to show you that some devices MIGHT have such limit. I just can't be 100% sure of it, even though from what I see in the different results on payload size and range sector count, I don't think the bridge is intervening at all (but just the limit of the SATA controller behind). But for my Intel 530 SSD, the granularity problem is obvious and solid enough because it's connected directly with SATA. Currently what the kernel does is assume all devices support 1 sector granularity. And the problem doesn't even lie on the current "hardcoded" granularity in libata, because blkdev_issue_discard() only does "mod check" against the granularity on the maximum sector counts, so even if it's allowed to be configured, it wouldn't really help. And I have yet to see anything which actually does "mod check" against ANY granularity on the sector count per range. This is what the example in my last mail was about. And the only info I have ever saw about TRIM block limit of a SATA drive is in `hdparm -I`. For my Intel SSD, which actually has a lower limit of 8 sectors, it shows: Data Set Management TRIM supported (limit 1 block) while for the SanDisk Extreme USB, which actually has a lower limit of 256 sectors, it shows: Data Set Management TRIM supported (limit 8 blocks) Inspite of the accuracy, I don't see the kernel actually read this limit. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom> Currently what the kernel does is assume all devices support 1
Tom> sector granularity.
The ATA Command Set does not allow for any other granularity than 1
sector.
Bigger granularity reporting is supported in SCSI SBC to allow for
thinly provisioned disk arrays with a bigger allocation unit. It is
common for disk arrays to provision LUN space in units of 1MB or more.
Tom> For my Intel SSD, which actually has a lower limit of 8 sectors, it
Tom> shows: Data Set Management TRIM supported (limit 1 block) while for
Tom> the SanDisk Extreme USB, which actually has a lower limit of 256
Tom> sectors, it shows: Data Set Management TRIM supported (limit 8
Tom> blocks)
It don't know what these "lower limits" you are talking about are.
What hdparm tells you is that DSM TRIM on the Intel drive supports 1
block (512 bytes) of range payload data. Whereas the SanDisk drive
supports a full 4KB page of range payload data (8 * 512 bytes).
We did try to enable payloads bigger than 512 bytes back in the day but
most drives that reported supporting the bigger payload actually didn't
and all hell broke loose. So we reverted to a single sector payload for
libata.
I still have the payload patch in my SSD test branch and regularly test
with it. But there are still drives that fail in mysterious ways with it
in place and so far I haven't felt compelled to maintain yet another
blacklist.
On 24 June 2015 at 01:03, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > It don't know what these "lower limits" you are talking about are. > [tom@localhost ~]$ sudo shred -v -n 1 /dev/sda3 [tom@localhost ~]$ sudo blkdiscard -l 512 /dev/sda3 [tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 | head 0000000 58c7 f3b0 0f97 2a9a 2da8 7c44 11f5 7852 0000010 4f1d d19c 682b a3d8 6be9 61df 4eb3 a4ee 0000020 f28b f2f6 ef36 c244 f1b0 eadd 5ad0 c5ec 0000030 54c4 abb4 f9c3 7ca0 d807 63ad 81cd 45e6 0000040 2b30 709c 550a 5e52 6928 4468 78c9 f671 0000050 eef1 dfa7 2a6d 3150 64f8 f9a2 240d bbf4 0000060 2cff 02e0 c893 16a7 67c2 8cd9 af76 71b6 0000070 4740 83cc 5b15 3db0 64b2 1a6d 3f40 5a98 0000080 09d0 035c b5ae 2cd2 1850 ea70 7a09 0491 0000090 6b27 9d81 9d74 a087 afd9 e75a 85ea adf6 [tom@localhost ~]$ sudo blkdiscard -l 3584 /dev/sda3 [tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 | head 0000000 58c7 f3b0 0f97 2a9a 2da8 7c44 11f5 7852 0000010 4f1d d19c 682b a3d8 6be9 61df 4eb3 a4ee 0000020 f28b f2f6 ef36 c244 f1b0 eadd 5ad0 c5ec 0000030 54c4 abb4 f9c3 7ca0 d807 63ad 81cd 45e6 0000040 2b30 709c 550a 5e52 6928 4468 78c9 f671 0000050 eef1 dfa7 2a6d 3150 64f8 f9a2 240d bbf4 0000060 2cff 02e0 c893 16a7 67c2 8cd9 af76 71b6 0000070 4740 83cc 5b15 3db0 64b2 1a6d 3f40 5a98 0000080 09d0 035c b5ae 2cd2 1850 ea70 7a09 0491 0000090 6b27 9d81 9d74 a087 afd9 e75a 85ea adf6 [tom@localhost ~]$ sudo blkdiscard -l 4096 /dev/sda3 [tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 | head 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0001000 So when libata issue ATA commands with ranges of 65535 sectors, only 65535-(65535%8) = 65528 sectors are discarded, yet it wouldn't know about that so the next range will start from LBA 65536. I can workaround this by specifying --step in blkdiscard, but I think the kernel should have a param configurable for general. > > What hdparm tells you is that DSM TRIM on the Intel drive supports 1 > block (512 bytes) of range payload data. Whereas the SanDisk drive > supports a full 4KB page of range payload data (8 * 512 bytes). > Oh so they are not reporting garbage. That's good to know. I guess that's why hdparm actually splits the range list I provided by 512-range per ATA command (which works) for the SanDisk drive. > > We did try to enable payloads bigger than 512 bytes back in the day but > most drives that reported supporting the bigger payload actually didn't > and all hell broke loose. So we reverted to a single sector payload for > libata. > > I still have the payload patch in my SSD test branch and regularly test > with it. But there are still drives that fail in mysterious ways with it > in place and so far I haven't felt compelled to maintain yet another > blacklist. > For now I have no problem with the single sector payload thing. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
Tom> So when libata issue ATA commands with ranges of 65535 sectors,
Tom> only 65535-(65535%8) = 65528 sectors are discarded,
That's unfortunate but TRIM is advisory so the drive is free to ignore
all or parts of the request.
What happens if you discard sectors 0-6 and then sector 7?
Tom> I can workaround this by specifying --step in blkdiscard, but I
Tom> think the kernel should have a param configurable for general.
This is on the Intel 530? What does the drive report in
/sys/block/sdN/queue/discard_zeroes_data?
First of all let me add another "statistic" about the issue: [tom@localhost ~]$ sudo shred -n 1 /dev/sda3 [tom@localhost ~]$ sudo blkdiscard /dev/sda3 [tom@localhost ~]$ sudo hexdump /dev/sda3 | wc -l 310635 [tom@localhost ~]$ sudo hexdump /dev/sda3 | pcregrep -M '0000 0000 0000 0000 0000 0000 0000 0000\n\*' | wc -l 2410 total untrimmed sectors: (310635 - 1 - 2410) / 32 = 9632 total ranges: (43 GiB * (1024 ^ 3) / 512) / 65535 ~= 1376 average untrimmed sectors per range: (9632 / 1376) = 7 = (65535 % 8) Also, FWIW: [tom@localhost ~]$ sudo shred -n 1 /dev/sda3 [tom@localhost ~]$ let step=(65535-65535%8)*512 [tom@localhost ~]$ echo $step 33550336 [tom@localhost ~]$ sudo blkdiscard -p "$step" /dev/sda3 [tom@localhost ~]$ sudo hexdump /dev/sda3 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * ac0000000 On 24 June 2015 at 02:26, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > What happens if you discard sectors 0-6 and then sector 7? > [tom@localhost ~]$ sudo shred -n 1 /dev/sda3 [tom@localhost ~]$ sudo blkdiscard -l 3584 /dev/sda3 [tom@localhost ~]$ sudo blkdiscard -o 3584 -l 512 /dev/sda3 [tom@localhost ~]$ sudo hexdump -n 4096 /dev/sda3 0000000 f06d 8365 5e1b 616c 7362 4d61 2182 02fb ... 0000ff0 54ef 9579 51bc 9042 115a 375e c28f 4dcc 0001000 > > This is on the Intel 530? What does the drive report in > /sys/block/sdN/queue/discard_zeroes_data? > Yes. It reports 0. In `hdparm -I`: * Data Set Management TRIM supported (limit 1 block) * Deterministic read data after TRIM You may also want to check out and compare the two attached test case files.
diff --git a/drivers/scsi/sd.c~ b/drivers/scsi/sd.c index a661d33..a9cdfb6 100644 --- a/drivers/scsi/sd.c~ +++ b/drivers/scsi/sd.c @@ -99,6 +99,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC); #endif static void sd_config_discard(struct scsi_disk *, unsigned int); +static void sd_config_discard_novpd(struct scsi_disk *); static void sd_config_write_same(struct scsi_disk *); static int sd_revalidate_disk(struct gendisk *); static void sd_unlock_native_capacity(struct gendisk *disk); @@ -700,6 +701,22 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); } +static void sd_config_discard_novpd(struct scsi_disk *sdkp) +{ + struct request_queue *q = sdkp->disk->queue; + unsigned int logical_block_size = sdkp->device->sector_size; + unsigned int max_blocks = (u32)SD_MAX_WS16_BLOCKS; + + q->limits.discard_zeroes_data = 0; + q->limits.discard_alignment = 0; + q->limits.discard_granularity = sdkp->physical_block_size; + + sdkp->provisioning_mode = SD_LBP_WS16; + + q->limits.max_discard_sectors = max_blocks * (logical_block_size >> 9); + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); +} + /** * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device * @sdp: scsi device to operate one @@ -2776,6 +2793,9 @@ static int sd_revalidate_disk(struct gendisk *disk) sd_read_block_limits(sdkp); sd_read_block_characteristics(sdkp); } + else { + sd_config_discard_novpd(sdkp); + } sd_read_write_protect_flag(sdkp, buffer); sd_read_cache_type(sdkp, buffer);