Message ID | 1523524915-25170-1-git-send-email-jinpu.wangl@profitbricks.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Jack,
> + pr_err_ratelimited("%s: ref tag error at location %llu (rcvd %u)\n",
I'm a bit concerned about dropping records of potential data loss.
Also, what are you doing that compels all these to be logged? This
should be a very rare occurrence.
On Thu, Apr 12, 2018 at 11:20 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > Jack, > >> + pr_err_ratelimited("%s: ref tag error at location %llu (rcvd %u)\n", > > I'm a bit concerned about dropping records of potential data loss. > > Also, what are you doing that compels all these to be logged? This > should be a very rare occurrence. > > -- > Martin K. Petersen Oracle Linux Engineering Hi Martin, Thanks for asking, we updated mpt3sas driver which enables DIX support (prot_mask=0x7f), all disks are SATA SSDs, no DIF support. After reboot, kernel reports the IO errors from all the drives behind HBA, seems for almost every read IO, which turns the system unusable: [ 13.079375] sda: ref tag error at location 0 (rcvd 143196159) [ 13.079989] sda: ref tag error at location 937702912 (rcvd 143196159) [ 13.080233] sda: ref tag error at location 937703072 (rcvd 143196159) [ 13.080407] sda: ref tag error at location 0 (rcvd 143196159) [ 13.080594] sda: ref tag error at location 8 (rcvd 143196159) [ 13.080996] sda: ref tag error at location 0 (rcvd 143196159) [ 13.089878] sdb: ref tag error at location 0 (rcvd 143196159) [ 13.090275] sdb: ref tag error at location 937702912 (rcvd 277413887) [ 13.090448] sdb: ref tag error at location 937703072 (rcvd 143196159) [ 13.090655] sdb: ref tag error at location 0 (rcvd 143196159) [ 13.090823] sdb: ref tag error at location 8 (rcvd 277413887) [ 13.091218] sdb: ref tag error at location 0 (rcvd 143196159) [ 13.095412] sdc: ref tag error at location 0 (rcvd 143196159) [ 13.095859] sdc: ref tag error at location 937702912 (rcvd 143196159) [ 13.096058] sdc: ref tag error at location 937703072 (rcvd 143196159) [ 13.096228] sdc: ref tag error at location 0 (rcvd 143196159) [ 13.096445] sdc: ref tag error at location 8 (rcvd 143196159) [ 13.096833] sdc: ref tag error at location 0 (rcvd 277413887) [ 13.097187] sds: ref tag error at location 0 (rcvd 277413887) [ 13.097707] sds: ref tag error at location 937702912 (rcvd 143196159) [ 13.097855] sds: ref tag error at location 937703072 (rcvd 277413887) Kernel version 4.15 and 4.14.28, I scan the commits in upstream, haven't found any relevant. in 4.4.112, there's no such errors. Diable DIX support (prot_mask=0x7) in mpt3sas fixes the problem. Regards,
Jinpu, [CC:ed the mpt3sas maintainers] The ratelimit patch is just an attempt to treat the symptom, not the cause. > Thanks for asking, we updated mpt3sas driver which enables DIX support > (prot_mask=0x7f), all disks are SATA SSDs, no DIF support. > After reboot, kernel reports the IO errors from all the drives behind > HBA, seems for almost every read IO, which turns the system unusable: > [ 13.079375] sda: ref tag error at location 0 (rcvd 143196159) > [ 13.079989] sda: ref tag error at location 937702912 (rcvd 143196159) > [ 13.080233] sda: ref tag error at location 937703072 (rcvd 143196159) > [ 13.080407] sda: ref tag error at location 0 (rcvd 143196159) > [ 13.080594] sda: ref tag error at location 8 (rcvd 143196159) That sounds like a bug in the mpt3sas driver or firmware. I guess the HBA could conceivably be operating a SATA device as DIX Type 0 and strip the PI on the drive side. But that doesn't seem to be a particularly useful mode of operation. Jinpu: Which firmware are you running? Also, please send us the output of: sg_readcap -l /dev/sda sg_inq -x /dev/sda sg_vpd /dev/sda Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas controller?
On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote: > > Jinpu, > > [CC:ed the mpt3sas maintainers] > > The ratelimit patch is just an attempt to treat the symptom, not the > cause. Agree. If we can fix the root cause, it will be great. > >> Thanks for asking, we updated mpt3sas driver which enables DIX support >> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support. >> After reboot, kernel reports the IO errors from all the drives behind >> HBA, seems for almost every read IO, which turns the system unusable: >> [ 13.079375] sda: ref tag error at location 0 (rcvd 143196159) >> [ 13.079989] sda: ref tag error at location 937702912 (rcvd 143196159) >> [ 13.080233] sda: ref tag error at location 937703072 (rcvd 143196159) >> [ 13.080407] sda: ref tag error at location 0 (rcvd 143196159) >> [ 13.080594] sda: ref tag error at location 8 (rcvd 143196159) > > That sounds like a bug in the mpt3sas driver or firmware. I guess the > HBA could conceivably be operating a SATA device as DIX Type 0 and strip > the PI on the drive side. But that doesn't seem to be a particularly > useful mode of operation. > > Jinpu: Which firmware are you running? Also, please send us the output > of: > > sg_readcap -l /dev/sda > sg_inq -x /dev/sda > sg_vpd /dev/sda > Disks are INTEL SSDSC2BX48, directly attached to HBA. LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00) mpt3sas_cm2: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ) jwang@x:~$ sudo sg_vpd /dev/sdz Supported VPD pages VPD page: Supported VPD pages [sv] Unit serial number [sn] Device identification [di] Mode page policy [mpp] ATA information (SAT) [ai] Block limits (SBC) [bl] Block device characteristics (SBC) [bdc] Logical block provisioning (SBC) [lbpv] jwang@x:~$ sudo sg_inq -x /dev/sdz VPD INQUIRY: extended INQUIRY data page inquiry: field in cdb illegal (page not supported) jwang@x:~$ sudo sg_readcap -l /dev/sdz Read Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=1, lbprz=1 Last logical block address=937703087 (0x37e436af), Number of logical blocks=937703088 Logical block length=512 bytes Logical blocks per physical block exponent=3 [so physical block length=4096 bytes] Lowest aligned logical block address=0 Hence: Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB > Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas > controller? > > -- > Martin K. Petersen Oracle Linux Engineering Thanks!
On Mon, Apr 16, 2018 at 1:46 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen > <martin.petersen@oracle.com> wrote: >> >> Jinpu, >> >> [CC:ed the mpt3sas maintainers] >> >> The ratelimit patch is just an attempt to treat the symptom, not the >> cause. > Agree. If we can fix the root cause, it will be great. >> >>> Thanks for asking, we updated mpt3sas driver which enables DIX support >>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support. >>> After reboot, kernel reports the IO errors from all the drives behind >>> HBA, seems for almost every read IO, which turns the system unusable: >>> [ 13.079375] sda: ref tag error at location 0 (rcvd 143196159) >>> [ 13.079989] sda: ref tag error at location 937702912 (rcvd 143196159) >>> [ 13.080233] sda: ref tag error at location 937703072 (rcvd 143196159) >>> [ 13.080407] sda: ref tag error at location 0 (rcvd 143196159) >>> [ 13.080594] sda: ref tag error at location 8 (rcvd 143196159) >> >> That sounds like a bug in the mpt3sas driver or firmware. I guess the >> HBA could conceivably be operating a SATA device as DIX Type 0 and strip >> the PI on the drive side. But that doesn't seem to be a particularly >> useful mode of operation. >> >> Jinpu: Which firmware are you running? Also, please send us the output >> of: >> >> sg_readcap -l /dev/sda >> sg_inq -x /dev/sda >> sg_vpd /dev/sda >> > Disks are INTEL SSDSC2BX48, directly attached to HBA. > LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00) > mpt3sas_cm2: Protocol=(Initiator,Target), > Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set > Full,NCQ) > > jwang@x:~$ sudo sg_vpd /dev/sdz > Supported VPD pages VPD page: > Supported VPD pages [sv] > Unit serial number [sn] > Device identification [di] > Mode page policy [mpp] > ATA information (SAT) [ai] > Block limits (SBC) [bl] > Block device characteristics (SBC) [bdc] > Logical block provisioning (SBC) [lbpv] > jwang@x:~$ sudo sg_inq -x /dev/sdz > VPD INQUIRY: extended INQUIRY data page > inquiry: field in cdb illegal (page not supported) > jwang@x:~$ sudo sg_readcap -l /dev/sdz > Read Capacity results: > Protection: prot_en=0, p_type=0, p_i_exponent=0 > Logical block provisioning: lbpme=1, lbprz=1 > Last logical block address=937703087 (0x37e436af), Number of > logical blocks=937703088 > Logical block length=512 bytes > Logical blocks per physical block exponent=3 [so physical block > length=4096 bytes] > Lowest aligned logical block address=0 > Hence: > Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB > > >> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas >> controller? [Sreekanth] Current Upstream mpt3sas driver doesn't have DIX support capabilities, it supports only DIF feature. Thanks, Sreekanth >> >> -- >> Martin K. Petersen Oracle Linux Engineering > > > Thanks! > > -- > Jack Wang > Linux Kernel Developer > > ProfitBricks GmbH > Greifswalder Str. 207 > D - 10405 Berlin
diff --git a/block/t10-pi.c b/block/t10-pi.c index a98db38..6faf8c1 100644 --- a/block/t10-pi.c +++ b/block/t10-pi.c @@ -84,10 +84,11 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, if (be32_to_cpu(pi->ref_tag) != lower_32_bits(iter->seed)) { - pr_err("%s: ref tag error at location %llu " \ - "(rcvd %u)\n", iter->disk_name, - (unsigned long long) - iter->seed, be32_to_cpu(pi->ref_tag)); + pr_err_ratelimited("%s: ref tag error at location %llu (rcvd %u)\n", + iter->disk_name, + (unsigned long long) + iter->seed, + be32_to_cpu(pi->ref_tag)); return BLK_STS_PROTECTION; } break; @@ -101,10 +102,11 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, csum = fn(iter->data_buf, iter->interval); if (pi->guard_tag != csum) { - pr_err("%s: guard tag error at sector %llu " \ - "(rcvd %04x, want %04x)\n", iter->disk_name, - (unsigned long long)iter->seed, - be16_to_cpu(pi->guard_tag), be16_to_cpu(csum)); + pr_err_ratelimited("%s: guard tag error at sector %llu (rcvd %04x, want %04x)\n", + iter->disk_name, + (unsigned long long)iter->seed, + be16_to_cpu(pi->guard_tag), + be16_to_cpu(csum)); return BLK_STS_PROTECTION; }