Message ID | 20220210114218.632725-1-damien.lemoal@opensource.wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | libsas and pm8001 fixes | expand |
On 10/02/2022 11:41, Damien Le Moal wrote: > The first 3 patches fix a problem with libsas handling of NCQ NON DATA > commands and simplify libsas code in a couple of places. > The remaining patches are fixes for the pm8001 driver: > * All sparse warnings are addressed, fixing along the way many le32 > handling bugs for big-endian architectures > * Fix handling of NCQ NON DATA commands > * Fix NCQ error recovery (abort all task execution) that was causing a > crash > * Simplify the code in many places > > With these fixes, libzbc test suite passes all test case. This test > suite was used with an SMR drive for testing because it generates many > NCQ NON DATA commands (for zone management commands) and also generates > many NCQ command errors to check ASC/ASCQ returned by the device. With > the test suite, the error recovery path was extensively exercised. > > Note that without these patches, libzbc test suite result in the > controller hanging, or in kernel crashes. Unfortunately I still see the hang on my arm64 system with this series :( Thanks, John
On 2/11/22 00:35, John Garry wrote: > On 10/02/2022 11:41, Damien Le Moal wrote: >> The first 3 patches fix a problem with libsas handling of NCQ NON DATA >> commands and simplify libsas code in a couple of places. >> The remaining patches are fixes for the pm8001 driver: >> * All sparse warnings are addressed, fixing along the way many le32 >> handling bugs for big-endian architectures >> * Fix handling of NCQ NON DATA commands >> * Fix NCQ error recovery (abort all task execution) that was causing a >> crash >> * Simplify the code in many places >> >> With these fixes, libzbc test suite passes all test case. This test >> suite was used with an SMR drive for testing because it generates many >> NCQ NON DATA commands (for zone management commands) and also generates >> many NCQ command errors to check ASC/ASCQ returned by the device. With >> the test suite, the error recovery path was extensively exercised. >> >> Note that without these patches, libzbc test suite result in the >> controller hanging, or in kernel crashes. > > Unfortunately I still see the hang on my arm64 system with this series :( That is unfortunate. Any particular command sequence triggering the hang ? Or is it random ? What workload are you running ? > > Thanks, > John
On 10/02/2022 22:44, Damien Le Moal wrote: Hi Damien, >>> Note that without these patches, libzbc test suite result in the >>> controller hanging, or in kernel crashes. >> Unfortunately I still see the hang on my arm64 system with this series:( > That is unfortunate. Any particular command sequence triggering the hang > ? Or is it random ? What workload are you running ? > mount/unmount fails mostly even after as few as one attempt, but then even fdisk -l fails sometimes: root@(none)$ fdisk -l [ 97.924789] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [ 97.930652] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 97.937149] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 97.943232] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 97.952020] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 97.958099] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 97.966881] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 97.972961] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 97.981737] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 97.991241] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.000752] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.010223] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.015180] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.021645] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.027725] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.036495] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.042574] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.051344] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.057423] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.066154] pm80xx: rc= -5 [ 98.068854] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.075331] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.081411] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.090192] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.096271] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.105053] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.111132] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.119909] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.129414] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.138919] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.148388] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.153345] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.159812] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.165892] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.174661] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.180741] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.189511] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.195590] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.204320] pm80xx: rc= -5 [ 98.207019] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.213497] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.219577] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.228359] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.234438] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.243220] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.249299] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.258075] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.267580] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.277085] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.286554] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.291510] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.297978] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.304059] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.312827] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.318906] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.327677] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.333756] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.342487] pm80xx: rc= -5 [ 98.345186] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.351664] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.357743] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.366525] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.372604] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.381386] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.387465] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.396230] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.405734] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.415239] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.424709] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.429665] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.436133] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.442213] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.450982] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.457061] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.465825] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.471904] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.480629] pm80xx: rc= -5 [ 98.483327] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.489800] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.495880] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.504661] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.510740] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.519523] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.525602] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.534372] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.543877] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.553382] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.562851] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.567807] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.574275] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.580355] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.589124] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.595203] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.603968] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.610047] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.618778] pm80xx: rc= -5 [ 98.621505] sas: --- Exit sas_scsi_recover_host: busy: 1 failed: 1 tries: 1 Sometimes I get TMF timeouts, which is a bad situation. I guess it's a subtle driver bug, but where ....? BTW, this following log needs removal/fixing at some stage by someone: [ 98.480629] pm80xx: rc= -5 It's from pm8001_query_task(). Thanks, John
On 2/11/22 18:24, John Garry wrote: > On 10/02/2022 22:44, Damien Le Moal wrote: > > Hi Damien, > >>>> Note that without these patches, libzbc test suite result in the >>>> controller hanging, or in kernel crashes. >>> Unfortunately I still see the hang on my arm64 system with this series:( >> That is unfortunate. Any particular command sequence triggering the hang >> ? Or is it random ? What workload are you running ? >> > > mount/unmount fails mostly even after as few as one attempt, but then > even fdisk -l fails sometimes: Try with patch 21 of my v2. It does fix a bug for scsi/sas case. That problem would likely lead to a crash though, but never know... > root@(none)$ fdisk -l > [ 97.924789] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 > [ 97.930652] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) > [ 97.937149] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b > [ 97.943232] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO > Failure Drive:5000c500a7babc61 [...] > > Sometimes I get TMF timeouts, which is a bad situation. I guess it's a > subtle driver bug, but where ....? What is the command failing ? Always the same ? Can you try adding scsi trace to see the commands ? If you are "lucky", it is always the same type of command like for the NCQ NON DATA in my case. Though on mount, I would only expect a lot of read commands and not much else. There may be some writes and a flush too, so there will be "data" commands and "non data" commands. It may be an issue with non-data commands too ? > BTW, this following log needs removal/fixing at some stage by someone: > > [ 98.480629] pm80xx: rc= -5 > > It's from pm8001_query_task(). > > Thanks, > John
On 11/02/2022 12:37, Damien Le Moal wrote: Hi Damien, >> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >> subtle driver bug, but where ....? > What is the command failing ? Always the same ? Can you try adding scsi > trace to see the commands ? This is the same issue I have had since day #1. Generally mount/unmount or even fdisk -l fails after booting into miniramfs. I wouldn't ever try to boot a distro. > > If you are "lucky", it is always the same type of command like for the > NCQ NON DATA in my case. I'm just trying SAS disks to start with - so it's an SCSI READ command. SATA/STP support is generally never as robust for SAS HBAs (HW and LLD bugs are common - as this series is evidence) so I start on something more basic - however SATA/STP also has this issue. The command is sent successfully but just never completes. Then sometimes the TMFs for error handling timeout and sometimes succeed. I don't have much to do on.... > Though on mount, I would only expect a lot of > read commands and not much else. Yes, and it is commonly the first SCSI read command which times out. It reliably breaks quite early. So I can think we can rule out issues like memory barriers/timing. There may be some writes and a flush > too, so there will be "data" commands and "non data" commands. It may be > an issue with non-data commands too ? > Not sure on that. I guess it isn't. Thanks, John
On 2/11/22 22:08, John Garry wrote: > On 11/02/2022 12:37, Damien Le Moal wrote: > > Hi Damien, > >>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >>> subtle driver bug, but where ....? >> What is the command failing ? Always the same ? Can you try adding scsi >> trace to see the commands ? > > This is the same issue I have had since day #1. > > Generally mount/unmount or even fdisk -l fails after booting into > miniramfs. I wouldn't ever try to boot a distro. busybox ? > >> >> If you are "lucky", it is always the same type of command like for the >> NCQ NON DATA in my case. > > I'm just trying SAS disks to start with - so it's an SCSI READ command. > SATA/STP support is generally never as robust for SAS HBAs (HW and LLD > bugs are common - as this series is evidence) so I start on something > more basic - however SATA/STP also has this issue. > > The command is sent successfully but just never completes. Then > sometimes the TMFs for error handling timeout and sometimes succeed. I > don't have much to do on.... No SAS bus analyzer lying in a corner of the office ? :) That could help... I will go to the office Monday. So I will get a chance to add SAS drives to my setup to see what I get. I have only tested with SATA until now. My controller is not the same chip as yours though. > >> Though on mount, I would only expect a lot of >> read commands and not much else. > > Yes, and it is commonly the first SCSI read command which times out. It > reliably breaks quite early. So I can think we can rule out issues like > memory barriers/timing. > > There may be some writes and a flush >> too, so there will be "data" commands and "non data" commands. It may be >> an issue with non-data commands too ? >> > > Not sure on that. I guess it isn't. Anything special with the drives you are using ? Have you tried other drives to see if you get lucky ? > > Thanks, > John
Hi Damien, >> >>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >>>> subtle driver bug, but where ....? >>> What is the command failing ? Always the same ? Can you try adding scsi >>> trace to see the commands ? >> >> This is the same issue I have had since day #1. >> >> Generally mount/unmount or even fdisk -l fails after booting into >> miniramfs. I wouldn't ever try to boot a distro. > > busybox ? > Yes >> >>> >>> If you are "lucky", it is always the same type of command like for the >>> NCQ NON DATA in my case. >> >> I'm just trying SAS disks to start with - so it's an SCSI READ command. >> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD >> bugs are common - as this series is evidence) so I start on something >> more basic - however SATA/STP also has this issue. >> >> The command is sent successfully but just never completes. Then >> sometimes the TMFs for error handling timeout and sometimes succeed. I >> don't have much to do on.... > > No SAS bus analyzer lying in a corner of the office ? :) > That could help... None unfortunately > > I will go to the office Monday. So I will get a chance to add SAS drives > to my setup to see what I get. I have only tested with SATA until now. > My controller is not the same chip as yours though. jfyi, Ajish, cc'ed, from microchip says that they see the same issue on their arm64 system. Unfortunately fixing it is not a priority for them. So it is something which arm64 is exposing. And I tried an old kernel - like 4.10 - on the same board and the pm8001 driver was working somewhat reliably (no hangs). It just looks like a driver issue. I'll have a look at the driver code again if I get a chance. It might be a DMA issue. > >> >>> Though on mount, I would only expect a lot of >>> read commands and not much else. >> >> Yes, and it is commonly the first SCSI read command which times out. It >> reliably breaks quite early. So I can think we can rule out issues like >> memory barriers/timing. >> >> There may be some writes and a flush >>> too, so there will be "data" commands and "non data" commands. It may be >>> an issue with non-data commands too ? >>> >> >> Not sure on that. I guess it isn't. > > Anything special with the drives you are using ? Have you tried other > drives to see if you get lucky ? > I think that the drives are ok. On the same board I originally had them plugged in the hisi_sas HBA and no issues. thanks, john
On 2/11/22 22:54, John Garry wrote: > Hi Damien, > >>> >>>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >>>>> subtle driver bug, but where ....? >>>> What is the command failing ? Always the same ? Can you try adding scsi >>>> trace to see the commands ? >>> >>> This is the same issue I have had since day #1. >>> >>> Generally mount/unmount or even fdisk -l fails after booting into >>> miniramfs. I wouldn't ever try to boot a distro. >> >> busybox ? >> > > Yes > >>> >>>> >>>> If you are "lucky", it is always the same type of command like for the >>>> NCQ NON DATA in my case. >>> >>> I'm just trying SAS disks to start with - so it's an SCSI READ command. >>> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD >>> bugs are common - as this series is evidence) so I start on something >>> more basic - however SATA/STP also has this issue. >>> >>> The command is sent successfully but just never completes. Then >>> sometimes the TMFs for error handling timeout and sometimes succeed. I >>> don't have much to do on.... >> >> No SAS bus analyzer lying in a corner of the office ? :) >> That could help... > > None unfortunately > >> >> I will go to the office Monday. So I will get a chance to add SAS drives >> to my setup to see what I get. I have only tested with SATA until now. >> My controller is not the same chip as yours though. > > jfyi, Ajish, cc'ed, from microchip says that they see the same issue on > their arm64 system. Unfortunately fixing it is not a priority for them. > So it is something which arm64 is exposing. > > And I tried an old kernel - like 4.10 - on the same board and the pm8001 > driver was working somewhat reliably (no hangs). It just looks like a > driver issue. > > I'll have a look at the driver code again if I get a chance. It might be > a DMA issue. There is one more thing that I find strange in the driver and that may cause problems: tag 0 is a perfectly valid tag value that can be returned by pm8001_tag_alloc() since find_first_zero_bit() will return 0 if the first bit is 0. And yet, there are many places in the driver that treat !tag as an error. Extremely weird, if not outright broken... I patched that and tested and everything seems OK... Could it be that you are not seeing some completions because of that ? I added the patch to my v3. Will send Monday.