Message ID | 20210120184548.20219-1-mwilck@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scsi: scsi_host_queue_ready: increase busy count early | expand |
On 20/01/2021 18:45, mwilck@suse.com wrote: > From: Martin Wilck <mwilck@suse.com> > > Donald: please give this patch a try. > > Commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > contained this hunk: > > - busy = atomic_inc_return(&shost->host_busy) - 1; > if (atomic_read(&shost->host_blocked) > 0) { > - if (busy) > + if (scsi_host_busy(shost) > 0) > goto starved; > > The previous code would increase the busy count before checking host_blocked. > With 6eb045e092ef, the busy count would be increased (by setting the > SCMD_STATE_INFLIGHT bit) after the if clause for host_blocked above. > > Users have reported a regression with the smartpqi driver [1] which has been > shown to be caused by this commit [2]. > Please note these, where potential issue was discussed before: https://lore.kernel.org/linux-scsi/8a3c8d37-8d15-4060-f461-8d400b19cb34@suse.de/ https://lore.kernel.org/linux-scsi/20200714104437.GB602708@T590/ > It seems that by moving the increase of the busy counter further down, it could > happen that the can_queue limit of the controller could be exceeded if several > CPUs were executing this code in parallel on different queues. > > This patch attempts to fix it by moving setting the SCMD_STATE_INFLIGHT before > the host_blocked test again. It also inserts barriers to make sure > scsi_host_busy() on once CPU will notice the increase of the count from another. > > [1]: https://marc.info/?l=linux-scsi&m=160271263114829&w=2 > [2]: https://marc.info/?l=linux-scsi&m=161116163722099&w=2 > > Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") For failing test, it would be good to know: - host .can_queue - host .nr_hw_queues - number of LUNs in test and their queue depth All can be got from lsscsi, apart from nr_hw_queues, which can be got from scsi_host sysfs for kernel >= 5.10 Cheers, John > > Cc: Ming Lei <ming.lei@redhat.com> > Cc: Don Brace <Don.Brace@microchip.com> > Cc: Kevin Barnett <Kevin.Barnett@microchip.com> > Cc: Donald Buczek <buczek@molgen.mpg.de> > Cc: John Garry <john.garry@huawei.com> > Cc: Paul Menzel <pmenzel@molgen.mpg.de> > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > drivers/scsi/hosts.c | 2 ++ > drivers/scsi/scsi_lib.c | 8 +++++--- > 2 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c > index 2f162603876f..1c452a1c18fd 100644 > --- a/drivers/scsi/hosts.c > +++ b/drivers/scsi/hosts.c > @@ -564,6 +564,8 @@ static bool scsi_host_check_in_flight(struct request *rq, void *data, > int *count = data; > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); > > + /* This pairs with set_bit() in scsi_host_queue_ready() */ > + smp_mb__before_atomic(); > if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) > (*count)++; > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index b3f14f05340a..0a9a36c349ee 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -1353,8 +1353,12 @@ static inline int scsi_host_queue_ready(struct request_queue *q, > if (scsi_host_in_recovery(shost)) > return 0; > > + set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > + /* This pairs with test_bit() in scsi_host_check_in_flight() */ > + smp_mb__after_atomic(); > + > if (atomic_read(&shost->host_blocked) > 0) { > - if (scsi_host_busy(shost) > 0) > + if (scsi_host_busy(shost) > 1) > goto starved; > > /* > @@ -1379,8 +1383,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q, > spin_unlock_irq(shost->host_lock); > } > > - __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > - > return 1; > > starved: >
On 20.01.21 19:45, mwilck@suse.com wrote: > From: Martin Wilck <mwilck@suse.com> > > Donald: please give this patch a try. > > Commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > [...] Unfortunately, this patch (on top of 6eb045e092ef) did not fix the problem. Same error (""controller is offline: status code 0x6100c"") after about 20 minutes of the test load. Best Donald
On Thu, 2021-01-21 at 10:07 +0100, Donald Buczek wrote: > On 20.01.21 19:45, mwilck@suse.com wrote: > > From: Martin Wilck <mwilck@suse.com> > > > > Donald: please give this patch a try. > > > > Commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter > > for scsi_mq") > > [...] > > > Unfortunately, this patch (on top of 6eb045e092ef) did not fix the > problem. Same error (""controller is offline: status code 0x6100c"") > after about 20 minutes of the test load. Too bad. Well, I believe it was worth a try. Thanks for testing it. Martin
On 20.01.21 21:26, John Garry wrote: > On 20/01/2021 18:45, mwilck@suse.com wrote: >> From: Martin Wilck <mwilck@suse.com> >> >> Donald: please give this patch a try. >> >> Commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") >> contained this hunk: >> >> - busy = atomic_inc_return(&shost->host_busy) - 1; >> if (atomic_read(&shost->host_blocked) > 0) { >> - if (busy) >> + if (scsi_host_busy(shost) > 0) >> goto starved; >> >> The previous code would increase the busy count before checking host_blocked. >> With 6eb045e092ef, the busy count would be increased (by setting the >> SCMD_STATE_INFLIGHT bit) after the if clause for host_blocked above. >> >> Users have reported a regression with the smartpqi driver [1] which has been >> shown to be caused by this commit [2]. >> > > Please note these, where potential issue was discussed before: > https://lore.kernel.org/linux-scsi/8a3c8d37-8d15-4060-f461-8d400b19cb34@suse.de/ > https://lore.kernel.org/linux-scsi/20200714104437.GB602708@T590/ > >> It seems that by moving the increase of the busy counter further down, it could >> happen that the can_queue limit of the controller could be exceeded if several >> CPUs were executing this code in parallel on different queues. >> >> This patch attempts to fix it by moving setting the SCMD_STATE_INFLIGHT before >> the host_blocked test again. It also inserts barriers to make sure >> scsi_host_busy() on once CPU will notice the increase of the count from another. >> >> [1]: https://marc.info/?l=linux-scsi&m=160271263114829&w=2 >> [2]: https://marc.info/?l=linux-scsi&m=161116163722099&w=2 >> >> Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > > For failing test, it would be good to know: > - host .can_queue > - host .nr_hw_queues > - number of LUNs in test and their queue dept>> All can be got from lsscsi, apart from nr_hw_queues, which can be got from scsi_host sysfs for kernel >= 5.10 The last test was on 6eb045e092ef + Martins experimental patch, which technically is 5.4.0-rc1 I'm not 100% sure about which data you need and where to find nr_hw_queues exposed. Please ask, if you need more data. + lsscsi [0:0:0:0] disk ATA ST500NM0011 PA09 /dev/sda [0:0:32:0] enclosu DP BP13G+ 2.23 - [12:0:0:0] disk SEAGATE ST8000NM001A E002 /dev/sdb [12:0:1:0] disk SEAGATE ST8000NM001A E002 /dev/sdc [12:0:2:0] disk SEAGATE ST8000NM001A E002 /dev/sdd [12:0:3:0] disk SEAGATE ST8000NM001A E002 /dev/sde [12:0:4:0] disk SEAGATE ST8000NM001A E002 /dev/sdf [12:0:5:0] disk SEAGATE ST8000NM001A E002 /dev/sdg [12:0:6:0] disk SEAGATE ST8000NM001A E002 /dev/sdh [12:0:7:0] disk SEAGATE ST8000NM001A E002 /dev/sdi [12:0:8:0] disk SEAGATE ST8000NM001A E002 /dev/sdj [12:0:9:0] disk SEAGATE ST8000NM001A E002 /dev/sdk [12:0:10:0] disk SEAGATE ST8000NM001A E002 /dev/sdl [12:0:11:0] disk SEAGATE ST8000NM001A E002 /dev/sdm [12:0:12:0] disk SEAGATE ST8000NM001A E002 /dev/sdn [12:0:13:0] disk SEAGATE ST8000NM001A E002 /dev/sdo [12:0:14:0] disk SEAGATE ST8000NM001A E002 /dev/sdp [12:0:15:0] disk SEAGATE ST8000NM001A E002 /dev/sdq [12:0:16:0] disk SEAGATE ST8000NM001A E002 /dev/sdr [12:0:17:0] disk SEAGATE ST8000NM001A E002 /dev/sds [12:0:18:0] disk SEAGATE ST8000NM001A E002 /dev/sdt [12:0:19:0] disk SEAGATE ST8000NM001A E002 /dev/sdu [12:0:20:0] disk SEAGATE ST8000NM001A E002 /dev/sdv [12:0:21:0] disk SEAGATE ST8000NM001A E002 /dev/sdw [12:0:22:0] disk SEAGATE ST8000NM001A E002 /dev/sdx [12:0:23:0] disk SEAGATE ST8000NM001A E002 /dev/sdy [12:0:24:0] disk SEAGATE ST8000NM001A E001 /dev/sdz [12:0:25:0] disk SEAGATE ST8000NM001A E001 /dev/sdaa [12:0:26:0] disk SEAGATE ST8000NM001A E002 /dev/sdab [12:0:27:0] disk SEAGATE ST8000NM001A E002 /dev/sdac [12:0:28:0] disk SEAGATE ST8000NM001A E001 /dev/sdad [12:0:29:0] disk SEAGATE ST8000NM001A E001 /dev/sdae [12:0:30:0] disk SEAGATE ST8000NM001A E001 /dev/sdaf [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag [12:0:32:0] enclosu AIC 12G 3U16SAS3swap 0c01 - [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - [12:2:0:0] storage Adaptec 1100-8e 3.21 - + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 + lsscsi -L 12:2:0:0 [12:2:0:0] storage Adaptec 1100-8e 3.21 - device_blocked=0 iocounterbits=32 iodone_cnt=0x1e ioerr_cnt=0x0 iorequest_cnt=0x1e queue_depth=1013 queue_type=simple scsi_level=6 state=running timeout=30 type=12 + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 + lsscsi -L 12:0:34:0 [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - device_blocked=0 iocounterbits=32 iodone_cnt=0x12 ioerr_cnt=0x0 iorequest_cnt=0x12 queue_depth=1 queue_type=none scsi_level=6 state=running timeout=30 type=13 + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 + lsscsi -L 12:0:33:0 [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - device_blocked=0 iocounterbits=32 iodone_cnt=0x12 ioerr_cnt=0x0 iorequest_cnt=0x12 queue_depth=1 queue_type=simple scsi_level=6 state=running timeout=30 type=13 + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 + lsscsi -L 12:0:31:0 [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag device_blocked=0 iocounterbits=32 iodone_cnt=0x19b94 ioerr_cnt=0x0 iorequest_cnt=0x19bba queue_depth=64 queue_type=simple scsi_level=8 state=running timeout=30 type=0 + cat /sys/bus/scsi/devices/host12/scsi_host/host12/can_queue 1013 Best Donald > > Cheers, > John > >> >> Cc: Ming Lei <ming.lei@redhat.com> >> Cc: Don Brace <Don.Brace@microchip.com> >> Cc: Kevin Barnett <Kevin.Barnett@microchip.com> >> Cc: Donald Buczek <buczek@molgen.mpg.de> >> Cc: John Garry <john.garry@huawei.com> >> Cc: Paul Menzel <pmenzel@molgen.mpg.de> >> Signed-off-by: Martin Wilck <mwilck@suse.com> >> --- >> drivers/scsi/hosts.c | 2 ++ >> drivers/scsi/scsi_lib.c | 8 +++++--- >> 2 files changed, 7 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c >> index 2f162603876f..1c452a1c18fd 100644 >> --- a/drivers/scsi/hosts.c >> +++ b/drivers/scsi/hosts.c >> @@ -564,6 +564,8 @@ static bool scsi_host_check_in_flight(struct request *rq, void *data, >> int *count = data; >> struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); >> + /* This pairs with set_bit() in scsi_host_queue_ready() */ >> + smp_mb__before_atomic(); >> if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) >> (*count)++; >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >> index b3f14f05340a..0a9a36c349ee 100644 >> --- a/drivers/scsi/scsi_lib.c >> +++ b/drivers/scsi/scsi_lib.c >> @@ -1353,8 +1353,12 @@ static inline int scsi_host_queue_ready(struct request_queue *q, >> if (scsi_host_in_recovery(shost)) >> return 0; >> + set_bit(SCMD_STATE_INFLIGHT, &cmd->state); >> + /* This pairs with test_bit() in scsi_host_check_in_flight() */ >> + smp_mb__after_atomic(); >> + >> if (atomic_read(&shost->host_blocked) > 0) { >> - if (scsi_host_busy(shost) > 0) >> + if (scsi_host_busy(shost) > 1) >> goto starved; >> /* >> @@ -1379,8 +1383,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q, >> spin_unlock_irq(shost->host_lock); >> } >> - __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); >> - >> return 1; >> starved: >> >
On 21/01/2021 12:01, Donald Buczek wrote: >>> pts to fix it by moving setting the SCMD_STATE_INFLIGHT before >>> the host_blocked test again. It also inserts barriers to make sure >>> scsi_host_busy() on once CPU will notice the increase of the count from another. >>> >>> [1]:https://marc.info/?l=linux-scsi&m=160271263114829&w=2 >>> [2]:https://marc.info/?l=linux-scsi&m=161116163722099&w=2 >>> >>> Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") >> For failing test, it would be good to know: >> - host .can_queue >> - host .nr_hw_queues >> - number of LUNs in test and their queue dept>> All can be got from lsscsi, apart from nr_hw_queues, which can be got from scsi_host sysfs for kernel >= 5.10 > The last test was on 6eb045e092ef + Martins experimental patch, which technically is 5.4.0-rc1 > > I'm not 100% sure about which data you need and where to find nr_hw_queues exposed. nr_hw_queues is not available on 5.4 kernel via sysfs it's prob same as count of CPUs in the system or you can check number of hctxX folders in /sys/kernel/debug/block/sdX (need to be root, and debugfs enabled) Please ask, if you need more data. > > + lsscsi > [0:0:0:0] disk ATA ST500NM0011 PA09 /dev/sda > [0:0:32:0] enclosu DP BP13G+ 2.23 - > [12:0:0:0] disk SEAGATE ST8000NM001A E002 /dev/sdb > [12:0:1:0] disk SEAGATE ST8000NM001A E002 /dev/sdc > [12:0:2:0] disk SEAGATE ST8000NM001A E002 /dev/sdd > [12:0:3:0] disk SEAGATE ST8000NM001A E002 /dev/sde > [12:0:4:0] disk SEAGATE ST8000NM001A E002 /dev/sdf > [12:0:5:0] disk SEAGATE ST8000NM001A E002 /dev/sdg > [12:0:6:0] disk SEAGATE ST8000NM001A E002 /dev/sdh > [12:0:7:0] disk SEAGATE ST8000NM001A E002 /dev/sdi > [12:0:8:0] disk SEAGATE ST8000NM001A E002 /dev/sdj > [12:0:9:0] disk SEAGATE ST8000NM001A E002 /dev/sdk > [12:0:10:0] disk SEAGATE ST8000NM001A E002 /dev/sdl > [12:0:11:0] disk SEAGATE ST8000NM001A E002 /dev/sdm > [12:0:12:0] disk SEAGATE ST8000NM001A E002 /dev/sdn > [12:0:13:0] disk SEAGATE ST8000NM001A E002 /dev/sdo > [12:0:14:0] disk SEAGATE ST8000NM001A E002 /dev/sdp > [12:0:15:0] disk SEAGATE ST8000NM001A E002 /dev/sdq > [12:0:16:0] disk SEAGATE ST8000NM001A E002 /dev/sdr > [12:0:17:0] disk SEAGATE ST8000NM001A E002 /dev/sds > [12:0:18:0] disk SEAGATE ST8000NM001A E002 /dev/sdt > [12:0:19:0] disk SEAGATE ST8000NM001A E002 /dev/sdu > [12:0:20:0] disk SEAGATE ST8000NM001A E002 /dev/sdv > [12:0:21:0] disk SEAGATE ST8000NM001A E002 /dev/sdw > [12:0:22:0] disk SEAGATE ST8000NM001A E002 /dev/sdx > [12:0:23:0] disk SEAGATE ST8000NM001A E002 /dev/sdy > [12:0:24:0] disk SEAGATE ST8000NM001A E001 /dev/sdz > [12:0:25:0] disk SEAGATE ST8000NM001A E001 /dev/sdaa > [12:0:26:0] disk SEAGATE ST8000NM001A E002 /dev/sdab > [12:0:27:0] disk SEAGATE ST8000NM001A E002 /dev/sdac > [12:0:28:0] disk SEAGATE ST8000NM001A E001 /dev/sdad > [12:0:29:0] disk SEAGATE ST8000NM001A E001 /dev/sdae > [12:0:30:0] disk SEAGATE ST8000NM001A E001 /dev/sdaf > [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag > [12:0:32:0] enclosu AIC 12G 3U16SAS3swap 0c01 - > [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - > [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - > [12:2:0:0] storage Adaptec 1100-8e 3.21 - > + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 I figure sdev queue depth is 64 for all disks, like /dev/sdag, below. lsscsi -l would give that But, as said here: https://lore.kernel.org/linux-scsi/20200714104437.GB602708@T590/ Looks like block layer can send too many commands to SCSI host if trying to achieve highest datarates with all those 32x disks. Before 6eb045e092ef, that would be avoided. That's how I see it. > + lsscsi -L 12:2:0:0 > [12:2:0:0] storage Adaptec 1100-8e 3.21 - > device_blocked=0 > iocounterbits=32 > iodone_cnt=0x1e > ioerr_cnt=0x0 > iorequest_cnt=0x1e > queue_depth=1013 > queue_type=simple > scsi_level=6 > state=running > timeout=30 > type=12 > + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 > + lsscsi -L 12:0:34:0 > [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - > device_blocked=0 > iocounterbits=32 > iodone_cnt=0x12 > ioerr_cnt=0x0 > iorequest_cnt=0x12 > queue_depth=1 > queue_type=none > scsi_level=6 > state=running > timeout=30 > type=13 > + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 > + lsscsi -L 12:0:33:0 > [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - > device_blocked=0 > iocounterbits=32 > iodone_cnt=0x12 > ioerr_cnt=0x0 > iorequest_cnt=0x12 > queue_depth=1 > queue_type=simple > scsi_level=6 > state=running > timeout=30 > type=13 > + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 > + lsscsi -L 12:0:31:0 > [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag > device_blocked=0 > iocounterbits=32 > iodone_cnt=0x19b94 > ioerr_cnt=0x0 > iorequest_cnt=0x19bba > queue_depth=64 > queue_type=simple > scsi_level=8 > state=running > timeout=30 > type=0 > + cat /sys/bus/scsi/devices/host12/scsi_host/host12/can_queue > 1013 > Thanks, John
On 21.01.21 13:35, John Garry wrote: > On 21/01/2021 12:01, Donald Buczek wrote: >>>> pts to fix it by moving setting the SCMD_STATE_INFLIGHT before >>>> the host_blocked test again. It also inserts barriers to make sure >>>> scsi_host_busy() on once CPU will notice the increase of the count from another. >>>> >>>> [1]:https://marc.info/?l=linux-scsi&m=160271263114829&w=2 >>>> [2]:https://marc.info/?l=linux-scsi&m=161116163722099&w=2 >>>> >>>> Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") >>> For failing test, it would be good to know: >>> - host .can_queue >>> - host .nr_hw_queues >>> - number of LUNs in test and their queue dept>> All can be got from lsscsi, apart from nr_hw_queues, which can be got from scsi_host sysfs for kernel >= 5.10 >> The last test was on 6eb045e092ef + Martins experimental patch, which technically is 5.4.0-rc1 >> >> I'm not 100% sure about which data you need and where to find nr_hw_queues exposed. > > nr_hw_queues is not available on 5.4 kernel via sysfs > > it's prob same as count of CPUs in the system > > or you can check number of hctxX folders in /sys/kernel/debug/block/sdX (need to be root, and debugfs enabled) root@deadbird:~# ls -d /sys/kernel/debug/block/sdz/hc* /sys/kernel/debug/block/sdz/hctx0 /sys/kernel/debug/block/sdz/hctx16 /sys/kernel/debug/block/sdz/hctx23 /sys/kernel/debug/block/sdz/hctx30 /sys/kernel/debug/block/sdz/hctx38 /sys/kernel/debug/block/sdz/hctx1 /sys/kernel/debug/block/sdz/hctx17 /sys/kernel/debug/block/sdz/hctx24 /sys/kernel/debug/block/sdz/hctx31 /sys/kernel/debug/block/sdz/hctx39 /sys/kernel/debug/block/sdz/hctx10 /sys/kernel/debug/block/sdz/hctx18 /sys/kernel/debug/block/sdz/hctx25 /sys/kernel/debug/block/sdz/hctx32 /sys/kernel/debug/block/sdz/hctx4 /sys/kernel/debug/block/sdz/hctx11 /sys/kernel/debug/block/sdz/hctx19 /sys/kernel/debug/block/sdz/hctx26 /sys/kernel/debug/block/sdz/hctx33 /sys/kernel/debug/block/sdz/hctx5 /sys/kernel/debug/block/sdz/hctx12 /sys/kernel/debug/block/sdz/hctx2 /sys/kernel/debug/block/sdz/hctx27 /sys/kernel/debug/block/sdz/hctx34 /sys/kernel/debug/block/sdz/hctx6 /sys/kernel/debug/block/sdz/hctx13 /sys/kernel/debug/block/sdz/hctx20 /sys/kernel/debug/block/sdz/hctx28 /sys/kernel/debug/block/sdz/hctx35 /sys/kernel/debug/block/sdz/hctx7 /sys/kernel/debug/block/sdz/hctx14 /sys/kernel/debug/block/sdz/hctx21 /sys/kernel/debug/block/sdz/hctx29 /sys/kernel/debug/block/sdz/hctx36 /sys/kernel/debug/block/sdz/hctx8 /sys/kernel/debug/block/sdz/hctx15 /sys/kernel/debug/block/sdz/hctx22 /sys/kernel/debug/block/sdz/hctx3 /sys/kernel/debug/block/sdz/hctx37 /sys/kernel/debug/block/sdz/hctx9 root@deadbird:~# ls -d /sys/kernel/debug/block/sdz/hc*|wc 40 40 1390 root@deadbird:~# nproc 40 > Please ask, if you need more data. >> >> + lsscsi >> [0:0:0:0] disk ATA ST500NM0011 PA09 /dev/sda >> [0:0:32:0] enclosu DP BP13G+ 2.23 - >> [12:0:0:0] disk SEAGATE ST8000NM001A E002 /dev/sdb >> [12:0:1:0] disk SEAGATE ST8000NM001A E002 /dev/sdc >> [12:0:2:0] disk SEAGATE ST8000NM001A E002 /dev/sdd >> [12:0:3:0] disk SEAGATE ST8000NM001A E002 /dev/sde >> [12:0:4:0] disk SEAGATE ST8000NM001A E002 /dev/sdf >> [12:0:5:0] disk SEAGATE ST8000NM001A E002 /dev/sdg >> [12:0:6:0] disk SEAGATE ST8000NM001A E002 /dev/sdh >> [12:0:7:0] disk SEAGATE ST8000NM001A E002 /dev/sdi >> [12:0:8:0] disk SEAGATE ST8000NM001A E002 /dev/sdj >> [12:0:9:0] disk SEAGATE ST8000NM001A E002 /dev/sdk >> [12:0:10:0] disk SEAGATE ST8000NM001A E002 /dev/sdl >> [12:0:11:0] disk SEAGATE ST8000NM001A E002 /dev/sdm >> [12:0:12:0] disk SEAGATE ST8000NM001A E002 /dev/sdn >> [12:0:13:0] disk SEAGATE ST8000NM001A E002 /dev/sdo >> [12:0:14:0] disk SEAGATE ST8000NM001A E002 /dev/sdp >> [12:0:15:0] disk SEAGATE ST8000NM001A E002 /dev/sdq >> [12:0:16:0] disk SEAGATE ST8000NM001A E002 /dev/sdr >> [12:0:17:0] disk SEAGATE ST8000NM001A E002 /dev/sds >> [12:0:18:0] disk SEAGATE ST8000NM001A E002 /dev/sdt >> [12:0:19:0] disk SEAGATE ST8000NM001A E002 /dev/sdu >> [12:0:20:0] disk SEAGATE ST8000NM001A E002 /dev/sdv >> [12:0:21:0] disk SEAGATE ST8000NM001A E002 /dev/sdw >> [12:0:22:0] disk SEAGATE ST8000NM001A E002 /dev/sdx >> [12:0:23:0] disk SEAGATE ST8000NM001A E002 /dev/sdy >> [12:0:24:0] disk SEAGATE ST8000NM001A E001 /dev/sdz >> [12:0:25:0] disk SEAGATE ST8000NM001A E001 /dev/sdaa >> [12:0:26:0] disk SEAGATE ST8000NM001A E002 /dev/sdab >> [12:0:27:0] disk SEAGATE ST8000NM001A E002 /dev/sdac >> [12:0:28:0] disk SEAGATE ST8000NM001A E001 /dev/sdad >> [12:0:29:0] disk SEAGATE ST8000NM001A E001 /dev/sdae >> [12:0:30:0] disk SEAGATE ST8000NM001A E001 /dev/sdaf >> [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag >> [12:0:32:0] enclosu AIC 12G 3U16SAS3swap 0c01 - >> [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - >> [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - >> [12:2:0:0] storage Adaptec 1100-8e 3.21 - >> + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 > > I figure sdev queue depth is 64 for all disks, like /dev/sdag, below. Yes, I send an example (one of two enclosures, 1 of 32 disks) but verified, that they are the same. Best Donald > > lsscsi -l would give that > > But, as said here: > https://lore.kernel.org/linux-scsi/20200714104437.GB602708@T590/ > > Looks like block layer can send too many commands to SCSI host if trying to achieve highest datarates with all those 32x disks. Before 6eb045e092ef, that would be avoided. That's how I see it. > >> + lsscsi -L 12:2:0:0 >> [12:2:0:0] storage Adaptec 1100-8e 3.21 - >> device_blocked=0 >> iocounterbits=32 >> iodone_cnt=0x1e >> ioerr_cnt=0x0 >> iorequest_cnt=0x1e >> queue_depth=1013 >> queue_type=simple >> scsi_level=6 >> state=running >> timeout=30 >> type=12 >> + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 >> + lsscsi -L 12:0:34:0 >> [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - >> device_blocked=0 >> iocounterbits=32 >> iodone_cnt=0x12 >> ioerr_cnt=0x0 >> iorequest_cnt=0x12 >> queue_depth=1 >> queue_type=none >> scsi_level=6 >> state=running >> timeout=30 >> type=13 >> + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 >> + lsscsi -L 12:0:33:0 >> [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - >> device_blocked=0 >> iocounterbits=32 >> iodone_cnt=0x12 >> ioerr_cnt=0x0 >> iorequest_cnt=0x12 >> queue_depth=1 >> queue_type=simple >> scsi_level=6 >> state=running >> timeout=30 >> type=13 >> + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 >> + lsscsi -L 12:0:31:0 >> [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag >> device_blocked=0 >> iocounterbits=32 >> iodone_cnt=0x19b94 >> ioerr_cnt=0x0 >> iorequest_cnt=0x19bba >> queue_depth=64 >> queue_type=simple >> scsi_level=8 >> state=running >> timeout=30 >> type=0 >> + cat /sys/bus/scsi/devices/host12/scsi_host/host12/can_queue >> 1013 >> > > Thanks, > John >
>>> I'm not 100% sure about which data you need and where to find >>> nr_hw_queues exposed. >> >> nr_hw_queues is not available on 5.4 kernel via sysfs >> >> it's prob same as count of CPUs in the system >> >> or you can check number of hctxX folders in >> /sys/kernel/debug/block/sdX (need to be root, and debugfs enabled) > > root@deadbird:~# ls -d /sys/kernel/debug/block/sdz/hc* > /sys/kernel/debug/block/sdz/hctx0 /sys/kernel/debug/block/sdz/hctx16 > /sys/kernel/debug/block/sdz/hctx23 /sys/kernel/debug/block/sdz/hctx30 > /sys/kernel/debug/block/sdz/hctx38 > /sys/kernel/debug/block/sdz/hctx1 /sys/kernel/debug/block/sdz/hctx17 > /sys/kernel/debug/block/sdz/hctx24 /sys/kernel/debug/block/sdz/hctx31 > /sys/kernel/debug/block/sdz/hctx39 > /sys/kernel/debug/block/sdz/hctx10 /sys/kernel/debug/block/sdz/hctx18 > /sys/kernel/debug/block/sdz/hctx25 /sys/kernel/debug/block/sdz/hctx32 > /sys/kernel/debug/block/sdz/hctx4 > /sys/kernel/debug/block/sdz/hctx11 /sys/kernel/debug/block/sdz/hctx19 > /sys/kernel/debug/block/sdz/hctx26 /sys/kernel/debug/block/sdz/hctx33 > /sys/kernel/debug/block/sdz/hctx5 > /sys/kernel/debug/block/sdz/hctx12 /sys/kernel/debug/block/sdz/hctx2 > /sys/kernel/debug/block/sdz/hctx27 /sys/kernel/debug/block/sdz/hctx34 > /sys/kernel/debug/block/sdz/hctx6 > /sys/kernel/debug/block/sdz/hctx13 /sys/kernel/debug/block/sdz/hctx20 > /sys/kernel/debug/block/sdz/hctx28 /sys/kernel/debug/block/sdz/hctx35 > /sys/kernel/debug/block/sdz/hctx7 > /sys/kernel/debug/block/sdz/hctx14 /sys/kernel/debug/block/sdz/hctx21 > /sys/kernel/debug/block/sdz/hctx29 /sys/kernel/debug/block/sdz/hctx36 > /sys/kernel/debug/block/sdz/hctx8 > /sys/kernel/debug/block/sdz/hctx15 /sys/kernel/debug/block/sdz/hctx22 > /sys/kernel/debug/block/sdz/hctx3 /sys/kernel/debug/block/sdz/hctx37 > /sys/kernel/debug/block/sdz/hctx9 > root@deadbird:~# ls -d /sys/kernel/debug/block/sdz/hc*|wc > 40 40 1390 > root@deadbird:~# nproc > 40 > > >> Please ask, if you need more data. >>> >>> + lsscsi >>> [0:0:0:0] disk ATA ST500NM0011 PA09 /dev/sda >>> [0:0:32:0] enclosu DP BP13G+ 2.23 - >>> [12:0:0:0] disk SEAGATE ST8000NM001A E002 /dev/sdb >>> [12:0:1:0] disk SEAGATE ST8000NM001A E002 /dev/sdc >>> [12:0:2:0] disk SEAGATE ST8000NM001A E002 /dev/sdd >>> [12:0:3:0] disk SEAGATE ST8000NM001A E002 /dev/sde >>> [12:0:4:0] disk SEAGATE ST8000NM001A E002 /dev/sdf >>> [12:0:5:0] disk SEAGATE ST8000NM001A E002 /dev/sdg >>> [12:0:6:0] disk SEAGATE ST8000NM001A E002 /dev/sdh >>> [12:0:7:0] disk SEAGATE ST8000NM001A E002 /dev/sdi >>> [12:0:8:0] disk SEAGATE ST8000NM001A E002 /dev/sdj >>> [12:0:9:0] disk SEAGATE ST8000NM001A E002 /dev/sdk >>> [12:0:10:0] disk SEAGATE ST8000NM001A E002 /dev/sdl >>> [12:0:11:0] disk SEAGATE ST8000NM001A E002 /dev/sdm >>> [12:0:12:0] disk SEAGATE ST8000NM001A E002 /dev/sdn >>> [12:0:13:0] disk SEAGATE ST8000NM001A E002 /dev/sdo >>> [12:0:14:0] disk SEAGATE ST8000NM001A E002 /dev/sdp >>> [12:0:15:0] disk SEAGATE ST8000NM001A E002 /dev/sdq >>> [12:0:16:0] disk SEAGATE ST8000NM001A E002 /dev/sdr >>> [12:0:17:0] disk SEAGATE ST8000NM001A E002 /dev/sds >>> [12:0:18:0] disk SEAGATE ST8000NM001A E002 /dev/sdt >>> [12:0:19:0] disk SEAGATE ST8000NM001A E002 /dev/sdu >>> [12:0:20:0] disk SEAGATE ST8000NM001A E002 /dev/sdv >>> [12:0:21:0] disk SEAGATE ST8000NM001A E002 /dev/sdw >>> [12:0:22:0] disk SEAGATE ST8000NM001A E002 /dev/sdx >>> [12:0:23:0] disk SEAGATE ST8000NM001A E002 /dev/sdy >>> [12:0:24:0] disk SEAGATE ST8000NM001A E001 /dev/sdz >>> [12:0:25:0] disk SEAGATE ST8000NM001A E001 /dev/sdaa >>> [12:0:26:0] disk SEAGATE ST8000NM001A E002 /dev/sdab >>> [12:0:27:0] disk SEAGATE ST8000NM001A E002 /dev/sdac >>> [12:0:28:0] disk SEAGATE ST8000NM001A E001 /dev/sdad >>> [12:0:29:0] disk SEAGATE ST8000NM001A E001 /dev/sdae >>> [12:0:30:0] disk SEAGATE ST8000NM001A E001 /dev/sdaf >>> [12:0:31:0] disk SEAGATE ST8000NM001A E001 /dev/sdag >>> [12:0:32:0] enclosu AIC 12G 3U16SAS3swap 0c01 - >>> [12:0:33:0] enclosu AIC 12G 3U16SAS3swap 0c01 - >>> [12:0:34:0] enclosu Adaptec Smart Adapter 3.21 - >>> [12:2:0:0] storage Adaptec 1100-8e 3.21 - >>> + for i in 12:2:0:0 12:0:34:0 12:0:33:0 12:0:31:0 >> >> I figure sdev queue depth is 64 for all disks, like /dev/sdag, below. > > Yes, I send an example (one of two enclosures, 1 of 32 disks) but > verified, that they are the same. Confirmed my suspicions - it looks like the host is sent more commands than it can handle. We would need many disks to see this issue though, which you have. So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and I suppose it could be simply fixed by setting .host_tagset in scsi host template there. Thanks, John
On Thu, 2021-01-21 at 13:05 +0000, John Garry wrote: > > > > > Confirmed my suspicions - it looks like the host is sent more > commands > than it can handle. We would need many disks to see this issue > though, > which you have. > > So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and > I > suppose it could be simply fixed by setting .host_tagset in scsi host > template there. If it's really just that, it should be easy enough to verify. @Donald, you'd need to test with a 5.10 kernel, and after reproducing the issue, add .host_tagset = 1, to the definition of pqi_driver_template in drivers/scsi/smartpqi/smartpqi_init.c. You don't need a patch to test that, I believe. Would you able to do this test? Regards, Martin
On Thu, 2021-01-21 at 11:05 +0100, Martin Wilck wrote: > On Thu, 2021-01-21 at 10:07 +0100, Donald Buczek wrote: > > > > Unfortunately, this patch (on top of 6eb045e092ef) did not fix the > > problem. Same error (""controller is offline: status code > > 0x6100c"") Rethinking, this patch couldn't have fixed your problem. I apologize for the dumb suggestion. However, I still believe I had a point in that patch, and would like to ask the experts for an opinion. Assume no commands in flight (busy == 0), host_blocked == 2, and that two CPUs enter scsi_host_queue_ready() at the same time. Before 6eb045e092ef, it was impossible that the function returned 1 on either CPU in this situation. CPU 1 CPU 2 scsi_host_queue_ready scsi_host_queue_ready host_busy = 1 host_busy = 2 (busy = 0) (busy = 1) host_blocked = 1 goto starved goto out_dec add sdev to starved list host_busy = 1 host_busy = 0 return 0 return 0 With 6eb045e092ef (and without my patch), the result could be: CPU 1 CPU 2 scsi_host_queue_ready scsi_host_queue_ready read scsi_host_busy() = 0 read scsi_host_busy() = 0 host_blocked = 1 host_blocked = 0 goto out_dec remove sdev from starved list return 0 set SCMD_STATE_INFLIGHT (host_busy = 1) return 1 So now, one command can be sent, and the host is unblocked. A very different outcome than before. Was this intentional? Thanks Martin
On Wed, Jan 20, 2021 at 07:45:48PM +0100, mwilck@suse.com wrote: > From: Martin Wilck <mwilck@suse.com> > > Donald: please give this patch a try. > > Commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > contained this hunk: > > - busy = atomic_inc_return(&shost->host_busy) - 1; > if (atomic_read(&shost->host_blocked) > 0) { > - if (busy) > + if (scsi_host_busy(shost) > 0) > goto starved; > > The previous code would increase the busy count before checking host_blocked. > With 6eb045e092ef, the busy count would be increased (by setting the > SCMD_STATE_INFLIGHT bit) after the if clause for host_blocked above. > > Users have reported a regression with the smartpqi driver [1] which has been > shown to be caused by this commit [2]. > > It seems that by moving the increase of the busy counter further down, it could > happen that the can_queue limit of the controller could be exceeded if several > CPUs were executing this code in parallel on different queues. can_queue limit should never be exceeded because it is respected by blk-mq since each hw queue's queue depth is .can_queue. smartpqi's issue is that its .can_queue does not represent each hw queue's depth, instead the .can_queue represents queue depth of the whole HBA. As John mentioned, smartpqi should have switched to hosttags. BTW, looks the following code has soft lockup risk: pqi_alloc_io_request(): while (1) { io_request = &ctrl_info->io_request_pool[i]; if (atomic_inc_return(&io_request->refcount) == 1) break; atomic_dec(&io_request->refcount); i = (i + 1) % ctrl_info->max_io_slots; } > > This patch attempts to fix it by moving setting the SCMD_STATE_INFLIGHT before > the host_blocked test again. It also inserts barriers to make sure > scsi_host_busy() on once CPU will notice the increase of the count from another. > > [1]: https://marc.info/?l=linux-scsi&m=160271263114829&w=2 > [2]: https://marc.info/?l=linux-scsi&m=161116163722099&w=2 If the above is true wrt. smartpqi's can_queue usage, your patch may not fix the issue completely in which you think '.can_queue is exceeded'. > > Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > > Cc: Ming Lei <ming.lei@redhat.com> > Cc: Don Brace <Don.Brace@microchip.com> > Cc: Kevin Barnett <Kevin.Barnett@microchip.com> > Cc: Donald Buczek <buczek@molgen.mpg.de> > Cc: John Garry <john.garry@huawei.com> > Cc: Paul Menzel <pmenzel@molgen.mpg.de> > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > drivers/scsi/hosts.c | 2 ++ > drivers/scsi/scsi_lib.c | 8 +++++--- > 2 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c > index 2f162603876f..1c452a1c18fd 100644 > --- a/drivers/scsi/hosts.c > +++ b/drivers/scsi/hosts.c > @@ -564,6 +564,8 @@ static bool scsi_host_check_in_flight(struct request *rq, void *data, > int *count = data; > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); > > + /* This pairs with set_bit() in scsi_host_queue_ready() */ > + smp_mb__before_atomic(); So the above barrier orders atomic_read(&shost->host_blocked) and test_bit()? > if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) > (*count)++; > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index b3f14f05340a..0a9a36c349ee 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -1353,8 +1353,12 @@ static inline int scsi_host_queue_ready(struct request_queue *q, > if (scsi_host_in_recovery(shost)) > return 0; > > + set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > + /* This pairs with test_bit() in scsi_host_check_in_flight() */ > + smp_mb__after_atomic(); > + > if (atomic_read(&shost->host_blocked) > 0) { > - if (scsi_host_busy(shost) > 0) > + if (scsi_host_busy(shost) > 1) > goto starved; > > /* > @@ -1379,8 +1383,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q, > spin_unlock_irq(shost->host_lock); > } > > - __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > - Looks this patch fine. However, I'd suggest to confirm smartpqi's .can_queue usage first, which looks one big issue.
On Fri, 2021-01-22 at 11:23 +0800, Ming Lei wrote: > > > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c > > index 2f162603876f..1c452a1c18fd 100644 > > --- a/drivers/scsi/hosts.c > > +++ b/drivers/scsi/hosts.c > > @@ -564,6 +564,8 @@ static bool scsi_host_check_in_flight(struct > > request *rq, void *data, > > int *count = data; > > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); > > > > + /* This pairs with set_bit() in scsi_host_queue_ready() */ > > + smp_mb__before_atomic(); > > So the above barrier orders atomic_read(&shost->host_blocked) and > test_bit()? Yes, I believe the change to SCMD_STATE_INFLIGHT should be visible before host_blocked is tested. For that, I would need to insert a full smb_mb() instead of smp_mb_after_atomic() below, though ... right? Which means that the smp_mb__before_atomic() above could be removed. The important thing is that if two threads execute scsi_host_queue_ready() simultaneously, one of them must see the SCMD_STATE_INFLIGHT bit of the other. Btw, I realized that calling scsi_host_busy() in scsi_host_queue_ready() is inefficient, because all we need to know there is whether the value is > 1; we don't need to iterate through the entire tag space. Thanks, Martin > > > if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) > > (*count)++; > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > index b3f14f05340a..0a9a36c349ee 100644 > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -1353,8 +1353,12 @@ static inline int > > scsi_host_queue_ready(struct request_queue *q, > > if (scsi_host_in_recovery(shost)) > > return 0; > > > > + set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > > + /* This pairs with test_bit() in > > scsi_host_check_in_flight() */ > > + smp_mb__after_atomic(); > > + > > if (atomic_read(&shost->host_blocked) > 0) { > > - if (scsi_host_busy(shost) > 0) > > + if (scsi_host_busy(shost) > 1) > > goto starved; > > > > /* > > @@ -1379,8 +1383,6 @@ static inline int > > scsi_host_queue_ready(struct request_queue *q, > > spin_unlock_irq(shost->host_lock); > > } > > > > - __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > > - > > Looks this patch fine. > > However, I'd suggest to confirm smartpqi's .can_queue usage first, > which > looks one big issue. >
-----Original Message----- From: John Garry [mailto:john.garry@huawei.com] Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early >>> I'm not 100% sure about which data you need and where to find >>> nr_hw_queues exposed. >> >> nr_hw_queues is not available on 5.4 kernel via sysfs >> >> it's prob same as count of CPUs in the system >> >> or you can check number of hctxX folders in >> /sys/kernel/debug/block/sdX (need to be root, and debugfs enabled) > >> I figure sdev queue depth is 64 for all disks, like /dev/sdag, below. > > Yes, I send an example (one of two enclosures, 1 of 32 disks) but > verified, that they are the same. Confirmed my suspicions - it looks like the host is sent more commands than it can handle. We would need many disks to see this issue though, which you have. So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and I suppose it could be simply fixed by setting .host_tagset in scsi host template there. Thanks, John John, I tried this for smartpqi, so far, setting host_tagset = 1 seems to be working. The issue normally repros very quickly. I want to run a few more tests before calling this a good fix. Thanks for your suggestion, Don Brace ---- [root@cleddyf ~]# lsscsi [0:0:0:0] disk Generic- SD/MMC CRW 1.00 /dev/sdc [1:0:0:0] disk ASMT 2115 0 /dev/sda [2:0:0:0] disk ASMT 2115 0 /dev/sdb [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr [3:0:15:0] enclosu HP D3700 5.04 - [3:0:16:0] enclosu HP D3700 5.04 - [3:0:17:0] enclosu HPE Smart Adapter 3.00 - [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - --- [global] ioengine=libaio ; rw=randwrite ; percentage_random=40 rw=write size=100g bs=4k direct=1 ramp_time=15 ; filename=/mnt/fio_test ; cpus_allowed=0-27 iodepth=1024 [/dev/sdd] [/dev/sde] [/dev/sdf] [/dev/sdg] [/dev/sdh] [/dev/sdi] [/dev/sdj] [/dev/sdk] [/dev/sdl] [/dev/sdm] [/dev/sdn] [/dev/sdo] [/dev/sdp] [/dev/sdq] [/dev/sdr]
-----Original Message-----
From: John Garry [mailto:john.garry@huawei.com]
Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early
Confirmed my suspicions - it looks like the host is sent more commands than it can handle. We would need many disks to see this issue though, which you have.
So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and I suppose it could be simply fixed by setting .host_tagset in scsi host template there.
Thanks,
John
--
Don: Even though this works for current kernels, what would chances of this getting back-ported to 5.9 or even further?
Otherwise the original patch smartpqi_fix_host_qdepth_limit would correct this issue for older kernels.
Thanks,
Don Brace
On Tue, 2021-02-02 at 20:04 +0000, Don.Brace@microchip.com wrote: > -----Original Message----- > From: John Garry [mailto:john.garry@huawei.com] > Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count > early > > > Confirmed my suspicions - it looks like the host is sent more > commands than it can handle. We would need many disks to see this > issue though, which you have. > > So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and > I suppose it could be simply fixed by setting .host_tagset in scsi > host template there. > > Thanks, > John > -- > Don: Even though this works for current kernels, what would chances > of this getting back-ported to 5.9 or even further? > > Otherwise the original patch smartpqi_fix_host_qdepth_limit would > correct this issue for older kernels. True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. How much performance would that cost you? Distribution kernels would be yet another issue, distros can backport host_tagset and get rid of the issue. Regards Martin
On 02/02/2021 20:48, Martin Wilck wrote: > On Tue, 2021-02-02 at 20:04 +0000, Don.Brace@microchip.com wrote: >> -----Original Message----- >> From: John Garry [mailto:john.garry@huawei.com] >> Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count >> early >> >> >> Confirmed my suspicions - it looks like the host is sent more >> commands than it can handle. We would need many disks to see this >> issue though, which you have. >> >> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and >> I suppose it could be simply fixed by setting .host_tagset in scsi >> host template there. >> >> Thanks, >> John >> -- >> Don: Even though this works for current kernels, what would chances >> of this getting back-ported to 5.9 or even further? >> >> Otherwise the original patch smartpqi_fix_host_qdepth_limit would >> correct this issue for older kernels. > > True. However this is 5.12 material, so we shouldn't be bothered by > that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure > whether smartpqi_fix_host_qdepth_limit would be the solution. > You could simply divide can_queue by nr_hw_queues, as suggested before, > or even simpler, set nr_hw_queues = 1. > > How much performance would that cost you? > > Distribution kernels would be yet another issue, distros can backport > host_tagset and get rid of the issue. Aren't they (distros) the only issue? As I mentioned above, for 5.10 mainline stable, I think it's reasonable to backport a patch to set .host_tagset for the driver. Thanks, John
Dear Linux folks, Am 03.02.21 um 09:49 schrieb John Garry: > On 02/02/2021 20:48, Martin Wilck wrote: >> On Tue, 2021-02-02 at 20:04 +0000, Don.Brace@microchip.com wrote: >>> -----Original Message----- >>> From: John Garry [mailto:john.garry@huawei.com] >>> Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early >>> >>> >>> Confirmed my suspicions - it looks like the host is sent more >>> commands than it can handle. We would need many disks to see this >>> issue though, which you have. >>> >>> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and >>> I suppose it could be simply fixed by setting .host_tagset in scsi >>> host template there. >>> >>> Thanks, >>> John >>> -- >>> Don: Even though this works for current kernels, what would chances >>> of this getting back-ported to 5.9 or even further? >>> >>> Otherwise the original patch smartpqi_fix_host_qdepth_limit would >>> correct this issue for older kernels. >> >> True. However this is 5.12 material, so we shouldn't be bothered by >> that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure >> whether smartpqi_fix_host_qdepth_limit would be the solution. >> You could simply divide can_queue by nr_hw_queues, as suggested before, >> or even simpler, set nr_hw_queues = 1. >> >> How much performance would that cost you? >> >> Distribution kernels would be yet another issue, distros can backport >> host_tagset and get rid of the issue. > > Aren't they (distros) the only issue? As I mentioned above, for 5.10 > mainline stable, I think it's reasonable to backport a patch to set > .host_tagset for the driver. Indeed. As per the Linux kernel Web site [1]: 5.5 and 5.9 are not maintained anymore (EOL) upstream. So, if distributions decided to go with another Linux kernel release, it’s their job to backport things. If the commit message is well written, and contains the Fixes tag, their tooling should be able to pick it up. Kind regards, Paul [1]: https://www.kernel.org/
-----Original Message----- From: Paul Menzel [mailto:pmenzel@molgen.mpg.de] Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early >>> Confirmed my suspicions - it looks like the host is sent more >>> commands than it can handle. We would need many disks to see this >>> issue though, which you have. >>> >>> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, >>> and I suppose it could be simply fixed by setting .host_tagset in >>> scsi host template there. >>> >>> Thanks, >>> John >>> -- >>> Don: Even though this works for current kernels, what would chances >>> of this getting back-ported to 5.9 or even further? >>> >>> Otherwise the original patch smartpqi_fix_host_qdepth_limit would >>> correct this issue for older kernels. >> >> True. However this is 5.12 material, so we shouldn't be bothered by >> that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure >> whether smartpqi_fix_host_qdepth_limit would be the solution. >> You could simply divide can_queue by nr_hw_queues, as suggested >> before, or even simpler, set nr_hw_queues = 1. >> >> How much performance would that cost you? >> >> Distribution kernels would be yet another issue, distros can backport >> host_tagset and get rid of the issue. > > Aren't they (distros) the only issue? As I mentioned above, for 5.10 > mainline stable, I think it's reasonable to backport a patch to set > .host_tagset for the driver. Indeed. As per the Linux kernel Web site [1]: 5.5 and 5.9 are not maintained anymore (EOL) upstream. So, if distributions decided to go with another Linux kernel release, it’s their job to backport things. If the commit message is well written, and contains the Fixes tag, their tooling should be able to pick it up. Kind regards, Paul [1]: https://www.kernel.org/ I remove patch smartpqi-fix_host_qdepth_limit and replaced it with a patch that sets host_tagset = 1 in function pqi_register_scsi Thanks to all for your hard work. Don
-----Original Message----- From: Martin Wilck [mailto:mwilck@suse.com] Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early > > > Confirmed my suspicions - it looks like the host is sent more commands > than it can handle. We would need many disks to see this issue though, > which you have. > > So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and > I suppose it could be simply fixed by setting .host_tagset in scsi > host template there. > > Thanks, > John > -- > Don: Even though this works for current kernels, what would chances of > this getting back-ported to 5.9 or even further? > > Otherwise the original patch smartpqi_fix_host_qdepth_limit would > correct this issue for older kernels. True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. How much performance would that cost you? Don: For my HBA disk tests... Dividing can_queue / nr_hw_queues is about a 40% drop. ~380K - 400K IOPS Setting nr_hw_queues = 1 results in a 1.5 X gain in performance. ~980K IOPS Setting host_tagset = 1 ~640K IOPS So, it seem that setting nr_hw_queues = 1 results in the best performance. Is this expected? Would this also be true for the future? Thanks, Don Brace Below is my setup. --- [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr [3:0:15:0] enclosu HP D3700 5.04 - [3:0:16:0] enclosu HP D3700 5.04 - [3:0:17:0] enclosu HPE Smart Adapter 3.00 - [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - ----- [global] ioengine=libaio ; rw=randwrite ; percentage_random=40 rw=write size=100g bs=4k direct=1 ramp_time=15 ; filename=/mnt/fio_test ; cpus_allowed=0-27 iodepth=4096 [/dev/sdd] [/dev/sde] [/dev/sdf] [/dev/sdg] [/dev/sdh] [/dev/sdi] [/dev/sdj] [/dev/sdk] [/dev/sdl] [/dev/sdm] [/dev/sdn] [/dev/sdo] [/dev/sdp] [/dev/sdq] [/dev/sdr] Distribution kernels would be yet another issue, distros can backport host_tagset and get rid of the issue. Regards Martin
On 03/02/2021 15:56, Don.Brace@microchip.com wrote: > True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. > You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. > > How much performance would that cost you? > > Don: For my HBA disk tests... > > Dividing can_queue / nr_hw_queues is about a 40% drop. > ~380K - 400K IOPS > Setting nr_hw_queues = 1 results in a 1.5 X gain in performance. > ~980K IOPS So do you just set shost.nr_hw_queues = 1, yet leave the rest of the driver as is? Please note that when changing from nr_hw_queues many -> 1, then the default IO scheduler changes from none -> mq-deadline, but I would hope that would not make such a big difference. > Setting host_tagset = 1 For this, v5.11-rc6 has a fix which may affect you (2569063c7140), so please include it > ~640K IOPS > > So, it seem that setting nr_hw_queues = 1 results in the best performance. > > Is this expected? Would this also be true for the future? Not expected by me > > Thanks, > Don Brace > > Below is my setup. > --- > [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd > [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde > [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf > [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg > [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh > [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi > [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj > [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk > [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl > [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm > [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn > [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo > [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp > [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq > [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr > [3:0:15:0] enclosu HP D3700 5.04 - > [3:0:16:0] enclosu HP D3700 5.04 - > [3:0:17:0] enclosu HPE Smart Adapter 3.00 - > [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds > [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - > ----- > [global] > ioengine=libaio > ; rw=randwrite > ; percentage_random=40 > rw=write > size=100g > bs=4k > direct=1 > ramp_time=15 > ; filename=/mnt/fio_test > ; cpus_allowed=0-27 > iodepth=4096 I normally use iodepth circa 40 to 128, but then I normally just do rw=read for performance testing > > [/dev/sdd] > [/dev/sde] > [/dev/sdf] > [/dev/sdg] > [/dev/sdh] > [/dev/sdi] > [/dev/sdj] > [/dev/sdk] > [/dev/sdl] > [/dev/sdm] > [/dev/sdn] > [/dev/sdo] > [/dev/sdp] > [/dev/sdq] > [/dev/sdr]
-----Original Message----- From: John Garry [mailto:john.garry@huawei.com] Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe On 03/02/2021 15:56, Don.Brace@microchip.com wrote: > True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. > You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. > > How much performance would that cost you? > > Don: For my HBA disk tests... > > Dividing can_queue / nr_hw_queues is about a 40% drop. > ~380K - 400K IOPS > Setting nr_hw_queues = 1 results in a 1.5 X gain in performance. > ~980K IOPS So do you just set shost.nr_hw_queues = 1, yet leave the rest of the driver as is? Please note that when changing from nr_hw_queues many -> 1, then the default IO scheduler changes from none -> mq-deadline, but I would hope that would not make such a big difference. > Setting host_tagset = 1 For this, v5.11-rc6 has a fix which may affect you (2569063c7140), so please include it > ~640K IOPS > > So, it seem that setting nr_hw_queues = 1 results in the best performance. > > Is this expected? Would this also be true for the future? Not expected by me Don: Ok, setting both host_tagset = 1 and nr_hw_queues = 1 yields the same better performance, about 940K IOPS. Thanks, Don Brace > > Thanks, > Don Brace > > Below is my setup. > --- > [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd > [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde > [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf > [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg > [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh > [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi > [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj > [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk > [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl > [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm > [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn > [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo > [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp > [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq > [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr > [3:0:15:0] enclosu HP D3700 5.04 - > [3:0:16:0] enclosu HP D3700 5.04 - > [3:0:17:0] enclosu HPE Smart Adapter 3.00 - > [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds > [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - > ----- > [global] > ioengine=libaio > ; rw=randwrite > ; percentage_random=40 > rw=write > size=100g > bs=4k > direct=1 > ramp_time=15 > ; filename=/mnt/fio_test > ; cpus_allowed=0-27 > iodepth=4096 I normally use iodepth circa 40 to 128, but then I normally just do rw=read for performance testing > > [/dev/sdd] > [/dev/sde] > [/dev/sdf] > [/dev/sdg] > [/dev/sdh] > [/dev/sdi] > [/dev/sdj] > [/dev/sdk] > [/dev/sdl] > [/dev/sdm] > [/dev/sdn] > [/dev/sdo] > [/dev/sdp] > [/dev/sdq] > [/dev/sdr]
FYI we have exactly this issue on a machine here running CentOS 8.3 (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives configured as five x twelve-drive raid-6, software striped using md, and formatted with xfs. Test software writes to the array using multiple threads in parallel. The smartpqi driver would report controller offline within ten minutes or so, with status code 0x6100c Changed the driver to set 'nr_hw_queues = 1’ and then tested by filling the array with random files (which took a couple of days), which completed fine, so it looks like that one-line change fixes it. Would, of course, be helpful if this was back-ported. — Roger > On 3 Feb 2021, at 15:56, Don.Brace@microchip.com wrote: > > -----Original Message----- > From: Martin Wilck [mailto:mwilck@suse.com] > Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early > >> >> >> Confirmed my suspicions - it looks like the host is sent more commands >> than it can handle. We would need many disks to see this issue though, >> which you have. >> >> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and >> I suppose it could be simply fixed by setting .host_tagset in scsi >> host template there. >> >> Thanks, >> John >> -- >> Don: Even though this works for current kernels, what would chances of >> this getting back-ported to 5.9 or even further? >> >> Otherwise the original patch smartpqi_fix_host_qdepth_limit would >> correct this issue for older kernels. > > True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. > You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. > > How much performance would that cost you? > > Don: For my HBA disk tests... > > Dividing can_queue / nr_hw_queues is about a 40% drop. > ~380K - 400K IOPS > Setting nr_hw_queues = 1 results in a 1.5 X gain in performance. > ~980K IOPS > Setting host_tagset = 1 > ~640K IOPS > > So, it seem that setting nr_hw_queues = 1 results in the best performance. > > Is this expected? Would this also be true for the future? > > Thanks, > Don Brace > > Below is my setup. > --- > [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd > [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde > [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf > [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg > [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh > [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi > [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj > [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk > [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl > [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm > [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn > [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo > [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp > [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq > [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr > [3:0:15:0] enclosu HP D3700 5.04 - > [3:0:16:0] enclosu HP D3700 5.04 - > [3:0:17:0] enclosu HPE Smart Adapter 3.00 - > [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds > [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - > ----- > [global] > ioengine=libaio > ; rw=randwrite > ; percentage_random=40 > rw=write > size=100g > bs=4k > direct=1 > ramp_time=15 > ; filename=/mnt/fio_test > ; cpus_allowed=0-27 > iodepth=4096 > > [/dev/sdd] > [/dev/sde] > [/dev/sdf] > [/dev/sdg] > [/dev/sdh] > [/dev/sdi] > [/dev/sdj] > [/dev/sdk] > [/dev/sdl] > [/dev/sdm] > [/dev/sdn] > [/dev/sdo] > [/dev/sdp] > [/dev/sdq] > [/dev/sdr] > > > Distribution kernels would be yet another issue, distros can backport host_tagset and get rid of the issue. > > Regards > Martin > > > > > > > > > >
On 22/02/2021 14:23, Roger Willcocks wrote: > FYI we have exactly this issue on a machine here running CentOS 8.3 (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) > > Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives configured as five x twelve-drive raid-6, software striped using md, and formatted with xfs. > > Test software writes to the array using multiple threads in parallel. > > The smartpqi driver would report controller offline within ten minutes or so, with status code 0x6100c > > Changed the driver to set 'nr_hw_queues = 1’ and then tested by filling the array with random files (which took a couple of days), which completed fine, so it looks like that one-line change fixes it. > That just makes the driver single-queue. As such, since the driver uses blk_mq_unique_tag_to_hwq(), only hw queue #0 will ever be used in the driver. And then, since the driver still spreads MSI-X interrupt vectors over all CPUs [from pci_alloc_vectors(PCI_IRQ_AFFINITY)], if CPUs associated with HW queue #0 are offlined (probably just cpu0), there is no CPUs available to service queue #0 interrupt. That's what I think would happen, from a quick glance at the code. > Would, of course, be helpful if this was back-ported. > > — > Roger > > > >> On 3 Feb 2021, at 15:56, Don.Brace@microchip.com wrote: >> >> -----Original Message----- >> From: Martin Wilck [mailto:mwilck@suse.com] >> Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early >> >>> >>> >>> Confirmed my suspicions - it looks like the host is sent more commands >>> than it can handle. We would need many disks to see this issue though, >>> which you have. >>> >>> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and >>> I suppose it could be simply fixed by setting .host_tagset in scsi >>> host template there. >>> >>> Thanks, >>> John >>> -- >>> Don: Even though this works for current kernels, what would chances of >>> this getting back-ported to 5.9 or even further? >>> >>> Otherwise the original patch smartpqi_fix_host_qdepth_limit would >>> correct this issue for older kernels. >> >> True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. >> You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. >> >> How much performance would that cost you? >> >> Don: For my HBA disk tests... >> >> Dividing can_queue / nr_hw_queues is about a 40% drop. >> ~380K - 400K IOPS >> Setting nr_hw_queues = 1 results in a 1.5 X gain in performance. >> ~980K IOPS >> Setting host_tagset = 1 >> ~640K IOPS >> >> So, it seem that setting nr_hw_queues = 1 results in the best performance. >> >> Is this expected? Would this also be true for the future? >> >> Thanks, >> Don Brace >> >> Below is my setup. >> --- >> [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd >> [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde >> [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf >> [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg >> [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh >> [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi >> [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj >> [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk >> [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl >> [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm >> [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn >> [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo >> [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp >> [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq >> [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr >> [3:0:15:0] enclosu HP D3700 5.04 - >> [3:0:16:0] enclosu HP D3700 5.04 - >> [3:0:17:0] enclosu HPE Smart Adapter 3.00 - >> [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds >> [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - >> ----- >> [global] >> ioengine=libaio >> ; rw=randwrite >> ; percentage_random=40 >> rw=write >> size=100g >> bs=4k >> direct=1 >> ramp_time=15 >> ; filename=/mnt/fio_test >> ; cpus_allowed=0-27 >> iodepth=4096 >> >> [/dev/sdd] >> [/dev/sde] >> [/dev/sdf] >> [/dev/sdg] >> [/dev/sdh] >> [/dev/sdi] >> [/dev/sdj] >> [/dev/sdk] >> [/dev/sdl] >> [/dev/sdm] >> [/dev/sdn] >> [/dev/sdo] >> [/dev/sdp] >> [/dev/sdq] >> [/dev/sdr] >> >> >> Distribution kernels would be yet another issue, distros can backport host_tagset and get rid of the issue. >> >> Regards >> Martin >> >> >> >> >> >> >> >> >> >> > > . >
> On 23 Feb 2021, at 08:57, John Garry <john.garry@huawei.com> wrote: > > On 22/02/2021 14:23, Roger Willcocks wrote: >> FYI we have exactly this issue on a machine here running CentOS 8.3 (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) >> Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives configured as five x twelve-drive raid-6, software striped using md, and formatted with xfs. >> Test software writes to the array using multiple threads in parallel. >> The smartpqi driver would report controller offline within ten minutes or so, with status code 0x6100c >> Changed the driver to set 'nr_hw_queues = 1’ and then tested by filling the array with random files (which took a couple of days), which completed fine, so it looks like that one-line change fixes it. > > That just makes the driver single-queue. > All I can say is it fixes the problem. Write performance is two or three percent faster than CentOS 6.5 on the same hardware. > As such, since the driver uses blk_mq_unique_tag_to_hwq(), only hw queue #0 will ever be used in the driver. > > And then, since the driver still spreads MSI-X interrupt vectors over all CPUs [from pci_alloc_vectors(PCI_IRQ_AFFINITY)], if CPUs associated with HW queue #0 are offlined (probably just cpu0), there is no CPUs available to service queue #0 interrupt. That's what I think would happen, from a quick glance at the code. > Surely that would be an issue even if it used multiple queues (one of which would be queue #0) ? > >> Would, of course, be helpful if this was back-ported. >> — >> Roger
On 23/02/2021 14:06, Roger Willcocks wrote: > > >> On 23 Feb 2021, at 08:57, John Garry <john.garry@huawei.com> wrote: >> >> On 22/02/2021 14:23, Roger Willcocks wrote: >>> FYI we have exactly this issue on a machine here running CentOS 8.3 (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) >>> Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives configured as five x twelve-drive raid-6, software striped using md, and formatted with xfs. >>> Test software writes to the array using multiple threads in parallel. >>> The smartpqi driver would report controller offline within ten minutes or so, with status code 0x6100c >>> Changed the driver to set 'nr_hw_queues = 1’ and then tested by filling the array with random files (which took a couple of days), which completed fine, so it looks like that one-line change fixes it. >> >> That just makes the driver single-queue. >> > > All I can say is it fixes the problem. Write performance is two or three percent faster than CentOS 6.5 on the same hardware. > > >> As such, since the driver uses blk_mq_unique_tag_to_hwq(), only hw queue #0 will ever be used in the driver. >> >> And then, since the driver still spreads MSI-X interrupt vectors over all CPUs [from pci_alloc_vectors(PCI_IRQ_AFFINITY)], if CPUs associated with HW queue #0 are offlined (probably just cpu0), there is no CPUs available to service queue #0 interrupt. That's what I think would happen, from a quick glance at the code. >> > > Surely that would be an issue even if it used multiple queues (one of which would be queue #0) ? > Well, no. Because there is currently a symmetry between HW queue context in the block layer and the HW queues in the LLDD. So if hwq0 were mapped to cpu0 only, if cpu0 is offline, block layer will not send commands on hwq0. By setting nr_hw_queues=1, that symmetry breaks - every cpu tries to send on hwq0, but irq core code will disable hwq0 interrupt when cpu0 is offline - that's because it is managed. That's how it looks to me - I did not check the LLDD code too closely. Please discuss with Don. Thanks, John >> >>> Would, of course, be helpful if this was back-ported. >>> — >>> Roger > > > . >
Dear Roger, Am 22.02.21 um 15:23 schrieb Roger Willcocks: > FYI we have exactly this issue on a machine here running CentOS 8.3 > (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) What driver version do you use? > Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives > configured as five x twelve-drive raid-6, software striped using md, > and formatted with xfs. > > Test software writes to the array using multiple threads in > parallel. > > The smartpqi driver would report controller offline within ten > minutes or so, with status code 0x6100c > > Changed the driver to set 'nr_hw_queues = 1’ and then tested by > filling the array with random files (which took a couple of days), > which completed fine, so it looks like that one-line change fixes > it. > > Would, of course, be helpful if this was back-ported. We only noticed the issue starting with Linux 5.5 (commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq"). So I am curious how this problem can be in CentOS 8.3 (Linux kernel 4.18.x). > — > Roger Apple Mail mangled your signature delimiter (-- ). Kind regards, Paul
On 22.01.21 00:32, Martin Wilck wrote: > On Thu, 2021-01-21 at 13:05 +0000, John Garry wrote: >>>> >> >> Confirmed my suspicions - it looks like the host is sent more >> commands >> than it can handle. We would need many disks to see this issue >> though, >> which you have. >> >> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and >> I >> suppose it could be simply fixed by setting .host_tagset in scsi host >> template there. > > If it's really just that, it should be easy enough to verify. > > @Donald, you'd need to test with a 5.10 kernel, and after reproducing > the issue, add > > .host_tagset = 1, > > to the definition of pqi_driver_template in > drivers/scsi/smartpqi/smartpqi_init.c. > > You don't need a patch to test that, I believe. Would you able to do > this test? Sorry, I had overlooked this request. I reviewed this thread now, because I want to switch our production systems to 5.10 LTS. I could reproduce the problem with Linux 5.10.22. When setting `host_tagset = 1`, the problem disappeared. Additionally, we have 5.10.22 with the patch running on two previously affected production systems for over 24 hours now. Statistics suggest, that these systems were very likely to trigger the problem in that time frame if the patch didn't work. So I think this is a working fix which should go to 5.10 stable. Best Donald diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c index 9d0229656681f..be429a7cb1512 100644 --- a/drivers/scsi/smartpqi/smartpqi_init.c +++ b/drivers/scsi/smartpqi/smartpqi_init.c @@ -6571,6 +6571,7 @@ static struct scsi_host_template pqi_driver_template = { .map_queues = pqi_map_queues, .sdev_attrs = pqi_sdev_attrs, .shost_attrs = pqi_shost_attrs, + .host_tagset = 1, }; static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 2f162603876f..1c452a1c18fd 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -564,6 +564,8 @@ static bool scsi_host_check_in_flight(struct request *rq, void *data, int *count = data; struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); + /* This pairs with set_bit() in scsi_host_queue_ready() */ + smp_mb__before_atomic(); if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) (*count)++; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index b3f14f05340a..0a9a36c349ee 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1353,8 +1353,12 @@ static inline int scsi_host_queue_ready(struct request_queue *q, if (scsi_host_in_recovery(shost)) return 0; + set_bit(SCMD_STATE_INFLIGHT, &cmd->state); + /* This pairs with test_bit() in scsi_host_check_in_flight() */ + smp_mb__after_atomic(); + if (atomic_read(&shost->host_blocked) > 0) { - if (scsi_host_busy(shost) > 0) + if (scsi_host_busy(shost) > 1) goto starved; /* @@ -1379,8 +1383,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); } - __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); - return 1; starved: