mbox series

[v1,0/7] Irq poll to address cpu lockup.

Message ID 1550216430-36612-1-git-send-email-suganath-prabu.subramani@broadcom.com (mailing list archive)
Headers show
Series Irq poll to address cpu lockup. | expand

Message

Suganath Prabu S Feb. 15, 2019, 7:40 a.m. UTC
We have seen cpu lock up issue from fields if
system has greater (more than 96) logical cpu count.
SAS3.0 controller (Invader series) supports at
max 96 msix vector and SAS3.5 product (Ventura)
supports at max 128 msix vectors.

This may be a generic issue (if PCI device support
completion on multiple reply queues).
Let me explain it w.r.t to mpt3sas supported h/w
just to simplify the problem and possible changes to
handle such issues. IT HBA (mpt3sas) supports
multiple reply queues in completion path. Driver
creates MSI-x vectors for controller as "min of
(FW supported Reply queue, Logical CPUs)". If submitter
is not interrupted via completion on same CPU, there is
a loop in the IO path. This behavior can cause
hard/soft CPU lockups, IO timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs
and another CPU (e.g. CPU B) is busy with processing the
corresponding IO's reply descriptors from reply
descriptor queue upon receiving the interrupts from HBA.
If the CPU A is continuously pumping the IOs then always
CPU B (which is executing the ISR) will see the valid
reply descriptors in the reply descriptor queue and it
will be continuously processing those reply descriptor
in a loop without quitting the ISR handler.

Mpt3sas driver will exit ISR handler if it finds unused
reply descriptor in the reply descriptor queue. Since
CPU A will be continuously sending the IOs, CPU B may
always see a valid reply descriptor
(posted by HBA Firmware after processing the IO) in the
reply descriptor queue. In worst case, driver will not
quit from this loop in the ISR handler. Eventually,
CPU lockup will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity"
set to 2 or affinity_hint is honored by
irqbalance as "exact".
If rq_affinity is set to 2, submitter will be always
interrupted via completion on same CPU.
If irqbalance is using "exact" policy,
interrupt will be delivered to submitter CPU.

Problem statement -
If CPU counts to MSI-X vectors (reply descriptor Queues)
count ratio is not 1:1, we still have exposure of issue
explained above and for that we don't have any solution.

Exposure of soft/hard lockup if CPU count is more
than MSI-x supported by device.

If CPUs count to MSI-x vectors count ratio is not 1:1,
(Other way, if CPU counts to MSI-x vector count ratio is
something like X:1, where X > 1) then 'exact' irqbalance
policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping
between CPU to MSI-x vector instead one MSI-x interrupt
(or reply descriptor queue) is shared with group/set of
CPUs and there is a possibility of having a loop in the
IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and
each node having four logical CPUs and also consider that
number of MSI-x vectors enabled on the HBA is two, then
CPUs count to MSI-x vector count ratio as 4:1.
e.g.
MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3
of NUMA node 0 and MSI-x vector 1 is affinity to CPU 4,
CPU 5, CPU 6 & CPU 7 of NUMA node 1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3                 --> MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7                 -->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses
all the CPUs of NUMA node 0 for issuing the IOs.
Only one CPU from affinity list (it can be any cpu since
this behavior depends upon irqbalance) CPU0 will receive the
interrupts from MSIx vector 0 for all the IOs. Eventually,
CPU 0 IO submission percentage will be decreasing and ISR
processing percentage will be increasing as it is more busy
with processing the interrupts.
Gradually IO submission percentage on CPU 0 will be zero
and it's ISR processing percentage will be 100 percentage as
IO loop has already formed within the NUMA node 0,
i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
submitting the heavy IOs and only CPU 0 is busy in the ISR
path as it always find the valid reply descriptor in the
reply descriptor queue. Eventually, we will observe the
hard lockup here.

Chances of occurring of hard/soft lockups are directly
proportional to value of X. If value of X is high,
then chances of observing CPU lockups is high.

Solution -

Fix-1
=====
Use IRQ poll interface defined in " irq_poll.c".
mpt3sas driver will execute ISR routine in Softirq context
and it will always quit the loop based on budget provided in
IRQ poll interface.

In these scenarios( i.e. where CPUs count to MSI-X vectors
count ratio is X:1 (where X >  1)), IRQ poll interface
will avoid CPU hard lockups due to voluntary exit from
the reply queue processing based on budget.
Note - Only one MSI-x vector is busy doing processing.

Irqstat output -

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44    122871   122871   0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
  45        0              0           0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1

Fix-2
=====
Driver should round robin the reply queue, so that each
reply queue is load balanced.
so that IO's are distributed to all the available
reply descriptor post queues equally.
With this each reply descriptor post queue load is balanced.
This improves performance and also fixes soft lockups.

Irqstat output after driver does reply queue load balance-

Irqstat output -

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
  45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1

In Summary,
CPU completing IO which is not contributing to
IO submission, may cause cpu lockup.
If CPUs count to MSI-X vector count ratio is X:1 (where X > 1)
then using irq poll interface, we can avoid the CPU lockups and
by equally distributing the interrupts among the enabled MSI-x
interrupts we can avoid performance issues.

Patch 3 & 4 addresses Fix 1 and Fix 2 explained
above, only if cpu count is more than FW supported MSI-x vector.

Patch V1 changeset.
Added patch 3 to add select irqpoll (Kconfig).

Suganath Prabu (7):
  mpt3sas: Fix typo in request_desript_type.
  mpt3sas: simplify interrupt handler.
  mpt3sas: Select IRQ_POLL to avoid build error.
  mpt3sas: Irq poll to avoid CPU hard lockups.
  mpt3sas: Load balance to improve performance and avoid soft lockups.
  mpt3sas: Improve the threshold value and introduce module param.
  mpt3sas: Update mpt3sas driver version to 28.100.00.00

 drivers/scsi/mpt3sas/Kconfig        |   1 +
 drivers/scsi/mpt3sas/mpt3sas_base.c | 178 ++++++++++++++++++++++++++++++------
 drivers/scsi/mpt3sas/mpt3sas_base.h |  22 ++++-
 3 files changed, 170 insertions(+), 31 deletions(-)

Comments

Suganath Prabu S March 14, 2019, 6:53 a.m. UTC | #1
Hi Martin,

Any update on these patches.

Thanks,
Suganath


On Fri, Feb 15, 2019 at 1:10 PM Suganath Prabu
<suganath-prabu.subramani@broadcom.com> wrote:
>
> We have seen cpu lock up issue from fields if
> system has greater (more than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at
> max 96 msix vector and SAS3.5 product (Ventura)
> supports at max 128 msix vectors.
>
> This may be a generic issue (if PCI device support
> completion on multiple reply queues).
> Let me explain it w.r.t to mpt3sas supported h/w
> just to simplify the problem and possible changes to
> handle such issues. IT HBA (mpt3sas) supports
> multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of
> (FW supported Reply queue, Logical CPUs)". If submitter
> is not interrupted via completion on same CPU, there is
> a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.
>
> Example - one CPU (e.g. CPU A) is busy submitting the IOs
> and another CPU (e.g. CPU B) is busy with processing the
> corresponding IO's reply descriptors from reply
> descriptor queue upon receiving the interrupts from HBA.
> If the CPU A is continuously pumping the IOs then always
> CPU B (which is executing the ISR) will see the valid
> reply descriptors in the reply descriptor queue and it
> will be continuously processing those reply descriptor
> in a loop without quitting the ISR handler.
>
> Mpt3sas driver will exit ISR handler if it finds unused
> reply descriptor in the reply descriptor queue. Since
> CPU A will be continuously sending the IOs, CPU B may
> always see a valid reply descriptor
> (posted by HBA Firmware after processing the IO) in the
> reply descriptor queue. In worst case, driver will not
> quit from this loop in the ISR handler. Eventually,
> CPU lockup will be detected by watchdog.
>
> Above mentioned behavior is not common if "rq_affinity"
> set to 2 or affinity_hint is honored by
> irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always
> interrupted via completion on same CPU.
> If irqbalance is using "exact" policy,
> interrupt will be delivered to submitter CPU.
>
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues)
> count ratio is not 1:1, we still have exposure of issue
> explained above and for that we don't have any solution.
>
> Exposure of soft/hard lockup if CPU count is more
> than MSI-x supported by device.
>
> If CPUs count to MSI-x vectors count ratio is not 1:1,
> (Other way, if CPU counts to MSI-x vector count ratio is
> something like X:1, where X > 1) then 'exact' irqbalance
> policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping
> between CPU to MSI-x vector instead one MSI-x interrupt
> (or reply descriptor queue) is shared with group/set of
> CPUs and there is a possibility of having a loop in the
> IO path within that CPU group and may observe lockups.
>
> For example: Consider a system having two NUMA nodes and
> each node having four logical CPUs and also consider that
> number of MSI-x vectors enabled on the HBA is two, then
> CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3
> of NUMA node 0 and MSI-x vector 1 is affinity to CPU 4,
> CPU 5, CPU 6 & CPU 7 of NUMA node 1.
>
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3                 --> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7                 -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
>
> Assume that user started an application which uses
> all the CPUs of NUMA node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since
> this behavior depends upon irqbalance) CPU0 will receive the
> interrupts from MSIx vector 0 for all the IOs. Eventually,
> CPU 0 IO submission percentage will be decreasing and ISR
> processing percentage will be increasing as it is more busy
> with processing the interrupts.
> Gradually IO submission percentage on CPU 0 will be zero
> and it's ISR processing percentage will be 100 percentage as
> IO loop has already formed within the NUMA node 0,
> i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
> submitting the heavy IOs and only CPU 0 is busy in the ISR
> path as it always find the valid reply descriptor in the
> reply descriptor queue. Eventually, we will observe the
> hard lockup here.
>
> Chances of occurring of hard/soft lockups are directly
> proportional to value of X. If value of X is high,
> then chances of observing CPU lockups is high.
>
> Solution -
>
> Fix-1
> =====
> Use IRQ poll interface defined in " irq_poll.c".
> mpt3sas driver will execute ISR routine in Softirq context
> and it will always quit the loop based on budget provided in
> IRQ poll interface.
>
> In these scenarios( i.e. where CPUs count to MSI-X vectors
> count ratio is X:1 (where X >  1)), IRQ poll interface
> will avoid CPU hard lockups due to voluntary exit from
> the reply queue processing based on budget.
> Note - Only one MSI-x vector is busy doing processing.
>
> Irqstat output -
>
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44    122871   122871   0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45        0              0           0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
>
> Fix-2
> =====
> Driver should round robin the reply queue, so that each
> reply queue is load balanced.
> so that IO's are distributed to all the available
> reply descriptor post queues equally.
> With this each reply descriptor post queue load is balanced.
> This improves performance and also fixes soft lockups.
>
> Irqstat output after driver does reply queue load balance-
>
> Irqstat output -
>
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
>
> In Summary,
> CPU completing IO which is not contributing to
> IO submission, may cause cpu lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1)
> then using irq poll interface, we can avoid the CPU lockups and
> by equally distributing the interrupts among the enabled MSI-x
> interrupts we can avoid performance issues.
>
> Patch 3 & 4 addresses Fix 1 and Fix 2 explained
> above, only if cpu count is more than FW supported MSI-x vector.
>
> Patch V1 changeset.
> Added patch 3 to add select irqpoll (Kconfig).
>
> Suganath Prabu (7):
>   mpt3sas: Fix typo in request_desript_type.
>   mpt3sas: simplify interrupt handler.
>   mpt3sas: Select IRQ_POLL to avoid build error.
>   mpt3sas: Irq poll to avoid CPU hard lockups.
>   mpt3sas: Load balance to improve performance and avoid soft lockups.
>   mpt3sas: Improve the threshold value and introduce module param.
>   mpt3sas: Update mpt3sas driver version to 28.100.00.00
>
>  drivers/scsi/mpt3sas/Kconfig        |   1 +
>  drivers/scsi/mpt3sas/mpt3sas_base.c | 178 ++++++++++++++++++++++++++++++------
>  drivers/scsi/mpt3sas/mpt3sas_base.h |  22 ++++-
>  3 files changed, 170 insertions(+), 31 deletions(-)
>
> --
> 1.8.3.1
>
Martin K. Petersen March 18, 2019, 9:17 p.m. UTC | #2
Suganath,

> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.  SAS3.0 controller (Invader series)
> supports at max 96 msix vector and SAS3.5 product (Ventura) supports
> at max 128 msix vectors.

Applied to 5.2/scsi-queue. Thanks!
Suganath Prabu S March 19, 2019, 3:22 a.m. UTC | #3
Thanks Martin.

-Suganath

On Tue, Mar 19, 2019 at 2:47 AM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Suganath,
>
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.  SAS3.0 controller (Invader series)
> > supports at max 96 msix vector and SAS3.5 product (Ventura) supports
> > at max 128 msix vectors.
>
> Applied to 5.2/scsi-queue. Thanks!
>
> --
> Martin K. Petersen      Oracle Linux Engineering