Message ID | 20190131170951.178676-1-haakon.bugge@oracle.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | mlx4_ib: Increase the timeout for CM cache | expand |
Sorry, I posted this as if it a net patch. It isn't, hence resend with correct recipients. Thxs, Håkon > On 31 Jan 2019, at 18:09, Håkon Bugge <haakon.bugge@oracle.com> wrote: > > Using CX-3 virtual functions, either from a bare-metal machine or > pass-through from a VM, MAD packets are proxied through the PF driver. > > Since the VMs have separate name spaces for MAD Transaction Ids > (TIDs), the PF driver has to re-map the TIDs and keep the book keeping > in a cache. > > Following the RDMA CM protocol, it is clear when an entry has to > evicted form the cache. But life is not perfect, remote peers may die > or be rebooted. Hence, it's a timeout to wipe out a cache entry, when > the PF driver assumes the remote peer has gone. > > We have experienced excessive amount of DREQ retries during fail-over > testing, when running with eight VMs per database server. > > The problem has been reproduced in a bare-metal system using one VM > per physical node. In this environment, running 256 processes in each > VM, each process uses RDMA CM to create an RC QP between himself and > all (256) remote processes. All in all 16K QPs. > > When tearing down these 16K QPs, excessive DREQ retries (and > duplicates) are observed. With some cat/paste/awk wizardry on the > infiniband_cm sysfs, we observe: > > dreq: 5007 > cm_rx_msgs: > drep: 3838 > dreq: 13018 > rep: 8128 > req: 8256 > rtu: 8256 > cm_tx_msgs: > drep: 8011 > dreq: 68856 > rep: 8256 > req: 8128 > rtu: 8128 > cm_tx_retries: > dreq: 60483 > > Note that the active/passive side is distributed. > > Enabling pr_debug in cm.c gives tons of: > > [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave: > 1,sl_cm_id: 0xd393089f} is NULL! > > By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the > tear-down phase of the application is reduced from 113 to 67 > seconds. Retries/duplicates are also significantly reduced: > > cm_rx_duplicates: > dreq: 7726 > [] > cm_tx_retries: > drep: 1 > dreq: 7779 > > Increasing the timeout further didn't help, as these duplicates and > retries stem from a too short CMA timeout, which was 20 (~4 seconds) > on the systems. By increasing the CMA timeout to 22 (~17 seconds), the > numbers fell down to about one hundred for both of them. > > Adjustment of the CMA timeout is _not_ part of this commit. > > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > --- > drivers/infiniband/hw/mlx4/cm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c > index fedaf8260105..8c79a480f2b7 100644 > --- a/drivers/infiniband/hw/mlx4/cm.c > +++ b/drivers/infiniband/hw/mlx4/cm.c > @@ -39,7 +39,7 @@ > > #include "mlx4_ib.h" > > -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) > +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) > > struct id_map_entry { > struct rb_node node; > -- > 2.20.1 >
On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote: > Using CX-3 virtual functions, either from a bare-metal machine or > pass-through from a VM, MAD packets are proxied through the PF driver. > > Since the VMs have separate name spaces for MAD Transaction Ids > (TIDs), the PF driver has to re-map the TIDs and keep the book keeping > in a cache. > > Following the RDMA CM protocol, it is clear when an entry has to > evicted form the cache. But life is not perfect, remote peers may die > or be rebooted. Hence, it's a timeout to wipe out a cache entry, when > the PF driver assumes the remote peer has gone. > > We have experienced excessive amount of DREQ retries during fail-over > testing, when running with eight VMs per database server. > > The problem has been reproduced in a bare-metal system using one VM > per physical node. In this environment, running 256 processes in each > VM, each process uses RDMA CM to create an RC QP between himself and > all (256) remote processes. All in all 16K QPs. > > When tearing down these 16K QPs, excessive DREQ retries (and > duplicates) are observed. With some cat/paste/awk wizardry on the > infiniband_cm sysfs, we observe: > > dreq: 5007 > cm_rx_msgs: > drep: 3838 > dreq: 13018 > rep: 8128 > req: 8256 > rtu: 8256 > cm_tx_msgs: > drep: 8011 > dreq: 68856 > rep: 8256 > req: 8128 > rtu: 8128 > cm_tx_retries: > dreq: 60483 > > Note that the active/passive side is distributed. > > Enabling pr_debug in cm.c gives tons of: > > [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave: > 1,sl_cm_id: 0xd393089f} is NULL! > > By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the > tear-down phase of the application is reduced from 113 to 67 > seconds. Retries/duplicates are also significantly reduced: > > cm_rx_duplicates: > dreq: 7726 > [] > cm_tx_retries: > drep: 1 > dreq: 7779 > > Increasing the timeout further didn't help, as these duplicates and > retries stem from a too short CMA timeout, which was 20 (~4 seconds) > on the systems. By increasing the CMA timeout to 22 (~17 seconds), the > numbers fell down to about one hundred for both of them. > > Adjustment of the CMA timeout is _not_ part of this commit. > > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > --- > drivers/infiniband/hw/mlx4/cm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Jack? What do you think? > diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c > index fedaf8260105..8c79a480f2b7 100644 > --- a/drivers/infiniband/hw/mlx4/cm.c > +++ b/drivers/infiniband/hw/mlx4/cm.c > @@ -39,7 +39,7 @@ > > #include "mlx4_ib.h" > > -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) > +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) > > struct id_map_entry { > struct rb_node node; > -- > 2.20.1 >
> On 5 Feb 2019, at 23:36, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote: >> Using CX-3 virtual functions, either from a bare-metal machine or >> pass-through from a VM, MAD packets are proxied through the PF driver. >> >> Since the VMs have separate name spaces for MAD Transaction Ids >> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping >> in a cache. >> >> Following the RDMA CM protocol, it is clear when an entry has to >> evicted form the cache. But life is not perfect, remote peers may die >> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when >> the PF driver assumes the remote peer has gone. >> >> We have experienced excessive amount of DREQ retries during fail-over >> testing, when running with eight VMs per database server. >> >> The problem has been reproduced in a bare-metal system using one VM >> per physical node. In this environment, running 256 processes in each >> VM, each process uses RDMA CM to create an RC QP between himself and >> all (256) remote processes. All in all 16K QPs. >> >> When tearing down these 16K QPs, excessive DREQ retries (and >> duplicates) are observed. With some cat/paste/awk wizardry on the >> infiniband_cm sysfs, we observe: >> >> dreq: 5007 >> cm_rx_msgs: >> drep: 3838 >> dreq: 13018 >> rep: 8128 >> req: 8256 >> rtu: 8256 >> cm_tx_msgs: >> drep: 8011 >> dreq: 68856 >> rep: 8256 >> req: 8128 >> rtu: 8128 >> cm_tx_retries: >> dreq: 60483 >> >> Note that the active/passive side is distributed. >> >> Enabling pr_debug in cm.c gives tons of: >> >> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave: >> 1,sl_cm_id: 0xd393089f} is NULL! >> >> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the >> tear-down phase of the application is reduced from 113 to 67 >> seconds. Retries/duplicates are also significantly reduced: >> >> cm_rx_duplicates: >> dreq: 7726 >> [] >> cm_tx_retries: >> drep: 1 >> dreq: 7779 >> >> Increasing the timeout further didn't help, as these duplicates and >> retries stem from a too short CMA timeout, which was 20 (~4 seconds) >> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the >> numbers fell down to about one hundred for both of them. >> >> Adjustment of the CMA timeout is _not_ part of this commit. >> >> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> >> --- >> drivers/infiniband/hw/mlx4/cm.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) > > Jack? What do you think? I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit. But sure, Jack's opinion matters. Thxs, Håkon > >> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c >> index fedaf8260105..8c79a480f2b7 100644 >> --- a/drivers/infiniband/hw/mlx4/cm.c >> +++ b/drivers/infiniband/hw/mlx4/cm.c >> @@ -39,7 +39,7 @@ >> >> #include "mlx4_ib.h" >> >> -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) >> +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) >> >> struct id_map_entry { >> struct rb_node node; >> -- >> 2.20.1
> On 6 Feb 2019, at 09:50, Håkon Bugge <haakon.bugge@oracle.com> wrote: > > > >> On 5 Feb 2019, at 23:36, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote: >>> Using CX-3 virtual functions, either from a bare-metal machine or >>> pass-through from a VM, MAD packets are proxied through the PF driver. >>> >>> Since the VMs have separate name spaces for MAD Transaction Ids >>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping >>> in a cache. >>> >>> Following the RDMA CM protocol, it is clear when an entry has to >>> evicted form the cache. But life is not perfect, remote peers may die >>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when >>> the PF driver assumes the remote peer has gone. >>> >>> We have experienced excessive amount of DREQ retries during fail-over >>> testing, when running with eight VMs per database server. >>> >>> The problem has been reproduced in a bare-metal system using one VM >>> per physical node. In this environment, running 256 processes in each >>> VM, each process uses RDMA CM to create an RC QP between himself and >>> all (256) remote processes. All in all 16K QPs. >>> >>> When tearing down these 16K QPs, excessive DREQ retries (and >>> duplicates) are observed. With some cat/paste/awk wizardry on the >>> infiniband_cm sysfs, we observe: >>> >>> dreq: 5007 >>> cm_rx_msgs: >>> drep: 3838 >>> dreq: 13018 >>> rep: 8128 >>> req: 8256 >>> rtu: 8256 >>> cm_tx_msgs: >>> drep: 8011 >>> dreq: 68856 >>> rep: 8256 >>> req: 8128 >>> rtu: 8128 >>> cm_tx_retries: >>> dreq: 60483 >>> >>> Note that the active/passive side is distributed. >>> >>> Enabling pr_debug in cm.c gives tons of: >>> >>> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave: >>> 1,sl_cm_id: 0xd393089f} is NULL! >>> >>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the >>> tear-down phase of the application is reduced from 113 to 67 >>> seconds. Retries/duplicates are also significantly reduced: >>> >>> cm_rx_duplicates: >>> dreq: 7726 >>> [] >>> cm_tx_retries: >>> drep: 1 >>> dreq: 7779 >>> >>> Increasing the timeout further didn't help, as these duplicates and >>> retries stem from a too short CMA timeout, which was 20 (~4 seconds) >>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the >>> numbers fell down to about one hundred for both of them. >>> >>> Adjustment of the CMA timeout is _not_ part of this commit. >>> >>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> >>> --- >>> drivers/infiniband/hw/mlx4/cm.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> Jack? What do you think? > > I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit. > > But sure, Jack's opinion matters. Jack, A major contributor to the long processing time in the PF driver proxying QP1 packets is: create_pv_resources -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler, NULL, ctx, cq_size, 0); That is, comp_vector is zero. Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), the zero comp_vector has the intent of let the mlx4_core driver select the least used vector. But, in mlx4_ib_create_cq(), we have: pr_info("eq_table: %p\n", dev->eq_table); if (dev->eq_table) { vector = dev->eq_table[mlx4_choose_vector(dev->dev, vector, ibdev->num_comp_vectors)]; } cq->vector = vector; and dev->eq_table is NULL, so all the CQs for the proxy QPs get comp_vector zero. I have to make some reservations, as this analysis is based on uek4, but I think the code here is equal upstream, but need to double check. Thxs, Håkon > > > Thxs, Håkon > >> >>> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c >>> index fedaf8260105..8c79a480f2b7 100644 >>> --- a/drivers/infiniband/hw/mlx4/cm.c >>> +++ b/drivers/infiniband/hw/mlx4/cm.c >>> @@ -39,7 +39,7 @@ >>> >>> #include "mlx4_ib.h" >>> >>> -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) >>> +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) >>> >>> struct id_map_entry { >>> struct rb_node node; >>> -- >>> 2.20.1
On Wed, 6 Feb 2019 16:40:14 +0100 Håkon Bugge <haakon.bugge@oracle.com> wrote: > Jack, > > A major contributor to the long processing time in the PF driver > proxying QP1 packets is: > > create_pv_resources > -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler, > NULL, ctx, cq_size, 0); > > That is, comp_vector is zero. > > Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), > the zero comp_vector has the intent of let the mlx4_core driver > select the least used vector. > > But, in mlx4_ib_create_cq(), we have: > > pr_info("eq_table: %p\n", dev->eq_table); > if (dev->eq_table) { > vector = dev->eq_table[mlx4_choose_vector(dev->dev, > vector, ibdev->num_comp_vectors)]; > } > > cq->vector = vector; > > and dev->eq_table is NULL, so all the CQs for the proxy QPs get > comp_vector zero. > > I have to make some reservations, as this analysis is based on uek4, > but I think the code here is equal upstream, but need to double check. > > > Thxs, Håkon > Hi Hakon and Jason, I was ill today (bad cold, took antihistamines all day, which knocked me out). I'll get to this tomorrow. -Jack
> On 6 Feb 2019, at 19:02, jackm <jackm@dev.mellanox.co.il> wrote: > > On Wed, 6 Feb 2019 16:40:14 +0100 > Håkon Bugge <haakon.bugge@oracle.com> wrote: > >> Jack, >> >> A major contributor to the long processing time in the PF driver >> proxying QP1 packets is: >> >> create_pv_resources >> -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler, >> NULL, ctx, cq_size, 0); >> >> That is, comp_vector is zero. >> >> Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), >> the zero comp_vector has the intent of let the mlx4_core driver >> select the least used vector. >> >> But, in mlx4_ib_create_cq(), we have: >> >> pr_info("eq_table: %p\n", dev->eq_table); >> if (dev->eq_table) { >> vector = dev->eq_table[mlx4_choose_vector(dev->dev, >> vector, ibdev->num_comp_vectors)]; >> } >> >> cq->vector = vector; >> >> and dev->eq_table is NULL, so all the CQs for the proxy QPs get >> comp_vector zero. >> >> I have to make some reservations, as this analysis is based on uek4, >> but I think the code here is equal upstream, but need to double check. >> >> >> Thxs, Håkon >> > Hi Hakon and Jason, > I was ill today (bad cold, took antihistamines all day, which knocked > me out). > I'll get to this tomorrow. No problem Jack. I actually see that our uek4 is different in mlx4_ib_alloc_eqs(), and that may well be the root cause here. Hence, moved the MLs to Bcc, and get back to you tomorrow. Thxs, Håkon
diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c index fedaf8260105..8c79a480f2b7 100644 --- a/drivers/infiniband/hw/mlx4/cm.c +++ b/drivers/infiniband/hw/mlx4/cm.c @@ -39,7 +39,7 @@ #include "mlx4_ib.h" -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) struct id_map_entry { struct rb_node node;
Using CX-3 virtual functions, either from a bare-metal machine or pass-through from a VM, MAD packets are proxied through the PF driver. Since the VMs have separate name spaces for MAD Transaction Ids (TIDs), the PF driver has to re-map the TIDs and keep the book keeping in a cache. Following the RDMA CM protocol, it is clear when an entry has to evicted form the cache. But life is not perfect, remote peers may die or be rebooted. Hence, it's a timeout to wipe out a cache entry, when the PF driver assumes the remote peer has gone. We have experienced excessive amount of DREQ retries during fail-over testing, when running with eight VMs per database server. The problem has been reproduced in a bare-metal system using one VM per physical node. In this environment, running 256 processes in each VM, each process uses RDMA CM to create an RC QP between himself and all (256) remote processes. All in all 16K QPs. When tearing down these 16K QPs, excessive DREQ retries (and duplicates) are observed. With some cat/paste/awk wizardry on the infiniband_cm sysfs, we observe: dreq: 5007 cm_rx_msgs: drep: 3838 dreq: 13018 rep: 8128 req: 8256 rtu: 8256 cm_tx_msgs: drep: 8011 dreq: 68856 rep: 8256 req: 8128 rtu: 8128 cm_tx_retries: dreq: 60483 Note that the active/passive side is distributed. Enabling pr_debug in cm.c gives tons of: [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave: 1,sl_cm_id: 0xd393089f} is NULL! By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the tear-down phase of the application is reduced from 113 to 67 seconds. Retries/duplicates are also significantly reduced: cm_rx_duplicates: dreq: 7726 [] cm_tx_retries: drep: 1 dreq: 7779 Increasing the timeout further didn't help, as these duplicates and retries stem from a too short CMA timeout, which was 20 (~4 seconds) on the systems. By increasing the CMA timeout to 22 (~17 seconds), the numbers fell down to about one hundred for both of them. Adjustment of the CMA timeout is _not_ part of this commit. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> --- drivers/infiniband/hw/mlx4/cm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)