Message ID | 1462912922.23006.3.camel@ssi (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On Tue, May 10, 2016 at 01:42:02PM -0700, Ming Lin wrote: > Here is a bug with mlx5_ib. > > commit d603c809ef91fa2d211bde5e95be417847410379 > Author: Eli Cohen <eli@mellanox.com> > Date: Fri Mar 11 22:58:35 2016 +0200 > > IB/mlx5: Fix decision on using MAD_IFC > > > This commit causes below WARN. The "ix" returns -1 > > 658 void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u8 port, > ... > > 693 /* Coudn't find default GID location */ > 694 WARN_ON(ix < 0); > 695 > > > WARNING: CPU: 1 PID: 2651 at /home/mlin/linux/drivers/infiniband/core/cache.c:717 ib_cache_gid_set_default_gid+0x2f8/0x340 [ib_core] > Can you tell if the link layer you're using is Ethernet or IB? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I can see this is Ethernet. Let me check that. On Wed, May 11, 2016 at 12:09:04AM +0300, Eli Cohen wrote: > On Tue, May 10, 2016 at 01:42:02PM -0700, Ming Lin wrote: > > Here is a bug with mlx5_ib. > > > > commit d603c809ef91fa2d211bde5e95be417847410379 > > Author: Eli Cohen <eli@mellanox.com> > > Date: Fri Mar 11 22:58:35 2016 +0200 > > > > IB/mlx5: Fix decision on using MAD_IFC > > > > > > This commit causes below WARN. The "ix" returns -1 > > > > 658 void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u8 port, > > ... > > > > 693 /* Coudn't find default GID location */ > > 694 WARN_ON(ix < 0); > > 695 > > > > > > WARNING: CPU: 1 PID: 2651 at /home/mlin/linux/drivers/infiniband/core/cache.c:717 ib_cache_gid_set_default_gid+0x2f8/0x340 [ib_core] > > > > Can you tell if the link layer you're using is Ethernet or IB? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/10/2016 04:42 PM, Ming Lin wrote: > Here is a bug with mlx5_ib. > > commit d603c809ef91fa2d211bde5e95be417847410379 > Author: Eli Cohen <eli@mellanox.com> > Date: Fri Mar 11 22:58:35 2016 +0200 > > IB/mlx5: Fix decision on using MAD_IFC I ran into this same bug when testing 4.6-rc. I submitted a patch for 4.6-rc that resolves the oops (but leaves the WARN_ON in place). Once I updated to the latest official mlx5 firmware on the devices, the issue wen away. So, this can probably be mostly ignored since the oops has been fixed, and I would suggest updating your firmware. > > This commit causes below WARN. The "ix" returns -1 > > 658 void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u8 port, > ... > > 693 /* Coudn't find default GID location */ > 694 WARN_ON(ix < 0); > 695 > > > WARNING: CPU: 1 PID: 2651 at /home/mlin/linux/drivers/infiniband/core/cache.c:717 ib_cache_gid_set_default_gid+0x2f8/0x340 [ib_core] > > [ 394.725187] CPU: 1 PID: 2651 Comm: modprobe Tainted: G OE 4.6.0-rc3+ #195 > [ 394.734464] Hardware name: Dell Inc. OptiPlex 7010/0YXT71, BIOS A15 08/12/2013 > [ 394.743131] 0000000000000000 ffff88006791b848 ffffffff8132996a 0000000000000000 > [ 394.752045] 0000000000000000 ffff88006791b888 ffffffff8106a7c7 000002cd00000008 > [ 394.761426] 0000000000000000 0000000000000001 ffff880063028780 ffff880060d7c000 > [ 394.770370] Call Trace: > [ 394.774749] [<ffffffff8132996a>] dump_stack+0x63/0x89 > [ 394.781582] [<ffffffff8106a7c7>] __warn+0xc7/0xf0 > [ 394.788325] [<ffffffff8106a8a8>] warn_slowpath_null+0x18/0x20 > [ 394.795732] [<ffffffffc0860c48>] ib_cache_gid_set_default_gid+0x2f8/0x340 [ib_core] > [ 394.804556] [<ffffffff8109ef07>] ? pick_next_task_fair+0x367/0x490 > [ 394.811923] [<ffffffff816db9e0>] ? __schedule+0x660/0x770 > [ 394.818487] [<ffffffffc08624ef>] add_netdev_ips+0xaf/0xc0 [ib_core] > [ 394.825935] [<ffffffffc0862685>] enum_all_gids_of_dev_cb+0x85/0xc0 [ib_core] > [ 394.834155] [<ffffffffc0861760>] ? rdma_protocol_roce_eth_encap+0x20/0x20 [ib_core] > [ 394.842993] [<ffffffffc085e642>] ib_enum_roce_netdev+0xe2/0x100 [ib_core] > [ 394.850959] [<ffffffffc0862600>] ? is_eth_port_of_netdev+0x90/0x90 [ib_core] > [ 394.859193] [<ffffffffc086281c>] roce_rescan_device+0x1c/0x20 [ib_core] > [ 394.866981] [<ffffffffc0860d7b>] ib_cache_setup_one+0xeb/0x400 [ib_core] > [ 394.874851] [<ffffffffc085e299>] ib_register_device+0x2d9/0x500 [ib_core] > [ 394.882807] [<ffffffffc0979961>] mlx5_ib_add+0xad1/0x1370 [mlx5_ib] > [ 394.890211] [<ffffffff8108dad8>] ? ttwu_do_activate.constprop.81+0x58/0x60 > [ 394.898212] [<ffffffff81084224>] ? __alloc_workqueue_key+0x1f4/0x540 > [ 394.905696] [<ffffffffc08840ec>] mlx5_add_device+0x3c/0xa0 [mlx5_core] > [ 394.913340] [<ffffffffc09e3000>] ? 0xffffffffc09e3000 > [ 394.919516] [<ffffffffc08841bc>] mlx5_register_interface+0x6c/0xa0 [mlx5_core] > [ 394.927858] [<ffffffffc09e3035>] mlx5_ib_init+0x35/0x4b [mlx5_ib] > [ 394.935059] [<ffffffff81002138>] do_one_initcall+0xc8/0x1f0 > [ 394.941734] [<ffffffff81159690>] ? __vunmap+0x80/0xd0 > [ 394.947875] [<ffffffff8111d04f>] do_init_module+0x56/0x1c8 > [ 394.954450] [<ffffffff810dd2be>] load_module+0x1dae/0x2670 > [ 394.961034] [<ffffffff810da7b0>] ? __symbol_put+0x50/0x50 > [ 394.967543] [<ffffffff810ddd89>] SYSC_finit_module+0xa9/0xd0 > [ 394.974302] [<ffffffff810dddc9>] SyS_finit_module+0x9/0x10 > [ 394.980878] [<ffffffff816df1b6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 > [ 394.988336] ---[ end trace df64015bed03617a ]--- > > [ 395.007774] BUG: unable to handle kernel paging request at ffffffffffffffe0 > > [ 395.302076] Call Trace: > [ 395.305549] [<ffffffff8106a7a0>] ? __warn+0xa0/0xf0 > [ 395.311550] [<ffffffffc0860bd4>] ib_cache_gid_set_default_gid+0x284/0x340 [ib_core] > [ 395.320335] [<ffffffff816db9e0>] ? __schedule+0x660/0x770 > [ 395.326868] [<ffffffffc08624ef>] add_netdev_ips+0xaf/0xc0 [ib_core] > [ 395.334268] [<ffffffffc0862685>] enum_all_gids_of_dev_cb+0x85/0xc0 [ib_core] > [ 395.342452] [<ffffffffc0861760>] ? rdma_protocol_roce_eth_encap+0x20/0x20 [ib_core] > [ 395.351239] [<ffffffffc085e642>] ib_enum_roce_netdev+0xe2/0x100 [ib_core] > [ 395.359167] [<ffffffffc0862600>] ? is_eth_port_of_netdev+0x90/0x90 [ib_core] > [ 395.367353] [<ffffffffc086281c>] roce_rescan_device+0x1c/0x20 [ib_core] > [ 395.375115] [<ffffffffc0860d7b>] ib_cache_setup_one+0xeb/0x400 [ib_core] > [ 395.382949] [<ffffffffc085e299>] ib_register_device+0x2d9/0x500 [ib_core] > [ 395.390869] [<ffffffffc0979961>] mlx5_ib_add+0xad1/0x1370 [mlx5_ib] > [ 395.398289] [<ffffffff8108dad8>] ? ttwu_do_activate.constprop.81+0x58/0x60 > [ 395.406318] [<ffffffff81084224>] ? __alloc_workqueue_key+0x1f4/0x540 > [ 395.413806] [<ffffffffc08840ec>] mlx5_add_device+0x3c/0xa0 [mlx5_core] > [ 395.421467] [<ffffffffc09e3000>] ? 0xffffffffc09e3000 > [ 395.427644] [<ffffffffc08841bc>] mlx5_register_interface+0x6c/0xa0 [mlx5_core] > [ 395.436002] [<ffffffffc09e3035>] mlx5_ib_init+0x35/0x4b [mlx5_ib] > [ 395.443222] [<ffffffff81002138>] do_one_initcall+0xc8/0x1f0 > [ 395.449938] [<ffffffff81159690>] ? __vunmap+0x80/0xd0 > [ 395.456114] [<ffffffff8111d04f>] do_init_module+0x56/0x1c8 > [ 395.462722] [<ffffffff810dd2be>] load_module+0x1dae/0x2670 > [ 395.469324] [<ffffffff810da7b0>] ? __symbol_put+0x50/0x50 > [ 395.475872] [<ffffffff810ddd89>] SYSC_finit_module+0xa9/0xd0 > [ 395.482656] [<ffffffff810dddc9>] SyS_finit_module+0x9/0x10 > [ 395.489252] [<ffffffff816df1b6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 > > > Instead of reverting the commit, I tried to find out the cause. > > ib_cache_gid_set_default_gid() calls find_gid() > > 249 static int find_gid(struct ib_gid_table *table, const union ib_gid *gid, > 250 const struct ib_gid_attr *val, bool default_gid, > 251 unsigned long mask, int *pempty) > 252 { > 253 int i = 0; > 254 int found = -1; > 255 int empty = pempty ? -1 : 0; > 256 > 257 while (i < table->sz && (found < 0 || empty < 0)) { > > find_gid() returns -1 because table->sz is 0. > > > 757 static int _gid_table_setup_one(struct ib_device *ib_dev) > 758 { > 759 u8 port; > 760 struct ib_gid_table **table; > 761 int err = 0; > 762 > 763 table = kcalloc(ib_dev->phys_port_cnt, sizeof(*table), GFP_KERNEL); > 764 > 765 if (!table) { > 766 pr_warn("failed to allocate ib gid cache for %s\n", > 767 ib_dev->name); > 768 return -ENOMEM; > 769 } > 770 > 771 for (port = 0; port < ib_dev->phys_port_cnt; port++) { > 772 u8 rdma_port = port + rdma_start_port(ib_dev); > 773 > 774 table[port] = > 775 alloc_gid_table( > 776 ib_dev->port_immutable[rdma_port].gid_tbl_len); > > "table" is allocated in alloc_gid_table(). > And debug shows ib_dev->port_immutable[rdma_port].gid_tbl_len is 0. > > "gid_tbl_len" is set in mlx5_query_mad_ifc_port() > > 498 int mlx5_query_mad_ifc_port(struct ib_device *ibdev, u8 port, > 499 struct ib_port_attr *props) > 500 { > ... > > 537 props->gid_tbl_len = out_mad->data[50]; > > Debug shows out_mad->data[50] is 0. > > So here is the "temporary" patch. > I just copied it from mlx5_query_hca_port() > > diff --git a/drivers/infiniband/hw/mlx5/mad.c b/drivers/infiniband/hw/mlx5/mad.c > index 1534af1..ef19b5c 100644 > --- a/drivers/infiniband/hw/mlx5/mad.c > +++ b/drivers/infiniband/hw/mlx5/mad.c > @@ -534,7 +534,7 @@ int mlx5_query_mad_ifc_port(struct ib_device *ibdev, u8 port, > props->state = out_mad->data[32] & 0xf; > props->phys_state = out_mad->data[33] >> 4; > props->port_cap_flags = be32_to_cpup((__be32 *)(out_mad->data + 20)); > - props->gid_tbl_len = out_mad->data[50]; > + props->gid_tbl_len = mlx5_get_gid_table_len(MLX5_CAP_GEN(mdev, gid_table_size)); > props->max_msg_sz = 1 << MLX5_CAP_GEN(mdev, log_max_msg); > props->pkey_tbl_len = mdev->port_caps[port - 1].pkey_table_len; > props->bad_pkey_cntr = be16_to_cpup((__be16 *)(out_mad->data + 46)); > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
diff --git a/drivers/infiniband/hw/mlx5/mad.c b/drivers/infiniband/hw/mlx5/mad.c index 1534af1..ef19b5c 100644 --- a/drivers/infiniband/hw/mlx5/mad.c +++ b/drivers/infiniband/hw/mlx5/mad.c @@ -534,7 +534,7 @@ int mlx5_query_mad_ifc_port(struct ib_device *ibdev, u8 port, props->state = out_mad->data[32] & 0xf; props->phys_state = out_mad->data[33] >> 4; props->port_cap_flags = be32_to_cpup((__be32 *)(out_mad->data + 20)); - props->gid_tbl_len = out_mad->data[50]; + props->gid_tbl_len = mlx5_get_gid_table_len(MLX5_CAP_GEN(mdev, gid_table_size)); props->max_msg_sz = 1 << MLX5_CAP_GEN(mdev, log_max_msg); props->pkey_tbl_len = mdev->port_caps[port - 1].pkey_table_len; props->bad_pkey_cntr = be16_to_cpup((__be16 *)(out_mad->data + 46));