diff mbox

[rdma-next] IB/mlx5: Support RAW Ethernet when RoCE is disabled

Message ID 1463317066-29900-1-git-send-email-ogerlitz@mellanox.com (mailing list archive)
State Superseded
Headers show

Commit Message

Or Gerlitz May 15, 2016, 12:57 p.m. UTC
On some environments, such as certain SRIOV VF configurations, RoCE is
not supported for an Ethernet port. Currently, the driver will not
open IB device on that port.

This is problematic, since we do want user-space RAW Ethernet (RAW_PACKET
QPs) functionality to remain in place. For that end, enhance the relevant
driver flows such that we do create a device instance in that case.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c | 55 +++++++++++++++------------------------
 1 file changed, 21 insertions(+), 34 deletions(-)

Comments

Doug Ledford May 16, 2016, 3:14 p.m. UTC | #1
On 05/15/2016 08:57 AM, Or Gerlitz wrote:
> On some environments, such as certain SRIOV VF configurations, RoCE is
> not supported for an Ethernet port. Currently, the driver will not
> open IB device on that port.
> 
> This is problematic, since we do want user-space RAW Ethernet (RAW_PACKET
> QPs) functionality to remain in place. For that end, enhance the relevant
> driver flows such that we do create a device instance in that case.
> 
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Reviewed-by: Matan Barak <matanb@mellanox.com>

So this device is still going to present itself as a uverbs# device and
libibverbs is still going to try and access it, but when it comes time
for user applications to try and use it they will simply fail all QP
types except for RAW?  Is there no way to distinguish between a device
with RoCE enabled and without?
Matan Barak May 16, 2016, 4:31 p.m. UTC | #2
On 16/05/2016 18:14, Doug Ledford wrote:
> On 05/15/2016 08:57 AM, Or Gerlitz wrote:
>> On some environments, such as certain SRIOV VF configurations, RoCE is
>> not supported for an Ethernet port. Currently, the driver will not
>> open IB device on that port.
>>
>> This is problematic, since we do want user-space RAW Ethernet (RAW_PACKET
>> QPs) functionality to remain in place. For that end, enhance the relevant
>> driver flows such that we do create a device instance in that case.
>>
>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>> Reviewed-by: Matan Barak <matanb@mellanox.com>
>
> So this device is still going to present itself as a uverbs# device and
> libibverbs is still going to try and access it, but when it comes time
> for user applications to try and use it they will simply fail all QP
> types except for RAW?  Is there no way to distinguish between a device
> with RoCE enabled and without?
>
>

It's a bit broken, but you could query gid_tbl_len in ibv_port_attr.
If it's zero, RoCE can't be used anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 16, 2016, 4:46 p.m. UTC | #3
On 05/16/2016 12:31 PM, Matan Barak (External) wrote:
> On 16/05/2016 18:14, Doug Ledford wrote:
>> On 05/15/2016 08:57 AM, Or Gerlitz wrote:
>>> On some environments, such as certain SRIOV VF configurations, RoCE is
>>> not supported for an Ethernet port. Currently, the driver will not
>>> open IB device on that port.
>>>
>>> This is problematic, since we do want user-space RAW Ethernet
>>> (RAW_PACKET
>>> QPs) functionality to remain in place. For that end, enhance the
>>> relevant
>>> driver flows such that we do create a device instance in that case.
>>>
>>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>>> Reviewed-by: Matan Barak <matanb@mellanox.com>
>>
>> So this device is still going to present itself as a uverbs# device and
>> libibverbs is still going to try and access it, but when it comes time
>> for user applications to try and use it they will simply fail all QP
>> types except for RAW?  Is there no way to distinguish between a device
>> with RoCE enabled and without?
>>
>>
> 
> It's a bit broken, but you could query gid_tbl_len in ibv_port_attr.
> If it's zero, RoCE can't be used anyway.

OK, so you want a plain Ethernet device with QPs as a means of enabling
RDMA for Ethernet, but without the full blown Verbs support, just direct
data placement in memory.  That's a reasonable thing to want to do.
However, I'm not sure I agree with the implementation here.  It seems
very hackish.  This setup probably deserves its own type setting in the
RDMA stack to differentiate it (aka, InfiniBand, OPA, RoCE, iWARP,
USNIC, Ethernet + QPs).
Or Gerlitz May 16, 2016, 9:03 p.m. UTC | #4
On Mon, May 16, 2016 at 7:46 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 05/16/2016 12:31 PM, Matan Barak (External) wrote:
>> On 16/05/2016 18:14, Doug Ledford wrote:
>>> On 05/15/2016 08:57 AM, Or Gerlitz wrote:

>>>> On some environments, such as certain SRIOV VF configurations, RoCE is
>>>> not supported for an Ethernet port. Currently, the driver will not
>>>> open IB device on that port.
>>>> This is problematic, since we do want user-space RAW Ethernet
>>>> (RAW_PACKET QPs) functionality to remain in place. For that end,
>>>> enhance the relevant
>>>> driver flows such that we do create a device instance in that case.

>>> So this device is still going to present itself as a uverbs# device and
>>> libibverbs is still going to try and access it, but when it comes time
>>> for user applications to try and use it they will simply fail all QP
>>> types except for RAW?  Is there no way to distinguish between a device
>>> with RoCE enabled and without?

Doug,

To answer your questions: YES there is a way, namely
RDMA_CORE_CAP_PROT_ROCE and RDMA_CORE_PORT_IBA_ROCE - under this
patch, when RoCE isn't supported we do create the device but not
advertizing these caps.

> OK, so you want a plain Ethernet device with QPs as a means of enabling
> RDMA for Ethernet, but without the full blown Verbs support, just direct
> data placement in memory.  That's a reasonable thing to want to do.
> However, I'm not sure I agree with the implementation here.  It seems
> very hackish.  This setup probably deserves its own type setting in the
> RDMA stack to differentiate it (aka, InfiniBand, OPA, RoCE, iWARP,
> USNIC, Ethernet + QPs).

Why? we should be doing things for a reason. There's no entity I know
of in the stack which needs to be operative and is not b/c that new
type you're suggesting was not defined.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 17, 2016, 1:01 a.m. UTC | #5
On 05/16/2016 05:03 PM, Or Gerlitz wrote:
> On Mon, May 16, 2016 at 7:46 PM, Doug Ledford <dledford@redhat.com> wrote:
>> On 05/16/2016 12:31 PM, Matan Barak (External) wrote:
>>> On 16/05/2016 18:14, Doug Ledford wrote:
>>>> On 05/15/2016 08:57 AM, Or Gerlitz wrote:
> 
>>>>> On some environments, such as certain SRIOV VF configurations, RoCE is
>>>>> not supported for an Ethernet port. Currently, the driver will not
>>>>> open IB device on that port.
>>>>> This is problematic, since we do want user-space RAW Ethernet
>>>>> (RAW_PACKET QPs) functionality to remain in place. For that end,
>>>>> enhance the relevant
>>>>> driver flows such that we do create a device instance in that case.
> 
>>>> So this device is still going to present itself as a uverbs# device and
>>>> libibverbs is still going to try and access it, but when it comes time
>>>> for user applications to try and use it they will simply fail all QP
>>>> types except for RAW?  Is there no way to distinguish between a device
>>>> with RoCE enabled and without?
> 
> Doug,
> 
> To answer your questions: YES there is a way, namely
> RDMA_CORE_CAP_PROT_ROCE and RDMA_CORE_PORT_IBA_ROCE - under this
> patch, when RoCE isn't supported we do create the device but not
> advertizing these caps.

I wasn't referring to the kernel.  I was referring to a user space
application.  There is currently no way for a user space app to tell the
difference between this and a full RoCE device until they try to create
something other than a raw Eth QP and it fails.

>> OK, so you want a plain Ethernet device with QPs as a means of enabling
>> RDMA for Ethernet, but without the full blown Verbs support, just direct
>> data placement in memory.  That's a reasonable thing to want to do.
>> However, I'm not sure I agree with the implementation here.  It seems
>> very hackish.  This setup probably deserves its own type setting in the
>> RDMA stack to differentiate it (aka, InfiniBand, OPA, RoCE, iWARP,
>> USNIC, Ethernet + QPs).
> 
> Why? we should be doing things for a reason. There's no entity I know
> of in the stack which needs to be operative and is not b/c that new
> type you're suggesting was not defined.

The reason is to properly support this as a valid device type.  Not full
RDMA, but still with direct data placement in user memory.  That has
value, I'm thinking it should be properly represented.
Or Gerlitz May 17, 2016, 5:31 a.m. UTC | #6
On Tue, May 17, 2016 at 4:01 AM, Doug Ledford <dledford@redhat.com> wrote:
> On 05/16/2016 05:03 PM, Or Gerlitz wrote:

>> To answer your questions: YES there is a way, namely
>> RDMA_CORE_CAP_PROT_ROCE and RDMA_CORE_PORT_IBA_ROCE - under this
>> patch, when RoCE isn't supported we do create the device but not
>> advertizing these caps.

> I wasn't referring to the kernel.  I was referring to a user space
> application.  There is currently no way for a user space app to tell the
> difference between this and a full RoCE device until they try to create
> something other than a raw Eth QP and it fails.

And how a general purpose user-space app should be realizing that UD isn't
supported by the iWARP drivers and RC isn't supported by the Cisco driver?

c4iw_create_qp -->
 if (attrs->qp_type != IB_QPT_RC)
                return ERR_PTR(-EINVAL);

usnic_ib_create_qp -->
if (init_attr->qp_type != IB_QPT_UD)
                return ERR_PTR(-EINVAL);

We need to fix mlx5 to work over a port not supporting RoCE, this patch does
that, in a manner which puts it ~in par with other vendors/transports
limitation.

If you really want, we can add such -EINVAL-ing when RoCE isn't supported
when the user attempts to create IBTA QP types (UC/RD/RC/etc)


>>> OK, so you want a plain Ethernet device with QPs as a means of enabling
>>> RDMA for Ethernet, but without the full blown Verbs support, just direct
>>> data placement in memory.  That's a reasonable thing to want to do.
>>> However, I'm not sure I agree with the implementation here.  It seems
>>> very hackish.  This setup probably deserves its own type setting in the
>>> RDMA stack to differentiate it (aka, InfiniBand, OPA, RoCE, iWARP,
>>> USNIC, Ethernet + QPs).

>> Why? we should be doing things for a reason. There's no entity I know
>> of in the stack which needs to be operative and is not b/c that new
>> type you're suggesting was not defined.

> The reason is to properly support this as a valid device type.  Not full
> RDMA, but still with direct data placement in user memory.  That has
> value, I'm thinking it should be properly represented.

Again, going with you, say we add that CAP, I didn't find a single place
in the kernel IB stack to where it would be used, please suggest a use-case!

I/F code which don't have consumer is not upstreamable. BTW, AFAIR.,we have
some little code like that from the early days for some IB switch HW
drivers which
are not upstream, and whenever we get there this is PITA to change/maintain.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 18, 2016, 3:31 a.m. UTC | #7
On Tue, May 17, 2016 at 8:31 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Tue, May 17, 2016 at 4:01 AM, Doug Ledford <dledford@redhat.com> wrote:
>> On 05/16/2016 05:03 PM, Or Gerlitz wrote:
>
>>> To answer your questions: YES there is a way, namely
>>> RDMA_CORE_CAP_PROT_ROCE and RDMA_CORE_PORT_IBA_ROCE - under this
>>> patch, when RoCE isn't supported we do create the device but not
>>> advertizing these caps.
>
>> I wasn't referring to the kernel.  I was referring to a user space
>> application.  There is currently no way for a user space app to tell the
>> difference between this and a full RoCE device until they try to create
>> something other than a raw Eth QP and it fails.
>
> And how a general purpose user-space app should be realizing that UD isn't
> supported by the iWARP drivers and RC isn't supported by the Cisco driver?
>
> c4iw_create_qp -->
>  if (attrs->qp_type != IB_QPT_RC)
>                 return ERR_PTR(-EINVAL);
>
> usnic_ib_create_qp -->
> if (init_attr->qp_type != IB_QPT_UD)
>                 return ERR_PTR(-EINVAL);
>
> We need to fix mlx5 to work over a port not supporting RoCE, this patch does
> that, in a manner which puts it ~in par with other vendors/transports
> limitation.
>
> If you really want, we can add such -EINVAL-ing when RoCE isn't supported
> when the user attempts to create IBTA QP types (UC/RD/RC/etc)

Doug,

You left me in the air... can you comment here and below on my
responses, we need the driver to be functions on these envs, lets see
what's needed!


>>>> OK, so you want a plain Ethernet device with QPs as a means of enabling
>>>> RDMA for Ethernet, but without the full blown Verbs support, just direct
>>>> data placement in memory.  That's a reasonable thing to want to do.
>>>> However, I'm not sure I agree with the implementation here.  It seems
>>>> very hackish.  This setup probably deserves its own type setting in the
>>>> RDMA stack to differentiate it (aka, InfiniBand, OPA, RoCE, iWARP,
>>>> USNIC, Ethernet + QPs).
>
>>> Why? we should be doing things for a reason. There's no entity I know
>>> of in the stack which needs to be operative and is not b/c that new
>>> type you're suggesting was not defined.
>
>> The reason is to properly support this as a valid device type.  Not full
>> RDMA, but still with direct data placement in user memory.  That has
>> value, I'm thinking it should be properly represented.
>
> Again, going with you, say we add that CAP, I didn't find a single place
> in the kernel IB stack to where it would be used, please suggest a use-case!


Doug, here too, anywhere that you see a use for that new cap?

> I/F code which don't have consumer is not upstreamable. BTW, AFAIR.,we have
> some little code like that from the early days for some IB switch HW
> drivers which
> are not upstream, and whenever we get there this is PITA to change/maintain.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 4cb81f6..965e414 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2195,6 +2195,8 @@  static int mlx5_port_immutable(struct ib_device *ibdev, u8 port_num,
 			       struct ib_port_immutable *immutable)
 {
 	struct ib_port_attr attr;
+	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+	enum rdma_link_layer ll = mlx5_ib_port_link_layer(ibdev, 1);
 	int err;
 
 	err = mlx5_ib_query_port(ibdev, port_num, &attr);
@@ -2204,37 +2206,12 @@  static int mlx5_port_immutable(struct ib_device *ibdev, u8 port_num,
 	immutable->pkey_tbl_len = attr.pkey_tbl_len;
 	immutable->gid_tbl_len = attr.gid_tbl_len;
 	immutable->core_cap_flags = get_core_cap_flags(ibdev);
-	immutable->max_mad_size = IB_MGMT_MAD_SIZE;
+	if ((ll == IB_LINK_LAYER_INFINIBAND) || MLX5_CAP_GEN(dev->mdev, roce))
+		immutable->max_mad_size = IB_MGMT_MAD_SIZE;
 
 	return 0;
 }
 
-static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
-{
-	int err;
-
-	dev->roce.nb.notifier_call = mlx5_netdev_event;
-	err = register_netdevice_notifier(&dev->roce.nb);
-	if (err)
-		return err;
-
-	err = mlx5_nic_vport_enable_roce(dev->mdev);
-	if (err)
-		goto err_unregister_netdevice_notifier;
-
-	return 0;
-
-err_unregister_netdevice_notifier:
-	unregister_netdevice_notifier(&dev->roce.nb);
-	return err;
-}
-
-static void mlx5_disable_roce(struct mlx5_ib_dev *dev)
-{
-	mlx5_nic_vport_disable_roce(dev->mdev);
-	unregister_netdevice_notifier(&dev->roce.nb);
-}
-
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
 	struct mlx5_ib_dev *dev;
@@ -2246,9 +2223,6 @@  static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	port_type_cap = MLX5_CAP_GEN(mdev, port_type);
 	ll = mlx5_port_type_cap_to_rdma_ll(port_type_cap);
 
-	if ((ll == IB_LINK_LAYER_ETHERNET) && !MLX5_CAP_GEN(mdev, roce))
-		return NULL;
-
 	printk_once(KERN_INFO "%s", mlx5_version);
 
 	dev = (struct mlx5_ib_dev *)ib_alloc_device(sizeof(*dev));
@@ -2396,9 +2370,15 @@  static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	mutex_init(&dev->cap_mask_mutex);
 
 	if (ll == IB_LINK_LAYER_ETHERNET) {
-		err = mlx5_enable_roce(dev);
+		dev->roce.nb.notifier_call = mlx5_netdev_event;
+		err = register_netdevice_notifier(&dev->roce.nb);
 		if (err)
 			goto err_dealloc;
+		if (MLX5_CAP_GEN(mdev, roce)) {
+			err = mlx5_nic_vport_enable_roce(dev->mdev);
+			if (err)
+				goto err_unreg_notifier;
+		}
 	}
 
 	err = create_dev_resources(&dev->devr);
@@ -2441,8 +2421,12 @@  err_rsrc:
 	destroy_dev_resources(&dev->devr);
 
 err_disable_roce:
+	if (ll == IB_LINK_LAYER_ETHERNET && MLX5_CAP_GEN(mdev, roce))
+		mlx5_nic_vport_disable_roce(dev->mdev);
+
+err_unreg_notifier:
 	if (ll == IB_LINK_LAYER_ETHERNET)
-		mlx5_disable_roce(dev);
+		unregister_netdevice_notifier(&dev->roce.nb);
 
 err_dealloc:
 	ib_dealloc_device((struct ib_device *)dev);
@@ -2459,8 +2443,11 @@  static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
 	destroy_umrc_res(dev);
 	mlx5_ib_odp_remove_one(dev);
 	destroy_dev_resources(&dev->devr);
-	if (ll == IB_LINK_LAYER_ETHERNET)
-		mlx5_disable_roce(dev);
+	if (ll == IB_LINK_LAYER_ETHERNET) {
+		if (MLX5_CAP_GEN(mdev, roce))
+			mlx5_nic_vport_disable_roce(dev->mdev);
+		unregister_netdevice_notifier(&dev->roce.nb);
+	}
 	ib_dealloc_device(&dev->ib_dev);
 }