Message ID | 20211221024858.25938-1-chengyou@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | Elastic RDMA Adapter (ERDMA) driver | expand |
On Tue, Dec 21, 2021 at 10:48:47AM +0800, Cheng Xu wrote: > Hello all, > > This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which > released in Apsara Conference 2021 by Alibaba. > > ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS > environment, initially offered in g7re instance. It can improve the > efficiency of large-scale distributed computing and communication > significantly and expand dynamically with the cluster scale of Alibaba > Cloud. > > ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It > works in the VPC network environment (overlay network), and uses iWarp > tranport protocol. ERDMA supports reliable connection (RC). ERDMA also > supports both kernel space and user space verbs. Now we have already > supported HPC/AI applications with libfabric, NoF and some other internal > verbs libraries, such as xrdma, epsl, etc,. We will need to get erdma provider implementation in the rdma-core too, in order to consider to merge it. > > For the ECS instance with RDMA enabled, there are two kinds of devices > allocated, one for ERDMA, and one for the original netdev (virtio-net). > They are different PCI deivces. ERDMA driver can get the information about > which netdev attached to in its PCIe barspace (by MAC address matching). This is very questionable. The netdev part should be kept in the drivers/ethernet/... part of the kernel. Thanks > > Thanks, > Cheng Xu > > Cheng Xu (11): > RDMA: Add ERDMA to rdma_driver_id definition > RDMA/erdma: Add the hardware related definitions > RDMA/erdma: Add main include file > RDMA/erdma: Add cmdq implementation > RDMA/erdma: Add event queue implementation > RDMA/erdma: Add verbs header file > RDMA/erdma: Add verbs implementation > RDMA/erdma: Add connection management (CM) support > RDMA/erdma: Add the erdma module > RDMA/erdma: Add the ABI definitions > RDMA/erdma: Add driver to kernel build environment > > MAINTAINERS | 8 + > drivers/infiniband/Kconfig | 1 + > drivers/infiniband/hw/Makefile | 1 + > drivers/infiniband/hw/erdma/Kconfig | 10 + > drivers/infiniband/hw/erdma/Makefile | 5 + > drivers/infiniband/hw/erdma/erdma.h | 381 +++++ > drivers/infiniband/hw/erdma/erdma_cm.c | 1585 +++++++++++++++++++++ > drivers/infiniband/hw/erdma/erdma_cm.h | 158 ++ > drivers/infiniband/hw/erdma/erdma_cmdq.c | 489 +++++++ > drivers/infiniband/hw/erdma/erdma_cq.c | 201 +++ > drivers/infiniband/hw/erdma/erdma_debug.c | 314 ++++ > drivers/infiniband/hw/erdma/erdma_debug.h | 18 + > drivers/infiniband/hw/erdma/erdma_eq.c | 346 +++++ > drivers/infiniband/hw/erdma/erdma_hw.h | 474 ++++++ > drivers/infiniband/hw/erdma/erdma_main.c | 711 +++++++++ > drivers/infiniband/hw/erdma/erdma_qp.c | 624 ++++++++ > drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++ > drivers/infiniband/hw/erdma/erdma_verbs.h | 366 +++++ > include/uapi/rdma/erdma-abi.h | 49 + > include/uapi/rdma/ib_user_ioctl_verbs.h | 1 + > 20 files changed, 7219 insertions(+) > create mode 100644 drivers/infiniband/hw/erdma/Kconfig > create mode 100644 drivers/infiniband/hw/erdma/Makefile > create mode 100644 drivers/infiniband/hw/erdma/erdma.h > create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h > create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h > create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h > create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c > create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h > create mode 100644 include/uapi/rdma/erdma-abi.h > > -- > 2.27.0 >
On 12/21/21 9:09 PM, Leon Romanovsky wrote: > On Tue, Dec 21, 2021 at 10:48:47AM +0800, Cheng Xu wrote: >> Hello all, >> >> This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which >> released in Apsara Conference 2021 by Alibaba. >> >> ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS >> environment, initially offered in g7re instance. It can improve the >> efficiency of large-scale distributed computing and communication >> significantly and expand dynamically with the cluster scale of Alibaba >> Cloud. >> >> ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It >> works in the VPC network environment (overlay network), and uses iWarp >> tranport protocol. ERDMA supports reliable connection (RC). ERDMA also >> supports both kernel space and user space verbs. Now we have already >> supported HPC/AI applications with libfabric, NoF and some other internal >> verbs libraries, such as xrdma, epsl, etc,. > > We will need to get erdma provider implementation in the rdma-core too, > in order to consider to merge it. Sure, I will submit erdma userspace provider implementation within 2 days. >> >> For the ECS instance with RDMA enabled, there are two kinds of devices >> allocated, one for ERDMA, and one for the original netdev (virtio-net). >> They are different PCI deivces. ERDMA driver can get the information about >> which netdev attached to in its PCIe barspace (by MAC address matching). > > This is very questionable. The netdev part should be kept in the > drivers/ethernet/... part of the kernel. > > Thanks The net device used in Alibaba ECS instance is virtio-net device, driven by virtio-pci/virtio-net drivers. ERDMA device does not need its own net device, and will be attached to an existed virtio-net device. The relationship between ibdev and netdev in erdma is similar to siw/rxe. >> >> Thanks, >> Cheng Xu >> >> Cheng Xu (11): >> RDMA: Add ERDMA to rdma_driver_id definition >> RDMA/erdma: Add the hardware related definitions >> RDMA/erdma: Add main include file >> RDMA/erdma: Add cmdq implementation >> RDMA/erdma: Add event queue implementation >> RDMA/erdma: Add verbs header file >> RDMA/erdma: Add verbs implementation >> RDMA/erdma: Add connection management (CM) support >> RDMA/erdma: Add the erdma module >> RDMA/erdma: Add the ABI definitions >> RDMA/erdma: Add driver to kernel build environment >> >> MAINTAINERS | 8 + >> drivers/infiniband/Kconfig | 1 + >> drivers/infiniband/hw/Makefile | 1 + >> drivers/infiniband/hw/erdma/Kconfig | 10 + >> drivers/infiniband/hw/erdma/Makefile | 5 + >> drivers/infiniband/hw/erdma/erdma.h | 381 +++++ >> drivers/infiniband/hw/erdma/erdma_cm.c | 1585 +++++++++++++++++++++ >> drivers/infiniband/hw/erdma/erdma_cm.h | 158 ++ >> drivers/infiniband/hw/erdma/erdma_cmdq.c | 489 +++++++ >> drivers/infiniband/hw/erdma/erdma_cq.c | 201 +++ >> drivers/infiniband/hw/erdma/erdma_debug.c | 314 ++++ >> drivers/infiniband/hw/erdma/erdma_debug.h | 18 + >> drivers/infiniband/hw/erdma/erdma_eq.c | 346 +++++ >> drivers/infiniband/hw/erdma/erdma_hw.h | 474 ++++++ >> drivers/infiniband/hw/erdma/erdma_main.c | 711 +++++++++ >> drivers/infiniband/hw/erdma/erdma_qp.c | 624 ++++++++ >> drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++ >> drivers/infiniband/hw/erdma/erdma_verbs.h | 366 +++++ >> include/uapi/rdma/erdma-abi.h | 49 + >> include/uapi/rdma/ib_user_ioctl_verbs.h | 1 + >> 20 files changed, 7219 insertions(+) >> create mode 100644 drivers/infiniband/hw/erdma/Kconfig >> create mode 100644 drivers/infiniband/hw/erdma/Makefile >> create mode 100644 drivers/infiniband/hw/erdma/erdma.h >> create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h >> create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h >> create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h >> create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c >> create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h >> create mode 100644 include/uapi/rdma/erdma-abi.h >> >> -- >> 2.27.0 >>
On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: > <...> > > > > > > For the ECS instance with RDMA enabled, there are two kinds of devices > > > allocated, one for ERDMA, and one for the original netdev (virtio-net). > > > They are different PCI deivces. ERDMA driver can get the information about > > > which netdev attached to in its PCIe barspace (by MAC address matching). > > > > This is very questionable. The netdev part should be kept in the > > drivers/ethernet/... part of the kernel. > > > > Thanks > > The net device used in Alibaba ECS instance is virtio-net device, driven > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net > device, and will be attached to an existed virtio-net device. The > relationship between ibdev and netdev in erdma is similar to siw/rxe. siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not through MAC's matching. Thanks
On 12/23/21 6:23 PM, Leon Romanovsky wrote: > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: >> > > <...> > >>>> >>>> For the ECS instance with RDMA enabled, there are two kinds of devices >>>> allocated, one for ERDMA, and one for the original netdev (virtio-net). >>>> They are different PCI deivces. ERDMA driver can get the information about >>>> which netdev attached to in its PCIe barspace (by MAC address matching). >>> >>> This is very questionable. The netdev part should be kept in the >>> drivers/ethernet/... part of the kernel. >>> >>> Thanks >> >> The net device used in Alibaba ECS instance is virtio-net device, driven >> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net >> device, and will be attached to an existed virtio-net device. The >> relationship between ibdev and netdev in erdma is similar to siw/rxe. > > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not > through MAC's matching. > > Thanks Both siw/rxe/erdma don't need to implement netdev part, this is what I wanted to express when I said 'similar'. What you mentioned (the bind mechanism) is one major difference between erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if he/she wants, but it is not true for erdma. When user buys the erdma service, he/she must specify which ENI (elastic network interface) to be binded, it means that the attached erdma device can only be binded to the specific netdev. Due to the uniqueness of MAC address in our ECS instance, we use the MAC address as the identification, then the driver knows which netdev should be binded to. Thanks, Cheng Xu
On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: > > > On 12/23/21 6:23 PM, Leon Romanovsky wrote: > > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: > > > > > > > <...> > > > > > > > > > > > > For the ECS instance with RDMA enabled, there are two kinds of devices > > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net). > > > > > They are different PCI deivces. ERDMA driver can get the information about > > > > > which netdev attached to in its PCIe barspace (by MAC address matching). > > > > > > > > This is very questionable. The netdev part should be kept in the > > > > drivers/ethernet/... part of the kernel. > > > > > > > > Thanks > > > > > > The net device used in Alibaba ECS instance is virtio-net device, driven > > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net > > > device, and will be attached to an existed virtio-net device. The > > > relationship between ibdev and netdev in erdma is similar to siw/rxe. > > > > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not > > through MAC's matching. > > > > Thanks > > Both siw/rxe/erdma don't need to implement netdev part, this is what I > wanted to express when I said 'similar'. > What you mentioned (the bind mechanism) is one major difference between > erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if > he/she wants, but it is not true for erdma. When user buys the erdma > service, he/she must specify which ENI (elastic network interface) to be > binded, it means that the attached erdma device can only be binded to > the specific netdev. Due to the uniqueness of MAC address in our ECS > instance, we use the MAC address as the identification, then the driver > knows which netdev should be binded to. Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK. I personally don't like the idea that bind logic is performed "magically". BTW, 1. No module parameters 2. No driver versions Thanks > > Thanks, > Cheng Xu
On 12/23/21 9:44 PM, Leon Romanovsky wrote: > On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: >> >> >> On 12/23/21 6:23 PM, Leon Romanovsky wrote: >>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: >>>> >>> >>> <...> >>> >>>>>> >>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices >>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net). >>>>>> They are different PCI deivces. ERDMA driver can get the information about >>>>>> which netdev attached to in its PCIe barspace (by MAC address matching). >>>>> >>>>> This is very questionable. The netdev part should be kept in the >>>>> drivers/ethernet/... part of the kernel. >>>>> >>>>> Thanks >>>> >>>> The net device used in Alibaba ECS instance is virtio-net device, driven >>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net >>>> device, and will be attached to an existed virtio-net device. The >>>> relationship between ibdev and netdev in erdma is similar to siw/rxe. >>> >>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not >>> through MAC's matching. >>> >>> Thanks >> >> Both siw/rxe/erdma don't need to implement netdev part, this is what I >> wanted to express when I said 'similar'. >> What you mentioned (the bind mechanism) is one major difference between >> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if >> he/she wants, but it is not true for erdma. When user buys the erdma >> service, he/she must specify which ENI (elastic network interface) to be >> binded, it means that the attached erdma device can only be binded to >> the specific netdev. Due to the uniqueness of MAC address in our ECS >> instance, we use the MAC address as the identification, then the driver >> knows which netdev should be binded to. > > Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK. > I personally don't like the idea that bind logic is performed "magically". > OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it means that erdma can not be ready to use like other RDMA HCAs, until user configure the link manually. This way may be not friendly to them. I'm not sure that our current method is acceptable or not. If you strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to it. Thanks, Cheng Xu > BTW, > 1. No module parameters > 2. No driver versions > Will fix them. > Thanks > >> >> Thanks, >> Cheng Xu
On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote: > > > On 12/23/21 9:44 PM, Leon Romanovsky wrote: > > On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: > > > > > > > > > On 12/23/21 6:23 PM, Leon Romanovsky wrote: > > > > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: > > > > > > > > > > > > > <...> > > > > > > > > > > > > > > > > > > For the ECS instance with RDMA enabled, there are two kinds of devices > > > > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net). > > > > > > > They are different PCI deivces. ERDMA driver can get the information about > > > > > > > which netdev attached to in its PCIe barspace (by MAC address matching). > > > > > > > > > > > > This is very questionable. The netdev part should be kept in the > > > > > > drivers/ethernet/... part of the kernel. > > > > > > > > > > > > Thanks > > > > > > > > > > The net device used in Alibaba ECS instance is virtio-net device, driven > > > > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net > > > > > device, and will be attached to an existed virtio-net device. The > > > > > relationship between ibdev and netdev in erdma is similar to siw/rxe. > > > > > > > > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not > > > > through MAC's matching. > > > > > > > > Thanks > > > > > > Both siw/rxe/erdma don't need to implement netdev part, this is what I > > > wanted to express when I said 'similar'. > > > What you mentioned (the bind mechanism) is one major difference between > > > erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if > > > he/she wants, but it is not true for erdma. When user buys the erdma > > > service, he/she must specify which ENI (elastic network interface) to be > > > binded, it means that the attached erdma device can only be binded to > > > the specific netdev. Due to the uniqueness of MAC address in our ECS > > > instance, we use the MAC address as the identification, then the driver > > > knows which netdev should be binded to. > > > > Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK. > > I personally don't like the idea that bind logic is performed "magically". > > > > OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it > means that erdma can not be ready to use like other RDMA HCAs, until > user configure the link manually. This way may be not friendly to them. > I'm not sure that our current method is acceptable or not. If you > strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to > it. Before you are rushing to change that logic, could you please explain the security model of this binding? I'm as an owner of VM can replace kernel code with any code I want and remove your MAC matching (or replace to something different). How will you protect from such flow? If you don't trust VM, you should perform binding in hypervisor and this erdma driver will work out-of-the-box in the VM. Thanks > > Thanks, > Cheng Xu > > > BTW, > > 1. No module parameters > > 2. No driver versions > > > > Will fix them. > > > Thanks > > > > > > > > Thanks, > > > Cheng Xu
On 12/25/21 2:26 AM, Leon Romanovsky wrote: > On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote: >> >> >> On 12/23/21 9:44 PM, Leon Romanovsky wrote: >>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: >>>> >>>> >>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote: >>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: >>>>>> >>>>> >>>>> <...> >>>>> >>>>>>>> >>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices >>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net). >>>>>>>> They are different PCI deivces. ERDMA driver can get the information about >>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching). >>>>>>> >>>>>>> This is very questionable. The netdev part should be kept in the >>>>>>> drivers/ethernet/... part of the kernel. >>>>>>> >>>>>>> Thanks >>>>>> >>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven >>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net >>>>>> device, and will be attached to an existed virtio-net device. The >>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe. >>>>> >>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not >>>>> through MAC's matching. >>>>> >>>>> Thanks >>>> >>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I >>>> wanted to express when I said 'similar'. >>>> What you mentioned (the bind mechanism) is one major difference between >>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if >>>> he/she wants, but it is not true for erdma. When user buys the erdma >>>> service, he/she must specify which ENI (elastic network interface) to be >>>> binded, it means that the attached erdma device can only be binded to >>>> the specific netdev. Due to the uniqueness of MAC address in our ECS >>>> instance, we use the MAC address as the identification, then the driver >>>> knows which netdev should be binded to. >>> >>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK. >>> I personally don't like the idea that bind logic is performed "magically". >>> >> >> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it >> means that erdma can not be ready to use like other RDMA HCAs, until >> user configure the link manually. This way may be not friendly to them. >> I'm not sure that our current method is acceptable or not. If you >> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to >> it. > > Before you are rushing to change that logic, could you please explain > the security model of this binding? > > I'm as an owner of VM can replace kernel code with any code I want and > remove your MAC matching (or replace to something different). How will > you protect from such flow? I think this topic belongs to anti-attack. One principle of anti-attack in our cloud is that the attacker MUST NOT have influence on users but themselves. Before I answer the question, I want to describe some more details of our architecture. In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is fully offloaded to MOC, not in host hypervisor. One virtio-net device belongs to a vport, and if it has a peer erdma device, erdma device also belongs to the vport. The protocol headers of the network flows in the virtio-net and erdma devices must be consistent with the vport configurations (mac address, ip, etc. ) by checking the OVS rules. Back to the question, we can not prevent attackers from modifying the code, making devices binding wrongly in the front-end, or in some worse cases, making driver sending invalid commands to devices. If binding wrongly, the erdma network will be unreachable, because the OVS module in MOC hardware can distinguish this situation and drop all the invalid network packets, and this has no influence to other users. > If you don't trust VM, you should perform binding in hypervisor and > this erdma driver will work out-of-the-box in the VM. As mentioned above, we also have the binding configuration in the back-end (e.g, MOC hardware), only when the configuration is correct of the front-end, the erdma can work properly. > Thanks >
On 12/25/21 2:26 AM, Leon Romanovsky wrote: > On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote: >> >> >> On 12/23/21 9:44 PM, Leon Romanovsky wrote: >>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: >>>> >>>> >>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote: >>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: >>>>>> >>>>> >>>>> <...> >>>>> >>>>>>>> >>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices >>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net). >>>>>>>> They are different PCI deivces. ERDMA driver can get the information about >>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching). >>>>>>> >>>>>>> This is very questionable. The netdev part should be kept in the >>>>>>> drivers/ethernet/... part of the kernel. >>>>>>> >>>>>>> Thanks >>>>>> >>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven >>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net >>>>>> device, and will be attached to an existed virtio-net device. The >>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe. >>>>> >>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not >>>>> through MAC's matching. >>>>> >>>>> Thanks >>>> >>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I >>>> wanted to express when I said 'similar'. >>>> What you mentioned (the bind mechanism) is one major difference between >>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if >>>> he/she wants, but it is not true for erdma. When user buys the erdma >>>> service, he/she must specify which ENI (elastic network interface) to be >>>> binded, it means that the attached erdma device can only be binded to >>>> the specific netdev. Due to the uniqueness of MAC address in our ECS >>>> instance, we use the MAC address as the identification, then the driver >>>> knows which netdev should be binded to. >>> >>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK. >>> I personally don't like the idea that bind logic is performed "magically". >>> >> >> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it >> means that erdma can not be ready to use like other RDMA HCAs, until >> user configure the link manually. This way may be not friendly to them. >> I'm not sure that our current method is acceptable or not. If you >> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to >> it. > > Before you are rushing to change that logic, could you please explain > the security model of this binding? > > I'm as an owner of VM can replace kernel code with any code I want and > remove your MAC matching (or replace to something different). How will > you protect from such flow? In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is fully offloaded to MOC, not in host hypervisor. One virtio-net device belongs to a vport, and if it has a peer erdma device, erdma device also belongs to the vport. The protocol headers of the network flows in the virtio-net and erdma devices must be consistent with the vport configurations (mac address, ip, etc. ) by checking the OVS rules. Back to the question, we can not prevent attackers from modifying the code, making devices binding wrongly in the front-end, or in some worse cases, making driver sending invalid commands to devices. If binding wrongly, the erdma network will be unreachable, because the OVS module in MOC hardware can distinguish this situation and drop all the invalid network packets, and this has no influence to other users. > If you don't trust VM, you should perform binding in hypervisor and > this erdma driver will work out-of-the-box in the VM. As mentioned above, we also have the binding configuration in the back-end (e.g, MOC hardware), only when the configuration is correct of the front-end, the erdma can work properly. Thanks, Cheng Xu > Thanks >
On 12/25/21 2:26 AM, Leon Romanovsky wrote: > On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote: >> >> >> On 12/23/21 9:44 PM, Leon Romanovsky wrote: >>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: >>>> >>>> >>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote: >>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: >>>>>> >>>>> >>>>> <...> >>>>> >>>>>>>> >>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices >>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net). >>>>>>>> They are different PCI deivces. ERDMA driver can get the information about >>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching). >>>>>>> >>>>>>> This is very questionable. The netdev part should be kept in the >>>>>>> drivers/ethernet/... part of the kernel. >>>>>>> >>>>>>> Thanks >>>>>> >>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven >>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net >>>>>> device, and will be attached to an existed virtio-net device. The >>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe. >>>>> >>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not >>>>> through MAC's matching. >>>>> >>>>> Thanks >>>> >>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I >>>> wanted to express when I said 'similar'. >>>> What you mentioned (the bind mechanism) is one major difference between >>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if >>>> he/she wants, but it is not true for erdma. When user buys the erdma >>>> service, he/she must specify which ENI (elastic network interface) to be >>>> binded, it means that the attached erdma device can only be binded to >>>> the specific netdev. Due to the uniqueness of MAC address in our ECS >>>> instance, we use the MAC address as the identification, then the driver >>>> knows which netdev should be binded to. >>> >>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK. >>> I personally don't like the idea that bind logic is performed "magically". >>> >> >> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it >> means that erdma can not be ready to use like other RDMA HCAs, until >> user configure the link manually. This way may be not friendly to them. >> I'm not sure that our current method is acceptable or not. If you >> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to >> it. > > Before you are rushing to change that logic, could you please explain > the security model of this binding? > > I'm as an owner of VM can replace kernel code with any code I want and > remove your MAC matching (or replace to something different). How will > you protect from such flow? (I'm sorry for wrong editing format in the two former responses, please ignore them.) I think this topic belongs to anti-attack. One principle of anti-attack in our cloud is that the attacker MUST NOT have influence on users but themselves. Before I answer the question, I want to describe some more details of our architecture. In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is fully offloaded to MOC, not in host hypervisor. One virtio-net device belongs to a vport, and if it has a peer erdma device, erdma device also belongs to the vport. The protocol headers of the network flows in the virtio-net and erdma devices must be consistent with the vport configurations (mac address, ip, etc. ) by checking the OVS rules. Back to the question, we can not prevent attackers from modifying the code, making devices binding wrongly in the front-end, or in some worse cases, making driver sending invalid commands to devices. If binding wrongly, the erdma network will be unreachable, because the OVS module in MOC hardware can distinguish this situation and drop all the invalid network packets, and this has no influence to other users. > If you don't trust VM, you should perform binding in hypervisor and > this erdma driver will work out-of-the-box in the VM. As mentioned above, we also have the binding configuration in the back-end (e.g, MOC hardware), only when the configuration is correct of the front-end, the erdma can work properly. Thanks, Cheng Xu > Thanks >
On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: > > > On 12/23/21 6:23 PM, Leon Romanovsky wrote: > > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: > > > > > > > <...> > > > > > > > > > > > > For the ECS instance with RDMA enabled, there are two kinds of devices > > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net). > > > > > They are different PCI deivces. ERDMA driver can get the information about > > > > > which netdev attached to in its PCIe barspace (by MAC address matching). > > > > > > > > This is very questionable. The netdev part should be kept in the > > > > drivers/ethernet/... part of the kernel. > > > > > > > > Thanks > > > > > > The net device used in Alibaba ECS instance is virtio-net device, driven > > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net > > > device, and will be attached to an existed virtio-net device. The > > > relationship between ibdev and netdev in erdma is similar to siw/rxe. > > > > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not > > through MAC's matching. > > > > Thanks > > Both siw/rxe/erdma don't need to implement netdev part, this is what I > wanted to express when I said 'similar'. > What you mentioned (the bind mechanism) is one major difference between > erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if > he/she wants, but it is not true for erdma. When user buys the erdma > service, he/she must specify which ENI (elastic network interface) to be > binded, it means that the attached erdma device can only be binded to > the specific netdev. Due to the uniqueness of MAC address in our ECS > instance, we use the MAC address as the identification, then the driver > knows which netdev should be binded to. It really doesn't match our driver binding model to rely on MAC addreses. Our standard model would expect that the virtio-net driver would detect it has RDMA capability and spawn an aux device to link the two things together. Using net notifiers to try to link the lifecycles together has been a mess so far. Jason
On 1/7/22 10:24 PM, Jason Gunthorpe wrote: > On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote: >> >> >> On 12/23/21 6:23 PM, Leon Romanovsky wrote: >>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote: >>>> >>> >>> <...> >>> >>>>>> >>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices >>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net). >>>>>> They are different PCI deivces. ERDMA driver can get the information about >>>>>> which netdev attached to in its PCIe barspace (by MAC address matching). >>>>> >>>>> This is very questionable. The netdev part should be kept in the >>>>> drivers/ethernet/... part of the kernel. >>>>> >>>>> Thanks >>>> >>>> The net device used in Alibaba ECS instance is virtio-net device, driven >>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net >>>> device, and will be attached to an existed virtio-net device. The >>>> relationship between ibdev and netdev in erdma is similar to siw/rxe. >>> >>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not >>> through MAC's matching. >>> >>> Thanks >> >> Both siw/rxe/erdma don't need to implement netdev part, this is what I >> wanted to express when I said 'similar'. >> What you mentioned (the bind mechanism) is one major difference between >> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if >> he/she wants, but it is not true for erdma. When user buys the erdma >> service, he/she must specify which ENI (elastic network interface) to be >> binded, it means that the attached erdma device can only be binded to >> the specific netdev. Due to the uniqueness of MAC address in our ECS >> instance, we use the MAC address as the identification, then the driver >> knows which netdev should be binded to. > > It really doesn't match our driver binding model to rely on MAC > addreses. > > Our standard model would expect that the virtio-net driver would > detect it has RDMA capability and spawn an aux device to link the two > things together. > > Using net notifiers to try to link the lifecycles together has been a > mess so far. Thanks for your explanation. I guess this model requires the netdev and its associated ibdev share the same physical hardware (pci device or platform device)? ERDMA is a separated pci device. Only because that ENIs in our cloud are virtio-net devices, and we let ERDMA binded to virtio-net. Actually it also can work with other type of netdev. As you and Leon said, using net notifiers is not a good way. And I'm modifying our bind mechanism, using RDMA_NLDEV_CMD_NEWLINK to fix it. Thanks, Cheng Xu > Jason