diff mbox series

RDMA/rxe: Use kzalloc() to alloc map_set

Message ID 20220518043725.771549-1-lizhijian@fujitsu.com (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series RDMA/rxe: Use kzalloc() to alloc map_set | expand

Commit Message

Li Zhijian May 18, 2022, 4:37 a.m. UTC
Below call chains will alloc map_set without fully initializing map_set.
rxe_mr_init_fast()
 -> rxe_mr_alloc()
    -> rxe_mr_alloc_map_set()

Uninitialized values inside struct rxe_map_set are possible to cause
kernel panic.

It's noticed that crashes were caused by rnbd user cases, it can be
easily reproduced by:
$ while true; do echo "sessname=bla path=ip:<ip> device_path=<device>" > /sys/devices/virtual/rnbd-client/ctl/map_device; done

The backtraces are not always identical.
[1st]----------
[   80.158930] CPU: 0 PID: 11 Comm: ksoftirqd/0 Not tainted 5.18.0-rc1-roce-flush+ #60                                                                                                                                         [0/9090]
[   80.160736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
[   80.163579] RIP: 0010:lookup_iova+0x66/0xa0 [rdma_rxe]
[   80.164825] Code: 00 00 00 48 d3 ee 89 32 c3 4c 8b 18 49 8b 3b 48 8b 47 08 48 39 c6 72 38 48 29 c6 45 31 d2 b8 01 00 00 00 48 63 c8 48 c1 e1 04 <48> 8b 4c 0f 08 48 39 f1 77 21 83 c0 01 48 29 ce 3d 00 01 00 00 75
[   80.168935] RSP: 0018:ffffb7ff80063bf0 EFLAGS: 00010246
[   80.170333] RAX: 0000000000000000 RBX: ffff9b9949d86800 RCX: 0000000000000000
[   80.171976] RDX: ffffb7ff80063c00 RSI: 0000000049f6b378 RDI: 002818da00000004
[   80.173606] RBP: 0000000000000120 R08: ffffb7ff80063c08 R09: ffffb7ff80063c04
[   80.176933] R10: 0000000000000002 R11: ffff9b9916f7eef8 R12: ffff9b99488a0038
[   80.178526] R13: ffff9b99488a0038 R14: ffff9b9914fb346a R15: ffff9b990ab27000
[   80.180378] FS:  0000000000000000(0000) GS:ffff9b997dc00000(0000) knlGS:0000000000000000
[   80.182257] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   80.183577] CR2: 00007efc33a98ed0 CR3: 0000000014f32004 CR4: 00000000001706f0
[   80.185210] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.186890] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   80.188517] Call Trace:
[   80.189269]  <TASK>
[   80.189949]  rxe_mr_copy.part.0+0x6f/0x140 [rdma_rxe]
[   80.191173]  rxe_responder+0x12ee/0x1b60 [rdma_rxe]
[   80.192409]  ? rxe_icrc_check+0x7e/0x100 [rdma_rxe]
[   80.193576]  ? rxe_rcv+0x1d0/0x780 [rdma_rxe]
[   80.194668]  ? rxe_icrc_hdr.isra.0+0xf6/0x160 [rdma_rxe]
[   80.195952]  rxe_do_task+0x67/0xb0 [rdma_rxe]
[   80.197081]  rxe_xmit_packet+0xc7/0x210 [rdma_rxe]
[   80.198253]  rxe_requester+0x680/0xee0 [rdma_rxe]
[   80.199439]  ? update_load_avg+0x5f/0x690
[   80.200530]  ? update_load_avg+0x5f/0x690
[   80.213968]  ? rtrs_clt_recv_done+0x1b/0x30 [rtrs_client]

[2nd]----------
[ 5213.049494] RIP: 0010:rxe_mr_copy.part.0+0xa8/0x140 [rdma_rxe]
[ 5213.050978] Code: 00 00 49 c1 e7 04 48 8b 00 4c 8d 2c d0 48 8b 44 24 10 4d 03 7d 00 85 ed 7f 10 eb 6c 89 54 24 0c 49 83 c7 10 31 c0 85 ed 7e 5e <49> 8b 3f 8b 14 24 4c 89 f6 48 01 c7 85 d2 74 06 48 89 fe 4c 89 f7
[ 5213.056463] RSP: 0018:ffffae3580063bf8 EFLAGS: 00010202
[ 5213.057986] RAX: 0000000000018978 RBX: ffff9d7ef7a03600 RCX: 0000000000000008
[ 5213.059797] RDX: 000000000000007c RSI: 000000000000007c RDI: ffff9d7ef7a03600
[ 5213.061720] RBP: 0000000000000120 R08: ffffae3580063c08 R09: ffffae3580063c04
[ 5213.063532] R10: ffff9d7efece0038 R11: ffff9d7ec4b1db00 R12: ffff9d7efece0038
[ 5213.065445] R13: ffff9d7ef4098260 R14: ffff9d7f11e23c6a R15: 4c79500065708144
[ 5213.067264] FS:  0000000000000000(0000) GS:ffff9d7f3dc00000(0000) knlGS:0000000000000000
[ 5213.069442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5213.071004] CR2: 00007fce47276c60 CR3: 0000000003f66004 CR4: 00000000001706f0
[ 5213.072827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5213.074484] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5213.076292] Call Trace:
[ 5213.077027]  <TASK>
[ 5213.077718]  rxe_responder+0x12ee/0x1b60 [rdma_rxe]
[ 5213.079019]  ? rxe_icrc_check+0x7e/0x100 [rdma_rxe]
[ 5213.080380]  ? rxe_rcv+0x1d0/0x780 [rdma_rxe]
[ 5213.081708]  ? rxe_icrc_hdr.isra.0+0xf6/0x160 [rdma_rxe]
[ 5213.082990]  rxe_do_task+0x67/0xb0 [rdma_rxe]
[ 5213.084030]  rxe_xmit_packet+0xc7/0x210 [rdma_rxe]
[ 5213.085156]  rxe_requester+0x680/0xee0 [rdma_rxe]
[ 5213.088258]  ? update_load_avg+0x5f/0x690
[ 5213.089381]  ? update_load_avg+0x5f/0x690
[ 5213.090446]  ? rtrs_clt_recv_done+0x1b/0x30 [rtrs_client]
[ 5213.092087]  rxe_do_task+0x67/0xb0 [rdma_rxe]
[ 5213.093125]  tasklet_action_common.constprop.0+0x92/0xc0
[ 5213.094366]  __do_softirq+0xe1/0x2d8
[ 5213.095287]  run_ksoftirqd+0x21/0x30
[ 5213.096456]  smpboot_thread_fn+0x183/0x220
[ 5213.097519]  ? sort_range+0x20/0x20
[ 5213.098761]  kthread+0xe2/0x110
[ 5213.099638]  ? kthread_complete_and_exit+0x20/0x20
[ 5213.100948]  ret_from_fork+0x22/0x30

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe May 20, 2022, 2:45 p.m. UTC | #1
On Wed, May 18, 2022 at 12:37:25PM +0800, Li Zhijian wrote:
> Below call chains will alloc map_set without fully initializing map_set.
> rxe_mr_init_fast()
>  -> rxe_mr_alloc()
>     -> rxe_mr_alloc_map_set()
> 
> Uninitialized values inside struct rxe_map_set are possible to cause
> kernel panic.

If the value is uninitialized then why is 0 an OK value?

Would be happier to know the exact value that is not initialized

Jason
Li Zhijian May 23, 2022, 2:02 p.m. UTC | #2
on 2022/5/20 22:45, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 12:37:25PM +0800, Li Zhijian wrote:
>> Below call chains will alloc map_set without fully initializing map_set.
>> rxe_mr_init_fast()
>>   -> rxe_mr_alloc()
>>      -> rxe_mr_alloc_map_set()
>>
>> Uninitialized values inside struct rxe_map_set are possible to cause
>> kernel panic.
> If the value is uninitialized then why is 0 an OK value?
>
> Would be happier to know the exact value that is not initialized

Well, good question. After re-think of this issue, it seems this patch 
wasn't the root cause though it made the crash disappear in some extent.

I'm still working on the root cause :)

Thanks

Zhijian


>
> Jason
Li Zhijian May 24, 2022, 3:59 a.m. UTC | #3
Hi Jason & Bob
CC Guoqing

@Guoqing, It may correlate with your previous bug report: https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/


It's observed that a same MR in rnbd server will trigger below code
path:
  -> rxe_mr_init_fast()
  |-> alloc map_set() # map_set is uninitialized
  |...-> rxe_map_mr_sg() # build the map_set
      |-> rxe_mr_set_page()
  |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
                           # we can access host memory(such rxe_mr_copy)
  |...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
  |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
                           # but map_set was not built again
  |...-> rxe_mr_copy() # kernel crash due to access wild addresses
                       # that lookup from the map_set

I draft a patch like below for it, but i wonder if it's rxe's responsibility to do such checking.
Any comments are very welcome.


 From e9d0bd821f07f5e049027f07b3ce9dc283624201 Mon Sep 17 00:00:00 2001
From: Li Zhijian <lizhijian@fujitsu.com>
Date: Tue, 24 May 2022 10:56:19 +0800
Subject: [PATCH] RDMA/rxe: check map_set valid when handle IB_WR_REG_MR
                                                                                                                                                                                                                                        
It's observed that a same MR in rnbd server will trigger below code
path:
  -> rxe_mr_init_fast()
  |-> alloc map_set() # map_set is uninitialized
  |...-> rxe_map_mr_sg() # build the map_set
      |-> rxe_mr_set_page()
  |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
                           # we can access host memory(such rxe_mr_copy)
  |...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
  |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
                           # but map_set was not built again
  |...-> rxe_mr_copy() # kernel crash due to access wild addresses
                       # that lookup from the map_set
                                                                                                                                                                                                                                        
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
  drivers/infiniband/sw/rxe/rxe_mr.c    | 9 +++++++++
  drivers/infiniband/sw/rxe/rxe_verbs.c | 1 +
  drivers/infiniband/sw/rxe/rxe_verbs.h | 1 +
  3 files changed, 11 insertions(+)
                                                                                                                                                                                                                                        
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 787c7dadc14f..09673d559c06 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -90,6 +90,7 @@ static int rxe_mr_alloc_map_set(int num_map, struct rxe_map_set **setp)
         if (!set->map)
                 goto err_free_set;
                                                                                                                                                                                                                                        
+       set->valid = false;
         for (i = 0; i < num_map; i++) {
                 set->map[i] = kmalloc(sizeof(struct rxe_map), GFP_KERNEL);
                 if (!set->map[i])
@@ -216,6 +217,7 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
         }
                                                                                                                                                                                                                                        
         set = mr->cur_map_set;
+       set->valid = true;
         set->page_shift = PAGE_SHIFT;
         set->page_mask = PAGE_SIZE - 1;
  
@@ -643,6 +645,7 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey)
         }
  
         mr->state = RXE_MR_STATE_FREE;
+       mr->cur_map_set->valid = mr->next_map_set->valid = false;
         ret = 0;
  
  err_drop_ref:
@@ -679,12 +682,18 @@ int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
                 return -EINVAL;
         }
  
+       if (!mr->next_map_set->valid) {
+               pr_warn("%s: map set is not valid\n", __func__);
+               return -EINVAL;
+       }
+
         mr->access = access;
         mr->lkey = (mr->lkey & ~0xff) | key;
         mr->rkey = (access & IB_ACCESS_REMOTE) ? mr->lkey : 0;
         mr->state = RXE_MR_STATE_VALID;
  
         set = mr->cur_map_set;
+       set->valid = false;
         mr->cur_map_set = mr->next_map_set;
         mr->cur_map_set->iova = wqe->wr.wr.reg.mr->iova;
         mr->next_map_set = set;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 58e4412b1d16..4b7ae2d1d921 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -992,6 +992,7 @@ static int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
         set->page_shift = ilog2(ibmr->page_size);
         set->page_mask = ibmr->page_size - 1;
         set->offset = set->iova & set->page_mask;
+       set->valid = true;
  
         return n;
  }
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 86068d70cd95..2edf31aab7e1 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -289,6 +289,7 @@ struct rxe_map {
  
  struct rxe_map_set {
         struct rxe_map          **map;
+       bool                    valid;
         u64                     va;
         u64                     iova;
         size_t                  length;
Haris Iqbal May 24, 2022, 10:56 a.m. UTC | #4
On Tue, May 24, 2022 at 6:00 AM lizhijian@fujitsu.com
<lizhijian@fujitsu.com> wrote:
>
> Hi Jason & Bob
> CC Guoqing
>
> @Guoqing, It may correlate with your previous bug report: https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
>
>
> It's observed that a same MR in rnbd server will trigger below code
> path:
>   -> rxe_mr_init_fast()
>   |-> alloc map_set() # map_set is uninitialized
>   |...-> rxe_map_mr_sg() # build the map_set
>       |-> rxe_mr_set_page()
>   |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
>                            # we can access host memory(such rxe_mr_copy)
>   |...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
>   |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
>                            # but map_set was not built again
>   |...-> rxe_mr_copy() # kernel crash due to access wild addresses
>                        # that lookup from the map_set
>
> I draft a patch like below for it, but i wonder if it's rxe's responsibility to do such checking.
> Any comments are very welcome.
>
>
>  From e9d0bd821f07f5e049027f07b3ce9dc283624201 Mon Sep 17 00:00:00 2001
> From: Li Zhijian <lizhijian@fujitsu.com>
> Date: Tue, 24 May 2022 10:56:19 +0800
> Subject: [PATCH] RDMA/rxe: check map_set valid when handle IB_WR_REG_MR
>
> It's observed that a same MR in rnbd server will trigger below code
> path:
>   -> rxe_mr_init_fast()
>   |-> alloc map_set() # map_set is uninitialized
>   |...-> rxe_map_mr_sg() # build the map_set
>       |-> rxe_mr_set_page()
>   |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
>                            # we can access host memory(such rxe_mr_copy)
>   |...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
>   |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
>                            # but map_set was not built again
>   |...-> rxe_mr_copy() # kernel crash due to access wild addresses
>                        # that lookup from the map_set
>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_mr.c    | 9 +++++++++
>   drivers/infiniband/sw/rxe/rxe_verbs.c | 1 +
>   drivers/infiniband/sw/rxe/rxe_verbs.h | 1 +
>   3 files changed, 11 insertions(+)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 787c7dadc14f..09673d559c06 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -90,6 +90,7 @@ static int rxe_mr_alloc_map_set(int num_map, struct rxe_map_set **setp)
>          if (!set->map)
>                  goto err_free_set;
>
> +       set->valid = false;
>          for (i = 0; i < num_map; i++) {
>                  set->map[i] = kmalloc(sizeof(struct rxe_map), GFP_KERNEL);
>                  if (!set->map[i])
> @@ -216,6 +217,7 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>          }
>
>          set = mr->cur_map_set;
> +       set->valid = true;
>          set->page_shift = PAGE_SHIFT;
>          set->page_mask = PAGE_SIZE - 1;
>
> @@ -643,6 +645,7 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey)
>          }
>
>          mr->state = RXE_MR_STATE_FREE;
> +       mr->cur_map_set->valid = mr->next_map_set->valid = false;
>          ret = 0;
>
>   err_drop_ref:
> @@ -679,12 +682,18 @@ int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
>                  return -EINVAL;
>          }
>
> +       if (!mr->next_map_set->valid) {
> +               pr_warn("%s: map set is not valid\n", __func__);
> +               return -EINVAL;
> +       }
> +
>          mr->access = access;
>          mr->lkey = (mr->lkey & ~0xff) | key;
>          mr->rkey = (access & IB_ACCESS_REMOTE) ? mr->lkey : 0;
>          mr->state = RXE_MR_STATE_VALID;
>
>          set = mr->cur_map_set;
> +       set->valid = false;
>          mr->cur_map_set = mr->next_map_set;
>          mr->cur_map_set->iova = wqe->wr.wr.reg.mr->iova;
>          mr->next_map_set = set;
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 58e4412b1d16..4b7ae2d1d921 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -992,6 +992,7 @@ static int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
>          set->page_shift = ilog2(ibmr->page_size);
>          set->page_mask = ibmr->page_size - 1;
>          set->offset = set->iova & set->page_mask;
> +       set->valid = true;
>
>          return n;
>   }
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index 86068d70cd95..2edf31aab7e1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -289,6 +289,7 @@ struct rxe_map {
>
>   struct rxe_map_set {
>          struct rxe_map          **map;
> +       bool                    valid;
>          u64                     va;
>          u64                     iova;
>          size_t                  length;
> --
> 2.31.1

Thanks for posting the description and the patch.

We have been facing the exact same issue (only with rxe), and we also
realized that to get around this we will have to call ib_map_mr_sg()
before every IB_WR_REG_MR wr; even if we are reusing the MR and simply
invalidating and re-validating them.

In reference to this, we have 2 questions.

1) This change was made in the following commit.

commit 647bf13ce944f20f7402f281578423a952274e4a
Author: Bob Pearson <rpearsonhpe@gmail.com>
Date:   Tue Sep 14 11:42:06 2021 -0500

    RDMA/rxe: Create duplicate mapping tables for FMRs

    For fast memory regions create duplicate mapping tables so ib_map_mr_sg()
    can build a new mapping table which is then swapped into place
    synchronously with the execution of an IB_WR_REG_MR work request.

    Currently the rxe driver uses the same table for receiving RDMA operations
    and for building new tables in preparation for reusing the MR. This
    exposes users to potentially incorrect results.

    Link: https://lore.kernel.org/r/20210914164206.19768-5-rpearsonhpe@gmail.com
    Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

We tried to understand what potential incorrect result that commit
message talks about, but were not able to. If someone can through
light into this scenario where single mapping table can result into
issues, it would be great.

2)
We wanted to confirm that, with the above patch, it clearly means that
the use-case where we reuse the MR, by simply invalidating and
re-validating (IB_WR_REG_MR wr) is a correct one.
And there is no requirement saying that ib_map_mr_sg() needs to be
done everytime regardless.

(We were planning to send this in the coming days, but wanted other
discussions to get over. Since the patch got posted and discussion
started, we thought better to put this out.)

Regards

>
>
> On 23/05/2022 22:02, Li, Zhijian wrote:
> >
> > on 2022/5/20 22:45, Jason Gunthorpe wrote:
> >> On Wed, May 18, 2022 at 12:37:25PM +0800, Li Zhijian wrote:
> >>> Below call chains will alloc map_set without fully initializing map_set.
> >>> rxe_mr_init_fast()
> >>>   -> rxe_mr_alloc()
> >>>      -> rxe_mr_alloc_map_set()
> >>>
> >>> Uninitialized values inside struct rxe_map_set are possible to cause
> >>> kernel panic.
> >> If the value is uninitialized then why is 0 an OK value?
> >>
> >> Would be happier to know the exact value that is not initialized
> >
> > Well, good question. After re-think of this issue, it seems this patch wasn't the root cause though it made the crash disappear in some extent.
> >
> > I'm still working on the root cause :)
> >
> > Thanks
> >
> > Zhijian
> >
> >
> >>
> >> Jason
> >
> >
Li Zhijian May 25, 2022, 1:31 a.m. UTC | #5
On 24/05/2022 18:56, Haris Iqbal wrote:
> On Tue, May 24, 2022 at 6:00 AM lizhijian@fujitsu.com
> <lizhijian@fujitsu.com> wrote:
>> Hi Jason & Bob
>> CC Guoqing
>>
>> @Guoqing, It may correlate with your previous bug report: https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
>>
>>
>> It's observed that a same MR in rnbd server will trigger below code
>> path:
>>    -> rxe_mr_init_fast()
>>    |-> alloc map_set() # map_set is uninitialized
>>    |...-> rxe_map_mr_sg() # build the map_set
>>        |-> rxe_mr_set_page()
>>    |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
>>                             # we can access host memory(such rxe_mr_copy)
>>    |...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
>>    |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
>>                             # but map_set was not built again
>>    |...-> rxe_mr_copy() # kernel crash due to access wild addresses
>>                         # that lookup from the map_set
>>
>> I draft a patch like below for it, but i wonder if it's rxe's responsibility to do such checking.
>> Any comments are very welcome.
>>
> Thanks for posting the description and the patch.
>
> We have been facing the exact same issue (only with rxe),

> and we also
> realized that to get around this we will have to call ib_map_mr_sg()
> before every IB_WR_REG_MR wr; even if we are reusing the MR and simply
> invalidating and re-validating them.
Yes, this workaround should work but expensive.
It seems Bob has started a new thread to discuss the FMRs in https://www.spinics.net/lists/linux-rdma/msg110836.html

Thanks
Zhijian

>
> In reference to this, we have 2 questions.
>
> 1) This change was made in the following commit.
>
> commit 647bf13ce944f20f7402f281578423a952274e4a
> Author: Bob Pearson <rpearsonhpe@gmail.com>
> Date:   Tue Sep 14 11:42:06 2021 -0500
>
>      RDMA/rxe: Create duplicate mapping tables for FMRs
>
>      For fast memory regions create duplicate mapping tables so ib_map_mr_sg()
>      can build a new mapping table which is then swapped into place
>      synchronously with the execution of an IB_WR_REG_MR work request.
>
>      Currently the rxe driver uses the same table for receiving RDMA operations
>      and for building new tables in preparation for reusing the MR. This
>      exposes users to potentially incorrect results.
>
>      Link: https://lore.kernel.org/r/20210914164206.19768-5-rpearsonhpe@gmail.com
>      Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>
> We tried to understand what potential incorrect result that commit
> message talks about, but were not able to. If someone can through
> light into this scenario where single mapping table can result into
> issues, it would be great.
>
> 2)
> We wanted to confirm that, with the above patch, it clearly means that
> the use-case where we reuse the MR, by simply invalidating and
> re-validating (IB_WR_REG_MR wr) is a correct one.
> And there is no requirement saying that ib_map_mr_sg() needs to be
> done everytime regardless.
>
> (We were planning to send this in the coming days, but wanted other
> discussions to get over. Since the patch got posted and discussion
> started, we thought better to put this out.)
>
> Regards
>
>>
>> On 23/05/2022 22:02, Li, Zhijian wrote:
>>> on 2022/5/20 22:45, Jason Gunthorpe wrote:
>>>> On Wed, May 18, 2022 at 12:37:25PM +0800, Li Zhijian wrote:
>>>>> Below call chains will alloc map_set without fully initializing map_set.
>>>>> rxe_mr_init_fast()
>>>>>    -> rxe_mr_alloc()
>>>>>       -> rxe_mr_alloc_map_set()
>>>>>
>>>>> Uninitialized values inside struct rxe_map_set are possible to cause
>>>>> kernel panic.
>>>> If the value is uninitialized then why is 0 an OK value?
>>>>
>>>> Would be happier to know the exact value that is not initialized
>>> Well, good question. After re-think of this issue, it seems this patch wasn't the root cause though it made the crash disappear in some extent.
>>>
>>> I'm still working on the root cause :)
>>>
>>> Thanks
>>>
>>> Zhijian
>>>
>>>
>>>> Jason
>>>
Guoqing Jiang May 25, 2022, 4:11 a.m. UTC | #6
On 5/25/22 9:31 AM, lizhijian@fujitsu.com wrote:
>
> On 24/05/2022 18:56, Haris Iqbal wrote:
>> On Tue, May 24, 2022 at 6:00 AM lizhijian@fujitsu.com
>> <lizhijian@fujitsu.com> wrote:
>>> Hi Jason & Bob
>>> CC Guoqing
>>>
>>> @Guoqing, It may correlate with your previous bug report: https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
>>>
>>>
>>> It's observed that a same MR in rnbd server will trigger below code
>>> path:
>>>     -> rxe_mr_init_fast()
>>>     |-> alloc map_set() # map_set is uninitialized
>>>     |...-> rxe_map_mr_sg() # build the map_set
>>>         |-> rxe_mr_set_page()
>>>     |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
>>>                              # we can access host memory(such rxe_mr_copy)
>>>     |...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
>>>     |...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
>>>                              # but map_set was not built again
>>>     |...-> rxe_mr_copy() # kernel crash due to access wild addresses
>>>                          # that lookup from the map_set

Yes, it could be similar issue thought I didn't get kernel crash, but it 
was FMR relevant.

https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/#m5dc6898375cedf17fea13ccebf595aac0454c841

> Yes, this workaround should work but expensive.
> It seems Bob has started a new thread to discuss the FMRs in https://www.spinics.net/lists/linux-rdma/msg110836.html

Will give it a try, thanks for the link.

Thanks,
Guoqing
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 60a31b718774..bfd2d9db3deb 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -81,7 +81,7 @@  static int rxe_mr_alloc_map_set(int num_map, struct rxe_map_set **setp)
 	int i;
 	struct rxe_map_set *set;
 
-	set = kmalloc(sizeof(*set), GFP_KERNEL);
+	set = kzalloc(sizeof(*set), GFP_KERNEL);
 	if (!set)
 		goto err_out;
 
@@ -90,7 +90,7 @@  static int rxe_mr_alloc_map_set(int num_map, struct rxe_map_set **setp)
 		goto err_free_set;
 
 	for (i = 0; i < num_map; i++) {
-		set->map[i] = kmalloc(sizeof(struct rxe_map), GFP_KERNEL);
+		set->map[i] = kzalloc(sizeof(struct rxe_map), GFP_KERNEL);
 		if (!set->map[i])
 			goto err_free_map;
 	}