Message ID | 20221215101439.3644683-2-matsuda-daisuke@fujitsu.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/2] RDMA/rxe: Fix inaccurate constants in rxe_type_info | expand |
Good catch, I hit it as well by following tests: $ while true; do ./bin/run_tests.py --dev rxe_enp3s0 --gid 1 2>&1 ; done run for a while, it throws ERROR: test_atomic_cmp_and_swap (tests.test_atomic.AtomicTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/lizhijian/rdma-core/tests/test_atomic.py", line 110, in test_atomic_cmp_and_swap u.atomic_traffic(**self.traffic_args, send_op=e.IBV_WR_ATOMIC_CMP_AND_SWP) File "/home/lizhijian/rdma-core/tests/utils.py", line 1077, in atomic_traffic poll_cq(client.cq) File "/home/lizhijian/rdma-core/tests/utils.py", line 604, in poll_cq raise PyverbsRDMAError(f'Completion status is {wc_status_to_str(wcs[0].status)}') pyverbs.pyverbs_error.PyverbsRDMAError: Completion status is Remote access error ====================================================================== ERROR: test_atomic_fetch_and_add (tests.test_atomic.AtomicTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/lizhijian/rdma-core/tests/test_atomic.py", line 116, in test_atomic_fetch_and_add u.atomic_traffic(**self.traffic_args, File "/home/lizhijian/rdma-core/tests/utils.py", line 1077, in atomic_traffic poll_cq(client.cq) File "/home/lizhijian/rdma-core/tests/utils.py", line 604, in poll_cq raise PyverbsRDMAError(f'Completion status is {wc_status_to_str(wcs[0].status)}') pyverbs.pyverbs_error.PyverbsRDMAError: Completion status is Remote access error ====================================================================== ERROR: test_mr_rereg_access_bad_flow (tests.test_mr.MRTest) Test that cover rereg MR's access with this flow: ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/lizhijian/rdma-core/tests/test_mr.py", line 129, in test_mr_rereg_access_bad_flow u.rdma_traffic(**self.traffic_args, send_op=e.IBV_WR_RDMA_WRITE) File "/home/lizhijian/rdma-core/tests/utils.py", line 1031, in rdma_traffic poll_cq(client.cq) File "/home/lizhijian/rdma-core/tests/utils.py", line 604, in poll_cq raise PyverbsRDMAError(f'Completion status is {wc_status_to_str(wcs[0].status)}') pyverbs.pyverbs_error.PyverbsRDMAError: Completion status is Remote access error ====================================================================== ERROR: test_qp_ex_rc_atomic_cmp_swp (tests.test_qpex.QpExTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/lizhijian/rdma-core/tests/test_qpex.py", line 341, in test_qp_ex_rc_atomic_cmp_swp u.atomic_traffic(client, server, self.iters, self.gid_index, File "/home/lizhijian/rdma-core/tests/utils.py", line 1077, in atomic_traffic poll_cq(client.cq) File "/home/lizhijian/rdma-core/tests/utils.py", line 604, in poll_cq raise PyverbsRDMAError(f'Completion status is {wc_status_to_str(wcs[0].status)}') pyverbs.pyverbs_error.PyverbsRDMAError: Completion status is Remote access error ====================================================================== ERROR: test_qp_ex_rc_atomic_fetch_add (tests.test_qpex.QpExTestCase) ---------------------------------------------------------------------- After digging into the source, i believe that it's same with this one. BTW, i believed that i did such test before, but i didn't get this error until v6.1+ Thanks Zhijian On 15/12/2022 18:14, Daisuke Matsuda wrote: > If you create MRs more than 0x10000 times after loading the module, > responder starts to reply NAKs for RDMA/Atomic operations because of rkey > violation detected in check_rkey(). The root cause is that rkeys are > incremented each time a new MR is created and the value overflows into the > range reserved for MWs. > > Fixes: 0994a1bcd5f7 ("RDMA/rxe: Bump up default maximum values used via uverbs") > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> > --- > drivers/infiniband/sw/rxe/rxe_param.h | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h > index a754fc902e3d..a3d31bd45895 100644 > --- a/drivers/infiniband/sw/rxe/rxe_param.h > +++ b/drivers/infiniband/sw/rxe/rxe_param.h > @@ -98,10 +98,10 @@ enum rxe_device_param { > RXE_MAX_SRQ = DEFAULT_MAX_VALUE - RXE_MIN_SRQ_INDEX, > > RXE_MIN_MR_INDEX = 0x00000001, > - RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE, > - RXE_MAX_MR = DEFAULT_MAX_VALUE - RXE_MIN_MR_INDEX, > - RXE_MIN_MW_INDEX = 0x00010001, > - RXE_MAX_MW_INDEX = 0x00020000, > + RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE >> 1, > + RXE_MAX_MR = 0x00001000, > + RXE_MIN_MW_INDEX = (DEFAULT_MAX_VALUE >> 1) + 1, > + RXE_MAX_MW_INDEX = DEFAULT_MAX_VALUE, > RXE_MAX_MW = 0x00001000, > > RXE_MAX_PKT_PER_ACK = 64,
On 15/12/2022 18:14, Daisuke Matsuda wrote: > If you create MRs more than 0x10000 times after loading the module, > responder starts to reply NAKs for RDMA/Atomic operations because of rkey > violation detected in check_rkey(). The root cause is that rkeys are > incremented each time a new MR is created and the value overflows into the > range reserved for MWs. > > Fixes: 0994a1bcd5f7 ("RDMA/rxe: Bump up default maximum values used via uverbs") > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> > --- > drivers/infiniband/sw/rxe/rxe_param.h | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h > index a754fc902e3d..a3d31bd45895 100644 > --- a/drivers/infiniband/sw/rxe/rxe_param.h > +++ b/drivers/infiniband/sw/rxe/rxe_param.h > @@ -98,10 +98,10 @@ enum rxe_device_param { > RXE_MAX_SRQ = DEFAULT_MAX_VALUE - RXE_MIN_SRQ_INDEX, > > RXE_MIN_MR_INDEX = 0x00000001, > - RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE, > - RXE_MAX_MR = DEFAULT_MAX_VALUE - RXE_MIN_MR_INDEX, > - RXE_MIN_MW_INDEX = 0x00010001, > - RXE_MAX_MW_INDEX = 0x00020000, > + RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE >> 1, > + RXE_MAX_MR = 0x00001000, May i know why the RXE_MAX_MR isn't (RXE_MAX_MR_INDEX - RXE_MIN_MR_INDEX) 0x00001000 is much less than previous value > + RXE_MIN_MW_INDEX = (DEFAULT_MAX_VALUE >> 1) + 1, > + RXE_MAX_MW_INDEX = DEFAULT_MAX_VALUE, > RXE_MAX_MW = 0x00001000, > > RXE_MAX_PKT_PER_ACK = 64,
On Sat, Dec 17, 2022 7:10 PM Li, Zhijian wrote: > > > > On 15/12/2022 18:14, Daisuke Matsuda wrote: > > If you create MRs more than 0x10000 times after loading the module, > > responder starts to reply NAKs for RDMA/Atomic operations because of rkey > > violation detected in check_rkey(). The root cause is that rkeys are > > incremented each time a new MR is created and the value overflows into the > > range reserved for MWs. > > > > Fixes: 0994a1bcd5f7 ("RDMA/rxe: Bump up default maximum values used via uverbs") > > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> > > --- > > drivers/infiniband/sw/rxe/rxe_param.h | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h > > index a754fc902e3d..a3d31bd45895 100644 > > --- a/drivers/infiniband/sw/rxe/rxe_param.h > > +++ b/drivers/infiniband/sw/rxe/rxe_param.h > > @@ -98,10 +98,10 @@ enum rxe_device_param { > > RXE_MAX_SRQ = DEFAULT_MAX_VALUE - RXE_MIN_SRQ_INDEX, > > > > RXE_MIN_MR_INDEX = 0x00000001, > > - RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE, > > - RXE_MAX_MR = DEFAULT_MAX_VALUE - RXE_MIN_MR_INDEX, > > - RXE_MIN_MW_INDEX = 0x00010001, > > - RXE_MAX_MW_INDEX = 0x00020000, > > + RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE >> 1, > > + RXE_MAX_MR = 0x00001000, > > May i know why the RXE_MAX_MR isn't (RXE_MAX_MR_INDEX - RXE_MIN_MR_INDEX) > 0x00001000 is much less than previous value I just thought nobody will use that many MRs at the same time, so I made it take after RXE_MAX_MW, but I see there was a reason to make this large. Cf. https://lore.kernel.org/all/20210927191907.GA1582097@nvidia.com/ I shall change this and submit v2. Perhaps, I should also change RXE_MAX_MW. Daisuke > > > > > > + RXE_MIN_MW_INDEX = (DEFAULT_MAX_VALUE >> 1) + 1, > > + RXE_MAX_MW_INDEX = DEFAULT_MAX_VALUE, > > RXE_MAX_MW = 0x00001000, > > > > RXE_MAX_PKT_PER_ACK = 64,
diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h index a754fc902e3d..a3d31bd45895 100644 --- a/drivers/infiniband/sw/rxe/rxe_param.h +++ b/drivers/infiniband/sw/rxe/rxe_param.h @@ -98,10 +98,10 @@ enum rxe_device_param { RXE_MAX_SRQ = DEFAULT_MAX_VALUE - RXE_MIN_SRQ_INDEX, RXE_MIN_MR_INDEX = 0x00000001, - RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE, - RXE_MAX_MR = DEFAULT_MAX_VALUE - RXE_MIN_MR_INDEX, - RXE_MIN_MW_INDEX = 0x00010001, - RXE_MAX_MW_INDEX = 0x00020000, + RXE_MAX_MR_INDEX = DEFAULT_MAX_VALUE >> 1, + RXE_MAX_MR = 0x00001000, + RXE_MIN_MW_INDEX = (DEFAULT_MAX_VALUE >> 1) + 1, + RXE_MAX_MW_INDEX = DEFAULT_MAX_VALUE, RXE_MAX_MW = 0x00001000, RXE_MAX_PKT_PER_ACK = 64,
If you create MRs more than 0x10000 times after loading the module, responder starts to reply NAKs for RDMA/Atomic operations because of rkey violation detected in check_rkey(). The root cause is that rkeys are incremented each time a new MR is created and the value overflows into the range reserved for MWs. Fixes: 0994a1bcd5f7 ("RDMA/rxe: Bump up default maximum values used via uverbs") Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> --- drivers/infiniband/sw/rxe/rxe_param.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)