Message ID | 20231013011803.70474-1-yanjun.zhu@intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [1/1] RDMA/rxe: Fix blktests srp lead kernel panic with 64k page size | expand |
On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: > From: Zhu Yanjun <yanjun.zhu@linux.dev> > > The page_size of mr is set in infiniband core originally. In the commit > 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the > page_size is also set. Sometime this will cause conflict. I appreciate your prompt action, but I do not think this commit deals with the root cause. I agree that the problem lies in rxe driver, but what is wrong with assigning actual page size to ibmr.page_size? IMO, the problem comes from the device attribute of rxe driver, which is used in ulp/srp layer to calculate the page_size. ===== static int srp_add_one(struct ib_device *device) { struct srp_device *srp_dev; struct ib_device_attr *attr = &device->attrs; <...> /* * Use the smallest page size supported by the HCA, down to a * minimum of 4096 bytes. We're unlikely to build large sglists * out of smaller entries. */ mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); srp_dev->mr_page_size = 1 << mr_page_shift; ===== On initialization of srp driver, mr_page_size is calculated here. Note that the device attribute is used to calculate the value of page shift when the device is trying to use a page size larger than 4096. Since Yi specified CONFIG_ARM64_64K_PAGES, the system naturally met the condition. ===== static int srp_map_finish_fr(struct srp_map_state *state, struct srp_request *req, struct srp_rdma_ch *ch, int sg_nents, unsigned int *sg_offset_p) { struct srp_target_port *target = ch->target; struct srp_device *dev = target->srp_host->srp_dev; <...> n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, dev->mr_page_size); ===== After that, mr_page_size is presumably passed to ib_core layer. ===== int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents, unsigned int *sg_offset, unsigned int page_size) { if (unlikely(!mr->device->ops.map_mr_sg)) return -EOPNOTSUPP; mr->page_size = page_size; return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset); } EXPORT_SYMBOL(ib_map_mr_sg); ===== Consequently, the page size calculated in srp driver is set to ibmr.page_size. Coming back to rxe, the device attribute is set here: ===== rxe.c <...> /* initialize rxe device parameters */ static void rxe_init_device_param(struct rxe_dev *rxe) { rxe->max_inline_data = RXE_MAX_INLINE_DATA; rxe->attr.vendor_id = RXE_VENDOR_ID; rxe->attr.max_mr_size = RXE_MAX_MR_SIZE; rxe->attr.page_size_cap = RXE_PAGE_SIZE_CAP; rxe->attr.max_qp = RXE_MAX_QP; --- rxe_param.h <...> /* default/initial rxe device parameter settings */ enum rxe_device_param { RXE_MAX_MR_SIZE = -1ull, RXE_PAGE_SIZE_CAP = 0xfffff000, RXE_MAX_QP_WR = DEFAULT_MAX_VALUE, ===== rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied to ib_device->attrs in setup_device()@core/device.c. See that the page size cap is hardcoded to 4096 bytes. I suspect this led to incorrect page_size being set to ibmr.page_size, resulting in the kernel crash. I think rxe driver is made to be able to handle arbitrary page sizes. Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue. What do you guys think? Thanks, Daisuke Matsuda
在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道: > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >> From: Zhu Yanjun <yanjun.zhu@linux.dev> >> >> The page_size of mr is set in infiniband core originally. In the commit >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >> page_size is also set. Sometime this will cause conflict. > > I appreciate your prompt action, but I do not think this commit deals with > the root cause. I agree that the problem lies in rxe driver, but what is wrong > with assigning actual page size to ibmr.page_size? Please check the source code. ibmr.page_size is assigned in infiniband/core. And then it is assigned in rxe. When the 2 are different, the problem will occur. Please add logs to infiniband/core and rxe to check them. Zhu Yanjun > > IMO, the problem comes from the device attribute of rxe driver, which is used > in ulp/srp layer to calculate the page_size. > ===== > static int srp_add_one(struct ib_device *device) > { > struct srp_device *srp_dev; > struct ib_device_attr *attr = &device->attrs; > <...> > /* > * Use the smallest page size supported by the HCA, down to a > * minimum of 4096 bytes. We're unlikely to build large sglists > * out of smaller entries. > */ > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > srp_dev->mr_page_size = 1 << mr_page_shift; > ===== > On initialization of srp driver, mr_page_size is calculated here. > Note that the device attribute is used to calculate the value of page shift > when the device is trying to use a page size larger than 4096. Since Yi specified > CONFIG_ARM64_64K_PAGES, the system naturally met the condition. > > ===== > static int srp_map_finish_fr(struct srp_map_state *state, > struct srp_request *req, > struct srp_rdma_ch *ch, int sg_nents, > unsigned int *sg_offset_p) > { > struct srp_target_port *target = ch->target; > struct srp_device *dev = target->srp_host->srp_dev; > <...> > n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, > dev->mr_page_size); > ===== > After that, mr_page_size is presumably passed to ib_core layer. > > ===== > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents, > unsigned int *sg_offset, unsigned int page_size) > { > if (unlikely(!mr->device->ops.map_mr_sg)) > return -EOPNOTSUPP; > > mr->page_size = page_size; > > return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset); > } > EXPORT_SYMBOL(ib_map_mr_sg); > ===== > Consequently, the page size calculated in srp driver is set to ibmr.page_size. > > Coming back to rxe, the device attribute is set here: > ===== > rxe.c > <...> > /* initialize rxe device parameters */ > static void rxe_init_device_param(struct rxe_dev *rxe) > { > rxe->max_inline_data = RXE_MAX_INLINE_DATA; > > rxe->attr.vendor_id = RXE_VENDOR_ID; > rxe->attr.max_mr_size = RXE_MAX_MR_SIZE; > rxe->attr.page_size_cap = RXE_PAGE_SIZE_CAP; > rxe->attr.max_qp = RXE_MAX_QP; > --- > rxe_param.h > <...> > /* default/initial rxe device parameter settings */ > enum rxe_device_param { > RXE_MAX_MR_SIZE = -1ull, > RXE_PAGE_SIZE_CAP = 0xfffff000, > RXE_MAX_QP_WR = DEFAULT_MAX_VALUE, > ===== > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied > to ib_device->attrs in setup_device()@core/device.c. > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash. > > I think rxe driver is made to be able to handle arbitrary page sizes. > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue. > What do you guys think? > > Thanks, > Daisuke Matsuda >
On Friday, October 13, 2023 9:29 PM Zhu Yanjun: > 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道: > > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: > >> From: Zhu Yanjun <yanjun.zhu@linux.dev> > >> > >> The page_size of mr is set in infiniband core originally. In the commit > >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the > >> page_size is also set. Sometime this will cause conflict. > > > > I appreciate your prompt action, but I do not think this commit deals with > > the root cause. I agree that the problem lies in rxe driver, but what is wrong > > with assigning actual page size to ibmr.page_size? > > Please check the source code. ibmr.page_size is assigned in > infiniband/core. And then it is assigned in rxe. > When the 2 are different, the problem will occur. In the first place, the two must always be equal. Is there any situation there are two different page sizes for a MR? I think I have explained the value assigned in the core layer is wrong when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent. As I have no environment to test the kernel panic, it remains speculative, but I have provided enough information for everyone to rebut my argument. It is possible I am wrong. I hope someone will tell me specifically where my guess is wrong, or Yi would kindly take the trouble to verify it. Thanks, Daisuke Matsuda > Please add logs to infiniband/core and rxe to check them. > > Zhu Yanjun > > > > > IMO, the problem comes from the device attribute of rxe driver, which is used > > in ulp/srp layer to calculate the page_size. > > ===== > > static int srp_add_one(struct ib_device *device) > > { > > struct srp_device *srp_dev; > > struct ib_device_attr *attr = &device->attrs; > > <...> > > /* > > * Use the smallest page size supported by the HCA, down to a > > * minimum of 4096 bytes. We're unlikely to build large sglists > > * out of smaller entries. > > */ > > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > > srp_dev->mr_page_size = 1 << mr_page_shift; > > ===== > > On initialization of srp driver, mr_page_size is calculated here. > > Note that the device attribute is used to calculate the value of page shift > > when the device is trying to use a page size larger than 4096. Since Yi specified > > CONFIG_ARM64_64K_PAGES, the system naturally met the condition. > > > > ===== > > static int srp_map_finish_fr(struct srp_map_state *state, > > struct srp_request *req, > > struct srp_rdma_ch *ch, int sg_nents, > > unsigned int *sg_offset_p) > > { > > struct srp_target_port *target = ch->target; > > struct srp_device *dev = target->srp_host->srp_dev; > > <...> > > n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, > > dev->mr_page_size); > > ===== > > After that, mr_page_size is presumably passed to ib_core layer. > > > > ===== > > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents, > > unsigned int *sg_offset, unsigned int page_size) > > { > > if (unlikely(!mr->device->ops.map_mr_sg)) > > return -EOPNOTSUPP; > > > > mr->page_size = page_size; > > > > return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset); > > } > > EXPORT_SYMBOL(ib_map_mr_sg); > > ===== > > Consequently, the page size calculated in srp driver is set to ibmr.page_size. > > > > Coming back to rxe, the device attribute is set here: > > ===== > > rxe.c > > <...> > > /* initialize rxe device parameters */ > > static void rxe_init_device_param(struct rxe_dev *rxe) > > { > > rxe->max_inline_data = RXE_MAX_INLINE_DATA; > > > > rxe->attr.vendor_id = RXE_VENDOR_ID; > > rxe->attr.max_mr_size = RXE_MAX_MR_SIZE; > > rxe->attr.page_size_cap = RXE_PAGE_SIZE_CAP; > > rxe->attr.max_qp = RXE_MAX_QP; > > --- > > rxe_param.h > > <...> > > /* default/initial rxe device parameter settings */ > > enum rxe_device_param { > > RXE_MAX_MR_SIZE = -1ull, > > RXE_PAGE_SIZE_CAP = 0xfffff000, > > RXE_MAX_QP_WR = DEFAULT_MAX_VALUE, > > ===== > > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied > > to ib_device->attrs in setup_device()@core/device.c. > > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to > > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash. > > > > I think rxe driver is made to be able to handle arbitrary page sizes. > > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue. > > What do you guys think? > > > > Thanks, > > Daisuke Matsuda > >
On Fri, Oct 13, 2023 at 9:01 AM Daisuke Matsuda (Fujitsu) <matsuda-daisuke@fujitsu.com> wrote: > > On Friday, October 13, 2023 9:29 PM Zhu Yanjun: > > 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道: > > > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: > > >> From: Zhu Yanjun <yanjun.zhu@linux.dev> > > >> > > >> The page_size of mr is set in infiniband core originally. In the commit > > >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the > > >> page_size is also set. Sometime this will cause conflict. > > > > > > I appreciate your prompt action, but I do not think this commit deals with > > > the root cause. I agree that the problem lies in rxe driver, but what is wrong > > > with assigning actual page size to ibmr.page_size? > > > > Please check the source code. ibmr.page_size is assigned in > > infiniband/core. And then it is assigned in rxe. > > When the 2 are different, the problem will occur. > > In the first place, the two must always be equal. > Is there any situation there are two different page sizes for a MR? > I think I have explained the value assigned in the core layer is wrong > when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent. > > As I have no environment to test the kernel panic, it remains speculative, > but I have provided enough information for everyone to rebut my argument. > It is possible I am wrong. I hope someone will tell me specifically where > my guess is wrong, or Yi would kindly take the trouble to verify it. I made a quick test. With Zhu's patch, the problem fixed. With your patch, this problem exists. And other problems occur. I do not know why. Will continue to make tests. > > Thanks, > Daisuke Matsuda > > > Please add logs to infiniband/core and rxe to check them. > > > > Zhu Yanjun > > > > > > > > IMO, the problem comes from the device attribute of rxe driver, which is used > > > in ulp/srp layer to calculate the page_size. > > > ===== > > > static int srp_add_one(struct ib_device *device) > > > { > > > struct srp_device *srp_dev; > > > struct ib_device_attr *attr = &device->attrs; > > > <...> > > > /* > > > * Use the smallest page size supported by the HCA, down to a > > > * minimum of 4096 bytes. We're unlikely to build large sglists > > > * out of smaller entries. > > > */ > > > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > > > srp_dev->mr_page_size = 1 << mr_page_shift; > > > ===== > > > On initialization of srp driver, mr_page_size is calculated here. > > > Note that the device attribute is used to calculate the value of page shift > > > when the device is trying to use a page size larger than 4096. Since Yi specified > > > CONFIG_ARM64_64K_PAGES, the system naturally met the condition. > > > > > > ===== > > > static int srp_map_finish_fr(struct srp_map_state *state, > > > struct srp_request *req, > > > struct srp_rdma_ch *ch, int sg_nents, > > > unsigned int *sg_offset_p) > > > { > > > struct srp_target_port *target = ch->target; > > > struct srp_device *dev = target->srp_host->srp_dev; > > > <...> > > > n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, > > > dev->mr_page_size); > > > ===== > > > After that, mr_page_size is presumably passed to ib_core layer. > > > > > > ===== > > > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents, > > > unsigned int *sg_offset, unsigned int page_size) > > > { > > > if (unlikely(!mr->device->ops.map_mr_sg)) > > > return -EOPNOTSUPP; > > > > > > mr->page_size = page_size; > > > > > > return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset); > > > } > > > EXPORT_SYMBOL(ib_map_mr_sg); > > > ===== > > > Consequently, the page size calculated in srp driver is set to ibmr.page_size. > > > > > > Coming back to rxe, the device attribute is set here: > > > ===== > > > rxe.c > > > <...> > > > /* initialize rxe device parameters */ > > > static void rxe_init_device_param(struct rxe_dev *rxe) > > > { > > > rxe->max_inline_data = RXE_MAX_INLINE_DATA; > > > > > > rxe->attr.vendor_id = RXE_VENDOR_ID; > > > rxe->attr.max_mr_size = RXE_MAX_MR_SIZE; > > > rxe->attr.page_size_cap = RXE_PAGE_SIZE_CAP; > > > rxe->attr.max_qp = RXE_MAX_QP; > > > --- > > > rxe_param.h > > > <...> > > > /* default/initial rxe device parameter settings */ > > > enum rxe_device_param { > > > RXE_MAX_MR_SIZE = -1ull, > > > RXE_PAGE_SIZE_CAP = 0xfffff000, > > > RXE_MAX_QP_WR = DEFAULT_MAX_VALUE, > > > ===== > > > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied > > > to ib_device->attrs in setup_device()@core/device.c. > > > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to > > > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash. > > > > > > I think rxe driver is made to be able to handle arbitrary page sizes. > > > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue. > > > What do you guys think? > > > > > > Thanks, > > > Daisuke Matsuda > > > >
On Fri, Oct 13, 2023 10:44 PM Rain River <rain.1986.08.12@gmail.com> wrote: > > On Friday, October 13, 2023 9:29 PM Zhu Yanjun: > > > 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道: > > > > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: > > > >> From: Zhu Yanjun <yanjun.zhu@linux.dev> > > > >> > > > >> The page_size of mr is set in infiniband core originally. In the commit > > > >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the > > > >> page_size is also set. Sometime this will cause conflict. > > > > > > > > I appreciate your prompt action, but I do not think this commit deals with > > > > the root cause. I agree that the problem lies in rxe driver, but what is wrong > > > > with assigning actual page size to ibmr.page_size? > > > > > > Please check the source code. ibmr.page_size is assigned in > > > infiniband/core. And then it is assigned in rxe. > > > When the 2 are different, the problem will occur. > > > > In the first place, the two must always be equal. > > Is there any situation there are two different page sizes for a MR? > > I think I have explained the value assigned in the core layer is wrong > > when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent. > > > > As I have no environment to test the kernel panic, it remains speculative, > > but I have provided enough information for everyone to rebut my argument. > > It is possible I am wrong. I hope someone will tell me specifically where > > my guess is wrong, or Yi would kindly take the trouble to verify it. > > I made a quick test. > > With Zhu's patch, the problem fixed. > With your patch, this problem exists. And other problems occur. I do > not know why. > Will continue to make tests. Thank you for your time and work. I will try to find out why there were two different page sizes from different perspectives. This may take a while because I am busy with other projects now. If anybody need the driver without the crash issue right now, I do not object to partially undoing "RDMA/rxe: Cleanup page variables in rxe_mr.c" as Zhu suggests, but that should be interim and not a final solution. Thanks, Daisuke > > > > > Thanks, > > Daisuke Matsuda > > > > > Please add logs to infiniband/core and rxe to check them. > > > > > > Zhu Yanjun > > > > > > > > > > > IMO, the problem comes from the device attribute of rxe driver, which is used > > > > in ulp/srp layer to calculate the page_size. > > > > ===== > > > > static int srp_add_one(struct ib_device *device) > > > > { > > > > struct srp_device *srp_dev; > > > > struct ib_device_attr *attr = &device->attrs; > > > > <...> > > > > /* > > > > * Use the smallest page size supported by the HCA, down to a > > > > * minimum of 4096 bytes. We're unlikely to build large sglists > > > > * out of smaller entries. > > > > */ > > > > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > > > > srp_dev->mr_page_size = 1 << mr_page_shift; > > > > ===== > > > > On initialization of srp driver, mr_page_size is calculated here. > > > > Note that the device attribute is used to calculate the value of page shift > > > > when the device is trying to use a page size larger than 4096. Since Yi specified > > > > CONFIG_ARM64_64K_PAGES, the system naturally met the condition. > > > > > > > > ===== > > > > static int srp_map_finish_fr(struct srp_map_state *state, > > > > struct srp_request *req, > > > > struct srp_rdma_ch *ch, int sg_nents, > > > > unsigned int *sg_offset_p) > > > > { > > > > struct srp_target_port *target = ch->target; > > > > struct srp_device *dev = target->srp_host->srp_dev; > > > > <...> > > > > n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, > > > > dev->mr_page_size); > > > > ===== > > > > After that, mr_page_size is presumably passed to ib_core layer. > > > > > > > > ===== > > > > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents, > > > > unsigned int *sg_offset, unsigned int page_size) > > > > { > > > > if (unlikely(!mr->device->ops.map_mr_sg)) > > > > return -EOPNOTSUPP; > > > > > > > > mr->page_size = page_size; > > > > > > > > return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset); > > > > } > > > > EXPORT_SYMBOL(ib_map_mr_sg); > > > > ===== > > > > Consequently, the page size calculated in srp driver is set to ibmr.page_size. > > > > > > > > Coming back to rxe, the device attribute is set here: > > > > ===== > > > > rxe.c > > > > <...> > > > > /* initialize rxe device parameters */ > > > > static void rxe_init_device_param(struct rxe_dev *rxe) > > > > { > > > > rxe->max_inline_data = RXE_MAX_INLINE_DATA; > > > > > > > > rxe->attr.vendor_id = RXE_VENDOR_ID; > > > > rxe->attr.max_mr_size = RXE_MAX_MR_SIZE; > > > > rxe->attr.page_size_cap = RXE_PAGE_SIZE_CAP; > > > > rxe->attr.max_qp = RXE_MAX_QP; > > > > --- > > > > rxe_param.h > > > > <...> > > > > /* default/initial rxe device parameter settings */ > > > > enum rxe_device_param { > > > > RXE_MAX_MR_SIZE = -1ull, > > > > RXE_PAGE_SIZE_CAP = 0xfffff000, > > > > RXE_MAX_QP_WR = DEFAULT_MAX_VALUE, > > > > ===== > > > > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied > > > > to ib_device->attrs in setup_device()@core/device.c. > > > > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to > > > > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash. > > > > > > > > I think rxe driver is made to be able to handle arbitrary page sizes. > > > > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue. > > > > What do you guys think? > > > > > > > > Thanks, > > > > Daisuke Matsuda > > > > > >
在 2023/10/16 14:07, Daisuke Matsuda (Fujitsu) 写道: > On Fri, Oct 13, 2023 10:44 PM Rain River <rain.1986.08.12@gmail.com> wrote: >>> On Friday, October 13, 2023 9:29 PM Zhu Yanjun: >>>> 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道: >>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev> >>>>>> >>>>>> The page_size of mr is set in infiniband core originally. In the commit >>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>>>> page_size is also set. Sometime this will cause conflict. >>>>> >>>>> I appreciate your prompt action, but I do not think this commit deals with >>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>>>> with assigning actual page size to ibmr.page_size? >>>> >>>> Please check the source code. ibmr.page_size is assigned in >>>> infiniband/core. And then it is assigned in rxe. >>>> When the 2 are different, the problem will occur. >>> >>> In the first place, the two must always be equal. >>> Is there any situation there are two different page sizes for a MR? >>> I think I have explained the value assigned in the core layer is wrong >>> when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent. >>> >>> As I have no environment to test the kernel panic, it remains speculative, >>> but I have provided enough information for everyone to rebut my argument. >>> It is possible I am wrong. I hope someone will tell me specifically where >>> my guess is wrong, or Yi would kindly take the trouble to verify it. >> >> I made a quick test. >> >> With Zhu's patch, the problem fixed. >> With your patch, this problem exists. And other problems occur. I do >> not know why. >> Will continue to make tests. > > Thank you for your time and work. > I will try to find out why there were two different page sizes from > different perspectives. This may take a while because I am busy > with other projects now. If anybody need the driver without the crash > issue right now, I do not object to partially undoing "RDMA/rxe: Cleanup > page variables in rxe_mr.c" as Zhu suggests, but that should be interim > and not a final solution. Sorry, it is late to reply. This is a difficult and complicated problem because it involves rdma and block/blktests. This commit is based on the fact that the commit 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") causes this problem. So I delved into this commit. In this commit, the core part is to change the size of ibmr.page_size. From this, I checked the source code. I found that this page_size is also set in infiniband/core. So I try to remove the change of ibmr.page_size. I will continue to delve into this problem and the source code to find the root cause that you also agree with. Zhu Yanjun > > Thanks, > Daisuke > >> >>> >>> Thanks, >>> Daisuke Matsuda >>> >>>> Please add logs to infiniband/core and rxe to check them. >>>> >>>> Zhu Yanjun >>>> >>>>> >>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>>>> in ulp/srp layer to calculate the page_size. >>>>> ===== >>>>> static int srp_add_one(struct ib_device *device) >>>>> { >>>>> struct srp_device *srp_dev; >>>>> struct ib_device_attr *attr = &device->attrs; >>>>> <...> >>>>> /* >>>>> * Use the smallest page size supported by the HCA, down to a >>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>>>> * out of smaller entries. >>>>> */ >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>>>> srp_dev->mr_page_size = 1 << mr_page_shift; >>>>> ===== >>>>> On initialization of srp driver, mr_page_size is calculated here. >>>>> Note that the device attribute is used to calculate the value of page shift >>>>> when the device is trying to use a page size larger than 4096. Since Yi specified >>>>> CONFIG_ARM64_64K_PAGES, the system naturally met the condition. >>>>> >>>>> ===== >>>>> static int srp_map_finish_fr(struct srp_map_state *state, >>>>> struct srp_request *req, >>>>> struct srp_rdma_ch *ch, int sg_nents, >>>>> unsigned int *sg_offset_p) >>>>> { >>>>> struct srp_target_port *target = ch->target; >>>>> struct srp_device *dev = target->srp_host->srp_dev; >>>>> <...> >>>>> n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, >>>>> dev->mr_page_size); >>>>> ===== >>>>> After that, mr_page_size is presumably passed to ib_core layer. >>>>> >>>>> ===== >>>>> int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents, >>>>> unsigned int *sg_offset, unsigned int page_size) >>>>> { >>>>> if (unlikely(!mr->device->ops.map_mr_sg)) >>>>> return -EOPNOTSUPP; >>>>> >>>>> mr->page_size = page_size; >>>>> >>>>> return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset); >>>>> } >>>>> EXPORT_SYMBOL(ib_map_mr_sg); >>>>> ===== >>>>> Consequently, the page size calculated in srp driver is set to ibmr.page_size. >>>>> >>>>> Coming back to rxe, the device attribute is set here: >>>>> ===== >>>>> rxe.c >>>>> <...> >>>>> /* initialize rxe device parameters */ >>>>> static void rxe_init_device_param(struct rxe_dev *rxe) >>>>> { >>>>> rxe->max_inline_data = RXE_MAX_INLINE_DATA; >>>>> >>>>> rxe->attr.vendor_id = RXE_VENDOR_ID; >>>>> rxe->attr.max_mr_size = RXE_MAX_MR_SIZE; >>>>> rxe->attr.page_size_cap = RXE_PAGE_SIZE_CAP; >>>>> rxe->attr.max_qp = RXE_MAX_QP; >>>>> --- >>>>> rxe_param.h >>>>> <...> >>>>> /* default/initial rxe device parameter settings */ >>>>> enum rxe_device_param { >>>>> RXE_MAX_MR_SIZE = -1ull, >>>>> RXE_PAGE_SIZE_CAP = 0xfffff000, >>>>> RXE_MAX_QP_WR = DEFAULT_MAX_VALUE, >>>>> ===== >>>>> rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied >>>>> to ib_device->attrs in setup_device()@core/device.c. >>>>> See that the page size cap is hardcoded to 4096 bytes. I suspect this led to >>>>> incorrect page_size being set to ibmr.page_size, resulting in the kernel crash. >>>>> >>>>> I think rxe driver is made to be able to handle arbitrary page sizes. >>>>> Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue. >>>>> What do you guys think? >>>>> >>>>> Thanks, >>>>> Daisuke Matsuda >>>>> >>>
CC Bart On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >> From: Zhu Yanjun<yanjun.zhu@linux.dev> >> >> The page_size of mr is set in infiniband core originally. In the commit >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >> page_size is also set. Sometime this will cause conflict. > I appreciate your prompt action, but I do not think this commit deals with > the root cause. I agree that the problem lies in rxe driver, but what is wrong > with assigning actual page size to ibmr.page_size? > > IMO, the problem comes from the device attribute of rxe driver, which is used > in ulp/srp layer to calculate the page_size. > ===== > static int srp_add_one(struct ib_device *device) > { > struct srp_device *srp_dev; > struct ib_device_attr *attr = &device->attrs; > <...> > /* > * Use the smallest page size supported by the HCA, down to a > * minimum of 4096 bytes. We're unlikely to build large sglists > * out of smaller entries. > */ > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); You light me up. RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size so i think more accurate logic should be: if (device supports PAGE_SIZE) use PAGE_SIZE else if (device support 4096 page_size) // fallback to 4096 use 4096 etc... else ... > srp_dev->mr_page_size = 1 << mr_page_shift; > ===== > On initialization of srp driver, mr_page_size is calculated here. > Note that the device attribute is used to calculate the value of page shift > when the device is trying to use a page size larger than 4096. Since Yi specified > CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
Add and Hi Bart, Yi reported a crash[1] when PAGE_SIZE is 64K [1]https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/ The root cause is unknown so far, but I notice that SRP over RXE always uses MR with page_size 4K = MAX(4K, min(device_support_page_size)) even though the device supports 64K page size. * RXE device support 4K ~ 2G page size 4024 /* 4025 * Use the smallest page size supported by the HCA, down to a 4026 * minimum of 4096 bytes. We're unlikely to build large sglists 4027 * out of smaller entries. 4028 */ 4029 mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); 4030 srp_dev->mr_page_size = 1 << mr_page_shift; 4031 srp_dev->mr_page_mask = ~((u64) srp_dev->mr_page_size - 1); 4032 max_pages_per_mr = attr->max_mr_size; I doubt if we can use PAGE_SIZE MR if the device supports it. BTW, rtrs seems have the same code. @rtrs Thanks Zhijian On 20/10/2023 11:47, Zhijian Li (Fujitsu) wrote: > CC Bart > > On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>> >>> The page_size of mr is set in infiniband core originally. In the commit >>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>> page_size is also set. Sometime this will cause conflict. >> I appreciate your prompt action, but I do not think this commit deals with >> the root cause. I agree that the problem lies in rxe driver, but what is wrong >> with assigning actual page size to ibmr.page_size? >> >> IMO, the problem comes from the device attribute of rxe driver, which is used >> in ulp/srp layer to calculate the page_size. >> ===== >> static int srp_add_one(struct ib_device *device) >> { >> struct srp_device *srp_dev; >> struct ib_device_attr *attr = &device->attrs; >> <...> >> /* >> * Use the smallest page size supported by the HCA, down to a >> * minimum of 4096 bytes. We're unlikely to build large sglists >> * out of smaller entries. >> */ >> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > > > You light me up. > RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size > > so i think more accurate logic should be: > > if (device supports PAGE_SIZE) > use PAGE_SIZE > else if (device support 4096 page_size) // fallback to 4096 > use 4096 etc... > else > ... > > > > >> srp_dev->mr_page_size = 1 << mr_page_shift; >> ===== >> On initialization of srp driver, mr_page_size is calculated here. >> Note that the device attribute is used to calculate the value of page shift >> when the device is trying to use a page size larger than 4096. Since Yi specified >> CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: > CC Bart > > On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: > > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: > >> From: Zhu Yanjun<yanjun.zhu@linux.dev> > >> > >> The page_size of mr is set in infiniband core originally. In the commit > >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the > >> page_size is also set. Sometime this will cause conflict. > > I appreciate your prompt action, but I do not think this commit deals with > > the root cause. I agree that the problem lies in rxe driver, but what is wrong > > with assigning actual page size to ibmr.page_size? > > > > IMO, the problem comes from the device attribute of rxe driver, which is used > > in ulp/srp layer to calculate the page_size. > > ===== > > static int srp_add_one(struct ib_device *device) > > { > > struct srp_device *srp_dev; > > struct ib_device_attr *attr = &device->attrs; > > <...> > > /* > > * Use the smallest page size supported by the HCA, down to a > > * minimum of 4096 bytes. We're unlikely to build large sglists > > * out of smaller entries. > > */ > > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > > > You light me up. > RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size That doesn't seem right even in concept. I think the multi-size support in the new xarray code does not work right, just looking at it makes me think it does not work right. It looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will explode because kmap_local_page() does only 4K. If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? Jason
On 10/19/23 23:54, Zhijian Li (Fujitsu) wrote: > Add and Hi Bart, > > Yi reported a crash[1] when PAGE_SIZE is 64K > [1]https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/ > > The root cause is unknown so far, but I notice that SRP over RXE always uses MR with page_size > 4K = MAX(4K, min(device_support_page_size)) even though the device supports 64K page size. > * RXE device support 4K ~ 2G page size > > > 4024 /* > 4025 * Use the smallest page size supported by the HCA, down to a > 4026 * minimum of 4096 bytes. We're unlikely to build large sglists > 4027 * out of smaller entries. > 4028 */ > 4029 mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > 4030 srp_dev->mr_page_size = 1 << mr_page_shift; > 4031 srp_dev->mr_page_mask = ~((u64) srp_dev->mr_page_size - 1); > 4032 max_pages_per_mr = attr->max_mr_size; > > > I doubt if we can use PAGE_SIZE MR if the device supports it. Hi Zhijian, Thank you for having Cc-ed me. How about changing "12" in the above code into PAGE_SHIFT? Thanks, Bart.
On 21/10/2023 00:21, Bart Van Assche wrote: > On 10/19/23 23:54, Zhijian Li (Fujitsu) wrote: >> Add and Hi Bart, >> >> Yi reported a crash[1] when PAGE_SIZE is 64K >> [1]https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/ >> >> The root cause is unknown so far, but I notice that SRP over RXE always uses MR with page_size >> 4K = MAX(4K, min(device_support_page_size)) even though the device supports 64K page size. >> * RXE device support 4K ~ 2G page size >> >> >> 4024 /* >> 4025 * Use the smallest page size supported by the HCA, down to a >> 4026 * minimum of 4096 bytes. We're unlikely to build large sglists >> 4027 * out of smaller entries. >> 4028 */ >> 4029 mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >> 4030 srp_dev->mr_page_size = 1 << mr_page_shift; >> 4031 srp_dev->mr_page_mask = ~((u64) srp_dev->mr_page_size - 1); >> 4032 max_pages_per_mr = attr->max_mr_size; >> >> >> I doubt if we can use PAGE_SIZE MR if the device supports it. > > Hi Zhijian, > > Thank you for having Cc-ed me. How about changing "12" in the above code > into PAGE_SHIFT? Yeah, after changed to PAGE_SHIFT, it works fine and the crash is gone in my aarch64 environment. Thanks Zhijian > > Thanks, > > Bart.
On 20/10/2023 22:01, Jason Gunthorpe wrote: > On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >> CC Bart >> >> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>>> >>>> The page_size of mr is set in infiniband core originally. In the commit >>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>> page_size is also set. Sometime this will cause conflict. >>> I appreciate your prompt action, but I do not think this commit deals with >>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>> with assigning actual page size to ibmr.page_size? >>> >>> IMO, the problem comes from the device attribute of rxe driver, which is used >>> in ulp/srp layer to calculate the page_size. >>> ===== >>> static int srp_add_one(struct ib_device *device) >>> { >>> struct srp_device *srp_dev; >>> struct ib_device_attr *attr = &device->attrs; >>> <...> >>> /* >>> * Use the smallest page size supported by the HCA, down to a >>> * minimum of 4096 bytes. We're unlikely to build large sglists >>> * out of smaller entries. >>> */ >>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >> >> >> You light me up. >> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size > > That doesn't seem right even in concept.> > I think the multi-size support in the new xarray code does not work > right, just looking at it makes me think it does not work right. It > looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will > explode because kmap_local_page() does only 4K. > > If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? > Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). >>> import ctypes >>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>> hex(65536) '0x10000' >>> libc.ffs(0x10000) - 1 16 >>> 1 << 16 65536 so mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; So I think Daisuke's patch should work as well. https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 > Jason
在 2023/10/23 11:52, Zhijian Li (Fujitsu) 写道: > > > On 20/10/2023 22:01, Jason Gunthorpe wrote: >> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>> CC Bart >>> >>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>>>> >>>>> The page_size of mr is set in infiniband core originally. In the commit >>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>>> page_size is also set. Sometime this will cause conflict. >>>> I appreciate your prompt action, but I do not think this commit deals with >>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>>> with assigning actual page size to ibmr.page_size? >>>> >>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>>> in ulp/srp layer to calculate the page_size. >>>> ===== >>>> static int srp_add_one(struct ib_device *device) >>>> { >>>> struct srp_device *srp_dev; >>>> struct ib_device_attr *attr = &device->attrs; >>>> <...> >>>> /* >>>> * Use the smallest page size supported by the HCA, down to a >>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>>> * out of smaller entries. >>>> */ >>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>> >>> >>> You light me up. >>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >> >> That doesn't seem right even in concept.> >> I think the multi-size support in the new xarray code does not work >> right, just looking at it makes me think it does not work right. It >> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >> explode because kmap_local_page() does only 4K. >> >> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >> > > Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). > >>>> import ctypes >>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>>> hex(65536) > '0x10000' >>>> libc.ffs(0x10000) - 1 > 16 >>>> 1 << 16 > 65536 > > so > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; > > > So I think Daisuke's patch should work as well. > > https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 About this patch, please make full tests. Please ensure that this will not introduce some potential problems. In the discussion, this patch possibly has side effect. Zhu Yanjun > > >> Jason
On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) <lizhijian@fujitsu.com> wrote: > > > > On 20/10/2023 22:01, Jason Gunthorpe wrote: > > On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: > >> CC Bart > >> > >> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: > >>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: > >>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> > >>>> > >>>> The page_size of mr is set in infiniband core originally. In the commit > >>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the > >>>> page_size is also set. Sometime this will cause conflict. > >>> I appreciate your prompt action, but I do not think this commit deals with > >>> the root cause. I agree that the problem lies in rxe driver, but what is wrong > >>> with assigning actual page size to ibmr.page_size? > >>> > >>> IMO, the problem comes from the device attribute of rxe driver, which is used > >>> in ulp/srp layer to calculate the page_size. > >>> ===== > >>> static int srp_add_one(struct ib_device *device) > >>> { > >>> struct srp_device *srp_dev; > >>> struct ib_device_attr *attr = &device->attrs; > >>> <...> > >>> /* > >>> * Use the smallest page size supported by the HCA, down to a > >>> * minimum of 4096 bytes. We're unlikely to build large sglists > >>> * out of smaller entries. > >>> */ > >>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); > >> > >> > >> You light me up. > >> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size > > > > That doesn't seem right even in concept.> > > I think the multi-size support in the new xarray code does not work > > right, just looking at it makes me think it does not work right. It > > looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will > > explode because kmap_local_page() does only 4K. > > > > If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? > > > > Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). Hi Zhijian Did you try blktests nvme/rdma use_rxe on your environment, it still failed on my side. # use_rxe=1 nvme_trtype=rdma ./check nvme/003 nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] runtime 12.179s ... 11.941s --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 05:52:27.882759168 -0400 @@ -1,3 +1,3 @@ Running nvme/003 -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) Test complete [ 7033.431910] rdma_rxe: loaded [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 [ 7033.510969] infiniband enP2p1s0v0_rxe: set active [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 [ 7033.549301] loop0: detected capacity change from 0 to 2097152 [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 [ 7046.317861] rdma_rxe: unloaded > > >>> import ctypes > >>> libc = ctypes.cdll.LoadLibrary('libc.so.6') > >>> hex(65536) > '0x10000' > >>> libc.ffs(0x10000) - 1 > 16 > >>> 1 << 16 > 65536 > > so > mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; > > > So I think Daisuke's patch should work as well. > > https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 > > > > Jason -- Best Regards, Yi Zhang
On 23/10/2023 18:45, Yi Zhang wrote: > On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) > <lizhijian@fujitsu.com> wrote: >> >> >> >> On 20/10/2023 22:01, Jason Gunthorpe wrote: >>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>>> CC Bart >>>> >>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>>>>> >>>>>> The page_size of mr is set in infiniband core originally. In the commit >>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>>>> page_size is also set. Sometime this will cause conflict. >>>>> I appreciate your prompt action, but I do not think this commit deals with >>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>>>> with assigning actual page size to ibmr.page_size? >>>>> >>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>>>> in ulp/srp layer to calculate the page_size. >>>>> ===== >>>>> static int srp_add_one(struct ib_device *device) >>>>> { >>>>> struct srp_device *srp_dev; >>>>> struct ib_device_attr *attr = &device->attrs; >>>>> <...> >>>>> /* >>>>> * Use the smallest page size supported by the HCA, down to a >>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>>>> * out of smaller entries. >>>>> */ >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>>> >>>> >>>> You light me up. >>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >>> >>> That doesn't seem right even in concept.> >>> I think the multi-size support in the new xarray code does not work >>> right, just looking at it makes me think it does not work right. It >>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >>> explode because kmap_local_page() does only 4K. >>> >>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >>> >> >> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). > > Hi Zhijian > > Did you try blktests nvme/rdma use_rxe on your environment, it still > failed on my side. > Thanks for your testing. I just did that, it also failed in my environment. After a glance debug, I found there are other places still registered 4K MR. I will dig into it later. Thanks Zhijian > # use_rxe=1 nvme_trtype=rdma ./check nvme/003 > nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] > runtime 12.179s ... 11.941s > --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 > +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 > 05:52:27.882759168 -0400 > @@ -1,3 +1,3 @@ > Running nvme/003 > -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) > +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) > Test complete > > [ 7033.431910] rdma_rxe: loaded > [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 > [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 > [ 7033.510969] infiniband enP2p1s0v0_rxe: set active > [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 > [ 7033.549301] loop0: detected capacity change from 0 to 2097152 > [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) > [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 > [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 > [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 > [ 7046.317861] rdma_rxe: unloaded > > >> >>>>> import ctypes >>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>>>> hex(65536) >> '0x10000' >>>>> libc.ffs(0x10000) - 1 >> 16 >>>>> 1 << 16 >> 65536 >> >> so >> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; >> >> >> So I think Daisuke's patch should work as well. >> >> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 >> >> >>> Jason > > > > -- > Best Regards, > Yi Zhang >
On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote: > > > On 23/10/2023 18:45, Yi Zhang wrote: >> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) >> <lizhijian@fujitsu.com> wrote: >>> >>> >>> >>> On 20/10/2023 22:01, Jason Gunthorpe wrote: >>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>>>> CC Bart >>>>> >>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>>>>>> >>>>>>> The page_size of mr is set in infiniband core originally. In the commit >>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>>>>> page_size is also set. Sometime this will cause conflict. >>>>>> I appreciate your prompt action, but I do not think this commit deals with >>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>>>>> with assigning actual page size to ibmr.page_size? >>>>>> >>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>>>>> in ulp/srp layer to calculate the page_size. >>>>>> ===== >>>>>> static int srp_add_one(struct ib_device *device) >>>>>> { >>>>>> struct srp_device *srp_dev; >>>>>> struct ib_device_attr *attr = &device->attrs; >>>>>> <...> >>>>>> /* >>>>>> * Use the smallest page size supported by the HCA, down to a >>>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>>>>> * out of smaller entries. >>>>>> */ >>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>>>> >>>>> >>>>> You light me up. >>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >>>> >>>> That doesn't seem right even in concept.> >>>> I think the multi-size support in the new xarray code does not work >>>> right, just looking at it makes me think it does not work right. It >>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >>>> explode because kmap_local_page() does only 4K. >>>> >>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >>>> >>> >>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). >> >> Hi Zhijian >> >> Did you try blktests nvme/rdma use_rxe on your environment, it still >> failed on my side. >> > > Thanks for your testing. > I just did that, it also failed in my environment. After a glance debug, I found there are > other places still registered 4K MR. I will dig into it later. nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly. 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, 1294 struct nvme_rdma_request *req, struct nvme_command *c, 1295 int count) 1296 { 1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; 1298 int nr; 1299 1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); 1301 if (WARN_ON_ONCE(!req->mr)) 1302 return -EAGAIN; 1303 1304 /* 1305 * Align the MR to a 4K page size to match the ctrl page size and 1306 * the block virtual boundary. 1307 */ 1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, 1309 SZ_4K); Anyway, i will go ahead. if you have any thought, please let me know. > > Thanks > Zhijian > > > > >> # use_rxe=1 nvme_trtype=rdma ./check nvme/003 >> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] >> runtime 12.179s ... 11.941s >> --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 >> +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 >> 05:52:27.882759168 -0400 >> @@ -1,3 +1,3 @@ >> Running nvme/003 >> -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) >> +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) >> Test complete >> >> [ 7033.431910] rdma_rxe: loaded >> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 >> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 >> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active >> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 >> [ 7033.549301] loop0: detected capacity change from 0 to 2097152 >> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) >> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 >> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 >> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 >> [ 7046.317861] rdma_rxe: unloaded >> >> >>> >>>>>> import ctypes >>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>>>>> hex(65536) >>> '0x10000' >>>>>> libc.ffs(0x10000) - 1 >>> 16 >>>>>> 1 << 16 >>> 65536 >>> >>> so >>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; >>> >>> >>> So I think Daisuke's patch should work as well. >>> >>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 >>> >>> >>>> Jason >> >> >> >> -- >> Best Regards, >> Yi Zhang
The root cause is that rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. So the offset will get lost. For example, store process: page_size = 0x1000; PAGE_SIZE = 0x10000; va0 = 0xffff000020651000; page_offset = 0 = va & (page_size - 1); page = va_to_page(va); xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); load_process: page = xa_load(&mr->page_list, index); page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; Obviously, *va0 != va1*, page_offset get lost. How to fix: - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") - don't allow ulp registering mr.page_size != PAGE_SIZE ? Thanks Zhijian On 24/10/2023 17:13, Li Zhijian wrote: > > > On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote: >> >> >> On 23/10/2023 18:45, Yi Zhang wrote: >>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) >>> <lizhijian@fujitsu.com> wrote: >>>> >>>> >>>> >>>> On 20/10/2023 22:01, Jason Gunthorpe wrote: >>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>>>>> CC Bart >>>>>> >>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>>>>>>> >>>>>>>> The page_size of mr is set in infiniband core originally. In the commit >>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>>>>>> page_size is also set. Sometime this will cause conflict. >>>>>>> I appreciate your prompt action, but I do not think this commit deals with >>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>>>>>> with assigning actual page size to ibmr.page_size? >>>>>>> >>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>>>>>> in ulp/srp layer to calculate the page_size. >>>>>>> ===== >>>>>>> static int srp_add_one(struct ib_device *device) >>>>>>> { >>>>>>> struct srp_device *srp_dev; >>>>>>> struct ib_device_attr *attr = &device->attrs; >>>>>>> <...> >>>>>>> /* >>>>>>> * Use the smallest page size supported by the HCA, down to a >>>>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>>>>>> * out of smaller entries. >>>>>>> */ >>>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>>>>> >>>>>> >>>>>> You light me up. >>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >>>>> >>>>> That doesn't seem right even in concept.> >>>>> I think the multi-size support in the new xarray code does not work >>>>> right, just looking at it makes me think it does not work right. It >>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >>>>> explode because kmap_local_page() does only 4K. >>>>> >>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >>>>> >>>> >>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). >>> >>> Hi Zhijian >>> >>> Did you try blktests nvme/rdma use_rxe on your environment, it still >>> failed on my side. >>> >> >> Thanks for your testing. >> I just did that, it also failed in my environment. After a glance debug, I found there are >> other places still registered 4K MR. I will dig into it later. > > > nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly. > > > 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, > 1294 struct nvme_rdma_request *req, struct nvme_command *c, > 1295 int count) > 1296 { > 1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; > 1298 int nr; > 1299 > 1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); > 1301 if (WARN_ON_ONCE(!req->mr)) > 1302 return -EAGAIN; > 1303 > 1304 /* > 1305 * Align the MR to a 4K page size to match the ctrl page size and > 1306 * the block virtual boundary. > 1307 */ > 1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, > 1309 SZ_4K); > > > Anyway, i will go ahead. if you have any thought, please let me know. > > > >> >> Thanks >> Zhijian >> >> >> >> >>> # use_rxe=1 nvme_trtype=rdma ./check nvme/003 >>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] >>> runtime 12.179s ... 11.941s >>> --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 >>> +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 >>> 05:52:27.882759168 -0400 >>> @@ -1,3 +1,3 @@ >>> Running nvme/003 >>> -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) >>> +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) >>> Test complete >>> >>> [ 7033.431910] rdma_rxe: loaded >>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 >>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 >>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active >>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 >>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152 >>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) >>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 >>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 >>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 >>> [ 7046.317861] rdma_rxe: unloaded >>> >>> >>>> >>>>>>> import ctypes >>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>>>>>> hex(65536) >>>> '0x10000' >>>>>>> libc.ffs(0x10000) - 1 >>>> 16 >>>>>>> 1 << 16 >>>> 65536 >>>> >>>> so >>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; >>>> >>>> >>>> So I think Daisuke's patch should work as well. >>>> >>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 >>>> >>>> >>>>> Jason >>> >>> >>> >>> -- >>> Best Regards, >>> Yi Zhang
On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote: > The root cause is that > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. > So the offset will get lost. > > For example, > store process: > page_size = 0x1000; > PAGE_SIZE = 0x10000; > va0 = 0xffff000020651000; > page_offset = 0 = va & (page_size - 1); > page = va_to_page(va); > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); > > load_process: > page = xa_load(&mr->page_list, index); > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; > > Obviously, *va0 != va1*, page_offset get lost. > > > How to fix: > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") > - don't allow ulp registering mr.page_size != PAGE_SIZE ? Lets do the second one please. Most devices only support PAGE_SIZE anyhow. Jason
在 2023/10/26 19:42, Jason Gunthorpe 写道: > On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote: >> The root cause is that >> >> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. >> So the offset will get lost. >> >> For example, >> store process: >> page_size = 0x1000; >> PAGE_SIZE = 0x10000; >> va0 = 0xffff000020651000; >> page_offset = 0 = va & (page_size - 1); >> page = va_to_page(va); >> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); >> >> load_process: >> page = xa_load(&mr->page_list, index); >> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 >> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; >> >> Obviously, *va0 != va1*, page_offset get lost. >> >> >> How to fix: >> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") >> - don't allow ulp registering mr.page_size != PAGE_SIZE ? > > Lets do the second one please. Most devices only support PAGE_SIZE anyhow. Normally page_size is PAGE_SIZE or the size of the whole compound page (in the latest kernel version, it is the size of folio). When compound page or folio is taken into account, the page_size is not equal to PAGE_SIZE. If the ULP uses the compound page or folio, the similar problem will occur again. Since this problem is involved with RDMA and block, it will cost a lot of time and efforts to debug. If we fix it now, it will save a lot of time and efforts for the future ULP with folio. Because folio has a lot of benefits, sooner or later, folio or other similar MMU technology will be popular in the kernel. At that time, this problem will appear again. We can do something for the future folio ^_^ Zhu Yanjun > > Jason
On 10/26/23 02:05, Zhijian Li (Fujitsu) wrote: > The root cause is that > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. > So the offset will get lost. > > For example, > store process: > page_size = 0x1000; > PAGE_SIZE = 0x10000; > va0 = 0xffff000020651000; > page_offset = 0 = va & (page_size - 1); > page = va_to_page(va); > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); > > load_process: > page = xa_load(&mr->page_list, index); > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; > > Obviously, *va0 != va1*, page_offset get lost. > > > How to fix: > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") > - don't allow ulp registering mr.page_size != PAGE_SIZE ? Thank you Zhijian for this root-cause analysis. Hardware RDMA adapters may not support the host page size (PAGE_SIZE) so disallowing mr.page_size != PAGE_SIZE would be wrong. If the rxe driver only supports mr.page_size == PAGE_SIZE, does this mean that RXE_PAGE_SIZE_CAP should be changed from 0xfffff000 into PAGE_SHIFT? Thanks, Bart.
On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote: > On 10/26/23 02:05, Zhijian Li (Fujitsu) wrote: > > The root cause is that > > > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. > > So the offset will get lost. > > > > For example, > > store process: > > page_size = 0x1000; > > PAGE_SIZE = 0x10000; > > va0 = 0xffff000020651000; > > page_offset = 0 = va & (page_size - 1); > > page = va_to_page(va); > > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); > > > > load_process: > > page = xa_load(&mr->page_list, index); > > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 > > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; > > > > Obviously, *va0 != va1*, page_offset get lost. > > > > > > How to fix: > > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") > > - don't allow ulp registering mr.page_size != PAGE_SIZE ? > > Thank you Zhijian for this root-cause analysis. > > Hardware RDMA adapters may not support the host page size (PAGE_SIZE) > so disallowing mr.page_size != PAGE_SIZE would be wrong. PAGE_SIZE must always be a valid way to construct a MR. We have code in all the DMA drivers to break PAGE_SIZE chunks to something smaller when building MRs. rxe is missing this now with the xarray stuff since we can't encode the struct page offset and a struct page in a single xarray entry. It would have to go back to encoding pfns and doing pfn to page. > If the rxe driver only supports mr.page_size == PAGE_SIZE, does this > mean that RXE_PAGE_SIZE_CAP should be changed from > 0xfffff000 into PAGE_SHIFT? Yes Jason
On 10/26/23 06:43, Jason Gunthorpe wrote: > On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote: >> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this >> mean that RXE_PAGE_SIZE_CAP should be changed from >> 0xfffff000 into PAGE_SHIFT? > > Yes Bob, do you plan to convert the above change into a patch or do you perhaps expect me to do that? Thanks, Bart.
On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote: > 在 2023/10/26 19:42, Jason Gunthorpe 写道: > > On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote: > > > The root cause is that > > > > > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. > > > So the offset will get lost. > > > > > > For example, > > > store process: > > > page_size = 0x1000; > > > PAGE_SIZE = 0x10000; > > > va0 = 0xffff000020651000; > > > page_offset = 0 = va & (page_size - 1); > > > page = va_to_page(va); > > > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); > > > > > > load_process: > > > page = xa_load(&mr->page_list, index); > > > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 > > > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; > > > > > > Obviously, *va0 != va1*, page_offset get lost. > > > > > > > > > How to fix: > > > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") > > > - don't allow ulp registering mr.page_size != PAGE_SIZE ? > > > > Lets do the second one please. Most devices only support PAGE_SIZE anyhow. > > Normally page_size is PAGE_SIZE or the size of the whole compound page (in > the latest kernel version, it is the size of folio). When compound page or > folio is taken into account, the page_size is not equal to > PAGE_SIZE. folios are always multiples of PAGE_SIZE. rxe splits everything into PAGE_SIZE units in the xarray. > If the ULP uses the compound page or folio, the similar problem will occur > again. No, it won't. We never store folios in the xarray. Jason
OnFri, Oct 27, 2023 6:48 AM Bart Van Assche wrote: > On 10/26/23 06:43, Jason Gunthorpe wrote: > > On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote: > >> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this > >> mean that RXE_PAGE_SIZE_CAP should be changed from > >> 0xfffff000 into PAGE_SHIFT? Some drivers pad upper bits with 1, and others with 0. Is there any rule? I think both are practically allowed, but we should try to eliminate inconsistency. > > > > Yes > > Bob, do you plan to convert the above change into a patch or do you > perhaps expect me to do that? It looks Bob has not been involved in this thread. Thanks, Daisuke Matsuda > > Thanks, > > Bart.
在 2023/10/27 7:23, Jason Gunthorpe 写道: > On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote: >> 在 2023/10/26 19:42, Jason Gunthorpe 写道: >>> On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote: >>>> The root cause is that >>>> >>>> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. >>>> So the offset will get lost. >>>> >>>> For example, >>>> store process: >>>> page_size = 0x1000; >>>> PAGE_SIZE = 0x10000; >>>> va0 = 0xffff000020651000; >>>> page_offset = 0 = va & (page_size - 1); >>>> page = va_to_page(va); >>>> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); >>>> >>>> load_process: >>>> page = xa_load(&mr->page_list, index); >>>> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 >>>> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; >>>> >>>> Obviously, *va0 != va1*, page_offset get lost. >>>> >>>> >>>> How to fix: >>>> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") >>>> - don't allow ulp registering mr.page_size != PAGE_SIZE ? >>> Lets do the second one please. Most devices only support PAGE_SIZE anyhow. >> Normally page_size is PAGE_SIZE or the size of the whole compound page (in >> the latest kernel version, it is the size of folio). When compound page or >> folio is taken into account, the page_size is not equal to >> PAGE_SIZE. > folios are always multiples of PAGE_SIZE. rxe splits everything into > PAGE_SIZE units in the xarray. Sure. Thanks. Folio is multiple base pages. So the page size should be multiple PAGE_SIZE. This page size is set in infiniband core and rxe. Hope no problem will occur when folio or compound page is used in ULP. Zhu Yanjun > >> If the ULP uses the compound page or folio, the similar problem will occur >> again. > No, it won't. We never store folios in the xarray. > > Jason
在 2023/10/27 5:47, Bart Van Assche 写道: > On 10/26/23 06:43, Jason Gunthorpe wrote: >> On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote: >>> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this >>> mean that RXE_PAGE_SIZE_CAP should be changed from >>> 0xfffff000 into PAGE_SHIFT? >> >> Yes > > Bob, do you plan to convert the above change into a patch or do you > perhaps expect me to do that? Zhijian has done a lot of work on this problem. And he found out the root cause. Perhaps Zhijian should file a patch for this problem? Zhu Yanjun > > Thanks, > > Bart. >
在 2023/10/27 7:23, Jason Gunthorpe 写道: > On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote: >> 在 2023/10/26 19:42, Jason Gunthorpe 写道: >>> On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote: >>>> The root cause is that >>>> >>>> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. >>>> So the offset will get lost. >>>> >>>> For example, >>>> store process: >>>> page_size = 0x1000; >>>> PAGE_SIZE = 0x10000; >>>> va0 = 0xffff000020651000; >>>> page_offset = 0 = va & (page_size - 1); >>>> page = va_to_page(va); >>>> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); >>>> >>>> load_process: >>>> page = xa_load(&mr->page_list, index); >>>> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 >>>> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; >>>> >>>> Obviously, *va0 != va1*, page_offset get lost. >>>> >>>> >>>> How to fix: >>>> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") >>>> - don't allow ulp registering mr.page_size != PAGE_SIZE ? >>> Lets do the second one please. Most devices only support PAGE_SIZE anyhow. >> Normally page_size is PAGE_SIZE or the size of the whole compound page (in >> the latest kernel version, it is the size of folio). When compound page or >> folio is taken into account, the page_size is not equal to >> PAGE_SIZE. > folios are always multiples of PAGE_SIZE. rxe splits everything into > PAGE_SIZE units in the xarray. > >> If the ULP uses the compound page or folio, the similar problem will occur >> again. > No, it won't. We never store folios in the xarray. Sure. I mean, in ULP, if folio is used, the page size is set to multiple PAGE_SIZE, but in RXE, the page size is set to PAGE_SIZE. So the page size in ULP is different with the page size in RXE. I am not sure whether this similar problem still occur or not. I hope this problem will not occur even with folio. Zhu Yanjun > > Jason
On 27/10/2023 05:47, Bart Van Assche wrote: > On 10/26/23 06:43, Jason Gunthorpe wrote: >> On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote: >>> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this >>> mean that RXE_PAGE_SIZE_CAP should be changed from >>> 0xfffff000 into PAGE_SHIFT? >> >> Yes > > Bob, do you plan to convert the above change into a patch or do you > perhaps expect me to do that? I just cooked the patches, it can fix SRP problem. But nvme still fails. > > Thanks, > > Bart. >
On Fri, Oct 27, 2023 at 12:01:47PM +0800, Zhu Yanjun wrote: > > 在 2023/10/27 7:23, Jason Gunthorpe 写道: > > On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote: > > > 在 2023/10/26 19:42, Jason Gunthorpe 写道: > > > > On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote: > > > > > The root cause is that > > > > > > > > > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. > > > > > So the offset will get lost. > > > > > > > > > > For example, > > > > > store process: > > > > > page_size = 0x1000; > > > > > PAGE_SIZE = 0x10000; > > > > > va0 = 0xffff000020651000; > > > > > page_offset = 0 = va & (page_size - 1); > > > > > page = va_to_page(va); > > > > > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); > > > > > > > > > > load_process: > > > > > page = xa_load(&mr->page_list, index); > > > > > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 > > > > > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; > > > > > > > > > > Obviously, *va0 != va1*, page_offset get lost. > > > > > > > > > > > > > > > How to fix: > > > > > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") > > > > > - don't allow ulp registering mr.page_size != PAGE_SIZE ? > > > > Lets do the second one please. Most devices only support PAGE_SIZE anyhow. > > > Normally page_size is PAGE_SIZE or the size of the whole compound page (in > > > the latest kernel version, it is the size of folio). When compound page or > > > folio is taken into account, the page_size is not equal to > > > PAGE_SIZE. > > folios are always multiples of PAGE_SIZE. rxe splits everything into > > PAGE_SIZE units in the xarray. > > > > > If the ULP uses the compound page or folio, the similar problem will occur > > > again. > > No, it won't. We never store folios in the xarray. > > Sure. > > I mean, in ULP, if folio is used, the page size is set to multiple > PAGE_SIZE, but in RXE, the page size is set to PAGE_SIZE. > > So the page size in ULP is different with the page size in RXE. There is no such thing as a "page size" in the ULP. rxe is the thing that keeps things in PAGE_SIZE units, and it should be fragmenting whatever the ulp gives into that. The ULP must simply give virtually contiguous runs of memory that are PAGE_SIZE aligned Jason
在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道: > The root cause is that > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. > So the offset will get lost. It is interesting root cause. page_size is 0x1000, PAGE_SIZE 0x10000. Zhu Yanjun > > For example, > store process: > page_size = 0x1000; > PAGE_SIZE = 0x10000; > va0 = 0xffff000020651000; > page_offset = 0 = va & (page_size - 1); > page = va_to_page(va); > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); > > load_process: > page = xa_load(&mr->page_list, index); > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; > > Obviously, *va0 != va1*, page_offset get lost. > > > How to fix: > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") > - don't allow ulp registering mr.page_size != PAGE_SIZE ? > > > > Thanks > Zhijian > > > On 24/10/2023 17:13, Li Zhijian wrote: >> >> >> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote: >>> >>> >>> On 23/10/2023 18:45, Yi Zhang wrote: >>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) >>>> <lizhijian@fujitsu.com> wrote: >>>>> >>>>> >>>>> >>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote: >>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>>>>>> CC Bart >>>>>>> >>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>>>>>>>> >>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit >>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>>>>>>>> page_size is also set. Sometime this will cause conflict. >>>>>>>> I appreciate your prompt action, but I do not think this commit deals with >>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>>>>>>> with assigning actual page size to ibmr.page_size? >>>>>>>> >>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>>>>>>> in ulp/srp layer to calculate the page_size. >>>>>>>> ===== >>>>>>>> static int srp_add_one(struct ib_device *device) >>>>>>>> { >>>>>>>> struct srp_device *srp_dev; >>>>>>>> struct ib_device_attr *attr = &device->attrs; >>>>>>>> <...> >>>>>>>> /* >>>>>>>> * Use the smallest page size supported by the HCA, down to a >>>>>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>>>>>>> * out of smaller entries. >>>>>>>> */ >>>>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>>>>>> >>>>>>> >>>>>>> You light me up. >>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >>>>>> >>>>>> That doesn't seem right even in concept.> >>>>>> I think the multi-size support in the new xarray code does not work >>>>>> right, just looking at it makes me think it does not work right. It >>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >>>>>> explode because kmap_local_page() does only 4K. >>>>>> >>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >>>>>> >>>>> >>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). >>>> >>>> Hi Zhijian >>>> >>>> Did you try blktests nvme/rdma use_rxe on your environment, it still >>>> failed on my side. >>>> >>> >>> Thanks for your testing. >>> I just did that, it also failed in my environment. After a glance debug, I found there are >>> other places still registered 4K MR. I will dig into it later. >> >> >> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly. >> >> >> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, >> 1294 struct nvme_rdma_request *req, struct nvme_command *c, >> 1295 int count) >> 1296 { >> 1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; >> 1298 int nr; >> 1299 >> 1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); >> 1301 if (WARN_ON_ONCE(!req->mr)) >> 1302 return -EAGAIN; >> 1303 >> 1304 /* >> 1305 * Align the MR to a 4K page size to match the ctrl page size and >> 1306 * the block virtual boundary. >> 1307 */ >> 1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, >> 1309 SZ_4K); >> >> >> Anyway, i will go ahead. if you have any thought, please let me know. >> >> >> >>> >>> Thanks >>> Zhijian >>> >>> >>> >>> >>>> # use_rxe=1 nvme_trtype=rdma ./check nvme/003 >>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] >>>> runtime 12.179s ... 11.941s >>>> --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 >>>> +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 >>>> 05:52:27.882759168 -0400 >>>> @@ -1,3 +1,3 @@ >>>> Running nvme/003 >>>> -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) >>>> +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) >>>> Test complete >>>> >>>> [ 7033.431910] rdma_rxe: loaded >>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 >>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 >>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active >>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 >>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152 >>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) >>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 >>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 >>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 >>>> [ 7046.317861] rdma_rxe: unloaded >>>> >>>> >>>>> >>>>>>>> import ctypes >>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>>>>>>> hex(65536) >>>>> '0x10000' >>>>>>>> libc.ffs(0x10000) - 1 >>>>> 16 >>>>>>>> 1 << 16 >>>>> 65536 >>>>> >>>>> so >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; >>>>> >>>>> >>>>> So I think Daisuke's patch should work as well. >>>>> >>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 >>>>> >>>>> >>>>>> Jason >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yi Zhang
On Tue, Oct 31, 2023 at 3:53 PM Greg Sword <gregsword0@gmail.com> wrote: > > > > On Tue, Oct 31, 2023 at 2:03 PM Greg Sword <gregsword0@gmail.com> wrote: >> >> >> >> >> On Tue, Oct 31, 2023 at 9:40 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: >>> >>> 在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道: >>> > The root cause is that >>> > >>> > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. >>> > So the offset will get lost. >>> >>> It is interesting root cause. >>> page_size is 0x1000, PAGE_SIZE 0x10000. >>> >>> Zhu Yanjun >>> >>> > >>> > For example, >>> > store process: >>> > page_size = 0x1000; >>> > PAGE_SIZE = 0x10000; >> >> >> page_size is 0x1000, PAGE_SIZE is 0x10000. >> >> Is it correct? page_size should be PAGE_SIZE. >> That is, page_size should be greater than PAGE_SIZE. >> It is not reasonable now. >> >> The root cause is to find out why page_size is set 4K (0x1000). On ARM64 host withCONFIG_ARM64_64K_PAGES enabled, >> the page_size should not be 4K, it should be 64K. >> >> So Zhijian's commit seems not to be correct. > > linux-rdma not reached. Resend it again. > Because rxe splits everything into PAGE_SIZE units in the xarray, so it should only support PAGE_SIZE page size. WithCONFIG_ARM64_64K_PAGES, > PAGE_SIZE is 64K. > > But from NVME side, the page size is 4K bytes, it is not 64K. > > In the function ib_map_mr_sg is fixed SZ_4K. So the root cause is not in rxe. Before rxe is called, the capability of rxe should be tested. > > static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, > struct nvme_rdma_request *req, struct nvme_command *c, > int count) > { > struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; > int nr; > > req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); > if (WARN_ON_ONCE(!req->mr)) > return -EAGAIN; > > /* > * Align the MR to a 4K page size to match the ctrl page size and > * the block virtual boundary. > */ > nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, > SZ_4K); > if (unlikely(nr < count)) { > ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr); > req->mr = NULL; > if (nr < 0) > return nr; > return -EINVAL; > } > > ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey)); > >> >> >> >>> >>> > va0 = 0xffff000020651000; >>> > page_offset = 0 = va & (page_size - 1); >>> > page = va_to_page(va); >>> > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); >>> > >>> > load_process: >>> > page = xa_load(&mr->page_list, index); >>> > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 >>> > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; >>> > >>> > Obviously, *va0 != va1*, page_offset get lost. >>> > >>> > >>> > How to fix: >>> > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") >>> > - don't allow ulp registering mr.page_size != PAGE_SIZE ? >>> > >>> > >>> > >>> > Thanks >>> > Zhijian >>> > >>> > >>> > On 24/10/2023 17:13, Li Zhijian wrote: >>> >> >>> >> >>> >> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote: >>> >>> >>> >>> >>> >>> On 23/10/2023 18:45, Yi Zhang wrote: >>> >>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) >>> >>>> <lizhijian@fujitsu.com> wrote: >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote: >>> >>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>> >>>>>>> CC Bart >>> >>>>>>> >>> >>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>> >>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>> >>>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev> >>> >>>>>>>>> >>> >>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit >>> >>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>> >>>>>>>>> page_size is also set. Sometime this will cause conflict. >>> >>>>>>>> I appreciate your prompt action, but I do not think this commit deals with >>> >>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>> >>>>>>>> with assigning actual page size to ibmr.page_size? >>> >>>>>>>> >>> >>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>> >>>>>>>> in ulp/srp layer to calculate the page_size. >>> >>>>>>>> ===== >>> >>>>>>>> static int srp_add_one(struct ib_device *device) >>> >>>>>>>> { >>> >>>>>>>> struct srp_device *srp_dev; >>> >>>>>>>> struct ib_device_attr *attr = &device->attrs; >>> >>>>>>>> <...> >>> >>>>>>>> /* >>> >>>>>>>> * Use the smallest page size supported by the HCA, down to a >>> >>>>>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>> >>>>>>>> * out of smaller entries. >>> >>>>>>>> */ >>> >>>>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> You light me up. >>> >>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >>> >>>>>> >>> >>>>>> That doesn't seem right even in concept.> >>> >>>>>> I think the multi-size support in the new xarray code does not work >>> >>>>>> right, just looking at it makes me think it does not work right. It >>> >>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >>> >>>>>> explode because kmap_local_page() does only 4K. >>> >>>>>> >>> >>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >>> >>>>>> >>> >>>>> >>> >>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). >>> >>>> >>> >>>> Hi Zhijian >>> >>>> >>> >>>> Did you try blktests nvme/rdma use_rxe on your environment, it still >>> >>>> failed on my side. >>> >>>> >>> >>> >>> >>> Thanks for your testing. >>> >>> I just did that, it also failed in my environment. After a glance debug, I found there are >>> >>> other places still registered 4K MR. I will dig into it later. >>> >> >>> >> >>> >> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly. >>> >> >>> >> >>> >> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, >>> >> 1294 struct nvme_rdma_request *req, struct nvme_command *c, >>> >> 1295 int count) >>> >> 1296 { >>> >> 1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; >>> >> 1298 int nr; >>> >> 1299 >>> >> 1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); >>> >> 1301 if (WARN_ON_ONCE(!req->mr)) >>> >> 1302 return -EAGAIN; >>> >> 1303 >>> >> 1304 /* >>> >> 1305 * Align the MR to a 4K page size to match the ctrl page size and >>> >> 1306 * the block virtual boundary. >>> >> 1307 */ >>> >> 1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, >>> >> 1309 SZ_4K); >>> >> >>> >> >>> >> Anyway, i will go ahead. if you have any thought, please let me know. >>> >> >>> >> >>> >> >>> >>> >>> >>> Thanks >>> >>> Zhijian >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> # use_rxe=1 nvme_trtype=rdma ./check nvme/003 >>> >>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] >>> >>>> runtime 12.179s ... 11.941s >>> >>>> --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 >>> >>>> +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 >>> >>>> 05:52:27.882759168 -0400 >>> >>>> @@ -1,3 +1,3 @@ >>> >>>> Running nvme/003 >>> >>>> -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) >>> >>>> +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) >>> >>>> Test complete >>> >>>> >>> >>>> [ 7033.431910] rdma_rxe: loaded >>> >>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 >>> >>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 >>> >>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active >>> >>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 >>> >>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152 >>> >>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> >>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) >>> >>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 >>> >>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 >>> >>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 >>> >>>> [ 7046.317861] rdma_rxe: unloaded >>> >>>> >>> >>>> >>> >>>>> >>> >>>>>>>> import ctypes >>> >>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>> >>>>>>>> hex(65536) >>> >>>>> '0x10000' >>> >>>>>>>> libc.ffs(0x10000) - 1 >>> >>>>> 16 >>> >>>>>>>> 1 << 16 >>> >>>>> 65536 >>> >>>>> >>> >>>>> so >>> >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; >>> >>>>> >>> >>>>> >>> >>>>> So I think Daisuke's patch should work as well. >>> >>>>> >>> >>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 >>> >>>>> >>> >>>>> >>> >>>>>> Jason >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Best Regards, >>> >>>> Yi Zhang >>>
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c index f54042e9aeb2..dc9d31bfb689 100644 --- a/drivers/infiniband/sw/rxe/rxe_mr.c +++ b/drivers/infiniband/sw/rxe/rxe_mr.c @@ -58,7 +58,6 @@ static void rxe_mr_init(int access, struct rxe_mr *mr) mr->rkey = mr->ibmr.rkey = key; mr->access = access; - mr->ibmr.page_size = PAGE_SIZE; mr->page_mask = PAGE_MASK; mr->page_shift = PAGE_SHIFT; mr->state = RXE_MR_STATE_INVALID; @@ -79,7 +78,7 @@ static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova) static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova) { - return iova & (mr_page_size(mr) - 1); + return iova & (PAGE_SIZE - 1); } static bool is_pmem_page(struct page *pg) @@ -232,7 +231,7 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl, int sg_nents, unsigned int *sg_offset) { struct rxe_mr *mr = to_rmr(ibmr); - unsigned int page_size = mr_page_size(mr); + unsigned int page_size = ibmr->page_size; mr->nbuf = 0; mr->page_shift = ilog2(page_size); @@ -256,8 +255,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr, if (!page) return -EFAULT; - bytes = min_t(unsigned int, length, - mr_page_size(mr) - page_offset); + bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset); va = kmap_local_page(page); if (dir == RXE_FROM_MR_OBJ) memcpy(addr, va + page_offset, bytes); @@ -450,8 +448,7 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length) page_offset = rxe_mr_iova_to_page_offset(mr, iova); if (!page) return -EFAULT; - bytes = min_t(unsigned int, length, - mr_page_size(mr) - page_offset); + bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset); va = kmap_local_page(page); arch_wb_cache_pmem(va + page_offset, bytes); diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h index ccb9d19ffe8a..9fcaa974e079 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.h +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h @@ -319,11 +319,6 @@ struct rxe_mr { struct xarray page_list; }; -static inline unsigned int mr_page_size(struct rxe_mr *mr) -{ - return mr ? mr->ibmr.page_size : PAGE_SIZE; -} - enum rxe_mw_state { RXE_MW_STATE_INVALID = RXE_MR_STATE_INVALID, RXE_MW_STATE_FREE = RXE_MR_STATE_FREE,