diff mbox series

[1/1] RDMA/rxe: Fix blktests srp lead kernel panic with 64k page size

Message ID 20231013011803.70474-1-yanjun.zhu@intel.com (mailing list archive)
State Changes Requested
Headers show
Series [1/1] RDMA/rxe: Fix blktests srp lead kernel panic with 64k page size | expand

Commit Message

Zhu Yanjun Oct. 13, 2023, 1:18 a.m. UTC
From: Zhu Yanjun <yanjun.zhu@linux.dev>

The page_size of mr is set in infiniband core originally. In the commit
325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
page_size is also set. Sometime this will cause conflict.

Fixes: 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    | 11 ++++-------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  5 -----
 2 files changed, 4 insertions(+), 12 deletions(-)

Comments

Daisuke Matsuda (Fujitsu) Oct. 13, 2023, 12:01 p.m. UTC | #1
On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> The page_size of mr is set in infiniband core originally. In the commit
> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
> page_size is also set. Sometime this will cause conflict.

I appreciate your prompt action, but I do not think this commit deals with
the root cause. I agree that the problem lies in rxe driver, but what is wrong
with assigning actual page size to ibmr.page_size?

IMO, the problem comes from the device attribute of rxe driver, which is used
in ulp/srp layer to calculate the page_size.
=====
static int srp_add_one(struct ib_device *device)
{
        struct srp_device *srp_dev;
        struct ib_device_attr *attr = &device->attrs;
<...>
        /*
         * Use the smallest page size supported by the HCA, down to a
         * minimum of 4096 bytes. We're unlikely to build large sglists
         * out of smaller entries.
         */
        mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
        srp_dev->mr_page_size   = 1 << mr_page_shift;
=====
On initialization of srp driver, mr_page_size is calculated here.
Note that the device attribute is used to calculate the value of page shift
when the device is trying to use a page size larger than 4096. Since Yi specified
CONFIG_ARM64_64K_PAGES, the system naturally met the condition.

=====
static int srp_map_finish_fr(struct srp_map_state *state,
                             struct srp_request *req,
                             struct srp_rdma_ch *ch, int sg_nents,
                             unsigned int *sg_offset_p)
{
        struct srp_target_port *target = ch->target;
        struct srp_device *dev = target->srp_host->srp_dev;
<...>
        n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p,
                         dev->mr_page_size);
=====
After that, mr_page_size is presumably passed to ib_core layer.

=====
int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
                 unsigned int *sg_offset, unsigned int page_size)
{
        if (unlikely(!mr->device->ops.map_mr_sg))
                return -EOPNOTSUPP;

        mr->page_size = page_size;

        return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset);
}
EXPORT_SYMBOL(ib_map_mr_sg);
=====
Consequently, the page size calculated in srp driver is set to ibmr.page_size.

Coming back to rxe, the device attribute is set here:
===== 
rxe.c
<...>
/* initialize rxe device parameters */
static void rxe_init_device_param(struct rxe_dev *rxe)
{
        rxe->max_inline_data                    = RXE_MAX_INLINE_DATA;

        rxe->attr.vendor_id                     = RXE_VENDOR_ID;
        rxe->attr.max_mr_size                   = RXE_MAX_MR_SIZE;
        rxe->attr.page_size_cap                 = RXE_PAGE_SIZE_CAP;
        rxe->attr.max_qp                        = RXE_MAX_QP;
---
rxe_param.h
<...>
/* default/initial rxe device parameter settings */
enum rxe_device_param {
        RXE_MAX_MR_SIZE                 = -1ull,
        RXE_PAGE_SIZE_CAP               = 0xfffff000,
        RXE_MAX_QP_WR                   = DEFAULT_MAX_VALUE,
=====
rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied
to ib_device->attrs in setup_device()@core/device.c.
See that the page size cap is hardcoded to 4096 bytes. I suspect this led to
incorrect page_size being set to ibmr.page_size, resulting in the kernel crash.

I think rxe driver is made to be able to handle arbitrary page sizes.
Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue.
What do you guys think?

Thanks,
Daisuke Matsuda
Zhu Yanjun Oct. 13, 2023, 12:28 p.m. UTC | #2
在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道:
> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> The page_size of mr is set in infiniband core originally. In the commit
>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>> page_size is also set. Sometime this will cause conflict.
> 
> I appreciate your prompt action, but I do not think this commit deals with
> the root cause. I agree that the problem lies in rxe driver, but what is wrong
> with assigning actual page size to ibmr.page_size?

Please check the source code. ibmr.page_size is assigned in 
infiniband/core. And then it is assigned in rxe.
When the 2 are different, the problem will occur.
Please add logs to infiniband/core and rxe to check them.

Zhu Yanjun

> 
> IMO, the problem comes from the device attribute of rxe driver, which is used
> in ulp/srp layer to calculate the page_size.
> =====
> static int srp_add_one(struct ib_device *device)
> {
>          struct srp_device *srp_dev;
>          struct ib_device_attr *attr = &device->attrs;
> <...>
>          /*
>           * Use the smallest page size supported by the HCA, down to a
>           * minimum of 4096 bytes. We're unlikely to build large sglists
>           * out of smaller entries.
>           */
>          mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>          srp_dev->mr_page_size   = 1 << mr_page_shift;
> =====
> On initialization of srp driver, mr_page_size is calculated here.
> Note that the device attribute is used to calculate the value of page shift
> when the device is trying to use a page size larger than 4096. Since Yi specified
> CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
> 
> =====
> static int srp_map_finish_fr(struct srp_map_state *state,
>                               struct srp_request *req,
>                               struct srp_rdma_ch *ch, int sg_nents,
>                               unsigned int *sg_offset_p)
> {
>          struct srp_target_port *target = ch->target;
>          struct srp_device *dev = target->srp_host->srp_dev;
> <...>
>          n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p,
>                           dev->mr_page_size);
> =====
> After that, mr_page_size is presumably passed to ib_core layer.
> 
> =====
> int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
>                   unsigned int *sg_offset, unsigned int page_size)
> {
>          if (unlikely(!mr->device->ops.map_mr_sg))
>                  return -EOPNOTSUPP;
> 
>          mr->page_size = page_size;
> 
>          return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset);
> }
> EXPORT_SYMBOL(ib_map_mr_sg);
> =====
> Consequently, the page size calculated in srp driver is set to ibmr.page_size.
> 
> Coming back to rxe, the device attribute is set here:
> =====
> rxe.c
> <...>
> /* initialize rxe device parameters */
> static void rxe_init_device_param(struct rxe_dev *rxe)
> {
>          rxe->max_inline_data                    = RXE_MAX_INLINE_DATA;
> 
>          rxe->attr.vendor_id                     = RXE_VENDOR_ID;
>          rxe->attr.max_mr_size                   = RXE_MAX_MR_SIZE;
>          rxe->attr.page_size_cap                 = RXE_PAGE_SIZE_CAP;
>          rxe->attr.max_qp                        = RXE_MAX_QP;
> ---
> rxe_param.h
> <...>
> /* default/initial rxe device parameter settings */
> enum rxe_device_param {
>          RXE_MAX_MR_SIZE                 = -1ull,
>          RXE_PAGE_SIZE_CAP               = 0xfffff000,
>          RXE_MAX_QP_WR                   = DEFAULT_MAX_VALUE,
> =====
> rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied
> to ib_device->attrs in setup_device()@core/device.c.
> See that the page size cap is hardcoded to 4096 bytes. I suspect this led to
> incorrect page_size being set to ibmr.page_size, resulting in the kernel crash.
> 
> I think rxe driver is made to be able to handle arbitrary page sizes.
> Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue.
> What do you guys think?
> 
> Thanks,
> Daisuke Matsuda
>
Daisuke Matsuda (Fujitsu) Oct. 13, 2023, 1:01 p.m. UTC | #3
On Friday, October 13, 2023 9:29 PM Zhu Yanjun:
> 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道:
> > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
> >> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> >>
> >> The page_size of mr is set in infiniband core originally. In the commit
> >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
> >> page_size is also set. Sometime this will cause conflict.
> >
> > I appreciate your prompt action, but I do not think this commit deals with
> > the root cause. I agree that the problem lies in rxe driver, but what is wrong
> > with assigning actual page size to ibmr.page_size?
> 
> Please check the source code. ibmr.page_size is assigned in
> infiniband/core. And then it is assigned in rxe.
> When the 2 are different, the problem will occur.

In the first place, the two must always be equal.
Is there any situation there are two different page sizes for a MR?
I think I have explained the value assigned in the core layer is wrong
when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent.

As I have no environment to test the kernel panic, it remains speculative,
but I have provided enough information for everyone to rebut my argument.
It is possible I am wrong. I hope someone will tell me specifically where
my guess is wrong, or Yi would kindly take the trouble to verify it.

Thanks,
Daisuke Matsuda

> Please add logs to infiniband/core and rxe to check them.
> 
> Zhu Yanjun
> 
> >
> > IMO, the problem comes from the device attribute of rxe driver, which is used
> > in ulp/srp layer to calculate the page_size.
> > =====
> > static int srp_add_one(struct ib_device *device)
> > {
> >          struct srp_device *srp_dev;
> >          struct ib_device_attr *attr = &device->attrs;
> > <...>
> >          /*
> >           * Use the smallest page size supported by the HCA, down to a
> >           * minimum of 4096 bytes. We're unlikely to build large sglists
> >           * out of smaller entries.
> >           */
> >          mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> >          srp_dev->mr_page_size   = 1 << mr_page_shift;
> > =====
> > On initialization of srp driver, mr_page_size is calculated here.
> > Note that the device attribute is used to calculate the value of page shift
> > when the device is trying to use a page size larger than 4096. Since Yi specified
> > CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
> >
> > =====
> > static int srp_map_finish_fr(struct srp_map_state *state,
> >                               struct srp_request *req,
> >                               struct srp_rdma_ch *ch, int sg_nents,
> >                               unsigned int *sg_offset_p)
> > {
> >          struct srp_target_port *target = ch->target;
> >          struct srp_device *dev = target->srp_host->srp_dev;
> > <...>
> >          n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p,
> >                           dev->mr_page_size);
> > =====
> > After that, mr_page_size is presumably passed to ib_core layer.
> >
> > =====
> > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
> >                   unsigned int *sg_offset, unsigned int page_size)
> > {
> >          if (unlikely(!mr->device->ops.map_mr_sg))
> >                  return -EOPNOTSUPP;
> >
> >          mr->page_size = page_size;
> >
> >          return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset);
> > }
> > EXPORT_SYMBOL(ib_map_mr_sg);
> > =====
> > Consequently, the page size calculated in srp driver is set to ibmr.page_size.
> >
> > Coming back to rxe, the device attribute is set here:
> > =====
> > rxe.c
> > <...>
> > /* initialize rxe device parameters */
> > static void rxe_init_device_param(struct rxe_dev *rxe)
> > {
> >          rxe->max_inline_data                    = RXE_MAX_INLINE_DATA;
> >
> >          rxe->attr.vendor_id                     = RXE_VENDOR_ID;
> >          rxe->attr.max_mr_size                   = RXE_MAX_MR_SIZE;
> >          rxe->attr.page_size_cap                 = RXE_PAGE_SIZE_CAP;
> >          rxe->attr.max_qp                        = RXE_MAX_QP;
> > ---
> > rxe_param.h
> > <...>
> > /* default/initial rxe device parameter settings */
> > enum rxe_device_param {
> >          RXE_MAX_MR_SIZE                 = -1ull,
> >          RXE_PAGE_SIZE_CAP               = 0xfffff000,
> >          RXE_MAX_QP_WR                   = DEFAULT_MAX_VALUE,
> > =====
> > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied
> > to ib_device->attrs in setup_device()@core/device.c.
> > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to
> > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash.
> >
> > I think rxe driver is made to be able to handle arbitrary page sizes.
> > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue.
> > What do you guys think?
> >
> > Thanks,
> > Daisuke Matsuda
> >
Rain River Oct. 13, 2023, 1:44 p.m. UTC | #4
On Fri, Oct 13, 2023 at 9:01 AM Daisuke Matsuda (Fujitsu)
<matsuda-daisuke@fujitsu.com> wrote:
>
> On Friday, October 13, 2023 9:29 PM Zhu Yanjun:
> > 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道:
> > > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
> > >> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> > >>
> > >> The page_size of mr is set in infiniband core originally. In the commit
> > >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
> > >> page_size is also set. Sometime this will cause conflict.
> > >
> > > I appreciate your prompt action, but I do not think this commit deals with
> > > the root cause. I agree that the problem lies in rxe driver, but what is wrong
> > > with assigning actual page size to ibmr.page_size?
> >
> > Please check the source code. ibmr.page_size is assigned in
> > infiniband/core. And then it is assigned in rxe.
> > When the 2 are different, the problem will occur.
>
> In the first place, the two must always be equal.
> Is there any situation there are two different page sizes for a MR?
> I think I have explained the value assigned in the core layer is wrong
> when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent.
>
> As I have no environment to test the kernel panic, it remains speculative,
> but I have provided enough information for everyone to rebut my argument.
> It is possible I am wrong. I hope someone will tell me specifically where
> my guess is wrong, or Yi would kindly take the trouble to verify it.

I made a quick test.

With Zhu's patch, the problem fixed.
With your patch, this problem exists. And other problems occur. I do
not know why.
Will continue to make tests.

>
> Thanks,
> Daisuke Matsuda
>
> > Please add logs to infiniband/core and rxe to check them.
> >
> > Zhu Yanjun
> >
> > >
> > > IMO, the problem comes from the device attribute of rxe driver, which is used
> > > in ulp/srp layer to calculate the page_size.
> > > =====
> > > static int srp_add_one(struct ib_device *device)
> > > {
> > >          struct srp_device *srp_dev;
> > >          struct ib_device_attr *attr = &device->attrs;
> > > <...>
> > >          /*
> > >           * Use the smallest page size supported by the HCA, down to a
> > >           * minimum of 4096 bytes. We're unlikely to build large sglists
> > >           * out of smaller entries.
> > >           */
> > >          mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> > >          srp_dev->mr_page_size   = 1 << mr_page_shift;
> > > =====
> > > On initialization of srp driver, mr_page_size is calculated here.
> > > Note that the device attribute is used to calculate the value of page shift
> > > when the device is trying to use a page size larger than 4096. Since Yi specified
> > > CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
> > >
> > > =====
> > > static int srp_map_finish_fr(struct srp_map_state *state,
> > >                               struct srp_request *req,
> > >                               struct srp_rdma_ch *ch, int sg_nents,
> > >                               unsigned int *sg_offset_p)
> > > {
> > >          struct srp_target_port *target = ch->target;
> > >          struct srp_device *dev = target->srp_host->srp_dev;
> > > <...>
> > >          n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p,
> > >                           dev->mr_page_size);
> > > =====
> > > After that, mr_page_size is presumably passed to ib_core layer.
> > >
> > > =====
> > > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
> > >                   unsigned int *sg_offset, unsigned int page_size)
> > > {
> > >          if (unlikely(!mr->device->ops.map_mr_sg))
> > >                  return -EOPNOTSUPP;
> > >
> > >          mr->page_size = page_size;
> > >
> > >          return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset);
> > > }
> > > EXPORT_SYMBOL(ib_map_mr_sg);
> > > =====
> > > Consequently, the page size calculated in srp driver is set to ibmr.page_size.
> > >
> > > Coming back to rxe, the device attribute is set here:
> > > =====
> > > rxe.c
> > > <...>
> > > /* initialize rxe device parameters */
> > > static void rxe_init_device_param(struct rxe_dev *rxe)
> > > {
> > >          rxe->max_inline_data                    = RXE_MAX_INLINE_DATA;
> > >
> > >          rxe->attr.vendor_id                     = RXE_VENDOR_ID;
> > >          rxe->attr.max_mr_size                   = RXE_MAX_MR_SIZE;
> > >          rxe->attr.page_size_cap                 = RXE_PAGE_SIZE_CAP;
> > >          rxe->attr.max_qp                        = RXE_MAX_QP;
> > > ---
> > > rxe_param.h
> > > <...>
> > > /* default/initial rxe device parameter settings */
> > > enum rxe_device_param {
> > >          RXE_MAX_MR_SIZE                 = -1ull,
> > >          RXE_PAGE_SIZE_CAP               = 0xfffff000,
> > >          RXE_MAX_QP_WR                   = DEFAULT_MAX_VALUE,
> > > =====
> > > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied
> > > to ib_device->attrs in setup_device()@core/device.c.
> > > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to
> > > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash.
> > >
> > > I think rxe driver is made to be able to handle arbitrary page sizes.
> > > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue.
> > > What do you guys think?
> > >
> > > Thanks,
> > > Daisuke Matsuda
> > >
>
Daisuke Matsuda (Fujitsu) Oct. 16, 2023, 6:07 a.m. UTC | #5
On Fri, Oct 13, 2023 10:44 PM Rain River <rain.1986.08.12@gmail.com> wrote:
> > On Friday, October 13, 2023 9:29 PM Zhu Yanjun:
> > > 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道:
> > > > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
> > > >> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> > > >>
> > > >> The page_size of mr is set in infiniband core originally. In the commit
> > > >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
> > > >> page_size is also set. Sometime this will cause conflict.
> > > >
> > > > I appreciate your prompt action, but I do not think this commit deals with
> > > > the root cause. I agree that the problem lies in rxe driver, but what is wrong
> > > > with assigning actual page size to ibmr.page_size?
> > >
> > > Please check the source code. ibmr.page_size is assigned in
> > > infiniband/core. And then it is assigned in rxe.
> > > When the 2 are different, the problem will occur.
> >
> > In the first place, the two must always be equal.
> > Is there any situation there are two different page sizes for a MR?
> > I think I have explained the value assigned in the core layer is wrong
> > when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent.
> >
> > As I have no environment to test the kernel panic, it remains speculative,
> > but I have provided enough information for everyone to rebut my argument.
> > It is possible I am wrong. I hope someone will tell me specifically where
> > my guess is wrong, or Yi would kindly take the trouble to verify it.
> 
> I made a quick test.
> 
> With Zhu's patch, the problem fixed.
> With your patch, this problem exists. And other problems occur. I do
> not know why.
> Will continue to make tests.

Thank you for your time and work.
I will try to find out why there were two different page sizes from
different perspectives. This may take a while because I am busy
with other projects now. If anybody need the driver without the crash
issue right now, I do not object to partially undoing "RDMA/rxe: Cleanup
page variables in rxe_mr.c" as Zhu suggests, but that should be interim
and not a final solution.

Thanks,
Daisuke

> 
> >
> > Thanks,
> > Daisuke Matsuda
> >
> > > Please add logs to infiniband/core and rxe to check them.
> > >
> > > Zhu Yanjun
> > >
> > > >
> > > > IMO, the problem comes from the device attribute of rxe driver, which is used
> > > > in ulp/srp layer to calculate the page_size.
> > > > =====
> > > > static int srp_add_one(struct ib_device *device)
> > > > {
> > > >          struct srp_device *srp_dev;
> > > >          struct ib_device_attr *attr = &device->attrs;
> > > > <...>
> > > >          /*
> > > >           * Use the smallest page size supported by the HCA, down to a
> > > >           * minimum of 4096 bytes. We're unlikely to build large sglists
> > > >           * out of smaller entries.
> > > >           */
> > > >          mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> > > >          srp_dev->mr_page_size   = 1 << mr_page_shift;
> > > > =====
> > > > On initialization of srp driver, mr_page_size is calculated here.
> > > > Note that the device attribute is used to calculate the value of page shift
> > > > when the device is trying to use a page size larger than 4096. Since Yi specified
> > > > CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
> > > >
> > > > =====
> > > > static int srp_map_finish_fr(struct srp_map_state *state,
> > > >                               struct srp_request *req,
> > > >                               struct srp_rdma_ch *ch, int sg_nents,
> > > >                               unsigned int *sg_offset_p)
> > > > {
> > > >          struct srp_target_port *target = ch->target;
> > > >          struct srp_device *dev = target->srp_host->srp_dev;
> > > > <...>
> > > >          n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p,
> > > >                           dev->mr_page_size);
> > > > =====
> > > > After that, mr_page_size is presumably passed to ib_core layer.
> > > >
> > > > =====
> > > > int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
> > > >                   unsigned int *sg_offset, unsigned int page_size)
> > > > {
> > > >          if (unlikely(!mr->device->ops.map_mr_sg))
> > > >                  return -EOPNOTSUPP;
> > > >
> > > >          mr->page_size = page_size;
> > > >
> > > >          return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset);
> > > > }
> > > > EXPORT_SYMBOL(ib_map_mr_sg);
> > > > =====
> > > > Consequently, the page size calculated in srp driver is set to ibmr.page_size.
> > > >
> > > > Coming back to rxe, the device attribute is set here:
> > > > =====
> > > > rxe.c
> > > > <...>
> > > > /* initialize rxe device parameters */
> > > > static void rxe_init_device_param(struct rxe_dev *rxe)
> > > > {
> > > >          rxe->max_inline_data                    = RXE_MAX_INLINE_DATA;
> > > >
> > > >          rxe->attr.vendor_id                     = RXE_VENDOR_ID;
> > > >          rxe->attr.max_mr_size                   = RXE_MAX_MR_SIZE;
> > > >          rxe->attr.page_size_cap                 = RXE_PAGE_SIZE_CAP;
> > > >          rxe->attr.max_qp                        = RXE_MAX_QP;
> > > > ---
> > > > rxe_param.h
> > > > <...>
> > > > /* default/initial rxe device parameter settings */
> > > > enum rxe_device_param {
> > > >          RXE_MAX_MR_SIZE                 = -1ull,
> > > >          RXE_PAGE_SIZE_CAP               = 0xfffff000,
> > > >          RXE_MAX_QP_WR                   = DEFAULT_MAX_VALUE,
> > > > =====
> > > > rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied
> > > > to ib_device->attrs in setup_device()@core/device.c.
> > > > See that the page size cap is hardcoded to 4096 bytes. I suspect this led to
> > > > incorrect page_size being set to ibmr.page_size, resulting in the kernel crash.
> > > >
> > > > I think rxe driver is made to be able to handle arbitrary page sizes.
> > > > Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue.
> > > > What do you guys think?
> > > >
> > > > Thanks,
> > > > Daisuke Matsuda
> > > >
> >
Zhu Yanjun Oct. 18, 2023, 8:34 a.m. UTC | #6
在 2023/10/16 14:07, Daisuke Matsuda (Fujitsu) 写道:
> On Fri, Oct 13, 2023 10:44 PM Rain River <rain.1986.08.12@gmail.com> wrote:
>>> On Friday, October 13, 2023 9:29 PM Zhu Yanjun:
>>>> 在 2023/10/13 20:01, Daisuke Matsuda (Fujitsu) 写道:
>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>
>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>>> page_size is also set. Sometime this will cause conflict.
>>>>>
>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>>> with assigning actual page size to ibmr.page_size?
>>>>
>>>> Please check the source code. ibmr.page_size is assigned in
>>>> infiniband/core. And then it is assigned in rxe.
>>>> When the 2 are different, the problem will occur.
>>>
>>> In the first place, the two must always be equal.
>>> Is there any situation there are two different page sizes for a MR?
>>> I think I have explained the value assigned in the core layer is wrong
>>> when the PAGE_SIZE is bigger than 4k, and that is why they are inconsistent.
>>>
>>> As I have no environment to test the kernel panic, it remains speculative,
>>> but I have provided enough information for everyone to rebut my argument.
>>> It is possible I am wrong. I hope someone will tell me specifically where
>>> my guess is wrong, or Yi would kindly take the trouble to verify it.
>>
>> I made a quick test.
>>
>> With Zhu's patch, the problem fixed.
>> With your patch, this problem exists. And other problems occur. I do
>> not know why.
>> Will continue to make tests.
> 
> Thank you for your time and work.
> I will try to find out why there were two different page sizes from
> different perspectives. This may take a while because I am busy
> with other projects now. If anybody need the driver without the crash
> issue right now, I do not object to partially undoing "RDMA/rxe: Cleanup
> page variables in rxe_mr.c" as Zhu suggests, but that should be interim
> and not a final solution.

Sorry, it is late to reply.

This is a difficult and complicated problem because it involves rdma and 
block/blktests. This commit is based on the fact that the commit 
325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") causes 
this problem.

So I delved into this commit. In this commit, the core part is to change 
the size of ibmr.page_size.
 From this, I checked the source code. I found that this page_size is 
also set in infiniband/core. So I try to remove the change of 
ibmr.page_size.

I will continue to delve into this problem and the source code to find 
the root cause that you also agree with.

Zhu Yanjun
> 
> Thanks,
> Daisuke
> 
>>
>>>
>>> Thanks,
>>> Daisuke Matsuda
>>>
>>>> Please add logs to infiniband/core and rxe to check them.
>>>>
>>>> Zhu Yanjun
>>>>
>>>>>
>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>>> in ulp/srp layer to calculate the page_size.
>>>>> =====
>>>>> static int srp_add_one(struct ib_device *device)
>>>>> {
>>>>>           struct srp_device *srp_dev;
>>>>>           struct ib_device_attr *attr = &device->attrs;
>>>>> <...>
>>>>>           /*
>>>>>            * Use the smallest page size supported by the HCA, down to a
>>>>>            * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>>            * out of smaller entries.
>>>>>            */
>>>>>           mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>>>>           srp_dev->mr_page_size   = 1 << mr_page_shift;
>>>>> =====
>>>>> On initialization of srp driver, mr_page_size is calculated here.
>>>>> Note that the device attribute is used to calculate the value of page shift
>>>>> when the device is trying to use a page size larger than 4096. Since Yi specified
>>>>> CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
>>>>>
>>>>> =====
>>>>> static int srp_map_finish_fr(struct srp_map_state *state,
>>>>>                                struct srp_request *req,
>>>>>                                struct srp_rdma_ch *ch, int sg_nents,
>>>>>                                unsigned int *sg_offset_p)
>>>>> {
>>>>>           struct srp_target_port *target = ch->target;
>>>>>           struct srp_device *dev = target->srp_host->srp_dev;
>>>>> <...>
>>>>>           n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p,
>>>>>                            dev->mr_page_size);
>>>>> =====
>>>>> After that, mr_page_size is presumably passed to ib_core layer.
>>>>>
>>>>> =====
>>>>> int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
>>>>>                    unsigned int *sg_offset, unsigned int page_size)
>>>>> {
>>>>>           if (unlikely(!mr->device->ops.map_mr_sg))
>>>>>                   return -EOPNOTSUPP;
>>>>>
>>>>>           mr->page_size = page_size;
>>>>>
>>>>>           return mr->device->ops.map_mr_sg(mr, sg, sg_nents, sg_offset);
>>>>> }
>>>>> EXPORT_SYMBOL(ib_map_mr_sg);
>>>>> =====
>>>>> Consequently, the page size calculated in srp driver is set to ibmr.page_size.
>>>>>
>>>>> Coming back to rxe, the device attribute is set here:
>>>>> =====
>>>>> rxe.c
>>>>> <...>
>>>>> /* initialize rxe device parameters */
>>>>> static void rxe_init_device_param(struct rxe_dev *rxe)
>>>>> {
>>>>>           rxe->max_inline_data                    = RXE_MAX_INLINE_DATA;
>>>>>
>>>>>           rxe->attr.vendor_id                     = RXE_VENDOR_ID;
>>>>>           rxe->attr.max_mr_size                   = RXE_MAX_MR_SIZE;
>>>>>           rxe->attr.page_size_cap                 = RXE_PAGE_SIZE_CAP;
>>>>>           rxe->attr.max_qp                        = RXE_MAX_QP;
>>>>> ---
>>>>> rxe_param.h
>>>>> <...>
>>>>> /* default/initial rxe device parameter settings */
>>>>> enum rxe_device_param {
>>>>>           RXE_MAX_MR_SIZE                 = -1ull,
>>>>>           RXE_PAGE_SIZE_CAP               = 0xfffff000,
>>>>>           RXE_MAX_QP_WR                   = DEFAULT_MAX_VALUE,
>>>>> =====
>>>>> rxe_init_device_param() sets the attributes to rxe_dev->attr, and it is later copied
>>>>> to ib_device->attrs in setup_device()@core/device.c.
>>>>> See that the page size cap is hardcoded to 4096 bytes. I suspect this led to
>>>>> incorrect page_size being set to ibmr.page_size, resulting in the kernel crash.
>>>>>
>>>>> I think rxe driver is made to be able to handle arbitrary page sizes.
>>>>> Probably, we can just modify RXE_PAGE_SIZE_CAP to fix the issue.
>>>>> What do you guys think?
>>>>>
>>>>> Thanks,
>>>>> Daisuke Matsuda
>>>>>
>>>
Zhijian Li (Fujitsu) Oct. 20, 2023, 3:47 a.m. UTC | #7
CC Bart

On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>
>> The page_size of mr is set in infiniband core originally. In the commit
>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>> page_size is also set. Sometime this will cause conflict.
> I appreciate your prompt action, but I do not think this commit deals with
> the root cause. I agree that the problem lies in rxe driver, but what is wrong
> with assigning actual page size to ibmr.page_size?
> 
> IMO, the problem comes from the device attribute of rxe driver, which is used
> in ulp/srp layer to calculate the page_size.
> =====
> static int srp_add_one(struct ib_device *device)
> {
>          struct srp_device *srp_dev;
>          struct ib_device_attr *attr = &device->attrs;
> <...>
>          /*
>           * Use the smallest page size supported by the HCA, down to a
>           * minimum of 4096 bytes. We're unlikely to build large sglists
>           * out of smaller entries.
>           */
>          mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);


You light me up.
RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size

so i think more accurate logic should be:

if (device supports PAGE_SIZE)
    use PAGE_SIZE
else if (device support 4096 page_size)  // fallback to 4096
    use 4096 etc...
else
    ...




>          srp_dev->mr_page_size   = 1 << mr_page_shift;
> =====
> On initialization of srp driver, mr_page_size is calculated here.
> Note that the device attribute is used to calculate the value of page shift
> when the device is trying to use a page size larger than 4096. Since Yi specified
> CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
Zhijian Li (Fujitsu) Oct. 20, 2023, 6:54 a.m. UTC | #8
Add and Hi Bart,

Yi reported a crash[1] when PAGE_SIZE is 64K
[1]https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/

The root cause is unknown so far, but I notice that SRP over RXE always uses MR with page_size
4K = MAX(4K, min(device_support_page_size)) even though the device supports 64K page size.
* RXE device support 4K ~ 2G page size


4024         /*
4025          * Use the smallest page size supported by the HCA, down to a
4026          * minimum of 4096 bytes. We're unlikely to build large sglists
4027          * out of smaller entries.
4028          */
4029         mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
4030         srp_dev->mr_page_size   = 1 << mr_page_shift;
4031         srp_dev->mr_page_mask   = ~((u64) srp_dev->mr_page_size - 1);
4032         max_pages_per_mr        = attr->max_mr_size;


I doubt if we can use PAGE_SIZE MR if the device supports it.


BTW, rtrs seems have the same code.  @rtrs

Thanks
Zhijian

On 20/10/2023 11:47, Zhijian Li (Fujitsu) wrote:
> CC Bart
> 
> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>
>>> The page_size of mr is set in infiniband core originally. In the commit
>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>> page_size is also set. Sometime this will cause conflict.
>> I appreciate your prompt action, but I do not think this commit deals with
>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>> with assigning actual page size to ibmr.page_size?
>>
>> IMO, the problem comes from the device attribute of rxe driver, which is used
>> in ulp/srp layer to calculate the page_size.
>> =====
>> static int srp_add_one(struct ib_device *device)
>> {
>>           struct srp_device *srp_dev;
>>           struct ib_device_attr *attr = &device->attrs;
>> <...>
>>           /*
>>            * Use the smallest page size supported by the HCA, down to a
>>            * minimum of 4096 bytes. We're unlikely to build large sglists
>>            * out of smaller entries.
>>            */
>>           mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> 
> 
> You light me up.
> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
> 
> so i think more accurate logic should be:
> 
> if (device supports PAGE_SIZE)
>      use PAGE_SIZE
> else if (device support 4096 page_size)  // fallback to 4096
>      use 4096 etc...
> else
>      ...
> 
> 
> 
> 
>>           srp_dev->mr_page_size   = 1 << mr_page_shift;
>> =====
>> On initialization of srp driver, mr_page_size is calculated here.
>> Note that the device attribute is used to calculate the value of page shift
>> when the device is trying to use a page size larger than 4096. Since Yi specified
>> CONFIG_ARM64_64K_PAGES, the system naturally met the condition.
Jason Gunthorpe Oct. 20, 2023, 2:01 p.m. UTC | #9
On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
> CC Bart
> 
> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
> > On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
> >> From: Zhu Yanjun<yanjun.zhu@linux.dev>
> >>
> >> The page_size of mr is set in infiniband core originally. In the commit
> >> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
> >> page_size is also set. Sometime this will cause conflict.
> > I appreciate your prompt action, but I do not think this commit deals with
> > the root cause. I agree that the problem lies in rxe driver, but what is wrong
> > with assigning actual page size to ibmr.page_size?
> > 
> > IMO, the problem comes from the device attribute of rxe driver, which is used
> > in ulp/srp layer to calculate the page_size.
> > =====
> > static int srp_add_one(struct ib_device *device)
> > {
> >          struct srp_device *srp_dev;
> >          struct ib_device_attr *attr = &device->attrs;
> > <...>
> >          /*
> >           * Use the smallest page size supported by the HCA, down to a
> >           * minimum of 4096 bytes. We're unlikely to build large sglists
> >           * out of smaller entries.
> >           */
> >          mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> 
> 
> You light me up.
> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size

That doesn't seem right even in concept.

I think the multi-size support in the new xarray code does not work
right, just looking at it makes me think it does not work right. It
looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
explode because kmap_local_page() does only 4K.

If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work?

Jason
Bart Van Assche Oct. 20, 2023, 4:21 p.m. UTC | #10
On 10/19/23 23:54, Zhijian Li (Fujitsu) wrote:
> Add and Hi Bart,
> 
> Yi reported a crash[1] when PAGE_SIZE is 64K
> [1]https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
> 
> The root cause is unknown so far, but I notice that SRP over RXE always uses MR with page_size
> 4K = MAX(4K, min(device_support_page_size)) even though the device supports 64K page size.
> * RXE device support 4K ~ 2G page size
> 
> 
> 4024         /*
> 4025          * Use the smallest page size supported by the HCA, down to a
> 4026          * minimum of 4096 bytes. We're unlikely to build large sglists
> 4027          * out of smaller entries.
> 4028          */
> 4029         mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> 4030         srp_dev->mr_page_size   = 1 << mr_page_shift;
> 4031         srp_dev->mr_page_mask   = ~((u64) srp_dev->mr_page_size - 1);
> 4032         max_pages_per_mr        = attr->max_mr_size;
> 
> 
> I doubt if we can use PAGE_SIZE MR if the device supports it.

Hi Zhijian,

Thank you for having Cc-ed me. How about changing "12" in the above code
into PAGE_SHIFT?

Thanks,

Bart.
Zhijian Li (Fujitsu) Oct. 23, 2023, 12:58 a.m. UTC | #11
On 21/10/2023 00:21, Bart Van Assche wrote:
> On 10/19/23 23:54, Zhijian Li (Fujitsu) wrote:
>> Add and Hi Bart,
>>
>> Yi reported a crash[1] when PAGE_SIZE is 64K
>> [1]https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
>>
>> The root cause is unknown so far, but I notice that SRP over RXE always uses MR with page_size
>> 4K = MAX(4K, min(device_support_page_size)) even though the device supports 64K page size.
>> * RXE device support 4K ~ 2G page size
>>
>>
>> 4024         /*
>> 4025          * Use the smallest page size supported by the HCA, down to a
>> 4026          * minimum of 4096 bytes. We're unlikely to build large sglists
>> 4027          * out of smaller entries.
>> 4028          */
>> 4029         mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>> 4030         srp_dev->mr_page_size   = 1 << mr_page_shift;
>> 4031         srp_dev->mr_page_mask   = ~((u64) srp_dev->mr_page_size - 1);
>> 4032         max_pages_per_mr        = attr->max_mr_size;
>>
>>
>> I doubt if we can use PAGE_SIZE MR if the device supports it.
> 
> Hi Zhijian,
> 
> Thank you for having Cc-ed me. How about changing "12" in the above code
> into PAGE_SHIFT?


Yeah, after changed to PAGE_SHIFT, it works fine and the crash is gone in my aarch64 environment.


Thanks
Zhijian



> 
> Thanks,
> 
> Bart.
Zhijian Li (Fujitsu) Oct. 23, 2023, 3:52 a.m. UTC | #12
On 20/10/2023 22:01, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>> CC Bart
>>
>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>
>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>> page_size is also set. Sometime this will cause conflict.
>>> I appreciate your prompt action, but I do not think this commit deals with
>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>> with assigning actual page size to ibmr.page_size?
>>>
>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>> in ulp/srp layer to calculate the page_size.
>>> =====
>>> static int srp_add_one(struct ib_device *device)
>>> {
>>>           struct srp_device *srp_dev;
>>>           struct ib_device_attr *attr = &device->attrs;
>>> <...>
>>>           /*
>>>            * Use the smallest page size supported by the HCA, down to a
>>>            * minimum of 4096 bytes. We're unlikely to build large sglists
>>>            * out of smaller entries.
>>>            */
>>>           mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>
>>
>> You light me up.
>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
> 
> That doesn't seem right even in concept.> 
> I think the multi-size support in the new xarray code does not work
> right, just looking at it makes me think it does not work right. It
> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
> explode because kmap_local_page() does only 4K.
> 
> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
> 

Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).

>>> import ctypes
>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>> hex(65536)
'0x10000'
>>> libc.ffs(0x10000) - 1
16
>>> 1 << 16
65536

so
mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;


So I think Daisuke's patch should work as well.

https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1


> Jason
Zhu Yanjun Oct. 23, 2023, 6:08 a.m. UTC | #13
在 2023/10/23 11:52, Zhijian Li (Fujitsu) 写道:
> 
> 
> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>> CC Bart
>>>
>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>>
>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>> page_size is also set. Sometime this will cause conflict.
>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>> with assigning actual page size to ibmr.page_size?
>>>>
>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>> in ulp/srp layer to calculate the page_size.
>>>> =====
>>>> static int srp_add_one(struct ib_device *device)
>>>> {
>>>>            struct srp_device *srp_dev;
>>>>            struct ib_device_attr *attr = &device->attrs;
>>>> <...>
>>>>            /*
>>>>             * Use the smallest page size supported by the HCA, down to a
>>>>             * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>             * out of smaller entries.
>>>>             */
>>>>            mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>>
>>>
>>> You light me up.
>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>
>> That doesn't seem right even in concept.>
>> I think the multi-size support in the new xarray code does not work
>> right, just looking at it makes me think it does not work right. It
>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>> explode because kmap_local_page() does only 4K.
>>
>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>
> 
> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
> 
>>>> import ctypes
>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>>> hex(65536)
> '0x10000'
>>>> libc.ffs(0x10000) - 1
> 16
>>>> 1 << 16
> 65536
> 
> so
> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
> 
> 
> So I think Daisuke's patch should work as well.
> 
> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1

About this patch, please make full tests. Please ensure that this will 
not introduce some potential problems. In the discussion, this patch 
possibly has side effect.

Zhu Yanjun

> 
> 
>> Jason
Yi Zhang Oct. 23, 2023, 10:45 a.m. UTC | #14
On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
<lizhijian@fujitsu.com> wrote:
>
>
>
> On 20/10/2023 22:01, Jason Gunthorpe wrote:
> > On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
> >> CC Bart
> >>
> >> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
> >>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
> >>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
> >>>>
> >>>> The page_size of mr is set in infiniband core originally. In the commit
> >>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
> >>>> page_size is also set. Sometime this will cause conflict.
> >>> I appreciate your prompt action, but I do not think this commit deals with
> >>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
> >>> with assigning actual page size to ibmr.page_size?
> >>>
> >>> IMO, the problem comes from the device attribute of rxe driver, which is used
> >>> in ulp/srp layer to calculate the page_size.
> >>> =====
> >>> static int srp_add_one(struct ib_device *device)
> >>> {
> >>>           struct srp_device *srp_dev;
> >>>           struct ib_device_attr *attr = &device->attrs;
> >>> <...>
> >>>           /*
> >>>            * Use the smallest page size supported by the HCA, down to a
> >>>            * minimum of 4096 bytes. We're unlikely to build large sglists
> >>>            * out of smaller entries.
> >>>            */
> >>>           mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
> >>
> >>
> >> You light me up.
> >> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
> >
> > That doesn't seem right even in concept.>
> > I think the multi-size support in the new xarray code does not work
> > right, just looking at it makes me think it does not work right. It
> > looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
> > explode because kmap_local_page() does only 4K.
> >
> > If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
> >
>
> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).

Hi Zhijian

Did you try blktests nvme/rdma use_rxe on your environment, it still
failed on my side.

# use_rxe=1 nvme_trtype=rdma  ./check nvme/003
nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
    runtime  12.179s  ...  11.941s
    --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
    +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
05:52:27.882759168 -0400
    @@ -1,3 +1,3 @@
     Running nvme/003
    -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
    +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
     Test complete

[ 7033.431910] rdma_rxe: loaded
[ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
[ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
[ 7033.510969] infiniband enP2p1s0v0_rxe: set active
[ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
[ 7033.549301] loop0: detected capacity change from 0 to 2097152
[ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
[ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
[ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
[ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
[ 7046.317861] rdma_rxe: unloaded


>
> >>> import ctypes
> >>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
> >>> hex(65536)
> '0x10000'
> >>> libc.ffs(0x10000) - 1
> 16
> >>> 1 << 16
> 65536
>
> so
> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>
>
> So I think Daisuke's patch should work as well.
>
> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>
>
> > Jason



--
Best Regards,
  Yi Zhang
Zhijian Li (Fujitsu) Oct. 24, 2023, 8:15 a.m. UTC | #15
On 23/10/2023 18:45, Yi Zhang wrote:
> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com> wrote:
>>
>>
>>
>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>>> CC Bart
>>>>
>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>>>
>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>>> page_size is also set. Sometime this will cause conflict.
>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>>> with assigning actual page size to ibmr.page_size?
>>>>>
>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>>> in ulp/srp layer to calculate the page_size.
>>>>> =====
>>>>> static int srp_add_one(struct ib_device *device)
>>>>> {
>>>>>            struct srp_device *srp_dev;
>>>>>            struct ib_device_attr *attr = &device->attrs;
>>>>> <...>
>>>>>            /*
>>>>>             * Use the smallest page size supported by the HCA, down to a
>>>>>             * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>>             * out of smaller entries.
>>>>>             */
>>>>>            mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>>>
>>>>
>>>> You light me up.
>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>>
>>> That doesn't seem right even in concept.>
>>> I think the multi-size support in the new xarray code does not work
>>> right, just looking at it makes me think it does not work right. It
>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>> explode because kmap_local_page() does only 4K.
>>>
>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>>
>>
>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
> 
> Hi Zhijian
> 
> Did you try blktests nvme/rdma use_rxe on your environment, it still
> failed on my side.
> 

Thanks for your testing.
I just did that, it also failed in my environment. After a glance debug, I found there are
other places still registered 4K MR. I will dig into it later.

Thanks
Zhijian




> # use_rxe=1 nvme_trtype=rdma  ./check nvme/003
> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>      runtime  12.179s  ...  11.941s
>      --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>      +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
> 05:52:27.882759168 -0400
>      @@ -1,3 +1,3 @@
>       Running nvme/003
>      -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>      +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>       Test complete
> 
> [ 7033.431910] rdma_rxe: loaded
> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
> [ 7046.317861] rdma_rxe: unloaded
> 
> 
>>
>>>>> import ctypes
>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>>>> hex(65536)
>> '0x10000'
>>>>> libc.ffs(0x10000) - 1
>> 16
>>>>> 1 << 16
>> 65536
>>
>> so
>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>
>>
>> So I think Daisuke's patch should work as well.
>>
>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>
>>
>>> Jason
> 
> 
> 
> --
> Best Regards,
>    Yi Zhang
>
Zhijian Li (Fujitsu) Oct. 24, 2023, 9:13 a.m. UTC | #16
On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
> 
> 
> On 23/10/2023 18:45, Yi Zhang wrote:
>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
>> <lizhijian@fujitsu.com> wrote:
>>>
>>>
>>>
>>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>>>> CC Bart
>>>>>
>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>>>>
>>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>>>> page_size is also set. Sometime this will cause conflict.
>>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>>>> with assigning actual page size to ibmr.page_size?
>>>>>>
>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>>>> in ulp/srp layer to calculate the page_size.
>>>>>> =====
>>>>>> static int srp_add_one(struct ib_device *device)
>>>>>> {
>>>>>>             struct srp_device *srp_dev;
>>>>>>             struct ib_device_attr *attr = &device->attrs;
>>>>>> <...>
>>>>>>             /*
>>>>>>              * Use the smallest page size supported by the HCA, down to a
>>>>>>              * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>>>              * out of smaller entries.
>>>>>>              */
>>>>>>             mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>>>>
>>>>>
>>>>> You light me up.
>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>>>
>>>> That doesn't seem right even in concept.>
>>>> I think the multi-size support in the new xarray code does not work
>>>> right, just looking at it makes me think it does not work right. It
>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>>> explode because kmap_local_page() does only 4K.
>>>>
>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>>>
>>>
>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
>>
>> Hi Zhijian
>>
>> Did you try blktests nvme/rdma use_rxe on your environment, it still
>> failed on my side.
>>
> 
> Thanks for your testing.
> I just did that, it also failed in my environment. After a glance debug, I found there are
> other places still registered 4K MR. I will dig into it later.


nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.


1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
1294                 struct nvme_rdma_request *req, struct nvme_command *c,
1295                 int count)
1296 {
1297         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
1298         int nr;
1299
1300         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
1301         if (WARN_ON_ONCE(!req->mr))
1302                 return -EAGAIN;
1303
1304         /*
1305          * Align the MR to a 4K page size to match the ctrl page size and
1306          * the block virtual boundary.
1307          */
1308         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
1309                           SZ_4K);


Anyway, i will go ahead. if you have any thought, please let me know.



> 
> Thanks
> Zhijian
> 
> 
> 
> 
>> # use_rxe=1 nvme_trtype=rdma  ./check nvme/003
>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>>       runtime  12.179s  ...  11.941s
>>       --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>>       +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
>> 05:52:27.882759168 -0400
>>       @@ -1,3 +1,3 @@
>>        Running nvme/003
>>       -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>>       +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>>        Test complete
>>
>> [ 7033.431910] rdma_rxe: loaded
>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
>> [ 7046.317861] rdma_rxe: unloaded
>>
>>
>>>
>>>>>> import ctypes
>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>>>>> hex(65536)
>>> '0x10000'
>>>>>> libc.ffs(0x10000) - 1
>>> 16
>>>>>> 1 << 16
>>> 65536
>>>
>>> so
>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>>
>>>
>>> So I think Daisuke's patch should work as well.
>>>
>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>>
>>>
>>>> Jason
>>
>>
>>
>> --
>> Best Regards,
>>     Yi Zhang
Zhijian Li (Fujitsu) Oct. 26, 2023, 9:05 a.m. UTC | #17
The root cause is that

rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
So the offset will get lost.

For example,
store process:
page_size = 0x1000;
PAGE_SIZE = 0x10000;
va0 = 0xffff000020651000;
page_offset = 0 = va & (page_size - 1);
page = va_to_page(va);
xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);

load_process:
page = xa_load(&mr->page_list, index);
page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;

Obviously, *va0 != va1*, page_offset get lost.


How to fix:
- revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
- don't allow ulp registering mr.page_size != PAGE_SIZE ?



Thanks
Zhijian


On 24/10/2023 17:13, Li Zhijian wrote:
> 
> 
> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
>>
>>
>> On 23/10/2023 18:45, Yi Zhang wrote:
>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
>>> <lizhijian@fujitsu.com> wrote:
>>>>
>>>>
>>>>
>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>>>>> CC Bart
>>>>>>
>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>>>>>
>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>>>>> page_size is also set. Sometime this will cause conflict.
>>>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>>>>> with assigning actual page size to ibmr.page_size?
>>>>>>>
>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>>>>> in ulp/srp layer to calculate the page_size.
>>>>>>> =====
>>>>>>> static int srp_add_one(struct ib_device *device)
>>>>>>> {
>>>>>>>             struct srp_device *srp_dev;
>>>>>>>             struct ib_device_attr *attr = &device->attrs;
>>>>>>> <...>
>>>>>>>             /*
>>>>>>>              * Use the smallest page size supported by the HCA, down to a
>>>>>>>              * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>>>>              * out of smaller entries.
>>>>>>>              */
>>>>>>>             mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>>>>>
>>>>>>
>>>>>> You light me up.
>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>>>>
>>>>> That doesn't seem right even in concept.>
>>>>> I think the multi-size support in the new xarray code does not work
>>>>> right, just looking at it makes me think it does not work right. It
>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>>>> explode because kmap_local_page() does only 4K.
>>>>>
>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>>>>
>>>>
>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
>>>
>>> Hi Zhijian
>>>
>>> Did you try blktests nvme/rdma use_rxe on your environment, it still
>>> failed on my side.
>>>
>>
>> Thanks for your testing.
>> I just did that, it also failed in my environment. After a glance debug, I found there are
>> other places still registered 4K MR. I will dig into it later.
> 
> 
> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.
> 
> 
> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
> 1294                 struct nvme_rdma_request *req, struct nvme_command *c,
> 1295                 int count)
> 1296 {
> 1297         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
> 1298         int nr;
> 1299
> 1300         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
> 1301         if (WARN_ON_ONCE(!req->mr))
> 1302                 return -EAGAIN;
> 1303
> 1304         /*
> 1305          * Align the MR to a 4K page size to match the ctrl page size and
> 1306          * the block virtual boundary.
> 1307          */
> 1308         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
> 1309                           SZ_4K);
> 
> 
> Anyway, i will go ahead. if you have any thought, please let me know.
> 
> 
> 
>>
>> Thanks
>> Zhijian
>>
>>
>>
>>
>>> # use_rxe=1 nvme_trtype=rdma  ./check nvme/003
>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>>>       runtime  12.179s  ...  11.941s
>>>       --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>>>       +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
>>> 05:52:27.882759168 -0400
>>>       @@ -1,3 +1,3 @@
>>>        Running nvme/003
>>>       -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>>>       +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>>>        Test complete
>>>
>>> [ 7033.431910] rdma_rxe: loaded
>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
>>> [ 7046.317861] rdma_rxe: unloaded
>>>
>>>
>>>>
>>>>>>> import ctypes
>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>>>>>> hex(65536)
>>>> '0x10000'
>>>>>>> libc.ffs(0x10000) - 1
>>>> 16
>>>>>>> 1 << 16
>>>> 65536
>>>>
>>>> so
>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>>>
>>>>
>>>> So I think Daisuke's patch should work as well.
>>>>
>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>>>
>>>>
>>>>> Jason
>>>
>>>
>>>
>>> -- 
>>> Best Regards,
>>>     Yi Zhang
Jason Gunthorpe Oct. 26, 2023, 11:42 a.m. UTC | #18
On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote:
> The root cause is that
> 
> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> So the offset will get lost.
> 
> For example,
> store process:
> page_size = 0x1000;
> PAGE_SIZE = 0x10000;
> va0 = 0xffff000020651000;
> page_offset = 0 = va & (page_size - 1);
> page = va_to_page(va);
> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
> 
> load_process:
> page = xa_load(&mr->page_list, index);
> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
> 
> Obviously, *va0 != va1*, page_offset get lost.
> 
> 
> How to fix:
> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> - don't allow ulp registering mr.page_size != PAGE_SIZE ?

Lets do the second one please. Most devices only support PAGE_SIZE anyhow.

Jason
Zhu Yanjun Oct. 26, 2023, 12:59 p.m. UTC | #19
在 2023/10/26 19:42, Jason Gunthorpe 写道:
> On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote:
>> The root cause is that
>>
>> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
>> So the offset will get lost.
>>
>> For example,
>> store process:
>> page_size = 0x1000;
>> PAGE_SIZE = 0x10000;
>> va0 = 0xffff000020651000;
>> page_offset = 0 = va & (page_size - 1);
>> page = va_to_page(va);
>> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
>>
>> load_process:
>> page = xa_load(&mr->page_list, index);
>> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
>> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
>>
>> Obviously, *va0 != va1*, page_offset get lost.
>>
>>
>> How to fix:
>> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
>> - don't allow ulp registering mr.page_size != PAGE_SIZE ?
> 
> Lets do the second one please. Most devices only support PAGE_SIZE anyhow.

Normally page_size is PAGE_SIZE or the size of the whole compound page 
(in the latest kernel version, it is the size of folio). When compound 
page or folio is taken into account, the page_size is not equal to 
PAGE_SIZE.

If the ULP uses the compound page or folio, the similar problem will 
occur again. Since this problem is involved with RDMA and block, it will 
cost a lot of time and efforts to debug. If we fix it now, it will save 
a lot of time and efforts for the future ULP with folio.

Because folio has a lot of benefits, sooner or later, folio or other 
similar MMU technology will be popular in the kernel. At that time, this 
problem will appear again. We can do something for the future folio ^_^

Zhu Yanjun

> 
> Jason
Bart Van Assche Oct. 26, 2023, 1:28 p.m. UTC | #20
On 10/26/23 02:05, Zhijian Li (Fujitsu) wrote:
> The root cause is that
> 
> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> So the offset will get lost.
> 
> For example,
> store process:
> page_size = 0x1000;
> PAGE_SIZE = 0x10000;
> va0 = 0xffff000020651000;
> page_offset = 0 = va & (page_size - 1);
> page = va_to_page(va);
> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
> 
> load_process:
> page = xa_load(&mr->page_list, index);
> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
> 
> Obviously, *va0 != va1*, page_offset get lost.
> 
> 
> How to fix:
> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> - don't allow ulp registering mr.page_size != PAGE_SIZE ?

Thank you Zhijian for this root-cause analysis.

Hardware RDMA adapters may not support the host page size (PAGE_SIZE)
so disallowing mr.page_size != PAGE_SIZE would be wrong.

If the rxe driver only supports mr.page_size == PAGE_SIZE, does this
mean that RXE_PAGE_SIZE_CAP should be changed from
0xfffff000 into PAGE_SHIFT?

Thanks,

Bart.
Jason Gunthorpe Oct. 26, 2023, 1:43 p.m. UTC | #21
On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote:
> On 10/26/23 02:05, Zhijian Li (Fujitsu) wrote:
> > The root cause is that
> > 
> > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> > So the offset will get lost.
> > 
> > For example,
> > store process:
> > page_size = 0x1000;
> > PAGE_SIZE = 0x10000;
> > va0 = 0xffff000020651000;
> > page_offset = 0 = va & (page_size - 1);
> > page = va_to_page(va);
> > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
> > 
> > load_process:
> > page = xa_load(&mr->page_list, index);
> > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
> > 
> > Obviously, *va0 != va1*, page_offset get lost.
> > 
> > 
> > How to fix:
> > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> > - don't allow ulp registering mr.page_size != PAGE_SIZE ?
> 
> Thank you Zhijian for this root-cause analysis.
> 
> Hardware RDMA adapters may not support the host page size (PAGE_SIZE)
> so disallowing mr.page_size != PAGE_SIZE would be wrong.

PAGE_SIZE must always be a valid way to construct a MR. We have code
in all the DMA drivers to break PAGE_SIZE chunks to something smaller
when building MRs.

rxe is missing this now with the xarray stuff since we can't encode
the struct page offset and a struct page in a single xarray entry. It
would have to go back to encoding pfns and doing pfn to page.

> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this
> mean that RXE_PAGE_SIZE_CAP should be changed from
> 0xfffff000 into PAGE_SHIFT?

Yes

Jason
Bart Van Assche Oct. 26, 2023, 9:47 p.m. UTC | #22
On 10/26/23 06:43, Jason Gunthorpe wrote:
> On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote:
>> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this
>> mean that RXE_PAGE_SIZE_CAP should be changed from
>> 0xfffff000 into PAGE_SHIFT?
> 
> Yes

Bob, do you plan to convert the above change into a patch or do you
perhaps expect me to do that?

Thanks,

Bart.
Jason Gunthorpe Oct. 26, 2023, 11:23 p.m. UTC | #23
On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote:
> 在 2023/10/26 19:42, Jason Gunthorpe 写道:
> > On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote:
> > > The root cause is that
> > > 
> > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> > > So the offset will get lost.
> > > 
> > > For example,
> > > store process:
> > > page_size = 0x1000;
> > > PAGE_SIZE = 0x10000;
> > > va0 = 0xffff000020651000;
> > > page_offset = 0 = va & (page_size - 1);
> > > page = va_to_page(va);
> > > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
> > > 
> > > load_process:
> > > page = xa_load(&mr->page_list, index);
> > > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> > > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
> > > 
> > > Obviously, *va0 != va1*, page_offset get lost.
> > > 
> > > 
> > > How to fix:
> > > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> > > - don't allow ulp registering mr.page_size != PAGE_SIZE ?
> > 
> > Lets do the second one please. Most devices only support PAGE_SIZE anyhow.
> 
> Normally page_size is PAGE_SIZE or the size of the whole compound page (in
> the latest kernel version, it is the size of folio). When compound page or
> folio is taken into account, the page_size is not equal to
> PAGE_SIZE.

folios are always multiples of PAGE_SIZE. rxe splits everything into
PAGE_SIZE units in the xarray.

> If the ULP uses the compound page or folio, the similar problem will occur
> again. 

No, it won't. We never store folios in the xarray.

Jason
Daisuke Matsuda (Fujitsu) Oct. 27, 2023, 1:26 a.m. UTC | #24
OnFri, Oct 27, 2023 6:48 AM Bart Van Assche wrote:
> On 10/26/23 06:43, Jason Gunthorpe wrote:
> > On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote:
> >> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this
> >> mean that RXE_PAGE_SIZE_CAP should be changed from
> >> 0xfffff000 into PAGE_SHIFT?

Some drivers pad upper bits with 1, and others with 0. Is there any rule?
I think both are practically allowed, but we should try to eliminate inconsistency.

> >
> > Yes
> 
> Bob, do you plan to convert the above change into a patch or do you
> perhaps expect me to do that?

It looks Bob has not been involved in this thread.

Thanks,
Daisuke Matsuda

> 
> Thanks,
> 
> Bart.
Zhu Yanjun Oct. 27, 2023, 1:36 a.m. UTC | #25
在 2023/10/27 7:23, Jason Gunthorpe 写道:
> On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote:
>> 在 2023/10/26 19:42, Jason Gunthorpe 写道:
>>> On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote:
>>>> The root cause is that
>>>>
>>>> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
>>>> So the offset will get lost.
>>>>
>>>> For example,
>>>> store process:
>>>> page_size = 0x1000;
>>>> PAGE_SIZE = 0x10000;
>>>> va0 = 0xffff000020651000;
>>>> page_offset = 0 = va & (page_size - 1);
>>>> page = va_to_page(va);
>>>> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
>>>>
>>>> load_process:
>>>> page = xa_load(&mr->page_list, index);
>>>> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
>>>> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
>>>>
>>>> Obviously, *va0 != va1*, page_offset get lost.
>>>>
>>>>
>>>> How to fix:
>>>> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
>>>> - don't allow ulp registering mr.page_size != PAGE_SIZE ?
>>> Lets do the second one please. Most devices only support PAGE_SIZE anyhow.
>> Normally page_size is PAGE_SIZE or the size of the whole compound page (in
>> the latest kernel version, it is the size of folio). When compound page or
>> folio is taken into account, the page_size is not equal to
>> PAGE_SIZE.
> folios are always multiples of PAGE_SIZE. rxe splits everything into
> PAGE_SIZE units in the xarray.

Sure. Thanks. Folio is multiple base pages. So the page size should be 
multiple PAGE_SIZE.

This page size is set in infiniband core and rxe.

Hope no problem will occur when folio or compound page is used in ULP.

Zhu Yanjun

>
>> If the ULP uses the compound page or folio, the similar problem will occur
>> again.
> No, it won't. We never store folios in the xarray.
>
> Jason
Zhu Yanjun Oct. 27, 2023, 1:39 a.m. UTC | #26
在 2023/10/27 5:47, Bart Van Assche 写道:
> On 10/26/23 06:43, Jason Gunthorpe wrote:
>> On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote:
>>> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this
>>> mean that RXE_PAGE_SIZE_CAP should be changed from
>>> 0xfffff000 into PAGE_SHIFT?
>>
>> Yes
>
> Bob, do you plan to convert the above change into a patch or do you
> perhaps expect me to do that?


Zhijian has done a lot of work on this problem. And he found out the 
root cause.

Perhaps Zhijian should file a patch for this problem?

Zhu Yanjun


>
> Thanks,
>
> Bart.
>
Zhu Yanjun Oct. 27, 2023, 4:01 a.m. UTC | #27
在 2023/10/27 7:23, Jason Gunthorpe 写道:
> On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote:
>> 在 2023/10/26 19:42, Jason Gunthorpe 写道:
>>> On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote:
>>>> The root cause is that
>>>>
>>>> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
>>>> So the offset will get lost.
>>>>
>>>> For example,
>>>> store process:
>>>> page_size = 0x1000;
>>>> PAGE_SIZE = 0x10000;
>>>> va0 = 0xffff000020651000;
>>>> page_offset = 0 = va & (page_size - 1);
>>>> page = va_to_page(va);
>>>> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
>>>>
>>>> load_process:
>>>> page = xa_load(&mr->page_list, index);
>>>> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
>>>> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
>>>>
>>>> Obviously, *va0 != va1*, page_offset get lost.
>>>>
>>>>
>>>> How to fix:
>>>> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
>>>> - don't allow ulp registering mr.page_size != PAGE_SIZE ?
>>> Lets do the second one please. Most devices only support PAGE_SIZE anyhow.
>> Normally page_size is PAGE_SIZE or the size of the whole compound page (in
>> the latest kernel version, it is the size of folio). When compound page or
>> folio is taken into account, the page_size is not equal to
>> PAGE_SIZE.
> folios are always multiples of PAGE_SIZE. rxe splits everything into
> PAGE_SIZE units in the xarray.
>
>> If the ULP uses the compound page or folio, the similar problem will occur
>> again.
> No, it won't. We never store folios in the xarray.

Sure.

I mean, in ULP, if folio is used, the page size is set to multiple 
PAGE_SIZE, but in RXE, the page size is set to PAGE_SIZE.

So the page size in ULP is different with the page size in RXE.

I am not sure whether this similar problem still occur or not.

I hope this problem will not occur even with folio.

Zhu Yanjun

>
> Jason
Zhijian Li (Fujitsu) Oct. 27, 2023, 5:43 a.m. UTC | #28
On 27/10/2023 05:47, Bart Van Assche wrote:
> On 10/26/23 06:43, Jason Gunthorpe wrote:
>> On Thu, Oct 26, 2023 at 06:28:37AM -0700, Bart Van Assche wrote:
>>> If the rxe driver only supports mr.page_size == PAGE_SIZE, does this
>>> mean that RXE_PAGE_SIZE_CAP should be changed from
>>> 0xfffff000 into PAGE_SHIFT?
>>
>> Yes
> 
> Bob, do you plan to convert the above change into a patch or do you
> perhaps expect me to do that?

I just cooked the patches, it can fix SRP problem. But nvme still fails.





> 
> Thanks,
> 
> Bart.
>
Jason Gunthorpe Oct. 27, 2023, 11:51 a.m. UTC | #29
On Fri, Oct 27, 2023 at 12:01:47PM +0800, Zhu Yanjun wrote:
> 
> 在 2023/10/27 7:23, Jason Gunthorpe 写道:
> > On Thu, Oct 26, 2023 at 08:59:34PM +0800, Zhu Yanjun wrote:
> > > 在 2023/10/26 19:42, Jason Gunthorpe 写道:
> > > > On Thu, Oct 26, 2023 at 09:05:52AM +0000, Zhijian Li (Fujitsu) wrote:
> > > > > The root cause is that
> > > > > 
> > > > > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> > > > > So the offset will get lost.
> > > > > 
> > > > > For example,
> > > > > store process:
> > > > > page_size = 0x1000;
> > > > > PAGE_SIZE = 0x10000;
> > > > > va0 = 0xffff000020651000;
> > > > > page_offset = 0 = va & (page_size - 1);
> > > > > page = va_to_page(va);
> > > > > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
> > > > > 
> > > > > load_process:
> > > > > page = xa_load(&mr->page_list, index);
> > > > > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> > > > > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
> > > > > 
> > > > > Obviously, *va0 != va1*, page_offset get lost.
> > > > > 
> > > > > 
> > > > > How to fix:
> > > > > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> > > > > - don't allow ulp registering mr.page_size != PAGE_SIZE ?
> > > > Lets do the second one please. Most devices only support PAGE_SIZE anyhow.
> > > Normally page_size is PAGE_SIZE or the size of the whole compound page (in
> > > the latest kernel version, it is the size of folio). When compound page or
> > > folio is taken into account, the page_size is not equal to
> > > PAGE_SIZE.
> > folios are always multiples of PAGE_SIZE. rxe splits everything into
> > PAGE_SIZE units in the xarray.
> > 
> > > If the ULP uses the compound page or folio, the similar problem will occur
> > > again.
> > No, it won't. We never store folios in the xarray.
> 
> Sure.
> 
> I mean, in ULP, if folio is used, the page size is set to multiple
> PAGE_SIZE, but in RXE, the page size is set to PAGE_SIZE.
> 
> So the page size in ULP is different with the page size in RXE.

There is no such thing as a "page size" in the ULP. rxe is the thing
that keeps things in PAGE_SIZE units, and it should be fragmenting
whatever the ulp gives into that. The ULP must simply give virtually
contiguous runs of memory that are PAGE_SIZE aligned

Jason
Zhu Yanjun Oct. 31, 2023, 1:36 a.m. UTC | #30
在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道:
> The root cause is that
> 
> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> So the offset will get lost.

It is interesting root cause.
page_size is 0x1000, PAGE_SIZE 0x10000.

Zhu Yanjun

> 
> For example,
> store process:
> page_size = 0x1000;
> PAGE_SIZE = 0x10000;
> va0 = 0xffff000020651000;
> page_offset = 0 = va & (page_size - 1);
> page = va_to_page(va);
> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
> 
> load_process:
> page = xa_load(&mr->page_list, index);
> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
> 
> Obviously, *va0 != va1*, page_offset get lost.
> 
> 
> How to fix:
> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> - don't allow ulp registering mr.page_size != PAGE_SIZE ?
> 
> 
> 
> Thanks
> Zhijian
> 
> 
> On 24/10/2023 17:13, Li Zhijian wrote:
>>
>>
>> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
>>>
>>>
>>> On 23/10/2023 18:45, Yi Zhang wrote:
>>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
>>>> <lizhijian@fujitsu.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>>>>>> CC Bart
>>>>>>>
>>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>>>>>>
>>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>>>>>> page_size is also set. Sometime this will cause conflict.
>>>>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>>>>>> with assigning actual page size to ibmr.page_size?
>>>>>>>>
>>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>>>>>> in ulp/srp layer to calculate the page_size.
>>>>>>>> =====
>>>>>>>> static int srp_add_one(struct ib_device *device)
>>>>>>>> {
>>>>>>>>              struct srp_device *srp_dev;
>>>>>>>>              struct ib_device_attr *attr = &device->attrs;
>>>>>>>> <...>
>>>>>>>>              /*
>>>>>>>>               * Use the smallest page size supported by the HCA, down to a
>>>>>>>>               * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>>>>>               * out of smaller entries.
>>>>>>>>               */
>>>>>>>>              mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>>>>>>
>>>>>>>
>>>>>>> You light me up.
>>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>>>>>
>>>>>> That doesn't seem right even in concept.>
>>>>>> I think the multi-size support in the new xarray code does not work
>>>>>> right, just looking at it makes me think it does not work right. It
>>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>>>>> explode because kmap_local_page() does only 4K.
>>>>>>
>>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>>>>>
>>>>>
>>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
>>>>
>>>> Hi Zhijian
>>>>
>>>> Did you try blktests nvme/rdma use_rxe on your environment, it still
>>>> failed on my side.
>>>>
>>>
>>> Thanks for your testing.
>>> I just did that, it also failed in my environment. After a glance debug, I found there are
>>> other places still registered 4K MR. I will dig into it later.
>>
>>
>> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.
>>
>>
>> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>> 1294                 struct nvme_rdma_request *req, struct nvme_command *c,
>> 1295                 int count)
>> 1296 {
>> 1297         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>> 1298         int nr;
>> 1299
>> 1300         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>> 1301         if (WARN_ON_ONCE(!req->mr))
>> 1302                 return -EAGAIN;
>> 1303
>> 1304         /*
>> 1305          * Align the MR to a 4K page size to match the ctrl page size and
>> 1306          * the block virtual boundary.
>> 1307          */
>> 1308         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
>> 1309                           SZ_4K);
>>
>>
>> Anyway, i will go ahead. if you have any thought, please let me know.
>>
>>
>>
>>>
>>> Thanks
>>> Zhijian
>>>
>>>
>>>
>>>
>>>> # use_rxe=1 nvme_trtype=rdma  ./check nvme/003
>>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>>>>        runtime  12.179s  ...  11.941s
>>>>        --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>>>>        +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
>>>> 05:52:27.882759168 -0400
>>>>        @@ -1,3 +1,3 @@
>>>>         Running nvme/003
>>>>        -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>>>>        +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>>>>         Test complete
>>>>
>>>> [ 7033.431910] rdma_rxe: loaded
>>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
>>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
>>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
>>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
>>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
>>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
>>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
>>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
>>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
>>>> [ 7046.317861] rdma_rxe: unloaded
>>>>
>>>>
>>>>>
>>>>>>>> import ctypes
>>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>>>>>>> hex(65536)
>>>>> '0x10000'
>>>>>>>> libc.ffs(0x10000) - 1
>>>>> 16
>>>>>>>> 1 << 16
>>>>> 65536
>>>>>
>>>>> so
>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>>>>
>>>>>
>>>>> So I think Daisuke's patch should work as well.
>>>>>
>>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>>>>
>>>>>
>>>>>> Jason
>>>>
>>>>
>>>>
>>>> -- 
>>>> Best Regards,
>>>>      Yi Zhang
Greg Sword Oct. 31, 2023, 8:14 a.m. UTC | #31
On Tue, Oct 31, 2023 at 3:53 PM Greg Sword <gregsword0@gmail.com> wrote:
>
>
>
> On Tue, Oct 31, 2023 at 2:03 PM Greg Sword <gregsword0@gmail.com> wrote:
>>
>>
>>
>>
>> On Tue, Oct 31, 2023 at 9:40 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>>
>>> 在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道:
>>> > The root cause is that
>>> >
>>> > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
>>> > So the offset will get lost.
>>>
>>> It is interesting root cause.
>>> page_size is 0x1000, PAGE_SIZE 0x10000.
>>>
>>> Zhu Yanjun
>>>
>>> >
>>> > For example,
>>> > store process:
>>> > page_size = 0x1000;
>>> > PAGE_SIZE = 0x10000;
>>
>>
>> page_size is 0x1000, PAGE_SIZE is 0x10000.
>>
>> Is it correct? page_size should be PAGE_SIZE.
>> That is, page_size should be greater than PAGE_SIZE.
>> It is not reasonable now.
>>
>> The root cause is to find out why page_size is set 4K (0x1000). On ARM64 host withCONFIG_ARM64_64K_PAGES enabled,
>> the page_size should not be 4K, it should be 64K.
>>
>> So Zhijian's commit seems not to be correct.
>
>

linux-rdma not reached. Resend it again.

> Because rxe splits everything into PAGE_SIZE units in the xarray, so it should only support PAGE_SIZE page size. WithCONFIG_ARM64_64K_PAGES,
> PAGE_SIZE is 64K.
>
> But from NVME side, the page size is 4K bytes, it is not 64K.
>
> In the function ib_map_mr_sg is fixed SZ_4K.  So the root cause is not in rxe. Before rxe is called, the capability of rxe should be tested.
>
> static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_request *req, struct nvme_command *c,
>                 int count)
> {
>         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>         int nr;
>
>         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>         if (WARN_ON_ONCE(!req->mr))
>                 return -EAGAIN;
>
>         /*
>          * Align the MR to a 4K page size to match the ctrl page size and
>          * the block virtual boundary.
>          */
>         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
>                           SZ_4K);
>         if (unlikely(nr < count)) {
>                 ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>                 req->mr = NULL;
>                 if (nr < 0)
>                         return nr;
>                 return -EINVAL;
>         }
>
>         ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
>
>>
>>
>>
>>>
>>> > va0 = 0xffff000020651000;
>>> > page_offset = 0 = va & (page_size - 1);
>>> > page = va_to_page(va);
>>> > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
>>> >
>>> > load_process:
>>> > page = xa_load(&mr->page_list, index);
>>> > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
>>> > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
>>> >
>>> > Obviously, *va0 != va1*, page_offset get lost.
>>> >
>>> >
>>> > How to fix:
>>> > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
>>> > - don't allow ulp registering mr.page_size != PAGE_SIZE ?
>>> >
>>> >
>>> >
>>> > Thanks
>>> > Zhijian
>>> >
>>> >
>>> > On 24/10/2023 17:13, Li Zhijian wrote:
>>> >>
>>> >>
>>> >> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
>>> >>>
>>> >>>
>>> >>> On 23/10/2023 18:45, Yi Zhang wrote:
>>> >>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
>>> >>>> <lizhijian@fujitsu.com> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>> >>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>> >>>>>>> CC Bart
>>> >>>>>>>
>>> >>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>> >>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>> >>>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>> >>>>>>>>>
>>> >>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>> >>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>> >>>>>>>>> page_size is also set. Sometime this will cause conflict.
>>> >>>>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>> >>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>> >>>>>>>> with assigning actual page size to ibmr.page_size?
>>> >>>>>>>>
>>> >>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>> >>>>>>>> in ulp/srp layer to calculate the page_size.
>>> >>>>>>>> =====
>>> >>>>>>>> static int srp_add_one(struct ib_device *device)
>>> >>>>>>>> {
>>> >>>>>>>>              struct srp_device *srp_dev;
>>> >>>>>>>>              struct ib_device_attr *attr = &device->attrs;
>>> >>>>>>>> <...>
>>> >>>>>>>>              /*
>>> >>>>>>>>               * Use the smallest page size supported by the HCA, down to a
>>> >>>>>>>>               * minimum of 4096 bytes. We're unlikely to build large sglists
>>> >>>>>>>>               * out of smaller entries.
>>> >>>>>>>>               */
>>> >>>>>>>>              mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> You light me up.
>>> >>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>> >>>>>>
>>> >>>>>> That doesn't seem right even in concept.>
>>> >>>>>> I think the multi-size support in the new xarray code does not work
>>> >>>>>> right, just looking at it makes me think it does not work right. It
>>> >>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>> >>>>>> explode because kmap_local_page() does only 4K.
>>> >>>>>>
>>> >>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>> >>>>>>
>>> >>>>>
>>> >>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
>>> >>>>
>>> >>>> Hi Zhijian
>>> >>>>
>>> >>>> Did you try blktests nvme/rdma use_rxe on your environment, it still
>>> >>>> failed on my side.
>>> >>>>
>>> >>>
>>> >>> Thanks for your testing.
>>> >>> I just did that, it also failed in my environment. After a glance debug, I found there are
>>> >>> other places still registered 4K MR. I will dig into it later.
>>> >>
>>> >>
>>> >> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.
>>> >>
>>> >>
>>> >> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>>> >> 1294                 struct nvme_rdma_request *req, struct nvme_command *c,
>>> >> 1295                 int count)
>>> >> 1296 {
>>> >> 1297         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>> >> 1298         int nr;
>>> >> 1299
>>> >> 1300         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>>> >> 1301         if (WARN_ON_ONCE(!req->mr))
>>> >> 1302                 return -EAGAIN;
>>> >> 1303
>>> >> 1304         /*
>>> >> 1305          * Align the MR to a 4K page size to match the ctrl page size and
>>> >> 1306          * the block virtual boundary.
>>> >> 1307          */
>>> >> 1308         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
>>> >> 1309                           SZ_4K);
>>> >>
>>> >>
>>> >> Anyway, i will go ahead. if you have any thought, please let me know.
>>> >>
>>> >>
>>> >>
>>> >>>
>>> >>> Thanks
>>> >>> Zhijian
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>> # use_rxe=1 nvme_trtype=rdma  ./check nvme/003
>>> >>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>>> >>>>        runtime  12.179s  ...  11.941s
>>> >>>>        --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>>> >>>>        +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
>>> >>>> 05:52:27.882759168 -0400
>>> >>>>        @@ -1,3 +1,3 @@
>>> >>>>         Running nvme/003
>>> >>>>        -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>>> >>>>        +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>>> >>>>         Test complete
>>> >>>>
>>> >>>> [ 7033.431910] rdma_rxe: loaded
>>> >>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
>>> >>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
>>> >>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
>>> >>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
>>> >>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
>>> >>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> >>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
>>> >>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
>>> >>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
>>> >>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
>>> >>>> [ 7046.317861] rdma_rxe: unloaded
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>>>> import ctypes
>>> >>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>> >>>>>>>> hex(65536)
>>> >>>>> '0x10000'
>>> >>>>>>>> libc.ffs(0x10000) - 1
>>> >>>>> 16
>>> >>>>>>>> 1 << 16
>>> >>>>> 65536
>>> >>>>>
>>> >>>>> so
>>> >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>> >>>>>
>>> >>>>>
>>> >>>>> So I think Daisuke's patch should work as well.
>>> >>>>>
>>> >>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>> >>>>>
>>> >>>>>
>>> >>>>>> Jason
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Best Regards,
>>> >>>>      Yi Zhang
>>>
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index f54042e9aeb2..dc9d31bfb689 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -58,7 +58,6 @@  static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->rkey = mr->ibmr.rkey = key;
 
 	mr->access = access;
-	mr->ibmr.page_size = PAGE_SIZE;
 	mr->page_mask = PAGE_MASK;
 	mr->page_shift = PAGE_SHIFT;
 	mr->state = RXE_MR_STATE_INVALID;
@@ -79,7 +78,7 @@  static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
 
 static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
 {
-	return iova & (mr_page_size(mr) - 1);
+	return iova & (PAGE_SIZE - 1);
 }
 
 static bool is_pmem_page(struct page *pg)
@@ -232,7 +231,7 @@  int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
 		  int sg_nents, unsigned int *sg_offset)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
-	unsigned int page_size = mr_page_size(mr);
+	unsigned int page_size = ibmr->page_size;
 
 	mr->nbuf = 0;
 	mr->page_shift = ilog2(page_size);
@@ -256,8 +255,7 @@  static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
 		if (!page)
 			return -EFAULT;
 
-		bytes = min_t(unsigned int, length,
-				mr_page_size(mr) - page_offset);
+		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
 		va = kmap_local_page(page);
 		if (dir == RXE_FROM_MR_OBJ)
 			memcpy(addr, va + page_offset, bytes);
@@ -450,8 +448,7 @@  int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
 		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
 		if (!page)
 			return -EFAULT;
-		bytes = min_t(unsigned int, length,
-				mr_page_size(mr) - page_offset);
+		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
 
 		va = kmap_local_page(page);
 		arch_wb_cache_pmem(va + page_offset, bytes);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index ccb9d19ffe8a..9fcaa974e079 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -319,11 +319,6 @@  struct rxe_mr {
 	struct xarray		page_list;
 };
 
-static inline unsigned int mr_page_size(struct rxe_mr *mr)
-{
-	return mr ? mr->ibmr.page_size : PAGE_SIZE;
-}
-
 enum rxe_mw_state {
 	RXE_MW_STATE_INVALID	= RXE_MR_STATE_INVALID,
 	RXE_MW_STATE_FREE	= RXE_MR_STATE_FREE,