[v3,3/3] nvmet-rdma: support 16K inline data

Message ID	55c82aa5730c05017a58641ad9550b6c0f0e16b2.1527618402.git.swise@opengridcomputing.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-rdma-owner@kernel.org> Message-Id: <55c82aa5730c05017a58641ad9550b6c0f0e16b2.1527618402.git.swise@opengridcomputing.com> In-Reply-To: <cover.1527618402.git.swise@opengridcomputing.com> References: <cover.1527618402.git.swise@opengridcomputing.com> From: Steve Wise <swise@opengridcomputing.com> Date: Tue, 29 May 2018 11:25:58 -0700 Subject: [PATCH v3 3/3] nvmet-rdma: support 16K inline data To: axboe@kernel.dk, hch@lst.de, keith.busch@intel.com, sagi@grimberg.me, linux-nvme@lists.infradead.org Cc: parav@mellanox.com, maxg@mellanox.com, linux-rdma@vger.kernel.org Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Steve Wise May 29, 2018, 6:25 p.m. UTC

Add a new configfs port attribute, called inline_data_size,
to allow configuring the size of inline data for a given port.
The maximum size allowed is still enforced by nvmet-rdma with
NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to max(16KB,
PAGE_SIZE).  And the default size, if not specified via configfs,
is still PAGE_SIZE.  This preserves the existing behavior, but allows
larger inline sizes.

Also support a configuration where inline_data_size is 0, which disables
using inline data.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/admin-cmd.c |  4 ++--
 drivers/nvme/target/configfs.c  | 31 ++++++++++++++++++++++++++++
 drivers/nvme/target/core.c      |  4 ++++
 drivers/nvme/target/discovery.c |  2 +-
 drivers/nvme/target/nvmet.h     |  2 +-
 drivers/nvme/target/rdma.c      | 45 ++++++++++++++++++++++++++++-------------
 6 files changed, 70 insertions(+), 18 deletions(-)

Christoph Lameter (Ampere) May 30, 2018, 3:49 p.m. UTC | #1

On Tue, 29 May 2018, Steve Wise wrote:

> @@ -200,17 +204,17 @@ static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>  	c->sge[0].length = sizeof(*c->nvme_cmd);
>  	c->sge[0].lkey = ndev->pd->local_dma_lkey;
>
> -	if (!admin) {
> +	if (!admin && inline_data_size) {
>  		c->inline_page = alloc_pages(GFP_KERNEL,
> -				get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> +				get_order(inline_data_size));

Now we do higher order allocations here. This means that the allocation
can fail if system memory is highly fragmented. And the allocations can no
longer be satisfied from the per cpu caches. So allocation performance
will drop.

>  		if (!c->inline_page)
>  			goto out_unmap_cmd;

Maybe think about some sort of fallback to vmalloc or so if this
alloc fails?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise May 30, 2018, 4:46 p.m. UTC | #2

On 5/30/2018 10:49 AM, Christopher Lameter wrote:
> On Tue, 29 May 2018, Steve Wise wrote:
>
>> @@ -200,17 +204,17 @@ static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>>  	c->sge[0].length = sizeof(*c->nvme_cmd);
>>  	c->sge[0].lkey = ndev->pd->local_dma_lkey;
>>
>> -	if (!admin) {
>> +	if (!admin && inline_data_size) {
>>  		c->inline_page = alloc_pages(GFP_KERNEL,
>> -				get_order(NVMET_RDMA_INLINE_DATA_SIZE));
>> +				get_order(inline_data_size));
> Now we do higher order allocations here. This means that the allocation
> can fail if system memory is highly fragmented. And the allocations can no
> longer be satisfied from the per cpu caches. So allocation performance
> will drop.

Yes.

>>  		if (!c->inline_page)
>>  			goto out_unmap_cmd;
> Maybe think about some sort of fallback to vmalloc or so if this
> alloc fails?

The memory needs to be physically contiguous and will be mapped for DMA,
so vmalloc() won't work. 

I could complicate the design and allocate a scatter gather table for
this memory, and then register it into a single RDMA MR.   That would
allow allocating non-contiguous pages.  But is that complication worth
it here?

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) May 30, 2018, 5:02 p.m. UTC | #3

On Wed, 30 May 2018, Steve Wise wrote:

> > Now we do higher order allocations here. This means that the allocation
> > can fail if system memory is highly fragmented. And the allocations can no
> > longer be satisfied from the per cpu caches. So allocation performance
> > will drop.
>
> Yes.

Maybe do some test with fragmented memory?

> >>  		if (!c->inline_page)
> >>  			goto out_unmap_cmd;
> > Maybe think about some sort of fallback to vmalloc or so if this
> > alloc fails?
>
> The memory needs to be physically contiguous and will be mapped for DMA,
> so vmalloc() won't work. 
>
> I could complicate the design and allocate a scatter gather table for
> this memory, and then register it into a single RDMA MR.   That would
> allow allocating non-contiguous pages.  But is that complication worth
> it here?

Not sure..... Just saw the higher alloc and I have seen what other
subsystems went through when dealing with larger order allocations

Sagi Grimberg May 30, 2018, 9:45 p.m. UTC | #4

>> @@ -200,17 +204,17 @@ static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>>   	c->sge[0].length = sizeof(*c->nvme_cmd);
>>   	c->sge[0].lkey = ndev->pd->local_dma_lkey;
>>
>> -	if (!admin) {
>> +	if (!admin && inline_data_size) {
>>   		c->inline_page = alloc_pages(GFP_KERNEL,
>> -				get_order(NVMET_RDMA_INLINE_DATA_SIZE));
>> +				get_order(inline_data_size));
> 
> Now we do higher order allocations here. This means that the allocation
> can fail if system memory is highly fragmented. And the allocations can no
> longer be satisfied from the per cpu caches. So allocation performance
> will drop.

That was my first thought as well. I'm not too keen on having
higher-order allocations on this, not at all. nvmet-rdma will
allocate a whole bunch of those. I think we should try to be
good citizens. I don't think its too complicated to do is it?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise May 30, 2018, 9:52 p.m. UTC | #5

On 5/30/2018 4:45 PM, Sagi Grimberg wrote:
>
>>> @@ -200,17 +204,17 @@ static int nvmet_rdma_alloc_cmd(struct
>>> nvmet_rdma_device *ndev,
>>>       c->sge[0].length = sizeof(*c->nvme_cmd);
>>>       c->sge[0].lkey = ndev->pd->local_dma_lkey;
>>>
>>> -    if (!admin) {
>>> +    if (!admin && inline_data_size) {
>>>           c->inline_page = alloc_pages(GFP_KERNEL,
>>> -                get_order(NVMET_RDMA_INLINE_DATA_SIZE));
>>> +                get_order(inline_data_size));
>>
>> Now we do higher order allocations here. This means that the allocation
>> can fail if system memory is highly fragmented. And the allocations
>> can no
>> longer be satisfied from the per cpu caches. So allocation performance
>> will drop.
>
> That was my first thought as well. I'm not too keen on having
> higher-order allocations on this, not at all. nvmet-rdma will
> allocate a whole bunch of those. I think we should try to be
> good citizens. I don't think its too complicated to do is it?

Its ugly because registering an MR for the inline pages requires a
connected QP to use the REG_MR WR.  Maybe I'll just split the needed
pages across the remaining recv sge entries available?  IE if the device
supports 5 recv sges, then 4 can be used for inline, and thus 4
non-contiguous pages could be used.  cxgb4, however, only support 4 recv
sges, so it would only support 12K of inline with this implementation. 
And perhaps there are rdma devices with even fewer recv sges?

Do you have any other idea on how to avoid higher-order allocations?

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg May 30, 2018, 10:13 p.m. UTC | #6

> Its ugly because registering an MR for the inline pages requires a
> connected QP to use the REG_MR WR.

No, why MR?

> Maybe I'll just split the needed
> pages across the remaining recv sge entries available?  IE if the device
> supports 5 recv sges, then 4 can be used for inline, and thus 4
> non-contiguous pages could be used.

That's what I had in mind...

> cxgb4, however, only support 4 recv
> sges, so it would only support 12K of inline with this implementation.

:(

> And perhaps there are rdma devices with even fewer recv sges?
> 
> Do you have any other idea on how to avoid higher-order allocations?

Not really. I guess that we should be able to recv the user
inline_data_size. So maybe it would be ok if the we'd do that only if
we must?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise May 30, 2018, 10:26 p.m. UTC | #7

On 5/30/2018 5:13 PM, Sagi Grimberg wrote:
>
>> Its ugly because registering an MR for the inline pages requires a
>> connected QP to use the REG_MR WR.
>
> No, why MR?
>
>> Maybe I'll just split the needed
>> pages across the remaining recv sge entries available?  IE if the device
>> supports 5 recv sges, then 4 can be used for inline, and thus 4
>> non-contiguous pages could be used.
>
> That's what I had in mind...
>
>> cxgb4, however, only support 4 recv
>> sges, so it would only support 12K of inline with this implementation.
>
> :(
>
>> And perhaps there are rdma devices with even fewer recv sges?
>>
>> Do you have any other idea on how to avoid higher-order allocations?
>
> Not really. I guess that we should be able to recv the user
> inline_data_size. So maybe it would be ok if the we'd do that only if
> we must?

I'll explore this idea.   I also thought about just reducing the inline
data size based on the number of sges available.  But I think the inline
data size has already been advertised to the host side at the time the
target is allocating these pages. :(

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Max Gurtovoy June 3, 2018, 8:39 a.m. UTC | #8

On 5/29/2018 9:25 PM, Steve Wise wrote:
> Add a new configfs port attribute, called inline_data_size,
> to allow configuring the size of inline data for a given port.
> The maximum size allowed is still enforced by nvmet-rdma with
> NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to max(16KB,
> PAGE_SIZE).  And the default size, if not specified via configfs,
> is still PAGE_SIZE.  This preserves the existing behavior, but allows
> larger inline sizes.
> 
> Also support a configuration where inline_data_size is 0, which disables
> using inline data.
> 
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
>   drivers/nvme/target/admin-cmd.c |  4 ++--
>   drivers/nvme/target/configfs.c  | 31 ++++++++++++++++++++++++++++
>   drivers/nvme/target/core.c      |  4 ++++
>   drivers/nvme/target/discovery.c |  2 +-
>   drivers/nvme/target/nvmet.h     |  2 +-
>   drivers/nvme/target/rdma.c      | 45 ++++++++++++++++++++++++++++-------------
>   6 files changed, 70 insertions(+), 18 deletions(-)

snip..

> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 52e0c5d..2f0b08e 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -33,9 +33,10 @@
>   #include "nvmet.h"
>   
>   /*
> - * We allow up to a page of inline data to go with the SQE
> + * We allow at least 1 page, and up to 16KB of inline data to go with the SQE
>    */
> -#define NVMET_RDMA_INLINE_DATA_SIZE	PAGE_SIZE
> +#define NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE	PAGE_SIZE
> +#define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int, SZ_16K, PAGE_SIZE)

why not use SZ_16K ? why we need to mention the PAGE_SIZE ?

>   
>   struct nvmet_rdma_cmd {
>   	struct ib_sge		sge[2];
> @@ -116,6 +117,7 @@ struct nvmet_rdma_device {
>   	size_t			srq_size;
>   	struct kref		ref;
>   	struct list_head	entry;
> +	int			inline_data_size;
>   };
>   
>   static bool nvmet_rdma_use_srq;
> @@ -187,6 +189,8 @@ static inline bool nvmet_rdma_need_data_out(struct nvmet_rdma_rsp *rsp)
>   static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>   			struct nvmet_rdma_cmd *c, bool admin)
>   {
> +	int inline_data_size = ndev->inline_data_size;
> +
>   	/* NVMe command / RDMA RECV */
>   	c->nvme_cmd = kmalloc(sizeof(*c->nvme_cmd), GFP_KERNEL);
>   	if (!c->nvme_cmd)
> @@ -200,17 +204,17 @@ static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>   	c->sge[0].length = sizeof(*c->nvme_cmd);
>   	c->sge[0].lkey = ndev->pd->local_dma_lkey;
>   
> -	if (!admin) {
> +	if (!admin && inline_data_size) {
>   		c->inline_page = alloc_pages(GFP_KERNEL,
> -				get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> +				get_order(inline_data_size));
>   		if (!c->inline_page)
>   			goto out_unmap_cmd;
>   		c->sge[1].addr = ib_dma_map_page(ndev->device,
> -				c->inline_page, 0, NVMET_RDMA_INLINE_DATA_SIZE,
> +				c->inline_page, 0, inline_data_size,
>   				DMA_FROM_DEVICE);
>   		if (ib_dma_mapping_error(ndev->device, c->sge[1].addr))
>   			goto out_free_inline_page;
> -		c->sge[1].length = NVMET_RDMA_INLINE_DATA_SIZE;
> +		c->sge[1].length = inline_data_size;
>   		c->sge[1].lkey = ndev->pd->local_dma_lkey;
>   	}
>   
> @@ -225,7 +229,7 @@ static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>   out_free_inline_page:
>   	if (!admin) {
>   		__free_pages(c->inline_page,
> -				get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> +				get_order(inline_data_size));
>   	}
>   out_unmap_cmd:
>   	ib_dma_unmap_single(ndev->device, c->sge[0].addr,
> @@ -240,11 +244,13 @@ static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
>   static void nvmet_rdma_free_cmd(struct nvmet_rdma_device *ndev,
>   		struct nvmet_rdma_cmd *c, bool admin)
>   {
> -	if (!admin) {
> +	int inline_data_size = ndev->inline_data_size;
> +
> +	if (!admin && inline_data_size) {
>   		ib_dma_unmap_page(ndev->device, c->sge[1].addr,
> -				NVMET_RDMA_INLINE_DATA_SIZE, DMA_FROM_DEVICE);
> +				inline_data_size, DMA_FROM_DEVICE);
>   		__free_pages(c->inline_page,
> -				get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> +				get_order(inline_data_size));
>   	}
>   	ib_dma_unmap_single(ndev->device, c->sge[0].addr,
>   				sizeof(*c->nvme_cmd), DMA_FROM_DEVICE);
> @@ -544,7 +550,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
>   	if (!nvme_is_write(rsp->req.cmd))
>   		return NVME_SC_INVALID_FIELD | NVME_SC_DNR;
>   
> -	if (off + len > NVMET_RDMA_INLINE_DATA_SIZE) {
> +	if (off + len > rsp->queue->dev->inline_data_size) {
>   		pr_err("invalid inline data offset!\n");
>   		return NVME_SC_SGL_INVALID_OFFSET | NVME_SC_DNR;
>   	}
> @@ -793,6 +799,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
>   static struct nvmet_rdma_device *
>   nvmet_rdma_find_get_device(struct rdma_cm_id *cm_id)
>   {
> +	struct nvmet_port *port = cm_id->context;
>   	struct nvmet_rdma_device *ndev;
>   	int ret;
>   
> @@ -807,6 +814,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
>   	if (!ndev)
>   		goto out_err;
>   
> +	ndev->inline_data_size = port->inline_data_size;
>   	ndev->device = cm_id->device;
>   	kref_init(&ndev->ref);
>   
> @@ -1379,6 +1387,15 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
>   		return -EINVAL;
>   	}
>   
> +	if (port->inline_data_size < 0) {
> +		port->inline_data_size = NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE;
> +	} else if (port->inline_data_size > NVMET_RDMA_MAX_INLINE_DATA_SIZE) {
> +		pr_err("invalid inline_data_size %d (max supported is %u)\n",
> +			port->inline_data_size,
> +			NVMET_RDMA_MAX_INLINE_DATA_SIZE);
> +		return -EINVAL;
> +	}
> +
>   	ret = inet_pton_with_scope(&init_net, af, port->disc_addr.traddr,
>   			port->disc_addr.trsvcid, &addr);
>   	if (ret) {
> @@ -1418,8 +1435,9 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
>   		goto out_destroy_id;
>   	}
>   
> -	pr_info("enabling port %d (%pISpcs)\n",
> -		le16_to_cpu(port->disc_addr.portid), (struct sockaddr *)&addr);
> +	pr_info("enabling port %d (%pISpcs) inline_data_size %d\n",
> +		le16_to_cpu(port->disc_addr.portid), (struct sockaddr *)&addr,
> +		port->inline_data_size);
>   	port->priv = cm_id;
>   	return 0;
>   
> @@ -1456,7 +1474,6 @@ static void nvmet_rdma_disc_port_addr(struct nvmet_req *req,
>   static const struct nvmet_fabrics_ops nvmet_rdma_ops = {
>   	.owner			= THIS_MODULE,
>   	.type			= NVMF_TRTYPE_RDMA,
> -	.sqe_inline_size	= NVMET_RDMA_INLINE_DATA_SIZE,
>   	.msdbd			= 1,
>   	.has_keyed_sgls		= 1,
>   	.add_port		= nvmet_rdma_add_port,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise June 3, 2018, 6:25 p.m. UTC | #9

> -----Original Message-----
> From: Max Gurtovoy <maxg@mellanox.com>
> Sent: Sunday, June 3, 2018 3:40 AM
> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
> nvme@lists.infradead.org
> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
> 
> 
> 
> On 5/29/2018 9:25 PM, Steve Wise wrote:
> > Add a new configfs port attribute, called inline_data_size,
> > to allow configuring the size of inline data for a given port.
> > The maximum size allowed is still enforced by nvmet-rdma with
> > NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to max(16KB,
> > PAGE_SIZE).  And the default size, if not specified via configfs,
> > is still PAGE_SIZE.  This preserves the existing behavior, but allows
> > larger inline sizes.
> >
> > Also support a configuration where inline_data_size is 0, which disables
> > using inline data.
> >
> > Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> > ---
> >   drivers/nvme/target/admin-cmd.c |  4 ++--
> >   drivers/nvme/target/configfs.c  | 31 ++++++++++++++++++++++++++++
> >   drivers/nvme/target/core.c      |  4 ++++
> >   drivers/nvme/target/discovery.c |  2 +-
> >   drivers/nvme/target/nvmet.h     |  2 +-
> >   drivers/nvme/target/rdma.c      | 45 ++++++++++++++++++++++++++++-----
> --------
> >   6 files changed, 70 insertions(+), 18 deletions(-)
> 
> snip..
> 
> >
> > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> > index 52e0c5d..2f0b08e 100644
> > --- a/drivers/nvme/target/rdma.c
> > +++ b/drivers/nvme/target/rdma.c
> > @@ -33,9 +33,10 @@
> >   #include "nvmet.h"
> >
> >   /*
> > - * We allow up to a page of inline data to go with the SQE
> > + * We allow at least 1 page, and up to 16KB of inline data to go with
the
> SQE
> >    */
> > -#define NVMET_RDMA_INLINE_DATA_SIZE	PAGE_SIZE
> > +#define NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE	PAGE_SIZE
> > +#define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int,
> SZ_16K, PAGE_SIZE)
> 
> why not use SZ_16K ? why we need to mention the PAGE_SIZE ?
>

The idea is to allow at least 1 page.  So for, say, a 64K page system, we'll
allow 64K since we're allocating a page minimum for the buffer.

 
> >
> >   struct nvmet_rdma_cmd {
> >   	struct ib_sge		sge[2];
> > @@ -116,6 +117,7 @@ struct nvmet_rdma_device {
> >   	size_t			srq_size;
> >   	struct kref		ref;
> >   	struct list_head	entry;
> > +	int			inline_data_size;
> >   };
> >
> >   static bool nvmet_rdma_use_srq;
> > @@ -187,6 +189,8 @@ static inline bool
> nvmet_rdma_need_data_out(struct nvmet_rdma_rsp *rsp)
> >   static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
> >   			struct nvmet_rdma_cmd *c, bool admin)
> >   {
> > +	int inline_data_size = ndev->inline_data_size;
> > +
> >   	/* NVMe command / RDMA RECV */
> >   	c->nvme_cmd = kmalloc(sizeof(*c->nvme_cmd), GFP_KERNEL);
> >   	if (!c->nvme_cmd)
> > @@ -200,17 +204,17 @@ static int nvmet_rdma_alloc_cmd(struct
> nvmet_rdma_device *ndev,
> >   	c->sge[0].length = sizeof(*c->nvme_cmd);
> >   	c->sge[0].lkey = ndev->pd->local_dma_lkey;
> >
> > -	if (!admin) {
> > +	if (!admin && inline_data_size) {
> >   		c->inline_page = alloc_pages(GFP_KERNEL,
> > -
> 	get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> > +				get_order(inline_data_size));
> >   		if (!c->inline_page)
> >   			goto out_unmap_cmd;
> >   		c->sge[1].addr = ib_dma_map_page(ndev->device,
> > -				c->inline_page, 0,
> NVMET_RDMA_INLINE_DATA_SIZE,
> > +				c->inline_page, 0, inline_data_size,
> >   				DMA_FROM_DEVICE);
> >   		if (ib_dma_mapping_error(ndev->device, c->sge[1].addr))
> >   			goto out_free_inline_page;
> > -		c->sge[1].length = NVMET_RDMA_INLINE_DATA_SIZE;
> > +		c->sge[1].length = inline_data_size;
> >   		c->sge[1].lkey = ndev->pd->local_dma_lkey;
> >   	}
> >
> > @@ -225,7 +229,7 @@ static int nvmet_rdma_alloc_cmd(struct
> nvmet_rdma_device *ndev,
> >   out_free_inline_page:
> >   	if (!admin) {
> >   		__free_pages(c->inline_page,
> > -
> 	get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> > +				get_order(inline_data_size));
> >   	}
> >   out_unmap_cmd:
> >   	ib_dma_unmap_single(ndev->device, c->sge[0].addr,
> > @@ -240,11 +244,13 @@ static int nvmet_rdma_alloc_cmd(struct
> nvmet_rdma_device *ndev,
> >   static void nvmet_rdma_free_cmd(struct nvmet_rdma_device *ndev,
> >   		struct nvmet_rdma_cmd *c, bool admin)
> >   {
> > -	if (!admin) {
> > +	int inline_data_size = ndev->inline_data_size;
> > +
> > +	if (!admin && inline_data_size) {
> >   		ib_dma_unmap_page(ndev->device, c->sge[1].addr,
> > -				NVMET_RDMA_INLINE_DATA_SIZE,
> DMA_FROM_DEVICE);
> > +				inline_data_size, DMA_FROM_DEVICE);
> >   		__free_pages(c->inline_page,
> > -
> 	get_order(NVMET_RDMA_INLINE_DATA_SIZE));
> > +				get_order(inline_data_size));
> >   	}
> >   	ib_dma_unmap_single(ndev->device, c->sge[0].addr,
> >   				sizeof(*c->nvme_cmd),
> DMA_FROM_DEVICE);
> > @@ -544,7 +550,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct
> nvmet_rdma_rsp *rsp)
> >   	if (!nvme_is_write(rsp->req.cmd))
> >   		return NVME_SC_INVALID_FIELD | NVME_SC_DNR;
> >
> > -	if (off + len > NVMET_RDMA_INLINE_DATA_SIZE) {
> > +	if (off + len > rsp->queue->dev->inline_data_size) {
> >   		pr_err("invalid inline data offset!\n");
> >   		return NVME_SC_SGL_INVALID_OFFSET | NVME_SC_DNR;
> >   	}
> > @@ -793,6 +799,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
> >   static struct nvmet_rdma_device *
> >   nvmet_rdma_find_get_device(struct rdma_cm_id *cm_id)
> >   {
> > +	struct nvmet_port *port = cm_id->context;
> >   	struct nvmet_rdma_device *ndev;
> >   	int ret;
> >
> > @@ -807,6 +814,7 @@ static void nvmet_rdma_free_dev(struct kref *ref)
> >   	if (!ndev)
> >   		goto out_err;
> >
> > +	ndev->inline_data_size = port->inline_data_size;
> >   	ndev->device = cm_id->device;
> >   	kref_init(&ndev->ref);
> >
> > @@ -1379,6 +1387,15 @@ static int nvmet_rdma_add_port(struct
> nvmet_port *port)
> >   		return -EINVAL;
> >   	}
> >
> > +	if (port->inline_data_size < 0) {
> > +		port->inline_data_size =
> NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE;
> > +	} else if (port->inline_data_size >
> NVMET_RDMA_MAX_INLINE_DATA_SIZE) {
> > +		pr_err("invalid inline_data_size %d (max supported is
> %u)\n",
> > +			port->inline_data_size,
> > +			NVMET_RDMA_MAX_INLINE_DATA_SIZE);
> > +		return -EINVAL;
> > +	}
> > +
> >   	ret = inet_pton_with_scope(&init_net, af, port->disc_addr.traddr,
> >   			port->disc_addr.trsvcid, &addr);
> >   	if (ret) {
> > @@ -1418,8 +1435,9 @@ static int nvmet_rdma_add_port(struct
> nvmet_port *port)
> >   		goto out_destroy_id;
> >   	}
> >
> > -	pr_info("enabling port %d (%pISpcs)\n",
> > -		le16_to_cpu(port->disc_addr.portid), (struct sockaddr
> *)&addr);
> > +	pr_info("enabling port %d (%pISpcs) inline_data_size %d\n",
> > +		le16_to_cpu(port->disc_addr.portid), (struct sockaddr
> *)&addr,
> > +		port->inline_data_size);
> >   	port->priv = cm_id;
> >   	return 0;
> >
> > @@ -1456,7 +1474,6 @@ static void nvmet_rdma_disc_port_addr(struct
> nvmet_req *req,
> >   static const struct nvmet_fabrics_ops nvmet_rdma_ops = {
> >   	.owner			= THIS_MODULE,
> >   	.type			= NVMF_TRTYPE_RDMA,
> > -	.sqe_inline_size	= NVMET_RDMA_INLINE_DATA_SIZE,
> >   	.msdbd			= 1,
> >   	.has_keyed_sgls		= 1,
> >   	.add_port		= nvmet_rdma_add_port,
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Max Gurtovoy June 4, 2018, 1:58 p.m. UTC | #10

On 6/3/2018 9:25 PM, Steve Wise wrote:
> 
> 
>> -----Original Message-----
>> From: Max Gurtovoy <maxg@mellanox.com>
>> Sent: Sunday, June 3, 2018 3:40 AM
>> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
>> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
>> nvme@lists.infradead.org
>> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
>> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
>>
>>
>>
>> On 5/29/2018 9:25 PM, Steve Wise wrote:
>>> Add a new configfs port attribute, called inline_data_size,
>>> to allow configuring the size of inline data for a given port.
>>> The maximum size allowed is still enforced by nvmet-rdma with
>>> NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to max(16KB,
>>> PAGE_SIZE).  And the default size, if not specified via configfs,
>>> is still PAGE_SIZE.  This preserves the existing behavior, but allows
>>> larger inline sizes.
>>>
>>> Also support a configuration where inline_data_size is 0, which disables
>>> using inline data.
>>>
>>> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
>>> ---
>>>    drivers/nvme/target/admin-cmd.c |  4 ++--
>>>    drivers/nvme/target/configfs.c  | 31 ++++++++++++++++++++++++++++
>>>    drivers/nvme/target/core.c      |  4 ++++
>>>    drivers/nvme/target/discovery.c |  2 +-
>>>    drivers/nvme/target/nvmet.h     |  2 +-
>>>    drivers/nvme/target/rdma.c      | 45 ++++++++++++++++++++++++++++-----
>> --------
>>>    6 files changed, 70 insertions(+), 18 deletions(-)
>>
>> snip..
>>
>>>
>>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>>> index 52e0c5d..2f0b08e 100644
>>> --- a/drivers/nvme/target/rdma.c
>>> +++ b/drivers/nvme/target/rdma.c
>>> @@ -33,9 +33,10 @@
>>>    #include "nvmet.h"
>>>
>>>    /*
>>> - * We allow up to a page of inline data to go with the SQE
>>> + * We allow at least 1 page, and up to 16KB of inline data to go with
> the
>> SQE
>>>     */
>>> -#define NVMET_RDMA_INLINE_DATA_SIZE	PAGE_SIZE
>>> +#define NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE	PAGE_SIZE
>>> +#define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int,
>> SZ_16K, PAGE_SIZE)
>>
>> why not use SZ_16K ? why we need to mention the PAGE_SIZE ?
>>
> 
> The idea is to allow at least 1 page.  So for, say, a 64K page system, we'll
> allow 64K since we're allocating a page minimum for the buffer.

IMO you want to support upto 16K inline data and not upto 64k (also in 
PowerPC system)...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise June 4, 2018, 2:18 p.m. UTC | #11

> -----Original Message-----
> From: Max Gurtovoy <maxg@mellanox.com>
> Sent: Monday, June 4, 2018 8:58 AM
> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
> nvme@lists.infradead.org
> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
> 
> 
> 
> On 6/3/2018 9:25 PM, Steve Wise wrote:
> >
> >
> >> -----Original Message-----
> >> From: Max Gurtovoy <maxg@mellanox.com>
> >> Sent: Sunday, June 3, 2018 3:40 AM
> >> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
> >> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
> >> nvme@lists.infradead.org
> >> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
> >> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
> >>
> >>
> >>
> >> On 5/29/2018 9:25 PM, Steve Wise wrote:
> >>> Add a new configfs port attribute, called inline_data_size,
> >>> to allow configuring the size of inline data for a given port.
> >>> The maximum size allowed is still enforced by nvmet-rdma with
> >>> NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to
> max(16KB,
> >>> PAGE_SIZE).  And the default size, if not specified via configfs,
> >>> is still PAGE_SIZE.  This preserves the existing behavior, but allows
> >>> larger inline sizes.
> >>>
> >>> Also support a configuration where inline_data_size is 0, which
disables
> >>> using inline data.
> >>>
> >>> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> >>> ---
> >>>    drivers/nvme/target/admin-cmd.c |  4 ++--
> >>>    drivers/nvme/target/configfs.c  | 31 ++++++++++++++++++++++++++++
> >>>    drivers/nvme/target/core.c      |  4 ++++
> >>>    drivers/nvme/target/discovery.c |  2 +-
> >>>    drivers/nvme/target/nvmet.h     |  2 +-
> >>>    drivers/nvme/target/rdma.c      | 45 ++++++++++++++++++++++++++++-
> ----
> >> --------
> >>>    6 files changed, 70 insertions(+), 18 deletions(-)
> >>
> >> snip..
> >>
> >>>
> >>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> >>> index 52e0c5d..2f0b08e 100644
> >>> --- a/drivers/nvme/target/rdma.c
> >>> +++ b/drivers/nvme/target/rdma.c
> >>> @@ -33,9 +33,10 @@
> >>>    #include "nvmet.h"
> >>>
> >>>    /*
> >>> - * We allow up to a page of inline data to go with the SQE
> >>> + * We allow at least 1 page, and up to 16KB of inline data to go with
> > the
> >> SQE
> >>>     */
> >>> -#define NVMET_RDMA_INLINE_DATA_SIZE	PAGE_SIZE
> >>> +#define NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE	PAGE_SIZE
> >>> +#define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int,
> >> SZ_16K, PAGE_SIZE)
> >>
> >> why not use SZ_16K ? why we need to mention the PAGE_SIZE ?
> >>
> >
> > The idea is to allow at least 1 page.  So for, say, a 64K page system,
we'll
> > allow 64K since we're allocating a page minimum for the buffer.
> 
> IMO you want to support upto 16K inline data and not upto 64k (also in
> PowerPC system)...

Why?



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Max Gurtovoy June 5, 2018, 8:52 a.m. UTC | #12

On 6/4/2018 5:18 PM, Steve Wise wrote:
> 
> 
>> -----Original Message-----
>> From: Max Gurtovoy <maxg@mellanox.com>
>> Sent: Monday, June 4, 2018 8:58 AM
>> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
>> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
>> nvme@lists.infradead.org
>> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
>> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
>>
>>
>>
>> On 6/3/2018 9:25 PM, Steve Wise wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Max Gurtovoy <maxg@mellanox.com>
>>>> Sent: Sunday, June 3, 2018 3:40 AM
>>>> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
>>>> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
>>>> nvme@lists.infradead.org
>>>> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
>>>> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
>>>>
>>>>
>>>>
>>>> On 5/29/2018 9:25 PM, Steve Wise wrote:
>>>>> Add a new configfs port attribute, called inline_data_size,
>>>>> to allow configuring the size of inline data for a given port.
>>>>> The maximum size allowed is still enforced by nvmet-rdma with
>>>>> NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to
>> max(16KB,
>>>>> PAGE_SIZE).  And the default size, if not specified via configfs,
>>>>> is still PAGE_SIZE.  This preserves the existing behavior, but allows
>>>>> larger inline sizes.
>>>>>
>>>>> Also support a configuration where inline_data_size is 0, which
> disables
>>>>> using inline data.
>>>>>
>>>>> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
>>>>> ---
>>>>>     drivers/nvme/target/admin-cmd.c |  4 ++--
>>>>>     drivers/nvme/target/configfs.c  | 31 ++++++++++++++++++++++++++++
>>>>>     drivers/nvme/target/core.c      |  4 ++++
>>>>>     drivers/nvme/target/discovery.c |  2 +-
>>>>>     drivers/nvme/target/nvmet.h     |  2 +-
>>>>>     drivers/nvme/target/rdma.c      | 45 ++++++++++++++++++++++++++++-
>> ----
>>>> --------
>>>>>     6 files changed, 70 insertions(+), 18 deletions(-)
>>>>
>>>> snip..
>>>>
>>>>>
>>>>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>>>>> index 52e0c5d..2f0b08e 100644
>>>>> --- a/drivers/nvme/target/rdma.c
>>>>> +++ b/drivers/nvme/target/rdma.c
>>>>> @@ -33,9 +33,10 @@
>>>>>     #include "nvmet.h"
>>>>>
>>>>>     /*
>>>>> - * We allow up to a page of inline data to go with the SQE
>>>>> + * We allow at least 1 page, and up to 16KB of inline data to go with
>>> the
>>>> SQE
>>>>>      */
>>>>> -#define NVMET_RDMA_INLINE_DATA_SIZE	PAGE_SIZE
>>>>> +#define NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE	PAGE_SIZE
>>>>> +#define NVMET_RDMA_MAX_INLINE_DATA_SIZE		max_t(int,
>>>> SZ_16K, PAGE_SIZE)
>>>>
>>>> why not use SZ_16K ? why we need to mention the PAGE_SIZE ?
>>>>
>>>
>>> The idea is to allow at least 1 page.  So for, say, a 64K page system,
> we'll
>>> allow 64K since we're allocating a page minimum for the buffer.
>>
>> IMO you want to support upto 16K inline data and not upto 64k (also in
>> PowerPC system)...
> 
> Why?
> 

Ok we can use PAGE_SIZE from resource and perf perspective, but please 
change the subject since it is confusing. I guess the subject can be 
similar to the one you used in the initiator.

I think we should consider clamping to NVMET_RDMA_MAX_INLINE_DATA_SIZE 
in case port->inline_data_size > NVMET_RDMA_MAX_INLINE_DATA_SIZE instead 
of failing the operation (and add an info print).

-Max.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise June 5, 2018, 2:28 p.m. UTC | #13

On 6/5/2018 3:52 AM, Max Gurtovoy wrote:
>
>
> On 6/4/2018 5:18 PM, Steve Wise wrote:
>>
>>
>>> -----Original Message-----
>>> From: Max Gurtovoy <maxg@mellanox.com>
>>> Sent: Monday, June 4, 2018 8:58 AM
>>> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
>>> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
>>> nvme@lists.infradead.org
>>> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
>>> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
>>>
>>>
>>>
>>> On 6/3/2018 9:25 PM, Steve Wise wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Max Gurtovoy <maxg@mellanox.com>
>>>>> Sent: Sunday, June 3, 2018 3:40 AM
>>>>> To: Steve Wise <swise@opengridcomputing.com>; axboe@kernel.dk;
>>>>> hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
>>>>> nvme@lists.infradead.org
>>>>> Cc: parav@mellanox.com; linux-rdma@vger.kernel.org
>>>>> Subject: Re: [PATCH v3 3/3] nvmet-rdma: support 16K inline data
>>>>>
>>>>>
>>>>>
>>>>> On 5/29/2018 9:25 PM, Steve Wise wrote:
>>>>>> Add a new configfs port attribute, called inline_data_size,
>>>>>> to allow configuring the size of inline data for a given port.
>>>>>> The maximum size allowed is still enforced by nvmet-rdma with
>>>>>> NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is increased to
>>> max(16KB,
>>>>>> PAGE_SIZE).  And the default size, if not specified via configfs,
>>>>>> is still PAGE_SIZE.  This preserves the existing behavior, but
>>>>>> allows
>>>>>> larger inline sizes.
>>>>>>
>>>>>> Also support a configuration where inline_data_size is 0, which
>> disables
>>>>>> using inline data.
>>>>>>
>>>>>> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
>>>>>> ---
>>>>>>     drivers/nvme/target/admin-cmd.c |  4 ++--
>>>>>>     drivers/nvme/target/configfs.c  | 31
>>>>>> ++++++++++++++++++++++++++++
>>>>>>     drivers/nvme/target/core.c      |  4 ++++
>>>>>>     drivers/nvme/target/discovery.c |  2 +-
>>>>>>     drivers/nvme/target/nvmet.h     |  2 +-
>>>>>>     drivers/nvme/target/rdma.c      | 45
>>>>>> ++++++++++++++++++++++++++++-
>>> ----
>>>>> --------
>>>>>>     6 files changed, 70 insertions(+), 18 deletions(-)
>>>>>
>>>>> snip..
>>>>>
>>>>>>
>>>>>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>>>>>> index 52e0c5d..2f0b08e 100644
>>>>>> --- a/drivers/nvme/target/rdma.c
>>>>>> +++ b/drivers/nvme/target/rdma.c
>>>>>> @@ -33,9 +33,10 @@
>>>>>>     #include "nvmet.h"
>>>>>>
>>>>>>     /*
>>>>>> - * We allow up to a page of inline data to go with the SQE
>>>>>> + * We allow at least 1 page, and up to 16KB of inline data to go
>>>>>> with
>>>> the
>>>>> SQE
>>>>>>      */
>>>>>> -#define NVMET_RDMA_INLINE_DATA_SIZE    PAGE_SIZE
>>>>>> +#define NVMET_RDMA_DEFAULT_INLINE_DATA_SIZE    PAGE_SIZE
>>>>>> +#define NVMET_RDMA_MAX_INLINE_DATA_SIZE        max_t(int,
>>>>> SZ_16K, PAGE_SIZE)
>>>>>
>>>>> why not use SZ_16K ? why we need to mention the PAGE_SIZE ?
>>>>>
>>>>
>>>> The idea is to allow at least 1 page.  So for, say, a 64K page system,
>> we'll
>>>> allow 64K since we're allocating a page minimum for the buffer.
>>>
>>> IMO you want to support upto 16K inline data and not upto 64k (also in
>>> PowerPC system)...
>>
>> Why?
>>
>
> Ok we can use PAGE_SIZE from resource and perf perspective, but please
> change the subject since it is confusing. I guess the subject can be
> similar to the one you used in the initiator.
>
> I think we should consider clamping to NVMET_RDMA_MAX_INLINE_DATA_SIZE
> in case port->inline_data_size > NVMET_RDMA_MAX_INLINE_DATA_SIZE
> instead of failing the operation (and add an info print).
>

This sounds reasonable.  In my yet-to-be sent new version of this, I
have similar logic if the rdma device doesn't have enough sge entries to
support the configured size.  In that case, the inline data size is
clamped to the max that can be supported by the device.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v3,3/3] nvmet-rdma: support 16K inline data

Commit Message

Comments

Patch