nvme: fix handling single range discard request

Message ID	20230303231345.119652-1-ming.lei@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, linux-nvme@lists.infradead.org Cc: linux-block@vger.kernel.org, Ming Lei <ming.lei@redhat.com> Subject: [PATCH] nvme: fix handling single range discard request Date: Sat, 4 Mar 2023 07:13:45 +0800 Message-Id: <20230303231345.119652-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nvme: fix handling single range discard request \| expand nvme: fix handling single range discard request

Ming Lei March 3, 2023, 11:13 p.m. UTC

When investigating one customer report on warning in nvme_setup_discard,
we observed the controller(nvme/tcp) actually exposes
queue_max_discard_segments(req->q) == 1.

Obviously the current code can't handle this situation, since contiguity
merge like normal RW request is taken.

Fix the issue by building range from request sector/nr_sectors directly.

Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

Hannes Reinecke March 4, 2023, 8 a.m. UTC | #1

On 3/4/23 00:13, Ming Lei wrote:
> When investigating one customer report on warning in nvme_setup_discard,
> we observed the controller(nvme/tcp) actually exposes
> queue_max_discard_segments(req->q) == 1.
> 
> Obviously the current code can't handle this situation, since contiguity
> merge like normal RW request is taken.
> 
> Fix the issue by building range from request sector/nr_sectors directly.
> 
> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
>   1 file changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c2730b116dc6..d4be525f8100 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>   		range = page_address(ns->ctrl->discard_page);
>   	}
>   
> -	__rq_for_each_bio(bio, req) {
> -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> -
> -		if (n < segments) {
> -			range[n].cattr = cpu_to_le32(0);
> -			range[n].nlb = cpu_to_le32(nlb);
> -			range[n].slba = cpu_to_le64(slba);
> +	if (queue_max_discard_segments(req->q) == 1) {
> +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> +
> +		range[0].cattr = cpu_to_le32(0);
> +		range[0].nlb = cpu_to_le32(nlb);
> +		range[0].slba = cpu_to_le64(slba);
> +		n = 1;
> +	} else { > +		__rq_for_each_bio(bio, req) {
> +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> +
> +			if (n < segments) {
> +				range[n].cattr = cpu_to_le32(0);
> +				range[n].nlb = cpu_to_le32(nlb);
> +				range[n].slba = cpu_to_le64(slba);
> +			}
> +			n++;
>   		}
> -		n++;
>   	}
>   
>   	if (WARN_ON_ONCE(n != segments)) {
Now _that_ is odd.
Looks like 'req' is not formatted according to the 'max_discard_sectors' 
setting.
But if that's the case, then this 'fix' would fail whenever 
'max_discard_sectors' < 'max_hw_sectors', right?
Shouldn't we rather modify the merge algorithm to check for 
max_discard_sectors for DISCARD requests, such that we never _have_
mis-matched requests and this patch would be pointless?

Cheers,

Hannes

Ming Lei March 4, 2023, 10:22 a.m. UTC | #2

On Sat, Mar 04, 2023 at 09:00:28AM +0100, Hannes Reinecke wrote:
> On 3/4/23 00:13, Ming Lei wrote:
> > When investigating one customer report on warning in nvme_setup_discard,
> > we observed the controller(nvme/tcp) actually exposes
> > queue_max_discard_segments(req->q) == 1.
> > 
> > Obviously the current code can't handle this situation, since contiguity
> > merge like normal RW request is taken.
> > 
> > Fix the issue by building range from request sector/nr_sectors directly.
> > 
> > Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
> >   1 file changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index c2730b116dc6..d4be525f8100 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> >   		range = page_address(ns->ctrl->discard_page);
> >   	}
> > -	__rq_for_each_bio(bio, req) {
> > -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > -
> > -		if (n < segments) {
> > -			range[n].cattr = cpu_to_le32(0);
> > -			range[n].nlb = cpu_to_le32(nlb);
> > -			range[n].slba = cpu_to_le64(slba);
> > +	if (queue_max_discard_segments(req->q) == 1) {
> > +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> > +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> > +
> > +		range[0].cattr = cpu_to_le32(0);
> > +		range[0].nlb = cpu_to_le32(nlb);
> > +		range[0].slba = cpu_to_le64(slba);
> > +		n = 1;
> > +	} else { > +		__rq_for_each_bio(bio, req) {
> > +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > +
> > +			if (n < segments) {
> > +				range[n].cattr = cpu_to_le32(0);
> > +				range[n].nlb = cpu_to_le32(nlb);
> > +				range[n].slba = cpu_to_le64(slba);
> > +			}
> > +			n++;
> >   		}
> > -		n++;
> >   	}
> >   	if (WARN_ON_ONCE(n != segments)) {
> Now _that_ is odd.
> Looks like 'req' is not formatted according to the 'max_discard_sectors'
> setting.
> But if that's the case, then this 'fix' would fail whenever
> 'max_discard_sectors' < 'max_hw_sectors', right?

No, it isn't the case.

> Shouldn't we rather modify the merge algorithm to check for
> max_discard_sectors for DISCARD requests, such that we never _have_
> mis-matched requests and this patch would be pointless?

But it is related with discard merge.

If queue_max_discard_segments() is 1, block layer merges discard
request/bios just like normal RW IO.

However, if queue_max_discard_segments() is > 1, block layer simply
'merges' all bios into one request, no matter if the LBA is adjacent
or not, and treat each bio as one discard segment, that is called
multi range discard too.

That is the reason why we have to treat queue_max_discard_segments() == 1
specially, and you can see same handling in virtio_blk.


Thanks,
Ming

Hannes Reinecke March 4, 2023, 11:14 a.m. UTC | #3

On 3/4/23 11:22, Ming Lei wrote:
> On Sat, Mar 04, 2023 at 09:00:28AM +0100, Hannes Reinecke wrote:
>> On 3/4/23 00:13, Ming Lei wrote:
>>> When investigating one customer report on warning in nvme_setup_discard,
>>> we observed the controller(nvme/tcp) actually exposes
>>> queue_max_discard_segments(req->q) == 1.
>>>
>>> Obviously the current code can't handle this situation, since contiguity
>>> merge like normal RW request is taken.
>>>
>>> Fix the issue by building range from request sector/nr_sectors directly.
>>>
>>> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
>>>    1 file changed, 19 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index c2730b116dc6..d4be525f8100 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>>>    		range = page_address(ns->ctrl->discard_page);
>>>    	}
>>> -	__rq_for_each_bio(bio, req) {
>>> -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
>>> -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
>>> -
>>> -		if (n < segments) {
>>> -			range[n].cattr = cpu_to_le32(0);
>>> -			range[n].nlb = cpu_to_le32(nlb);
>>> -			range[n].slba = cpu_to_le64(slba);
>>> +	if (queue_max_discard_segments(req->q) == 1) {
>>> +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
>>> +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
>>> +
>>> +		range[0].cattr = cpu_to_le32(0);
>>> +		range[0].nlb = cpu_to_le32(nlb);
>>> +		range[0].slba = cpu_to_le64(slba);
>>> +		n = 1;
>>> +	} else { > +		__rq_for_each_bio(bio, req) {
>>> +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
>>> +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
>>> +
>>> +			if (n < segments) {
>>> +				range[n].cattr = cpu_to_le32(0);
>>> +				range[n].nlb = cpu_to_le32(nlb);
>>> +				range[n].slba = cpu_to_le64(slba);
>>> +			}
>>> +			n++;
>>>    		}
>>> -		n++;
>>>    	}
>>>    	if (WARN_ON_ONCE(n != segments)) {
>> Now _that_ is odd.
>> Looks like 'req' is not formatted according to the 'max_discard_sectors'
>> setting.
>> But if that's the case, then this 'fix' would fail whenever
>> 'max_discard_sectors' < 'max_hw_sectors', right?
> 
> No, it isn't the case.
> 
>> Shouldn't we rather modify the merge algorithm to check for
>> max_discard_sectors for DISCARD requests, such that we never _have_
>> mis-matched requests and this patch would be pointless?
> 
> But it is related with discard merge.
> 
> If queue_max_discard_segments() is 1, block layer merges discard
> request/bios just like normal RW IO.
> 
> However, if queue_max_discard_segments() is > 1, block layer simply
> 'merges' all bios into one request, no matter if the LBA is adjacent
> or not, and treat each bio as one discard segment, that is called
> multi range discard too.
> 
But wouldn't the number of bios be subject to 
'queue_max_discard_segment', too?
What guarantees we're not overflowing that for multi-segment discard merge?

Cheers,

Hannes

Ming Lei March 4, 2023, 12:02 p.m. UTC | #4

On Sat, Mar 04, 2023 at 12:14:30PM +0100, Hannes Reinecke wrote:
> On 3/4/23 11:22, Ming Lei wrote:
> > On Sat, Mar 04, 2023 at 09:00:28AM +0100, Hannes Reinecke wrote:
> > > On 3/4/23 00:13, Ming Lei wrote:
> > > > When investigating one customer report on warning in nvme_setup_discard,
> > > > we observed the controller(nvme/tcp) actually exposes
> > > > queue_max_discard_segments(req->q) == 1.
> > > > 
> > > > Obviously the current code can't handle this situation, since contiguity
> > > > merge like normal RW request is taken.
> > > > 
> > > > Fix the issue by building range from request sector/nr_sectors directly.
> > > > 
> > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >    drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
> > > >    1 file changed, 19 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index c2730b116dc6..d4be525f8100 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> > > >    		range = page_address(ns->ctrl->discard_page);
> > > >    	}
> > > > -	__rq_for_each_bio(bio, req) {
> > > > -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > > > -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > > > -
> > > > -		if (n < segments) {
> > > > -			range[n].cattr = cpu_to_le32(0);
> > > > -			range[n].nlb = cpu_to_le32(nlb);
> > > > -			range[n].slba = cpu_to_le64(slba);
> > > > +	if (queue_max_discard_segments(req->q) == 1) {
> > > > +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> > > > +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> > > > +
> > > > +		range[0].cattr = cpu_to_le32(0);
> > > > +		range[0].nlb = cpu_to_le32(nlb);
> > > > +		range[0].slba = cpu_to_le64(slba);
> > > > +		n = 1;
> > > > +	} else { > +		__rq_for_each_bio(bio, req) {
> > > > +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > > > +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > > > +
> > > > +			if (n < segments) {
> > > > +				range[n].cattr = cpu_to_le32(0);
> > > > +				range[n].nlb = cpu_to_le32(nlb);
> > > > +				range[n].slba = cpu_to_le64(slba);
> > > > +			}
> > > > +			n++;
> > > >    		}
> > > > -		n++;
> > > >    	}
> > > >    	if (WARN_ON_ONCE(n != segments)) {
> > > Now _that_ is odd.
> > > Looks like 'req' is not formatted according to the 'max_discard_sectors'
> > > setting.
> > > But if that's the case, then this 'fix' would fail whenever
> > > 'max_discard_sectors' < 'max_hw_sectors', right?
> > 
> > No, it isn't the case.
> > 
> > > Shouldn't we rather modify the merge algorithm to check for
> > > max_discard_sectors for DISCARD requests, such that we never _have_
> > > mis-matched requests and this patch would be pointless?
> > 
> > But it is related with discard merge.
> > 
> > If queue_max_discard_segments() is 1, block layer merges discard
> > request/bios just like normal RW IO.
> > 
> > However, if queue_max_discard_segments() is > 1, block layer simply
> > 'merges' all bios into one request, no matter if the LBA is adjacent
> > or not, and treat each bio as one discard segment, that is called
> > multi range discard too.
> > 
> But wouldn't the number of bios be subject to 'queue_max_discard_segment',
> too?
> What guarantees we're not overflowing that for multi-segment discard merge?

block layer merge code makes sure that the max discard segment limit
is respected, please see:

	req_attempt_discard_merge()
	bio_attempt_discard_merge()
	blk_recalc_rq_segments()

Thanks,
Ming

Sagi Grimberg March 6, 2023, 2:21 p.m. UTC | #5

On 3/4/23 01:13, Ming Lei wrote:
> When investigating one customer report on warning in nvme_setup_discard,
> we observed the controller(nvme/tcp) actually exposes
> queue_max_discard_segments(req->q) == 1.
> 
> Obviously the current code can't handle this situation, since contiguity
> merge like normal RW request is taken.
> 
> Fix the issue by building range from request sector/nr_sectors directly.
> 
> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
>   1 file changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c2730b116dc6..d4be525f8100 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>   		range = page_address(ns->ctrl->discard_page);
>   	}
>   
> -	__rq_for_each_bio(bio, req) {
> -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> -
> -		if (n < segments) {
> -			range[n].cattr = cpu_to_le32(0);
> -			range[n].nlb = cpu_to_le32(nlb);
> -			range[n].slba = cpu_to_le64(slba);
> +	if (queue_max_discard_segments(req->q) == 1) {
> +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> +
> +		range[0].cattr = cpu_to_le32(0);
> +		range[0].nlb = cpu_to_le32(nlb);
> +		range[0].slba = cpu_to_le64(slba);
> +		n = 1;
> +	} else {
> +		__rq_for_each_bio(bio, req) {
> +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> +
> +			if (n < segments) {
> +				range[n].cattr = cpu_to_le32(0);
> +				range[n].nlb = cpu_to_le32(nlb);
> +				range[n].slba = cpu_to_le64(slba);
> +			}
> +			n++;
>   		}
> -		n++;
>   	}
>   
>   	if (WARN_ON_ONCE(n != segments)) {


Maybe just set segments to min(blk_rq_nr_discard_segments(req), 
queue_max_discard_segments(req->q)) and let the existing code do
its thing?

Ming Lei March 6, 2023, 9:49 p.m. UTC | #6

On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote:
> 
> 
> On 3/4/23 01:13, Ming Lei wrote:
> > When investigating one customer report on warning in nvme_setup_discard,
> > we observed the controller(nvme/tcp) actually exposes
> > queue_max_discard_segments(req->q) == 1.
> > 
> > Obviously the current code can't handle this situation, since contiguity
> > merge like normal RW request is taken.
> > 
> > Fix the issue by building range from request sector/nr_sectors directly.
> > 
> > Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
> >   1 file changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index c2730b116dc6..d4be525f8100 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> >   		range = page_address(ns->ctrl->discard_page);
> >   	}
> > -	__rq_for_each_bio(bio, req) {
> > -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > -
> > -		if (n < segments) {
> > -			range[n].cattr = cpu_to_le32(0);
> > -			range[n].nlb = cpu_to_le32(nlb);
> > -			range[n].slba = cpu_to_le64(slba);
> > +	if (queue_max_discard_segments(req->q) == 1) {
> > +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> > +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> > +
> > +		range[0].cattr = cpu_to_le32(0);
> > +		range[0].nlb = cpu_to_le32(nlb);
> > +		range[0].slba = cpu_to_le64(slba);
> > +		n = 1;
> > +	} else {
> > +		__rq_for_each_bio(bio, req) {
> > +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > +
> > +			if (n < segments) {
> > +				range[n].cattr = cpu_to_le32(0);
> > +				range[n].nlb = cpu_to_le32(nlb);
> > +				range[n].slba = cpu_to_le64(slba);
> > +			}
> > +			n++;
> >   		}
> > -		n++;
> >   	}
> >   	if (WARN_ON_ONCE(n != segments)) {
> 
> 
> Maybe just set segments to min(blk_rq_nr_discard_segments(req),
> queue_max_discard_segments(req->q)) and let the existing code do
> its thing?

What is the existing code for applying min()?

For block layer merge code, it has to cover two kinds of discard merge:

- the traditional single range discard for most of devices
- multi range discard merge for nvme and virtio-blk which takes same
fix

For driver side, it has to do similar handling if both single-range and
multi-range discard are supported:

- each bio is one discard range for multi-range discard
- the whole request(may include more than 1 bios) is the single discard
range for single-range discard


Thanks, 
Ming

Sagi Grimberg March 7, 2023, 11:39 a.m. UTC | #7

On 3/6/23 23:49, Ming Lei wrote:
> On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote:
>>
>>
>> On 3/4/23 01:13, Ming Lei wrote:
>>> When investigating one customer report on warning in nvme_setup_discard,
>>> we observed the controller(nvme/tcp) actually exposes
>>> queue_max_discard_segments(req->q) == 1.
>>>
>>> Obviously the current code can't handle this situation, since contiguity
>>> merge like normal RW request is taken.
>>>
>>> Fix the issue by building range from request sector/nr_sectors directly.
>>>
>>> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
>>>    1 file changed, 19 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index c2730b116dc6..d4be525f8100 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>>>    		range = page_address(ns->ctrl->discard_page);
>>>    	}
>>> -	__rq_for_each_bio(bio, req) {
>>> -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
>>> -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
>>> -
>>> -		if (n < segments) {
>>> -			range[n].cattr = cpu_to_le32(0);
>>> -			range[n].nlb = cpu_to_le32(nlb);
>>> -			range[n].slba = cpu_to_le64(slba);
>>> +	if (queue_max_discard_segments(req->q) == 1) {
>>> +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
>>> +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
>>> +
>>> +		range[0].cattr = cpu_to_le32(0);
>>> +		range[0].nlb = cpu_to_le32(nlb);
>>> +		range[0].slba = cpu_to_le64(slba);
>>> +		n = 1;
>>> +	} else {
>>> +		__rq_for_each_bio(bio, req) {
>>> +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
>>> +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
>>> +
>>> +			if (n < segments) {
>>> +				range[n].cattr = cpu_to_le32(0);
>>> +				range[n].nlb = cpu_to_le32(nlb);
>>> +				range[n].slba = cpu_to_le64(slba);
>>> +			}
>>> +			n++;
>>>    		}
>>> -		n++;
>>>    	}
>>>    	if (WARN_ON_ONCE(n != segments)) {
>>
>>
>> Maybe just set segments to min(blk_rq_nr_discard_segments(req),
>> queue_max_discard_segments(req->q)) and let the existing code do
>> its thing?
> 
> What is the existing code for applying min()?

Was referring to this:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3345f866178e..dbc402587431 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct 
nvme_ns *ns, struct request *req,
                 range = page_address(ns->ctrl->discard_page);
         }

+       segments = min(segments, queue_max_discard_segments(req->q));
         __rq_for_each_bio(bio, req) {
                 u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
                 u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
--

Ming Lei March 7, 2023, 12:14 p.m. UTC | #8

On Tue, Mar 07, 2023 at 01:39:27PM +0200, Sagi Grimberg wrote:
> 
> 
> On 3/6/23 23:49, Ming Lei wrote:
> > On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 3/4/23 01:13, Ming Lei wrote:
> > > > When investigating one customer report on warning in nvme_setup_discard,
> > > > we observed the controller(nvme/tcp) actually exposes
> > > > queue_max_discard_segments(req->q) == 1.
> > > > 
> > > > Obviously the current code can't handle this situation, since contiguity
> > > > merge like normal RW request is taken.
> > > > 
> > > > Fix the issue by building range from request sector/nr_sectors directly.
> > > > 
> > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >    drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
> > > >    1 file changed, 19 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index c2730b116dc6..d4be525f8100 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> > > >    		range = page_address(ns->ctrl->discard_page);
> > > >    	}
> > > > -	__rq_for_each_bio(bio, req) {
> > > > -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > > > -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > > > -
> > > > -		if (n < segments) {
> > > > -			range[n].cattr = cpu_to_le32(0);
> > > > -			range[n].nlb = cpu_to_le32(nlb);
> > > > -			range[n].slba = cpu_to_le64(slba);
> > > > +	if (queue_max_discard_segments(req->q) == 1) {
> > > > +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> > > > +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> > > > +
> > > > +		range[0].cattr = cpu_to_le32(0);
> > > > +		range[0].nlb = cpu_to_le32(nlb);
> > > > +		range[0].slba = cpu_to_le64(slba);
> > > > +		n = 1;
> > > > +	} else {
> > > > +		__rq_for_each_bio(bio, req) {
> > > > +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > > > +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > > > +
> > > > +			if (n < segments) {
> > > > +				range[n].cattr = cpu_to_le32(0);
> > > > +				range[n].nlb = cpu_to_le32(nlb);
> > > > +				range[n].slba = cpu_to_le64(slba);
> > > > +			}
> > > > +			n++;
> > > >    		}
> > > > -		n++;
> > > >    	}
> > > >    	if (WARN_ON_ONCE(n != segments)) {
> > > 
> > > 
> > > Maybe just set segments to min(blk_rq_nr_discard_segments(req),
> > > queue_max_discard_segments(req->q)) and let the existing code do
> > > its thing?
> > 
> > What is the existing code for applying min()?
> 
> Was referring to this:
> --
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3345f866178e..dbc402587431 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> *ns, struct request *req,
>                 range = page_address(ns->ctrl->discard_page);
>         }
> 
> +       segments = min(segments, queue_max_discard_segments(req->q));

That can't work.

In case of queue_max_discard_segments(req->q) == 1, the request still
can have more than one bios since the normal merge is taken for discard
IOs.


Thanks,
Ming

Sagi Grimberg March 7, 2023, 12:31 p.m. UTC | #9

On 3/7/23 14:14, Ming Lei wrote:
> On Tue, Mar 07, 2023 at 01:39:27PM +0200, Sagi Grimberg wrote:
>>
>>
>> On 3/6/23 23:49, Ming Lei wrote:
>>> On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 3/4/23 01:13, Ming Lei wrote:
>>>>> When investigating one customer report on warning in nvme_setup_discard,
>>>>> we observed the controller(nvme/tcp) actually exposes
>>>>> queue_max_discard_segments(req->q) == 1.
>>>>>
>>>>> Obviously the current code can't handle this situation, since contiguity
>>>>> merge like normal RW request is taken.
>>>>>
>>>>> Fix the issue by building range from request sector/nr_sectors directly.
>>>>>
>>>>> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>> ---
>>>>>     drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
>>>>>     1 file changed, 19 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index c2730b116dc6..d4be525f8100 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>>>>>     		range = page_address(ns->ctrl->discard_page);
>>>>>     	}
>>>>> -	__rq_for_each_bio(bio, req) {
>>>>> -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
>>>>> -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
>>>>> -
>>>>> -		if (n < segments) {
>>>>> -			range[n].cattr = cpu_to_le32(0);
>>>>> -			range[n].nlb = cpu_to_le32(nlb);
>>>>> -			range[n].slba = cpu_to_le64(slba);
>>>>> +	if (queue_max_discard_segments(req->q) == 1) {
>>>>> +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
>>>>> +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
>>>>> +
>>>>> +		range[0].cattr = cpu_to_le32(0);
>>>>> +		range[0].nlb = cpu_to_le32(nlb);
>>>>> +		range[0].slba = cpu_to_le64(slba);
>>>>> +		n = 1;
>>>>> +	} else {
>>>>> +		__rq_for_each_bio(bio, req) {
>>>>> +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
>>>>> +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
>>>>> +
>>>>> +			if (n < segments) {
>>>>> +				range[n].cattr = cpu_to_le32(0);
>>>>> +				range[n].nlb = cpu_to_le32(nlb);
>>>>> +				range[n].slba = cpu_to_le64(slba);
>>>>> +			}
>>>>> +			n++;
>>>>>     		}
>>>>> -		n++;
>>>>>     	}
>>>>>     	if (WARN_ON_ONCE(n != segments)) {
>>>>
>>>>
>>>> Maybe just set segments to min(blk_rq_nr_discard_segments(req),
>>>> queue_max_discard_segments(req->q)) and let the existing code do
>>>> its thing?
>>>
>>> What is the existing code for applying min()?
>>
>> Was referring to this:
>> --
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 3345f866178e..dbc402587431 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
>> *ns, struct request *req,
>>                  range = page_address(ns->ctrl->discard_page);
>>          }
>>
>> +       segments = min(segments, queue_max_discard_segments(req->q));
> 
> That can't work.
> 
> In case of queue_max_discard_segments(req->q) == 1, the request still
> can have more than one bios since the normal merge is taken for discard
> IOs.

Ah, I see, the bios are contiguous though right?
We could add a contiguity check in the loop and conditionally
increment n, but maybe that would probably be more complicated...

Ming Lei March 7, 2023, 2:24 p.m. UTC | #10

On Tue, Mar 07, 2023 at 02:31:48PM +0200, Sagi Grimberg wrote:
> 
> 
> On 3/7/23 14:14, Ming Lei wrote:
> > On Tue, Mar 07, 2023 at 01:39:27PM +0200, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 3/6/23 23:49, Ming Lei wrote:
> > > > On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote:
> > > > > 
> > > > > 
> > > > > On 3/4/23 01:13, Ming Lei wrote:
> > > > > > When investigating one customer report on warning in nvme_setup_discard,
> > > > > > we observed the controller(nvme/tcp) actually exposes
> > > > > > queue_max_discard_segments(req->q) == 1.
> > > > > > 
> > > > > > Obviously the current code can't handle this situation, since contiguity
> > > > > > merge like normal RW request is taken.
> > > > > > 
> > > > > > Fix the issue by building range from request sector/nr_sectors directly.
> > > > > > 
> > > > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > ---
> > > > > >     drivers/nvme/host/core.c | 28 +++++++++++++++++++---------
> > > > > >     1 file changed, 19 insertions(+), 9 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > > > index c2730b116dc6..d4be525f8100 100644
> > > > > > --- a/drivers/nvme/host/core.c
> > > > > > +++ b/drivers/nvme/host/core.c
> > > > > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> > > > > >     		range = page_address(ns->ctrl->discard_page);
> > > > > >     	}
> > > > > > -	__rq_for_each_bio(bio, req) {
> > > > > > -		u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > > > > > -		u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > > > > > -
> > > > > > -		if (n < segments) {
> > > > > > -			range[n].cattr = cpu_to_le32(0);
> > > > > > -			range[n].nlb = cpu_to_le32(nlb);
> > > > > > -			range[n].slba = cpu_to_le64(slba);
> > > > > > +	if (queue_max_discard_segments(req->q) == 1) {
> > > > > > +		u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req));
> > > > > > +		u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9);
> > > > > > +
> > > > > > +		range[0].cattr = cpu_to_le32(0);
> > > > > > +		range[0].nlb = cpu_to_le32(nlb);
> > > > > > +		range[0].slba = cpu_to_le64(slba);
> > > > > > +		n = 1;
> > > > > > +	} else {
> > > > > > +		__rq_for_each_bio(bio, req) {
> > > > > > +			u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
> > > > > > +			u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
> > > > > > +
> > > > > > +			if (n < segments) {
> > > > > > +				range[n].cattr = cpu_to_le32(0);
> > > > > > +				range[n].nlb = cpu_to_le32(nlb);
> > > > > > +				range[n].slba = cpu_to_le64(slba);
> > > > > > +			}
> > > > > > +			n++;
> > > > > >     		}
> > > > > > -		n++;
> > > > > >     	}
> > > > > >     	if (WARN_ON_ONCE(n != segments)) {
> > > > > 
> > > > > 
> > > > > Maybe just set segments to min(blk_rq_nr_discard_segments(req),
> > > > > queue_max_discard_segments(req->q)) and let the existing code do
> > > > > its thing?
> > > > 
> > > > What is the existing code for applying min()?
> > > 
> > > Was referring to this:
> > > --
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index 3345f866178e..dbc402587431 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
> > > *ns, struct request *req,
> > >                  range = page_address(ns->ctrl->discard_page);
> > >          }
> > > 
> > > +       segments = min(segments, queue_max_discard_segments(req->q));
> > 
> > That can't work.
> > 
> > In case of queue_max_discard_segments(req->q) == 1, the request still
> > can have more than one bios since the normal merge is taken for discard
> > IOs.
> 
> Ah, I see, the bios are contiguous though right?

Yes, the merge is just like normal RW.

> We could add a contiguity check in the loop and conditionally
> increment n, but maybe that would probably be more complicated...

That is more complicated than this patch, and the same pattern
has been applied on virtio-blk.


Thanks,
Ming

Chaitanya Kulkarni March 8, 2023, 5:36 a.m. UTC | #11

>>> Was referring to this:
>>> -- 
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 3345f866178e..dbc402587431 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct 
>>> nvme_ns
>>> *ns, struct request *req,
>>>                  range = page_address(ns->ctrl->discard_page);
>>>          }
>>>
>>> +       segments = min(segments, queue_max_discard_segments(req->q));
>>
>> That can't work.
>>
>> In case of queue_max_discard_segments(req->q) == 1, the request still
>> can have more than one bios since the normal merge is taken for discard
>> IOs.
> 
> Ah, I see, the bios are contiguous though right?
> We could add a contiguity check in the loop and conditionally
> increment n, but maybe that would probably be more complicated...

It will be great if we can avoid above mentioned complicated pattern...

-ck

Chaitanya Kulkarni March 8, 2023, 5:38 a.m. UTC | #12

On 3/3/2023 3:13 PM, Ming Lei wrote:
> When investigating one customer report on warning in nvme_setup_discard,
> we observed the controller(nvme/tcp) actually exposes
> queue_max_discard_segments(req->q) == 1.
> 
> Obviously the current code can't handle this situation, since contiguity
> merge like normal RW request is taken.
> 
> Fix the issue by building range from request sector/nr_sectors directly.
> 
> Fixes: b35ba01ea697 ("nvme: support ranged discard requests")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck

Chaitanya Kulkarni March 8, 2023, 5:42 a.m. UTC | #13

>>>> Was referring to this:
>>>> --
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 3345f866178e..dbc402587431 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns
>>>> *ns, struct request *req,
>>>>                   range = page_address(ns->ctrl->discard_page);
>>>>           }
>>>>
>>>> +       segments = min(segments, queue_max_discard_segments(req->q));
>>>
>>> That can't work.
>>>
>>> In case of queue_max_discard_segments(req->q) == 1, the request still
>>> can have more than one bios since the normal merge is taken for discard
>>> IOs.
>>
>> Ah, I see, the bios are contiguous though right?
> 
> Yes, the merge is just like normal RW.
> 
>> We could add a contiguity check in the loop and conditionally
>> increment n, but maybe that would probably be more complicated...
> 
> That is more complicated than this patch, and the same pattern
> has been applied on virtio-blk.
>

I'd very much keep the pattern in virio-blk in nvme that is easier
to read and simple than conditional increment of the n, unless there
is a strong reason for not doing that, which I failed to
understand ...

-ck

Christoph Hellwig March 9, 2023, 9:41 a.m. UTC | #14

Thanks,

applied to nvme-6.3.

nvme: fix handling single range discard request

Commit Message

Comments

Patch