Message ID | 20230303231345.119652-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nvme: fix handling single range discard request | expand |
On 3/4/23 00:13, Ming Lei wrote: > When investigating one customer report on warning in nvme_setup_discard, > we observed the controller(nvme/tcp) actually exposes > queue_max_discard_segments(req->q) == 1. > > Obviously the current code can't handle this situation, since contiguity > merge like normal RW request is taken. > > Fix the issue by building range from request sector/nr_sectors directly. > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index c2730b116dc6..d4be525f8100 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > range = page_address(ns->ctrl->discard_page); > } > > - __rq_for_each_bio(bio, req) { > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > - > - if (n < segments) { > - range[n].cattr = cpu_to_le32(0); > - range[n].nlb = cpu_to_le32(nlb); > - range[n].slba = cpu_to_le64(slba); > + if (queue_max_discard_segments(req->q) == 1) { > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > + > + range[0].cattr = cpu_to_le32(0); > + range[0].nlb = cpu_to_le32(nlb); > + range[0].slba = cpu_to_le64(slba); > + n = 1; > + } else { > + __rq_for_each_bio(bio, req) { > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > + > + if (n < segments) { > + range[n].cattr = cpu_to_le32(0); > + range[n].nlb = cpu_to_le32(nlb); > + range[n].slba = cpu_to_le64(slba); > + } > + n++; > } > - n++; > } > > if (WARN_ON_ONCE(n != segments)) { Now _that_ is odd. Looks like 'req' is not formatted according to the 'max_discard_sectors' setting. But if that's the case, then this 'fix' would fail whenever 'max_discard_sectors' < 'max_hw_sectors', right? Shouldn't we rather modify the merge algorithm to check for max_discard_sectors for DISCARD requests, such that we never _have_ mis-matched requests and this patch would be pointless? Cheers, Hannes
On Sat, Mar 04, 2023 at 09:00:28AM +0100, Hannes Reinecke wrote: > On 3/4/23 00:13, Ming Lei wrote: > > When investigating one customer report on warning in nvme_setup_discard, > > we observed the controller(nvme/tcp) actually exposes > > queue_max_discard_segments(req->q) == 1. > > > > Obviously the current code can't handle this situation, since contiguity > > merge like normal RW request is taken. > > > > Fix the issue by building range from request sector/nr_sectors directly. > > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index c2730b116dc6..d4be525f8100 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > > range = page_address(ns->ctrl->discard_page); > > } > > - __rq_for_each_bio(bio, req) { > > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > - > > - if (n < segments) { > > - range[n].cattr = cpu_to_le32(0); > > - range[n].nlb = cpu_to_le32(nlb); > > - range[n].slba = cpu_to_le64(slba); > > + if (queue_max_discard_segments(req->q) == 1) { > > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > > + > > + range[0].cattr = cpu_to_le32(0); > > + range[0].nlb = cpu_to_le32(nlb); > > + range[0].slba = cpu_to_le64(slba); > > + n = 1; > > + } else { > + __rq_for_each_bio(bio, req) { > > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > + > > + if (n < segments) { > > + range[n].cattr = cpu_to_le32(0); > > + range[n].nlb = cpu_to_le32(nlb); > > + range[n].slba = cpu_to_le64(slba); > > + } > > + n++; > > } > > - n++; > > } > > if (WARN_ON_ONCE(n != segments)) { > Now _that_ is odd. > Looks like 'req' is not formatted according to the 'max_discard_sectors' > setting. > But if that's the case, then this 'fix' would fail whenever > 'max_discard_sectors' < 'max_hw_sectors', right? No, it isn't the case. > Shouldn't we rather modify the merge algorithm to check for > max_discard_sectors for DISCARD requests, such that we never _have_ > mis-matched requests and this patch would be pointless? But it is related with discard merge. If queue_max_discard_segments() is 1, block layer merges discard request/bios just like normal RW IO. However, if queue_max_discard_segments() is > 1, block layer simply 'merges' all bios into one request, no matter if the LBA is adjacent or not, and treat each bio as one discard segment, that is called multi range discard too. That is the reason why we have to treat queue_max_discard_segments() == 1 specially, and you can see same handling in virtio_blk. Thanks, Ming
On 3/4/23 11:22, Ming Lei wrote: > On Sat, Mar 04, 2023 at 09:00:28AM +0100, Hannes Reinecke wrote: >> On 3/4/23 00:13, Ming Lei wrote: >>> When investigating one customer report on warning in nvme_setup_discard, >>> we observed the controller(nvme/tcp) actually exposes >>> queue_max_discard_segments(req->q) == 1. >>> >>> Obviously the current code can't handle this situation, since contiguity >>> merge like normal RW request is taken. >>> >>> Fix the issue by building range from request sector/nr_sectors directly. >>> >>> Fixes: b35ba01ea697 ("nvme: support ranged discard requests") >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >>> --- >>> drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- >>> 1 file changed, 19 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index c2730b116dc6..d4be525f8100 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, >>> range = page_address(ns->ctrl->discard_page); >>> } >>> - __rq_for_each_bio(bio, req) { >>> - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); >>> - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; >>> - >>> - if (n < segments) { >>> - range[n].cattr = cpu_to_le32(0); >>> - range[n].nlb = cpu_to_le32(nlb); >>> - range[n].slba = cpu_to_le64(slba); >>> + if (queue_max_discard_segments(req->q) == 1) { >>> + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); >>> + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); >>> + >>> + range[0].cattr = cpu_to_le32(0); >>> + range[0].nlb = cpu_to_le32(nlb); >>> + range[0].slba = cpu_to_le64(slba); >>> + n = 1; >>> + } else { > + __rq_for_each_bio(bio, req) { >>> + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); >>> + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; >>> + >>> + if (n < segments) { >>> + range[n].cattr = cpu_to_le32(0); >>> + range[n].nlb = cpu_to_le32(nlb); >>> + range[n].slba = cpu_to_le64(slba); >>> + } >>> + n++; >>> } >>> - n++; >>> } >>> if (WARN_ON_ONCE(n != segments)) { >> Now _that_ is odd. >> Looks like 'req' is not formatted according to the 'max_discard_sectors' >> setting. >> But if that's the case, then this 'fix' would fail whenever >> 'max_discard_sectors' < 'max_hw_sectors', right? > > No, it isn't the case. > >> Shouldn't we rather modify the merge algorithm to check for >> max_discard_sectors for DISCARD requests, such that we never _have_ >> mis-matched requests and this patch would be pointless? > > But it is related with discard merge. > > If queue_max_discard_segments() is 1, block layer merges discard > request/bios just like normal RW IO. > > However, if queue_max_discard_segments() is > 1, block layer simply > 'merges' all bios into one request, no matter if the LBA is adjacent > or not, and treat each bio as one discard segment, that is called > multi range discard too. > But wouldn't the number of bios be subject to 'queue_max_discard_segment', too? What guarantees we're not overflowing that for multi-segment discard merge? Cheers, Hannes
On Sat, Mar 04, 2023 at 12:14:30PM +0100, Hannes Reinecke wrote: > On 3/4/23 11:22, Ming Lei wrote: > > On Sat, Mar 04, 2023 at 09:00:28AM +0100, Hannes Reinecke wrote: > > > On 3/4/23 00:13, Ming Lei wrote: > > > > When investigating one customer report on warning in nvme_setup_discard, > > > > we observed the controller(nvme/tcp) actually exposes > > > > queue_max_discard_segments(req->q) == 1. > > > > > > > > Obviously the current code can't handle this situation, since contiguity > > > > merge like normal RW request is taken. > > > > > > > > Fix the issue by building range from request sector/nr_sectors directly. > > > > > > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > --- > > > > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > > > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > > index c2730b116dc6..d4be525f8100 100644 > > > > --- a/drivers/nvme/host/core.c > > > > +++ b/drivers/nvme/host/core.c > > > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > > > > range = page_address(ns->ctrl->discard_page); > > > > } > > > > - __rq_for_each_bio(bio, req) { > > > > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > > > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > > > - > > > > - if (n < segments) { > > > > - range[n].cattr = cpu_to_le32(0); > > > > - range[n].nlb = cpu_to_le32(nlb); > > > > - range[n].slba = cpu_to_le64(slba); > > > > + if (queue_max_discard_segments(req->q) == 1) { > > > > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > > > > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > > > > + > > > > + range[0].cattr = cpu_to_le32(0); > > > > + range[0].nlb = cpu_to_le32(nlb); > > > > + range[0].slba = cpu_to_le64(slba); > > > > + n = 1; > > > > + } else { > + __rq_for_each_bio(bio, req) { > > > > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > > > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > > > + > > > > + if (n < segments) { > > > > + range[n].cattr = cpu_to_le32(0); > > > > + range[n].nlb = cpu_to_le32(nlb); > > > > + range[n].slba = cpu_to_le64(slba); > > > > + } > > > > + n++; > > > > } > > > > - n++; > > > > } > > > > if (WARN_ON_ONCE(n != segments)) { > > > Now _that_ is odd. > > > Looks like 'req' is not formatted according to the 'max_discard_sectors' > > > setting. > > > But if that's the case, then this 'fix' would fail whenever > > > 'max_discard_sectors' < 'max_hw_sectors', right? > > > > No, it isn't the case. > > > > > Shouldn't we rather modify the merge algorithm to check for > > > max_discard_sectors for DISCARD requests, such that we never _have_ > > > mis-matched requests and this patch would be pointless? > > > > But it is related with discard merge. > > > > If queue_max_discard_segments() is 1, block layer merges discard > > request/bios just like normal RW IO. > > > > However, if queue_max_discard_segments() is > 1, block layer simply > > 'merges' all bios into one request, no matter if the LBA is adjacent > > or not, and treat each bio as one discard segment, that is called > > multi range discard too. > > > But wouldn't the number of bios be subject to 'queue_max_discard_segment', > too? > What guarantees we're not overflowing that for multi-segment discard merge? block layer merge code makes sure that the max discard segment limit is respected, please see: req_attempt_discard_merge() bio_attempt_discard_merge() blk_recalc_rq_segments() Thanks, Ming
On 3/4/23 01:13, Ming Lei wrote: > When investigating one customer report on warning in nvme_setup_discard, > we observed the controller(nvme/tcp) actually exposes > queue_max_discard_segments(req->q) == 1. > > Obviously the current code can't handle this situation, since contiguity > merge like normal RW request is taken. > > Fix the issue by building range from request sector/nr_sectors directly. > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index c2730b116dc6..d4be525f8100 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > range = page_address(ns->ctrl->discard_page); > } > > - __rq_for_each_bio(bio, req) { > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > - > - if (n < segments) { > - range[n].cattr = cpu_to_le32(0); > - range[n].nlb = cpu_to_le32(nlb); > - range[n].slba = cpu_to_le64(slba); > + if (queue_max_discard_segments(req->q) == 1) { > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > + > + range[0].cattr = cpu_to_le32(0); > + range[0].nlb = cpu_to_le32(nlb); > + range[0].slba = cpu_to_le64(slba); > + n = 1; > + } else { > + __rq_for_each_bio(bio, req) { > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > + > + if (n < segments) { > + range[n].cattr = cpu_to_le32(0); > + range[n].nlb = cpu_to_le32(nlb); > + range[n].slba = cpu_to_le64(slba); > + } > + n++; > } > - n++; > } > > if (WARN_ON_ONCE(n != segments)) { Maybe just set segments to min(blk_rq_nr_discard_segments(req), queue_max_discard_segments(req->q)) and let the existing code do its thing?
On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote: > > > On 3/4/23 01:13, Ming Lei wrote: > > When investigating one customer report on warning in nvme_setup_discard, > > we observed the controller(nvme/tcp) actually exposes > > queue_max_discard_segments(req->q) == 1. > > > > Obviously the current code can't handle this situation, since contiguity > > merge like normal RW request is taken. > > > > Fix the issue by building range from request sector/nr_sectors directly. > > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index c2730b116dc6..d4be525f8100 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > > range = page_address(ns->ctrl->discard_page); > > } > > - __rq_for_each_bio(bio, req) { > > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > - > > - if (n < segments) { > > - range[n].cattr = cpu_to_le32(0); > > - range[n].nlb = cpu_to_le32(nlb); > > - range[n].slba = cpu_to_le64(slba); > > + if (queue_max_discard_segments(req->q) == 1) { > > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > > + > > + range[0].cattr = cpu_to_le32(0); > > + range[0].nlb = cpu_to_le32(nlb); > > + range[0].slba = cpu_to_le64(slba); > > + n = 1; > > + } else { > > + __rq_for_each_bio(bio, req) { > > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > + > > + if (n < segments) { > > + range[n].cattr = cpu_to_le32(0); > > + range[n].nlb = cpu_to_le32(nlb); > > + range[n].slba = cpu_to_le64(slba); > > + } > > + n++; > > } > > - n++; > > } > > if (WARN_ON_ONCE(n != segments)) { > > > Maybe just set segments to min(blk_rq_nr_discard_segments(req), > queue_max_discard_segments(req->q)) and let the existing code do > its thing? What is the existing code for applying min()? For block layer merge code, it has to cover two kinds of discard merge: - the traditional single range discard for most of devices - multi range discard merge for nvme and virtio-blk which takes same fix For driver side, it has to do similar handling if both single-range and multi-range discard are supported: - each bio is one discard range for multi-range discard - the whole request(may include more than 1 bios) is the single discard range for single-range discard Thanks, Ming
On 3/6/23 23:49, Ming Lei wrote: > On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote: >> >> >> On 3/4/23 01:13, Ming Lei wrote: >>> When investigating one customer report on warning in nvme_setup_discard, >>> we observed the controller(nvme/tcp) actually exposes >>> queue_max_discard_segments(req->q) == 1. >>> >>> Obviously the current code can't handle this situation, since contiguity >>> merge like normal RW request is taken. >>> >>> Fix the issue by building range from request sector/nr_sectors directly. >>> >>> Fixes: b35ba01ea697 ("nvme: support ranged discard requests") >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >>> --- >>> drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- >>> 1 file changed, 19 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index c2730b116dc6..d4be525f8100 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, >>> range = page_address(ns->ctrl->discard_page); >>> } >>> - __rq_for_each_bio(bio, req) { >>> - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); >>> - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; >>> - >>> - if (n < segments) { >>> - range[n].cattr = cpu_to_le32(0); >>> - range[n].nlb = cpu_to_le32(nlb); >>> - range[n].slba = cpu_to_le64(slba); >>> + if (queue_max_discard_segments(req->q) == 1) { >>> + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); >>> + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); >>> + >>> + range[0].cattr = cpu_to_le32(0); >>> + range[0].nlb = cpu_to_le32(nlb); >>> + range[0].slba = cpu_to_le64(slba); >>> + n = 1; >>> + } else { >>> + __rq_for_each_bio(bio, req) { >>> + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); >>> + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; >>> + >>> + if (n < segments) { >>> + range[n].cattr = cpu_to_le32(0); >>> + range[n].nlb = cpu_to_le32(nlb); >>> + range[n].slba = cpu_to_le64(slba); >>> + } >>> + n++; >>> } >>> - n++; >>> } >>> if (WARN_ON_ONCE(n != segments)) { >> >> >> Maybe just set segments to min(blk_rq_nr_discard_segments(req), >> queue_max_discard_segments(req->q)) and let the existing code do >> its thing? > > What is the existing code for applying min()? Was referring to this: -- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3345f866178e..dbc402587431 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, range = page_address(ns->ctrl->discard_page); } + segments = min(segments, queue_max_discard_segments(req->q)); __rq_for_each_bio(bio, req) { u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; --
On Tue, Mar 07, 2023 at 01:39:27PM +0200, Sagi Grimberg wrote: > > > On 3/6/23 23:49, Ming Lei wrote: > > On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote: > > > > > > > > > On 3/4/23 01:13, Ming Lei wrote: > > > > When investigating one customer report on warning in nvme_setup_discard, > > > > we observed the controller(nvme/tcp) actually exposes > > > > queue_max_discard_segments(req->q) == 1. > > > > > > > > Obviously the current code can't handle this situation, since contiguity > > > > merge like normal RW request is taken. > > > > > > > > Fix the issue by building range from request sector/nr_sectors directly. > > > > > > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > --- > > > > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > > > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > > index c2730b116dc6..d4be525f8100 100644 > > > > --- a/drivers/nvme/host/core.c > > > > +++ b/drivers/nvme/host/core.c > > > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > > > > range = page_address(ns->ctrl->discard_page); > > > > } > > > > - __rq_for_each_bio(bio, req) { > > > > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > > > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > > > - > > > > - if (n < segments) { > > > > - range[n].cattr = cpu_to_le32(0); > > > > - range[n].nlb = cpu_to_le32(nlb); > > > > - range[n].slba = cpu_to_le64(slba); > > > > + if (queue_max_discard_segments(req->q) == 1) { > > > > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > > > > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > > > > + > > > > + range[0].cattr = cpu_to_le32(0); > > > > + range[0].nlb = cpu_to_le32(nlb); > > > > + range[0].slba = cpu_to_le64(slba); > > > > + n = 1; > > > > + } else { > > > > + __rq_for_each_bio(bio, req) { > > > > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > > > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > > > + > > > > + if (n < segments) { > > > > + range[n].cattr = cpu_to_le32(0); > > > > + range[n].nlb = cpu_to_le32(nlb); > > > > + range[n].slba = cpu_to_le64(slba); > > > > + } > > > > + n++; > > > > } > > > > - n++; > > > > } > > > > if (WARN_ON_ONCE(n != segments)) { > > > > > > > > > Maybe just set segments to min(blk_rq_nr_discard_segments(req), > > > queue_max_discard_segments(req->q)) and let the existing code do > > > its thing? > > > > What is the existing code for applying min()? > > Was referring to this: > -- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 3345f866178e..dbc402587431 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns > *ns, struct request *req, > range = page_address(ns->ctrl->discard_page); > } > > + segments = min(segments, queue_max_discard_segments(req->q)); That can't work. In case of queue_max_discard_segments(req->q) == 1, the request still can have more than one bios since the normal merge is taken for discard IOs. Thanks, Ming
On 3/7/23 14:14, Ming Lei wrote: > On Tue, Mar 07, 2023 at 01:39:27PM +0200, Sagi Grimberg wrote: >> >> >> On 3/6/23 23:49, Ming Lei wrote: >>> On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote: >>>> >>>> >>>> On 3/4/23 01:13, Ming Lei wrote: >>>>> When investigating one customer report on warning in nvme_setup_discard, >>>>> we observed the controller(nvme/tcp) actually exposes >>>>> queue_max_discard_segments(req->q) == 1. >>>>> >>>>> Obviously the current code can't handle this situation, since contiguity >>>>> merge like normal RW request is taken. >>>>> >>>>> Fix the issue by building range from request sector/nr_sectors directly. >>>>> >>>>> Fixes: b35ba01ea697 ("nvme: support ranged discard requests") >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >>>>> --- >>>>> drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- >>>>> 1 file changed, 19 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>>> index c2730b116dc6..d4be525f8100 100644 >>>>> --- a/drivers/nvme/host/core.c >>>>> +++ b/drivers/nvme/host/core.c >>>>> @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, >>>>> range = page_address(ns->ctrl->discard_page); >>>>> } >>>>> - __rq_for_each_bio(bio, req) { >>>>> - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); >>>>> - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; >>>>> - >>>>> - if (n < segments) { >>>>> - range[n].cattr = cpu_to_le32(0); >>>>> - range[n].nlb = cpu_to_le32(nlb); >>>>> - range[n].slba = cpu_to_le64(slba); >>>>> + if (queue_max_discard_segments(req->q) == 1) { >>>>> + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); >>>>> + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); >>>>> + >>>>> + range[0].cattr = cpu_to_le32(0); >>>>> + range[0].nlb = cpu_to_le32(nlb); >>>>> + range[0].slba = cpu_to_le64(slba); >>>>> + n = 1; >>>>> + } else { >>>>> + __rq_for_each_bio(bio, req) { >>>>> + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); >>>>> + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; >>>>> + >>>>> + if (n < segments) { >>>>> + range[n].cattr = cpu_to_le32(0); >>>>> + range[n].nlb = cpu_to_le32(nlb); >>>>> + range[n].slba = cpu_to_le64(slba); >>>>> + } >>>>> + n++; >>>>> } >>>>> - n++; >>>>> } >>>>> if (WARN_ON_ONCE(n != segments)) { >>>> >>>> >>>> Maybe just set segments to min(blk_rq_nr_discard_segments(req), >>>> queue_max_discard_segments(req->q)) and let the existing code do >>>> its thing? >>> >>> What is the existing code for applying min()? >> >> Was referring to this: >> -- >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 3345f866178e..dbc402587431 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns >> *ns, struct request *req, >> range = page_address(ns->ctrl->discard_page); >> } >> >> + segments = min(segments, queue_max_discard_segments(req->q)); > > That can't work. > > In case of queue_max_discard_segments(req->q) == 1, the request still > can have more than one bios since the normal merge is taken for discard > IOs. Ah, I see, the bios are contiguous though right? We could add a contiguity check in the loop and conditionally increment n, but maybe that would probably be more complicated...
On Tue, Mar 07, 2023 at 02:31:48PM +0200, Sagi Grimberg wrote: > > > On 3/7/23 14:14, Ming Lei wrote: > > On Tue, Mar 07, 2023 at 01:39:27PM +0200, Sagi Grimberg wrote: > > > > > > > > > On 3/6/23 23:49, Ming Lei wrote: > > > > On Mon, Mar 06, 2023 at 04:21:08PM +0200, Sagi Grimberg wrote: > > > > > > > > > > > > > > > On 3/4/23 01:13, Ming Lei wrote: > > > > > > When investigating one customer report on warning in nvme_setup_discard, > > > > > > we observed the controller(nvme/tcp) actually exposes > > > > > > queue_max_discard_segments(req->q) == 1. > > > > > > > > > > > > Obviously the current code can't handle this situation, since contiguity > > > > > > merge like normal RW request is taken. > > > > > > > > > > > > Fix the issue by building range from request sector/nr_sectors directly. > > > > > > > > > > > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > > > --- > > > > > > drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- > > > > > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > > > > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > > > > index c2730b116dc6..d4be525f8100 100644 > > > > > > --- a/drivers/nvme/host/core.c > > > > > > +++ b/drivers/nvme/host/core.c > > > > > > @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, > > > > > > range = page_address(ns->ctrl->discard_page); > > > > > > } > > > > > > - __rq_for_each_bio(bio, req) { > > > > > > - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > > > > > - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > > > > > - > > > > > > - if (n < segments) { > > > > > > - range[n].cattr = cpu_to_le32(0); > > > > > > - range[n].nlb = cpu_to_le32(nlb); > > > > > > - range[n].slba = cpu_to_le64(slba); > > > > > > + if (queue_max_discard_segments(req->q) == 1) { > > > > > > + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); > > > > > > + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); > > > > > > + > > > > > > + range[0].cattr = cpu_to_le32(0); > > > > > > + range[0].nlb = cpu_to_le32(nlb); > > > > > > + range[0].slba = cpu_to_le64(slba); > > > > > > + n = 1; > > > > > > + } else { > > > > > > + __rq_for_each_bio(bio, req) { > > > > > > + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); > > > > > > + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; > > > > > > + > > > > > > + if (n < segments) { > > > > > > + range[n].cattr = cpu_to_le32(0); > > > > > > + range[n].nlb = cpu_to_le32(nlb); > > > > > > + range[n].slba = cpu_to_le64(slba); > > > > > > + } > > > > > > + n++; > > > > > > } > > > > > > - n++; > > > > > > } > > > > > > if (WARN_ON_ONCE(n != segments)) { > > > > > > > > > > > > > > > Maybe just set segments to min(blk_rq_nr_discard_segments(req), > > > > > queue_max_discard_segments(req->q)) and let the existing code do > > > > > its thing? > > > > > > > > What is the existing code for applying min()? > > > > > > Was referring to this: > > > -- > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > index 3345f866178e..dbc402587431 100644 > > > --- a/drivers/nvme/host/core.c > > > +++ b/drivers/nvme/host/core.c > > > @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns > > > *ns, struct request *req, > > > range = page_address(ns->ctrl->discard_page); > > > } > > > > > > + segments = min(segments, queue_max_discard_segments(req->q)); > > > > That can't work. > > > > In case of queue_max_discard_segments(req->q) == 1, the request still > > can have more than one bios since the normal merge is taken for discard > > IOs. > > Ah, I see, the bios are contiguous though right? Yes, the merge is just like normal RW. > We could add a contiguity check in the loop and conditionally > increment n, but maybe that would probably be more complicated... That is more complicated than this patch, and the same pattern has been applied on virtio-blk. Thanks, Ming
>>> Was referring to this: >>> -- >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index 3345f866178e..dbc402587431 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct >>> nvme_ns >>> *ns, struct request *req, >>> range = page_address(ns->ctrl->discard_page); >>> } >>> >>> + segments = min(segments, queue_max_discard_segments(req->q)); >> >> That can't work. >> >> In case of queue_max_discard_segments(req->q) == 1, the request still >> can have more than one bios since the normal merge is taken for discard >> IOs. > > Ah, I see, the bios are contiguous though right? > We could add a contiguity check in the loop and conditionally > increment n, but maybe that would probably be more complicated... It will be great if we can avoid above mentioned complicated pattern... -ck
On 3/3/2023 3:13 PM, Ming Lei wrote: > When investigating one customer report on warning in nvme_setup_discard, > we observed the controller(nvme/tcp) actually exposes > queue_max_discard_segments(req->q) == 1. > > Obviously the current code can't handle this situation, since contiguity > merge like normal RW request is taken. > > Fix the issue by building range from request sector/nr_sectors directly. > > Fixes: b35ba01ea697 ("nvme: support ranged discard requests") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
>>>> Was referring to this: >>>> -- >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>> index 3345f866178e..dbc402587431 100644 >>>> --- a/drivers/nvme/host/core.c >>>> +++ b/drivers/nvme/host/core.c >>>> @@ -781,6 +781,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns >>>> *ns, struct request *req, >>>> range = page_address(ns->ctrl->discard_page); >>>> } >>>> >>>> + segments = min(segments, queue_max_discard_segments(req->q)); >>> >>> That can't work. >>> >>> In case of queue_max_discard_segments(req->q) == 1, the request still >>> can have more than one bios since the normal merge is taken for discard >>> IOs. >> >> Ah, I see, the bios are contiguous though right? > > Yes, the merge is just like normal RW. > >> We could add a contiguity check in the loop and conditionally >> increment n, but maybe that would probably be more complicated... > > That is more complicated than this patch, and the same pattern > has been applied on virtio-blk. > I'd very much keep the pattern in virio-blk in nvme that is easier to read and simple than conditional increment of the n, unless there is a strong reason for not doing that, which I failed to understand ... -ck
Thanks, applied to nvme-6.3.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index c2730b116dc6..d4be525f8100 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -781,16 +781,26 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, range = page_address(ns->ctrl->discard_page); } - __rq_for_each_bio(bio, req) { - u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); - u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; - - if (n < segments) { - range[n].cattr = cpu_to_le32(0); - range[n].nlb = cpu_to_le32(nlb); - range[n].slba = cpu_to_le64(slba); + if (queue_max_discard_segments(req->q) == 1) { + u64 slba = nvme_sect_to_lba(ns, blk_rq_pos(req)); + u32 nlb = blk_rq_sectors(req) >> (ns->lba_shift - 9); + + range[0].cattr = cpu_to_le32(0); + range[0].nlb = cpu_to_le32(nlb); + range[0].slba = cpu_to_le64(slba); + n = 1; + } else { + __rq_for_each_bio(bio, req) { + u64 slba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector); + u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift; + + if (n < segments) { + range[n].cattr = cpu_to_le32(0); + range[n].nlb = cpu_to_le32(nlb); + range[n].slba = cpu_to_le64(slba); + } + n++; } - n++; } if (WARN_ON_ONCE(n != segments)) {
When investigating one customer report on warning in nvme_setup_discard, we observed the controller(nvme/tcp) actually exposes queue_max_discard_segments(req->q) == 1. Obviously the current code can't handle this situation, since contiguity merge like normal RW request is taken. Fix the issue by building range from request sector/nr_sectors directly. Fixes: b35ba01ea697 ("nvme: support ranged discard requests") Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/core.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-)