Message ID | CACVXFVNQQNd8_3ti4AAhZB4KN0uJEN8sdmB+xsmpykJaCLfebA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jan 06, 2016 at 10:17:51AM +0800, Ming Lei wrote: > Firstly we didn't split one single bio vector before bio splitting. > > Secondly, current bio split still doesn't support to split one single > bvec into two, and it just makes the two bios shared the original > bvec table, please see bio_split(), which calls bio_clone_fast() > to do that, and the bvec table has been immutable at that time. You are saying we can't split a bio in the middle of a vector? bvec_iter_advance() says we can split anywhere. > I understand your motivation in the two patches, actually before bio splitting, > we don't do sg merge for nvme because of the flag of NO_SG_MERGE, > which is ignored after bio splitting is introduced. So could you check if > the nvme performance can be good by putting NO_SG_MERGE back > in blk_bio_segment_split()? And the change should be simple, like the > attached patch. The nvme driver is okay to take physically merged pages. It splits them into PRPs accordingly, and it's faster to DMA map physically contiguous just because there are fewer segments to iterate, so NVMe would prefer to let them coalesce. > IMO, splitting is quite cheap, Splitting is absolutely not cheap if your media is fast enough. I measure every % of increased CPU instructions as the same % decrease in IOPs with 3DXpoint. But this patch was to split on stripe boundaries, which is an even worse penalty if we get the split wrong. > + bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags); > > bio_for_each_segment(bv, bio, iter) { > - if (sectors + (bv.bv_len >> 9) > blk_max_size_offset(q, bio->bi_iter.bi_sector)) > + if (no_sg_merge) > + goto new_segment; Bad idea for NVMe. We need to split on SG gaps, which you've skipped, or you will BUG_ON in the NVMe driver. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 6, 2016 at 1:51 PM, Keith Busch <keith.busch@intel.com> wrote: > On Wed, Jan 06, 2016 at 10:17:51AM +0800, Ming Lei wrote: >> Firstly we didn't split one single bio vector before bio splitting. >> >> Secondly, current bio split still doesn't support to split one single >> bvec into two, and it just makes the two bios shared the original >> bvec table, please see bio_split(), which calls bio_clone_fast() >> to do that, and the bvec table has been immutable at that time. > > You are saying we can't split a bio in the middle of a vector? I mean the current block stack may not be ready for that. > bvec_iter_advance() says we can split anywhere. bvec_iter_advance() can split anywhere, but the splitted bios may cross one same bvec, which may cause trouble, for example, BIOVEC_PHYS_MERGEABLE() may not work well and blk_phys_contig_segment() too. Also not mention your patch doesn't update front_seg_size/back_seg_size/ seg_size correctly. That is why I said we should be careful about splitting bvec because it is never used or supported before. > >> I understand your motivation in the two patches, actually before bio splitting, >> we don't do sg merge for nvme because of the flag of NO_SG_MERGE, >> which is ignored after bio splitting is introduced. So could you check if >> the nvme performance can be good by putting NO_SG_MERGE back >> in blk_bio_segment_split()? And the change should be simple, like the >> attached patch. > > The nvme driver is okay to take physically merged pages. It splits them > into PRPs accordingly, and it's faster to DMA map physically contiguous > just because there are fewer segments to iterate, so NVMe would prefer > to let them coalesce. If so, NO_SG_MERGE should be removed now. > >> IMO, splitting is quite cheap, > > Splitting is absolutely not cheap if your media is fast enough. I measure > every % of increased CPU instructions as the same % decrease in IOPs > with 3DXpoint. Could you share your test case? Last time I use null_blk to observe the performance difference between NO_SG_MERGE vs. non-NO_SG_MERGE, and basically no difference is observed. > > But this patch was to split on stripe boundaries, which is an even worse > penalty if we get the split wrong. > >> + bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags); >> >> bio_for_each_segment(bv, bio, iter) { >> - if (sectors + (bv.bv_len >> 9) > blk_max_size_offset(q, bio->bi_iter.bi_sector)) >> + if (no_sg_merge) >> + goto new_segment; > > Bad idea for NVMe. We need to split on SG gaps, which you've skipped, > or you will BUG_ON in the NVMe driver. I don't object to the idea of split on SG gap, and I mean we should make sure the implementation is correct. Thanks, Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 6, 2016 at 2:29 PM, Ming Lei <tom.leiming@gmail.com> wrote: > On Wed, Jan 6, 2016 at 1:51 PM, Keith Busch <keith.busch@intel.com> wrote: >> On Wed, Jan 06, 2016 at 10:17:51AM +0800, Ming Lei wrote: >>> Firstly we didn't split one single bio vector before bio splitting. >>> >>> Secondly, current bio split still doesn't support to split one single >>> bvec into two, and it just makes the two bios shared the original >>> bvec table, please see bio_split(), which calls bio_clone_fast() >>> to do that, and the bvec table has been immutable at that time. >> >> You are saying we can't split a bio in the middle of a vector? > > I mean the current block stack may not be ready for that. > >> bvec_iter_advance() says we can split anywhere. > > bvec_iter_advance() can split anywhere, but the splitted bios may > cross one same bvec, which may cause trouble, for example, > BIOVEC_PHYS_MERGEABLE() may not work well and > blk_phys_contig_segment() too. blk_rq_map_sg() isn't ready for splitting in the middle of bvec too. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 06, 2016 at 02:53:22PM +0800, Ming Lei wrote: > On Wed, Jan 6, 2016 at 2:29 PM, Ming Lei <tom.leiming@gmail.com> wrote: > > > > bvec_iter_advance() can split anywhere, but the splitted bios may > > cross one same bvec, which may cause trouble, Cross? One starts where the other starts. Could you please explain what's wrong? > > for example, > > BIOVEC_PHYS_MERGEABLE() may not work well and > > blk_phys_contig_segment() too. Could you please explain why it may not work well? I have no idea what concern you have with BIOVEC_PHYS_MERGEABLE in this case. The only place blk_phys_contig_segment is called from is ll_merge_requests, and the first req of the split bio wouldn't even go through here because it's marked REQ_NOMERGE. > blk_rq_map_sg() isn't ready for splitting in the middle of bvec too. Could you please explain? It just uses the bio iterator, which I'm pretty sure is not broken. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 06, 2016 at 02:29:16PM +0800, Ming Lei wrote: > Also not mention your patch doesn't update front_seg_size/back_seg_size/ > seg_size correctly. > > That is why I said we should be careful about splitting bvec because it > is never used or supported before. Ok, that's an easy fix, but it's not really a functional difference. The request is not mergeable after this, which is the only use for back and front seg sizes. > > The nvme driver is okay to take physically merged pages. It splits them > > into PRPs accordingly, and it's faster to DMA map physically contiguous > > just because there are fewer segments to iterate, so NVMe would prefer > > to let them coalesce. > > If so, NO_SG_MERGE should be removed now. Sounds good. > Could you share your test case? Last time I use null_blk to observe > the performance difference between NO_SG_MERGE vs. non-NO_SG_MERGE, > and basically no difference is observed. null_blk doesn't do anything, so it's probably difficult to measure that. Use a driver that DMA maps the data, uses MMIO to submit to h/w and takes an interrupt. You can measure all these when you create 50% more commands than necessary with unneeded splits. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 6, 2016 at 11:18 PM, Keith Busch <keith.busch@intel.com> wrote: > On Wed, Jan 06, 2016 at 02:53:22PM +0800, Ming Lei wrote: >> On Wed, Jan 6, 2016 at 2:29 PM, Ming Lei <tom.leiming@gmail.com> wrote: >> > >> > bvec_iter_advance() can split anywhere, but the splitted bios may >> > cross one same bvec, which may cause trouble, > > Cross? One starts where the other starts. Could you please explain what's > wrong? Suppose bio is splitted as bio1 and bio2, then the last bvec of bio1 is same with the first bvec of bio2. > >> > for example, >> > BIOVEC_PHYS_MERGEABLE() may not work well and >> > blk_phys_contig_segment() too. > > Could you please explain why it may not work well? I have no idea what > concern you have with BIOVEC_PHYS_MERGEABLE in this case. > > The only place blk_phys_contig_segment is called from is > ll_merge_requests, and the first req of the split bio wouldn't even go > through here because it's marked REQ_NOMERGE. OK, looks I forget this change. > >> blk_rq_map_sg() isn't ready for splitting in the middle of bvec too. > > Could you please explain? It just uses the bio iterator, which I'm pretty > sure is not broken. Please see the 1st line code of __blk_segment_map_sg(), in which only one whole bvec is handled, and partial bvec can't be figured out there. Think of it further, drivers often use bv.bv_len directly in the iterator, for example: bio_for_each_segment(bvec, bio, iter) memcpy(page_address(bvec.bv_page) + bvec.bv_offset, addr + offset, bvec.bv_len); So your patch will break these drivers, won't it?
On Wed, Jan 06, 2016 at 11:43:45PM +0800, Ming Lei wrote: > Please see the 1st line code of __blk_segment_map_sg(), in which only > one whole bvec is handled, and partial bvec can't be figured out there. > > Think of it further, drivers often use bv.bv_len directly in the > iterator, for example: > > bio_for_each_segment(bvec, bio, iter) > memcpy(page_address(bvec.bv_page) + > bvec.bv_offset, addr + > offset, bvec.bv_len); > > So your patch will break these drivers, won't it? CC'ing Kent in hopes he will clarify what happens on a split. The bio_advance() code comments say it's handled: " * This updates bi_sector, bi_size and bi_idx; if the number of bytes to * complete doesn't align with a bvec boundary, then bv_len and bv_offset will * be updated on the last bvec as well. " I admit I'm having a hard time seeing where bv_len and bv_offset updated in this path. It was obviously handled after 054bdf646e then changed with 4550dd6c6b. If I follow correctly, 4550dd6c6b will implicity update the bvec's offset and length during the split here since bio_iter_iovec resets the bvec's length and offset: --- #define __bio_for_each_segment(bvl, bio, iter, start) \ for (iter = (start); \ (iter).bi_size && \ ((bvl = bio_iter_iovec((bio), (iter))), 1); \ bio_advance_iter((bio), &(iter), (bvl).bv_len)) -- -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 7, 2016 at 12:21 AM, Keith Busch <keith.busch@intel.com> wrote: > On Wed, Jan 06, 2016 at 11:43:45PM +0800, Ming Lei wrote: >> Please see the 1st line code of __blk_segment_map_sg(), in which only >> one whole bvec is handled, and partial bvec can't be figured out there. >> >> Think of it further, drivers often use bv.bv_len directly in the >> iterator, for example: >> >> bio_for_each_segment(bvec, bio, iter) >> memcpy(page_address(bvec.bv_page) + >> bvec.bv_offset, addr + >> offset, bvec.bv_len); >> >> So your patch will break these drivers, won't it? > > CC'ing Kent in hopes he will clarify what happens on a split. > > The bio_advance() code comments say it's handled: > > " > * This updates bi_sector, bi_size and bi_idx; if the number of bytes to > * complete doesn't align with a bvec boundary, then bv_len and bv_offset will > * be updated on the last bvec as well. > " One exception is that both can be reseted in bio_clone_bioset(). > > I admit I'm having a hard time seeing where bv_len and bv_offset updated > in this path. It was obviously handled after 054bdf646e then changed > with 4550dd6c6b. > > If I follow correctly, 4550dd6c6b will implicity update the bvec's offset > and length during the split here since bio_iter_iovec resets the bvec's > length and offset: > --- > #define __bio_for_each_segment(bvl, bio, iter, start) \ > for (iter = (start); \ > (iter).bi_size && \ > ((bvl = bio_iter_iovec((bio), (iter))), 1); \ > bio_advance_iter((bio), &(iter), (bvl).bv_len)) > --
On Wed, Jan 06, 2016 at 04:21:17PM +0000, Keith Busch wrote: > On Wed, Jan 06, 2016 at 11:43:45PM +0800, Ming Lei wrote: > > Please see the 1st line code of __blk_segment_map_sg(), in which only > > one whole bvec is handled, and partial bvec can't be figured out there. > > > > Think of it further, drivers often use bv.bv_len directly in the > > iterator, for example: > > > > bio_for_each_segment(bvec, bio, iter) > > memcpy(page_address(bvec.bv_page) + > > bvec.bv_offset, addr + > > offset, bvec.bv_len); > > > > So your patch will break these drivers, won't it? > > CC'ing Kent in hopes he will clarify what happens on a split. > > The bio_advance() code comments say it's handled: > > " > * This updates bi_sector, bi_size and bi_idx; if the number of bytes to > * complete doesn't align with a bvec boundary, then bv_len and bv_offset will > * be updated on the last bvec as well. > " > > I admit I'm having a hard time seeing where bv_len and bv_offset updated > in this path. It was obviously handled after 054bdf646e then changed > with 4550dd6c6b. > > If I follow correctly, 4550dd6c6b will implicity update the bvec's offset > and length during the split here since bio_iter_iovec resets the bvec's > length and offset: > --- > #define __bio_for_each_segment(bvl, bio, iter, start) \ > for (iter = (start); \ > (iter).bi_size && \ > ((bvl = bio_iter_iovec((bio), (iter))), 1); \ > bio_advance_iter((bio), &(iter), (bvl).bv_len)) > -- Yes, splitting in the middle of a bvec is perfectly fine. The reason bio_for_each_segment takes a struct bvec and not a struct bvec * is because it's computing what bv_len should be (taking the min of bv_len and bi_size, roughly). See include/linux/bio.h: bio_for_each_segment() bio_iter_iovec() bvec_iter_bvec() bvec_iter_len() which does the actual bv_len computation. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 7, 2016 at 6:46 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote: > On Wed, Jan 06, 2016 at 04:21:17PM +0000, Keith Busch wrote: ... > Yes, splitting in the middle of a bvec is perfectly fine. The reason > bio_for_each_segment takes a struct bvec and not a struct bvec * is because it's > computing what bv_len should be (taking the min of bv_len and bi_size, roughly). > > See include/linux/bio.h: > > bio_for_each_segment() > bio_iter_iovec() > bvec_iter_bvec() > bvec_iter_len() > > which does the actual bv_len computation. Looks my understanding was wrong, and Keith's patch is correct, sorry for the noise.
diff --git a/block/blk-merge.c b/block/blk-merge.c index e73846a..64fbbba 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -79,9 +79,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, unsigned front_seg_size = bio->bi_seg_front_size; bool do_split = true; struct bio *new = NULL; + bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags); bio_for_each_segment(bv, bio, iter) { - if (sectors + (bv.bv_len >> 9) > blk_max_size_offset(q, bio->bi_iter.bi_sector)) + if (no_sg_merge) + goto new_segment; + + if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q)) goto split; /*