Message ID | 20240530142417.146696-2-ofir.gal@volumez.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/4] net: introduce helper sendpages_ok() | expand |
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
On 30/05/2024 17:24, Ofir Gal wrote: > Network drivers are using sendpage_ok() to check the first page of an > iterator in order to disable MSG_SPLICE_PAGES. The iterator can > represent list of contiguous pages. > > When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used, > it requires all pages in the iterator to be sendable. Therefore it needs > to check that each page is sendable. > > The patch introduces a helper sendpages_ok(), it returns true if all the > contiguous pages are sendable. > > Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use > this helper to check whether the page list is OK. If the helper does not > return true, the driver should remove MSG_SPLICE_PAGES flag. > > Signed-off-by: Ofir Gal <ofir.gal@volumez.com> > --- > include/linux/net.h | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/include/linux/net.h b/include/linux/net.h > index 688320b79fcc..b33bdc3e2031 100644 > --- a/include/linux/net.h > +++ b/include/linux/net.h > @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page) > return !PageSlab(page) && page_count(page) >= 1; > } > > +/* > + * Check sendpage_ok on contiguous pages. > + */ > +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset) > +{ > + unsigned int pagecount; > + size_t page_offset; > + int k; > + > + page = page + offset / PAGE_SIZE; > + page_offset = offset % PAGE_SIZE; lets not modify the input page variable. p = page + offset >> PAGE_SHIFT; poffset = offset & PAGE_MASK; > + pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE); > + > + for (k = 0; k < pagecount; k++) > + if (!sendpage_ok(page + k)) > + return false; perhaps instead of doing a costly DIV_ROUND_UP for every network send we can do: count = 0; while (count < len) { if (!sendpage_ok(p)) return false; page++; count += PAGE_SIZE; } And we can lose page_offset. It can be done in a number of ways, but we should be able to do it without the DIV_ROUND_UP... I still don't understand how a page in the middle of a contiguous range ends up coming from the slab while others don't. Ofir, can you please check which condition in sendpage_ok actually fails?
On 31/05/2024 11:51, Sagi Grimberg wrote: > > > On 30/05/2024 17:24, Ofir Gal wrote: >> Network drivers are using sendpage_ok() to check the first page of an >> iterator in order to disable MSG_SPLICE_PAGES. The iterator can >> represent list of contiguous pages. >> >> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used, >> it requires all pages in the iterator to be sendable. Therefore it needs >> to check that each page is sendable. >> >> The patch introduces a helper sendpages_ok(), it returns true if all the >> contiguous pages are sendable. >> >> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use >> this helper to check whether the page list is OK. If the helper does not >> return true, the driver should remove MSG_SPLICE_PAGES flag. >> >> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> >> --- >> include/linux/net.h | 20 ++++++++++++++++++++ >> 1 file changed, 20 insertions(+) >> >> diff --git a/include/linux/net.h b/include/linux/net.h >> index 688320b79fcc..b33bdc3e2031 100644 >> --- a/include/linux/net.h >> +++ b/include/linux/net.h >> @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page) >> return !PageSlab(page) && page_count(page) >= 1; >> } >> +/* >> + * Check sendpage_ok on contiguous pages. >> + */ >> +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset) >> +{ >> + unsigned int pagecount; >> + size_t page_offset; >> + int k; >> + >> + page = page + offset / PAGE_SIZE; >> + page_offset = offset % PAGE_SIZE; > > lets not modify the input page variable. > > p = page + offset >> PAGE_SHIFT; > poffset = offset & PAGE_MASK; Ok, will be applied in the next patch set. >> + pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE); >> + >> + for (k = 0; k < pagecount; k++) >> + if (!sendpage_ok(page + k)) >> + return false; > > perhaps instead of doing a costly DIV_ROUND_UP for every network send we can do: > > count = 0; > while (count < len) { > if (!sendpage_ok(p)) > return false; > page++; > count += PAGE_SIZE; > } > > And we can lose page_offset. > > It can be done in a number of ways, but we should be able to do it > without the DIV_ROUND_UP... Ok, will be applied in the next patch set. > I still don't understand how a page in the middle of a contiguous range ends > up coming from the slab while others don't. I haven't investigate the origin of the IO yet. I suspect the first 2 pages are the superblocks of the raid (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap. > Ofir, can you please check which condition in sendpage_ok actually fails? It failed because the page has slab, page count is 1. Sorry for not clarifying this. "skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1" ^ The print I used: pr_info( "!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n", (void *)page, page_to_pfn(page), page_address(page), !!PageSlab(page), page_count(page) );
>> I still don't understand how a page in the middle of a contiguous range ends >> up coming from the slab while others don't. > I haven't investigate the origin of the IO > yet. I suspect the first 2 pages are the superblocks of the raid > (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap. Well, if these indeed are different origins and just *happen* to be a mixture of slab originated pages and non-slab pages combined together in a single bio of a bvec entry, I'd suspect that it would be more beneficial to split the bvec (essentially not allow bio_add_page to append the page to tail bvec depending on a queue limit (similar to how we handle sg gaps). > >> Ofir, can you please check which condition in sendpage_ok actually fails? > It failed because the page has slab, page count is 1. Sorry for not > clarifying this. > > "skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1" > ^ > The print I used: > pr_info( > "!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n", > (void *)page, > page_to_pfn(page), > page_address(page), > !!PageSlab(page), > page_count(page) > ); > >
On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote: >>> I still don't understand how a page in the middle of a contiguous range ends >>> up coming from the slab while others don't. >> I haven't investigate the origin of the IO >> yet. I suspect the first 2 pages are the superblocks of the raid >> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap. > > Well, if these indeed are different origins and just *happen* to be a > mixture > of slab originated pages and non-slab pages combined together in a single > bio of a bvec entry, > I'd suspect that it would be more beneficial to split the bvec (essentially > not allow bio_add_page > to append the page to tail bvec depending on a queue limit (similar to how > we handle sg gaps). So you want to add a PageSlab check to bvec_try_merge_page? That sounds fairly expensive..
On 04/06/2024 7:27, Christoph Hellwig wrote: > On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote: >>>> I still don't understand how a page in the middle of a contiguous range ends >>>> up coming from the slab while others don't. >>> I haven't investigate the origin of the IO >>> yet. I suspect the first 2 pages are the superblocks of the raid >>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap. >> Well, if these indeed are different origins and just *happen* to be a >> mixture >> of slab originated pages and non-slab pages combined together in a single >> bio of a bvec entry, >> I'd suspect that it would be more beneficial to split the bvec (essentially >> not allow bio_add_page >> to append the page to tail bvec depending on a queue limit (similar to how >> we handle sg gaps). > So you want to add a PageSlab check to bvec_try_merge_page? That sounds > fairly expensive.. > The check needs to happen somewhere apparently, and given that it will be gated by a queue flag only request queues that actually needed will suffer, but they will suffer anyways...
On 04/06/2024 11:24, Sagi Grimberg wrote: > > > On 04/06/2024 7:27, Christoph Hellwig wrote: >> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote: >>>>> I still don't understand how a page in the middle of a contiguous >>>>> range ends >>>>> up coming from the slab while others don't. >>>> I haven't investigate the origin of the IO >>>> yet. I suspect the first 2 pages are the superblocks of the raid >>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the >>>> bitmap. >>> Well, if these indeed are different origins and just *happen* to be a >>> mixture >>> of slab originated pages and non-slab pages combined together in a >>> single >>> bio of a bvec entry, >>> I'd suspect that it would be more beneficial to split the bvec >>> (essentially >>> not allow bio_add_page >>> to append the page to tail bvec depending on a queue limit (similar >>> to how >>> we handle sg gaps). >> So you want to add a PageSlab check to bvec_try_merge_page? That sounds >> fairly expensive.. >> > > The check needs to happen somewhere apparently, and given that it will > be gated by a queue flag > only request queues that actually needed will suffer, but they will > suffer anyways... Something like the untested patch below: -- diff --git a/block/bio.c b/block/bio.c index 53f608028c78..e55a4184c0e6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/blk-crypto.h> #include <linux/xarray.h> +#include <linux/net.h> #include <trace/events/block.h> #include "blk.h" @@ -960,6 +961,9 @@ bool bvec_try_merge_hw_page(struct request_queue *q, struct bio_vec *bv, return false; if (len > queue_max_segment_size(q) - bv->bv_len) return false; + if (q->limits.splice_pages && + sendpage_ok(bv->bv_page) ^ sendpage_ok(page)) + return false; return bvec_try_merge_page(bv, page, len, offset, same_page); } diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index a7e820840cf7..82e2719acb9c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1937,6 +1937,7 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl, lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; lim->max_segment_size = UINT_MAX; lim->dma_alignment = 3; + lim->splice_pages = ctrl->splice_pages; } static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id, diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 3f3e26849b61..d9818330e236 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -398,6 +398,7 @@ struct nvme_ctrl { enum nvme_ctrl_type cntrltype; enum nvme_dctype dctype; + bool splice_pages }; static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 02076b8cb4d8..618b8f20206a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2146,6 +2146,12 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) if (error) goto out_stop_queue; + /* + * we want to prevent contig pages with conflicting splice-ability with + * respect to the network transmission + */ + ctrl->splice_pages = true; + nvme_unquiesce_admin_queue(ctrl); error = nvme_init_ctrl_finish(ctrl, false); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 69c4f113db42..ec657ddad2a4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -331,6 +331,12 @@ struct queue_limits { * due to possible offsets. */ unsigned int dma_alignment; + + /* + * Drivers that use MSG_SPLICE_PAGES to send the bvec over the network, + * will need either bvec entry contig pages spliceable or not. + */ + bool splice_pages; }; typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx, -- What I now see is that we will check PageSlab twice (bvec last index and append page) and skb_splice_from_iter checks it again... How many times check we check this :) Would be great if the network stack can just check it once and fallback to page copy...
On 04/06/2024 16:01, Sagi Grimberg wrote: > > > On 04/06/2024 11:24, Sagi Grimberg wrote: >> >> >> On 04/06/2024 7:27, Christoph Hellwig wrote: >>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote: >>>>>> I still don't understand how a page in the middle of a contiguous range ends >>>>>> up coming from the slab while others don't. >>>>> I haven't investigate the origin of the IO >>>>> yet. I suspect the first 2 pages are the superblocks of the raid >>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap. >>>> Well, if these indeed are different origins and just *happen* to be a >>>> mixture >>>> of slab originated pages and non-slab pages combined together in a single >>>> bio of a bvec entry, >>>> I'd suspect that it would be more beneficial to split the bvec (essentially >>>> not allow bio_add_page >>>> to append the page to tail bvec depending on a queue limit (similar to how >>>> we handle sg gaps). I have investigated the origin of the IO. It's a bug in the md-bitmap code. It's a single IO that __write_sb_page() sends, it rounds up the io size to the optimal io size but doesn't check that the final size exceeds the amount of pages it allocated. The slab pages aren't allocated by the md-bitmap, they are pages that happens to be after the allocated pages. I'm applying a patch to the md subsystem asap. I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write. offset: 16, size: 262144. pfn: 0x53ee ### __write_sb_page before md_super_write. logging pages ### pfn: 0x53ee, slab: 0 pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... There is only 1 page allocated for bitmap but __write_sb_page() tries to write 64 pages. >>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds >>> fairly expensive.. >>> >> >> The check needs to happen somewhere apparently, and given that it will be gated by a queue flag >> only request queues that actually needed will suffer, but they will suffer anyways... > What I now see is that we will check PageSlab twice (bvec last index and append page) > and skb_splice_from_iter checks it again... How many times check we check this :) > > Would be great if the network stack can just check it once and fallback to page copy... Nevertheless I think a check in the network stack or when merging bio's is still necessary.
On Thu, Jun 06, 2024 at 03:57:25PM +0300, Ofir Gal wrote: > The slab pages aren't allocated by the md-bitmap, they are pages that > happens to be after the allocated pages. I'm applying a patch to the md > subsystem asap. Similar cases could happen by other means as well. E.g. if you write to the last blocks in an XFS allocation group while also writing something to superblock copy at the beginnig of the next one and the two writes get merged. Probably not easy to reproduce but entirely possible. Just as as lot of other scenarious could happen due to merges.
On 06/06/2024 16:08, Christoph Hellwig wrote: > On Thu, Jun 06, 2024 at 03:57:25PM +0300, Ofir Gal wrote: >> The slab pages aren't allocated by the md-bitmap, they are pages that >> happens to be after the allocated pages. I'm applying a patch to the md >> subsystem asap. > Similar cases could happen by other means as well. E.g. if you write > to the last blocks in an XFS allocation group while also writing something > to superblock copy at the beginnig of the next one and the two writes get > merged. Probably not easy to reproduce but entirely possible. Just as > as lot of other scenarious could happen due to merges. I agree. Do we want to proceed with my patch or would we prefer a more efficient solution?
On Thu, Jun 06, 2024 at 04:18:29PM +0300, Ofir Gal wrote: > I agree. Do we want to proceed with my patch or would we prefer a more > efficient solution? We'll need something for sure. I don't like the idea of hacking special cases that call into the networking code into the core block layer, so I think it'll have to be something more like your original patch.
On 06/06/2024 16:52, Christoph Hellwig wrote: > On Thu, Jun 06, 2024 at 04:18:29PM +0300, Ofir Gal wrote: >> I agree. Do we want to proceed with my patch or would we prefer a more >> efficient solution? > We'll need something for sure. I don't like the idea of hacking > special cases that call into the networking code into the core block > layer, so I think it'll have to be something more like your original > patch. Ok, I'm sending v3 in case we need it.
diff --git a/include/linux/net.h b/include/linux/net.h index 688320b79fcc..b33bdc3e2031 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page) return !PageSlab(page) && page_count(page) >= 1; } +/* + * Check sendpage_ok on contiguous pages. + */ +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset) +{ + unsigned int pagecount; + size_t page_offset; + int k; + + page = page + offset / PAGE_SIZE; + page_offset = offset % PAGE_SIZE; + pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE); + + for (k = 0; k < pagecount; k++) + if (!sendpage_ok(page + k)) + return false; + + return true; +} + int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t len); int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
Network drivers are using sendpage_ok() to check the first page of an iterator in order to disable MSG_SPLICE_PAGES. The iterator can represent list of contiguous pages. When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used, it requires all pages in the iterator to be sendable. Therefore it needs to check that each page is sendable. The patch introduces a helper sendpages_ok(), it returns true if all the contiguous pages are sendable. Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use this helper to check whether the page list is OK. If the helper does not return true, the driver should remove MSG_SPLICE_PAGES flag. Signed-off-by: Ofir Gal <ofir.gal@volumez.com> --- include/linux/net.h | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)