Message ID | 20250211005646.222452-1-kbusch@meta.com (mailing list archive) |
---|---|
Headers | show |
Series | ublk zero-copy support | expand |
On Mon, Feb 10, 2025 at 04:56:40PM -0800, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > Previous version was discussed here: > > https://lore.kernel.org/linux-block/20250203154517.937623-1-kbusch@meta.com/ > > The same ublksrv reference code in that link was used to test the kernel > side changes. > > Before listing what has changed, I want to mention what is the same: the > reliance on the ring ctx lock to serialize the register ahead of any > use. I'm not ignoring the feedback; I just don't have a solid answer > right now, and want to progress on the other fronts in the meantime. It is explained in the following links: https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/ - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup in RW_FIXED OP's ->prep(), and ->prep() is always called before calling ->issue() when the two are submitted in same io_uring_enter(), so you need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep() to ->issue() first. - secondly, ->issue() order is only respected by IO_LINK, and io_uring can't provide such guarantee without using IO_LINK: Pavel explained it in the following link: https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/ There are also other examples, such as, register buffer stays in one link chain, and the consumer OP isn't in this chain, the consumer OP can still be issued before issuing register_buffer. Thanks, Ming
On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote: > It is explained in the following links: > > https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/ > > - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup > in RW_FIXED OP's ->prep(), and ->prep() is always called before calling > ->issue() when the two are submitted in same io_uring_enter(), so you > need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep() > to ->issue() first. I don't think that's accurate, at least in practice. In a normal flow, we'll have this sequence: io_submit_sqes io_submit_sqe (uring_cmd ublk register) io_init_req ->prep() io_queue_sqe ->issue() io_submit_sqe (read/write_fixed) io_init_req ->prep() io_queue_sqe ->issue() The first SQE is handled in its entirety before even looking at the subsequent SQE. Since the register is first, then the read/write_fixed's prep will have a valid index. Testing this patch series appears to show this reliably works. > - secondly, ->issue() order is only respected by IO_LINK, and io_uring > can't provide such guarantee without using IO_LINK: > > Pavel explained it in the following link: > > https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/ > > There are also other examples, such as, register buffer stays in one > link chain, and the consumer OP isn't in this chain, the consumer OP > can still be issued before issuing register_buffer. Yep, I got that. Linking is just something I was hoping to avoid. I understand there are conditions that can break the normal flow I'm relying on regarding the ordering. This hasn't appeared to be a problem in practice, but I agree this needs to be handled.
On 2/12/25 15:28, Keith Busch wrote: > On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote: >> It is explained in the following links: >> >> https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/ >> >> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup >> in RW_FIXED OP's ->prep(), and ->prep() is always called before calling >> ->issue() when the two are submitted in same io_uring_enter(), so you >> need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep() >> to ->issue() first. > > I don't think that's accurate, at least in practice. In a normal flow, > we'll have this sequence: > > io_submit_sqes > io_submit_sqe (uring_cmd ublk register) > io_init_req > ->prep() > io_queue_sqe > ->issue() > io_submit_sqe (read/write_fixed) > io_init_req > ->prep() > io_queue_sqe > ->issue() > > The first SQE is handled in its entirety before even looking at the > subsequent SQE. Since the register is first, then the read/write_fixed's > prep will have a valid index. Testing this patch series appears to show > this reliably works. Ming describes how it works for links. This one is indeed how non links are normally executed. Though I'd repeat it's an implementation detail and not a part of the uapi. Interestingly, Keith, you sent some patches changing the ordering here quite a while ago, just as an example of how it can change. >> - secondly, ->issue() order is only respected by IO_LINK, and io_uring >> can't provide such guarantee without using IO_LINK: >> >> Pavel explained it in the following link: >> >> https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/ >> >> There are also other examples, such as, register buffer stays in one >> link chain, and the consumer OP isn't in this chain, the consumer OP >> can still be issued before issuing register_buffer. > > Yep, I got that. Linking is just something I was hoping to avoid. I > understand there are conditions that can break the normal flow I'm > relying on regarding the ordering. This hasn't appeared to be a problem > in practice, but I agree this needs to be handled.
On Wed, Feb 12, 2025 at 04:06:58PM +0000, Pavel Begunkov wrote: > On 2/12/25 15:28, Keith Busch wrote: > > On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote: > > > It is explained in the following links: > > > > > > https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/ > > > > > > - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup > > > in RW_FIXED OP's ->prep(), and ->prep() is always called before calling > > > ->issue() when the two are submitted in same io_uring_enter(), so you > > > need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep() > > > to ->issue() first. > > > > I don't think that's accurate, at least in practice. In a normal flow, > > we'll have this sequence: > > > > io_submit_sqes > > io_submit_sqe (uring_cmd ublk register) > > io_init_req > > ->prep() > > io_queue_sqe > > ->issue() > > io_submit_sqe (read/write_fixed) > > io_init_req > > ->prep() > > io_queue_sqe > > ->issue() > > > > The first SQE is handled in its entirety before even looking at the > > subsequent SQE. Since the register is first, then the read/write_fixed's > > prep will have a valid index. Testing this patch series appears to show > > this reliably works. > > Ming describes how it works for links. This one is indeed how > non links are normally executed. Though I'd repeat it's an > implementation detail and not a part of the uapi. Interestingly, > Keith, you sent some patches changing the ordering here quite a > while ago, just as an example of how it can change. My fault, I should have provided the link or async background. > > > > > - secondly, ->issue() order is only respected by IO_LINK, and io_uring > > > can't provide such guarantee without using IO_LINK: > > > > > > Pavel explained it in the following link: > > > > > > https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/ > > > > > > There are also other examples, such as, register buffer stays in one > > > link chain, and the consumer OP isn't in this chain, the consumer OP > > > can still be issued before issuing register_buffer. > > > > Yep, I got that. Linking is just something I was hoping to avoid. I > > understand there are conditions that can break the normal flow I'm > > relying on regarding the ordering. This hasn't appeared to be a problem > > in practice, but I agree this needs to be handled. LINK/ASYNC needs to be supported, and sometimes they are useful. - IO_LINK is the only way for respecting IO order io_uring only supports non-link or link all in one batch - ASYNC sometimes can avoid to call two ->issue() unnecessarily if you know that the OP can't be dealt with async way in advance, maybe not one problem for ublk uring_cmd, but it is helpful for some FS write (un-allocated write) Thanks, Ming
Hi, > -----Original Message----- > From: Keith Busch <kbusch@meta.com> > Sent: Tuesday, February 11, 2025 8:57 AM > To: ming.lei@redhat.com; asml.silence@gmail.com; axboe@kernel.dk; linux- > block@vger.kernel.org; io-uring@vger.kernel.org > Cc: bernd@bsbernd.com; Keith Busch <kbusch@kernel.org> > Subject: [PATCHv2 0/6] ublk zero-copy support > > From: Keith Busch <kbusch@kernel.org> > > Previous version was discussed here: > > https://lore.kernel.org/linux-block/20250203154517.937623-1- > kbusch@meta.com/ > > The same ublksrv reference code in that link was used to test the kernel side > changes. > > Before listing what has changed, I want to mention what is the same: the > reliance on the ring ctx lock to serialize the register ahead of any use. I'm not > ignoring the feedback; I just don't have a solid answer right now, and want to > progress on the other fronts in the meantime. > > Here's what's different from the previous: > > - Introduced an optional 'release' callback when the resource node is > no longer referenced. The callback addresses any buggy applications > that may complete their request and unregister their index while IO > is in flight. This obviates any need to take extra page references > since it prevents the request from completing. > > - Removed peeking into the io_cache element size and instead use a > more intuitive bvec segment count limit to decide if we're caching > the imu (suggested by Pavel). > > - Dropped the const request changes; it's not needed. I tested this patch set. When I use null as the device, the test results are like your v1. When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement. However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow: ublk add -t loop -f ./ublk-loop.img ublk add -t loop -f ./ublk-loop-zerocopy.img fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec) fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec) So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know. --- Li Zetao
On Thu, Feb 13, 2025 at 03:12:43PM +0000, lizetao wrote: > I tested this patch set. When I use null as the device, the test results are like your v1. > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement. > However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow: > > ublk add -t loop -f ./ublk-loop.img > ublk add -t loop -f ./ublk-loop-zerocopy.img > > fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G > read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec) > > fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G > read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec) > > > So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know. What do you get if if you run your fio job directly on your ublk-loop.img file? Throughput should improve until you've saturated the backend device. Once you hit that point, the primary benefit of zero-copy come from decreased memory and CPU utilizations.
On Thu, Feb 13, 2025 at 11:24 PM lizetao <lizetao1@huawei.com> wrote: > > Hi, > > > -----Original Message----- > > From: Keith Busch <kbusch@meta.com> > > Sent: Tuesday, February 11, 2025 8:57 AM > > To: ming.lei@redhat.com; asml.silence@gmail.com; axboe@kernel.dk; linux- > > block@vger.kernel.org; io-uring@vger.kernel.org > > Cc: bernd@bsbernd.com; Keith Busch <kbusch@kernel.org> > > Subject: [PATCHv2 0/6] ublk zero-copy support > > > > From: Keith Busch <kbusch@kernel.org> > > > > Previous version was discussed here: > > > > https://lore.kernel.org/linux-block/20250203154517.937623-1- > > kbusch@meta.com/ > > > > The same ublksrv reference code in that link was used to test the kernel side > > changes. > > > > Before listing what has changed, I want to mention what is the same: the > > reliance on the ring ctx lock to serialize the register ahead of any use. I'm not > > ignoring the feedback; I just don't have a solid answer right now, and want to > > progress on the other fronts in the meantime. > > > > Here's what's different from the previous: > > > > - Introduced an optional 'release' callback when the resource node is > > no longer referenced. The callback addresses any buggy applications > > that may complete their request and unregister their index while IO > > is in flight. This obviates any need to take extra page references > > since it prevents the request from completing. > > > > - Removed peeking into the io_cache element size and instead use a > > more intuitive bvec segment count limit to decide if we're caching > > the imu (suggested by Pavel). > > > > - Dropped the const request changes; it's not needed. > > I tested this patch set. When I use null as the device, the test results are like your v1. > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement. Yes, the improvement is usually more obvious with a big IO size(>= 64K). > However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow: > > ublk add -t loop -f ./ublk-loop.img > ublk add -t loop -f ./ublk-loop-zerocopy.img > > fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G > read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec) > > fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G > read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec) > > > So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know. Latency may have decreased a bit. System sources can't be saturated in single queue depth, please run the same test with high queue depth per Keith's suggestion: --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 Also if you set up the backing file as ramfs image, the improvement should be pretty obvious, I observed IOPS doubled in this way. Thanks, Ming
Hi, > -----Original Message----- > From: Keith Busch <kbusch@kernel.org> > Sent: Friday, February 14, 2025 12:07 AM > To: lizetao <lizetao1@huawei.com> > Cc: Keith Busch <kbusch@meta.com>; io-uring@vger.kernel.org; > axboe@kernel.dk > Subject: Re: [PATCHv2 0/6] ublk zero-copy support > > On Thu, Feb 13, 2025 at 03:12:43PM +0000, lizetao wrote: > > I tested this patch set. When I use null as the device, the test results are like > your v1. > > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is > a significant improvement. > > However, when I used loop as the device, I found that there was no > improvement, whether using 4k or 64k. As follow: > > > > ublk add -t loop -f ./ublk-loop.img > > ublk add -t loop -f ./ublk-loop-zerocopy.img > > > > fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring > -bs=128k -size=5G > > read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec) > > > > fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring > -bs=128k -size=5G > > read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec) > > > > > > So, this patch set is optimized for null type devices? Or if I've missed any key > information, please let me know. > > What do you get if if you run your fio job directly on your ublk-loop.img file? I test it directly on ublk-loop.img, and the result is as follow: fio -filename=./ublk-loop.img -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G read: IOPS=1005, BW=126MiB/s (132MB/s)(1258MiB/10009msec) From the test results, this seems to be limited by the performance limitations of the file system where ./ublk-loop.img is located. > Throughput should improve until you've saturated the backend device. > Once you hit that point, the primary benefit of zero-copy come from decreased > memory and CPU utilizations. --- Li Zetao
Hi, > -----Original Message----- > From: Ming Lei <ming.lei@redhat.com> > Sent: Friday, February 14, 2025 10:41 AM > To: lizetao <lizetao1@huawei.com> > Cc: Keith Busch <kbusch@meta.com>; io-uring@vger.kernel.org; > axboe@kernel.dk; Ming Lei <ming.lei@redhat.com> > Subject: Re: [PATCHv2 0/6] ublk zero-copy support > > On Thu, Feb 13, 2025 at 11:24 PM lizetao <lizetao1@huawei.com> wrote: > > > > Hi, > > > > > -----Original Message----- > > > From: Keith Busch <kbusch@meta.com> > > > Sent: Tuesday, February 11, 2025 8:57 AM > > > To: ming.lei@redhat.com; asml.silence@gmail.com; axboe@kernel.dk; > > > linux- block@vger.kernel.org; io-uring@vger.kernel.org > > > Cc: bernd@bsbernd.com; Keith Busch <kbusch@kernel.org> > > > Subject: [PATCHv2 0/6] ublk zero-copy support > > > > > > From: Keith Busch <kbusch@kernel.org> > > > > > > Previous version was discussed here: > > > > > > https://lore.kernel.org/linux-block/20250203154517.937623-1- > > > kbusch@meta.com/ > > > > > > The same ublksrv reference code in that link was used to test the > > > kernel side changes. > > > > > > Before listing what has changed, I want to mention what is the same: > > > the reliance on the ring ctx lock to serialize the register ahead of > > > any use. I'm not ignoring the feedback; I just don't have a solid > > > answer right now, and want to progress on the other fronts in the > meantime. > > > > > > Here's what's different from the previous: > > > > > > - Introduced an optional 'release' callback when the resource node is > > > no longer referenced. The callback addresses any buggy applications > > > that may complete their request and unregister their index while IO > > > is in flight. This obviates any need to take extra page references > > > since it prevents the request from completing. > > > > > > - Removed peeking into the io_cache element size and instead use a > > > more intuitive bvec segment count limit to decide if we're caching > > > the imu (suggested by Pavel). > > > > > > - Dropped the const request changes; it's not needed. > > > > I tested this patch set. When I use null as the device, the test results are like > your v1. > > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is > a significant improvement. > > Yes, the improvement is usually more obvious with a big IO size(>= 64K). > > > However, when I used loop as the device, I found that there was no > improvement, whether using 4k or 64k. As follow: > > > > ublk add -t loop -f ./ublk-loop.img > > ublk add -t loop -f ./ublk-loop-zerocopy.img > > > > fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring > -bs=128k -size=5G > > read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec) > > > > fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring > -bs=128k -size=5G > > read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec) > > > > > > So, this patch set is optimized for null type devices? Or if I've missed any key > information, please let me know. > > Latency may have decreased a bit. > > System sources can't be saturated in single queue depth, please run the same > test with high queue depth per Keith's suggestion: > > --iodepth=128 --iodepth_batch_submit=16 -- > iodepth_batch_complete_min=16 I tested it with these settings, but the result is similar to iodepth=1: fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=2182, BW=136MiB/s (143MB/s)(1440MiB/10558msec) fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=2174, BW=136MiB/s (143MB/s)(1438MiB/10580msec) So I believe this is limited by the performance limitations of the file system where ./ublk-loop.img is located. > > Also if you set up the backing file as ramfs image, the improvement should be > pretty obvious, I observed IOPS doubled in this way. This is true, I tested it in /tmp/ and got a large optimizations. The results as follow: fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=95.8k, BW=5985MiB/s (6276MB/s)(58.5GiB/10014msec) fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=170k, BW=10.4GiB/s (11.1GB/s)(80.0GiB/7721msec) So this test result is in line with expectations. --- Li Zetao
From: Keith Busch <kbusch@kernel.org> Previous version was discussed here: https://lore.kernel.org/linux-block/20250203154517.937623-1-kbusch@meta.com/ The same ublksrv reference code in that link was used to test the kernel side changes. Before listing what has changed, I want to mention what is the same: the reliance on the ring ctx lock to serialize the register ahead of any use. I'm not ignoring the feedback; I just don't have a solid answer right now, and want to progress on the other fronts in the meantime. Here's what's different from the previous: - Introduced an optional 'release' callback when the resource node is no longer referenced. The callback addresses any buggy applications that may complete their request and unregister their index while IO is in flight. This obviates any need to take extra page references since it prevents the request from completing. - Removed peeking into the io_cache element size and instead use a more intuitive bvec segment count limit to decide if we're caching the imu (suggested by Pavel). - Dropped the const request changes; it's not needed. - Merged up to latest block uring branch Jens Axboe (1): io_uring: use node for import Keith Busch (5): io_uring: create resource release callback io_uring: add support for kernel registered bvecs ublk: zc register/unregister bvec io_uring: add abstraction for buf_table rsrc data io_uring: cache nodes and mapped buffers drivers/block/ublk_drv.c | 145 ++++++++++++++----- include/linux/io_uring.h | 1 + include/linux/io_uring_types.h | 28 ++-- include/uapi/linux/ublk_cmd.h | 4 + io_uring/fdinfo.c | 8 +- io_uring/filetable.c | 2 +- io_uring/net.c | 5 +- io_uring/nop.c | 2 +- io_uring/register.c | 2 +- io_uring/rsrc.c | 252 ++++++++++++++++++++++++++------- io_uring/rsrc.h | 11 +- io_uring/rw.c | 4 +- io_uring/uring_cmd.c | 4 +- 13 files changed, 355 insertions(+), 113 deletions(-)