Message ID | 20230215004122.28917-1-xiaoguang.wang@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | Add io_uring & ebpf based methods to implement zero-copy for ublk | expand |
On 2023/2/15 08:41, Xiaoguang Wang wrote: > Normally, userspace block device impementations need to copy data between > kernel block layer's io requests and userspace block device's userspace > daemon, for example, ublk and tcmu both have similar logic, but this > operation will consume cpu resources obviously, especially for large io. > > There are methods trying to reduce these cpu overheads, then userspace > block device's io performance will be improved further. These methods > contain: 1) use special hardware to do memory copy, but seems not all > architectures have these special hardware; 2) sofeware methods, such as > mmap kernel block layer's io requests's data to userspace daemon [1], > but it has page table's map/unmap, tlb flush overhead, security issue, > etc, and it maybe only friendly to large io. > > Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic > framework for implementing block device logic from userspace. Typical > userspace block device impementations need to copy data between kernel > block layer's io requests and userspace block device's userspace daemon, > which will consume cpu resources, especially for large io. > > To solve this problem, I'd propose a new method, which will combine the > respective advantages of io_uring and ebpf. Add a new program type > BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should > register an ebpf prog. This bpf prog will use bpf helper offered by ublk > bpf prog type to submit io requests on behalf of daemon process. > Currently there is only one helper: > u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx, > struct io_uring_sqe *sqe, u32 sqe_len, u32, fd) > > This helper will use io_uring to submit io requests, so we need to make > io_uring be able to submit a sqe located in kernel(Some codes idea comes > from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes > from userspace addr), and bpf prog initializes sqes, but does not need to > initializes sqes' buf field, sqe->buf will come from kernel block layer io > requests in some form. See patch 2 for more. > > In example of ublk loop target, we can easily implement such below logic in > ebpf prog: > 1. userspace daemon registers an ebpf prog and passes two backend file > fd in ebpf map structure。 > 2. For kernel io requests against the first half of userspace device, > ebpf prog prepares an io_uring sqe, which will submit io against the first > backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel > io requests against second half of userspace device has similar logic, > only sqe's fd will be the second backend file fd. > 3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will > be executed and completes kernel io requests. > > That means, by using ebpf, we can implement various userspace log in kernel. > > From above expample, we can see that this method has 3 advantages at least: > 1. Remove memory copy between kernel block layer and userspace daemon > completely. > 2. Save memory. Userspace daemon doesn't need to maintain memory to > issue and complete io requests, and use kernel block layer io requests > memory directly. > 2. We may reduce the number of round trips between kernel and userspace > daemon, so may reduce kernel & userspace context switch overheads. > > Test: > Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file > > fio job file: > [global] > direct=1 > filename=/dev/ublkb0 > time_based > runtime=60 > numjobs=1 > cpus_allowed=1 > > [rand-read-4k] > bs=512K > iodepth=16 > ioengine=libaio > rw=randwrite > stonewall > > > Without this patch: > WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec > ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool. > > With this patch: > WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec > ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool. > > From above tests, this method can reduce cpu copy overhead obviously. > > > TODO: > I must say this patchset is just a RFC for design. > > 1) Currently for this patchset, I just make ublk ebpf prog submit io requests > using io_uring in kernel, cqe event still needs to be handled in userspace > daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk > ebpf prog can implement io in kernel. > > 2) ublk driver needs to work better with ebpf, currently I did some hack > codes to support ebpf in ublk driver, it only can support write requests. > > 3) I have not done much tests yet, will run liburing/ublk/blktests > later. > > Any review and suggestions are welcome, thanks. > > [1] https://lore.kernel.org/all/20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com/ > [2] https://lore.kernel.org/all/cover.1621424513.git.asml.silence@gmail.com/ > > > Xiaoguang Wang (3): > bpf: add UBLK program type > io_uring: enable io_uring to submit sqes located in kernel > ublk_drv: add ebpf support > > drivers/block/ublk_drv.c | 228 ++++++++++++++++++++++++++++++++- > include/linux/bpf_types.h | 2 + > include/linux/io_uring.h | 13 ++ > include/linux/io_uring_types.h | 8 +- > include/uapi/linux/bpf.h | 2 + > include/uapi/linux/ublk_cmd.h | 11 ++ > io_uring/io_uring.c | 59 ++++++++- > io_uring/rsrc.c | 15 +++ > io_uring/rsrc.h | 3 + > io_uring/rw.c | 7 + > kernel/bpf/syscall.c | 1 + > kernel/bpf/verifier.c | 9 +- > scripts/bpf_doc.py | 4 + > tools/include/uapi/linux/bpf.h | 9 ++ > tools/lib/bpf/libbpf.c | 2 + > 15 files changed, 366 insertions(+), 7 deletions(-) > Hi, Here is perf report output of ublk daemon(loop target): + 57.96% 4.03% ublk liburing.so.2.2 [.] _io_uring_get_cqe ▒ + 53.94% 0.00% ublk [kernel.vmlinux] [k] entry_SYSCALL_64 ◆ + 53.94% 0.65% ublk [kernel.vmlinux] [k] do_syscall_64 ▒ + 48.37% 1.18% ublk [kernel.vmlinux] [k] __do_sys_io_uring_enter ▒ + 42.92% 1.72% ublk [kernel.vmlinux] [k] io_cqring_wait ▒ + 35.17% 0.06% ublk [kernel.vmlinux] [k] task_work_run ▒ + 34.75% 0.53% ublk [kernel.vmlinux] [k] io_run_task_work_sig ▒ + 33.45% 0.00% ublk [kernel.vmlinux] [k] ublk_bpf_io_submit_fn ▒ + 33.16% 0.06% ublk bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog [k] bpf_prog_3bdc6181a3c616fb_ublk_io_sub▒ + 32.68% 0.00% iou-wrk-18583 [unknown] [k] 0000000000000000 ▒ + 32.68% 0.00% iou-wrk-18583 [unknown] [k] 0x00007efe920b1040 ▒ + 32.68% 0.00% iou-wrk-18583 [kernel.vmlinux] [k] ret_from_fork ▒ + 32.68% 0.47% iou-wrk-18583 [kernel.vmlinux] [k] io_wqe_worker ▒ + 30.61% 0.00% ublk [kernel.vmlinux] [k] io_submit_sqe ▒ + 30.31% 0.06% ublk [kernel.vmlinux] [k] io_issue_sqe ▒ + 28.00% 0.00% ublk [kernel.vmlinux] [k] bpf_ublk_queue_sqe ▒ + 28.00% 0.00% ublk [kernel.vmlinux] [k] io_uring_submit_sqe ▒ + 27.18% 0.00% ublk [kernel.vmlinux] [k] io_write ▒ + 27.18% 0.00% ublk [xfs] [k] xfs_file_write_iter The call stack is: - 57.96% 4.03% ublk liburing.so.2.2 [.] _io_uring_get_cqe ◆ - 53.94% _io_uring_get_cqe ▒ entry_SYSCALL_64 ▒ - do_syscall_64 ▒ - 48.37% __do_sys_io_uring_enter ▒ - 42.92% io_cqring_wait ▒ - 34.75% io_run_task_work_sig ▒ - task_work_run ▒ - 32.50% ublk_bpf_io_submit_fn ▒ - 32.21% bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog ▒ - 27.12% bpf_ublk_queue_sqe ▒ - io_uring_submit_sqe ▒ - 26.64% io_submit_sqe ▒ - 26.35% io_issue_sqe ▒ - io_write ▒ xfs_file_write_iter ▒ Here, "io_submit" ebpf prog will be run in task_work of ublk daemon process after io_uring_enter() syscall. In this ebpf prog, a sqe is built and submitted. All information about this blk-mq request is stored in a "ctx". Then io_uring can write to the backing file (xfs_file_write_iter). Here is call stack from perf report output of fio: - 5.04% 0.18% fio [kernel.vmlinux] [k] ublk_queue_rq ▒ - 4.86% ublk_queue_rq ▒ - 3.67% bpf_prog_b8456549dbe40c37_ublk_io_prep_prog ▒ - 3.10% bpf_trace_printk ▒ 2.83% _raw_spin_unlock_irqrestore ▒ - 0.70% task_work_add ▒ - try_to_wake_up ▒ _raw_spin_unlock_irqrestore ▒ Here, "io_prep" ebpf prog will be run in "ublk_queue_rq" process. In this ebpf prog, qid, tag, nr_sectors, start_sector, op, flags will be stored in one "ctx". Then we add a task_work to the ublk daemon process. Regards, Zhang