mbox series

[V4,0/2] ublk: add io_uring based userspace block driver

Message ID 20220711022024.217163-1-ming.lei@redhat.com (mailing list archive)
Headers show
Series ublk: add io_uring based userspace block driver | expand

Message

Ming Lei July 11, 2022, 2:20 a.m. UTC
Hello Guys,

ublk driver is one kernel driver for implementing generic userspace block
device/driver, which delivers io request from ublk block device(/dev/ublkbN) into
ublk server[1] which is the userspace part of ublk for communicating
with ublk driver and handling specific io logic by its target module.

Another thing ublk driver handles is to copy data between user space buffer
and request/bio's pages, or take zero copy if mm is ready for support it in
future. ublk driver doesn't handle any IO logic of the specific driver, so
it is small/simple, and all io logics are done by the target code in ublkserver.

The above two are main jobs done by ublk driver.

ublk driver can help to move IO logic into userspace, in which the
development work is easier/more effective than doing in kernel, such as,
ublk-loop takes < 200 lines of loop specific code to get basically same 
function with kernel loop block driver, meantime the performance is
is even better than kernel loop with same setting. ublksrv[1] provide built-in
test for comparing both by running "make test T=loop", for example, see
the test result running on VM which is over my lattop(root disk is
nvme/device mapper/xfs):

	[root@ktest-36 ubdsrv]#make -s -C /root/git/ubdsrv/tests run T=loop/001 R=10
	running loop/001
		fio (ublk/loop(/root/git/ubdsrv/tests/tmp/ublk_loop_VqbMA), libaio, bs 4k, dio, hw queues:1)...
		randwrite: jobs 1, iops 32572
		randread: jobs 1, iops 143052
		rw: jobs 1, iops read 29919 write 29964
	
	[root@ktest-36 ubdsrv]# make test T=loop/003
	make -s -C /root/git/ubdsrv/tests run T=loop/003 R=10
	running loop/003
		fio (kernel_loop/kloop(/root/git/ubdsrv/tests/tmp/ublk_loop_ZIVnG), libaio, bs 4k, dio, hw queues:1)...
		randwrite: jobs 1, iops 27436
		randread: jobs 1, iops 95273
		rw: jobs 1, iops read 22542 write 22543 


Another example is high performance qcow2 support[2], which could be built with
ublk framework more easily than doing it inside kernel.

Also there are more people who express interests on userspace block driver[3],
Gabriel Krisman Bertazi proposes this topic in lsf/mm/ebpf 2022 and mentioned
requirement from Google. Ziyang Zhang from Alibaba said they "plan to
replace TCMU by UBD as a new choice" because UBD can get better throughput than
TCMU even with single queue[4], meantime UBD is simple. Also there is userspace
storage service for providing storage to containers.

It is io_uring based: io request is delivered to userspace via new added
io_uring command which has been proved as very efficient for making nvme
passthrough IO to get better IOPS than io_uring(READ/WRITE). Meantime one
shared/mmap buffer is used for sharing io descriptor to userspace, the
buffer is readonly for userspace, each IO just takes 24bytes so far.
It is suggested to use io_uring in userspace(target part of ublk server)
to handle IO request too. And it is still easy for ublkserver to support
io handling by non-io_uring, and this work isn't done yet, but can be
supported easily with help o eventfd.

This way is efficient since no extra io command copy is required, no sleep
is needed in transferring io command to userspace. Meantime the communication
protocol is simple and efficient, one single command of
UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit
command result in one trip. IO handling is often batched after single
io_uring_enter() returns, both IO requests from ublk server target and
IO commands could be handled as a whole batch.

And the patch by patch change can be found in the following
tree:

https://github.com/ming1/linux/tree/my_for-5.20-ubd-devel_v4

ublk server repo(master branch):

	https://github.com/ming1/ubdsrv

Any comments are welcome!

Since V3:
- address Gabriel Krisman Bertazi's comments on V3: add userspace data
  validation before handling command, remove warning, ...
- remove UBLK_IO_COMMIT_REQ command as suggested by Zixiang and Gabriel Krisman Bertazi
- fix one request double free when running abort
- rewrite/cleanup ublk_copy_pages(), then this handling becomes very
  clean
- add one command of UBLK_IO_REFETCH_REQ for allowing ublk_drv to build
  as module

Since V2:
- fix one big performance problem:
	https://github.com/ming1/linux/commit/3c9fd476951759858cc548dee4cedc074194d0b0
- rename as ublk, as suggested by Gabriel Krisman Bertazi 
- lots of cleanup & code improvement & bugfix, see details in git
  hisotry


Since V1:

Remove RFC now because ublk driver codes gets lots of cleanup, enhancement and
bug fixes since V1:

- cleanup uapi: remove ublk specific error code,  switch to linux error code,
remove one command op, remove one field from cmd_desc

- add monitor mechanism to handle ubq_daemon being killed, ublksrv[1]
  includes builtin tests for covering heavy IO with deleting ublk / killing
  ubq_daemon at the same time, and V2 pass all the two tests(make test T=generic),
  and the abort/stop mechanism is simple

- fix MQ command buffer mmap bug, and now 'xfstetests -g auto' works well on
  MQ ublk-loop devices(test/scratch)

- improve batching submission as suggested by Jens

- improve handling for starting device, replace random wait/poll with
completion

- all kinds of cleanup, bug fix,..

Ming Lei (2):
  ublk: add io_uring based userspace block driver
  ublk_drv: add UBLK_IO_REFETCH_REQ for supporting to build as module

 drivers/block/Kconfig         |    6 +
 drivers/block/Makefile        |    2 +
 drivers/block/ublk_drv.c      | 1701 +++++++++++++++++++++++++++++++++
 include/uapi/linux/ublk_cmd.h |  173 ++++
 4 files changed, 1882 insertions(+)
 create mode 100644 drivers/block/ublk_drv.c
 create mode 100644 include/uapi/linux/ublk_cmd.h

Comments

Xiaoguang Wang July 11, 2022, 11:58 a.m. UTC | #1
Hello Ming,

First thanks for this great work, I think ublk will be a powerful
replacement for tcmu. I read your v3 ublk kernel and user-space
codes, and have some ublk design questions here:

1) As I said before, currently ublk still needs user-space backstore
to allocate io buffer in advance, for example, UBLK_IO_FETCH_REQ
command needs to set ublk_io->addr when starting device. Currently,
some of our internal business may create hundreds or thousands of
tcmu devices in one host, when switching to ublk, it'll need user-space
backstore to allocate lots of memory in advance, which will waste memory
in a host with swap off. Also user-space backstore may use advanced
network components, they may maintain internal memory pool, which
can be used as io buffer.

So I'd like to suggest that at least we add a new flag and keep GET_DATA
command. When used, FETCH command does not need to pass io
buffer addr, and backstore needs to send a GET_DATA command with
user addr for write request.

Support high flexibility, let's user decides what's best for them.

2) complicated ublk user-space
First forgive me  I think current ublk user-space codes looks somewhat
complicated:
  1. UBLK_CMD_START_DEV and io handler worker need to be
in different task context, because UBLK_CMD_START_DEV needs
to wait the number of queue depth of sqes to be submitted in advance.

  2. mixed getting ublk command and target io handle in one io_uring instance
I'm not sure it's a good design, see ublksrv_handle_cqe(), which contains
many flag handle work and is_target_io() check, I think the data flow is not
that clear for me at least

  3. helper like tcmulib_get_next_command()
I wonder whether current ublk user-space can offer similar helper which
will return a set of io commands to backstore easily.

I'd like to suggest:
1. When starting ublk dev, pass io_uring fd for every queue, then in
blk-mq->queue_rq(), it'll generate one cqe for every coming request,
not need to issue fetch sqes command in advance, kernel codes would
simplify a bit,  UBLK_IO_FLAG_ACTIVE may be discarded. And helper
like returning a set of io command would be added easily. Note these
io_uring fd would be just used for notifying io command generated.

2. We use another io_uring fd per queue to handle GET_DATA or
COMMIT_REQ command. Indeed, if we can support synchronous ioctl
interface to do GET_DATA and COMMIT_REQ, we may make libublk
really simple.


Here I'd like to describe how we use tcmu. A main thread call
tcmulib_get_next_command() to get a set of io commands, then
it dispatches them to user-space io wokers. Take write requests as
example, io worker use ioctl(2) to get data from bios, and send
data to distributed fs, finally call ioctl(2) to commit req. Multiple
io workers can run concurrently. Since GET_DATA(write request)
or COMMIT_REQ(read request) mainly do memcpy work, one
io_uring instance will just do these jobs sequentially, which may
not take advantage of multi-cpu.

So finally, I would suggest ublk or libulk just offer basic interface:
add/start/delete dev interface, get io commands sescriptor, get io
data, commit io helper. Let's user-space target make decisions,
for example, whether to use eventfd.

Thanks for your patience


Regards,
Xiaoguang Wang

> Hello Guys,
>
> ublk driver is one kernel driver for implementing generic userspace block
> device/driver, which delivers io request from ublk block device(/dev/ublkbN) into
> ublk server[1] which is the userspace part of ublk for communicating
> with ublk driver and handling specific io logic by its target module.
>
> Another thing ublk driver handles is to copy data between user space buffer
> and request/bio's pages, or take zero copy if mm is ready for support it in
> future. ublk driver doesn't handle any IO logic of the specific driver, so
> it is small/simple, and all io logics are done by the target code in ublkserver.
>
> The above two are main jobs done by ublk driver.
>
> ublk driver can help to move IO logic into userspace, in which the
> development work is easier/more effective than doing in kernel, such as,
> ublk-loop takes < 200 lines of loop specific code to get basically same 
> function with kernel loop block driver, meantime the performance is
> is even better than kernel loop with same setting. ublksrv[1] provide built-in
> test for comparing both by running "make test T=loop", for example, see
> the test result running on VM which is over my lattop(root disk is
> nvme/device mapper/xfs):
>
> 	[root@ktest-36 ubdsrv]#make -s -C /root/git/ubdsrv/tests run T=loop/001 R=10
> 	running loop/001
> 		fio (ublk/loop(/root/git/ubdsrv/tests/tmp/ublk_loop_VqbMA), libaio, bs 4k, dio, hw queues:1)...
> 		randwrite: jobs 1, iops 32572
> 		randread: jobs 1, iops 143052
> 		rw: jobs 1, iops read 29919 write 29964
> 	
> 	[root@ktest-36 ubdsrv]# make test T=loop/003
> 	make -s -C /root/git/ubdsrv/tests run T=loop/003 R=10
> 	running loop/003
> 		fio (kernel_loop/kloop(/root/git/ubdsrv/tests/tmp/ublk_loop_ZIVnG), libaio, bs 4k, dio, hw queues:1)...
> 		randwrite: jobs 1, iops 27436
> 		randread: jobs 1, iops 95273
> 		rw: jobs 1, iops read 22542 write 22543 
>
>
> Another example is high performance qcow2 support[2], which could be built with
> ublk framework more easily than doing it inside kernel.
>
> Also there are more people who express interests on userspace block driver[3],
> Gabriel Krisman Bertazi proposes this topic in lsf/mm/ebpf 2022 and mentioned
> requirement from Google. Ziyang Zhang from Alibaba said they "plan to
> replace TCMU by UBD as a new choice" because UBD can get better throughput than
> TCMU even with single queue[4], meantime UBD is simple. Also there is userspace
> storage service for providing storage to containers.
>
> It is io_uring based: io request is delivered to userspace via new added
> io_uring command which has been proved as very efficient for making nvme
> passthrough IO to get better IOPS than io_uring(READ/WRITE). Meantime one
> shared/mmap buffer is used for sharing io descriptor to userspace, the
> buffer is readonly for userspace, each IO just takes 24bytes so far.
> It is suggested to use io_uring in userspace(target part of ublk server)
> to handle IO request too. And it is still easy for ublkserver to support
> io handling by non-io_uring, and this work isn't done yet, but can be
> supported easily with help o eventfd.
>
> This way is efficient since no extra io command copy is required, no sleep
> is needed in transferring io command to userspace. Meantime the communication
> protocol is simple and efficient, one single command of
> UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit
> command result in one trip. IO handling is often batched after single
> io_uring_enter() returns, both IO requests from ublk server target and
> IO commands could be handled as a whole batch.
>
> And the patch by patch change can be found in the following
> tree:
>
> https://github.com/ming1/linux/tree/my_for-5.20-ubd-devel_v4
>
> ublk server repo(master branch):
>
> 	https://github.com/ming1/ubdsrv
>
> Any comments are welcome!
>
> Since V3:
> - address Gabriel Krisman Bertazi's comments on V3: add userspace data
>   validation before handling command, remove warning, ...
> - remove UBLK_IO_COMMIT_REQ command as suggested by Zixiang and Gabriel Krisman Bertazi
> - fix one request double free when running abort
> - rewrite/cleanup ublk_copy_pages(), then this handling becomes very
>   clean
> - add one command of UBLK_IO_REFETCH_REQ for allowing ublk_drv to build
>   as module
>
> Since V2:
> - fix one big performance problem:
> 	https://github.com/ming1/linux/commit/3c9fd476951759858cc548dee4cedc074194d0b0
> - rename as ublk, as suggested by Gabriel Krisman Bertazi 
> - lots of cleanup & code improvement & bugfix, see details in git
>   hisotry
>
>
> Since V1:
>
> Remove RFC now because ublk driver codes gets lots of cleanup, enhancement and
> bug fixes since V1:
>
> - cleanup uapi: remove ublk specific error code,  switch to linux error code,
> remove one command op, remove one field from cmd_desc
>
> - add monitor mechanism to handle ubq_daemon being killed, ublksrv[1]
>   includes builtin tests for covering heavy IO with deleting ublk / killing
>   ubq_daemon at the same time, and V2 pass all the two tests(make test T=generic),
>   and the abort/stop mechanism is simple
>
> - fix MQ command buffer mmap bug, and now 'xfstetests -g auto' works well on
>   MQ ublk-loop devices(test/scratch)
>
> - improve batching submission as suggested by Jens
>
> - improve handling for starting device, replace random wait/poll with
> completion
>
> - all kinds of cleanup, bug fix,..
>
> Ming Lei (2):
>   ublk: add io_uring based userspace block driver
>   ublk_drv: add UBLK_IO_REFETCH_REQ for supporting to build as module
>
>  drivers/block/Kconfig         |    6 +
>  drivers/block/Makefile        |    2 +
>  drivers/block/ublk_drv.c      | 1701 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/ublk_cmd.h |  173 ++++
>  4 files changed, 1882 insertions(+)
>  create mode 100644 drivers/block/ublk_drv.c
>  create mode 100644 include/uapi/linux/ublk_cmd.h
>
Ming Lei July 11, 2022, 2:03 p.m. UTC | #2
Hi Xiaoguang,

On Mon, Jul 11, 2022 at 07:32:19PM +0800, Xiaoguang Wang wrote:
> Hello Ming,
> 
> First thanks for this great work, I think ublk will be a powerful
> replacement for tcmu. I read your v3 ublk kernel and user-space
> codes, and have some ublk design questions here:
> 
> 1) As I said before, currently ublk still needs user-space backstore
> to allocate io buffer in advance, for example, UBLK_IO_FETCH_REQ
> command needs to set ublk_io->addr when starting device. Currently,
> some of our internal business may create hundreds or thousands of
> tcmu devices in one host, when switching to ublk, it'll need user-space
> backstore to allocate lots of memory in advance, which will waste memory
> in a host with swap off. Also user-space backstore may use advanced
> network components, they may maintain internal memory pool, which
> can be used as io buffer.
> 
> So I'd like to suggest that at least we add a new flag and keep GET_DATA
> command. When used, FETCH command does not need to pass io
> buffer addr, and backstore needs to send a GET_DATA command with
> user addr for write request.
> 
> Support high flexibility, let's user decides what's best for them.

Please take a look at v4 patches or cover letter at least before asking
this question.

V4 adds one new command of REFETCH for supporting to build ublk driver
as module, you can allocate buffer when receiving REFETCH command
in userspace target code by adding one pair of callbacks.

Also the latest ublkserver adds callback for target code to pre-allocate
buffer, then if you have pre-allocated io buffer, the buffer can be passed
to driver via FETCH command during setting up queue.

Actually I have implemented pinning page during the whole io lifetime,
then the pre-allocated io buffers can be reclaimed without needing
swapout by kernel when io is completed:

https://github.com/ming1/linux/commits/ubd-master

So the preallocation is just allocation on virtual memory space, and
the pages are pinned actually when io is handled. After io handling is
done, kernel can reclaim pages at will without needing swapout on
these io pages.

> 
> 2) complicated ublk user-space
> First forgive me :) I think current ublk user-space codes looks somewhat
> complicated:

Please just look at libublksrv code, and example of demo_null.c &
demo_event.c.

Zixiang mentioned that it basically did almost everything what you
requested before:

https://lore.kernel.org/linux-block/532eb193-06cf-96a3-684f-cc7e16db5ddf@linux.alibaba.com/

Maybe you two need to talk a bit.

>   1. UBLK_CMD_START_DEV and io handler worker need to be
> in different task context, because UBLK_CMD_START_DEV needs
> to wait the number of queue depth of sqes to be submitted in advance.

Of course we have to wait until all IO commands are issued to driver,
since block IO can come to /dev/ublkbN after UBLK_CMD_START_DEV returns,
and /dev/ublkbN is exposed to userspace in running UBLK_CMD_START_DEV.

What is the matter of this kind of handling?

Also with libublksrv, you can do everything just in single task context,
see:

https://github.com/ming1/ubdsrv/blob/master/demo_null.c

> 
>   2. mixed getting ublk command and target io handle in one io_uring instance
> I'm not sure it's a good design, see ublksrv_handle_cqe(), which contains

io_uring is supposed to be bound with context, and serves all IOs
issued from this context. That is exactly typical AIO use pattern,
please look at example of t/io_uring.c in fio project, which can accept
lots of files in command line, then handle IOs to all these files in one
single io_uring context. Here /dev/ublkcN is just one file, we handle
IOs to other files and /dev/ublkcN in single io_uring/context, then
all of them can be handled at batching, then each single syscall can
handle more IOs, that is one important reason why io_uring performs so well.

> many flag handle work and is_target_io() check, I think the data flow is not
> that clear for me at least :)

/* 
 * this flag is set if we need to tell ublk driver to fetch io req only,
 * just needed in setup stage.  
 */
#define UBLKSRV_NEED_FETCH_RQ		(1UL << 0)

/* when one io is handled, we set this flag for committing io result */
#define UBLKSRV_NEED_COMMIT_RQ_COMP	(1UL << 1)

/*
 * this flag is set in case the command slot is free to issue new command;
 * cleared when io command is issued to driver.
 */
#define UBLKSRV_IO_FREE			(1UL << 2)

/*
 * added in v4, set in case UBLK_IO_RES_REFETCH is returned from driver,
 * so REFETCH command is issued to driver
 */
#define UBLKSRV_NEED_REFETCH_RQ		(1UL << 3)

Note, the flags are just for handling io commands.

> 
>   3. helper like tcmulib_get_next_command()
> I wonder whether current ublk user-space can offer similar helper which
> will return a set of io commands to backstore easily.

No, io command is supposed to use by libublksrv internal use, and target
code should _not_ deal with any io command.

The target code should just focus on implementing ->handle_io_async() in
which one new io command is received from driver, same with
->target_io_done() which is called when one target io is completed by
io_uring.

If target code doesn't use io_uring to handle io, please refer to
example of demo_event.c, in which ->handle_event() is required for
supporting to handle io in another contexts by either io_uring or libaio
or whatever. ->handle_event() is called when io_uring(for issuing io
command) is waken up by eventfd, which is triggered by target code
itself(two eventfd APIs).

> 
> I'd like to suggest:
> 1. When starting ublk dev, pass io_uring fd for every queue, then in
> blk-mq->queue_rq(), it'll generate one cqe for every coming request,
> not need to issue fetch sqes command in advance, kernel codes would

Why do you think it isn't good to issue fetch sqes in advance? It costs
nothing, meantime userspace can get io request pretty fast.

Actually you are suggesting one fundamental change to io_uring given
the current io_uring use model is that userspace issues io via sqe, and
kernel(io_uring) completes io via cqe, and sqe and cqe are in two rings
actually.

That current io_uring doesn't support to complete cqe to userspace without
issuing any sqe, also not see any benefit we can get in this way. If you
have, please explain it in details.


> simplify a bit,  UBLK_IO_FLAG_ACTIVE may be discarded. And helper
> like returning a set of io command would be added easily. Note these
> io_uring fd would be just used for notifying io command generated.
> 
> 2. We use another io_uring fd per queue to handle GET_DATA or
> COMMIT_REQ command. Indeed, if we can support synchronous ioctl
> interface to do GET_DATA and COMMIT_REQ, we may make libublk
> really simple.

IMO that won't be good idea. One important reason why io_uring is so
efficient is that batching issue/completion in single syscall. And using
ioctl to handle io can be too slow.

> 
> 
> Here I'd like to describe how we use tcmu. A main thread call
> tcmulib_get_next_command() to get a set of io commands, then
> it dispatches them to user-space io wokers. Take write requests as
> example, io worker use ioctl(2) to get data from bios, and send
> data to distributed fs, finally call ioctl(2) to commit req. Multiple

Hammm, not mentioning pthread communication, it takes at least 3 syscalls
for handling one io, how can you expect this way to work efficiently?

With ublk, usually we handle dozens or even more than hundred of IOs in
single io_uring_enter() syscall.

> io workers can run concurrently. Since GET_DATA(write request)
> or COMMIT_REQ(read request) mainly do memcpy work, one
> io_uring instance will just do these jobs sequentially, which may
> not take advantage of multi-cpu.

IMO you can implement target code to handle io in other pthreads against
current libublksrv design, see demo_event.c. Or if you think it is still
enough, please just share with us what the problem is. Without
understanding the problem, I won't try to figure out any solution or
change.

Again, the goal of ublk aims at implementing high performance & generic
userspace user space block driver.



Thanks,
Ming
Xiaoguang Wang July 12, 2022, 8:44 a.m. UTC | #3
Hello Ming,

> Hi Xiaoguang,
>
> On Mon, Jul 11, 2022 at 07:32:19PM +0800, Xiaoguang Wang wrote:
> Please take a look at v4 patches or cover letter at least before asking
> this question.
Yeah, I should be, really sorry.
>
> V4 adds one new command of REFETCH for supporting to build ublk driver
> as module, you can allocate buffer when receiving REFETCH command
> in userspace target code by adding one pair of callbacks.
>
> Also the latest ublkserver adds callback for target code to pre-allocate
> buffer, then if you have pre-allocated io buffer, the buffer can be passed
> to driver via FETCH command during setting up queue.
Now my concern about io buffer management has gone, thanks.

>
> Actually I have implemented pinning page during the whole io lifetime,
> then the pre-allocated io buffers can be reclaimed without needing
> swapout by kernel when io is completed:
>
> https://github.com/ming1/linux/commits/ubd-master
>
> So the preallocation is just allocation on virtual memory space, and
> the pages are pinned actually when io is handled. After io handling is
> done, kernel can reclaim pages at will without needing swapout on
> these io pages.
OK, I'll learn codes later.

>
>> 2) complicated ublk user-space
>> First forgive me :) I think current ublk user-space codes looks somewhat
>> complicated:
> Please just look at libublksrv code, and example of demo_null.c &
> demo_event.c.
OK.

>
>
> Of course we have to wait until all IO commands are issued to driver,
> since block IO can come to /dev/ublkbN after UBLK_CMD_START_DEV returns,
> and /dev/ublkbN is exposed to userspace in running UBLK_CMD_START_DEV.
>
> What is the matter of this kind of handling?
>
> Also with libublksrv, you can do everything just in single task context,
> see:
>
> https://github.com/ming1/ubdsrv/blob/master/demo_null.c
No, indeed I don't mean that there are something wrong with your
implementation. I just try to see whether I can simplify it a bit.

If we adopt to pass one io_uring fd per queue when starting device,
blk-mq's queue_rq() will get corresponding io_uring file for this queue and
use it to generate cqes directly to notify new io commands inserted,
then UBLK_CMD_START_DEV doesn't need to wait, and can be in the
same thread with ublksrv_queue_init or ublksrv_process_io.
Seems that for demo_null.c, they are still in different threads.

For current io_uring implementation, one sqe may generate one or more
cqes, but indeed, we can generate cqes without submitting sqes, just
fill one event to io_uring ctx.

Just suggestions :)
>
>>   2. mixed getting ublk command and target io handle in one io_uring instance
>> I'm not sure it's a good design, see ublksrv_handle_cqe(), which contains
> io_uring is supposed to be bound with context, and serves all IOs
> issued from this context. That is exactly typical AIO use pattern,
> please look at example of t/io_uring.c in fio project, which can accept
> lots of files in command line, then handle IOs to all these files in one
> single io_uring context. Here /dev/ublkcN is just one file, we handle
> IOs to other files and /dev/ublkcN in single io_uring/context, then
> all of them can be handled at batching, then each single syscall can
> handle more IOs, that is one important reason why io_uring performs so well.
Yeah, I understand that you're doing your best to improve ublk performance,
and I'm a early developer of io_uring and know how it works :)

It maybe just because of my poor design poor taste, I think put
io command descriptors acquire and io command handling together
seem not decouple well.
>
>> many flag handle work and is_target_io() check, I think the data flow is not
>> that clear for me at least :)
> /* 
>  * this flag is set if we need to tell ublk driver to fetch io req only,
>  * just needed in setup stage.  
>  */
> #define UBLKSRV_NEED_FETCH_RQ		(1UL << 0)
>
> /* when one io is handled, we set this flag for committing io result */
> #define UBLKSRV_NEED_COMMIT_RQ_COMP	(1UL << 1)
>
> /*
>  * this flag is set in case the command slot is free to issue new command;
>  * cleared when io command is issued to driver.
>  */
> #define UBLKSRV_IO_FREE			(1UL << 2)
>
> /*
>  * added in v4, set in case UBLK_IO_RES_REFETCH is returned from driver,
>  * so REFETCH command is issued to driver
>  */
> #define UBLKSRV_NEED_REFETCH_RQ		(1UL << 3)
>
> Note, the flags are just for handling io commands.
>
>>   3. helper like tcmulib_get_next_command()
>> I wonder whether current ublk user-space can offer similar helper which
>> will return a set of io commands to backstore easily.
> No, io command is supposed to use by libublksrv internal use, and target
> code should _not_ deal with any io command.
Seems different from design ideas of tcmu.

>
> The target code should just focus on implementing ->handle_io_async() in
> which one new io command is received from driver, same with
> ->target_io_done() which is called when one target io is completed by
> io_uring.
>
> If target code doesn't use io_uring to handle io, please refer to
> example of demo_event.c, in which ->handle_event() is required for
> supporting to handle io in another contexts by either io_uring or libaio
> or whatever. ->handle_event() is called when io_uring(for issuing io
> command) is waken up by eventfd, which is triggered by target code
> itself(two eventfd APIs).
OK.

>
>> I'd like to suggest:
>> 1. When starting ublk dev, pass io_uring fd for every queue, then in
>> blk-mq->queue_rq(), it'll generate one cqe for every coming request,
>> not need to issue fetch sqes command in advance, kernel codes would
> Why do you think it isn't good to issue fetch sqes in advance? It costs
> nothing, meantime userspace can get io request pretty fast.
>
> Actually you are suggesting one fundamental change to io_uring given
> the current io_uring use model is that userspace issues io via sqe, and
> kernel(io_uring) completes io via cqe, and sqe and cqe are in two rings
> actually.
>
> That current io_uring doesn't support to complete cqe to userspace without
> issuing any sqe, also not see any benefit we can get in this way. If you
> have, please explain it in details.
Hard to say it's one fundamental change, io_uring can easily add such
a helper which generates cqes but needs not to submit sqes, which contains
  allocate one cqe, with user_data, res
  io_commit_cqring(ctx);

As I said before, there maybe such benefits:
1. may decouple io command descriptor acquire and io command handling well.
At least helper like tcmulib_get_next_command maybe added easily. I'm not sure, some
applications based on tcmu previously may need this helper.

2. UBLK_CMD_START_DEV won't need to wait another thread context to submit
number of queue depth of sqes firstly, but I admit that it's not a big issue.

>
>
>> simplify a bit,  UBLK_IO_FLAG_ACTIVE may be discarded. And helper
>> like returning a set of io command would be added easily. Note these
>> io_uring fd would be just used for notifying io command generated.
>>
>> 2. We use another io_uring fd per queue to handle GET_DATA or
>> COMMIT_REQ command. Indeed, if we can support synchronous ioctl
>> interface to do GET_DATA and COMMIT_REQ, we may make libublk
>> really simple.
> IMO that won't be good idea. One important reason why io_uring is so
> efficient is that batching issue/completion in single syscall. And using
> ioctl to handle io can be too slow.
>
>>
>> Here I'd like to describe how we use tcmu. A main thread call
>> tcmulib_get_next_command() to get a set of io commands, then
>> it dispatches them to user-space io wokers. Take write requests as
>> example, io worker use ioctl(2) to get data from bios, and send
>> data to distributed fs, finally call ioctl(2) to commit req. Multiple
> Hammm, not mentioning pthread communication, it takes at least 3 syscalls
> for handling one io, how can you expect this way to work efficiently?
I admit batch will be good, and syscalls userspace and kernel context switch
introduce overhead. But for big io requests, batch in one context is not good. In
the example of read requests, if io request is big, say 1MB, io_uring will do
commit req sequentially, which indeed mainly do memcpy work. But if users
can choose to issue multiple ioctls which do commit req concurrently, I think
user may get higher io throughput.

And in this case, user may not care userspace and kernel context switch overhead at all.

Or to put it another way, should libublk offer synchronous programming interface ?

>
> With ublk, usually we handle dozens or even more than hundred of IOs in
> single io_uring_enter() syscall.
>
>> io workers can run concurrently. Since GET_DATA(write request)
>> or COMMIT_REQ(read request) mainly do memcpy work, one
>> io_uring instance will just do these jobs sequentially, which may
>> not take advantage of multi-cpu.
> IMO you can implement target code to handle io in other pthreads against
> current libublksrv design, see demo_event.c. Or if you think it is still
> enough, please just share with us what the problem is. Without
> understanding the problem, I won't try to figure out any solution or
> change.
I need to read your ublk userspace codes carefully, if I made
some noises, sorry.
>
> Again, the goal of ublk aims at implementing high performance & generic
> userspace user space block driver.
Yeah, sure, thanks for this work again.

Regards,
Xiaoguang Wang
>
>
>
> Thanks,
> Ming
Ming Lei July 12, 2022, 11:30 a.m. UTC | #4
On Tue, Jul 12, 2022 at 04:44:03PM +0800, Xiaoguang Wang wrote:
> Hello Ming,
> 
> > Hi Xiaoguang,
> >
> > On Mon, Jul 11, 2022 at 07:32:19PM +0800, Xiaoguang Wang wrote:
> > Please take a look at v4 patches or cover letter at least before asking
> > this question.
> Yeah, I should be, really sorry.
> >
> > V4 adds one new command of REFETCH for supporting to build ublk driver
> > as module, you can allocate buffer when receiving REFETCH command
> > in userspace target code by adding one pair of callbacks.
> >
> > Also the latest ublkserver adds callback for target code to pre-allocate
> > buffer, then if you have pre-allocated io buffer, the buffer can be passed
> > to driver via FETCH command during setting up queue.
> Now my concern about io buffer management has gone, thanks.
> 
> >
> > Actually I have implemented pinning page during the whole io lifetime,
> > then the pre-allocated io buffers can be reclaimed without needing
> > swapout by kernel when io is completed:
> >
> > https://github.com/ming1/linux/commits/ubd-master
> >
> > So the preallocation is just allocation on virtual memory space, and
> > the pages are pinned actually when io is handled. After io handling is
> > done, kernel can reclaim pages at will without needing swapout on
> > these io pages.
> OK, I'll learn codes later.
> 
> >
> >> 2) complicated ublk user-space
> >> First forgive me :) I think current ublk user-space codes looks somewhat
> >> complicated:
> > Please just look at libublksrv code, and example of demo_null.c &
> > demo_event.c.
> OK.
> 
> >
> >
> > Of course we have to wait until all IO commands are issued to driver,
> > since block IO can come to /dev/ublkbN after UBLK_CMD_START_DEV returns,
> > and /dev/ublkbN is exposed to userspace in running UBLK_CMD_START_DEV.
> >
> > What is the matter of this kind of handling?
> >
> > Also with libublksrv, you can do everything just in single task context,
> > see:
> >
> > https://github.com/ming1/ubdsrv/blob/master/demo_null.c
> No, indeed I don't mean that there are something wrong with your
> implementation. I just try to see whether I can simplify it a bit.
> 
> If we adopt to pass one io_uring fd per queue when starting device,
> blk-mq's queue_rq() will get corresponding io_uring file for this queue and
> use it to generate cqes directly to notify new io commands inserted,
> then UBLK_CMD_START_DEV doesn't need to wait, and can be in the
> same thread with ublksrv_queue_init or ublksrv_process_io.
> Seems that for demo_null.c, they are still in different threads.
> 
> For current io_uring implementation, one sqe may generate one or more
> cqes, but indeed, we can generate cqes without submitting sqes, just
> fill one event to io_uring ctx.
> Just suggestions :)

I don't have any interest in so unusual usage of io_uring, especially it
needs fundamental change in io_uring.

Compared with kernel side change, removing waiting until queue setup in
userspace side simplifies nothing.

> >
> >>   2. mixed getting ublk command and target io handle in one io_uring instance
> >> I'm not sure it's a good design, see ublksrv_handle_cqe(), which contains
> > io_uring is supposed to be bound with context, and serves all IOs
> > issued from this context. That is exactly typical AIO use pattern,
> > please look at example of t/io_uring.c in fio project, which can accept
> > lots of files in command line, then handle IOs to all these files in one
> > single io_uring context. Here /dev/ublkcN is just one file, we handle
> > IOs to other files and /dev/ublkcN in single io_uring/context, then
> > all of them can be handled at batching, then each single syscall can
> > handle more IOs, that is one important reason why io_uring performs so well.
> Yeah, I understand that you're doing your best to improve ublk performance,
> and I'm a early developer of io_uring and know how it works :)
> 
> It maybe just because of my poor design poor taste, I think put
> io command descriptors acquire and io command handling together
> seem not decouple well.
> >
> >> many flag handle work and is_target_io() check, I think the data flow is not
> >> that clear for me at least :)
> > /* 
> >  * this flag is set if we need to tell ublk driver to fetch io req only,
> >  * just needed in setup stage.  
> >  */
> > #define UBLKSRV_NEED_FETCH_RQ		(1UL << 0)
> >
> > /* when one io is handled, we set this flag for committing io result */
> > #define UBLKSRV_NEED_COMMIT_RQ_COMP	(1UL << 1)
> >
> > /*
> >  * this flag is set in case the command slot is free to issue new command;
> >  * cleared when io command is issued to driver.
> >  */
> > #define UBLKSRV_IO_FREE			(1UL << 2)
> >
> > /*
> >  * added in v4, set in case UBLK_IO_RES_REFETCH is returned from driver,
> >  * so REFETCH command is issued to driver
> >  */
> > #define UBLKSRV_NEED_REFETCH_RQ		(1UL << 3)
> >
> > Note, the flags are just for handling io commands.
> >
> >>   3. helper like tcmulib_get_next_command()
> >> I wonder whether current ublk user-space can offer similar helper which
> >> will return a set of io commands to backstore easily.
> > No, io command is supposed to use by libublksrv internal use, and target
> > code should _not_ deal with any io command.
> Seems different from design ideas of tcmu.
> 
> >
> > The target code should just focus on implementing ->handle_io_async() in
> > which one new io command is received from driver, same with
> > ->target_io_done() which is called when one target io is completed by
> > io_uring.
> >
> > If target code doesn't use io_uring to handle io, please refer to
> > example of demo_event.c, in which ->handle_event() is required for
> > supporting to handle io in another contexts by either io_uring or libaio
> > or whatever. ->handle_event() is called when io_uring(for issuing io
> > command) is waken up by eventfd, which is triggered by target code
> > itself(two eventfd APIs).
> OK.
> 
> >
> >> I'd like to suggest:
> >> 1. When starting ublk dev, pass io_uring fd for every queue, then in
> >> blk-mq->queue_rq(), it'll generate one cqe for every coming request,
> >> not need to issue fetch sqes command in advance, kernel codes would
> > Why do you think it isn't good to issue fetch sqes in advance? It costs
> > nothing, meantime userspace can get io request pretty fast.
> >
> > Actually you are suggesting one fundamental change to io_uring given
> > the current io_uring use model is that userspace issues io via sqe, and
> > kernel(io_uring) completes io via cqe, and sqe and cqe are in two rings
> > actually.
> >
> > That current io_uring doesn't support to complete cqe to userspace without
> > issuing any sqe, also not see any benefit we can get in this way. If you
> > have, please explain it in details.
> Hard to say it's one fundamental change, io_uring can easily add such
> a helper which generates cqes but needs not to submit sqes, which contains
>   allocate one cqe, with user_data, res
>   io_commit_cqring(ctx);

You still need io_kiocb/io_uring_cmd and both are allocated in
submission code path, since 'io_ring_ctx' is private for
io_uring.

And ublk server still needs one command to send io result to ublk driver,
that said this way saves nothing for us cause ublk driver uses single
command for both fetching io request and committing io result.

Easy to say than done, you may try to write a patch to verify your
ideas, especially no one uses io_ring in this way.

> 
> As I said before, there maybe such benefits:
> 1. may decouple io command descriptor acquire and io command handling well.
> At least helper like tcmulib_get_next_command maybe added easily. I'm not sure, some
> applications based on tcmu previously may need this helper.

Why do we need similar helper of tcmulib_get_next_command()? for what
purpose? In current design of ublk server, it should be enough for target
code to focus on ->handle_io_async(), ->target_io_done() in case of handling
target io via ubq io_uring, ->handle_event() in case of handling target io in
other contexts.

ublk server is one io_uring application, and interfaces provided
are based on io_uring design/interfaces. I guess TCMU isn't based on
io_uring, so it may have original style interface. You may ask
TCMU guys if it is possible to reuse current interface for supporting
io_uring with expected performance.

> 
> 2. UBLK_CMD_START_DEV won't need to wait another thread context to submit
> number of queue depth of sqes firstly, but I admit that it's not a big issue.

Yeah, it simplifies nothing, compared with fundamental io_uring change
and ublk driver side change.

> 
> >
> >
> >> simplify a bit,  UBLK_IO_FLAG_ACTIVE may be discarded. And helper
> >> like returning a set of io command would be added easily. Note these
> >> io_uring fd would be just used for notifying io command generated.
> >>
> >> 2. We use another io_uring fd per queue to handle GET_DATA or
> >> COMMIT_REQ command. Indeed, if we can support synchronous ioctl
> >> interface to do GET_DATA and COMMIT_REQ, we may make libublk
> >> really simple.
> > IMO that won't be good idea. One important reason why io_uring is so
> > efficient is that batching issue/completion in single syscall. And using
> > ioctl to handle io can be too slow.
> >
> >>
> >> Here I'd like to describe how we use tcmu. A main thread call
> >> tcmulib_get_next_command() to get a set of io commands, then
> >> it dispatches them to user-space io wokers. Take write requests as
> >> example, io worker use ioctl(2) to get data from bios, and send
> >> data to distributed fs, finally call ioctl(2) to commit req. Multiple
> > Hammm, not mentioning pthread communication, it takes at least 3 syscalls
> > for handling one io, how can you expect this way to work efficiently?
> I admit batch will be good, and syscalls userspace and kernel context switch
> introduce overhead. But for big io requests, batch in one context is not good. In
> the example of read requests, if io request is big, say 1MB, io_uring will do
> commit req sequentially, which indeed mainly do memcpy work. But if users
> can choose to issue multiple ioctls which do commit req concurrently, I think
> user may get higher io throughput.

If BS is big, single job could saturate devices bw easily, multiple jobs won't
make a difference, will it? Not mention sync cost among multiple jobs.

Not mention current ublk server does support multiple io jobs.

> 
> And in this case, user may not care userspace and kernel context switch overhead at all.
> 
> Or to put it another way, should libublk offer synchronous programming interface ?

It depends what the sync interface is.

You can call sync read()/write() for target io directly in ->handdle_io_async(),
or call them in other pthread(s) just by telling ubq after these sync ios are done
with the eventfd API.

demo_null.c is actually one such example.

> 
> >
> > With ublk, usually we handle dozens or even more than hundred of IOs in
> > single io_uring_enter() syscall.
> >
> >> io workers can run concurrently. Since GET_DATA(write request)
> >> or COMMIT_REQ(read request) mainly do memcpy work, one
> >> io_uring instance will just do these jobs sequentially, which may
> >> not take advantage of multi-cpu.
> > IMO you can implement target code to handle io in other pthreads against
> > current libublksrv design, see demo_event.c. Or if you think it is still
> > enough, please just share with us what the problem is. Without
> > understanding the problem, I won't try to figure out any solution or
> > change.
> I need to read your ublk userspace codes carefully, if I made

One small suggestion, you may start with the two builtin examples, both
are simple & clean, and the big one is only 300+ LOC, really complicated?


Thanks.
Ming
Xiaoguang Wang July 12, 2022, 3:16 p.m. UTC | #5
hi,

>>>
>>> If we adopt to pass one io_uring fd per queue when starting device,
>>> blk-mq's queue_rq() will get corresponding io_uring file for this queue and
>>> use it to generate cqes directly to notify new io commands inserted,
>>> then UBLK_CMD_START_DEV doesn't need to wait, and can be in the
>>> same thread with ublksrv_queue_init or ublksrv_process_io.
>>> Seems that for demo_null.c, they are still in different threads.
>>>
>>> For current io_uring implementation, one sqe may generate one or more
>>> cqes, but indeed, we can generate cqes without submitting sqes, just
>>> fill one event to io_uring ctx.
>>> Just suggestions :)
> I don't have any interest in so unusual usage of io_uring, especially it
> needs fundamental change in io_uring.
Got it, I see your point now and will respect that.

>
> Compared with kernel side change, removing waiting until queue setup in
> userspace side simplifies nothing.
>
> You still need io_kiocb/io_uring_cmd and both are allocated in
> submission code path, since 'io_ring_ctx' is private for
> io_uring.
>
> And ublk server still needs one command to send io result to ublk driver,
> that said this way saves nothing for us cause ublk driver uses single
> command for both fetching io request and committing io result.
>
> Easy to say than done, you may try to write a patch to verify your
> ideas, especially no one uses io_ring in this way.
I have done a hack change in io_uring:
iff --git a/fs/io_uring.c b/fs/io_uring.c
index 5ff2cdb425bc..e6696319148e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11958,6 +11958,21 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
        return 0;
 }

+static u64 tst_count;
+
+static void io_gen_cqe_direct(struct file *file)
+{
+       struct io_ring_ctx *ctx;
+       ctx = file->private_data;
+
+       printk(KERN_ERR "tst_count %llu\n", tst_count);
+       spin_lock(&ctx->completion_lock);
+       io_fill_cqe_aux(ctx, tst_count++, 0, 0);
+       io_commit_cqring(ctx);
+       spin_unlock(&ctx->completion_lock);
+       io_cqring_ev_posted(ctx);
+}
+
 SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
                u32, min_complete, u32, flags, const void __user *, argp,
                size_t, argsz)
@@ -12005,6 +12020,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
        if (unlikely(ctx->flags & IORING_SETUP_R_DISABLED))
                goto out;

+       io_gen_cqe_direct(f.file);
        /*
         * For SQ polling, the thread will do all submissions and completions.
         * Just return the requested submit count, and wake the thread if

And in user-space:
ret = io_uring_queue_init(QD, &ring, 0);

for (i = 0; i < 100; i++) {
    io_uring_submit_and_wait(&ring, 0);
    io_uring_wait_cqe(&ring, &cqe);
    printf("lege user_data %llu\n", cqe->user_data);
    io_uring_cqe_seen(&ring, cqe);
}

Now user-space app will get cqes continually, by it does not submit any sqes.
Indeed, io_fill_cqe_aux() has been used somewhere in kernel io_uring, for example,
sent one cqe from one io_uring instance to another instance.

I had planned to use io_gen_cqe_direct() in ublk's queue_rq to notify io requests
inserted. But now I'm fine that dropping this idea.
>
>> As I said before, there maybe such benefits:
>> 1. may decouple io command descriptor acquire and io command handling well.
>> At least helper like tcmulib_get_next_command maybe added easily. I'm not sure, some
>> applications based on tcmu previously may need this helper.
> Why do we need similar helper of tcmulib_get_next_command()? for what
> purpose? In current design of ublk server, it should be enough for target
> code to focus on ->handle_io_async(), ->target_io_done() in case of handling
> target io via ubq io_uring, ->handle_event() in case of handling target io in
> other contexts.
I'll have a deep think about it. Initially I tried to make libublk offer
similar programming interface to libtcmu, so apps can switch to ublk
easily, but indeed they maybe totally different things.

Sorry for noises again.
>
> ublk server is one io_uring application, and interfaces provided
> are based on io_uring design/interfaces. I guess TCMU isn't based on
> io_uring, so it may have original style interface. You may ask
> TCMU guys if it is possible to reuse current interface for supporting
> io_uring with expected performance.
>
>> 2. UBLK_CMD_START_DEV won't need to wait another thread context to submit
>> number of queue depth of sqes firstly, but I admit that it's not a big issue.
> Yeah, it simplifies nothing, compared with fundamental io_uring change
> and ublk driver side change.
>
>>
>> I admit batch will be good, and syscalls userspace and kernel context switch
>> introduce overhead. But for big io requests, batch in one context is not good. In
>> the example of read requests, if io request is big, say 1MB, io_uring will do
>> commit req sequentially, which indeed mainly do memcpy work. But if users
>> can choose to issue multiple ioctls which do commit req concurrently, I think
>> user may get higher io throughput.
> If BS is big, single job could saturate devices bw easily, multiple jobs won't
> make a difference, will it? Not mention sync cost among multiple jobs.n
No, I don't consider device scenes, at least for us, our target will visit distributed
file systems, which can offer very high io throughput, far greater than normal device.
>
> Not mention current ublk server does support multiple io jobs.
Yeah, I see that. But for read request, commit req commands will still be done
in ubq->ubq_daemon task context, and commit req commands mainly do memcpy work.

Regards,
Xiaoguang Wang

>
>> And in this case, user may not care userspace and kernel context switch overhead at all.
>>
>> Or to put it another way, should libublk offer synchronous programming interface ?
> It depends what the sync interface is.
>
> You can call sync read()/write() for target io directly in ->handdle_io_async(),
> or call them in other pthread(s) just by telling ubq after these sync ios are done
> with the eventfd API.
>
> demo_null.c is actually one such example.
>
>>> With ublk, usually we handle dozens or even more than hundred of IOs in
>>> single io_uring_enter() syscall.
>>>
>>>> io workers can run concurrently. Since GET_DATA(write request)
>>>> or COMMIT_REQ(read request) mainly do memcpy work, one
>>>> io_uring instance will just do these jobs sequentially, which may
>>>> not take advantage of multi-cpu.
>>> IMO you can implement target code to handle io in other pthreads against
>>> current libublksrv design, see demo_event.c. Or if you think it is still
>>> enough, please just share with us what the problem is. Without
>>> understanding the problem, I won't try to figure out any solution or
>>> change.
>> I need to read your ublk userspace codes carefully, if I made
> One small suggestion, you may start with the two builtin examples, both
> are simple & clean, and the big one is only 300+ LOC, really complicated?
>
>
> Thanks.
> Ming