Message ID | cover.1644302411.git.elena.ufimtseva@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | ioregionfd introduction | expand |
On Mon, Feb 07, 2022 at 11:22:14PM -0800, Elena Ufimtseva wrote: > This patchset is an RFC version for the ioregionfd implementation > in QEMU. The kernel patches are to be posted with some fixes as a v4. Hi Elena, I will review this on Monday. Thanks! Stefan
On Mon, Feb 07, 2022 at 11:22:14PM -0800, Elena Ufimtseva wrote: > This patchset is an RFC version for the ioregionfd implementation > in QEMU. The kernel patches are to be posted with some fixes as a v4. > > For this implementation version 3 of the posted kernel patches was user: > https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/ > > The future version will include support for vfio/libvfio-user. > Please refer to the design discussion here proposed by Stefan: > https://lore.kernel.org/all/YXpb1f3KicZxj1oj@stefanha-x1.localdomain/T/ > > The vfio-user version needed some bug-fixing and it was decided to send > this for multiprocess first. > > The ioregionfd is configured currently trough the command line and each > ioregionfd represent an object. This allow for easy parsing and does > not require device/remote object command line option modifications. > > The following command line can be used to specify ioregionfd: > <snip> > '-object', 'x-remote-object,id=robj1,devid=lsi0,fd='+str(remote.fileno()),\ > '-object', 'ioregionfd-object,id=ioreg2,devid=lsi0,iofd='+str(iord.fileno())+',bar=1',\ > '-object', 'ioregionfd-object,id=ioreg3,devid=lsi0,iofd='+str(iord.fileno())+',bar=2',\ Explicit configuration of ioregionfd-object is okay for early prototyping, but what is the plan for integrating this? I guess x-remote-object would query the remote device to find out which ioregionfds need to be registered and the user wouldn't need to specify ioregionfds on the command-line? > </snip> > > Proxy side of ioregionfd in this version uses only one file descriptor: > <snip> > '-device', 'x-pci-proxy-dev,id=lsi0,fd='+str(proxy.fileno())+',ioregfd='+str(iowr.fileno()), \ > </snip> This raises the question of the ioregionfd file descriptor lifecycle. In the end I think it shouldn't be specified on the command-line. Instead the remote device should create it and pass it to QEMU over the mpqemu/remote fd? > > This is done for RFC version and my though was that next version will > be for vfio-user, so I have not dedicated much effort to this command > line options. > > The multiprocess messaging protocol was extended to support inquiries > by the proxy if device has any ioregionfds. > This RFC implements inquires by proxy about the type of BAR (ioregionfd > or not) and the type of it (memory/io). > > Currently there are few limitations in this version of ioregionfd. > - one ioregionfd per bar, only full bar size is supported; > - one file descriptor per device for all of its ioregionfds; > - each remote device runs fd handler for all its BARs in one IOThread; > - proxy supports only one fd. > > Some of these limitations will be dropped in the future version. > This RFC is to acquire the feedback/suggestions from the community > on the general approach. > > The quick performance test was done for the remote lsi device with > ioregionfd and without for both mem BARs (1 and 2) with help > of the fio tool: > > Random R/W: > > read IOPS read BW write IOPS write BW > no ioregionfd 889 3559KiB/s 890 3561KiB/s > ioregionfd 938 3756KiB/s 939 3757KiB/s This is extremely slow, even for random I/O. How does this compare to QEMU running the LSI device without multi-process mode? > Sequential Read and Sequential Write: > > Sequential read Sequential write > read IOPS read BW write IOPS write BW > > no ioregionfd 367k 1434MiB/s 76k 297MiB/s > ioregionfd 374k 1459MiB/s 77.3k 302MiB/s It's normal for read and write IOPS to differ, but the read IOPS are very high. I wonder if caching and read-ahead are hiding the LSI device's actual performance here. What are the fio and QEMU command-lines? In order to benchmark ioregionfd it's best to run a benchmark where the bottleneck is MMIO/PIO dispatch. Otherwise we're looking at some other bottleneck (e.g. physical disk I/O performance) and the MMIO/PIO dispatch cost doesn't affect IOPS significantly. I suggest trying --blockdev null-co,size=64G,id=null0 as the disk instead of a file or host block device. The fio block size should be 4k to minimize the amount of time spent on I/O buffer contents and iodepth=1 because batching multiple requests with iodepth > 0 hides the MMIO/PIO dispatch bottleneck. Stefan
On Mon, Feb 14, 2022 at 02:52:29PM +0000, Stefan Hajnoczi wrote: > On Mon, Feb 07, 2022 at 11:22:14PM -0800, Elena Ufimtseva wrote: > > This patchset is an RFC version for the ioregionfd implementation > > in QEMU. The kernel patches are to be posted with some fixes as a v4. > > > > For this implementation version 3 of the posted kernel patches was user: > > https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/ > > > > The future version will include support for vfio/libvfio-user. > > Please refer to the design discussion here proposed by Stefan: > > https://lore.kernel.org/all/YXpb1f3KicZxj1oj@stefanha-x1.localdomain/T/ > > > > The vfio-user version needed some bug-fixing and it was decided to send > > this for multiprocess first. > > > > The ioregionfd is configured currently trough the command line and each > > ioregionfd represent an object. This allow for easy parsing and does > > not require device/remote object command line option modifications. > > > > The following command line can be used to specify ioregionfd: > > <snip> > > '-object', 'x-remote-object,id=robj1,devid=lsi0,fd='+str(remote.fileno()),\ > > '-object', 'ioregionfd-object,id=ioreg2,devid=lsi0,iofd='+str(iord.fileno())+',bar=1',\ > > '-object', 'ioregionfd-object,id=ioreg3,devid=lsi0,iofd='+str(iord.fileno())+',bar=2',\ > Hi Stefan Thank you for taking a look! > Explicit configuration of ioregionfd-object is okay for early > prototyping, but what is the plan for integrating this? I guess > x-remote-object would query the remote device to find out which > ioregionfds need to be registered and the user wouldn't need to specify > ioregionfds on the command-line? Yes, this can be done. For some reason I thought that user will be able to configure the number/size of the regions to be configured as ioregionfds. > > > </snip> > > > > Proxy side of ioregionfd in this version uses only one file descriptor: > > <snip> > > '-device', 'x-pci-proxy-dev,id=lsi0,fd='+str(proxy.fileno())+',ioregfd='+str(iowr.fileno()), \ > > </snip> > > This raises the question of the ioregionfd file descriptor lifecycle. In > the end I think it shouldn't be specified on the command-line. Instead > the remote device should create it and pass it to QEMU over the > mpqemu/remote fd? Yes, this will be same as vfio-user does. > > > > > This is done for RFC version and my though was that next version will > > be for vfio-user, so I have not dedicated much effort to this command > > line options. > > > > The multiprocess messaging protocol was extended to support inquiries > > by the proxy if device has any ioregionfds. > > This RFC implements inquires by proxy about the type of BAR (ioregionfd > > or not) and the type of it (memory/io). > > > > Currently there are few limitations in this version of ioregionfd. > > - one ioregionfd per bar, only full bar size is supported; > > - one file descriptor per device for all of its ioregionfds; > > - each remote device runs fd handler for all its BARs in one IOThread; > > - proxy supports only one fd. > > > > Some of these limitations will be dropped in the future version. > > This RFC is to acquire the feedback/suggestions from the community > > on the general approach. > > > > The quick performance test was done for the remote lsi device with > > ioregionfd and without for both mem BARs (1 and 2) with help > > of the fio tool: > > > > Random R/W: > > > > read IOPS read BW write IOPS write BW > > no ioregionfd 889 3559KiB/s 890 3561KiB/s > > ioregionfd 938 3756KiB/s 939 3757KiB/s > > This is extremely slow, even for random I/O. How does this compare to > QEMU running the LSI device without multi-process mode? These tests had the iodepth=256. I have changed this to 1 and tested without multiprocess, with multiprocess and multiprocess with both mmio regions as ioregionfds: read IOPS read BW(KiB/s) write IOPS write BW (KiB/s) no multiprocess 89 358 90 360 multiprocess 138 556 139 557 multiprocess ioregionfd 174 698 173 693 The fio config for randomrw: [global] bs=4K iodepth=1 direct=0 ioengine=libaio group_reporting time_based runtime=240 numjobs=1 name=raw-randreadwrite rw=randrw size=8G [job1] filename=/fio/randomrw And QEMU command line for non-mutliprocess: /usr/local/bin/qemu-system-x86_64 -name "OL7.4" -machine q35,accel=kvm -smp sockets=1,cores=2,threads=2 -m 2048 -hda /home/homedir/ol7u9boot.img -boot d -vnc :0 -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -device lsi53c895a,id=lsi1 -drive id=drive_image1,if=none,file=/home/homedir/10gb.qcow2 -device scsi-hd,id=drive1,drive=drive_image1,bus=lsi1.0,scsi-id=0 QEMU command line for multiprocess: remote_cmd = [ PROC_QEMU, \ '-machine', 'x-remote', \ '-device', 'lsi53c895a,id=lsi0', \ '-drive', 'id=drive_image1,file=/home/homedir/10gb.qcow2', \ '-device', 'scsi-hd,id=drive2,drive=drive_image1,bus=lsi0.0,' \ 'scsi-id=0', \ '-nographic', \ '-monitor', 'unix:/home/homedir/rem-sock,server,nowait', \ '-object', 'x-remote-object,id=robj1,devid=lsi0,fd='+str(remote.fileno()),\ '-object', 'ioregionfd-object,id=ioreg2,devid=lsi0,iofd='+str(iord.fileno())+',bar=1,',\ '-object', 'ioregionfd-object,id=ioreg3,devid=lsi0,iofd='+str(iord.fileno())+',bar=2',\ ] proxy_cmd = [ PROC_QEMU, \ '-D', '/tmp/qemu-debug-log', \ '-name', 'OL7.4', \ '-machine', 'pc,accel=kvm', \ '-smp', 'sockets=1,cores=2,threads=2', \ '-m', '2048', \ '-object', 'memory-backend-memfd,id=sysmem-file,size=2G', \ '-numa', 'node,memdev=sysmem-file', \ '-hda','/home/homedir/ol7u9boot.img', \ '-boot', 'd', \ '-vnc', ':0', \ '-device', 'x-pci-proxy-dev,id=lsi0,fd='+str(proxy.fileno())+',ioregfd='+str(iowr.fileno()), \ '-monitor', 'unix:/home/homedir/qemu-sock,server,nowait', \ '-netdev','tap,id=mynet0,ifname=tap0,script=no,downscript=no', '-device','e1000,netdev=mynet0,mac=52:55:00:d1:55:01',\ ] Where for the test without ioregionfds, they are commented out. I am doing more testing as I see some inconsistent results. > > > Sequential Read and Sequential Write: > > > > Sequential read Sequential write > > read IOPS read BW write IOPS write BW > > > > no ioregionfd 367k 1434MiB/s 76k 297MiB/s > > ioregionfd 374k 1459MiB/s 77.3k 302MiB/s > > It's normal for read and write IOPS to differ, but the read IOPS are > very high. I wonder if caching and read-ahead are hiding the LSI > device's actual performance here. > > What are the fio and QEMU command-lines? > > In order to benchmark ioregionfd it's best to run a benchmark where the > bottleneck is MMIO/PIO dispatch. Otherwise we're looking at some other > bottleneck (e.g. physical disk I/O performance) and the MMIO/PIO > dispatch cost doesn't affect IOPS significantly. > > I suggest trying --blockdev null-co,size=64G,id=null0 as the disk > instead of a file or host block device. The fio block size should be 4k > to minimize the amount of time spent on I/O buffer contents and > iodepth=1 because batching multiple requests with iodepth > 0 hides the > MMIO/PIO dispatch bottleneck. The queue depth in the tests above was 256, I will try that you have suggested. The block size is 4k. I am also looking at some other system issue that can interfere with test, will be running test on the fresh install and with settings you mentioned above. Thank you! > > Stefan
On Tue, Feb 15, 2022 at 10:16:04AM -0800, Elena wrote: > On Mon, Feb 14, 2022 at 02:52:29PM +0000, Stefan Hajnoczi wrote: > > On Mon, Feb 07, 2022 at 11:22:14PM -0800, Elena Ufimtseva wrote: > > > This patchset is an RFC version for the ioregionfd implementation > > > in QEMU. The kernel patches are to be posted with some fixes as a v4. > > > > > > For this implementation version 3 of the posted kernel patches was user: > > > https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/ > > > > > > The future version will include support for vfio/libvfio-user. > > > Please refer to the design discussion here proposed by Stefan: > > > https://lore.kernel.org/all/YXpb1f3KicZxj1oj@stefanha-x1.localdomain/T/ > > > > > > The vfio-user version needed some bug-fixing and it was decided to send > > > this for multiprocess first. > > > > > > The ioregionfd is configured currently trough the command line and each > > > ioregionfd represent an object. This allow for easy parsing and does > > > not require device/remote object command line option modifications. > > > > > > The following command line can be used to specify ioregionfd: > > > <snip> > > > '-object', 'x-remote-object,id=robj1,devid=lsi0,fd='+str(remote.fileno()),\ > > > '-object', 'ioregionfd-object,id=ioreg2,devid=lsi0,iofd='+str(iord.fileno())+',bar=1',\ > > > '-object', 'ioregionfd-object,id=ioreg3,devid=lsi0,iofd='+str(iord.fileno())+',bar=2',\ > > > > Hi Stefan > > Thank you for taking a look! > > > Explicit configuration of ioregionfd-object is okay for early > > prototyping, but what is the plan for integrating this? I guess > > x-remote-object would query the remote device to find out which > > ioregionfds need to be registered and the user wouldn't need to specify > > ioregionfds on the command-line? > > Yes, this can be done. For some reason I thought that user will be able > to configure the number/size of the regions to be configured as > ioregionfds. > > > > > > </snip> > > > > > > Proxy side of ioregionfd in this version uses only one file descriptor: > > > <snip> > > > '-device', 'x-pci-proxy-dev,id=lsi0,fd='+str(proxy.fileno())+',ioregfd='+str(iowr.fileno()), \ > > > </snip> > > > > This raises the question of the ioregionfd file descriptor lifecycle. In > > the end I think it shouldn't be specified on the command-line. Instead > > the remote device should create it and pass it to QEMU over the > > mpqemu/remote fd? > > Yes, this will be same as vfio-user does. > > > > > > > > This is done for RFC version and my though was that next version will > > > be for vfio-user, so I have not dedicated much effort to this command > > > line options. > > > > > > The multiprocess messaging protocol was extended to support inquiries > > > by the proxy if device has any ioregionfds. > > > This RFC implements inquires by proxy about the type of BAR (ioregionfd > > > or not) and the type of it (memory/io). > > > > > > Currently there are few limitations in this version of ioregionfd. > > > - one ioregionfd per bar, only full bar size is supported; > > > - one file descriptor per device for all of its ioregionfds; > > > - each remote device runs fd handler for all its BARs in one IOThread; > > > - proxy supports only one fd. > > > > > > Some of these limitations will be dropped in the future version. > > > This RFC is to acquire the feedback/suggestions from the community > > > on the general approach. > > > > > > The quick performance test was done for the remote lsi device with > > > ioregionfd and without for both mem BARs (1 and 2) with help > > > of the fio tool: > > > > > > Random R/W: > > > > > > read IOPS read BW write IOPS write BW > > > no ioregionfd 889 3559KiB/s 890 3561KiB/s > > > ioregionfd 938 3756KiB/s 939 3757KiB/s > > > > This is extremely slow, even for random I/O. How does this compare to > > QEMU running the LSI device without multi-process mode? > > These tests had the iodepth=256. I have changed this to 1 and tested > without multiprocess, with multiprocess and multiprocess with both mmio > regions as ioregionfds: > > read IOPS read BW(KiB/s) write IOPS write BW (KiB/s) > no multiprocess 89 358 90 360 > multiprocess 138 556 139 557 > multiprocess ioregionfd 174 698 173 693 > > The fio config for randomrw: > [global] > bs=4K > iodepth=1 > direct=0 Please set direct=1 so the guest page cache does not affect the I/O pattern. The host --drive option also needs cache.direct=on to avoid host page cache effects. The reason for benchmarking with direct=1 is to ensure that every I/O request submitted by fio is forwarded to the underlying disk. Otherwise the benchmark may be comparing guest page cache or host page cache hits, which do not involve the disk. Page cache read-ahead and write-behind may involve large block sizes and therefore change the I/O pattern specified on the fio command-line. This interferes with the benchmark and is another reason to use direct=1. > ioengine=libaio > group_reporting > time_based > runtime=240 > numjobs=1 > name=raw-randreadwrite > rw=randrw > size=8G > [job1] > filename=/fio/randomrw > > And QEMU command line for non-mutliprocess: > > /usr/local/bin/qemu-system-x86_64 -name "OL7.4" -machine q35,accel=kvm -smp sockets=1,cores=2,threads=2 -m 2048 -hda /home/homedir/ol7u9boot.img -boot d -vnc :0 -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -device lsi53c895a,id=lsi1 -drive id=drive_image1,if=none,file=/home/homedir/10gb.qcow2 -device scsi-hd,id=drive1,drive=drive_image1,bus=lsi1.0,scsi-id=0 > > QEMU command line for multiprocess: > > remote_cmd = [ PROC_QEMU, \ > '-machine', 'x-remote', \ > '-device', 'lsi53c895a,id=lsi0', \ > '-drive', 'id=drive_image1,file=/home/homedir/10gb.qcow2', \ > '-device', 'scsi-hd,id=drive2,drive=drive_image1,bus=lsi0.0,' \ > 'scsi-id=0', \ > '-nographic', \ > '-monitor', 'unix:/home/homedir/rem-sock,server,nowait', \ > '-object', 'x-remote-object,id=robj1,devid=lsi0,fd='+str(remote.fileno()),\ > '-object', 'ioregionfd-object,id=ioreg2,devid=lsi0,iofd='+str(iord.fileno())+',bar=1,',\ > '-object', 'ioregionfd-object,id=ioreg3,devid=lsi0,iofd='+str(iord.fileno())+',bar=2',\ > ] > proxy_cmd = [ PROC_QEMU, \ > '-D', '/tmp/qemu-debug-log', \ > '-name', 'OL7.4', \ > '-machine', 'pc,accel=kvm', \ > '-smp', 'sockets=1,cores=2,threads=2', \ > '-m', '2048', \ > '-object', 'memory-backend-memfd,id=sysmem-file,size=2G', \ > '-numa', 'node,memdev=sysmem-file', \ > '-hda','/home/homedir/ol7u9boot.img', \ > '-boot', 'd', \ > '-vnc', ':0', \ > '-device', 'x-pci-proxy-dev,id=lsi0,fd='+str(proxy.fileno())+',ioregfd='+str(iowr.fileno()), \ > '-monitor', 'unix:/home/homedir/qemu-sock,server,nowait', \ > '-netdev','tap,id=mynet0,ifname=tap0,script=no,downscript=no', '-device','e1000,netdev=mynet0,mac=52:55:00:d1:55:01',\ > ] > > Where for the test without ioregionfds, they are commented out. > > I am doing more testing as I see some inconsistent results. Thanks for the benchmark details! Stefan