Message ID | cover.1571905346.git.jag.raman@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Initial support of multi-process qemu | expand |
Patchew URL: https://patchew.org/QEMU/cover.1571905346.git.jag.raman@oracle.com/ Hi, This series failed the docker-quick@centos7 build test. Please find the testing commands and their output below. If you have Docker installed, you can probably reproduce it locally. === TEST SCRIPT BEGIN === #!/bin/bash make docker-image-centos7 V=1 NETWORK=1 time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1 === TEST SCRIPT END === TEST iotest-qcow2: 268 Failures: 071 099 120 184 186 Failed 5 of 109 iotests make: *** [check-tests/check-block.sh] Error 1 make: *** Waiting for unfinished jobs.... TEST check-qtest-aarch64: tests/qos-test Traceback (most recent call last): --- raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=7fcd537d324f4e2997dd5a6e772bce59', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-k9w5q7fm/src/docker-src.2019-10-24-21.54.49.17993:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2. filter=--filter=label=com.qemu.instance.uuid=7fcd537d324f4e2997dd5a6e772bce59 make[1]: *** [docker-run] Error 1 make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-k9w5q7fm/src' make: *** [docker-run-test-quick@centos7] Error 2 real 13m26.478s user 0m8.947s The full log is available at http://patchew.org/logs/cover.1571905346.git.jag.raman@oracle.com/testing.docker-quick@centos7/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
Patchew URL: https://patchew.org/QEMU/cover.1571905346.git.jag.raman@oracle.com/ Hi, This series failed the docker-mingw@fedora build test. Please find the testing commands and their output below. If you have Docker installed, you can probably reproduce it locally. === TEST SCRIPT BEGIN === #! /bin/bash export ARCH=x86_64 make docker-image-fedora V=1 NETWORK=1 time make docker-test-mingw@fedora J=14 NETWORK=1 === TEST SCRIPT END === CC util/aio-wait.o CC util/thread-pool.o Warning, treated as error: /tmp/qemu-test/src/docs/devel/index.rst:13:toctree contains reference to nonexisting document 'multi-process' CC util/qemu-timer.o CC util/main-loop.o make: *** [Makefile:1003: docs/devel/index.html] Error 2 make: *** Waiting for unfinished jobs.... Traceback (most recent call last): File "./tests/docker/docker.py", line 662, in <module> --- raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=f573792346ea4c8cb0e6066fdd1f1ddf', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-pe_jt46w/src/docker-src.2019-10-24-22.05.58.3272:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2. filter=--filter=label=com.qemu.instance.uuid=f573792346ea4c8cb0e6066fdd1f1ddf make[1]: *** [docker-run] Error 1 make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-pe_jt46w/src' make: *** [docker-run-test-mingw@fedora] Error 2 real 2m19.037s user 0m7.875s The full log is available at http://patchew.org/logs/cover.1571905346.git.jag.raman@oracle.com/testing.docker-mingw@fedora/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
Patchew URL: https://patchew.org/QEMU/cover.1571905346.git.jag.raman@oracle.com/ Hi, This series seems to have some coding style problems. See output below for more information: Subject: [RFC v4 PATCH 00/49] Initial support of multi-process qemu Type: series Message-id: cover.1571905346.git.jag.raman@oracle.com === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 Switched to a new branch 'test' 75d605b multi-process: add configure and usage information 1724569 multi-process: add the concept description to docs/devel/qemu-multiprocess 15328b7 multi-process: Enable support for multiple devices in remote 5cf79267 multi-process/mig: Restore the VMSD in remote process 97c15c9 multi-process/mig: Synchronize runstate of remote process 1b67b17 multi-process/mig: refactor runstate_check into common file 55e4b53 multi-process/mig: Load VMSD in the proxy object 7e1b819 multi-process/mig: Send VMSD of remote to the Proxy object 399f9bf multi-process/mig: Enable VMSD save in the Proxy object 645fc27 multi-process/mig: build migration module in the remote process f0265a7 multi-process: prevent duplicate memory initialization in remote a0faf9e multi-process/mon: Initialize QMP module for remote processes 58310d9 multi-process/mon: Refactor monitor/chardev functions out of vl.c ae4fd02 multi-process/mon: enable QMP module support in the remote process d7e2191 multi-process/mon: stub functions to enable QMP module for remote process 2c67c85 multi-process/mon: choose HMP commands based on target cb5ab3e multi-process: perform device reset in the remote process 3214409 multi-process: Use separate MMIO communication channel c34c47f multi-process: handle heartbeat messages in remote process ccd9230 multi-process: send heartbeat messages to remote f4991d6 multi-process: add parse_cmdline in remote process a90d6c6 multi-process: add remote options parser d616c9d multi-process: add remote option 45c18dd multi-process: refractor vl.c code to re-use in remote 57b6105 multi-process: Introduce build flags to separate remote process code 1295409 multi-process: add processing of remote drive and device command line e876f72 multi-process: remote: add create_done condition effa3f0 multi-process: remote: use fd for socket from parent process de69c89 multi-process: remote: add setup_devices and setup_drive msg processing afc658d multi-process: add qdev_proxy_add to create proxy devices 70e0f47 multi-process: configure remote side devices 701a141 multi-process: create IOHUB object to handle irq 613c372 multi-process: Synchronize remote memory f5183a9 multi-process: Add LSI device proxy object 7778ec0 multi-process: PCI BAR read/write handling for proxy & remote endpoints 736d74d mutli-process: build remote command line args 650f3d1 multi-process: introduce proxy object 9374d1a multi-process: remote process initialization 1965b18 multi-process: setup memory manager for remote device 719f823 multi-process: setup a machine object for remote device process 59bdda6 multi-process: setup PCI host bridge for remote device 139cdee multi-process: add functions to synchronize proxy and remote endpoints 1b3c1f0 multi-process: define mpqemu-link object ac130ca multi-process: build system for remote device process 93774c4 multi-process: Add config option for multi-process QEMU 7e5f9b2 multi-process: Add stub functions to facilate build of multi-process 73740e6 multi-process: add a command line option for debug file 59f2f7d multi-process: util: Add qemu_thread_cancel() to cancel running thread 18b29a1 multi-process: memory: alloc RAM from file at offset === OUTPUT BEGIN === 1/49 Checking commit 18b29a1306b2 (multi-process: memory: alloc RAM from file at offset) 2/49 Checking commit 59f2f7d24d20 (multi-process: util: Add qemu_thread_cancel() to cancel running thread) 3/49 Checking commit 73740e6eff21 (multi-process: add a command line option for debug file) 4/49 Checking commit 7e5f9b26d48c (multi-process: Add stub functions to facilate build of multi-process) ERROR: suspect code indent for conditional statements (4, 4) #137: FILE: accel/stubs/tcg-stub.c:109: + while (1) { + } WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #151: new file mode 100644 total: 1 errors, 1 warnings, 376 lines checked Patch 4/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 5/49 Checking commit 93774c4a2c28 (multi-process: Add config option for multi-process QEMU) 6/49 Checking commit ac130ca326fe (multi-process: build system for remote device process) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #286: new file mode 100644 total: 0 errors, 1 warnings, 244 lines checked Patch 6/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 7/49 Checking commit 1b3c1f0e0495 (multi-process: define mpqemu-link object) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #20: new file mode 100644 WARNING: line over 80 characters #171: FILE: include/io/mpqemu-link.h:147: +void mpqemu_link_set_callback(MPQemuLinkState *s, mpqemu_link_callback callback); WARNING: line over 80 characters #397: FILE: io/mpqemu-link.c:207: + qemu_log_mask(LOG_REMOTE_DEBUG, "%s: Max FDs exceeded\n", __func__); total: 0 errors, 3 warnings, 464 lines checked Patch 7/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 8/49 Checking commit 139cdee6cbf6 (multi-process: add functions to synchronize proxy and remote endpoints) 9/49 Checking commit 59bdda61a183 (multi-process: setup PCI host bridge for remote device) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #31: new file mode 100644 total: 0 errors, 1 warnings, 153 lines checked Patch 9/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 10/49 Checking commit 719f82346cb6 (multi-process: setup a machine object for remote device process) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #51: new file mode 100644 total: 0 errors, 1 warnings, 207 lines checked Patch 10/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 11/49 Checking commit 1965b18cc69e (multi-process: setup memory manager for remote device) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #81: new file mode 100644 total: 0 errors, 1 warnings, 182 lines checked Patch 11/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 12/49 Checking commit 9374d1ad3f30 (multi-process: remote process initialization) 13/49 Checking commit 650f3d106d82 (multi-process: introduce proxy object) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #34: new file mode 100644 total: 0 errors, 1 warnings, 379 lines checked Patch 13/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 14/49 Checking commit 736d74d6623b (mutli-process: build remote command line args) WARNING: line over 80 characters #141: FILE: hw/proxy/qemu-proxy.c:243: +static void init_proxy(PCIDevice *dev, char *command, bool need_spawn, Error **errp) WARNING: line over 80 characters #167: FILE: include/hw/proxy/qemu-proxy.h:66: + void (*init_proxy) (PCIDevice *dev, char *command, bool need_spawn, Error **errp); total: 0 errors, 2 warnings, 146 lines checked Patch 14/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 15/49 Checking commit 7778ec04e0e6 (multi-process: PCI BAR read/write handling for proxy & remote endpoints) 16/49 Checking commit f5183a9f01b5 (multi-process: Add LSI device proxy object) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #24: new file mode 100644 total: 0 errors, 1 warnings, 135 lines checked Patch 16/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 17/49 Checking commit 613c372a8221 (multi-process: Synchronize remote memory) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #32: new file mode 100644 total: 0 errors, 1 warnings, 344 lines checked Patch 17/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 18/49 Checking commit 701a141ec1a3 (multi-process: create IOHUB object to handle irq) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #202: new file mode 100644 total: 0 errors, 1 warnings, 435 lines checked Patch 18/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 19/49 Checking commit 70e0f47f0652 (multi-process: configure remote side devices) 20/49 Checking commit afc658d34c34 (multi-process: add qdev_proxy_add to create proxy devices) 21/49 Checking commit de69c899ae12 (multi-process: remote: add setup_devices and setup_drive msg processing) 22/49 Checking commit effa3f06a627 (multi-process: remote: use fd for socket from parent process) 23/49 Checking commit e876f72db9ae (multi-process: remote: add create_done condition) 24/49 Checking commit 129540993d75 (multi-process: add processing of remote drive and device command line) WARNING: Block comments use a leading /* on a separate line #41: FILE: vl.c:1149: + dev = qdev_remote_add(opts, false /* this is drive */, errp); WARNING: Block comments use a leading /* on a separate line #87: FILE: vl.c:2246: + dev = qdev_remote_add(opts, true /* this is device */, errp); WARNING: line over 80 characters #121: FILE: vl.c:4418: + * need PCI host initialized. As a TODO: could defer init of PCIProxyDev instead. total: 0 errors, 3 warnings, 117 lines checked Patch 24/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 25/49 Checking commit 57b61051819d (multi-process: Introduce build flags to separate remote process code) 26/49 Checking commit 45c18ddb523b (multi-process: refractor vl.c code to re-use in remote) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #35: new file mode 100644 WARNING: Block comments use a leading /* on a separate line #157: FILE: vl-parse.c:118: + dev = qdev_remote_add(opts, false /* this is drive */, errp); WARNING: Block comments use a leading /* on a separate line #172: FILE: vl-parse.c:133: + dev = qdev_remote_add(opts, true /* this is device */, errp); total: 0 errors, 3 warnings, 412 lines checked Patch 26/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 27/49 Checking commit d616c9d52b3e (multi-process: add remote option) 28/49 Checking commit a90d6c6eebff (multi-process: add remote options parser) WARNING: Block comments use a leading /* on a separate line #37: FILE: vl.c:300: + { /* end of list */ } total: 0 errors, 1 warnings, 147 lines checked Patch 28/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 29/49 Checking commit f4991d61c243 (multi-process: add parse_cmdline in remote process) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #59: new file mode 100644 ERROR: line over 90 characters #151: FILE: remote/remote-opts.c:88: + error_report("Option not supported for this target, %x arch_mask, %x arch_type", total: 1 errors, 1 warnings, 180 lines checked Patch 29/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 30/49 Checking commit ccd92304b77d (multi-process: send heartbeat messages to remote) 31/49 Checking commit c34c47ff9925 (multi-process: handle heartbeat messages in remote process) 32/49 Checking commit 32144092e7eb (multi-process: Use separate MMIO communication channel) 33/49 Checking commit cb5ab3eb7d63 (multi-process: perform device reset in the remote process) 34/49 Checking commit 2c67c85818aa (multi-process/mon: choose HMP commands based on target) 35/49 Checking commit d7e21910a6c4 (multi-process/mon: stub functions to enable QMP module for remote process) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #108: new file mode 100644 total: 0 errors, 1 warnings, 722 lines checked Patch 35/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 36/49 Checking commit ae4fd022fa7e (multi-process/mon: enable QMP module support in the remote process) 37/49 Checking commit 58310d9c635b (multi-process/mon: Refactor monitor/chardev functions out of vl.c) 38/49 Checking commit a0faf9ea04b0 (multi-process/mon: Initialize QMP module for remote processes) 39/49 Checking commit f0265a76742b (multi-process: prevent duplicate memory initialization in remote) 40/49 Checking commit 645fc2738aec (multi-process/mig: build migration module in the remote process) 41/49 Checking commit 399f9bf4ba56 (multi-process/mig: Enable VMSD save in the Proxy object) ERROR: suspect code indent for conditional statements (4, 4) #126: FILE: hw/proxy/qemu-proxy.c:449: + while (*((volatile uint64_t *)&pdev->migsize) < size) { + } total: 1 errors, 0 warnings, 173 lines checked Patch 41/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 42/49 Checking commit 7e1b8192f562 (multi-process/mig: Send VMSD of remote to the Proxy object) 43/49 Checking commit 55e4b53ec008 (multi-process/mig: Load VMSD in the proxy object) 44/49 Checking commit 1b67b1722798 (multi-process/mig: refactor runstate_check into common file) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #48: new file mode 100644 total: 0 errors, 1 warnings, 92 lines checked Patch 44/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 45/49 Checking commit 97c15c99f41f (multi-process/mig: Synchronize runstate of remote process) 46/49 Checking commit 5cf792672396 (multi-process/mig: Restore the VMSD in remote process) ERROR: that open brace { should be on the previous line #118: FILE: remote/remote-main.c:527: + if (process_start_mig_in(msg)) + { total: 1 errors, 0 warnings, 103 lines checked Patch 46/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 47/49 Checking commit 15328b7aae2d (multi-process: Enable support for multiple devices in remote) 48/49 Checking commit 1724569b4d71 (multi-process: add the concept description to docs/devel/qemu-multiprocess) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #22: new file mode 100644 total: 0 errors, 1 warnings, 1106 lines checked Patch 48/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 49/49 Checking commit 75d605b31739 (multi-process: add configure and usage information) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #13: new file mode 100644 total: 0 errors, 1 warnings, 86 lines checked Patch 49/49 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/cover.1571905346.git.jag.raman@oracle.com/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
On Thu, Oct 24, 2019 at 05:08:41AM -0400, Jagannathan Raman wrote: > Started with the presentation in October 2017 made by Marc-Andre (Red Hat) > and Konrad Wilk (Oracle) [1], and continued by Jag's BoF at KVM Forum 2018, > the multi-process project is now a prototype and presented in this patchset. > John & Elena will present the status of this project in KVM Forum 2019. > > This first series enables the emulation of lsi53c895a in a separate process. > > We posted the Proof Of Concept patches [2] before the BoF session in 2018. > Subsequently, we posted RFC v1 [3], RFC v2 [4] and RFC v3 [5] of this series. > > We want to present version 4 of this series, which incorporates the feedback > we received for v3 & adds support for live migrating the remote process. > > Following people contributed to this patchset: > > John G Johnson <john.g.johnson@oracle.com> > Jagannathan Raman <jag.raman@oracle.com> > Elena Ufimtseva <elena.ufimtseva@oracle.com> > Kanth Ghatraju <kanth.ghatraju@oracle.com> > > For full concept writeup about QEMU disaggregation refer to > docs/devel/qemu-multiprocess.rst. Please refer to > docs/qemu-multiprocess.txt for usage information. > > We are planning on making the following improvements in the future: > - Performance improvements > - Libvirt support > - Enforcement of security policies > - blockdev support > > We welcome all your ideas, concerns, and questions for this patchset. > > Thank you! I've wrapped up for v4. There is more to review in detail but I've posted enough comments so that I'd like to see the next revision before investing more time. The main topics: 1. It's possible to have just one proxy device per bus type (PCI, USB, etc). The proxy device instance can be initialized by communicating with the remote process to inquire about its memory regions, interrupts, etc. This removes the need to hardcode that information into per-device proxy objects, which is tedious and can get out-of-sync with the real device emulation code. This is becoming similar to doing VFIO or muser over a socket... 2. Security and code quality. Missing input validation and resource leaks don't inspire confidence :(. Please run scripts/checkpatch.pl on the code. Stefan
Hi We would like to give a short update to the community about the multi-process project. Firstly, we appreciate the feedback and all productive discussions we had at KVM 2019 forum. As an outcome of the conference, we have switched gears and are investigating the ways of using the muser framework in our project. At this moment we are working on the evaluation and a first prototype of qemu-multiprocess based on muser framework. We first heard about it at the conference from the presentation given by Thanos Makatos and Swapnil Ingle from Nutanix. Their presentation is available https://static.sched.com/hosted_files/kvmforum2019/3b/muser.pdf along with github link to the source repo. After the conversation we had with a group of people including Felipe Franciosi, Stefan Hajnoczi, Daniel Berrangé, Konrad Wilk, Peter Maydell, John Jonson and few others (apologies if some names are missing), we have gathered important answers on how to move forward with qemu-multiprocess. At this moment we are working on the first stage of the project with help of the Nutanix developers. The questions we have gathered so far will be addressed with muser and Qemu developers after we finish the first stage and make sure we understand what it will take for us to move onto the next stage. We will also incorporate relevant review from Stefan that he provided on the series 4 of the patchset. Thank you Stefan. If anyone have any further suggestions or questions about the status, please reply to this email. Thank you JJ, Jag & Elena On Thu, Oct 24, 2019 at 05:08:41AM -0400, Jagannathan Raman wrote: > Started with the presentation in October 2017 made by Marc-Andre (Red Hat) > and Konrad Wilk (Oracle) [1], and continued by Jag's BoF at KVM Forum 2018, > the multi-process project is now a prototype and presented in this patchset. > John & Elena will present the status of this project in KVM Forum 2019. > > This first series enables the emulation of lsi53c895a in a separate process. > > We posted the Proof Of Concept patches [2] before the BoF session in 2018. > Subsequently, we posted RFC v1 [3], RFC v2 [4] and RFC v3 [5] of this series. > > We want to present version 4 of this series, which incorporates the feedback > we received for v3 & adds support for live migrating the remote process. > > Following people contributed to this patchset: > > John G Johnson <john.g.johnson@oracle.com> > Jagannathan Raman <jag.raman@oracle.com> > Elena Ufimtseva <elena.ufimtseva@oracle.com> > Kanth Ghatraju <kanth.ghatraju@oracle.com> > > For full concept writeup about QEMU disaggregation refer to > docs/devel/qemu-multiprocess.rst. Please refer to > docs/qemu-multiprocess.txt for usage information. > > We are planning on making the following improvements in the future: > - Performance improvements > - Libvirt support > - Enforcement of security policies > - blockdev support > > We welcome all your ideas, concerns, and questions for this patchset. > > Thank you! > > [1]: http://events17.linuxfoundation.org/sites/events/files/slides/KVM%20FORUM%20multi-process.pdf > [1]: https://www.youtube.com/watch?v=Kq1-coHh7lg > [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg566538.html > [3]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg602285.html > [4]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg624877.html > [5]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg642000.html > > Elena Ufimtseva (22): > multi-process: add a command line option for debug file > multi-process: introduce proxy object > mutli-process: build remote command line args > multi-process: configure remote side devices > multi-process: add qdev_proxy_add to create proxy devices > multi-process: remote: add setup_devices and setup_drive msg > processing > multi-process: remote: use fd for socket from parent process > multi-process: remote: add create_done condition > multi-process: add processing of remote drive and device command line > multi-process: refractor vl.c code to re-use in remote > multi-process: add remote option > multi-process: add remote options parser > multi-process: add parse_cmdline in remote process > multi-process: send heartbeat messages to remote > multi-process: handle heartbeat messages in remote process > multi-process/mon: choose HMP commands based on target > multi-process/mig: Load VMSD in the proxy object > multi-process/mig: refactor runstate_check into common file > multi-process/mig: Synchronize runstate of remote process > multi-process/mig: Restore the VMSD in remote process > multi-process: Enable support for multiple devices in remote > multi-process: add configure and usage information > > Jagannathan Raman (26): > multi-process: memory: alloc RAM from file at offset > multi-process: util: Add qemu_thread_cancel() to cancel running thread > multi-process: Add stub functions to facilate build of multi-process > multi-process: Add config option for multi-process QEMU > multi-process: build system for remote device process > multi-process: define mpqemu-link object > multi-process: add functions to synchronize proxy and remote endpoints > multi-process: setup PCI host bridge for remote device > multi-process: setup a machine object for remote device process > multi-process: setup memory manager for remote device > multi-process: remote process initialization > multi-process: PCI BAR read/write handling for proxy & remote > endpoints > multi-process: Add LSI device proxy object > multi-process: Synchronize remote memory > multi-process: create IOHUB object to handle irq > multi-process: Introduce build flags to separate remote process code > multi-process: Use separate MMIO communication channel > multi-process: perform device reset in the remote process > multi-process/mon: stub functions to enable QMP module for remote > process > multi-process/mon: enable QMP module support in the remote process > multi-process/mon: Refactor monitor/chardev functions out of vl.c > multi-process/mon: Initialize QMP module for remote processes > multi-process: prevent duplicate memory initialization in remote > multi-process/mig: build migration module in the remote process > multi-process/mig: Enable VMSD save in the Proxy object > multi-process/mig: Send VMSD of remote to the Proxy object > > John G Johnson (1): > multi-process: add the concept description to > docs/devel/qemu-multiprocess > > Makefile | 2 + > Makefile.objs | 39 ++ > Makefile.target | 94 ++- > accel/stubs/kvm-stub.c | 5 + > accel/stubs/tcg-stub.c | 106 ++++ > backends/Makefile.objs | 2 + > block/Makefile.objs | 2 + > chardev/char.c | 14 + > configure | 15 + > docs/devel/index.rst | 1 + > docs/devel/qemu-multiprocess.rst | 1102 +++++++++++++++++++++++++++++++++++ > docs/qemu-multiprocess.txt | 86 +++ > exec.c | 14 +- > hmp-commands-info.hx | 10 + > hmp-commands.hx | 25 +- > hw/Makefile.objs | 9 + > hw/block/Makefile.objs | 2 + > hw/core/Makefile.objs | 17 + > hw/nvram/Makefile.objs | 2 + > hw/pci/Makefile.objs | 4 + > hw/proxy/Makefile.objs | 1 + > hw/proxy/memory-sync.c | 226 +++++++ > hw/proxy/proxy-lsi53c895a.c | 97 +++ > hw/proxy/qemu-proxy.c | 807 +++++++++++++++++++++++++ > hw/scsi/Makefile.objs | 2 + > include/chardev/char.h | 1 + > include/exec/address-spaces.h | 2 + > include/exec/ram_addr.h | 2 +- > include/hw/pci/pci_ids.h | 3 + > include/hw/proxy/memory-sync.h | 51 ++ > include/hw/proxy/proxy-lsi53c895a.h | 42 ++ > include/hw/proxy/qemu-proxy.h | 125 ++++ > include/hw/qdev-core.h | 2 + > include/io/mpqemu-link.h | 214 +++++++ > include/monitor/monitor.h | 2 + > include/monitor/qdev.h | 25 + > include/qemu-common.h | 8 + > include/qemu/log.h | 1 + > include/qemu/mmap-alloc.h | 3 +- > include/qemu/thread.h | 1 + > include/remote/iohub.h | 63 ++ > include/remote/machine.h | 48 ++ > include/remote/memory.h | 34 ++ > include/remote/pcihost.h | 59 ++ > include/sysemu/runstate.h | 3 + > io/Makefile.objs | 2 + > io/mpqemu-link.c | 351 +++++++++++ > memory.c | 2 +- > migration/Makefile.objs | 12 + > migration/savevm.c | 63 ++ > migration/savevm.h | 3 + > monitor/Makefile.objs | 3 + > monitor/misc.c | 84 +-- > monitor/monitor-internal.h | 38 ++ > monitor/monitor.c | 83 ++- > net/Makefile.objs | 2 + > qapi/Makefile.objs | 2 + > qdev-monitor.c | 270 ++++++++- > qemu-options.hx | 21 + > qom/Makefile.objs | 4 + > remote/Makefile.objs | 6 + > remote/iohub.c | 159 +++++ > remote/machine.c | 133 +++++ > remote/memory.c | 99 ++++ > remote/pcihost.c | 85 +++ > remote/remote-main.c | 633 ++++++++++++++++++++ > remote/remote-opts.c | 131 +++++ > remote/remote-opts.h | 31 + > replay/Makefile.objs | 2 +- > rules.mak | 2 +- > runstate.c | 41 ++ > scripts/hxtool | 44 +- > stubs/audio.c | 12 + > stubs/gdbstub.c | 21 + > stubs/machine-init-done.c | 4 + > stubs/migration.c | 211 +++++++ > stubs/monitor.c | 72 +++ > stubs/net-stub.c | 121 ++++ > stubs/qapi-misc.c | 43 ++ > stubs/qapi-target.c | 49 ++ > stubs/replay.c | 26 + > stubs/runstate-check.c | 3 + > stubs/ui-stub.c | 130 +++++ > stubs/vl-stub.c | 193 ++++++ > stubs/vmstate.c | 20 + > stubs/xen-mapcache.c | 22 + > ui/Makefile.objs | 2 + > util/log.c | 2 + > util/mmap-alloc.c | 7 +- > util/oslib-posix.c | 2 +- > util/qemu-thread-posix.c | 10 + > vl-parse.c | 161 +++++ > vl.c | 310 ++++------ > vl.h | 54 ++ > 94 files changed, 6908 insertions(+), 246 deletions(-) > create mode 100644 docs/devel/qemu-multiprocess.rst > create mode 100644 docs/qemu-multiprocess.txt > create mode 100644 hw/proxy/Makefile.objs > create mode 100644 hw/proxy/memory-sync.c > create mode 100644 hw/proxy/proxy-lsi53c895a.c > create mode 100644 hw/proxy/qemu-proxy.c > create mode 100644 include/hw/proxy/memory-sync.h > create mode 100644 include/hw/proxy/proxy-lsi53c895a.h > create mode 100644 include/hw/proxy/qemu-proxy.h > create mode 100644 include/io/mpqemu-link.h > create mode 100644 include/remote/iohub.h > create mode 100644 include/remote/machine.h > create mode 100644 include/remote/memory.h > create mode 100644 include/remote/pcihost.h > create mode 100644 io/mpqemu-link.c > create mode 100644 remote/Makefile.objs > create mode 100644 remote/iohub.c > create mode 100644 remote/machine.c > create mode 100644 remote/memory.c > create mode 100644 remote/pcihost.c > create mode 100644 remote/remote-main.c > create mode 100644 remote/remote-opts.c > create mode 100644 remote/remote-opts.h > create mode 100644 runstate.c > mode change 100644 => 100755 scripts/hxtool > create mode 100644 stubs/audio.c > create mode 100644 stubs/migration.c > create mode 100644 stubs/net-stub.c > create mode 100644 stubs/qapi-misc.c > create mode 100644 stubs/qapi-target.c > create mode 100644 stubs/ui-stub.c > create mode 100644 stubs/vl-stub.c > create mode 100644 stubs/xen-mapcache.c > create mode 100644 vl-parse.c > create mode 100644 vl.h > > -- > 1.8.3.1 >
On Mon, Dec 09, 2019 at 10:47:17PM -0800, Elena Ufimtseva wrote: > At this moment we are working on the first stage of the project with help of > the Nutanix developers. > The questions we have gathered so far will be addressed with muser > and Qemu developers after we finish the first stage and make sure we understand > what it will take for us to move onto the next stage. > > We will also incorporate relevant review from Stefan that he provided > on the series 4 of the patchset. Thank you Stefan. > > If anyone have any further suggestions or questions about the status, > please reply to this email. Hi Elena, At KVM Forum we discussed spending 1 or 2 weeks trying out muser. A few weeks have passed and from your email it sounds like this "next stage" might be a lot of work. Is there a work-in-progress muser patch series you can post to start the discussion early? That way we can avoid reviewers like myself asking you to make changes after you have invested a lot of time. It's good that you are in touch with the muser developers (via private discussion? I haven't seen much activity on #muser IRC). Stefan
On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > On Mon, Dec 09, 2019 at 10:47:17PM -0800, Elena Ufimtseva wrote: > > At this moment we are working on the first stage of the project with help of > > the Nutanix developers. > > The questions we have gathered so far will be addressed with muser > > and Qemu developers after we finish the first stage and make sure we understand > > what it will take for us to move onto the next stage. > > > > We will also incorporate relevant review from Stefan that he provided > > on the series 4 of the patchset. Thank you Stefan. > > > > If anyone have any further suggestions or questions about the status, > > please reply to this email. > > Hi Elena, > At KVM Forum we discussed spending 1 or 2 weeks trying out muser. A few > weeks have passed and from your email it sounds like this "next stage" > might be a lot of work. > Hi Stefan Perhaps we were not too clear about our work in the previous email. Our assumption was that the question that came from KVM Forum was if muser can be used to achieve the same what we have now. We should have answered clearly yes to this question. We have not yet discovered major road blocks. At the moment, we are mostly engaged in learning the code and discussing the design, plus some coding to answer the specific questions. We understand that the best way to make a progress is to work with the upstream community on early stages and we agree with this and will present the proposal shortly for discussion. > Is there a work-in-progress muser patch series you can post to start the > discussion early? That way we can avoid reviewers like myself asking > you to make changes after you have invested a lot of time. > Absolutely, that is our plan. At the moment we do not have the patches ready for the review. We have setup internally a milestone and will be sending that early version as a tarball after we have it completed. Would be also a meeting something that could help us to stay on the same page? > It's good that you are in touch with the muser developers (via private > discussion? I haven't seen much activity on #muser IRC). > We use IRC (I know Jag got some answers there) and github for issues (one of which was addressed). We are hoping to get the conversation going over the email. JJ, Jag and Elena > Stefan
Heya, > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: >>> On Mon, Dec 09, 2019 at 10:47:17PM -0800, Elena Ufimtseva wrote: >>> At this moment we are working on the first stage of the project with help of >>> the Nutanix developers. >>> The questions we have gathered so far will be addressed with muser >>> and Qemu developers after we finish the first stage and make sure we understand >>> what it will take for us to move onto the next stage. >>> >>> We will also incorporate relevant review from Stefan that he provided >>> on the series 4 of the patchset. Thank you Stefan. >>> >>> If anyone have any further suggestions or questions about the status, >>> please reply to this email. >> >> Hi Elena, >> At KVM Forum we discussed spending 1 or 2 weeks trying out muser. A few >> weeks have passed and from your email it sounds like this "next stage" >> might be a lot of work. >> > > Hi Stefan > > Perhaps we were not too clear about our work in the previous email. > Our assumption was that the question that came from KVM Forum was > if muser can be used to achieve the same what we have now. > We should have answered clearly yes to this question. We have not yet > discovered major road blocks. > At the moment, we are mostly engaged in learning the code and discussing > the design, plus some coding to answer the specific questions. > We understand that the best way to make a progress is to work with the > upstream community on early stages and we agree with this and will present > the proposal shortly for discussion. > >> Is there a work-in-progress muser patch series you can post to start the >> discussion early? That way we can avoid reviewers like myself asking >> you to make changes after you have invested a lot of time. >> > > Absolutely, that is our plan. At the moment we do not have the patches > ready for the review. We have setup internally a milestone and will be > sending that early version as a tarball after we have it completed. > Would be also a meeting something that could help us to stay on the same > page? Please loop us in if you so set up a meeting. > >> It's good that you are in touch with the muser developers (via private >> discussion? I haven't seen much activity on #muser IRC). >> > > We use IRC (I know Jag got some answers there) and github for issues > (one of which was addressed). I thought there was only the one. Let us know if you run into any other bugs. We are looking forward to hearing about people’s experience and addressing issues that come with uses we didn’t foresee or test. Cheers, Felipe > We are hoping to get the conversation going over > the email. > > JJ, Jag and Elena >> Stefan > >
On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > >> Is there a work-in-progress muser patch series you can post to start the > >> discussion early? That way we can avoid reviewers like myself asking > >> you to make changes after you have invested a lot of time. > >> > > > > Absolutely, that is our plan. At the moment we do not have the patches > > ready for the review. We have setup internally a milestone and will be > > sending that early version as a tarball after we have it completed. > > Would be also a meeting something that could help us to stay on the same > > page? > > Please loop us in if you so set up a meeting. There is a bi-weekly KVM Community Call that we can use for phone discussions: https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Or we can schedule a one-off call at any time :). Questions I've seen when discussing muser with people have been: 1. Can unprivileged containers create muser devices? If not, this is a blocker for use cases that want to avoid root privileges entirely. 2. Does muser need to be in the kernel (e.g. slower to develop/ship, security reasons)? A similar library could be implemented in userspace along the lines of the vhost-user protocol. Although VMMs would then need to use a new libmuser-client library instead of reusing their VFIO code to access the device. 3. Should this feature be Linux-only? vhost-user can be implemented on non-Linux OSes... Stefan
> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: >>>> Is there a work-in-progress muser patch series you can post to start the >>>> discussion early? That way we can avoid reviewers like myself asking >>>> you to make changes after you have invested a lot of time. >>>> >>> >>> Absolutely, that is our plan. At the moment we do not have the patches >>> ready for the review. We have setup internally a milestone and will be >>> sending that early version as a tarball after we have it completed. >>> Would be also a meeting something that could help us to stay on the same >>> page? >> >> Please loop us in if you so set up a meeting. > > There is a bi-weekly KVM Community Call that we can use for phone > discussions: > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ > > Or we can schedule a one-off call at any time :). Sounds good either way, whenever it's needed. > > Questions I've seen when discussing muser with people have been: > > 1. Can unprivileged containers create muser devices? If not, this is a > blocker for use cases that want to avoid root privileges entirely. Yes you can. Muser device creation follows the same process as general mdev device creation (ie. you write to a sysfs path). That creates an entry in /dev/vfio and the control plane can further drop privileges there (set selinux contexts, &c.) > > 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > security reasons)? A similar library could be implemented in > userspace along the lines of the vhost-user protocol. Although VMMs > would then need to use a new libmuser-client library instead of > reusing their VFIO code to access the device. Doing it in userspace was the flow we proposed back in last year's KVM Forum (Edinburgh), but it got turned down. That's why we procured the kernel approach, which turned out to have some advantages: - No changes needed to Qemu - No Qemu needed at all for userspace drivers - Device emulation process restart is trivial (it therefore makes device code upgrades much easier) Having said that, nothing stops us from enhancing libmuser to talk directly to Qemu (for the Qemu case). I envision at least two ways of doing that: - Hooking up libmuser with Qemu directly (eg. over a unix socket) - Hooking Qemu with CUSE and implementing the muser.ko interface For the latter, libmuser would talk to a character device just like it talks to the vfio character device. We "just" need to implement that backend in Qemu. :) > > 3. Should this feature be Linux-only? vhost-user can be implemented on > non-Linux OSes... The userspace approach discussed above certainly can be more portable. Currently, muser depends on MDEV+VFIO and that's where the restriction comes from. F. > > Stefan
On 17/12/19 23:57, Felipe Franciosi wrote: > Doing it in userspace was the flow we proposed back in last year's KVM > Forum (Edinburgh), but it got turned down. I think the time since then has shown that essentially the cat is out of the bag. I didn't really like the idea of devices outside QEMU---and I still don't---but if something like "VFIO over AF_UNIX" turns out to be the cleanest way to implement multi-process QEMU device models, I am not going to pull an RMS and block that from happening. Assuming I could even do so! Paolo
On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > > On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > Questions I've seen when discussing muser with people have been: > > > > 1. Can unprivileged containers create muser devices? If not, this is a > > blocker for use cases that want to avoid root privileges entirely. > > Yes you can. Muser device creation follows the same process as general > mdev device creation (ie. you write to a sysfs path). That creates an > entry in /dev/vfio and the control plane can further drop privileges > there (set selinux contexts, &c.) In this case there is still a privileged step during setup. What about completely unprivileged scenarios like a regular user without root or a rootless container? > > 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > > security reasons)? A similar library could be implemented in > > userspace along the lines of the vhost-user protocol. Although VMMs > > would then need to use a new libmuser-client library instead of > > reusing their VFIO code to access the device. > > Doing it in userspace was the flow we proposed back in last year's KVM > Forum (Edinburgh), but it got turned down. That's why we procured the > kernel approach, which turned out to have some advantages: > - No changes needed to Qemu > - No Qemu needed at all for userspace drivers > - Device emulation process restart is trivial > (it therefore makes device code upgrades much easier) > > Having said that, nothing stops us from enhancing libmuser to talk > directly to Qemu (for the Qemu case). I envision at least two ways of > doing that: > - Hooking up libmuser with Qemu directly (eg. over a unix socket) > - Hooking Qemu with CUSE and implementing the muser.ko interface > > For the latter, libmuser would talk to a character device just like it > talks to the vfio character device. We "just" need to implement that > backend in Qemu. :) What about: * libmuser's API stays mostly unchanged but the library speaks a VFIO-over-UNIX domain sockets protocol instead of talking to mdev/vfio in the host kernel. * VMMs can implement this protocol directly for POSIX-portable and unprivileged operation. * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can still take advantage of libmuser devices. Assuming this is feasible, would you lose any important features/advantages of the muser.ko approach? I don't know enough about VFIO to identify any blocker or obvious performance problems. Regarding recovery, it seems straightforward to keep state in a tmpfs file that can be reopened when the device is restarted. I don't think kernel code is necessary? Stefan
Hello, (I've added Jim and Ben from the SPDK team to the thread.) > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: >>> Questions I've seen when discussing muser with people have been: >>> >>> 1. Can unprivileged containers create muser devices? If not, this is a >>> blocker for use cases that want to avoid root privileges entirely. >> >> Yes you can. Muser device creation follows the same process as general >> mdev device creation (ie. you write to a sysfs path). That creates an >> entry in /dev/vfio and the control plane can further drop privileges >> there (set selinux contexts, &c.) > > In this case there is still a privileged step during setup. What about > completely unprivileged scenarios like a regular user without root or a > rootless container? Oh, I see what you are saying. I suppose we need to investigate adjusting the privileges of the sysfs path correctly beforehand to allow devices to be created by non-root users. The credentials used on creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>). I need to look into that and get back to you. > >>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship, >>> security reasons)? A similar library could be implemented in >>> userspace along the lines of the vhost-user protocol. Although VMMs >>> would then need to use a new libmuser-client library instead of >>> reusing their VFIO code to access the device. >> >> Doing it in userspace was the flow we proposed back in last year's KVM >> Forum (Edinburgh), but it got turned down. That's why we procured the >> kernel approach, which turned out to have some advantages: >> - No changes needed to Qemu >> - No Qemu needed at all for userspace drivers >> - Device emulation process restart is trivial >> (it therefore makes device code upgrades much easier) >> >> Having said that, nothing stops us from enhancing libmuser to talk >> directly to Qemu (for the Qemu case). I envision at least two ways of >> doing that: >> - Hooking up libmuser with Qemu directly (eg. over a unix socket) >> - Hooking Qemu with CUSE and implementing the muser.ko interface >> >> For the latter, libmuser would talk to a character device just like it >> talks to the vfio character device. We "just" need to implement that >> backend in Qemu. :) > > What about: > * libmuser's API stays mostly unchanged but the library speaks a > VFIO-over-UNIX domain sockets protocol instead of talking to > mdev/vfio in the host kernel. As I said above, there are advantages to the kernel model. The key one is transparent device emulation restarts. Today, muser.ko keeps the "device memory" internally in a prefix tree. Upon restart, a new device emulator can recover state (eg. from a state file in /dev/shm or similar) and remap the same memory that is already configured to the guest via Qemu. We have a pending work item for muser.ko to also keep the eventfds so we can recover those, too. Another advantage is working with any userspace driver and not requiring a VMM at all. If done entirely in userspace, the device emulator needs to allocate the device memory somewhere that remains accessible (eg. tmpfs), with the difference that now we may be talking about non-trivial amounts of memory. Also, that may not be the kind of content you want lingering around the filesystem (for the same reasons Qemu unlinks memory files from /dev/hugepages after mmap'ing it). That's why I'd prefer to rephrase what you said to "in addition" instead of "instead". > * VMMs can implement this protocol directly for POSIX-portable and > unprivileged operation. > * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can > still take advantage of libmuser devices. I'm happy with that. We need to think the credential aspect throughout to ensure nodes can be created in the right places with the right privileges. > > Assuming this is feasible, would you lose any important > features/advantages of the muser.ko approach? I don't know enough about > VFIO to identify any blocker or obvious performance problems. That's what I elaborated above. The fact that muser.ko can keep various metadata (and other resources) about the device in the kernel and grant it back to userspace as needed. There are ways around it, but it requires some orchestration with tmpfs and the VMM (only so much can be kept in tmpfs; the eventfds need to be retransmitted from the machine emulator on request). Restarting is a critical aspect of this. One key use case for the project is to be able to emulate various devices from one process (for polling). That must be able to restart for upgrades or recovery. > > Regarding recovery, it seems straightforward to keep state in a tmpfs > file that can be reopened when the device is restarted. I don't think > kernel code is necessary? It adds a dependency, but isn't a show stopper. If we can work through permission issues, making sure the VMM can reconnect and retransmit eventfds and other state, then it should be ok. To be clear: I'm very happy to have a userspace-only option for this, I just don't want to ditch the kernel module (yet, anyway). :) F. > > Stefan
On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > > > > On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > >>>> Is there a work-in-progress muser patch series you can post to start the > >>>> discussion early? That way we can avoid reviewers like myself asking > >>>> you to make changes after you have invested a lot of time. > >>>> > >>> > >>> Absolutely, that is our plan. At the moment we do not have the patches > >>> ready for the review. We have setup internally a milestone and will be > >>> sending that early version as a tarball after we have it completed. > >>> Would be also a meeting something that could help us to stay on the same > >>> page? > >> > >> Please loop us in if you so set up a meeting. > > > > There is a bi-weekly KVM Community Call that we can use for phone > > discussions: > > > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ > > > > Or we can schedule a one-off call at any time :). > > Sounds good either way, whenever it's needed. > > > > > Questions I've seen when discussing muser with people have been: > > > > 1. Can unprivileged containers create muser devices? If not, this is a > > blocker for use cases that want to avoid root privileges entirely. > > Yes you can. Muser device creation follows the same process as general > mdev device creation (ie. you write to a sysfs path). That creates an > entry in /dev/vfio and the control plane can further drop privileges > there (set selinux contexts, &c.) This isn't what I'd describe / consider as unprivileged, as AFAICT although QEMU can use it unprivileged, this still requires a privileged management process to do the setup in sysfs. I think it is desirable to be able support a fully unprivileged model where there is nothing requiring elevated privileges, neither libvirtd or QEMU. I think this basically ends up at the same requirement as support for non-Linux hosts. We have to assume that some desirable deployment scenarios will not be able to use Linux kernel features, either because they lack privileges, or are simply non-Linux hosts. > > 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > > security reasons)? A similar library could be implemented in > > userspace along the lines of the vhost-user protocol. Although VMMs > > would then need to use a new libmuser-client library instead of > > reusing their VFIO code to access the device. > > Doing it in userspace was the flow we proposed back in last year's KVM > Forum (Edinburgh), but it got turned down. That's why we procured the > kernel approach, which turned out to have some advantages: > - No changes needed to Qemu > - No Qemu needed at all for userspace drivers > - Device emulation process restart is trivial > (it therefore makes device code upgrades much easier) > > Having said that, nothing stops us from enhancing libmuser to talk > directly to Qemu (for the Qemu case). I envision at least two ways of > doing that: > - Hooking up libmuser with Qemu directly (eg. over a unix socket) A UNIX socket, or localhost TCP socket, sounds most appealing from a a portability POV. > - Hooking Qemu with CUSE and implementing the muser.ko interface Perhaps I'm misunderstanding, but wouldn't a CUSE interface still have issues with something needing to be privileged to do the initial setup, and also still lack OS portability. > For the latter, libmuser would talk to a character device just like it > talks to the vfio character device. We "just" need to implement that > backend in Qemu. :) > > > > > 3. Should this feature be Linux-only? vhost-user can be implemented on > > non-Linux OSes... > > The userspace approach discussed above certainly can be more portable. > Currently, muser depends on MDEV+VFIO and that's where the restriction > comes from. Regards, Daniel
On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: > Hello, > > (I've added Jim and Ben from the SPDK team to the thread.) > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > >>> Questions I've seen when discussing muser with people have been: > >>> > >>> 1. Can unprivileged containers create muser devices? If not, this is a > >>> blocker for use cases that want to avoid root privileges entirely. > >> > >> Yes you can. Muser device creation follows the same process as general > >> mdev device creation (ie. you write to a sysfs path). That creates an > >> entry in /dev/vfio and the control plane can further drop privileges > >> there (set selinux contexts, &c.) > > > > In this case there is still a privileged step during setup. What about > > completely unprivileged scenarios like a regular user without root or a > > rootless container? > > Oh, I see what you are saying. I suppose we need to investigate > adjusting the privileges of the sysfs path correctly beforehand to > allow devices to be created by non-root users. The credentials used on > creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>). > > I need to look into that and get back to you. > > > > >>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > >>> security reasons)? A similar library could be implemented in > >>> userspace along the lines of the vhost-user protocol. Although VMMs > >>> would then need to use a new libmuser-client library instead of > >>> reusing their VFIO code to access the device. > >> > >> Doing it in userspace was the flow we proposed back in last year's KVM > >> Forum (Edinburgh), but it got turned down. That's why we procured the > >> kernel approach, which turned out to have some advantages: > >> - No changes needed to Qemu > >> - No Qemu needed at all for userspace drivers > >> - Device emulation process restart is trivial > >> (it therefore makes device code upgrades much easier) > >> > >> Having said that, nothing stops us from enhancing libmuser to talk > >> directly to Qemu (for the Qemu case). I envision at least two ways of > >> doing that: > >> - Hooking up libmuser with Qemu directly (eg. over a unix socket) > >> - Hooking Qemu with CUSE and implementing the muser.ko interface > >> > >> For the latter, libmuser would talk to a character device just like it > >> talks to the vfio character device. We "just" need to implement that > >> backend in Qemu. :) > > > > What about: > > * libmuser's API stays mostly unchanged but the library speaks a > > VFIO-over-UNIX domain sockets protocol instead of talking to > > mdev/vfio in the host kernel. > > As I said above, there are advantages to the kernel model. The key one > is transparent device emulation restarts. Today, muser.ko keeps the > "device memory" internally in a prefix tree. Upon restart, a new > device emulator can recover state (eg. from a state file in /dev/shm > or similar) and remap the same memory that is already configured to > the guest via Qemu. We have a pending work item for muser.ko to also > keep the eventfds so we can recover those, too. Another advantage is > working with any userspace driver and not requiring a VMM at all. > > If done entirely in userspace, the device emulator needs to allocate > the device memory somewhere that remains accessible (eg. tmpfs), with > the difference that now we may be talking about non-trivial amounts of > memory. Also, that may not be the kind of content you want lingering > around the filesystem (for the same reasons Qemu unlinks memory files > from /dev/hugepages after mmap'ing it). > > That's why I'd prefer to rephrase what you said to "in addition" > instead of "instead". > > > * VMMs can implement this protocol directly for POSIX-portable and > > unprivileged operation. > > * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can > > still take advantage of libmuser devices. > > I'm happy with that. > We need to think the credential aspect throughout to ensure nodes can > be created in the right places with the right privileges. > > > > > Assuming this is feasible, would you lose any important > > features/advantages of the muser.ko approach? I don't know enough about > > VFIO to identify any blocker or obvious performance problems. > > That's what I elaborated above. The fact that muser.ko can keep > various metadata (and other resources) about the device in the kernel > and grant it back to userspace as needed. There are ways around it, > but it requires some orchestration with tmpfs and the VMM (only so > much can be kept in tmpfs; the eventfds need to be retransmitted from > the machine emulator on request). > > Restarting is a critical aspect of this. One key use case for the > project is to be able to emulate various devices from one process (for > polling). That must be able to restart for upgrades or recovery. > > > > > Regarding recovery, it seems straightforward to keep state in a tmpfs > > file that can be reopened when the device is restarted. I don't think > > kernel code is necessary? > > It adds a dependency, but isn't a show stopper. If we can work through > permission issues, making sure the VMM can reconnect and retransmit > eventfds and other state, then it should be ok. > > To be clear: I'm very happy to have a userspace-only option for this, > I just don't want to ditch the kernel module (yet, anyway). :) If it doesn't create too large of a burden to support both, then I think it is very desirable. IIUC, this is saying a kernel based solution as the optimized/optimal solution, and userspace UNIX socket based option as the generic "works everywhere" fallback solution. Regards, Daniel
On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote: > On 17/12/19 23:57, Felipe Franciosi wrote: > > Doing it in userspace was the flow we proposed back in last year's KVM > > Forum (Edinburgh), but it got turned down. > > I think the time since then has shown that essentially the cat is out of > the bag. I didn't really like the idea of devices outside QEMU---and I > still don't---but if something like "VFIO over AF_UNIX" turns out to be > the cleanest way to implement multi-process QEMU device models, I am not > going to pull an RMS and block that from happening. Assuming I could > even do so! There are a range of approaches that will influence how out-of-process devices can be licensed and distributed. A VFIO-over-UNIX domain sockets approach means a stable API so that any license (including proprietary) is possible. Another approach is a QEMU-centric unstable protocol. I'll call this the qdev-over-UNIX domain sockets approach. Maintaining an out-of-tree device is expensive and ugly since the protocol changes between QEMU versions in ways that are incompatible and undetectable. On top of that, the initialization protocol message could include the QEMU version string that the device was compiled against. If the version string doesn't match then QEMU will refuse to talk to the device. Distributing a single device executable that works with many QEMUs (e.g. CentOS, Ubuntu) and versions becomes difficult. I want to mention that we have the option of doing this if there are strong concerns about out-of-tree devices. It does have downsides: 1. Inability to share devices with other VMMs. 2. Probably won't replace vhost-user due to the out-of-tree limitations. 3. Can still be circumvented by a motivated device author. Stefan
On 12/19/2019 7:33 AM, Felipe Franciosi wrote: > Hello, > > (I've added Jim and Ben from the SPDK team to the thread.) > >> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> >> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: >>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: >>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: >>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: >>>> Questions I've seen when discussing muser with people have been: >>>> >>>> 1. Can unprivileged containers create muser devices? If not, this is a >>>> blocker for use cases that want to avoid root privileges entirely. >>> >>> Yes you can. Muser device creation follows the same process as general >>> mdev device creation (ie. you write to a sysfs path). That creates an >>> entry in /dev/vfio and the control plane can further drop privileges >>> there (set selinux contexts, &c.) >> >> In this case there is still a privileged step during setup. What about >> completely unprivileged scenarios like a regular user without root or a >> rootless container? > > Oh, I see what you are saying. I suppose we need to investigate > adjusting the privileges of the sysfs path correctly beforehand to > allow devices to be created by non-root users. The credentials used on > creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>). > > I need to look into that and get back to you. As a prerequisite to using the "vfio-pci" device in QEMU, the user assigns the PCI device on the host bus to the VFIO kernel driver by writing to "/sys/bus/pci/drivers/vfio-pci/new_id" and "/sys/bus/pci/drivers/vfio-pci/bind" I believe a privileged control plane is required to perform these prerequisite steps. Therefore, I wonder how rootless containers or unprivileged users currently go about using a VFIO device with QEMU/KVM. Thanks! -- Jag > >> >>>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship, >>>> security reasons)? A similar library could be implemented in >>>> userspace along the lines of the vhost-user protocol. Although VMMs >>>> would then need to use a new libmuser-client library instead of >>>> reusing their VFIO code to access the device. >>> >>> Doing it in userspace was the flow we proposed back in last year's KVM >>> Forum (Edinburgh), but it got turned down. That's why we procured the >>> kernel approach, which turned out to have some advantages: >>> - No changes needed to Qemu >>> - No Qemu needed at all for userspace drivers >>> - Device emulation process restart is trivial >>> (it therefore makes device code upgrades much easier) >>> >>> Having said that, nothing stops us from enhancing libmuser to talk >>> directly to Qemu (for the Qemu case). I envision at least two ways of >>> doing that: >>> - Hooking up libmuser with Qemu directly (eg. over a unix socket) >>> - Hooking Qemu with CUSE and implementing the muser.ko interface >>> >>> For the latter, libmuser would talk to a character device just like it >>> talks to the vfio character device. We "just" need to implement that >>> backend in Qemu. :) >> >> What about: >> * libmuser's API stays mostly unchanged but the library speaks a >> VFIO-over-UNIX domain sockets protocol instead of talking to >> mdev/vfio in the host kernel. > > As I said above, there are advantages to the kernel model. The key one > is transparent device emulation restarts. Today, muser.ko keeps the > "device memory" internally in a prefix tree. Upon restart, a new > device emulator can recover state (eg. from a state file in /dev/shm > or similar) and remap the same memory that is already configured to > the guest via Qemu. We have a pending work item for muser.ko to also > keep the eventfds so we can recover those, too. Another advantage is > working with any userspace driver and not requiring a VMM at all. > > If done entirely in userspace, the device emulator needs to allocate > the device memory somewhere that remains accessible (eg. tmpfs), with > the difference that now we may be talking about non-trivial amounts of > memory. Also, that may not be the kind of content you want lingering > around the filesystem (for the same reasons Qemu unlinks memory files > from /dev/hugepages after mmap'ing it). > > That's why I'd prefer to rephrase what you said to "in addition" > instead of "instead". > >> * VMMs can implement this protocol directly for POSIX-portable and >> unprivileged operation. >> * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can >> still take advantage of libmuser devices. > > I'm happy with that. > We need to think the credential aspect throughout to ensure nodes can > be created in the right places with the right privileges. > >> >> Assuming this is feasible, would you lose any important >> features/advantages of the muser.ko approach? I don't know enough about >> VFIO to identify any blocker or obvious performance problems. > > That's what I elaborated above. The fact that muser.ko can keep > various metadata (and other resources) about the device in the kernel > and grant it back to userspace as needed. There are ways around it, > but it requires some orchestration with tmpfs and the VMM (only so > much can be kept in tmpfs; the eventfds need to be retransmitted from > the machine emulator on request). > > Restarting is a critical aspect of this. One key use case for the > project is to be able to emulate various devices from one process (for > polling). That must be able to restart for upgrades or recovery. > >> >> Regarding recovery, it seems straightforward to keep state in a tmpfs >> file that can be reopened when the device is restarted. I don't think >> kernel code is necessary? > > It adds a dependency, but isn't a show stopper. If we can work through > permission issues, making sure the VMM can reconnect and retransmit > eventfds and other state, then it should be ok. > > To be clear: I'm very happy to have a userspace-only option for this, > I just don't want to ditch the kernel module (yet, anyway). :) > > F. > >> >> Stefan >
On Thu, Dec 19, 2019 at 12:50:21PM +0000, Daniel P. Berrangé wrote: > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > > > > > > > On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > >>>> Is there a work-in-progress muser patch series you can post to start the > > >>>> discussion early? That way we can avoid reviewers like myself asking > > >>>> you to make changes after you have invested a lot of time. > > >>>> > > >>> > > >>> Absolutely, that is our plan. At the moment we do not have the patches > > >>> ready for the review. We have setup internally a milestone and will be > > >>> sending that early version as a tarball after we have it completed. > > >>> Would be also a meeting something that could help us to stay on the same > > >>> page? > > >> > > >> Please loop us in if you so set up a meeting. > > > > > > There is a bi-weekly KVM Community Call that we can use for phone > > > discussions: > > > > > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ > > > > > > Or we can schedule a one-off call at any time :). > > > > Sounds good either way, whenever it's needed. > > > > > > > > Questions I've seen when discussing muser with people have been: > > > > > > 1. Can unprivileged containers create muser devices? If not, this is a > > > blocker for use cases that want to avoid root privileges entirely. > > > > Yes you can. Muser device creation follows the same process as general > > mdev device creation (ie. you write to a sysfs path). That creates an > > entry in /dev/vfio and the control plane can further drop privileges > > there (set selinux contexts, &c.) > > This isn't what I'd describe / consider as unprivileged, as AFAICT > although QEMU can use it unprivileged, this still requires a privileged > management process to do the setup in sysfs. > > I think it is desirable to be able support a fully unprivileged > model where there is nothing requiring elevated privileges, neither > libvirtd or QEMU. > > I think this basically ends up at the same requirement as support > for non-Linux hosts. We have to assume that some desirable deployment > scenarios will not be able to use Linux kernel features, either because > they lack privileges, or are simply non-Linux hosts. > > > > 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > > > security reasons)? A similar library could be implemented in > > > userspace along the lines of the vhost-user protocol. Although VMMs > > > would then need to use a new libmuser-client library instead of > > > reusing their VFIO code to access the device. > > > > Doing it in userspace was the flow we proposed back in last year's KVM > > Forum (Edinburgh), but it got turned down. That's why we procured the > > kernel approach, which turned out to have some advantages: > > - No changes needed to Qemu > > - No Qemu needed at all for userspace drivers > > - Device emulation process restart is trivial > > (it therefore makes device code upgrades much easier) > > > > Having said that, nothing stops us from enhancing libmuser to talk > > directly to Qemu (for the Qemu case). I envision at least two ways of > > doing that: > > - Hooking up libmuser with Qemu directly (eg. over a unix socket) > > A UNIX socket, or localhost TCP socket, sounds most appealing from a > a portability POV. Felipe reminded me on IRC that muser needs FD passing, so a TCP localhost socket is not an option. So UNIX socket would give us portability to any platform, except for Windows. It is not the end of the world to lack Windows support. Regards, Daniel
On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote: > On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > To be clear: I'm very happy to have a userspace-only option for this, > > I just don't want to ditch the kernel module (yet, anyway). :) > > If it doesn't create too large of a burden to support both, then I think > it is very desirable. IIUC, this is saying a kernel based solution as the > optimized/optimal solution, and userspace UNIX socket based option as the > generic "works everywhere" fallback solution. I'm slightly in favor of the kernel implementation because it keeps us better aligned with VFIO. That means solving problems in one place only and less reinventing the wheel. Knowing that a userspace implementation is possible is a plus though. Maybe that option will become attractive in the future and someone will develop it. In fact, a userspace implementation may be a cool Google Summer of Code project idea that I'd like to co-mentor. Stefan
On 20/12/19 10:47, Stefan Hajnoczi wrote: >> If it doesn't create too large of a burden to support both, then I think >> it is very desirable. IIUC, this is saying a kernel based solution as the >> optimized/optimal solution, and userspace UNIX socket based option as the >> generic "works everywhere" fallback solution. > I'm slightly in favor of the kernel implementation because it keeps us > better aligned with VFIO. That means solving problems in one place only > and less reinventing the wheel. I think there are anyway going to be some differences with VFIO. For example, currently VFIO requires pinning user memory. Is that a limitation for muser too? If so, that would be a big disadvantage; if not, however, management tools need to learn that muser devices unlike other VFIO devices do not prevent overcommit. Paolo
On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote: > On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote: > > On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: > > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > > > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > > To be clear: I'm very happy to have a userspace-only option for this, > > > I just don't want to ditch the kernel module (yet, anyway). :) > > > > If it doesn't create too large of a burden to support both, then I think > > it is very desirable. IIUC, this is saying a kernel based solution as the > > optimized/optimal solution, and userspace UNIX socket based option as the > > generic "works everywhere" fallback solution. > > I'm slightly in favor of the kernel implementation because it keeps us > better aligned with VFIO. That means solving problems in one place only > and less reinventing the wheel. > > Knowing that a userspace implementation is possible is a plus though. > Maybe that option will become attractive in the future and someone will > develop it. In fact, a userspace implementation may be a cool Google > Summer of Code project idea that I'd like to co-mentor. If it is technically viable as an approach, then I think we should be treating a fully unprivileged muser-over-UNIX socket as a higher priority than just "maybe a GSoC student will want todo it". Libvirt is getting strong message from KubeVirt project that they want to be running both libvirtd and QEMU fully unprivileged. This allows their containers to be unprivileged. Anything that requires privileges requires jumping through extra hoops writing custom code in KubeVirt to do things outside libvirt in side loaded privileged containers and this limits how where those features can be used. Regards, Daniel
> On Dec 20, 2019, at 9:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 20/12/19 10:47, Stefan Hajnoczi wrote: >>> If it doesn't create too large of a burden to support both, then I think >>> it is very desirable. IIUC, this is saying a kernel based solution as the >>> optimized/optimal solution, and userspace UNIX socket based option as the >>> generic "works everywhere" fallback solution. >> I'm slightly in favor of the kernel implementation because it keeps us >> better aligned with VFIO. That means solving problems in one place only >> and less reinventing the wheel. > > I think there are anyway going to be some differences with VFIO. > > For example, currently VFIO requires pinning user memory. Is that a > limitation for muser too? If so, that would be a big disadvantage; if > not, however, management tools need to learn that muser devices unlike > other VFIO devices do not prevent overcommit. More or less. We pin them today, but I think we don't really have to. I created an issue to look into it: https://github.com/nutanix/muser/issues/28 In any case, if Qemu is ballooning and calls UNMAP_DMA for memory that has been ballooned out, then we would release it. The reason we keep it pinned is to support libmuser restarts. IIRC, VFIO doesn't need to pin pages for mdev devices (that's the job of the mdev driver on the other end via vfio_pin_pages()). It only keeps the DMA entries in a RB tree. If my understanding is right, then we can probably just keep the map Qemu registered (without holding the pages) and call vfio_pin_pages() on demand when libmuser restarts. For context, this is how the DMA memory registration works today: 1) Qemu calls ioctl(vfio_fd, IOMMU_MAP_DMA, &vm_map); 2) The iommu driver notifies muser.ko 3) Muser.ko pins the pages (in get_dma_map(), called from below) (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L711) 4) Muser.ko notifies libmuser about the memory registration (The iommu driver context goes to sleep, hence the pinning) 5) Libmuser wakes up and calls mmap() on muser.ko 6) Muser.ko inserts the VM memory in libmuser's context (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L543) 7) Libmuser tells muser.ko that it's done 8) Muser.ko iommu callback context that was sleeping wakes up 9) Muser.ko places the memory in a "dma_list" for the mudev and returns. We could potentially modify the last step to unpin and keep only what we need for a future call to vfio_pin_pages(), but I need to check if that works. Cheers, Felipe > > Paolo >
On Fri, 20 Dec 2019 14:14:33 +0000 Felipe Franciosi <felipe@nutanix.com> wrote: > > On Dec 20, 2019, at 9:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > On 20/12/19 10:47, Stefan Hajnoczi wrote: > >>> If it doesn't create too large of a burden to support both, then I think > >>> it is very desirable. IIUC, this is saying a kernel based solution as the > >>> optimized/optimal solution, and userspace UNIX socket based option as the > >>> generic "works everywhere" fallback solution. > >> I'm slightly in favor of the kernel implementation because it keeps us > >> better aligned with VFIO. That means solving problems in one place only > >> and less reinventing the wheel. > > > > I think there are anyway going to be some differences with VFIO. > > > > For example, currently VFIO requires pinning user memory. Is that a > > limitation for muser too? If so, that would be a big disadvantage; if > > not, however, management tools need to learn that muser devices unlike > > other VFIO devices do not prevent overcommit. > > More or less. We pin them today, but I think we don't really have to. > I created an issue to look into it: > https://github.com/nutanix/muser/issues/28 > > In any case, if Qemu is ballooning and calls UNMAP_DMA for memory that > has been ballooned out, then we would release it. That's exactly the problem with ballooning and vfio, it doesn't unmap memory, it just zaps it out of the VM, to be demand faulted back in later. It's very vCPU-centric. Memory hotplug is the only case where we'll see a memory region get unmapped. > The reason we keep it pinned is to support libmuser restarts. IIRC, > VFIO doesn't need to pin pages for mdev devices (that's the job of the > mdev driver on the other end via vfio_pin_pages()). It only keeps the > DMA entries in a RB tree. > > If my understanding is right, then we can probably just keep the map > Qemu registered (without holding the pages) and call vfio_pin_pages() > on demand when libmuser restarts. > > For context, this is how the DMA memory registration works today: > > 1) Qemu calls ioctl(vfio_fd, IOMMU_MAP_DMA, &vm_map); > > 2) The iommu driver notifies muser.ko > > 3) Muser.ko pins the pages (in get_dma_map(), called from below) > (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L711) Yikes, it pins every page?? vfio_pin_pages() intends for the vendor driver to be much smarter than this :-\ Thanks, Alex > 4) Muser.ko notifies libmuser about the memory registration > (The iommu driver context goes to sleep, hence the pinning) > > 5) Libmuser wakes up and calls mmap() on muser.ko > > 6) Muser.ko inserts the VM memory in libmuser's context > (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L543) > > 7) Libmuser tells muser.ko that it's done > > 8) Muser.ko iommu callback context that was sleeping wakes up > > 9) Muser.ko places the memory in a "dma_list" for the mudev and returns. > > We could potentially modify the last step to unpin and keep only what > we need for a future call to vfio_pin_pages(), but I need to check if > that works. > > Cheers, > Felipe > > > > > Paolo > > > >
Heya, > On Dec 20, 2019, at 3:25 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > > On Fri, 20 Dec 2019 14:14:33 +0000 > Felipe Franciosi <felipe@nutanix.com> wrote: > >>> On Dec 20, 2019, at 9:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> >>> On 20/12/19 10:47, Stefan Hajnoczi wrote: >>>>> If it doesn't create too large of a burden to support both, then I think >>>>> it is very desirable. IIUC, this is saying a kernel based solution as the >>>>> optimized/optimal solution, and userspace UNIX socket based option as the >>>>> generic "works everywhere" fallback solution. >>>> I'm slightly in favor of the kernel implementation because it keeps us >>>> better aligned with VFIO. That means solving problems in one place only >>>> and less reinventing the wheel. >>> >>> I think there are anyway going to be some differences with VFIO. >>> >>> For example, currently VFIO requires pinning user memory. Is that a >>> limitation for muser too? If so, that would be a big disadvantage; if >>> not, however, management tools need to learn that muser devices unlike >>> other VFIO devices do not prevent overcommit. >> >> More or less. We pin them today, but I think we don't really have to. >> I created an issue to look into it: >> https://github.com/nutanix/muser/issues/28 >> >> In any case, if Qemu is ballooning and calls UNMAP_DMA for memory that >> has been ballooned out, then we would release it. > > That's exactly the problem with ballooning and vfio, it doesn't unmap > memory, it just zaps it out of the VM, to be demand faulted back in > later. It's very vCPU-centric. Memory hotplug is the only case where > we'll see a memory region get unmapped. > >> The reason we keep it pinned is to support libmuser restarts. IIRC, >> VFIO doesn't need to pin pages for mdev devices (that's the job of the >> mdev driver on the other end via vfio_pin_pages()). It only keeps the >> DMA entries in a RB tree. >> >> If my understanding is right, then we can probably just keep the map >> Qemu registered (without holding the pages) and call vfio_pin_pages() >> on demand when libmuser restarts. >> >> For context, this is how the DMA memory registration works today: >> >> 1) Qemu calls ioctl(vfio_fd, IOMMU_MAP_DMA, &vm_map); >> >> 2) The iommu driver notifies muser.ko >> >> 3) Muser.ko pins the pages (in get_dma_map(), called from below) >> (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L711) > > Yikes, it pins every page?? vfio_pin_pages() intends for the vendor > driver to be much smarter than this :-\ Thanks, We can't afford a kernel round trip every time we need to translate GPAs, so that's how we solved it. There's an action item to do pin in groups of 512 (which is the limit we saw in vfio_pin_pages()). Can you elaborate on the problems of the approach and whether there's something better we can do? F. > > Alex > >> 4) Muser.ko notifies libmuser about the memory registration >> (The iommu driver context goes to sleep, hence the pinning) >> >> 5) Libmuser wakes up and calls mmap() on muser.ko >> >> 6) Muser.ko inserts the VM memory in libmuser's context >> (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L543) >> >> 7) Libmuser tells muser.ko that it's done >> >> 8) Muser.ko iommu callback context that was sleeping wakes up >> >> 9) Muser.ko places the memory in a "dma_list" for the mudev and returns. >> >> We could potentially modify the last step to unpin and keep only what >> we need for a future call to vfio_pin_pages(), but I need to check if >> that works. >> >> Cheers, >> Felipe >> >>> >>> Paolo
> On Dec 19, 2019, at 5:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote: >> On 17/12/19 23:57, Felipe Franciosi wrote: >>> Doing it in userspace was the flow we proposed back in last year's KVM >>> Forum (Edinburgh), but it got turned down. >> >> I think the time since then has shown that essentially the cat is out of >> the bag. I didn't really like the idea of devices outside QEMU---and I >> still don't---but if something like "VFIO over AF_UNIX" turns out to be >> the cleanest way to implement multi-process QEMU device models, I am not >> going to pull an RMS and block that from happening. Assuming I could >> even do so! > > There are a range of approaches that will influence how out-of-process > devices can be licensed and distributed. > > A VFIO-over-UNIX domain sockets approach means a stable API so that any > license (including proprietary) is possible. > > Another approach is a QEMU-centric unstable protocol. I'll call this > the qdev-over-UNIX domain sockets approach. Maintaining an out-of-tree > device is expensive and ugly since the protocol changes between QEMU > versions in ways that are incompatible and undetectable. > > On top of that, the initialization protocol message could include the > QEMU version string that the device was compiled against. If the > version string doesn't match then QEMU will refuse to talk to the > device. > This is very similar to our multi-process QEMU implementation before we looked into using muser. The differences are: We use one object per emulated device type in QEMU rather than having a single VFIO type that can masquerade as any PCI device. We don’t pin guest memory; we pass the QEMU file descriptors used to create guest memory to the emulation program, and it mmap()s them itself. (ala vhost-user). JJ > Distributing a single device executable that works with many QEMUs (e.g. > CentOS, Ubuntu) and versions becomes difficult. > > I want to mention that we have the option of doing this if there are > strong concerns about out-of-tree devices. It does have downsides: > 1. Inability to share devices with other VMMs. > 2. Probably won't replace vhost-user due to the out-of-tree limitations. > 3. Can still be circumvented by a motivated device author. > > Stefan
On Fri, Dec 20, 2019 at 09:15:40AM -0800, John G Johnson wrote: > > On Dec 19, 2019, at 5:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote: > >> On 17/12/19 23:57, Felipe Franciosi wrote: > We don’t pin guest memory; we pass the QEMU file descriptors used to create > guest memory to the emulation program, and it mmap()s them itself. (ala > vhost-user). Does muser really require pinning? If yes, then it seems like a limitation that can be removed in the future. Stefan
On Fri, Dec 20, 2019 at 09:15:40AM -0800, John G Johnson wrote: > > On Dec 19, 2019, at 5:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote: > >> On 17/12/19 23:57, Felipe Franciosi wrote: > We don’t pin guest memory; we pass the QEMU file descriptors used to create > guest memory to the emulation program, and it mmap()s them itself. (ala > vhost-user). Please ignore my reply. I just saw pinning was discussed in another sub-thread. Felipe posted this URL for tracking the issue: https://github.com/nutanix/muser/issues/28 Stefan
On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote: > On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote: > > On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote: > > > On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: > > > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > > > > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > > > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > > > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > > > To be clear: I'm very happy to have a userspace-only option for this, > > > > I just don't want to ditch the kernel module (yet, anyway). :) > > > > > > If it doesn't create too large of a burden to support both, then I think > > > it is very desirable. IIUC, this is saying a kernel based solution as the > > > optimized/optimal solution, and userspace UNIX socket based option as the > > > generic "works everywhere" fallback solution. > > > > I'm slightly in favor of the kernel implementation because it keeps us > > better aligned with VFIO. That means solving problems in one place only > > and less reinventing the wheel. > > > > Knowing that a userspace implementation is possible is a plus though. > > Maybe that option will become attractive in the future and someone will > > develop it. In fact, a userspace implementation may be a cool Google > > Summer of Code project idea that I'd like to co-mentor. > > If it is technically viable as an approach, then I think we should be > treating a fully unprivileged muser-over-UNIX socket as a higher priority > than just "maybe a GSoC student will want todo it". > > Libvirt is getting strong message from KubeVirt project that they want to > be running both libvirtd and QEMU fully unprivileged. This allows their > containers to be unprivileged. Anything that requires privileges requires > jumping through extra hoops writing custom code in KubeVirt to do things > outside libvirt in side loaded privileged containers and this limits how > where those features can be used. Okay this makes sense. There needs to be a consensus on whether to go with a qdev-over-socket approach that is QEMU-specific and strongly discourages third-party device distribution or a muser-over-socket approach that offers a stable API for VMM interoperability and third-party device distribution. Interoperability between VMMs and also DPDK/SPDK is important because they form today's open source virtualization community. No one project or codebase covers all use cases or interesting developments. If we are short-sighted and prevent collaboration then we'll become isolated. On the other hand, I'm personally opposed to proprietary vendors that contribute very little to open source. We make that easier by offering a stable API for third-party devices. A stable API discourages open source contributions while allowing proprietary vendors to benefit from the work that the open source community is doing. One way to choose a position is to balance up the open source vs proprietary applications of a stable API. At this point in time I think the DPDK/SPDK and rust-vmm communities bring enough to the table that it's worth fostering collaboration through a stable API. The benefit of having the stable API is large enough that the disadvantage of making life easier for proprietary vendors can be accepted. This is just a more elaborate explanation for the "the cat is out of the bag" comments that have already been made on licensing. Does anyone still disagree or want to discuss further? If there is agreement that a stable API is okay then I think the practical way to do this is to first merge a cleaned-up version of multi-process QEMU as an unstable experimental API. Once it's being tested and used we can write a protocol specification and publish it as a stable interface when the spec has addressed most use cases. Does this sound good? Stefan
> On Jan 2, 2020, at 10:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote: >> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote: >>> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote: >>>> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: >>>>>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: >>>>>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: >>>>>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: >>>>>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: >>>>> To be clear: I'm very happy to have a userspace-only option for this, >>>>> I just don't want to ditch the kernel module (yet, anyway). :) >>>> >>>> If it doesn't create too large of a burden to support both, then I think >>>> it is very desirable. IIUC, this is saying a kernel based solution as the >>>> optimized/optimal solution, and userspace UNIX socket based option as the >>>> generic "works everywhere" fallback solution. >>> >>> I'm slightly in favor of the kernel implementation because it keeps us >>> better aligned with VFIO. That means solving problems in one place only >>> and less reinventing the wheel. >>> >>> Knowing that a userspace implementation is possible is a plus though. >>> Maybe that option will become attractive in the future and someone will >>> develop it. In fact, a userspace implementation may be a cool Google >>> Summer of Code project idea that I'd like to co-mentor. >> >> If it is technically viable as an approach, then I think we should be >> treating a fully unprivileged muser-over-UNIX socket as a higher priority >> than just "maybe a GSoC student will want todo it". >> >> Libvirt is getting strong message from KubeVirt project that they want to >> be running both libvirtd and QEMU fully unprivileged. This allows their >> containers to be unprivileged. Anything that requires privileges requires >> jumping through extra hoops writing custom code in KubeVirt to do things >> outside libvirt in side loaded privileged containers and this limits how >> where those features can be used. > > Okay this makes sense. > > There needs to be a consensus on whether to go with a qdev-over-socket > approach that is QEMU-specific and strongly discourages third-party > device distribution or a muser-over-socket approach that offers a stable > API for VMM interoperability and third-party device distribution. The reason I dislike yet another offloading protocol (ie. there is vhost, there is vfio, and then there would be qdev-over-socket) is that we keep reinventing the wheel. I very much prefer picking something solid (eg. VFIO) and keep investing on it. > Interoperability between VMMs and also DPDK/SPDK is important because > they form today's open source virtualization community. No one project > or codebase covers all use cases or interesting developments. If we are > short-sighted and prevent collaboration then we'll become isolated. > > On the other hand, I'm personally opposed to proprietary vendors that > contribute very little to open source. We make that easier by offering > a stable API for third-party devices. A stable API discourages open > source contributions while allowing proprietary vendors to benefit from > the work that the open source community is doing. I appreciate the concern. However, my opinion is that vendors cannot be stopped by providing them with unstable APIs. There are plenty of examples where projects were forked and maintained separately to keep certain things under control and that is bad for everyone. The community doesn't get contributions back, and vendors have extra pain to maintain the forks. Furthermore, service vendors will always get away with murder by copying whatever they like and using however they please (since they are not sharing the software). I would rather look at examples like KVM. It's a relatively stable API with several proprietary users. Nevertheless, we see loads of contributions to it (perhaps less than we would want, but plenty). > > One way to choose a position is to balance up the open source vs > proprietary applications of a stable API. At this point in time I think > the DPDK/SPDK and rust-vmm communities bring enough to the table that > it's worth fostering collaboration through a stable API. The benefit of > having the stable API is large enough that the disadvantage of making > life easier for proprietary vendors can be accepted. I agree with you as per reasoning above. > > This is just a more elaborate explanation for the "the cat is out of the > bag" comments that have already been made on licensing. Does anyone > still disagree or want to discuss further? > > If there is agreement that a stable API is okay then I think the > practical way to do this is to first merge a cleaned-up version of > multi-process QEMU as an unstable experimental API. Once it's being > tested and used we can write a protocol specification and publish it as > a stable interface when the spec has addressed most use cases. > > Does this sound good? In that case, wouldn't it be preferable to revive our proposal from Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO code to "common" and added a "user" backend underneath it, similar to how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and added vhost-user-scsi. It was centric on PCI, but it doesn't have to be. The other side can be implemented in libmuser for facilitating things. I even recall highlighting that vhost-user could be moved underneath that later, greatly simplifying lots of other Qemu code. F. > > Stefan
On Tue, Dec 17, 2019 at 04:33:16PM +0000, Stefan Hajnoczi wrote: > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > >> Is there a work-in-progress muser patch series you can post to start the > > >> discussion early? That way we can avoid reviewers like myself asking > > >> you to make changes after you have invested a lot of time. > > >> > > > > > > Absolutely, that is our plan. At the moment we do not have the patches > > > ready for the review. We have setup internally a milestone and will be > > > sending that early version as a tarball after we have it completed. > > > Would be also a meeting something that could help us to stay on the same > > > page? > > > > Please loop us in if you so set up a meeting. > Hi Stefan And happy New Year to everyone! > There is a bi-weekly KVM Community Call that we can use for phone > discussions: > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ > Our team would like to join the call on Jan 14 and maybe talk over few things. Felipe, will you and your team be joining as well? > Or we can schedule a one-off call at any time :). > Awesome! Thank you, we will use for sure this opportunity. Elena > Questions I've seen when discussing muser with people have been: > > 1. Can unprivileged containers create muser devices? If not, this is a > blocker for use cases that want to avoid root privileges entirely. > > 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > security reasons)? A similar library could be implemented in > userspace along the lines of the vhost-user protocol. Although VMMs > would then need to use a new libmuser-client library instead of > reusing their VFIO code to access the device. > > 3. Should this feature be Linux-only? vhost-user can be implemented on > non-Linux OSes... > > Stefan
Hi On Thu, Jan 2, 2020 at 3:03 PM Felipe Franciosi <felipe@nutanix.com> wrote: > The reason I dislike yet another offloading protocol (ie. there is > vhost, there is vfio, and then there would be qdev-over-socket) is > that we keep reinventing the wheel. I very much prefer picking > something solid (eg. VFIO) and keep investing on it. I don't have a lot of experience with VFIO, so I can't tell if it's really solid for the user-space case. Alex W could probably discuss that. > In that case, wouldn't it be preferable to revive our proposal from > Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO > code to "common" and added a "user" backend underneath it, similar to > how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and > added vhost-user-scsi. It was centric on PCI, but it doesn't have to > be. The other side can be implemented in libmuser for facilitating things. Same idea back in KVM forum 2017 (briefly mentioned at the end of our talk with Conrad) The PoC is still around: https://github.com/elmarco/qemu/tree/wip/vfio-user/contrib/libvfio-user > I even recall highlighting that vhost-user could be moved underneath > that later, greatly simplifying lots of other Qemu code. That would eventually be an option, but vhost-user is already quite complicated. We could try to split it up somehow for the non-virtio parts. cheers
On Thu, Jan 02, 2020 at 08:01:36AM -0800, Elena Ufimtseva wrote: > On Tue, Dec 17, 2019 at 04:33:16PM +0000, Stefan Hajnoczi wrote: > > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > > > > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > > > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > > > >> Is there a work-in-progress muser patch series you can post to start the > > > >> discussion early? That way we can avoid reviewers like myself asking > > > >> you to make changes after you have invested a lot of time. > > > >> > > > > > > > > Absolutely, that is our plan. At the moment we do not have the patches > > > > ready for the review. We have setup internally a milestone and will be > > > > sending that early version as a tarball after we have it completed. > > > > Would be also a meeting something that could help us to stay on the same > > > > page? > > > > > > Please loop us in if you so set up a meeting. > > > > Hi Stefan > > And happy New Year to everyone! > > > There is a bi-weekly KVM Community Call that we can use for phone > > discussions: > > > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ > > > > Our team would like to join the call on Jan 14 and maybe talk over few things. > Felipe, will you and your team be joining as well? Great, I'll be there. Stefan
On Thu, Jan 02, 2020 at 11:03:22AM +0000, Felipe Franciosi wrote: > > On Jan 2, 2020, at 10:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote: > >> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote: > >>> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote: > >>>> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: > >>>>>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > >>>>>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>>>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > >>>>>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > >>>>>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > >>>>> To be clear: I'm very happy to have a userspace-only option for this, > >>>>> I just don't want to ditch the kernel module (yet, anyway). :) > >>>> > >>>> If it doesn't create too large of a burden to support both, then I think > >>>> it is very desirable. IIUC, this is saying a kernel based solution as the > >>>> optimized/optimal solution, and userspace UNIX socket based option as the > >>>> generic "works everywhere" fallback solution. > >>> > >>> I'm slightly in favor of the kernel implementation because it keeps us > >>> better aligned with VFIO. That means solving problems in one place only > >>> and less reinventing the wheel. > >>> > >>> Knowing that a userspace implementation is possible is a plus though. > >>> Maybe that option will become attractive in the future and someone will > >>> develop it. In fact, a userspace implementation may be a cool Google > >>> Summer of Code project idea that I'd like to co-mentor. > >> > >> If it is technically viable as an approach, then I think we should be > >> treating a fully unprivileged muser-over-UNIX socket as a higher priority > >> than just "maybe a GSoC student will want todo it". > >> > >> Libvirt is getting strong message from KubeVirt project that they want to > >> be running both libvirtd and QEMU fully unprivileged. This allows their > >> containers to be unprivileged. Anything that requires privileges requires > >> jumping through extra hoops writing custom code in KubeVirt to do things > >> outside libvirt in side loaded privileged containers and this limits how > >> where those features can be used. > > > > Okay this makes sense. > > > > There needs to be a consensus on whether to go with a qdev-over-socket > > approach that is QEMU-specific and strongly discourages third-party > > device distribution or a muser-over-socket approach that offers a stable > > API for VMM interoperability and third-party device distribution. > > The reason I dislike yet another offloading protocol (ie. there is > vhost, there is vfio, and then there would be qdev-over-socket) is > that we keep reinventing the wheel. I very much prefer picking > something solid (eg. VFIO) and keep investing on it. I like the idea of sticking close to VFIO too. The first step is figuring out whether VFIO can be mapped to a UNIX domain socket protocol and many non-VFIO protocol messages are required. Hopefully that extra non-VFIO stuff isn't too large. If implementations can use the kernel uapi vfio header files then we're on track for compatibility with VFIO. > > This is just a more elaborate explanation for the "the cat is out of the > > bag" comments that have already been made on licensing. Does anyone > > still disagree or want to discuss further? > > > > If there is agreement that a stable API is okay then I think the > > practical way to do this is to first merge a cleaned-up version of > > multi-process QEMU as an unstable experimental API. Once it's being > > tested and used we can write a protocol specification and publish it as > > a stable interface when the spec has addressed most use cases. > > > > Does this sound good? > > In that case, wouldn't it be preferable to revive our proposal from > Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO > code to "common" and added a "user" backend underneath it, similar to > how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and > added vhost-user-scsi. It was centric on PCI, but it doesn't have to > be. The other side can be implemented in libmuser for facilitating things. That sounds good. Stefan
On Thu, Jan 02, 2020 at 10:55:46PM +0400, Marc-André Lureau wrote: > On Thu, Jan 2, 2020 at 3:03 PM Felipe Franciosi <felipe@nutanix.com> wrote: > > I even recall highlighting that vhost-user could be moved underneath > > that later, greatly simplifying lots of other Qemu code. > > That would eventually be an option, but vhost-user is already quite > complicated. We could try to split it up somehow for the non-virtio > parts. I hope we can deprecate vhost-user. New out-of-process devices should just implement VFIO-over-socket with virtio-pci. This way out-of-process devices are full VIRTIO devices and it's cleaner than having the vhost-user protocol that exposes a subset of VIRTIO. Stefan
> On Jan 3, 2020, at 7:59 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Thu, Jan 02, 2020 at 11:03:22AM +0000, Felipe Franciosi wrote: >>> On Jan 2, 2020, at 10:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>> On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote: >>>> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote: >>>>> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote: >>>>>> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: >>>>>>>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: >>>>>>>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: >>>>>>>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: >>>>>>>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: >>>>>>> To be clear: I'm very happy to have a userspace-only option for this, >>>>>>> I just don't want to ditch the kernel module (yet, anyway). :) >>>>>> >>>>>> If it doesn't create too large of a burden to support both, then I think >>>>>> it is very desirable. IIUC, this is saying a kernel based solution as the >>>>>> optimized/optimal solution, and userspace UNIX socket based option as the >>>>>> generic "works everywhere" fallback solution. >>>>> >>>>> I'm slightly in favor of the kernel implementation because it keeps us >>>>> better aligned with VFIO. That means solving problems in one place only >>>>> and less reinventing the wheel. >>>>> >>>>> Knowing that a userspace implementation is possible is a plus though. >>>>> Maybe that option will become attractive in the future and someone will >>>>> develop it. In fact, a userspace implementation may be a cool Google >>>>> Summer of Code project idea that I'd like to co-mentor. >>>> >>>> If it is technically viable as an approach, then I think we should be >>>> treating a fully unprivileged muser-over-UNIX socket as a higher priority >>>> than just "maybe a GSoC student will want todo it". >>>> >>>> Libvirt is getting strong message from KubeVirt project that they want to >>>> be running both libvirtd and QEMU fully unprivileged. This allows their >>>> containers to be unprivileged. Anything that requires privileges requires >>>> jumping through extra hoops writing custom code in KubeVirt to do things >>>> outside libvirt in side loaded privileged containers and this limits how >>>> where those features can be used. >>> >>> Okay this makes sense. >>> >>> There needs to be a consensus on whether to go with a qdev-over-socket >>> approach that is QEMU-specific and strongly discourages third-party >>> device distribution or a muser-over-socket approach that offers a stable >>> API for VMM interoperability and third-party device distribution. >> >> The reason I dislike yet another offloading protocol (ie. there is >> vhost, there is vfio, and then there would be qdev-over-socket) is >> that we keep reinventing the wheel. I very much prefer picking >> something solid (eg. VFIO) and keep investing on it. > > I like the idea of sticking close to VFIO too. The first step is > figuring out whether VFIO can be mapped to a UNIX domain socket protocol > and many non-VFIO protocol messages are required. Hopefully that extra > non-VFIO stuff isn't too large. > I looked at this and think we could map VFIO commands over a UNIX socket without a lot of difficulty. We'd have to use SCM messages to pass file descriptors from the QEMU process to the emulation process for certain operations, but that shouldn't be a big problem. Here are the mission mode operations: configuration VFIO defines a number of configuration ioctl()s that we could turn into messages, but if we make the protocol specific to PCI, then all of the information they transmit (e.g., device regions and interrupts) can be discovered by parsing the device's PCI config space. A lot of the current VFIO code that parses config space could be re-used to do this. MMIO VFIO uses reads and writes on the VFIO file descriptor to perform MMIOs to the device. The read/write offset encodes the VFIO region and offset of the MMIO. (the VFIO regions correspond to PCI BARs) These would have to be changed to send messages that include the VFIO region and offset (and data for writes) to the emulation process. interrupts VFIO creates eventfds that are sent to the kernel driver so it can inject interrupts into a guest. We would have to send these eventfds over the socket to the emulation process using SCM messages. The emulation process could then trigger interrupts by writing on the eventfd. DMA This is one place where I might diverge from VFIO. It uses an ioctl to tell the kernel driver what areas of guest memory the device can address. The driver then pins that memory so it can be programmed into a HW IOMMU. We could avoid pinning of guest memory by adopting the vhost-user idea of sending the file descriptors used by QEMU to create guest memory to the emulation process, and having it mmap() the guest itself. IOMMUs are handled by having the emulation process request device DMA to guest PA translations from QEMU. > If implementations can use the kernel uapi vfio header files then we're > on track for compatibility with VFIO. > >>> This is just a more elaborate explanation for the "the cat is out of the >>> bag" comments that have already been made on licensing. Does anyone >>> still disagree or want to discuss further? >>> >>> If there is agreement that a stable API is okay then I think the >>> practical way to do this is to first merge a cleaned-up version of >>> multi-process QEMU as an unstable experimental API. Once it's being >>> tested and used we can write a protocol specification and publish it as >>> a stable interface when the spec has addressed most use cases. >>> >>> Does this sound good? >> >> In that case, wouldn't it be preferable to revive our proposal from >> Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO >> code to "common" and added a "user" backend underneath it, similar to >> how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and >> added vhost-user-scsi. It was centric on PCI, but it doesn't have to >> be. The other side can be implemented in libmuser for facilitating things. > > That sounds good. > The emulation program API could be based on the current libmuser API or the libvfio-user API. The protocol itself wouldn’t care which is chosen. Our multi-processQEMU project would have to change how devices are specified from the QEMU command line to the emulation process command line. JJ
* John G Johnson (john.g.johnson@oracle.com) wrote: <snip> > DMA > > This is one place where I might diverge from VFIO. It uses an > ioctl to tell the kernel driver what areas of guest memory the device > can address. The driver then pins that memory so it can be programmed > into a HW IOMMU. We could avoid pinning of guest memory by adopting > the vhost-user idea of sending the file descriptors used by QEMU to > create guest memory to the emulation process, and having it mmap() the > guest itself. IOMMUs are handled by having the emulation process > request device DMA to guest PA translations from QEMU. The interface in vhost-user to pass these memory fd's is a bit hairy; so it would be great if there was something better for multi-process. Some things to think about: a) vhost-user filters it so that areas of memory not backed by an fd aren't passed to the client; this filters out some of the device specific RAM blocks that aren't really normal RAM. b) Hugepages are tricky; especially on a PC where the 0-1MB area is broken up into chunks and you're trying to mmap 2MB chunks into the client. c) Postcopy with vhost-user was pretty tricky as well; there needs to be some coordination with the qemu to handle pages that are missing. d) Some RAM mappings can change; mostly not the ones sent to the client; but just watch out that these can happen at unexpected times. Dave > > > > If implementations can use the kernel uapi vfio header files then we're > > on track for compatibility with VFIO. > > > >>> This is just a more elaborate explanation for the "the cat is out of the > >>> bag" comments that have already been made on licensing. Does anyone > >>> still disagree or want to discuss further? > >>> > >>> If there is agreement that a stable API is okay then I think the > >>> practical way to do this is to first merge a cleaned-up version of > >>> multi-process QEMU as an unstable experimental API. Once it's being > >>> tested and used we can write a protocol specification and publish it as > >>> a stable interface when the spec has addressed most use cases. > >>> > >>> Does this sound good? > >> > >> In that case, wouldn't it be preferable to revive our proposal from > >> Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO > >> code to "common" and added a "user" backend underneath it, similar to > >> how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and > >> added vhost-user-scsi. It was centric on PCI, but it doesn't have to > >> be. The other side can be implemented in libmuser for facilitating things. > > > > That sounds good. > > > > The emulation program API could be based on the current > libmuser API or the libvfio-user API. The protocol itself wouldn’t > care which is chosen. Our multi-processQEMU project would have to > change how devices are specified from the QEMU command line to the > emulation process command line. > > JJ > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 3) Muser.ko pins the pages (in get_dma_map(), called from below) > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__github.com_nutanix_muser_blob_master_kmod_muser.c- > 23L711&d=DwICAg&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtt > i46atk736SI4vgsJiUKIyDE&m=C8rTp4SZoy4YNcZWntiROp3otxCyKbLoQXBw8O > SB0TM&s=G2JfW1GcVNc_iph7C4hE285sTZM8JrR4dYXgmcyAZPE&e= ) > > Yikes, it pins every page?? vfio_pin_pages() intends for the vendor > driver to be much smarter than this :-\ Thanks, We no longer have to pin pages at all. Instead we grab the fd backing the VMA and inject it in libmuser, and then request it to mmap that file. This also solves a few other problems and is far simpler to implement.