mbox series

[RFC,v4,00/49] Initial support of multi-process qemu

Message ID cover.1571905346.git.jag.raman@oracle.com (mailing list archive)
Headers show
Series Initial support of multi-process qemu | expand

Message

Jag Raman Oct. 24, 2019, 9:08 a.m. UTC
Started with the presentation in October 2017 made by Marc-Andre (Red Hat)
and Konrad Wilk (Oracle) [1], and continued by Jag's BoF at KVM Forum 2018,
the multi-process project is now a prototype and presented in this patchset.
John & Elena will present the status of this project in KVM Forum 2019.

This first series enables the emulation of lsi53c895a in a separate process.

We posted the Proof Of Concept patches [2] before the BoF session in 2018.
Subsequently, we posted RFC v1 [3], RFC v2 [4] and RFC v3 [5] of this series. 

We want to present version 4 of this series, which incorporates the feedback
we received for v3 & adds support for live migrating the remote process.

Following people contributed to this patchset:

John G Johnson <john.g.johnson@oracle.com>
Jagannathan Raman <jag.raman@oracle.com>
Elena Ufimtseva <elena.ufimtseva@oracle.com>
Kanth Ghatraju <kanth.ghatraju@oracle.com>

For full concept writeup about QEMU disaggregation refer to
docs/devel/qemu-multiprocess.rst. Please refer to 
docs/qemu-multiprocess.txt for usage information.

We are planning on making the following improvements in the future:
 - Performance improvements
 - Libvirt support
 - Enforcement of security policies
 - blockdev support

We welcome all your ideas, concerns, and questions for this patchset.

Thank you!

[1]: http://events17.linuxfoundation.org/sites/events/files/slides/KVM%20FORUM%20multi-process.pdf
[1]: https://www.youtube.com/watch?v=Kq1-coHh7lg
[2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg566538.html
[3]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg602285.html
[4]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg624877.html
[5]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg642000.html

Elena Ufimtseva (22):
  multi-process: add a command line option for debug file
  multi-process: introduce proxy object
  mutli-process: build remote command line args
  multi-process: configure remote side devices
  multi-process: add qdev_proxy_add to create proxy devices
  multi-process: remote: add setup_devices and setup_drive msg
    processing
  multi-process: remote: use fd for socket from parent process
  multi-process: remote: add create_done condition
  multi-process: add processing of remote drive and device command line
  multi-process: refractor vl.c code to re-use in remote
  multi-process: add remote option
  multi-process: add remote options parser
  multi-process: add parse_cmdline in remote process
  multi-process: send heartbeat messages to remote
  multi-process: handle heartbeat messages in remote process
  multi-process/mon: choose HMP commands based on target
  multi-process/mig: Load VMSD in the proxy object
  multi-process/mig: refactor runstate_check into common file
  multi-process/mig: Synchronize runstate of remote process
  multi-process/mig: Restore the VMSD in remote process
  multi-process: Enable support for multiple devices in remote
  multi-process: add configure and usage information

Jagannathan Raman (26):
  multi-process: memory: alloc RAM from file at offset
  multi-process: util: Add qemu_thread_cancel() to cancel running thread
  multi-process: Add stub functions to facilate build of multi-process
  multi-process: Add config option for multi-process QEMU
  multi-process: build system for remote device process
  multi-process: define mpqemu-link object
  multi-process: add functions to synchronize proxy and remote endpoints
  multi-process: setup PCI host bridge for remote device
  multi-process: setup a machine object for remote device process
  multi-process: setup memory manager for remote device
  multi-process: remote process initialization
  multi-process: PCI BAR read/write handling for proxy & remote
    endpoints
  multi-process: Add LSI device proxy object
  multi-process: Synchronize remote memory
  multi-process: create IOHUB object to handle irq
  multi-process: Introduce build flags to separate remote process code
  multi-process: Use separate MMIO communication channel
  multi-process: perform device reset in the remote process
  multi-process/mon: stub functions to enable QMP module for remote
    process
  multi-process/mon: enable QMP module support in the remote process
  multi-process/mon: Refactor monitor/chardev functions out of vl.c
  multi-process/mon: Initialize QMP module for remote processes
  multi-process: prevent duplicate memory initialization in remote
  multi-process/mig: build migration module in the remote process
  multi-process/mig: Enable VMSD save in the Proxy object
  multi-process/mig: Send VMSD of remote to the Proxy object

John G Johnson (1):
  multi-process: add the concept description to
    docs/devel/qemu-multiprocess

 Makefile                            |    2 +
 Makefile.objs                       |   39 ++
 Makefile.target                     |   94 ++-
 accel/stubs/kvm-stub.c              |    5 +
 accel/stubs/tcg-stub.c              |  106 ++++
 backends/Makefile.objs              |    2 +
 block/Makefile.objs                 |    2 +
 chardev/char.c                      |   14 +
 configure                           |   15 +
 docs/devel/index.rst                |    1 +
 docs/devel/qemu-multiprocess.rst    | 1102 +++++++++++++++++++++++++++++++++++
 docs/qemu-multiprocess.txt          |   86 +++
 exec.c                              |   14 +-
 hmp-commands-info.hx                |   10 +
 hmp-commands.hx                     |   25 +-
 hw/Makefile.objs                    |    9 +
 hw/block/Makefile.objs              |    2 +
 hw/core/Makefile.objs               |   17 +
 hw/nvram/Makefile.objs              |    2 +
 hw/pci/Makefile.objs                |    4 +
 hw/proxy/Makefile.objs              |    1 +
 hw/proxy/memory-sync.c              |  226 +++++++
 hw/proxy/proxy-lsi53c895a.c         |   97 +++
 hw/proxy/qemu-proxy.c               |  807 +++++++++++++++++++++++++
 hw/scsi/Makefile.objs               |    2 +
 include/chardev/char.h              |    1 +
 include/exec/address-spaces.h       |    2 +
 include/exec/ram_addr.h             |    2 +-
 include/hw/pci/pci_ids.h            |    3 +
 include/hw/proxy/memory-sync.h      |   51 ++
 include/hw/proxy/proxy-lsi53c895a.h |   42 ++
 include/hw/proxy/qemu-proxy.h       |  125 ++++
 include/hw/qdev-core.h              |    2 +
 include/io/mpqemu-link.h            |  214 +++++++
 include/monitor/monitor.h           |    2 +
 include/monitor/qdev.h              |   25 +
 include/qemu-common.h               |    8 +
 include/qemu/log.h                  |    1 +
 include/qemu/mmap-alloc.h           |    3 +-
 include/qemu/thread.h               |    1 +
 include/remote/iohub.h              |   63 ++
 include/remote/machine.h            |   48 ++
 include/remote/memory.h             |   34 ++
 include/remote/pcihost.h            |   59 ++
 include/sysemu/runstate.h           |    3 +
 io/Makefile.objs                    |    2 +
 io/mpqemu-link.c                    |  351 +++++++++++
 memory.c                            |    2 +-
 migration/Makefile.objs             |   12 +
 migration/savevm.c                  |   63 ++
 migration/savevm.h                  |    3 +
 monitor/Makefile.objs               |    3 +
 monitor/misc.c                      |   84 +--
 monitor/monitor-internal.h          |   38 ++
 monitor/monitor.c                   |   83 ++-
 net/Makefile.objs                   |    2 +
 qapi/Makefile.objs                  |    2 +
 qdev-monitor.c                      |  270 ++++++++-
 qemu-options.hx                     |   21 +
 qom/Makefile.objs                   |    4 +
 remote/Makefile.objs                |    6 +
 remote/iohub.c                      |  159 +++++
 remote/machine.c                    |  133 +++++
 remote/memory.c                     |   99 ++++
 remote/pcihost.c                    |   85 +++
 remote/remote-main.c                |  633 ++++++++++++++++++++
 remote/remote-opts.c                |  131 +++++
 remote/remote-opts.h                |   31 +
 replay/Makefile.objs                |    2 +-
 rules.mak                           |    2 +-
 runstate.c                          |   41 ++
 scripts/hxtool                      |   44 +-
 stubs/audio.c                       |   12 +
 stubs/gdbstub.c                     |   21 +
 stubs/machine-init-done.c           |    4 +
 stubs/migration.c                   |  211 +++++++
 stubs/monitor.c                     |   72 +++
 stubs/net-stub.c                    |  121 ++++
 stubs/qapi-misc.c                   |   43 ++
 stubs/qapi-target.c                 |   49 ++
 stubs/replay.c                      |   26 +
 stubs/runstate-check.c              |    3 +
 stubs/ui-stub.c                     |  130 +++++
 stubs/vl-stub.c                     |  193 ++++++
 stubs/vmstate.c                     |   20 +
 stubs/xen-mapcache.c                |   22 +
 ui/Makefile.objs                    |    2 +
 util/log.c                          |    2 +
 util/mmap-alloc.c                   |    7 +-
 util/oslib-posix.c                  |    2 +-
 util/qemu-thread-posix.c            |   10 +
 vl-parse.c                          |  161 +++++
 vl.c                                |  310 ++++------
 vl.h                                |   54 ++
 94 files changed, 6908 insertions(+), 246 deletions(-)
 create mode 100644 docs/devel/qemu-multiprocess.rst
 create mode 100644 docs/qemu-multiprocess.txt
 create mode 100644 hw/proxy/Makefile.objs
 create mode 100644 hw/proxy/memory-sync.c
 create mode 100644 hw/proxy/proxy-lsi53c895a.c
 create mode 100644 hw/proxy/qemu-proxy.c
 create mode 100644 include/hw/proxy/memory-sync.h
 create mode 100644 include/hw/proxy/proxy-lsi53c895a.h
 create mode 100644 include/hw/proxy/qemu-proxy.h
 create mode 100644 include/io/mpqemu-link.h
 create mode 100644 include/remote/iohub.h
 create mode 100644 include/remote/machine.h
 create mode 100644 include/remote/memory.h
 create mode 100644 include/remote/pcihost.h
 create mode 100644 io/mpqemu-link.c
 create mode 100644 remote/Makefile.objs
 create mode 100644 remote/iohub.c
 create mode 100644 remote/machine.c
 create mode 100644 remote/memory.c
 create mode 100644 remote/pcihost.c
 create mode 100644 remote/remote-main.c
 create mode 100644 remote/remote-opts.c
 create mode 100644 remote/remote-opts.h
 create mode 100644 runstate.c
 mode change 100644 => 100755 scripts/hxtool
 create mode 100644 stubs/audio.c
 create mode 100644 stubs/migration.c
 create mode 100644 stubs/net-stub.c
 create mode 100644 stubs/qapi-misc.c
 create mode 100644 stubs/qapi-target.c
 create mode 100644 stubs/ui-stub.c
 create mode 100644 stubs/vl-stub.c
 create mode 100644 stubs/xen-mapcache.c
 create mode 100644 vl-parse.c
 create mode 100644 vl.h

Comments

no-reply@patchew.org Oct. 25, 2019, 2:08 a.m. UTC | #1
Patchew URL: https://patchew.org/QEMU/cover.1571905346.git.jag.raman@oracle.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TEST    iotest-qcow2: 268
Failures: 071 099 120 184 186
Failed 5 of 109 iotests
make: *** [check-tests/check-block.sh] Error 1
make: *** Waiting for unfinished jobs....
  TEST    check-qtest-aarch64: tests/qos-test
Traceback (most recent call last):
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=7fcd537d324f4e2997dd5a6e772bce59', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-k9w5q7fm/src/docker-src.2019-10-24-21.54.49.17993:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=7fcd537d324f4e2997dd5a6e772bce59
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-k9w5q7fm/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    13m26.478s
user    0m8.947s


The full log is available at
http://patchew.org/logs/cover.1571905346.git.jag.raman@oracle.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
no-reply@patchew.org Oct. 25, 2019, 2:08 a.m. UTC | #2
Patchew URL: https://patchew.org/QEMU/cover.1571905346.git.jag.raman@oracle.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      util/aio-wait.o
  CC      util/thread-pool.o

Warning, treated as error:
/tmp/qemu-test/src/docs/devel/index.rst:13:toctree contains reference to nonexisting document 'multi-process'
  CC      util/qemu-timer.o
  CC      util/main-loop.o
make: *** [Makefile:1003: docs/devel/index.html] Error 2
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=f573792346ea4c8cb0e6066fdd1f1ddf', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-pe_jt46w/src/docker-src.2019-10-24-22.05.58.3272:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=f573792346ea4c8cb0e6066fdd1f1ddf
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-pe_jt46w/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    2m19.037s
user    0m7.875s


The full log is available at
http://patchew.org/logs/cover.1571905346.git.jag.raman@oracle.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
no-reply@patchew.org Oct. 25, 2019, 2:10 a.m. UTC | #3
Patchew URL: https://patchew.org/QEMU/cover.1571905346.git.jag.raman@oracle.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [RFC v4 PATCH 00/49] Initial support of multi-process qemu
Type: series
Message-id: cover.1571905346.git.jag.raman@oracle.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
75d605b multi-process: add configure and usage information
1724569 multi-process: add the concept description to docs/devel/qemu-multiprocess
15328b7 multi-process: Enable support for multiple devices in remote
5cf79267 multi-process/mig: Restore the VMSD in remote process
97c15c9 multi-process/mig: Synchronize runstate of remote process
1b67b17 multi-process/mig: refactor runstate_check into common file
55e4b53 multi-process/mig: Load VMSD in the proxy object
7e1b819 multi-process/mig: Send VMSD of remote to the Proxy object
399f9bf multi-process/mig: Enable VMSD save in the Proxy object
645fc27 multi-process/mig: build migration module in the remote process
f0265a7 multi-process: prevent duplicate memory initialization in remote
a0faf9e multi-process/mon: Initialize QMP module for remote processes
58310d9 multi-process/mon: Refactor monitor/chardev functions out of vl.c
ae4fd02 multi-process/mon: enable QMP module support in the remote process
d7e2191 multi-process/mon: stub functions to enable QMP module for remote process
2c67c85 multi-process/mon: choose HMP commands based on target
cb5ab3e multi-process: perform device reset in the remote process
3214409 multi-process: Use separate MMIO communication channel
c34c47f multi-process: handle heartbeat messages in remote process
ccd9230 multi-process: send heartbeat messages to remote
f4991d6 multi-process: add parse_cmdline in remote process
a90d6c6 multi-process: add remote options parser
d616c9d multi-process: add remote option
45c18dd multi-process: refractor vl.c code to re-use in remote
57b6105 multi-process: Introduce build flags to separate remote process code
1295409 multi-process: add processing of remote drive and device command line
e876f72 multi-process: remote: add create_done condition
effa3f0 multi-process: remote: use fd for socket from parent process
de69c89 multi-process: remote: add setup_devices and setup_drive msg processing
afc658d multi-process: add qdev_proxy_add to create proxy devices
70e0f47 multi-process: configure remote side devices
701a141 multi-process: create IOHUB object to handle irq
613c372 multi-process: Synchronize remote memory
f5183a9 multi-process: Add LSI device proxy object
7778ec0 multi-process: PCI BAR read/write handling for proxy & remote endpoints
736d74d mutli-process: build remote command line args
650f3d1 multi-process: introduce proxy object
9374d1a multi-process: remote process initialization
1965b18 multi-process: setup memory manager for remote device
719f823 multi-process: setup a machine object for remote device process
59bdda6 multi-process: setup PCI host bridge for remote device
139cdee multi-process: add functions to synchronize proxy and remote endpoints
1b3c1f0 multi-process: define mpqemu-link object
ac130ca multi-process: build system for remote device process
93774c4 multi-process: Add config option for multi-process QEMU
7e5f9b2 multi-process: Add stub functions to facilate build of multi-process
73740e6 multi-process: add a command line option for debug file
59f2f7d multi-process: util: Add qemu_thread_cancel() to cancel running thread
18b29a1 multi-process: memory: alloc RAM from file at offset

=== OUTPUT BEGIN ===
1/49 Checking commit 18b29a1306b2 (multi-process: memory: alloc RAM from file at offset)
2/49 Checking commit 59f2f7d24d20 (multi-process: util: Add qemu_thread_cancel() to cancel running thread)
3/49 Checking commit 73740e6eff21 (multi-process: add a command line option for debug file)
4/49 Checking commit 7e5f9b26d48c (multi-process: Add stub functions to facilate build of multi-process)
ERROR: suspect code indent for conditional statements (4, 4)
#137: FILE: accel/stubs/tcg-stub.c:109:
+    while (1) {
+    }

WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#151: 
new file mode 100644

total: 1 errors, 1 warnings, 376 lines checked

Patch 4/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

5/49 Checking commit 93774c4a2c28 (multi-process: Add config option for multi-process QEMU)
6/49 Checking commit ac130ca326fe (multi-process: build system for remote device process)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#286: 
new file mode 100644

total: 0 errors, 1 warnings, 244 lines checked

Patch 6/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
7/49 Checking commit 1b3c1f0e0495 (multi-process: define mpqemu-link object)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#20: 
new file mode 100644

WARNING: line over 80 characters
#171: FILE: include/io/mpqemu-link.h:147:
+void mpqemu_link_set_callback(MPQemuLinkState *s, mpqemu_link_callback callback);

WARNING: line over 80 characters
#397: FILE: io/mpqemu-link.c:207:
+                qemu_log_mask(LOG_REMOTE_DEBUG, "%s: Max FDs exceeded\n", __func__);

total: 0 errors, 3 warnings, 464 lines checked

Patch 7/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
8/49 Checking commit 139cdee6cbf6 (multi-process: add functions to synchronize proxy and remote endpoints)
9/49 Checking commit 59bdda61a183 (multi-process: setup PCI host bridge for remote device)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#31: 
new file mode 100644

total: 0 errors, 1 warnings, 153 lines checked

Patch 9/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
10/49 Checking commit 719f82346cb6 (multi-process: setup a machine object for remote device process)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#51: 
new file mode 100644

total: 0 errors, 1 warnings, 207 lines checked

Patch 10/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
11/49 Checking commit 1965b18cc69e (multi-process: setup memory manager for remote device)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#81: 
new file mode 100644

total: 0 errors, 1 warnings, 182 lines checked

Patch 11/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
12/49 Checking commit 9374d1ad3f30 (multi-process: remote process initialization)
13/49 Checking commit 650f3d106d82 (multi-process: introduce proxy object)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#34: 
new file mode 100644

total: 0 errors, 1 warnings, 379 lines checked

Patch 13/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
14/49 Checking commit 736d74d6623b (mutli-process: build remote command line args)
WARNING: line over 80 characters
#141: FILE: hw/proxy/qemu-proxy.c:243:
+static void init_proxy(PCIDevice *dev, char *command, bool need_spawn, Error **errp)

WARNING: line over 80 characters
#167: FILE: include/hw/proxy/qemu-proxy.h:66:
+    void (*init_proxy) (PCIDevice *dev, char *command, bool need_spawn, Error **errp);

total: 0 errors, 2 warnings, 146 lines checked

Patch 14/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
15/49 Checking commit 7778ec04e0e6 (multi-process: PCI BAR read/write handling for proxy & remote endpoints)
16/49 Checking commit f5183a9f01b5 (multi-process: Add LSI device proxy object)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#24: 
new file mode 100644

total: 0 errors, 1 warnings, 135 lines checked

Patch 16/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
17/49 Checking commit 613c372a8221 (multi-process: Synchronize remote memory)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#32: 
new file mode 100644

total: 0 errors, 1 warnings, 344 lines checked

Patch 17/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
18/49 Checking commit 701a141ec1a3 (multi-process: create IOHUB object to handle irq)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#202: 
new file mode 100644

total: 0 errors, 1 warnings, 435 lines checked

Patch 18/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
19/49 Checking commit 70e0f47f0652 (multi-process: configure remote side devices)
20/49 Checking commit afc658d34c34 (multi-process: add qdev_proxy_add to create proxy devices)
21/49 Checking commit de69c899ae12 (multi-process: remote: add setup_devices and setup_drive msg processing)
22/49 Checking commit effa3f06a627 (multi-process: remote: use fd for socket from parent process)
23/49 Checking commit e876f72db9ae (multi-process: remote: add create_done condition)
24/49 Checking commit 129540993d75 (multi-process: add processing of remote drive and device command line)
WARNING: Block comments use a leading /* on a separate line
#41: FILE: vl.c:1149:
+    dev = qdev_remote_add(opts, false /* this is drive */, errp);

WARNING: Block comments use a leading /* on a separate line
#87: FILE: vl.c:2246:
+    dev = qdev_remote_add(opts, true /* this is device */, errp);

WARNING: line over 80 characters
#121: FILE: vl.c:4418:
+     * need PCI host initialized. As a TODO: could defer init of PCIProxyDev instead.

total: 0 errors, 3 warnings, 117 lines checked

Patch 24/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
25/49 Checking commit 57b61051819d (multi-process: Introduce build flags to separate remote process code)
26/49 Checking commit 45c18ddb523b (multi-process: refractor vl.c code to re-use in remote)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#35: 
new file mode 100644

WARNING: Block comments use a leading /* on a separate line
#157: FILE: vl-parse.c:118:
+    dev = qdev_remote_add(opts, false /* this is drive */, errp);

WARNING: Block comments use a leading /* on a separate line
#172: FILE: vl-parse.c:133:
+    dev = qdev_remote_add(opts, true /* this is device */, errp);

total: 0 errors, 3 warnings, 412 lines checked

Patch 26/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
27/49 Checking commit d616c9d52b3e (multi-process: add remote option)
28/49 Checking commit a90d6c6eebff (multi-process: add remote options parser)
WARNING: Block comments use a leading /* on a separate line
#37: FILE: vl.c:300:
+        { /* end of list */ }

total: 0 errors, 1 warnings, 147 lines checked

Patch 28/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
29/49 Checking commit f4991d61c243 (multi-process: add parse_cmdline in remote process)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#59: 
new file mode 100644

ERROR: line over 90 characters
#151: FILE: remote/remote-opts.c:88:
+                error_report("Option not supported for this target, %x arch_mask, %x arch_type",

total: 1 errors, 1 warnings, 180 lines checked

Patch 29/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

30/49 Checking commit ccd92304b77d (multi-process: send heartbeat messages to remote)
31/49 Checking commit c34c47ff9925 (multi-process: handle heartbeat messages in remote process)
32/49 Checking commit 32144092e7eb (multi-process: Use separate MMIO communication channel)
33/49 Checking commit cb5ab3eb7d63 (multi-process: perform device reset in the remote process)
34/49 Checking commit 2c67c85818aa (multi-process/mon: choose HMP commands based on target)
35/49 Checking commit d7e21910a6c4 (multi-process/mon: stub functions to enable QMP module for remote process)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#108: 
new file mode 100644

total: 0 errors, 1 warnings, 722 lines checked

Patch 35/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
36/49 Checking commit ae4fd022fa7e (multi-process/mon: enable QMP module support in the remote process)
37/49 Checking commit 58310d9c635b (multi-process/mon: Refactor monitor/chardev functions out of vl.c)
38/49 Checking commit a0faf9ea04b0 (multi-process/mon: Initialize QMP module for remote processes)
39/49 Checking commit f0265a76742b (multi-process: prevent duplicate memory initialization in remote)
40/49 Checking commit 645fc2738aec (multi-process/mig: build migration module in the remote process)
41/49 Checking commit 399f9bf4ba56 (multi-process/mig: Enable VMSD save in the Proxy object)
ERROR: suspect code indent for conditional statements (4, 4)
#126: FILE: hw/proxy/qemu-proxy.c:449:
+    while (*((volatile uint64_t *)&pdev->migsize) < size) {
+    }

total: 1 errors, 0 warnings, 173 lines checked

Patch 41/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

42/49 Checking commit 7e1b8192f562 (multi-process/mig: Send VMSD of remote to the Proxy object)
43/49 Checking commit 55e4b53ec008 (multi-process/mig: Load VMSD in the proxy object)
44/49 Checking commit 1b67b1722798 (multi-process/mig: refactor runstate_check into common file)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#48: 
new file mode 100644

total: 0 errors, 1 warnings, 92 lines checked

Patch 44/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
45/49 Checking commit 97c15c99f41f (multi-process/mig: Synchronize runstate of remote process)
46/49 Checking commit 5cf792672396 (multi-process/mig: Restore the VMSD in remote process)
ERROR: that open brace { should be on the previous line
#118: FILE: remote/remote-main.c:527:
+        if (process_start_mig_in(msg))
+        {

total: 1 errors, 0 warnings, 103 lines checked

Patch 46/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

47/49 Checking commit 15328b7aae2d (multi-process: Enable support for multiple devices in remote)
48/49 Checking commit 1724569b4d71 (multi-process: add the concept description to docs/devel/qemu-multiprocess)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#22: 
new file mode 100644

total: 0 errors, 1 warnings, 1106 lines checked

Patch 48/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
49/49 Checking commit 75d605b31739 (multi-process: add configure and usage information)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#13: 
new file mode 100644

total: 0 errors, 1 warnings, 86 lines checked

Patch 49/49 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/cover.1571905346.git.jag.raman@oracle.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Stefan Hajnoczi Nov. 21, 2019, 12:46 p.m. UTC | #4
On Thu, Oct 24, 2019 at 05:08:41AM -0400, Jagannathan Raman wrote:
> Started with the presentation in October 2017 made by Marc-Andre (Red Hat)
> and Konrad Wilk (Oracle) [1], and continued by Jag's BoF at KVM Forum 2018,
> the multi-process project is now a prototype and presented in this patchset.
> John & Elena will present the status of this project in KVM Forum 2019.
> 
> This first series enables the emulation of lsi53c895a in a separate process.
> 
> We posted the Proof Of Concept patches [2] before the BoF session in 2018.
> Subsequently, we posted RFC v1 [3], RFC v2 [4] and RFC v3 [5] of this series. 
> 
> We want to present version 4 of this series, which incorporates the feedback
> we received for v3 & adds support for live migrating the remote process.
> 
> Following people contributed to this patchset:
> 
> John G Johnson <john.g.johnson@oracle.com>
> Jagannathan Raman <jag.raman@oracle.com>
> Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Kanth Ghatraju <kanth.ghatraju@oracle.com>
> 
> For full concept writeup about QEMU disaggregation refer to
> docs/devel/qemu-multiprocess.rst. Please refer to 
> docs/qemu-multiprocess.txt for usage information.
> 
> We are planning on making the following improvements in the future:
>  - Performance improvements
>  - Libvirt support
>  - Enforcement of security policies
>  - blockdev support
> 
> We welcome all your ideas, concerns, and questions for this patchset.
> 
> Thank you!

I've wrapped up for v4.  There is more to review in detail but I've
posted enough comments so that I'd like to see the next revision before
investing more time.

The main topics:

1. It's possible to have just one proxy device per bus type (PCI, USB,
   etc).  The proxy device instance can be initialized by communicating
   with the remote process to inquire about its memory regions,
   interrupts, etc.  This removes the need to hardcode that information
   into per-device proxy objects, which is tedious and can get
   out-of-sync with the real device emulation code.

   This is becoming similar to doing VFIO or muser over a socket...

2. Security and code quality.  Missing input validation and resource
   leaks don't inspire confidence :(.

   Please run scripts/checkpatch.pl on the code.

Stefan
Elena Ufimtseva Dec. 10, 2019, 6:47 a.m. UTC | #5
Hi

We would like to give a short update to the community about the multi-process project.

Firstly, we appreciate the feedback and all productive discussions we had
at KVM 2019 forum.
As an outcome of the conference, we have switched gears and are investigating
the ways of using the muser framework in our project.

At this moment we are working on the evaluation and a first prototype
of qemu-multiprocess based on muser framework.
We first heard about it at the conference from the presentation given by
Thanos Makatos and Swapnil Ingle from Nutanix.
Their presentation is available https://static.sched.com/hosted_files/kvmforum2019/3b/muser.pdf
 along with github link to the source repo.
After the conversation we had with a group of people including Felipe Franciosi,
Stefan Hajnoczi, Daniel Berrangé, Konrad Wilk, Peter Maydell, John Jonson and few others
(apologies if some names are missing), we have gathered important answers on how to move
forward with qemu-multiprocess.

At this moment we are working on the first stage of the project with help of
the Nutanix developers.
The questions we have gathered so far will be addressed with muser
and Qemu developers after we finish the first stage and make sure we understand
what it will take for us to move onto the next stage.

We will also incorporate relevant review from Stefan that he provided
on the series 4 of the patchset. Thank you Stefan.

If anyone have any further suggestions or questions about the status,
please reply to this email.

Thank you

JJ, Jag & Elena

On Thu, Oct 24, 2019 at 05:08:41AM -0400, Jagannathan Raman wrote:
> Started with the presentation in October 2017 made by Marc-Andre (Red Hat)
> and Konrad Wilk (Oracle) [1], and continued by Jag's BoF at KVM Forum 2018,
> the multi-process project is now a prototype and presented in this patchset.
> John & Elena will present the status of this project in KVM Forum 2019.
> 
> This first series enables the emulation of lsi53c895a in a separate process.
> 
> We posted the Proof Of Concept patches [2] before the BoF session in 2018.
> Subsequently, we posted RFC v1 [3], RFC v2 [4] and RFC v3 [5] of this series. 
> 
> We want to present version 4 of this series, which incorporates the feedback
> we received for v3 & adds support for live migrating the remote process.
> 
> Following people contributed to this patchset:
> 
> John G Johnson <john.g.johnson@oracle.com>
> Jagannathan Raman <jag.raman@oracle.com>
> Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Kanth Ghatraju <kanth.ghatraju@oracle.com>
> 
> For full concept writeup about QEMU disaggregation refer to
> docs/devel/qemu-multiprocess.rst. Please refer to 
> docs/qemu-multiprocess.txt for usage information.
> 
> We are planning on making the following improvements in the future:
>  - Performance improvements
>  - Libvirt support
>  - Enforcement of security policies
>  - blockdev support
> 
> We welcome all your ideas, concerns, and questions for this patchset.
> 
> Thank you!
> 
> [1]: http://events17.linuxfoundation.org/sites/events/files/slides/KVM%20FORUM%20multi-process.pdf
> [1]: https://www.youtube.com/watch?v=Kq1-coHh7lg
> [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg566538.html
> [3]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg602285.html
> [4]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg624877.html
> [5]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg642000.html
> 
> Elena Ufimtseva (22):
>   multi-process: add a command line option for debug file
>   multi-process: introduce proxy object
>   mutli-process: build remote command line args
>   multi-process: configure remote side devices
>   multi-process: add qdev_proxy_add to create proxy devices
>   multi-process: remote: add setup_devices and setup_drive msg
>     processing
>   multi-process: remote: use fd for socket from parent process
>   multi-process: remote: add create_done condition
>   multi-process: add processing of remote drive and device command line
>   multi-process: refractor vl.c code to re-use in remote
>   multi-process: add remote option
>   multi-process: add remote options parser
>   multi-process: add parse_cmdline in remote process
>   multi-process: send heartbeat messages to remote
>   multi-process: handle heartbeat messages in remote process
>   multi-process/mon: choose HMP commands based on target
>   multi-process/mig: Load VMSD in the proxy object
>   multi-process/mig: refactor runstate_check into common file
>   multi-process/mig: Synchronize runstate of remote process
>   multi-process/mig: Restore the VMSD in remote process
>   multi-process: Enable support for multiple devices in remote
>   multi-process: add configure and usage information
> 
> Jagannathan Raman (26):
>   multi-process: memory: alloc RAM from file at offset
>   multi-process: util: Add qemu_thread_cancel() to cancel running thread
>   multi-process: Add stub functions to facilate build of multi-process
>   multi-process: Add config option for multi-process QEMU
>   multi-process: build system for remote device process
>   multi-process: define mpqemu-link object
>   multi-process: add functions to synchronize proxy and remote endpoints
>   multi-process: setup PCI host bridge for remote device
>   multi-process: setup a machine object for remote device process
>   multi-process: setup memory manager for remote device
>   multi-process: remote process initialization
>   multi-process: PCI BAR read/write handling for proxy & remote
>     endpoints
>   multi-process: Add LSI device proxy object
>   multi-process: Synchronize remote memory
>   multi-process: create IOHUB object to handle irq
>   multi-process: Introduce build flags to separate remote process code
>   multi-process: Use separate MMIO communication channel
>   multi-process: perform device reset in the remote process
>   multi-process/mon: stub functions to enable QMP module for remote
>     process
>   multi-process/mon: enable QMP module support in the remote process
>   multi-process/mon: Refactor monitor/chardev functions out of vl.c
>   multi-process/mon: Initialize QMP module for remote processes
>   multi-process: prevent duplicate memory initialization in remote
>   multi-process/mig: build migration module in the remote process
>   multi-process/mig: Enable VMSD save in the Proxy object
>   multi-process/mig: Send VMSD of remote to the Proxy object
> 
> John G Johnson (1):
>   multi-process: add the concept description to
>     docs/devel/qemu-multiprocess
> 
>  Makefile                            |    2 +
>  Makefile.objs                       |   39 ++
>  Makefile.target                     |   94 ++-
>  accel/stubs/kvm-stub.c              |    5 +
>  accel/stubs/tcg-stub.c              |  106 ++++
>  backends/Makefile.objs              |    2 +
>  block/Makefile.objs                 |    2 +
>  chardev/char.c                      |   14 +
>  configure                           |   15 +
>  docs/devel/index.rst                |    1 +
>  docs/devel/qemu-multiprocess.rst    | 1102 +++++++++++++++++++++++++++++++++++
>  docs/qemu-multiprocess.txt          |   86 +++
>  exec.c                              |   14 +-
>  hmp-commands-info.hx                |   10 +
>  hmp-commands.hx                     |   25 +-
>  hw/Makefile.objs                    |    9 +
>  hw/block/Makefile.objs              |    2 +
>  hw/core/Makefile.objs               |   17 +
>  hw/nvram/Makefile.objs              |    2 +
>  hw/pci/Makefile.objs                |    4 +
>  hw/proxy/Makefile.objs              |    1 +
>  hw/proxy/memory-sync.c              |  226 +++++++
>  hw/proxy/proxy-lsi53c895a.c         |   97 +++
>  hw/proxy/qemu-proxy.c               |  807 +++++++++++++++++++++++++
>  hw/scsi/Makefile.objs               |    2 +
>  include/chardev/char.h              |    1 +
>  include/exec/address-spaces.h       |    2 +
>  include/exec/ram_addr.h             |    2 +-
>  include/hw/pci/pci_ids.h            |    3 +
>  include/hw/proxy/memory-sync.h      |   51 ++
>  include/hw/proxy/proxy-lsi53c895a.h |   42 ++
>  include/hw/proxy/qemu-proxy.h       |  125 ++++
>  include/hw/qdev-core.h              |    2 +
>  include/io/mpqemu-link.h            |  214 +++++++
>  include/monitor/monitor.h           |    2 +
>  include/monitor/qdev.h              |   25 +
>  include/qemu-common.h               |    8 +
>  include/qemu/log.h                  |    1 +
>  include/qemu/mmap-alloc.h           |    3 +-
>  include/qemu/thread.h               |    1 +
>  include/remote/iohub.h              |   63 ++
>  include/remote/machine.h            |   48 ++
>  include/remote/memory.h             |   34 ++
>  include/remote/pcihost.h            |   59 ++
>  include/sysemu/runstate.h           |    3 +
>  io/Makefile.objs                    |    2 +
>  io/mpqemu-link.c                    |  351 +++++++++++
>  memory.c                            |    2 +-
>  migration/Makefile.objs             |   12 +
>  migration/savevm.c                  |   63 ++
>  migration/savevm.h                  |    3 +
>  monitor/Makefile.objs               |    3 +
>  monitor/misc.c                      |   84 +--
>  monitor/monitor-internal.h          |   38 ++
>  monitor/monitor.c                   |   83 ++-
>  net/Makefile.objs                   |    2 +
>  qapi/Makefile.objs                  |    2 +
>  qdev-monitor.c                      |  270 ++++++++-
>  qemu-options.hx                     |   21 +
>  qom/Makefile.objs                   |    4 +
>  remote/Makefile.objs                |    6 +
>  remote/iohub.c                      |  159 +++++
>  remote/machine.c                    |  133 +++++
>  remote/memory.c                     |   99 ++++
>  remote/pcihost.c                    |   85 +++
>  remote/remote-main.c                |  633 ++++++++++++++++++++
>  remote/remote-opts.c                |  131 +++++
>  remote/remote-opts.h                |   31 +
>  replay/Makefile.objs                |    2 +-
>  rules.mak                           |    2 +-
>  runstate.c                          |   41 ++
>  scripts/hxtool                      |   44 +-
>  stubs/audio.c                       |   12 +
>  stubs/gdbstub.c                     |   21 +
>  stubs/machine-init-done.c           |    4 +
>  stubs/migration.c                   |  211 +++++++
>  stubs/monitor.c                     |   72 +++
>  stubs/net-stub.c                    |  121 ++++
>  stubs/qapi-misc.c                   |   43 ++
>  stubs/qapi-target.c                 |   49 ++
>  stubs/replay.c                      |   26 +
>  stubs/runstate-check.c              |    3 +
>  stubs/ui-stub.c                     |  130 +++++
>  stubs/vl-stub.c                     |  193 ++++++
>  stubs/vmstate.c                     |   20 +
>  stubs/xen-mapcache.c                |   22 +
>  ui/Makefile.objs                    |    2 +
>  util/log.c                          |    2 +
>  util/mmap-alloc.c                   |    7 +-
>  util/oslib-posix.c                  |    2 +-
>  util/qemu-thread-posix.c            |   10 +
>  vl-parse.c                          |  161 +++++
>  vl.c                                |  310 ++++------
>  vl.h                                |   54 ++
>  94 files changed, 6908 insertions(+), 246 deletions(-)
>  create mode 100644 docs/devel/qemu-multiprocess.rst
>  create mode 100644 docs/qemu-multiprocess.txt
>  create mode 100644 hw/proxy/Makefile.objs
>  create mode 100644 hw/proxy/memory-sync.c
>  create mode 100644 hw/proxy/proxy-lsi53c895a.c
>  create mode 100644 hw/proxy/qemu-proxy.c
>  create mode 100644 include/hw/proxy/memory-sync.h
>  create mode 100644 include/hw/proxy/proxy-lsi53c895a.h
>  create mode 100644 include/hw/proxy/qemu-proxy.h
>  create mode 100644 include/io/mpqemu-link.h
>  create mode 100644 include/remote/iohub.h
>  create mode 100644 include/remote/machine.h
>  create mode 100644 include/remote/memory.h
>  create mode 100644 include/remote/pcihost.h
>  create mode 100644 io/mpqemu-link.c
>  create mode 100644 remote/Makefile.objs
>  create mode 100644 remote/iohub.c
>  create mode 100644 remote/machine.c
>  create mode 100644 remote/memory.c
>  create mode 100644 remote/pcihost.c
>  create mode 100644 remote/remote-main.c
>  create mode 100644 remote/remote-opts.c
>  create mode 100644 remote/remote-opts.h
>  create mode 100644 runstate.c
>  mode change 100644 => 100755 scripts/hxtool
>  create mode 100644 stubs/audio.c
>  create mode 100644 stubs/migration.c
>  create mode 100644 stubs/net-stub.c
>  create mode 100644 stubs/qapi-misc.c
>  create mode 100644 stubs/qapi-target.c
>  create mode 100644 stubs/ui-stub.c
>  create mode 100644 stubs/vl-stub.c
>  create mode 100644 stubs/xen-mapcache.c
>  create mode 100644 vl-parse.c
>  create mode 100644 vl.h
> 
> -- 
> 1.8.3.1
>
Stefan Hajnoczi Dec. 13, 2019, 10:41 a.m. UTC | #6
On Mon, Dec 09, 2019 at 10:47:17PM -0800, Elena Ufimtseva wrote:
> At this moment we are working on the first stage of the project with help of
> the Nutanix developers.
> The questions we have gathered so far will be addressed with muser
> and Qemu developers after we finish the first stage and make sure we understand
> what it will take for us to move onto the next stage.
> 
> We will also incorporate relevant review from Stefan that he provided
> on the series 4 of the patchset. Thank you Stefan.
> 
> If anyone have any further suggestions or questions about the status,
> please reply to this email.

Hi Elena,
At KVM Forum we discussed spending 1 or 2 weeks trying out muser.  A few
weeks have passed and from your email it sounds like this "next stage"
might be a lot of work.

Is there a work-in-progress muser patch series you can post to start the
discussion early?  That way we can avoid reviewers like myself asking
you to make changes after you have invested a lot of time.

It's good that you are in touch with the muser developers (via private
discussion?  I haven't seen much activity on #muser IRC).

Stefan
Elena Ufimtseva Dec. 16, 2019, 7:46 p.m. UTC | #7
On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> On Mon, Dec 09, 2019 at 10:47:17PM -0800, Elena Ufimtseva wrote:
> > At this moment we are working on the first stage of the project with help of
> > the Nutanix developers.
> > The questions we have gathered so far will be addressed with muser
> > and Qemu developers after we finish the first stage and make sure we understand
> > what it will take for us to move onto the next stage.
> > 
> > We will also incorporate relevant review from Stefan that he provided
> > on the series 4 of the patchset. Thank you Stefan.
> > 
> > If anyone have any further suggestions or questions about the status,
> > please reply to this email.
> 
> Hi Elena,
> At KVM Forum we discussed spending 1 or 2 weeks trying out muser.  A few
> weeks have passed and from your email it sounds like this "next stage"
> might be a lot of work.
>

Hi Stefan

Perhaps we were not too clear about our work in the previous email.
Our assumption was that the question that came from KVM Forum was
if muser can be used to achieve the same what we have now.
We should have answered clearly yes to this question.  We have not yet
discovered major road blocks.
At the moment, we are mostly engaged in learning the code and discussing
the design, plus some coding to answer the specific questions.
We understand that the best way to make a progress is to work with the
upstream community on early stages and we agree with this and will present
the proposal shortly for discussion.
 
> Is there a work-in-progress muser patch series you can post to start the
> discussion early?  That way we can avoid reviewers like myself asking
> you to make changes after you have invested a lot of time.
>

Absolutely, that is our plan. At the moment we do not have the patches
ready for the review. We have setup internally a milestone and will be
sending that early version as a tarball after we have it completed.
Would be also a meeting something that could help us to stay on the same
page?
 
> It's good that you are in touch with the muser developers (via private
> discussion?  I haven't seen much activity on #muser IRC).
>

We use IRC (I know Jag got some answers there) and github for issues
(one of which was addressed). We are hoping to get the conversation going over
the email.

JJ, Jag and Elena 
> Stefan
Felipe Franciosi Dec. 16, 2019, 7:57 p.m. UTC | #8
Heya,

> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> 
> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
>>> On Mon, Dec 09, 2019 at 10:47:17PM -0800, Elena Ufimtseva wrote:
>>> At this moment we are working on the first stage of the project with help of
>>> the Nutanix developers.
>>> The questions we have gathered so far will be addressed with muser
>>> and Qemu developers after we finish the first stage and make sure we understand
>>> what it will take for us to move onto the next stage.
>>> 
>>> We will also incorporate relevant review from Stefan that he provided
>>> on the series 4 of the patchset. Thank you Stefan.
>>> 
>>> If anyone have any further suggestions or questions about the status,
>>> please reply to this email.
>> 
>> Hi Elena,
>> At KVM Forum we discussed spending 1 or 2 weeks trying out muser.  A few
>> weeks have passed and from your email it sounds like this "next stage"
>> might be a lot of work.
>> 
> 
> Hi Stefan
> 
> Perhaps we were not too clear about our work in the previous email.
> Our assumption was that the question that came from KVM Forum was
> if muser can be used to achieve the same what we have now.
> We should have answered clearly yes to this question.  We have not yet
> discovered major road blocks.
> At the moment, we are mostly engaged in learning the code and discussing
> the design, plus some coding to answer the specific questions.
> We understand that the best way to make a progress is to work with the
> upstream community on early stages and we agree with this and will present
> the proposal shortly for discussion.
> 
>> Is there a work-in-progress muser patch series you can post to start the
>> discussion early?  That way we can avoid reviewers like myself asking
>> you to make changes after you have invested a lot of time.
>> 
> 
> Absolutely, that is our plan. At the moment we do not have the patches
> ready for the review. We have setup internally a milestone and will be
> sending that early version as a tarball after we have it completed.
> Would be also a meeting something that could help us to stay on the same
> page?

Please loop us in if you so set up a meeting.

> 
>> It's good that you are in touch with the muser developers (via private
>> discussion?  I haven't seen much activity on #muser IRC).
>> 
> 
> We use IRC (I know Jag got some answers there) and github for issues
> (one of which was addressed).

I thought there was only the one. Let us know if you run into any other bugs. We are looking forward to hearing about people’s experience and addressing issues that come with uses we didn’t foresee or test.

Cheers,
Felipe

> We are hoping to get the conversation going over
> the email.
> 
> JJ, Jag and Elena 
>> Stefan
> 
>
Stefan Hajnoczi Dec. 17, 2019, 4:33 p.m. UTC | #9
On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> >> Is there a work-in-progress muser patch series you can post to start the
> >> discussion early?  That way we can avoid reviewers like myself asking
> >> you to make changes after you have invested a lot of time.
> >> 
> > 
> > Absolutely, that is our plan. At the moment we do not have the patches
> > ready for the review. We have setup internally a milestone and will be
> > sending that early version as a tarball after we have it completed.
> > Would be also a meeting something that could help us to stay on the same
> > page?
> 
> Please loop us in if you so set up a meeting.

There is a bi-weekly KVM Community Call that we can use for phone
discussions:

  https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

Or we can schedule a one-off call at any time :).

Questions I've seen when discussing muser with people have been:

1. Can unprivileged containers create muser devices?  If not, this is a
   blocker for use cases that want to avoid root privileges entirely.

2. Does muser need to be in the kernel (e.g. slower to develop/ship,
   security reasons)?  A similar library could be implemented in
   userspace along the lines of the vhost-user protocol.  Although VMMs
   would then need to use a new libmuser-client library instead of
   reusing their VFIO code to access the device.

3. Should this feature be Linux-only?  vhost-user can be implemented on
   non-Linux OSes...

Stefan
Felipe Franciosi Dec. 17, 2019, 10:57 p.m. UTC | #10
> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
>>>> Is there a work-in-progress muser patch series you can post to start the
>>>> discussion early?  That way we can avoid reviewers like myself asking
>>>> you to make changes after you have invested a lot of time.
>>>> 
>>> 
>>> Absolutely, that is our plan. At the moment we do not have the patches
>>> ready for the review. We have setup internally a milestone and will be
>>> sending that early version as a tarball after we have it completed.
>>> Would be also a meeting something that could help us to stay on the same
>>> page?
>> 
>> Please loop us in if you so set up a meeting.
> 
> There is a bi-weekly KVM Community Call that we can use for phone
> discussions:
> 
>  https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> 
> Or we can schedule a one-off call at any time :).

Sounds good either way, whenever it's needed.

> 
> Questions I've seen when discussing muser with people have been:
> 
> 1. Can unprivileged containers create muser devices?  If not, this is a
>   blocker for use cases that want to avoid root privileges entirely.

Yes you can. Muser device creation follows the same process as general
mdev device creation (ie. you write to a sysfs path). That creates an
entry in /dev/vfio and the control plane can further drop privileges
there (set selinux contexts, &c.)

> 
> 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
>   security reasons)?  A similar library could be implemented in
>   userspace along the lines of the vhost-user protocol.  Although VMMs
>   would then need to use a new libmuser-client library instead of
>   reusing their VFIO code to access the device.

Doing it in userspace was the flow we proposed back in last year's KVM
Forum (Edinburgh), but it got turned down. That's why we procured the
kernel approach, which turned out to have some advantages:
- No changes needed to Qemu
- No Qemu needed at all for userspace drivers
- Device emulation process restart is trivial
  (it therefore makes device code upgrades much easier)

Having said that, nothing stops us from enhancing libmuser to talk
directly to Qemu (for the Qemu case). I envision at least two ways of
doing that:
- Hooking up libmuser with Qemu directly (eg. over a unix socket)
- Hooking Qemu with CUSE and implementing the muser.ko interface

For the latter, libmuser would talk to a character device just like it
talks to the vfio character device. We "just" need to implement that
backend in Qemu. :)

> 
> 3. Should this feature be Linux-only?  vhost-user can be implemented on
>   non-Linux OSes...

The userspace approach discussed above certainly can be more portable.
Currently, muser depends on MDEV+VFIO and that's where the restriction
comes from.

F.

> 
> Stefan
Paolo Bonzini Dec. 18, 2019, midnight UTC | #11
On 17/12/19 23:57, Felipe Franciosi wrote:
> Doing it in userspace was the flow we proposed back in last year's KVM
> Forum (Edinburgh), but it got turned down.

I think the time since then has shown that essentially the cat is out of
the bag.  I didn't really like the idea of devices outside QEMU---and I
still don't---but if something like "VFIO over AF_UNIX" turns out to be
the cleanest way to implement multi-process QEMU device models, I am not
going to pull an RMS and block that from happening.  Assuming I could
even do so!

Paolo
Stefan Hajnoczi Dec. 19, 2019, 11:55 a.m. UTC | #12
On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> > On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > Questions I've seen when discussing muser with people have been:
> > 
> > 1. Can unprivileged containers create muser devices?  If not, this is a
> >   blocker for use cases that want to avoid root privileges entirely.
> 
> Yes you can. Muser device creation follows the same process as general
> mdev device creation (ie. you write to a sysfs path). That creates an
> entry in /dev/vfio and the control plane can further drop privileges
> there (set selinux contexts, &c.)

In this case there is still a privileged step during setup.  What about
completely unprivileged scenarios like a regular user without root or a
rootless container?

> > 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
> >   security reasons)?  A similar library could be implemented in
> >   userspace along the lines of the vhost-user protocol.  Although VMMs
> >   would then need to use a new libmuser-client library instead of
> >   reusing their VFIO code to access the device.
> 
> Doing it in userspace was the flow we proposed back in last year's KVM
> Forum (Edinburgh), but it got turned down. That's why we procured the
> kernel approach, which turned out to have some advantages:
> - No changes needed to Qemu
> - No Qemu needed at all for userspace drivers
> - Device emulation process restart is trivial
>   (it therefore makes device code upgrades much easier)
> 
> Having said that, nothing stops us from enhancing libmuser to talk
> directly to Qemu (for the Qemu case). I envision at least two ways of
> doing that:
> - Hooking up libmuser with Qemu directly (eg. over a unix socket)
> - Hooking Qemu with CUSE and implementing the muser.ko interface
> 
> For the latter, libmuser would talk to a character device just like it
> talks to the vfio character device. We "just" need to implement that
> backend in Qemu. :)

What about:
 * libmuser's API stays mostly unchanged but the library speaks a
   VFIO-over-UNIX domain sockets protocol instead of talking to
   mdev/vfio in the host kernel.
 * VMMs can implement this protocol directly for POSIX-portable and
   unprivileged operation.
 * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can
   still take advantage of libmuser devices.

Assuming this is feasible, would you lose any important
features/advantages of the muser.ko approach?  I don't know enough about
VFIO to identify any blocker or obvious performance problems.

Regarding recovery, it seems straightforward to keep state in a tmpfs
file that can be reopened when the device is restarted.  I don't think
kernel code is necessary?

Stefan
Felipe Franciosi Dec. 19, 2019, 12:33 p.m. UTC | #13
Hello,

(I've added Jim and Ben from the SPDK team to the thread.)

> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
>>> Questions I've seen when discussing muser with people have been:
>>> 
>>> 1. Can unprivileged containers create muser devices?  If not, this is a
>>>  blocker for use cases that want to avoid root privileges entirely.
>> 
>> Yes you can. Muser device creation follows the same process as general
>> mdev device creation (ie. you write to a sysfs path). That creates an
>> entry in /dev/vfio and the control plane can further drop privileges
>> there (set selinux contexts, &c.)
> 
> In this case there is still a privileged step during setup.  What about
> completely unprivileged scenarios like a regular user without root or a
> rootless container?

Oh, I see what you are saying. I suppose we need to investigate
adjusting the privileges of the sysfs path correctly beforehand to
allow devices to be created by non-root users. The credentials used on
creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>).

I need to look into that and get back to you.

> 
>>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
>>>  security reasons)?  A similar library could be implemented in
>>>  userspace along the lines of the vhost-user protocol.  Although VMMs
>>>  would then need to use a new libmuser-client library instead of
>>>  reusing their VFIO code to access the device.
>> 
>> Doing it in userspace was the flow we proposed back in last year's KVM
>> Forum (Edinburgh), but it got turned down. That's why we procured the
>> kernel approach, which turned out to have some advantages:
>> - No changes needed to Qemu
>> - No Qemu needed at all for userspace drivers
>> - Device emulation process restart is trivial
>>  (it therefore makes device code upgrades much easier)
>> 
>> Having said that, nothing stops us from enhancing libmuser to talk
>> directly to Qemu (for the Qemu case). I envision at least two ways of
>> doing that:
>> - Hooking up libmuser with Qemu directly (eg. over a unix socket)
>> - Hooking Qemu with CUSE and implementing the muser.ko interface
>> 
>> For the latter, libmuser would talk to a character device just like it
>> talks to the vfio character device. We "just" need to implement that
>> backend in Qemu. :)
> 
> What about:
> * libmuser's API stays mostly unchanged but the library speaks a
>   VFIO-over-UNIX domain sockets protocol instead of talking to
>   mdev/vfio in the host kernel.

As I said above, there are advantages to the kernel model. The key one
is transparent device emulation restarts. Today, muser.ko keeps the
"device memory" internally in a prefix tree. Upon restart, a new
device emulator can recover state (eg. from a state file in /dev/shm
or similar) and remap the same memory that is already configured to
the guest via Qemu. We have a pending work item for muser.ko to also
keep the eventfds so we can recover those, too. Another advantage is
working with any userspace driver and not requiring a VMM at all.

If done entirely in userspace, the device emulator needs to allocate
the device memory somewhere that remains accessible (eg. tmpfs), with
the difference that now we may be talking about non-trivial amounts of
memory. Also, that may not be the kind of content you want lingering
around the filesystem (for the same reasons Qemu unlinks memory files
from /dev/hugepages after mmap'ing it).

That's why I'd prefer to rephrase what you said to "in addition"
instead of "instead".

> * VMMs can implement this protocol directly for POSIX-portable and
>   unprivileged operation.
> * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can
>   still take advantage of libmuser devices.

I'm happy with that.
We need to think the credential aspect throughout to ensure nodes can
be created in the right places with the right privileges.

> 
> Assuming this is feasible, would you lose any important
> features/advantages of the muser.ko approach?  I don't know enough about
> VFIO to identify any blocker or obvious performance problems.

That's what I elaborated above. The fact that muser.ko can keep
various metadata (and other resources) about the device in the kernel
and grant it back to userspace as needed. There are ways around it,
but it requires some orchestration with tmpfs and the VMM (only so
much can be kept in tmpfs; the eventfds need to be retransmitted from
the machine emulator on request).

Restarting is a critical aspect of this. One key use case for the
project is to be able to emulate various devices from one process (for
polling). That must be able to restart for upgrades or recovery.

> 
> Regarding recovery, it seems straightforward to keep state in a tmpfs
> file that can be reopened when the device is restarted.  I don't think
> kernel code is necessary?

It adds a dependency, but isn't a show stopper. If we can work through
permission issues, making sure the VMM can reconnect and retransmit
eventfds and other state, then it should be ok.

To be clear: I'm very happy to have a userspace-only option for this,
I just don't want to ditch the kernel module (yet, anyway). :)

F.

> 
> Stefan
Daniel P. Berrangé Dec. 19, 2019, 12:50 p.m. UTC | #14
On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> 
> 
> > On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> >>>> Is there a work-in-progress muser patch series you can post to start the
> >>>> discussion early?  That way we can avoid reviewers like myself asking
> >>>> you to make changes after you have invested a lot of time.
> >>>> 
> >>> 
> >>> Absolutely, that is our plan. At the moment we do not have the patches
> >>> ready for the review. We have setup internally a milestone and will be
> >>> sending that early version as a tarball after we have it completed.
> >>> Would be also a meeting something that could help us to stay on the same
> >>> page?
> >> 
> >> Please loop us in if you so set up a meeting.
> > 
> > There is a bi-weekly KVM Community Call that we can use for phone
> > discussions:
> > 
> >  https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> > 
> > Or we can schedule a one-off call at any time :).
> 
> Sounds good either way, whenever it's needed.
> 
> > 
> > Questions I've seen when discussing muser with people have been:
> > 
> > 1. Can unprivileged containers create muser devices?  If not, this is a
> >   blocker for use cases that want to avoid root privileges entirely.
> 
> Yes you can. Muser device creation follows the same process as general
> mdev device creation (ie. you write to a sysfs path). That creates an
> entry in /dev/vfio and the control plane can further drop privileges
> there (set selinux contexts, &c.)

This isn't what I'd describe / consider as unprivileged, as AFAICT
although QEMU can use it unprivileged, this still requires a privileged
management process to do the setup in sysfs.

I think it is desirable to be able support a fully unprivileged
model where there is nothing requiring elevated privileges, neither
libvirtd or QEMU.

I think this basically ends up at the same requirement as support
for non-Linux hosts. We have to assume that some desirable deployment
scenarios will not be able to use Linux kernel features, either because
they lack privileges, or are simply non-Linux hosts.

> > 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
> >   security reasons)?  A similar library could be implemented in
> >   userspace along the lines of the vhost-user protocol.  Although VMMs
> >   would then need to use a new libmuser-client library instead of
> >   reusing their VFIO code to access the device.
> 
> Doing it in userspace was the flow we proposed back in last year's KVM
> Forum (Edinburgh), but it got turned down. That's why we procured the
> kernel approach, which turned out to have some advantages:
> - No changes needed to Qemu
> - No Qemu needed at all for userspace drivers
> - Device emulation process restart is trivial
>   (it therefore makes device code upgrades much easier)
> 
> Having said that, nothing stops us from enhancing libmuser to talk
> directly to Qemu (for the Qemu case). I envision at least two ways of
> doing that:
> - Hooking up libmuser with Qemu directly (eg. over a unix socket)

A UNIX socket, or localhost TCP socket, sounds most appealing from a
a portability POV.

> - Hooking Qemu with CUSE and implementing the muser.ko interface

Perhaps I'm misunderstanding, but wouldn't a CUSE interface
still have issues with something needing to be privileged to
do the initial setup, and also still lack OS portability.

> For the latter, libmuser would talk to a character device just like it
> talks to the vfio character device. We "just" need to implement that
> backend in Qemu. :)
> 
> > 
> > 3. Should this feature be Linux-only?  vhost-user can be implemented on
> >   non-Linux OSes...
> 
> The userspace approach discussed above certainly can be more portable.
> Currently, muser depends on MDEV+VFIO and that's where the restriction
> comes from.

Regards,
Daniel
Daniel P. Berrangé Dec. 19, 2019, 12:55 p.m. UTC | #15
On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
> Hello,
> 
> (I've added Jim and Ben from the SPDK team to the thread.)
> 
> > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > 
> > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> >>> Questions I've seen when discussing muser with people have been:
> >>> 
> >>> 1. Can unprivileged containers create muser devices?  If not, this is a
> >>>  blocker for use cases that want to avoid root privileges entirely.
> >> 
> >> Yes you can. Muser device creation follows the same process as general
> >> mdev device creation (ie. you write to a sysfs path). That creates an
> >> entry in /dev/vfio and the control plane can further drop privileges
> >> there (set selinux contexts, &c.)
> > 
> > In this case there is still a privileged step during setup.  What about
> > completely unprivileged scenarios like a regular user without root or a
> > rootless container?
> 
> Oh, I see what you are saying. I suppose we need to investigate
> adjusting the privileges of the sysfs path correctly beforehand to
> allow devices to be created by non-root users. The credentials used on
> creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>).
> 
> I need to look into that and get back to you.
> 
> > 
> >>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
> >>>  security reasons)?  A similar library could be implemented in
> >>>  userspace along the lines of the vhost-user protocol.  Although VMMs
> >>>  would then need to use a new libmuser-client library instead of
> >>>  reusing their VFIO code to access the device.
> >> 
> >> Doing it in userspace was the flow we proposed back in last year's KVM
> >> Forum (Edinburgh), but it got turned down. That's why we procured the
> >> kernel approach, which turned out to have some advantages:
> >> - No changes needed to Qemu
> >> - No Qemu needed at all for userspace drivers
> >> - Device emulation process restart is trivial
> >>  (it therefore makes device code upgrades much easier)
> >> 
> >> Having said that, nothing stops us from enhancing libmuser to talk
> >> directly to Qemu (for the Qemu case). I envision at least two ways of
> >> doing that:
> >> - Hooking up libmuser with Qemu directly (eg. over a unix socket)
> >> - Hooking Qemu with CUSE and implementing the muser.ko interface
> >> 
> >> For the latter, libmuser would talk to a character device just like it
> >> talks to the vfio character device. We "just" need to implement that
> >> backend in Qemu. :)
> > 
> > What about:
> > * libmuser's API stays mostly unchanged but the library speaks a
> >   VFIO-over-UNIX domain sockets protocol instead of talking to
> >   mdev/vfio in the host kernel.
> 
> As I said above, there are advantages to the kernel model. The key one
> is transparent device emulation restarts. Today, muser.ko keeps the
> "device memory" internally in a prefix tree. Upon restart, a new
> device emulator can recover state (eg. from a state file in /dev/shm
> or similar) and remap the same memory that is already configured to
> the guest via Qemu. We have a pending work item for muser.ko to also
> keep the eventfds so we can recover those, too. Another advantage is
> working with any userspace driver and not requiring a VMM at all.
> 
> If done entirely in userspace, the device emulator needs to allocate
> the device memory somewhere that remains accessible (eg. tmpfs), with
> the difference that now we may be talking about non-trivial amounts of
> memory. Also, that may not be the kind of content you want lingering
> around the filesystem (for the same reasons Qemu unlinks memory files
> from /dev/hugepages after mmap'ing it).
> 
> That's why I'd prefer to rephrase what you said to "in addition"
> instead of "instead".
> 
> > * VMMs can implement this protocol directly for POSIX-portable and
> >   unprivileged operation.
> > * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can
> >   still take advantage of libmuser devices.
> 
> I'm happy with that.
> We need to think the credential aspect throughout to ensure nodes can
> be created in the right places with the right privileges.
> 
> > 
> > Assuming this is feasible, would you lose any important
> > features/advantages of the muser.ko approach?  I don't know enough about
> > VFIO to identify any blocker or obvious performance problems.
> 
> That's what I elaborated above. The fact that muser.ko can keep
> various metadata (and other resources) about the device in the kernel
> and grant it back to userspace as needed. There are ways around it,
> but it requires some orchestration with tmpfs and the VMM (only so
> much can be kept in tmpfs; the eventfds need to be retransmitted from
> the machine emulator on request).
> 
> Restarting is a critical aspect of this. One key use case for the
> project is to be able to emulate various devices from one process (for
> polling). That must be able to restart for upgrades or recovery.
> 
> > 
> > Regarding recovery, it seems straightforward to keep state in a tmpfs
> > file that can be reopened when the device is restarted.  I don't think
> > kernel code is necessary?
> 
> It adds a dependency, but isn't a show stopper. If we can work through
> permission issues, making sure the VMM can reconnect and retransmit
> eventfds and other state, then it should be ok.
> 
> To be clear: I'm very happy to have a userspace-only option for this,
> I just don't want to ditch the kernel module (yet, anyway). :)

If it doesn't create too large of a burden to support both, then I think
it is very desirable. IIUC, this is saying a kernel based solution as the
optimized/optimal solution, and userspace UNIX socket based option as the
generic "works everywhere" fallback solution.



Regards,
Daniel
Stefan Hajnoczi Dec. 19, 2019, 1:36 p.m. UTC | #16
On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote:
> On 17/12/19 23:57, Felipe Franciosi wrote:
> > Doing it in userspace was the flow we proposed back in last year's KVM
> > Forum (Edinburgh), but it got turned down.
> 
> I think the time since then has shown that essentially the cat is out of
> the bag.  I didn't really like the idea of devices outside QEMU---and I
> still don't---but if something like "VFIO over AF_UNIX" turns out to be
> the cleanest way to implement multi-process QEMU device models, I am not
> going to pull an RMS and block that from happening.  Assuming I could
> even do so!

There are a range of approaches that will influence how out-of-process
devices can be licensed and distributed.

A VFIO-over-UNIX domain sockets approach means a stable API so that any
license (including proprietary) is possible.

Another approach is a QEMU-centric unstable protocol.  I'll call this
the qdev-over-UNIX domain sockets approach.  Maintaining an out-of-tree
device is expensive and ugly since the protocol changes between QEMU
versions in ways that are incompatible and undetectable.

On top of that, the initialization protocol message could include the
QEMU version string that the device was compiled against.  If the
version string doesn't match then QEMU will refuse to talk to the
device.

Distributing a single device executable that works with many QEMUs (e.g.
CentOS, Ubuntu) and versions becomes difficult.

I want to mention that we have the option of doing this if there are
strong concerns about out-of-tree devices.  It does have downsides:
1. Inability to share devices with other VMMs.
2. Probably won't replace vhost-user due to the out-of-tree limitations.
3. Can still be circumvented by a motivated device author.

Stefan
Jag Raman Dec. 19, 2019, 4:40 p.m. UTC | #17
On 12/19/2019 7:33 AM, Felipe Franciosi wrote:
> Hello,
> 
> (I've added Jim and Ben from the SPDK team to the thread.)
> 
>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
>>>> Questions I've seen when discussing muser with people have been:
>>>>
>>>> 1. Can unprivileged containers create muser devices?  If not, this is a
>>>>   blocker for use cases that want to avoid root privileges entirely.
>>>
>>> Yes you can. Muser device creation follows the same process as general
>>> mdev device creation (ie. you write to a sysfs path). That creates an
>>> entry in /dev/vfio and the control plane can further drop privileges
>>> there (set selinux contexts, &c.)
>>
>> In this case there is still a privileged step during setup.  What about
>> completely unprivileged scenarios like a regular user without root or a
>> rootless container?
> 
> Oh, I see what you are saying. I suppose we need to investigate
> adjusting the privileges of the sysfs path correctly beforehand to
> allow devices to be created by non-root users. The credentials used on
> creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>).
> 
> I need to look into that and get back to you.

As a prerequisite to using the "vfio-pci" device in QEMU, the user
assigns the PCI device on the host bus to the VFIO kernel driver by
writing to "/sys/bus/pci/drivers/vfio-pci/new_id" and
"/sys/bus/pci/drivers/vfio-pci/bind"

I believe a privileged control plane is required to perform these
prerequisite steps. Therefore, I wonder how rootless containers or
unprivileged users currently go about using a VFIO device with QEMU/KVM.

Thanks!
--
Jag

> 
>>
>>>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
>>>>   security reasons)?  A similar library could be implemented in
>>>>   userspace along the lines of the vhost-user protocol.  Although VMMs
>>>>   would then need to use a new libmuser-client library instead of
>>>>   reusing their VFIO code to access the device.
>>>
>>> Doing it in userspace was the flow we proposed back in last year's KVM
>>> Forum (Edinburgh), but it got turned down. That's why we procured the
>>> kernel approach, which turned out to have some advantages:
>>> - No changes needed to Qemu
>>> - No Qemu needed at all for userspace drivers
>>> - Device emulation process restart is trivial
>>>   (it therefore makes device code upgrades much easier)
>>>
>>> Having said that, nothing stops us from enhancing libmuser to talk
>>> directly to Qemu (for the Qemu case). I envision at least two ways of
>>> doing that:
>>> - Hooking up libmuser with Qemu directly (eg. over a unix socket)
>>> - Hooking Qemu with CUSE and implementing the muser.ko interface
>>>
>>> For the latter, libmuser would talk to a character device just like it
>>> talks to the vfio character device. We "just" need to implement that
>>> backend in Qemu. :)
>>
>> What about:
>> * libmuser's API stays mostly unchanged but the library speaks a
>>    VFIO-over-UNIX domain sockets protocol instead of talking to
>>    mdev/vfio in the host kernel.
> 
> As I said above, there are advantages to the kernel model. The key one
> is transparent device emulation restarts. Today, muser.ko keeps the
> "device memory" internally in a prefix tree. Upon restart, a new
> device emulator can recover state (eg. from a state file in /dev/shm
> or similar) and remap the same memory that is already configured to
> the guest via Qemu. We have a pending work item for muser.ko to also
> keep the eventfds so we can recover those, too. Another advantage is
> working with any userspace driver and not requiring a VMM at all.
> 
> If done entirely in userspace, the device emulator needs to allocate
> the device memory somewhere that remains accessible (eg. tmpfs), with
> the difference that now we may be talking about non-trivial amounts of
> memory. Also, that may not be the kind of content you want lingering
> around the filesystem (for the same reasons Qemu unlinks memory files
> from /dev/hugepages after mmap'ing it).
> 
> That's why I'd prefer to rephrase what you said to "in addition"
> instead of "instead".
> 
>> * VMMs can implement this protocol directly for POSIX-portable and
>>    unprivileged operation.
>> * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can
>>    still take advantage of libmuser devices.
> 
> I'm happy with that.
> We need to think the credential aspect throughout to ensure nodes can
> be created in the right places with the right privileges.
> 
>>
>> Assuming this is feasible, would you lose any important
>> features/advantages of the muser.ko approach?  I don't know enough about
>> VFIO to identify any blocker or obvious performance problems.
> 
> That's what I elaborated above. The fact that muser.ko can keep
> various metadata (and other resources) about the device in the kernel
> and grant it back to userspace as needed. There are ways around it,
> but it requires some orchestration with tmpfs and the VMM (only so
> much can be kept in tmpfs; the eventfds need to be retransmitted from
> the machine emulator on request).
> 
> Restarting is a critical aspect of this. One key use case for the
> project is to be able to emulate various devices from one process (for
> polling). That must be able to restart for upgrades or recovery.
> 
>>
>> Regarding recovery, it seems straightforward to keep state in a tmpfs
>> file that can be reopened when the device is restarted.  I don't think
>> kernel code is necessary?
> 
> It adds a dependency, but isn't a show stopper. If we can work through
> permission issues, making sure the VMM can reconnect and retransmit
> eventfds and other state, then it should be ok.
> 
> To be clear: I'm very happy to have a userspace-only option for this,
> I just don't want to ditch the kernel module (yet, anyway). :)
> 
> F.
> 
>>
>> Stefan
>
Daniel P. Berrangé Dec. 19, 2019, 4:46 p.m. UTC | #18
On Thu, Dec 19, 2019 at 12:50:21PM +0000, Daniel P. Berrangé wrote:
> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> > 
> > 
> > > On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > 
> > > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > >>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > >>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > >>>> Is there a work-in-progress muser patch series you can post to start the
> > >>>> discussion early?  That way we can avoid reviewers like myself asking
> > >>>> you to make changes after you have invested a lot of time.
> > >>>> 
> > >>> 
> > >>> Absolutely, that is our plan. At the moment we do not have the patches
> > >>> ready for the review. We have setup internally a milestone and will be
> > >>> sending that early version as a tarball after we have it completed.
> > >>> Would be also a meeting something that could help us to stay on the same
> > >>> page?
> > >> 
> > >> Please loop us in if you so set up a meeting.
> > > 
> > > There is a bi-weekly KVM Community Call that we can use for phone
> > > discussions:
> > > 
> > >  https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> > > 
> > > Or we can schedule a one-off call at any time :).
> > 
> > Sounds good either way, whenever it's needed.
> > 
> > > 
> > > Questions I've seen when discussing muser with people have been:
> > > 
> > > 1. Can unprivileged containers create muser devices?  If not, this is a
> > >   blocker for use cases that want to avoid root privileges entirely.
> > 
> > Yes you can. Muser device creation follows the same process as general
> > mdev device creation (ie. you write to a sysfs path). That creates an
> > entry in /dev/vfio and the control plane can further drop privileges
> > there (set selinux contexts, &c.)
> 
> This isn't what I'd describe / consider as unprivileged, as AFAICT
> although QEMU can use it unprivileged, this still requires a privileged
> management process to do the setup in sysfs.
> 
> I think it is desirable to be able support a fully unprivileged
> model where there is nothing requiring elevated privileges, neither
> libvirtd or QEMU.
> 
> I think this basically ends up at the same requirement as support
> for non-Linux hosts. We have to assume that some desirable deployment
> scenarios will not be able to use Linux kernel features, either because
> they lack privileges, or are simply non-Linux hosts.
> 
> > > 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
> > >   security reasons)?  A similar library could be implemented in
> > >   userspace along the lines of the vhost-user protocol.  Although VMMs
> > >   would then need to use a new libmuser-client library instead of
> > >   reusing their VFIO code to access the device.
> > 
> > Doing it in userspace was the flow we proposed back in last year's KVM
> > Forum (Edinburgh), but it got turned down. That's why we procured the
> > kernel approach, which turned out to have some advantages:
> > - No changes needed to Qemu
> > - No Qemu needed at all for userspace drivers
> > - Device emulation process restart is trivial
> >   (it therefore makes device code upgrades much easier)
> > 
> > Having said that, nothing stops us from enhancing libmuser to talk
> > directly to Qemu (for the Qemu case). I envision at least two ways of
> > doing that:
> > - Hooking up libmuser with Qemu directly (eg. over a unix socket)
> 
> A UNIX socket, or localhost TCP socket, sounds most appealing from a
> a portability POV.

Felipe reminded me on IRC that muser needs FD passing, so a TCP
localhost socket is not an option.  So UNIX socket would give us
portability to any platform, except for Windows. It is not the
end of the world to lack Windows support.

Regards,
Daniel
Stefan Hajnoczi Dec. 20, 2019, 9:47 a.m. UTC | #19
On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote:
> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
> > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > To be clear: I'm very happy to have a userspace-only option for this,
> > I just don't want to ditch the kernel module (yet, anyway). :)
> 
> If it doesn't create too large of a burden to support both, then I think
> it is very desirable. IIUC, this is saying a kernel based solution as the
> optimized/optimal solution, and userspace UNIX socket based option as the
> generic "works everywhere" fallback solution.

I'm slightly in favor of the kernel implementation because it keeps us
better aligned with VFIO.  That means solving problems in one place only
and less reinventing the wheel.

Knowing that a userspace implementation is possible is a plus though.
Maybe that option will become attractive in the future and someone will
develop it.  In fact, a userspace implementation may be a cool Google
Summer of Code project idea that I'd like to co-mentor.

Stefan
Paolo Bonzini Dec. 20, 2019, 9:50 a.m. UTC | #20
On 20/12/19 10:47, Stefan Hajnoczi wrote:
>> If it doesn't create too large of a burden to support both, then I think
>> it is very desirable. IIUC, this is saying a kernel based solution as the
>> optimized/optimal solution, and userspace UNIX socket based option as the
>> generic "works everywhere" fallback solution.
> I'm slightly in favor of the kernel implementation because it keeps us
> better aligned with VFIO.  That means solving problems in one place only
> and less reinventing the wheel.

I think there are anyway going to be some differences with VFIO.

For example, currently VFIO requires pinning user memory.  Is that a
limitation for muser too?  If so, that would be a big disadvantage; if
not, however, management tools need to learn that muser devices unlike
other VFIO devices do not prevent overcommit.

Paolo
Daniel P. Berrangé Dec. 20, 2019, 10:22 a.m. UTC | #21
On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote:
> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote:
> > On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
> > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> > > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > > To be clear: I'm very happy to have a userspace-only option for this,
> > > I just don't want to ditch the kernel module (yet, anyway). :)
> > 
> > If it doesn't create too large of a burden to support both, then I think
> > it is very desirable. IIUC, this is saying a kernel based solution as the
> > optimized/optimal solution, and userspace UNIX socket based option as the
> > generic "works everywhere" fallback solution.
> 
> I'm slightly in favor of the kernel implementation because it keeps us
> better aligned with VFIO.  That means solving problems in one place only
> and less reinventing the wheel.
> 
> Knowing that a userspace implementation is possible is a plus though.
> Maybe that option will become attractive in the future and someone will
> develop it.  In fact, a userspace implementation may be a cool Google
> Summer of Code project idea that I'd like to co-mentor.

If it is technically viable as an approach, then I think  we should be
treating a fully unprivileged muser-over-UNIX socket as a higher priority
than just "maybe a GSoC student will want todo it".

Libvirt is getting strong message from KubeVirt project that they want to
be running both libvirtd and QEMU fully unprivileged. This allows their
containers to be unprivileged. Anything that requires privileges requires
jumping through extra hoops writing custom code in KubeVirt to do things
outside libvirt in side loaded privileged containers and this limits how
where those features can be used.

Regards,
Daniel
Felipe Franciosi Dec. 20, 2019, 2:14 p.m. UTC | #22
> On Dec 20, 2019, at 9:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 20/12/19 10:47, Stefan Hajnoczi wrote:
>>> If it doesn't create too large of a burden to support both, then I think
>>> it is very desirable. IIUC, this is saying a kernel based solution as the
>>> optimized/optimal solution, and userspace UNIX socket based option as the
>>> generic "works everywhere" fallback solution.
>> I'm slightly in favor of the kernel implementation because it keeps us
>> better aligned with VFIO.  That means solving problems in one place only
>> and less reinventing the wheel.
> 
> I think there are anyway going to be some differences with VFIO.
> 
> For example, currently VFIO requires pinning user memory.  Is that a
> limitation for muser too?  If so, that would be a big disadvantage; if
> not, however, management tools need to learn that muser devices unlike
> other VFIO devices do not prevent overcommit.

More or less. We pin them today, but I think we don't really have to.
I created an issue to look into it:
https://github.com/nutanix/muser/issues/28

In any case, if Qemu is ballooning and calls UNMAP_DMA for memory that
has been ballooned out, then we would release it.

The reason we keep it pinned is to support libmuser restarts. IIRC,
VFIO doesn't need to pin pages for mdev devices (that's the job of the
mdev driver on the other end via vfio_pin_pages()). It only keeps the
DMA entries in a RB tree.

If my understanding is right, then we can probably just keep the map
Qemu registered (without holding the pages) and call vfio_pin_pages()
on demand when libmuser restarts.

For context, this is how the DMA memory registration works today:

1) Qemu calls ioctl(vfio_fd, IOMMU_MAP_DMA, &vm_map);

2) The iommu driver notifies muser.ko

3) Muser.ko pins the pages (in get_dma_map(), called from below)
(https://github.com/nutanix/muser/blob/master/kmod/muser.c#L711)

4) Muser.ko notifies libmuser about the memory registration
(The iommu driver context goes to sleep, hence the pinning)

5) Libmuser wakes up and calls mmap() on muser.ko

6) Muser.ko inserts the VM memory in libmuser's context
(https://github.com/nutanix/muser/blob/master/kmod/muser.c#L543)

7) Libmuser tells muser.ko that it's done

8) Muser.ko iommu callback context that was sleeping wakes up

9) Muser.ko places the memory in a "dma_list" for the mudev and returns.

We could potentially modify the last step to unpin and keep only what
we need for a future call to vfio_pin_pages(), but I need to check if
that works.

Cheers,
Felipe

> 
> Paolo
>
Alex Williamson Dec. 20, 2019, 3:25 p.m. UTC | #23
On Fri, 20 Dec 2019 14:14:33 +0000
Felipe Franciosi <felipe@nutanix.com> wrote:

> > On Dec 20, 2019, at 9:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > On 20/12/19 10:47, Stefan Hajnoczi wrote:  
> >>> If it doesn't create too large of a burden to support both, then I think
> >>> it is very desirable. IIUC, this is saying a kernel based solution as the
> >>> optimized/optimal solution, and userspace UNIX socket based option as the
> >>> generic "works everywhere" fallback solution.  
> >> I'm slightly in favor of the kernel implementation because it keeps us
> >> better aligned with VFIO.  That means solving problems in one place only
> >> and less reinventing the wheel.  
> > 
> > I think there are anyway going to be some differences with VFIO.
> > 
> > For example, currently VFIO requires pinning user memory.  Is that a
> > limitation for muser too?  If so, that would be a big disadvantage; if
> > not, however, management tools need to learn that muser devices unlike
> > other VFIO devices do not prevent overcommit.  
> 
> More or less. We pin them today, but I think we don't really have to.
> I created an issue to look into it:
> https://github.com/nutanix/muser/issues/28
> 
> In any case, if Qemu is ballooning and calls UNMAP_DMA for memory that
> has been ballooned out, then we would release it.

That's exactly the problem with ballooning and vfio, it doesn't unmap
memory, it just zaps it out of the VM, to be demand faulted back in
later.  It's very vCPU-centric.  Memory hotplug is the only case where
we'll see a memory region get unmapped.
 
> The reason we keep it pinned is to support libmuser restarts. IIRC,
> VFIO doesn't need to pin pages for mdev devices (that's the job of the
> mdev driver on the other end via vfio_pin_pages()). It only keeps the
> DMA entries in a RB tree.
> 
> If my understanding is right, then we can probably just keep the map
> Qemu registered (without holding the pages) and call vfio_pin_pages()
> on demand when libmuser restarts.
> 
> For context, this is how the DMA memory registration works today:
> 
> 1) Qemu calls ioctl(vfio_fd, IOMMU_MAP_DMA, &vm_map);
> 
> 2) The iommu driver notifies muser.ko
> 
> 3) Muser.ko pins the pages (in get_dma_map(), called from below)
> (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L711)

Yikes, it pins every page??  vfio_pin_pages() intends for the vendor
driver to be much smarter than this :-\  Thanks,

Alex
 
> 4) Muser.ko notifies libmuser about the memory registration
> (The iommu driver context goes to sleep, hence the pinning)
> 
> 5) Libmuser wakes up and calls mmap() on muser.ko
> 
> 6) Muser.ko inserts the VM memory in libmuser's context
> (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L543)
> 
> 7) Libmuser tells muser.ko that it's done
> 
> 8) Muser.ko iommu callback context that was sleeping wakes up
> 
> 9) Muser.ko places the memory in a "dma_list" for the mudev and returns.
> 
> We could potentially modify the last step to unpin and keep only what
> we need for a future call to vfio_pin_pages(), but I need to check if
> that works.
> 
> Cheers,
> Felipe
> 
> > 
> > Paolo
> >   
> 
>
Felipe Franciosi Dec. 20, 2019, 4 p.m. UTC | #24
Heya,

> On Dec 20, 2019, at 3:25 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Fri, 20 Dec 2019 14:14:33 +0000
> Felipe Franciosi <felipe@nutanix.com> wrote:
> 
>>> On Dec 20, 2019, at 9:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> 
>>> On 20/12/19 10:47, Stefan Hajnoczi wrote:  
>>>>> If it doesn't create too large of a burden to support both, then I think
>>>>> it is very desirable. IIUC, this is saying a kernel based solution as the
>>>>> optimized/optimal solution, and userspace UNIX socket based option as the
>>>>> generic "works everywhere" fallback solution.  
>>>> I'm slightly in favor of the kernel implementation because it keeps us
>>>> better aligned with VFIO.  That means solving problems in one place only
>>>> and less reinventing the wheel.  
>>> 
>>> I think there are anyway going to be some differences with VFIO.
>>> 
>>> For example, currently VFIO requires pinning user memory.  Is that a
>>> limitation for muser too?  If so, that would be a big disadvantage; if
>>> not, however, management tools need to learn that muser devices unlike
>>> other VFIO devices do not prevent overcommit.  
>> 
>> More or less. We pin them today, but I think we don't really have to.
>> I created an issue to look into it:
>> https://github.com/nutanix/muser/issues/28 
>> 
>> In any case, if Qemu is ballooning and calls UNMAP_DMA for memory that
>> has been ballooned out, then we would release it.
> 
> That's exactly the problem with ballooning and vfio, it doesn't unmap
> memory, it just zaps it out of the VM, to be demand faulted back in
> later.  It's very vCPU-centric.  Memory hotplug is the only case where
> we'll see a memory region get unmapped.
> 
>> The reason we keep it pinned is to support libmuser restarts. IIRC,
>> VFIO doesn't need to pin pages for mdev devices (that's the job of the
>> mdev driver on the other end via vfio_pin_pages()). It only keeps the
>> DMA entries in a RB tree.
>> 
>> If my understanding is right, then we can probably just keep the map
>> Qemu registered (without holding the pages) and call vfio_pin_pages()
>> on demand when libmuser restarts.
>> 
>> For context, this is how the DMA memory registration works today:
>> 
>> 1) Qemu calls ioctl(vfio_fd, IOMMU_MAP_DMA, &vm_map);
>> 
>> 2) The iommu driver notifies muser.ko
>> 
>> 3) Muser.ko pins the pages (in get_dma_map(), called from below)
>> (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L711)
> 
> Yikes, it pins every page??  vfio_pin_pages() intends for the vendor
> driver to be much smarter than this :-\  Thanks,

We can't afford a kernel round trip every time we need to translate
GPAs, so that's how we solved it. There's an action item to do pin in
groups of 512 (which is the limit we saw in vfio_pin_pages()). Can you
elaborate on the problems of the approach and whether there's
something better we can do?

F.

> 
> Alex
> 
>> 4) Muser.ko notifies libmuser about the memory registration
>> (The iommu driver context goes to sleep, hence the pinning)
>> 
>> 5) Libmuser wakes up and calls mmap() on muser.ko
>> 
>> 6) Muser.ko inserts the VM memory in libmuser's context
>> (https://github.com/nutanix/muser/blob/master/kmod/muser.c#L543)
>> 
>> 7) Libmuser tells muser.ko that it's done
>> 
>> 8) Muser.ko iommu callback context that was sleeping wakes up
>> 
>> 9) Muser.ko places the memory in a "dma_list" for the mudev and returns.
>> 
>> We could potentially modify the last step to unpin and keep only what
>> we need for a future call to vfio_pin_pages(), but I need to check if
>> that works.
>> 
>> Cheers,
>> Felipe
>> 
>>> 
>>> Paolo
John Johnson Dec. 20, 2019, 5:15 p.m. UTC | #25
> On Dec 19, 2019, at 5:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote:
>> On 17/12/19 23:57, Felipe Franciosi wrote:
>>> Doing it in userspace was the flow we proposed back in last year's KVM
>>> Forum (Edinburgh), but it got turned down.
>> 
>> I think the time since then has shown that essentially the cat is out of
>> the bag.  I didn't really like the idea of devices outside QEMU---and I
>> still don't---but if something like "VFIO over AF_UNIX" turns out to be
>> the cleanest way to implement multi-process QEMU device models, I am not
>> going to pull an RMS and block that from happening.  Assuming I could
>> even do so!
> 
> There are a range of approaches that will influence how out-of-process
> devices can be licensed and distributed.
> 
> A VFIO-over-UNIX domain sockets approach means a stable API so that any
> license (including proprietary) is possible.
> 
> Another approach is a QEMU-centric unstable protocol.  I'll call this
> the qdev-over-UNIX domain sockets approach.  Maintaining an out-of-tree
> device is expensive and ugly since the protocol changes between QEMU
> versions in ways that are incompatible and undetectable.
> 
> On top of that, the initialization protocol message could include the
> QEMU version string that the device was compiled against.  If the
> version string doesn't match then QEMU will refuse to talk to the
> device.
> 

	This is very similar to our multi-process QEMU implementation before
we looked into using muser.  The differences are:

We use one object per emulated device type in QEMU rather than having a single
VFIO type that can masquerade as any PCI device.

We don’t pin guest memory; we pass the QEMU file descriptors used to create
guest memory to the emulation program, and it mmap()s them itself. (ala
vhost-user).

								JJ



> Distributing a single device executable that works with many QEMUs (e.g.
> CentOS, Ubuntu) and versions becomes difficult.
> 
> I want to mention that we have the option of doing this if there are
> strong concerns about out-of-tree devices.  It does have downsides:
> 1. Inability to share devices with other VMMs.
> 2. Probably won't replace vhost-user due to the out-of-tree limitations.
> 3. Can still be circumvented by a motivated device author.
> 
> Stefan
Stefan Hajnoczi Jan. 2, 2020, 10 a.m. UTC | #26
On Fri, Dec 20, 2019 at 09:15:40AM -0800, John G Johnson wrote:
> > On Dec 19, 2019, at 5:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote:
> >> On 17/12/19 23:57, Felipe Franciosi wrote:
> We don’t pin guest memory; we pass the QEMU file descriptors used to create
> guest memory to the emulation program, and it mmap()s them itself. (ala
> vhost-user).

Does muser really require pinning?  If yes, then it seems like a
limitation that can be removed in the future.

Stefan
Stefan Hajnoczi Jan. 2, 2020, 10:04 a.m. UTC | #27
On Fri, Dec 20, 2019 at 09:15:40AM -0800, John G Johnson wrote:
> > On Dec 19, 2019, at 5:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Wed, Dec 18, 2019 at 01:00:55AM +0100, Paolo Bonzini wrote:
> >> On 17/12/19 23:57, Felipe Franciosi wrote:
> We don’t pin guest memory; we pass the QEMU file descriptors used to create
> guest memory to the emulation program, and it mmap()s them itself. (ala
> vhost-user).

Please ignore my reply.  I just saw pinning was discussed in another
sub-thread.  Felipe posted this URL for tracking the issue:
https://github.com/nutanix/muser/issues/28

Stefan
Stefan Hajnoczi Jan. 2, 2020, 10:42 a.m. UTC | #28
On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote:
> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote:
> > On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote:
> > > On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
> > > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> > > > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > > > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > > > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > > > To be clear: I'm very happy to have a userspace-only option for this,
> > > > I just don't want to ditch the kernel module (yet, anyway). :)
> > > 
> > > If it doesn't create too large of a burden to support both, then I think
> > > it is very desirable. IIUC, this is saying a kernel based solution as the
> > > optimized/optimal solution, and userspace UNIX socket based option as the
> > > generic "works everywhere" fallback solution.
> > 
> > I'm slightly in favor of the kernel implementation because it keeps us
> > better aligned with VFIO.  That means solving problems in one place only
> > and less reinventing the wheel.
> > 
> > Knowing that a userspace implementation is possible is a plus though.
> > Maybe that option will become attractive in the future and someone will
> > develop it.  In fact, a userspace implementation may be a cool Google
> > Summer of Code project idea that I'd like to co-mentor.
> 
> If it is technically viable as an approach, then I think  we should be
> treating a fully unprivileged muser-over-UNIX socket as a higher priority
> than just "maybe a GSoC student will want todo it".
> 
> Libvirt is getting strong message from KubeVirt project that they want to
> be running both libvirtd and QEMU fully unprivileged. This allows their
> containers to be unprivileged. Anything that requires privileges requires
> jumping through extra hoops writing custom code in KubeVirt to do things
> outside libvirt in side loaded privileged containers and this limits how
> where those features can be used.

Okay this makes sense.

There needs to be a consensus on whether to go with a qdev-over-socket
approach that is QEMU-specific and strongly discourages third-party
device distribution or a muser-over-socket approach that offers a stable
API for VMM interoperability and third-party device distribution.

Interoperability between VMMs and also DPDK/SPDK is important because
they form today's open source virtualization community.  No one project
or codebase covers all use cases or interesting developments.  If we are
short-sighted and prevent collaboration then we'll become isolated.

On the other hand, I'm personally opposed to proprietary vendors that
contribute very little to open source.  We make that easier by offering
a stable API for third-party devices.  A stable API discourages open
source contributions while allowing proprietary vendors to benefit from
the work that the open source community is doing.

One way to choose a position is to balance up the open source vs
proprietary applications of a stable API.  At this point in time I think
the DPDK/SPDK and rust-vmm communities bring enough to the table that
it's worth fostering collaboration through a stable API.  The benefit of
having the stable API is large enough that the disadvantage of making
life easier for proprietary vendors can be accepted.

This is just a more elaborate explanation for the "the cat is out of the
bag" comments that have already been made on licensing.  Does anyone
still disagree or want to discuss further?

If there is agreement that a stable API is okay then I think the
practical way to do this is to first merge a cleaned-up version of
multi-process QEMU as an unstable experimental API.  Once it's being
tested and used we can write a protocol specification and publish it as
a stable interface when the spec has addressed most use cases.

Does this sound good?

Stefan
Felipe Franciosi Jan. 2, 2020, 11:03 a.m. UTC | #29
> On Jan 2, 2020, at 10:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote:
>> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote:
>>> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote:
>>>> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
>>>>>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
>>>>>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
>>>>>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
>>>>>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
>>>>> To be clear: I'm very happy to have a userspace-only option for this,
>>>>> I just don't want to ditch the kernel module (yet, anyway). :)
>>>> 
>>>> If it doesn't create too large of a burden to support both, then I think
>>>> it is very desirable. IIUC, this is saying a kernel based solution as the
>>>> optimized/optimal solution, and userspace UNIX socket based option as the
>>>> generic "works everywhere" fallback solution.
>>> 
>>> I'm slightly in favor of the kernel implementation because it keeps us
>>> better aligned with VFIO.  That means solving problems in one place only
>>> and less reinventing the wheel.
>>> 
>>> Knowing that a userspace implementation is possible is a plus though.
>>> Maybe that option will become attractive in the future and someone will
>>> develop it.  In fact, a userspace implementation may be a cool Google
>>> Summer of Code project idea that I'd like to co-mentor.
>> 
>> If it is technically viable as an approach, then I think  we should be
>> treating a fully unprivileged muser-over-UNIX socket as a higher priority
>> than just "maybe a GSoC student will want todo it".
>> 
>> Libvirt is getting strong message from KubeVirt project that they want to
>> be running both libvirtd and QEMU fully unprivileged. This allows their
>> containers to be unprivileged. Anything that requires privileges requires
>> jumping through extra hoops writing custom code in KubeVirt to do things
>> outside libvirt in side loaded privileged containers and this limits how
>> where those features can be used.
> 
> Okay this makes sense.
> 
> There needs to be a consensus on whether to go with a qdev-over-socket
> approach that is QEMU-specific and strongly discourages third-party
> device distribution or a muser-over-socket approach that offers a stable
> API for VMM interoperability and third-party device distribution.

The reason I dislike yet another offloading protocol (ie. there is
vhost, there is vfio, and then there would be qdev-over-socket) is
that we keep reinventing the wheel. I very much prefer picking
something solid (eg. VFIO) and keep investing on it.

> Interoperability between VMMs and also DPDK/SPDK is important because
> they form today's open source virtualization community.  No one project
> or codebase covers all use cases or interesting developments.  If we are
> short-sighted and prevent collaboration then we'll become isolated.
> 
> On the other hand, I'm personally opposed to proprietary vendors that
> contribute very little to open source.  We make that easier by offering
> a stable API for third-party devices.  A stable API discourages open
> source contributions while allowing proprietary vendors to benefit from
> the work that the open source community is doing.

I appreciate the concern. However, my opinion is that vendors cannot
be stopped by providing them with unstable APIs. There are plenty of
examples where projects were forked and maintained separately to keep
certain things under control and that is bad for everyone. The
community doesn't get contributions back, and vendors have extra pain
to maintain the forks. Furthermore, service vendors will always get
away with murder by copying whatever they like and using however they
please (since they are not sharing the software).

I would rather look at examples like KVM. It's a relatively stable API
with several proprietary users. Nevertheless, we see loads of
contributions to it (perhaps less than we would want, but plenty).

> 
> One way to choose a position is to balance up the open source vs
> proprietary applications of a stable API.  At this point in time I think
> the DPDK/SPDK and rust-vmm communities bring enough to the table that
> it's worth fostering collaboration through a stable API.  The benefit of
> having the stable API is large enough that the disadvantage of making
> life easier for proprietary vendors can be accepted.

I agree with you as per reasoning above.

> 
> This is just a more elaborate explanation for the "the cat is out of the
> bag" comments that have already been made on licensing.  Does anyone
> still disagree or want to discuss further?
> 
> If there is agreement that a stable API is okay then I think the
> practical way to do this is to first merge a cleaned-up version of
> multi-process QEMU as an unstable experimental API.  Once it's being
> tested and used we can write a protocol specification and publish it as
> a stable interface when the spec has addressed most use cases.
> 
> Does this sound good?

In that case, wouldn't it be preferable to revive our proposal from
Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO
code to "common" and added a "user" backend underneath it, similar to
how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and
added vhost-user-scsi. It was centric on PCI, but it doesn't have to
be. The other side can be implemented in libmuser for facilitating things.

I even recall highlighting that vhost-user could be moved underneath
that later, greatly simplifying lots of other Qemu code.

F.


> 
> Stefan
Elena Ufimtseva Jan. 2, 2020, 4:01 p.m. UTC | #30
On Tue, Dec 17, 2019 at 04:33:16PM +0000, Stefan Hajnoczi wrote:
> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > >> Is there a work-in-progress muser patch series you can post to start the
> > >> discussion early?  That way we can avoid reviewers like myself asking
> > >> you to make changes after you have invested a lot of time.
> > >> 
> > > 
> > > Absolutely, that is our plan. At the moment we do not have the patches
> > > ready for the review. We have setup internally a milestone and will be
> > > sending that early version as a tarball after we have it completed.
> > > Would be also a meeting something that could help us to stay on the same
> > > page?
> > 
> > Please loop us in if you so set up a meeting.
>

Hi Stefan

And happy New Year to everyone!

> There is a bi-weekly KVM Community Call that we can use for phone
> discussions:
> 
>   https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
>

Our team would like to join the call on Jan 14 and maybe talk over few things.
Felipe, will you and your team be joining as well?

> Or we can schedule a one-off call at any time :).
>

Awesome! Thank you, we will use for sure this opportunity.

Elena

> Questions I've seen when discussing muser with people have been:
> 
> 1. Can unprivileged containers create muser devices?  If not, this is a
>    blocker for use cases that want to avoid root privileges entirely.
> 
> 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
>    security reasons)?  A similar library could be implemented in
>    userspace along the lines of the vhost-user protocol.  Although VMMs
>    would then need to use a new libmuser-client library instead of
>    reusing their VFIO code to access the device.
> 
> 3. Should this feature be Linux-only?  vhost-user can be implemented on
>    non-Linux OSes...
> 
> Stefan
Marc-André Lureau Jan. 2, 2020, 6:55 p.m. UTC | #31
Hi

On Thu, Jan 2, 2020 at 3:03 PM Felipe Franciosi <felipe@nutanix.com> wrote:
> The reason I dislike yet another offloading protocol (ie. there is
> vhost, there is vfio, and then there would be qdev-over-socket) is
> that we keep reinventing the wheel. I very much prefer picking
> something solid (eg. VFIO) and keep investing on it.

I don't have a lot of experience with VFIO, so I can't tell if it's
really solid for the user-space case. Alex W could probably discuss
that.

> In that case, wouldn't it be preferable to revive our proposal from
> Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO
> code to "common" and added a "user" backend underneath it, similar to
> how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and
> added vhost-user-scsi. It was centric on PCI, but it doesn't have to
> be. The other side can be implemented in libmuser for facilitating things.

Same idea back in KVM forum 2017 (briefly mentioned at the end of our
talk with Conrad)

The PoC is still around:
https://github.com/elmarco/qemu/tree/wip/vfio-user/contrib/libvfio-user

> I even recall highlighting that vhost-user could be moved underneath
> that later, greatly simplifying lots of other Qemu code.

That would eventually be an option, but vhost-user is already quite
complicated. We could try to split it up somehow for the non-virtio
parts.

cheers
Stefan Hajnoczi Jan. 3, 2020, 3 p.m. UTC | #32
On Thu, Jan 02, 2020 at 08:01:36AM -0800, Elena Ufimtseva wrote:
> On Tue, Dec 17, 2019 at 04:33:16PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> > > > On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> > > > On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> > > >> Is there a work-in-progress muser patch series you can post to start the
> > > >> discussion early?  That way we can avoid reviewers like myself asking
> > > >> you to make changes after you have invested a lot of time.
> > > >> 
> > > > 
> > > > Absolutely, that is our plan. At the moment we do not have the patches
> > > > ready for the review. We have setup internally a milestone and will be
> > > > sending that early version as a tarball after we have it completed.
> > > > Would be also a meeting something that could help us to stay on the same
> > > > page?
> > > 
> > > Please loop us in if you so set up a meeting.
> >
> 
> Hi Stefan
> 
> And happy New Year to everyone!
> 
> > There is a bi-weekly KVM Community Call that we can use for phone
> > discussions:
> > 
> >   https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> >
> 
> Our team would like to join the call on Jan 14 and maybe talk over few things.
> Felipe, will you and your team be joining as well?

Great, I'll be there.

Stefan
Stefan Hajnoczi Jan. 3, 2020, 3:59 p.m. UTC | #33
On Thu, Jan 02, 2020 at 11:03:22AM +0000, Felipe Franciosi wrote:
> > On Jan 2, 2020, at 10:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote:
> >> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote:
> >>> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote:
> >>>> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
> >>>>>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> >>>>>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> >>>>>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
> >>>>>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> >>>>> To be clear: I'm very happy to have a userspace-only option for this,
> >>>>> I just don't want to ditch the kernel module (yet, anyway). :)
> >>>> 
> >>>> If it doesn't create too large of a burden to support both, then I think
> >>>> it is very desirable. IIUC, this is saying a kernel based solution as the
> >>>> optimized/optimal solution, and userspace UNIX socket based option as the
> >>>> generic "works everywhere" fallback solution.
> >>> 
> >>> I'm slightly in favor of the kernel implementation because it keeps us
> >>> better aligned with VFIO.  That means solving problems in one place only
> >>> and less reinventing the wheel.
> >>> 
> >>> Knowing that a userspace implementation is possible is a plus though.
> >>> Maybe that option will become attractive in the future and someone will
> >>> develop it.  In fact, a userspace implementation may be a cool Google
> >>> Summer of Code project idea that I'd like to co-mentor.
> >> 
> >> If it is technically viable as an approach, then I think  we should be
> >> treating a fully unprivileged muser-over-UNIX socket as a higher priority
> >> than just "maybe a GSoC student will want todo it".
> >> 
> >> Libvirt is getting strong message from KubeVirt project that they want to
> >> be running both libvirtd and QEMU fully unprivileged. This allows their
> >> containers to be unprivileged. Anything that requires privileges requires
> >> jumping through extra hoops writing custom code in KubeVirt to do things
> >> outside libvirt in side loaded privileged containers and this limits how
> >> where those features can be used.
> > 
> > Okay this makes sense.
> > 
> > There needs to be a consensus on whether to go with a qdev-over-socket
> > approach that is QEMU-specific and strongly discourages third-party
> > device distribution or a muser-over-socket approach that offers a stable
> > API for VMM interoperability and third-party device distribution.
> 
> The reason I dislike yet another offloading protocol (ie. there is
> vhost, there is vfio, and then there would be qdev-over-socket) is
> that we keep reinventing the wheel. I very much prefer picking
> something solid (eg. VFIO) and keep investing on it.

I like the idea of sticking close to VFIO too.  The first step is
figuring out whether VFIO can be mapped to a UNIX domain socket protocol
and many non-VFIO protocol messages are required.  Hopefully that extra
non-VFIO stuff isn't too large.

If implementations can use the kernel uapi vfio header files then we're
on track for compatibility with VFIO.

> > This is just a more elaborate explanation for the "the cat is out of the
> > bag" comments that have already been made on licensing.  Does anyone
> > still disagree or want to discuss further?
> > 
> > If there is agreement that a stable API is okay then I think the
> > practical way to do this is to first merge a cleaned-up version of
> > multi-process QEMU as an unstable experimental API.  Once it's being
> > tested and used we can write a protocol specification and publish it as
> > a stable interface when the spec has addressed most use cases.
> > 
> > Does this sound good?
> 
> In that case, wouldn't it be preferable to revive our proposal from
> Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO
> code to "common" and added a "user" backend underneath it, similar to
> how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and
> added vhost-user-scsi. It was centric on PCI, but it doesn't have to
> be. The other side can be implemented in libmuser for facilitating things.

That sounds good.

Stefan
Stefan Hajnoczi Jan. 8, 2020, 4:31 p.m. UTC | #34
On Thu, Jan 02, 2020 at 10:55:46PM +0400, Marc-André Lureau wrote:
> On Thu, Jan 2, 2020 at 3:03 PM Felipe Franciosi <felipe@nutanix.com> wrote:
> > I even recall highlighting that vhost-user could be moved underneath
> > that later, greatly simplifying lots of other Qemu code.
> 
> That would eventually be an option, but vhost-user is already quite
> complicated. We could try to split it up somehow for the non-virtio
> parts.

I hope we can deprecate vhost-user.  New out-of-process devices should
just implement VFIO-over-socket with virtio-pci.  This way
out-of-process devices are full VIRTIO devices and it's cleaner than
having the vhost-user protocol that exposes a subset of VIRTIO.

Stefan
John Johnson Jan. 14, 2020, 1:56 a.m. UTC | #35
> On Jan 3, 2020, at 7:59 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Thu, Jan 02, 2020 at 11:03:22AM +0000, Felipe Franciosi wrote:
>>> On Jan 2, 2020, at 10:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Fri, Dec 20, 2019 at 10:22:37AM +0000, Daniel P. Berrangé wrote:
>>>> On Fri, Dec 20, 2019 at 09:47:12AM +0000, Stefan Hajnoczi wrote:
>>>>> On Thu, Dec 19, 2019 at 12:55:04PM +0000, Daniel P. Berrangé wrote:
>>>>>> On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
>>>>>>>> On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>> On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
>>>>>>>>>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
>>>>>>>>>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote:
>>>>>>>>>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
>>>>>>> To be clear: I'm very happy to have a userspace-only option for this,
>>>>>>> I just don't want to ditch the kernel module (yet, anyway). :)
>>>>>> 
>>>>>> If it doesn't create too large of a burden to support both, then I think
>>>>>> it is very desirable. IIUC, this is saying a kernel based solution as the
>>>>>> optimized/optimal solution, and userspace UNIX socket based option as the
>>>>>> generic "works everywhere" fallback solution.
>>>>> 
>>>>> I'm slightly in favor of the kernel implementation because it keeps us
>>>>> better aligned with VFIO.  That means solving problems in one place only
>>>>> and less reinventing the wheel.
>>>>> 
>>>>> Knowing that a userspace implementation is possible is a plus though.
>>>>> Maybe that option will become attractive in the future and someone will
>>>>> develop it.  In fact, a userspace implementation may be a cool Google
>>>>> Summer of Code project idea that I'd like to co-mentor.
>>>> 
>>>> If it is technically viable as an approach, then I think  we should be
>>>> treating a fully unprivileged muser-over-UNIX socket as a higher priority
>>>> than just "maybe a GSoC student will want todo it".
>>>> 
>>>> Libvirt is getting strong message from KubeVirt project that they want to
>>>> be running both libvirtd and QEMU fully unprivileged. This allows their
>>>> containers to be unprivileged. Anything that requires privileges requires
>>>> jumping through extra hoops writing custom code in KubeVirt to do things
>>>> outside libvirt in side loaded privileged containers and this limits how
>>>> where those features can be used.
>>> 
>>> Okay this makes sense.
>>> 
>>> There needs to be a consensus on whether to go with a qdev-over-socket
>>> approach that is QEMU-specific and strongly discourages third-party
>>> device distribution or a muser-over-socket approach that offers a stable
>>> API for VMM interoperability and third-party device distribution.
>> 
>> The reason I dislike yet another offloading protocol (ie. there is
>> vhost, there is vfio, and then there would be qdev-over-socket) is
>> that we keep reinventing the wheel. I very much prefer picking
>> something solid (eg. VFIO) and keep investing on it.
> 
> I like the idea of sticking close to VFIO too.  The first step is
> figuring out whether VFIO can be mapped to a UNIX domain socket protocol
> and many non-VFIO protocol messages are required.  Hopefully that extra
> non-VFIO stuff isn't too large.
> 


	I looked at this and think we could map VFIO commands over a
UNIX socket without a lot of difficulty.  We'd have to use SCM
messages to pass file descriptors from the QEMU process to the
emulation process for certain operations, but that shouldn't be
a big problem.  Here are the mission mode operations:

configuration

	VFIO defines a number of configuration ioctl()s that we could
turn into messages, but if we make the protocol specific to PCI, then
all of the information they transmit (e.g., device regions and
interrupts) can be discovered by parsing the device's PCI config
space.  A lot of the current VFIO code that parses config space could
be re-used to do this.

MMIO

	VFIO uses reads and writes on the VFIO file descriptor to
perform MMIOs to the device.  The read/write offset encodes the VFIO
region and offset of the MMIO. (the VFIO regions correspond to PCI
BARs) These would have to be changed to send messages that include the
VFIO region and offset (and data for writes) to the emulation process.

interrupts

	VFIO creates eventfds that are sent to the kernel driver so it
can inject interrupts into a guest.  We would have to send these
eventfds over the socket to the emulation process using SCM messages.
The emulation process could then trigger interrupts by writing on the
eventfd.

DMA

	This is one place where I might diverge from VFIO.  It uses an
ioctl to tell the kernel driver what areas of guest memory the device
can address.  The driver then pins that memory so it can be programmed
into a HW IOMMU.  We could avoid pinning of guest memory by adopting
the vhost-user idea of sending the file descriptors used by QEMU to
create guest memory to the emulation process, and having it mmap() the
guest itself.  IOMMUs are handled by having the emulation process
request device DMA to guest PA translations from QEMU.



> If implementations can use the kernel uapi vfio header files then we're
> on track for compatibility with VFIO.
> 
>>> This is just a more elaborate explanation for the "the cat is out of the
>>> bag" comments that have already been made on licensing.  Does anyone
>>> still disagree or want to discuss further?
>>> 
>>> If there is agreement that a stable API is okay then I think the
>>> practical way to do this is to first merge a cleaned-up version of
>>> multi-process QEMU as an unstable experimental API.  Once it's being
>>> tested and used we can write a protocol specification and publish it as
>>> a stable interface when the spec has addressed most use cases.
>>> 
>>> Does this sound good?
>> 
>> In that case, wouldn't it be preferable to revive our proposal from
>> Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO
>> code to "common" and added a "user" backend underneath it, similar to
>> how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and
>> added vhost-user-scsi. It was centric on PCI, but it doesn't have to
>> be. The other side can be implemented in libmuser for facilitating things.
> 
> That sounds good.
> 

       The emulation program API could be based on the current
libmuser API or the libvfio-user API.  The protocol itself wouldn’t
care which is chosen.  Our multi-processQEMU project would have to
change how devices are specified from the QEMU command line to the
emulation process command line.

							JJ
Dr. David Alan Gilbert Jan. 17, 2020, 5:25 p.m. UTC | #36
* John G Johnson (john.g.johnson@oracle.com) wrote:

<snip>

> DMA
> 
> 	This is one place where I might diverge from VFIO.  It uses an
> ioctl to tell the kernel driver what areas of guest memory the device
> can address.  The driver then pins that memory so it can be programmed
> into a HW IOMMU.  We could avoid pinning of guest memory by adopting
> the vhost-user idea of sending the file descriptors used by QEMU to
> create guest memory to the emulation process, and having it mmap() the
> guest itself.  IOMMUs are handled by having the emulation process
> request device DMA to guest PA translations from QEMU.

The interface in vhost-user to pass these memory fd's is a bit hairy;
so it would be great if there was something better for multi-process.

Some things to think about:
  a) vhost-user filters it so that areas of memory not backed by an fd
aren't passed to the client; this filters out some of the device
specific RAM blocks that aren't really normal RAM.

  b) Hugepages are tricky; especially on a PC where the 0-1MB area is
broken up into chunks and you're trying to mmap 2MB chunks into the
client.

  c) Postcopy with vhost-user was pretty tricky as well; there needs
to be some coordination with the qemu to handle pages that are missing.

  d) Some RAM mappings can change; mostly not the ones sent to the
client; but just watch out that these can happen at unexpected times.

Dave


> 
> 
> > If implementations can use the kernel uapi vfio header files then we're
> > on track for compatibility with VFIO.
> > 
> >>> This is just a more elaborate explanation for the "the cat is out of the
> >>> bag" comments that have already been made on licensing.  Does anyone
> >>> still disagree or want to discuss further?
> >>> 
> >>> If there is agreement that a stable API is okay then I think the
> >>> practical way to do this is to first merge a cleaned-up version of
> >>> multi-process QEMU as an unstable experimental API.  Once it's being
> >>> tested and used we can write a protocol specification and publish it as
> >>> a stable interface when the spec has addressed most use cases.
> >>> 
> >>> Does this sound good?
> >> 
> >> In that case, wouldn't it be preferable to revive our proposal from
> >> Edinburgh (KVM Forum 2018)? Our prototypes moved more of the Qemu VFIO
> >> code to "common" and added a "user" backend underneath it, similar to
> >> how vhost-user-scsi moved some of vhost-scsi to vhost-scsi-common and
> >> added vhost-user-scsi. It was centric on PCI, but it doesn't have to
> >> be. The other side can be implemented in libmuser for facilitating things.
> > 
> > That sounds good.
> > 
> 
>        The emulation program API could be based on the current
> libmuser API or the libvfio-user API.  The protocol itself wouldn’t
> care which is chosen.  Our multi-processQEMU project would have to
> change how devices are specified from the QEMU command line to the
> emulation process command line.
> 
> 							JJ
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Thanos Makatos Feb. 25, 2020, 9:16 a.m. UTC | #37
> > 3) Muser.ko pins the pages (in get_dma_map(), called from below)
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_nutanix_muser_blob_master_kmod_muser.c-
> 23L711&d=DwICAg&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtt
> i46atk736SI4vgsJiUKIyDE&m=C8rTp4SZoy4YNcZWntiROp3otxCyKbLoQXBw8O
> SB0TM&s=G2JfW1GcVNc_iph7C4hE285sTZM8JrR4dYXgmcyAZPE&e= )
> 
> Yikes, it pins every page??  vfio_pin_pages() intends for the vendor
> driver to be much smarter than this :-\  Thanks,

We no longer have to pin pages at all. Instead we grab the fd backing the VMA
and inject it in libmuser, and then request it to mmap that file. This also
solves a few other problems and is far simpler to implement.