new file mode 100644
@@ -0,0 +1,627 @@
+Disaggregating QEMU
+
+This document describes implementation details of multi-process
+qemu.
+
+QEMU can be broadly described as providing three main services. One is a
+VM control point, where VMs can be created, migrated, re-configured, and
+destroyed. A second is to emulate the CPU instructions within the VM,
+often accelerated by HW virtualization features such as Intel’s VT
+extensions. Finally, it provides IO services to the VM by emulating HW
+IO devices, such as disk and network devices.
+
+A disaggregated QEMU
+
+A disaggregated QEMU involves separating QEMU services into separate
+host processes. Each of these processes can be given only the privileges
+it needs to provide its service, e.g., a disk service could be given
+access only the the disk images it provides, and not be allowed to
+access other files, or any network devices. An attacker who compromised
+this service would not be able to use this exploit to access files or
+devices beyond what the disk service was given access to.
+
+A QEMU control process would remain, but in disaggregated mode, it would
+be a control point that executes the processes needed to support the VM
+being created, but have no direct interfaces to the VM. During VM
+execution, it would still provide the user interface to hot-plug devices
+or live migrate the VM.
+
+A first step in creating a disaggregated QEMU is to separate IO services
+from the main QEMU program, which would continue to provide CPU
+emulation. i.e., the control process would also be the CPU emulation
+process. In a later phase, CPU emulation could be separated from the
+control process.
+
+
+Disaggregating IO services
+
+Disaggregating IO services is a good place to begin QEMU disaggregating
+for a couple of reasons. One is the sheer number of IO devices QEMU can
+emulate provides a large surface of interfaces which could potentially
+be exploited, and, indeed, have been a source of exploits in the past.
+Another is the modular nature of QEMU device emulation code provides
+interface points where the QEMU functions that perform device emulation
+can be separated from the QEMU functions that manage the emulation of
+guest CPU instructions.
+
+QEMU device emulation
+
+QEMU uses a object oriented SW architecture for device emulation code.
+Configured objects are all compiled into the QEMU binary, then objects
+are instantiated by name when used by the guest VM. For example, the
+code to emulate a device named “foo” is always present in QEMU, but its
+instantiation code is only run when the device is included in the target
+VM. (e.g., via the QEMU command line as _-device foo_)
+
+The object model is hierarchical, so device emulation code names its
+parent object (such as “pci-device” for a PCI device) and QEMU will
+instantiate a parent object before calling the device’s instantiation
+code.
+
+Current separation models
+
+In order to separate the device emulation code from the CPU emulation
+code, the device object code must run in a different process. There are
+a couple of existing QEMU features that can run emulation code
+separately from the main QEMU process. These are examined below.
+
+vhost user model
+
+Virtio guest device drivers can be connected to vhost user applications
+in order to perform their IO operations. This model uses special virtio
+device drivers in the guest and vhost user device objects in QEMU, but
+once the QEMU vhost user code has configured the vhost user application,
+mission-mode IO is performed by the application. The vhost user
+application is a daemon process that can be contacted via a known UNIX
+domain socket.
+
+vhost socket
+
+As mentioned above, one of the tasks of the vhost device object within
+QEMU is to contact the vhost application and send it configuration
+information about this device instance. As part of the configuration
+process, the application can also be sent other file descriptors over
+the socket, which then can be used by the vhost user application in
+various ways, some of which are described below.
+
+vhost MMIO store acceleration
+
+VMs are often run using HW virtualization features via the KVM kernel
+driver. This driver allows QEMU to accelerate the emulation of guest CPU
+instructions by running the guest in a virtual HW mode. When the guest
+executes instructions that cannot be executed by virtual HW mode,
+execution returns to the KVM driver so it can inform QEMU to emulate the
+instructions in SW.
+
+One of the events that can cause a return to QEMU is when a guest device
+driver accesses an IO location. QEMU then dispatches the memory
+operation to the corresponding QEMU device object. In the case of a
+vhost user device, the memory operation would need to be sent over a
+socket to the vhost application. This path is accelerated by the QEMU
+virtio code by setting up an eventfd file descriptor that the vhost
+application can directly receive MMIO store notifications from the KVM
+driver, instead of needing them to be sent to the QEMU process first.
+
+vhost interrupt acceleration
+
+Another optimization used by the vhost application is the ability to
+directly inject interrupts into the VM via the KVM driver, again,
+bypassing the need to send the interrupt back to the QEMU process first.
+The QEMU virtio setup code configures the KVM driver with an eventfd
+that triggers the device interrupt in the guest when the eventfd is
+written. This irqfd file descriptor is then passed to the vhost user
+application program.
+
+vhost access to guest memory
+
+The vhost application is also allowed to directly access guest memory,
+instead of needing to send the data as messages to QEMU. This is also
+done with file descriptors sent to the vhost user application by QEMU.
+These descriptors can be passed to mmap() by the vhost application to
+map the guest address space into the vhost application.
+
+IOMMUs introduce another level of complexity, since the address given to
+the guest virtio device to DMA to or from is not a guest physical
+address. This case is handled by having vhost code within QEMU register
+as a listener for IOMMU mapping changes. The vhost application maintains
+a cache of IOMMMU translations: sending translation requests back to
+QEMU on cache misses, and in turn receiving flush requests from QEMU
+when mappings are purged.
+
+applicability to device separation
+
+Much of the vhost model can be re-used by separated device emulation. In
+particular, the ideas of using a socket between QEMU and the device
+emulation application, using a file descriptor to inject interrupts into
+the VM via KVM, and allowing the application to mmap() the guest should
+be re used.
+
+There are, however, some notable differences between how a vhost
+application works and the needs of separated device emulation. The most
+basic is that vhost uses custom virtio device drivers which always
+trigger IO with MMIO stores. A separated device emulation model must
+work with existing IO device models and guest device drivers. MMIO loads
+break vhost store acceleration since they are synchronous - guest
+progress cannot continue until the load has been emulated. By contrast,
+stores are asynchronous, the guest can continue after the store event
+has been sent to the vhost application.
+
+Another difference is that in the vhost user model, a single daemon can
+support multiple QEMU instances. This is contrary to the security regime
+desired, in which the emulation application should only be allowed to
+access the files or devices the VM it’s running on behalf of can access.
+
+
+New separation model based on proxy objects
+
+A different model based on proxy objects in the QEMU program
+communicating with remote emulation programs could provide separation
+while minimizing the changes needed to the device emulation code. The
+rest of this section is a discussion of how a proxy object model would
+work.
+
+
+Remote emulation processes
+
+The remote emulation process will run the QEMU object hierarchy without
+modification. The device emulation objects will be also be based on the
+QEMU code, because for anything but the simplest device, it would not be
+a tractable to re-implement both the object model and the many device
+backends that QEMU has.
+
+The processes will communicate with the QEMU process over UNIX domain
+sockets. The processes can be executed either as standalone processes,
+or be executed by QEMU. In both cases, the host backends the emulation
+processes will provide are specified on its command line, as they would
+be for QEMU. For example:
+
+ disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \
+ -blockdev driver=qcow2,node-name=drive0,file=file0
+
+would indicate process _disk-proc_ uses a qcow2 emulated disk named
+_file0_ as its backend.
+
+Emulation processes may emulate more than one guest controller. A common
+configuration might be to put all controllers of the same device class
+(e.g., disk, network, etc.) in a single process, so that all backends of
+the same type can be managed by a single QMP monitor.
+
+communication with QEMU
+
+Remote emulation processes will recognize a _-socket_ argument that
+specifies the path of a UNIX domain socket used to communicate with the
+QEMU process. If no _-socket_ argument is present, the process will use
+file descriptor 0 to communicate with QEMU. For example,
+
+ disk-proc -socket /tmp/disk0-sock <backend list>
+
+will communicate with QEMU using the socket path _/tmp/disk0-sock_.
+
+remote process QMP monitor
+
+Remote emulation processes can be monitored via QMP, similar to QEMU
+itself. The QMP monitor socket is specified the same as for a QEMU
+process:
+
+ disk-proc -qmp unix:/tmp/disk-mon,server
+
+can be monitored over the UNIX socket path _/tmp/disk-mon_.
+
+QEMU command line
+
+The QEMU command line options will need to be modified to indicate which
+items are emulated by a separate program, and which remain emulated by
+QEMU itself.
+
+identifying remote emulation processes
+
+Remote emulation processes will be identified to QEMU using a _-remote_
+command line option. This option can either specify a command that QEMU
+will execute, or can specify a UNIX domain socket that QEMU can use to
+connect to an existing process. Both forms require a “id” option that
+identifies the process to later _-device_ options. The process version
+is:
+
+ -remote id=disk-proc,command="disk-proc <backend list>"
+
+And the socket version is:
+
+ -remote id=disk-proc,socket="/tmp/disk0-sock"
+
+In the latter case, the remote process must be given the same socket on
+its command line when it is executed:
+
+ disk-proc -socket /tmp/disk0-sock <backend list>
+
+identifying devices emulated remotely
+
+Devices that are to be emulated in a separate process will be identify
+the remote process with a “remote” option on their _-device_ command
+line specification. e.g., an LSI SCSI controller and disk can be
+specified as:
+
+ -device lsi53c895a,id=scsi0
+ -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0
+
+If these devices are emulated by remote process “disk-proc,” as
+described in the previous section, the QEMU command line would be:
+
+ -device lsi53c895a,id=scsi0,remote=disk-proc
+ -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0,remote=disk-proc
+
+Some devices are implicitly created by the machine object. e.g., the q35
+machine object will create its PCI bus, and attach an ich9-ahci IDE
+controller to it. In this case, options will need to be added to the
+_-machine_ command line. e.g.,
+
+ -machine pc-q35,ide-remote=disk-proc
+
+will use the remote process with an “id” of “disk-proc” to emulate the
+IDE controller and its disks.
+
+The disks themselves still need to be specified with _-remote_ option,
+as in the example above. e.g.,
+
+ -device ide-hd,drive=drive0,bus=ide.0,unit=0,remote=disk-proc
+
+
+QEMU management of remote processes
+
+Each _-remote_ instance on the QEMU command line will create a remote
+process proxy instance in QEMU. They will be held on a _QList_ that can
+be searched for by its “id” property. The remote process proxy will also
+establish a communication channel between QEMU and the remote process.
+This can be done in one of two methods: direction execution of the
+process by QEMU with fork() and exec() system calls, or by connecting to
+an existing process.
+
+direct execution
+
+When the remote process is directly executed, the remote process proxy
+will setup a communication channel between itself and the emulation
+process. This channel will be created using socketpair() and the remote
+process side of the pair will be given to the process as file descriptor
+0.
+
+connecting to an existing process
+
+Some environments wish to deny QEMU the ability to execute fork() and
+exec() In these case, emulation processes will be started before QEMU,
+and a UNIX domain socket will be given to each emulation process to
+communicate with QEMU over. After communication is established, the
+socket will be unlinked from the file system space by the QEMU process.
+
+communication with emulation process
+
+primary socket
+
+Whether the process was executed by QEMU or externally, there will be a
+primary socket for communication between QEMU and the remote process.
+This channel will handle configuration commands from QEMU to the
+process, either from the QEMU command line, or from QMP commands that
+affect the devices being emulated by the process. This channel will only
+allow one message to be pending at a time; if additional messages
+arrive, they must wait for previous ones to be acknowledged from the
+remote side.
+
+secondary sockets
+
+The primary socket can pass the file descriptors of secondary sockets
+for operations that occur in parallel with commands on the primary
+channel. These include MMIO operations generated by the guest, interrupt
+notifications generated by the devices being emulated, or _vmstate_ for
+live migration. These secondary sockets will be created at the behest of
+the device proxies that require them. A disk device proxy wouldn’t need
+any secondary sockets, but a disk controller device proxy may need both
+an MMIO socket and an interrupt socket.
+
+emulation process attached via QMP command
+
+There will be a new “attach-process” QMP command to facilitate device
+hot-plug. This command’s arguments will be the same as the _-remote_
+command line when it’s used to attach to a remote process. i.e., it will
+need an “id” argument so that hot-plugged devices can later find it, and
+a “socket” argument to identify the UNIX domain socket that will be used
+to communicate with QEMU.
+
+
+QEMU device proxy objects
+
+QEMU has an object model based on sub-classes inherited from the
+“object” super-class. The sub-classes that are of interest here are the
+“device” and “bus” sub-classes whose child sub-classes make up the
+device tree of a QEMU emulated system.
+
+The proxy object model will use device proxy objects to replace the
+device emulation code within the QEMU process. These objects will live
+in the same place in the object and bus hierarchies as the objects they
+replace. i.e., the proxy object for an LSI SCSI controller will be a
+sub-class of the “pci-device” class, and will have the same PCI bus
+parent and the same SCSI bus child objects as the LSI controller object
+it replaces.
+
+After the QEMU command line has been parsed, the remote devices will be
+instantiated in the same manner as local devices are. (i.e.,
+qdev_device_add()). In order to distinguish them from regular _-device_
+device objects, their class name will be the name of the class it
+replaces, with “-proxy” appended. e.g., the “lsi53c895a” proxy class
+will be “lsi53c895a-proxy.”
+
+device JSON description
+
+The remote process needs a JSON representation of the command line
+options used to create the object. This JSON representation is used to
+create the corresponding object in the emulation process. e.g., for an
+LSI SCSI controller invoked as:
+
+ -device lsi53c895a,id=scsi0,remote=lsi-scsi
+
+the proxy object would create a
+
+ { "driver" : "lsi53c895a", "id" : "scsi0" }
+
+JSON description. The “driver” option is assigned to the device name
+when the command line is parsed, so the “-proxy” appended by the command
+line parsing code is removed. The “remote” option isn’t needed in the
+JSON description since it only applies to the proxy object in the QEMU
+process.
+
+device object whitelist
+
+Some device objects may not need a proxy. These are devices with no
+direct guest interfaces. (e.g., no MMIO, PIO, or interrupts). There will
+be a whitelist of such devices, and any devices on this list will not be
+instantiated in QEMU. Their JSON representation will still be sent to
+the remote process, so the object can be created there.
+
+object initialization
+
+QEMU object initialization occurs in two phases. The first
+initialization happens once per object class. (i.e., there can be many
+SCSI disks in an emulated system, but the “scsi-hd” class has its
+class_init() function called only once) The second phase happens when
+each object’s instance_init() function is called to initialize each
+instance of the object.
+
+All device objects are sub-classes of the “device” class, so they also
+have a realize() function that is called after instance_init() is called
+and after the object’s static properties have been initialized. Many
+device objects don’t even provide an instance_init() function, and do
+all their per-instance work in realize().
+
+class_init
+
+The class_init() method of a proxy object will, in general behave
+similarly to the object it replaces, including setting any static
+properties and methods needed by the proxy.
+
+instance_init / realize
+
+The instance_init() and realize() functions would only need to perform
+tasks related to being a proxy, such are registering its own MMIO
+handlers, or creating a child bus that other proxy devices can be
+attached to later.
+
+Other tasks will are device-specific. For example, PCI device objects
+will initialize the PCI config space in order to make a valid PCI device
+tree within the QEMU process.
+
+address space registration
+
+Most devices are driven by guest device driver accesses to IO addresses
+or ports. The QEMU device emulation code uses QEMU’s memory region
+function calls (such as memory_region_init_io()) to add callback
+functions that QEMU will invoke when the guest accesses the device’s
+areas of the IO address space. When a guest driver does access the
+device, the VM will exit HW virtualization mode and return to QEMU,
+which will then lookup and execute the corresponding callback function.
+
+A proxy object would need to mirror the memory region calls the actual
+device emulator would perform in its initialization code, but with its
+own callbacks. When invoked by QEMU as a result of a guest IO operation,
+they will forward the operation to the device emulation process.
+
+PCI config space
+
+PCI devices also have a configuration space that can be accessed by the
+guest driver. Guest accesses to this space is not handled by the device
+emulation object, but by its PCI parent object. Much of this space is
+read-only, but certain registers (especially BAR and MSI-related ones)
+need to be propagated to the emulation process.
+
+PCI parent proxy
+
+One way to propagate guest PCI config accesses is to create a
+“pci-device-proxy” class that can serve as the parent of a PCI device
+proxy object. This class’s parent would be “pci-device” and it would
+override the PCI parent’s config_read() and config_write() methods with
+ones that forward these operations to the emulation program.
+
+interrupt receipt
+
+A proxy for a device that generates interrupts will need to create a
+socket to receive interrupt indications from the emulation process. An
+incoming interrupt indication would then be sent up to its bus parent to
+be injected into the guest. For example, a PCI device object may use
+pci_set_irq().
+
+live migration
+
+The proxy will register to save and restore any _vmstate_ it needs over
+a live migration event. The device proxy does not need to manage the
+remote device’s _vmstate_; that will be handled by the remote process
+proxy (see below).
+
+
+QEMU remote device operation
+
+Generic device operations, such as DMA, will be performs by the remote
+process proxy by sending messages to the remote process.
+
+DMA operations
+
+DMA operations would be handled much like vhost applications do. One of
+the initial messages sent to the emulation process is a guest memory
+table. Each entry in this table consists of a file descriptor and size
+that the emulation process can mmap() to directly access guest memory,
+similar to vhost_user_set_mem_table(). Note guest memory must be backed
+by file descriptors, such as when QEMU is given the _-mem-path_ command
+line option.
+
+IOMMU operations
+
+When the emulated system includes an IOMMU, the remote process proxy in
+QEMU will need to create a socket for IOMMU requests from the emulation
+process. It will handle those requests with an
+address_space_get_iotlb_entry() call. In order to handle IOMMU unmaps,
+the remote process proxy will also register as a listener on the
+device’s DMA address space. When an IOMMU memory region is created
+within the DMA address space, an IOMMU notifier for unmaps will be added
+to the memory region that will forward unmaps to the emulation process
+over the IOMMU socket.
+
+device hot-plug via QMP
+
+An QMP “device_add” command can add a device emulated by a remote
+process. It needs to add a “remote” option to the command, just as the
+_-device_ command line option does. The remote process may either be one
+started at QEMU startup, or be one added by the “add-process” QMP
+command described above. In either case, the remote process proxy will
+forward the new device’s JSON description to the corresponding emulation
+process.
+
+live migration
+
+The remote process proxy will also register for live migration
+notifications with vmstate_register(). When called to save state, the
+proxy will send the remote process a secondary socket file descriptor to
+save the remote process’s device _vmstate_ over. The incoming byte
+stream length and data will be saved as the proxy’s _vmstate_. When the
+proxy is resumed on its new host, this _vmstate_ will be extracted, and
+a secondary socket file descriptor will be sent to the new remote
+process through which it receives the _vmstate_ in order to restore the
+devices there.
+
+device emulation in remote process
+
+The parts of QEMU that the emulation program will need include the
+object model; the memory emulation objects; the device emulation objects
+of the targeted device, and any dependent devices; and, the device’s
+backends. It will also need code to setup the machine environment,
+handle requests from the QEMU process, and route machine-level requests
+(such as interrupts or IOMMU mappings) back to the QEMU process.
+
+initialization
+
+The process initialization sequence will follow the same sequence
+followed by QEMU. It will first initialize the backend objects, then
+device emulation objects. The JSON descriptions sent by the QEMU process
+will drive which objects need to be created.
+
+- address spaces
+
+ Before the device objects are created, the initial address spaces
+ and memory regions must be configured with memory_map_init(). This
+ creates a RAM memory region object (_system_memory_) and an IO
+ memory region object (_system_io_).
+
+- RAM
+
+ RAM memory region creation will follow how pc_memory_init() creates
+ them, but must use memory_region_init_ram_from_fd() instead of
+ memory_region_allocate_system_memory(). The file descriptors needed
+ will be supplied by the guest memory table from above. Those RAM
+ regions would then be added to the _system_memory_ memory region
+ with memory_region_add_subregion().
+
+- PCI
+
+ IO initialization will be driven by the JSON descriptions sent from
+ the QEMU process. For a PCI device, a PCI bus will need to be
+ created with pci_root_bus_new(), and a PCI memory region will need
+ to be created and added to the _system_memory_ memory region with
+ memory_region_add_subregion_overlap(). The overlap version is
+ required for architectures where PCI memory overlaps with RAM
+ memory.
+
+MMIO handling
+
+The device emulation objects will use memory_region_init_io() to install
+their MMIO handlers, and pci_register_bar() to associate those handlers
+with a PCI BAR, as they do within QEMU currently.
+
+In order to use address_space_rw() in the emulation process to handle
+MMIO requests from QEMU, the PCI physical addresses must be the same in
+the QEMU process and the device emulation process. In order to
+accomplish that, guest BAR programming must also be forwarded from QEMU
+to the emulation process.
+
+interrupt injection
+
+When device emulation wants to inject an interrupt into the VM, the
+request climbs the device’s bus object hierarchy until the point where a
+bus object knows how to signal the interrupt to the guest. The details
+depend on the type of interrupt being raised.
+
+- PCI pin interrupts
+
+ On x86 systems, there is an emulated IOAPIC object attached to the
+ root PCI bus object, and the root PCI object forwards interrupt
+ requests to it. The IOAPIC object, in turn, calls the KVM driver to
+ inject the corresponding interrupt into the VM. The simplest way to
+ handle this in an emulation process would be to setup the root PCI
+ bus driver (via pci_bus_irqs()) to send a interrupt request back to
+ the QEMU process, and have the device proxy object reflect it up the
+ PCI tree there.
+
+- PCI MSI/X interrupts
+
+ PCI MSI/X interrupts are implemented in HW as DMA writes to a
+ CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
+ these DMA writes, then calls into the KVM driver to inject the
+ interrupt into the VM. A simple emulation process implementation
+ would be to send the MSI DMA address from QEMU as a message at
+ initialization, then install an address space handler at that
+ address which forwards the MSI message back to QEMU.
+
+DMA operations
+
+When a emulation object wants to DMA into or out of guest memory, it
+first must use dma_memory_map() to convert the DMA address to a local
+virtual address. The emulation process memory region objects setup above
+will be used to translate the DMA address to a local virtual address the
+device emulation code can access.
+
+IOMMU
+
+When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
+regions to translate the DMA address to a guest physical address before
+that physical address can be translated to a local virtual address. The
+emulation process will need similar functionality.
+
+- IOTLB cache
+
+ The emulation process will maintain a cache of recent IOMMU
+ translations (the IOTLB). When the translate() callback of an IOMMU
+ memory region is invoked, the IOTLB cache will be searched for an
+ entry that will map the DMA address to a guest PA. On a cache miss,
+ a message will be sent back to QEMU requesting the corresponding
+ translation entry, which be both be used to return a guest address
+ and be added to the cache.
+
+- IOTLB purge
+
+ The IOMMU emulation will also need to act on unmap requests from
+ QEMU. These happen when the guest IOMMU driver purges an entry from
+ the guest’s translation table.
+
+live migration
+
+When a remote process receives a live migration indication from QEMU, it
+will set up a channel using the received file descriptor with
+qio_channel_socket_new_fd(). This channel will be used to create a
+_QEMUfile_ that can be passed to qemu_save_device_state() to send the
+process’s device state back to QEMU. This method will be reversed on
+restore - the channel will be passed to qemu_loadvm_state() to restore
+the device state.
+