Message ID | 1456771254-17511-9-git-send-email-armbru@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Feb 29, 2016 at 7:40 PM, Markus Armbruster <armbru@redhat.com> wrote: > This started as an attempt to update ivshmem_device_spec.txt for > clarity, accuracy and completeness while working on its code, and > quickly became a full rewrite. Since the diff would be useless > anyway, I'm using the opportunity to rename the file to > ivshmem-spec.txt. > > I tried hard to ensure the new text contradicts neither the old text > nor the code. If the new text contradicts the old text but not the > code, it's probably a bug in the old text. If the new text > contradicts both, its probably a bug in the new text. > > Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> > --- > docs/specs/ivshmem-spec.txt | 244 +++++++++++++++++++++++++++++++++++++ > docs/specs/ivshmem_device_spec.txt | 161 ------------------------ > 2 files changed, 244 insertions(+), 161 deletions(-) > create mode 100644 docs/specs/ivshmem-spec.txt > delete mode 100644 docs/specs/ivshmem_device_spec.txt > > diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt > new file mode 100644 > index 0000000..0835ba1 > --- /dev/null > +++ b/docs/specs/ivshmem-spec.txt > @@ -0,0 +1,244 @@ > += Device Specification for Inter-VM shared memory device = > + > +The Inter-VM shared memory device (ivshmem) is designed to share a > +memory region between multiple QEMU processes running different guests > +and the host. In order for all guests to be able to pick up the > +shared memory area, it is modeled by QEMU as a PCI device exposing > +said memory to the guest as a PCI BAR. > + > +The device can use a shared memory object on the host directly, or it > +can obtain one from an ivshmem server. > + > +In the latter case, the device can additionally interrupt its peers, and > +get interrupted by its peers. > + > + > +== Configuring the ivshmem PCI device == > + > +There are two basic configurations: > + > +- Just shared memory: -device ivshmem,shm=NAME,... > + > + This uses shared memory object NAME. > + > +- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,... > + > + An ivshmem server must already be running on the host. The device > + connects to the server's UNIX domain socket via character device > + CHR. > + > + Each peer gets assigned a unique ID by the server. IDs must be > + between 0 and 65535. > + > + Interrupts are message-signaled by default (MSI-X). With msi=off > + the device has no MSI-X capability, and uses legacy INTx instead. > + vectors=N configures the number of vectors to use. > + > +For more details on ivshmem device properties, see The QEMU Emulator > +User Documentation (qemu-doc.*). > + > + > +== The ivshmem PCI device's guest interface == > + > +The device has vendor ID 1af4, device ID 1110, revision 0. > + > +=== PCI BARs === > + > +The ivshmem PCI device has two or three BARs: > + > +- BAR0 holds device registers (256 Byte MMIO) > +- BAR1 holds MSI-X table and PBA (only when using MSI-X) > +- BAR2 maps the shared memory object > + > +There are two ways to use this device: > + > +- If you only need the shared memory part, BAR2 suffices. This way, > + you have access to the shared memory in the guest and can use it as > + you see fit. Memnic, for example, uses ivshmem this way from guest > + user space (see http://dpdk.org/browse/memnic). > + > +- If you additionally need the capability for peers to interrupt each > + other, you need BAR0 and, if using MSI-X, BAR1. You will most > + likely want to write a kernel driver to handle interrupts. Requires > + the device to be configured for interrupts, obviously. > + > +If the device is configured for interrupts, BAR2 is initially invalid. > +It becomes safely accessible only after the ivshmem server provided > +the shared memory. Guest software should wait for the IVPosition > +register (described below) to become non-negative before accessing > +BAR2. > + > +The device is not capable to tell guest software whether it is > +configured for interrupts. > + > +=== PCI device registers === > + > +BAR 0 contains the following registers: > + > + Offset Size Access On reset Function > + 0 4 read/write 0 Interrupt Mask > + bit 0: peer interrupt > + bit 1..31: reserved > + 4 4 read/write 0 Interrupt Status > + bit 0: peer interrupt > + bit 1..31: reserved > + 8 4 read-only 0 or -1 IVPosition > + 12 4 write-only N/A Doorbell > + bit 0..15: vector > + bit 16..31: peer ID > + 16 240 none N/A reserved > + > +Software should only access the registers as specified in column > +"Access". Reserved bits should be ignored on read, and preserved on > +write. > + > +Interrupt Status and Mask Register together control the legacy INTx > +interrupt when the device has no MSI-X capability: INTx is asserted > +when the bit-wise AND of Status and Mask is non-zero and the device > +has no MSI-X capability. Interrupt Status Register bit 0 becomes 1 > +when an interrupt request from a peer is received. Reading the > +register clears it. > + > +IVPosition Register: if the device is not configured for interrupts, > +this is zero. Else, it's -1 for a short while after reset, then > +changes to the device's ID (between 0 and 65535). > + > +There is no good way for software to find out whether the device is > +configured for interrupts. A positive IVPosition means interrupts, > +but zero could be either. The initial -1 cannot be reliably observed. > + > +Doorbell Register: writing this register requests to interrupt a peer. > +The written value's high 16 bits are the ID of the peer to interrupt, > +and its low 16 bits select an interrupt vector. > + > +If the device is not configured for interrupts, the write is ignored. > + > +If the interrupt hasn't completed setup, the write is ignored. The > +device is not capable to tell guest software whether setup is > +complete. Interrupts can regress to this state on migration. > + > +If the peer with the requested ID isn't connected, or it has fewer > +interrupt vectors connected, the write is ignored. The device is not > +capable to tell guest software what peers are connected, or how many > +interrupt vectors are connected. > + > +If the peer doesn't use MSI-X, its Interrupt Status register is set to > +1. This asserts INTx unless masked by the Interrupt Mask register. > +The device is not capable to communicate the interrupt vector to guest > +software then. > + > +If the peer uses MSI-X, the interrupt for this vector becomes pending. > +There is no way for software to clear the pending bit, and a polling > +mode of operation is therefore impossible with MSI-X. > + > +With multiple MSI-X vectors, different vectors can be used to indicate > +different events have occurred. The semantics of interrupt vectors > +are left to the application. > + > + > +== Interrupt infrastructure == > + > +When configured for interrupts, the peers share eventfd objects in > +addition to shared memory. The shared resources are managed by an > +ivshmem server. > + > +=== The ivshmem server === > + > +The server listens on a UNIX domain socket. > + > +For each new client that connects to the server, the server > +- picks an ID, > +- creates eventfd file descriptors for the interrupt vectors, > +- sends the ID and the file descriptor for the shared memory to the > + new client, > +- sends connect notifications for the new client to the other clients > + (these contain file descriptors for sending interrupts), > +- sends connect notifications for the other clients to the new client, > + and > +- sends interrupt setup messages to the new client (these contain file > + descriptors for receiving interrupts). > + > +When a client disconnects from the server, the server sends disconnect > +notifications to the other clients. > + > +The next section describes the protocol in detail. > + > +If the server terminates without sending disconnect notifications for > +its connected clients, the clients can elect to continue. They can > +communicate with each other normally, but won't receive disconnect > +notification on disconnect, and no new clients can connect. There is > +no way for the clients to connect to a restarted the server. The > +device is not capable to tell guest software whether the server is > +still up. > + > +Example server code is in contrib/ivshmem-server/. Not to be used in > +production. It assumes all clients use the same number of interrupt > +vectors. > + > +A standalone client is in contrib/ivshmem-client/. It can be useful > +for debugging. > + > +=== The ivshmem Client-Server Protocol === > + > +An ivshmem device configured for interrupts connects to an ivshmem > +server. This section details the protocol between the two. > + > +The connection is one-way: the server sends messages to the client. > +Each message consists of a single 8 byte little-endian signed number, > +and may be accompanied by a file descriptor via SCM_RIGHTS. Both > +client and server close the connection on error. > + > +On connect, the server sends the following messages in order: > + > +1. The protocol version number, currently zero. The client should > + close the connection on receipt of versions it can't handle. > + > +2. The client's ID. This is unique among all clients of this server. > + IDs must be between 0 and 65535, because the Doorbell register > + provides only 16 bits for them. > + > +3. The number -1, accompanied by the file descriptor for the shared > + memory. > + > +4. Connect notifications for existing other clients, if any. This is > + a peer ID (number between 0 and 65535 other than the client's ID), > + repeated N times. Each repetition is accompanied by one file > + descriptor. These are for interrupting the peer with that ID using > + vector 0,..,N-1, in order. If the client is configured for fewer > + vectors, it closes the extra file descriptors. If it is configured > + for more, the extra vectors remain unconnected. > + > +5. Interrupt setup. This is the client's own ID, repeated N times. > + Each repetition is accompanied by one file descriptor. These are > + for receiving interrupts from peers using vector 0,..,N-1, in > + order. If the client is configured for fewer vectors, it closes > + the extra file descriptors. If it is configured for more, the > + extra vectors remain unconnected. > + > +From then on, the server sends these kinds of messages: > + > +6. Connection / disconnection notification. This is a peer ID. > + > + - If the number comes with a file descriptor, it's a connection > + notification, exactly like in step 4. > + > + - Else, it's a disconnection notification for the peer with that ID. > + > +Known bugs: > + > +* The protocol changed incompatibly in QEMU 2.5. Before, messages > + were native endian long, and there was no version number. > + > +* The protocol is poorly designed. > + > +=== The ivshmem Client-Client Protocol === > + > +An ivshmem device configured for interrupts receives eventfd file > +descriptors for interrupting peers and getting interrupted by peers > +from the server, as explained in the previous section. > + > +To interrupt a peer, the device writes the 8-byte integer 1 in native > +byte order to the respective file descriptor. > + > +To receive an interrupt, the device reads and discards as many 8-byte > +integers as it can. > diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt > deleted file mode 100644 > index d318d65..0000000 > --- a/docs/specs/ivshmem_device_spec.txt > +++ /dev/null > @@ -1,161 +0,0 @@ > - > -Device Specification for Inter-VM shared memory device > ------------------------------------------------------- > - > -The Inter-VM shared memory device is designed to share a memory region (created > -on the host via the POSIX shared memory API) between multiple QEMU processes > -running different guests. In order for all guests to be able to pick up the > -shared memory area, it is modeled by QEMU as a PCI device exposing said memory > -to the guest as a PCI BAR. > -The memory region does not belong to any guest, but is a POSIX memory object on > -the host. The host can access this shared memory if needed. > - > -The device also provides an optional communication mechanism between guests > -sharing the same memory object. More details about that in the section 'Guest to > -guest communication' section. > - > - > -The Inter-VM PCI device > ------------------------ > - > -From the VM point of view, the ivshmem PCI device supports three BARs. > - > -- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is > - not used. > -- BAR1 is used for MSI-X when it is enabled in the device. > -- BAR2 is used to access the shared memory object. > - > -It is your choice how to use the device but you must choose between two > -behaviors : > - > -- basically, if you only need the shared memory part, you will map BAR2. > - This way, you have access to the shared memory in guest and can use it as you > - see fit (memnic, for example, uses it in userland > - http://dpdk.org/browse/memnic). > - > -- BAR0 and BAR1 are used to implement an optional communication mechanism > - through interrupts in the guests. If you need an event mechanism between the > - guests accessing the shared memory, you will most likely want to write a > - kernel driver that will handle interrupts. See details in the section 'Guest > - to guest communication' section. > - > -The behavior is chosen when starting your QEMU processes: > -- no communication mechanism needed, the first QEMU to start creates the shared > - memory on the host, subsequent QEMU processes will use it. > - > -- communication mechanism needed, an ivshmem server must be started before any > - QEMU processes, then each QEMU process connects to the server unix socket. > - > -For more details on the QEMU ivshmem parameters, see qemu-doc documentation. > - > - > -Guest to guest communication > ----------------------------- > - > -This section details the communication mechanism between the guests accessing > -the ivhsmem shared memory. > - > -*ivshmem server* > - > -This server code is available in qemu.git/contrib/ivshmem-server. > - > -The server must be started on the host before any guest. > -It creates a shared memory object then waits for clients to connect on a unix > -socket. All the messages are little-endian int64_t integer. > - > -For each client (QEMU process) that connects to the server: > -- the server sends a protocol version, if client does not support it, the client > - closes the communication, > -- the server assigns an ID for this client and sends this ID to him as the first > - message, > -- the server sends a fd to the shared memory object to this client, > -- the server creates a new set of host eventfds associated to the new client and > - sends this set to all already connected clients, > -- finally, the server sends all the eventfds sets for all clients to the new > - client. > - > -The server signals all clients when one of them disconnects. > - > -The client IDs are limited to 16 bits because of the current implementation (see > -Doorbell register in 'PCI device registers' subsection). Hence only 65536 > -clients are supported. > - > -All the file descriptors (fd to the shared memory, eventfds for each client) > -are passed to clients using SCM_RIGHTS over the server unix socket. > - > -Apart from the current ivshmem implementation in QEMU, an ivshmem client has > -been provided in qemu.git/contrib/ivshmem-client for debug. > - > -*QEMU as an ivshmem client* > - > -At initialisation, when creating the ivshmem device, QEMU first receives a > -protocol version and closes communication with server if it does not match. > -Then, QEMU gets its ID from the server then makes it available through BAR0 > -IVPosition register for the VM to use (see 'PCI device registers' subsection). > -QEMU then uses the fd to the shared memory to map it to BAR2. > -eventfds for all other clients received from the server are stored to implement > -BAR0 Doorbell register (see 'PCI device registers' subsection). > -Finally, eventfds assigned to this QEMU process are used to send interrupts in > -this VM. > - > -*PCI device registers* > - > -From the VM point of view, the ivshmem PCI device supports 4 registers of > -32-bits each. > - > -enum ivshmem_registers { > - IntrMask = 0, > - IntrStatus = 4, > - IVPosition = 8, > - Doorbell = 12 > -}; > - > -The first two registers are the interrupt mask and status registers. Mask and > -status are only used with pin-based interrupts. They are unused with MSI > -interrupts. > - > -Status Register: The status register is set to 1 when an interrupt occurs. > - > -Mask Register: The mask register is bitwise ANDed with the interrupt status > -and the result will raise an interrupt if it is non-zero. However, since 1 is > -the only value the status will be set to, it is only the first bit of the mask > -that has any effect. Therefore interrupts can be masked by setting the first > -bit to 0 and unmasked by setting the first bit to 1. > - > -IVPosition Register: The IVPosition register is read-only and reports the > -guest's ID number. The guest IDs are non-negative integers. When using the > -server, since the server is a separate process, the VM ID will only be set when > -the device is ready (shared memory is received from the server and accessible > -via the device). If the device is not ready, the IVPosition will return -1. > -Applications should ensure that they have a valid VM ID before accessing the > -shared memory. > - > -Doorbell Register: To interrupt another guest, a guest must write to the > -Doorbell register. The doorbell register is 32-bits, logically divided into > -two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low > -16-bits are the interrupt vector to trigger. The semantics of the value > -written to the doorbell depends on whether the device is using MSI or a regular > -pin-based interrupt. In short, MSI uses vectors while regular interrupts set > -the status register. > - > -Regular Interrupts > - > -If regular interrupts are used (due to either a guest not supporting MSI or the > -user specifying not to use them on startup) then the value written to the lower > -16-bits of the Doorbell register results is arbitrary and will trigger an > -interrupt in the destination guest. > - > -Message Signalled Interrupts > - > -An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits > -written to the Doorbell register must be between 0 and the maximum number of > -vectors the guest supports. The lower 16 bits written to the doorbell is the > -MSI vector that will be raised in the destination guest. The number of MSI > -vectors is configurable but it is set when the VM is started. > - > -The important thing to remember with MSI is that it is only a signal, no status > -is set (since MSI interrupts are not shared). All information other than the > -interrupt itself should be communicated via the shared memory region. Devices > -supporting multiple MSI vectors can use different vectors to indicate different > -events have occurred. The semantics of interrupt vectors are left to the > -user's discretion. > -- > 2.4.3 > >
On 02/29/2016 11:40 AM, Markus Armbruster wrote: > This started as an attempt to update ivshmem_device_spec.txt for > clarity, accuracy and completeness while working on its code, and > quickly became a full rewrite. Since the diff would be useless > anyway, I'm using the opportunity to rename the file to > ivshmem-spec.txt. > > I tried hard to ensure the new text contradicts neither the old text > nor the code. If the new text contradicts the old text but not the > code, it's probably a bug in the old text. If the new text > contradicts both, its probably a bug in the new text. > > Signed-off-by: Markus Armbruster <armbru@redhat.com> > --- > +If the server terminates without sending disconnect notifications for > +its connected clients, the clients can elect to continue. They can > +communicate with each other normally, but won't receive disconnect > +notification on disconnect, and no new clients can connect. There is > +no way for the clients to connect to a restarted the server. The s/the server/server/ > +device is not capable to tell guest software whether the server is > +still up. Wow - lots of shortcomings in the server protocol. Food for thought for future improvements, but I'm happy with your approach of just documenting pitfalls for now. > + > +Known bugs: > + > +* The protocol changed incompatibly in QEMU 2.5. Before, messages > + were native endian long, and there was no version number. > + > +* The protocol is poorly designed.
Eric Blake <eblake@redhat.com> writes: > On 02/29/2016 11:40 AM, Markus Armbruster wrote: >> This started as an attempt to update ivshmem_device_spec.txt for >> clarity, accuracy and completeness while working on its code, and >> quickly became a full rewrite. Since the diff would be useless >> anyway, I'm using the opportunity to rename the file to >> ivshmem-spec.txt. >> >> I tried hard to ensure the new text contradicts neither the old text >> nor the code. If the new text contradicts the old text but not the >> code, it's probably a bug in the old text. If the new text >> contradicts both, its probably a bug in the new text. >> >> Signed-off-by: Markus Armbruster <armbru@redhat.com> >> --- > >> +If the server terminates without sending disconnect notifications for >> +its connected clients, the clients can elect to continue. They can >> +communicate with each other normally, but won't receive disconnect >> +notification on disconnect, and no new clients can connect. There is >> +no way for the clients to connect to a restarted the server. The > > s/the server/server/ Will fix, thanks! >> +device is not capable to tell guest software whether the server is >> +still up. > > Wow - lots of shortcomings in the server protocol. Food for thought for > future improvements, but I'm happy with your approach of just > documenting pitfalls for now. Best we can do for 2.6 anyway :) >> + >> +Known bugs: >> + >> +* The protocol changed incompatibly in QEMU 2.5. Before, messages >> + were native endian long, and there was no version number. >> + >> +* The protocol is poorly designed.
diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt new file mode 100644 index 0000000..0835ba1 --- /dev/null +++ b/docs/specs/ivshmem-spec.txt @@ -0,0 +1,244 @@ += Device Specification for Inter-VM shared memory device = + +The Inter-VM shared memory device (ivshmem) is designed to share a +memory region between multiple QEMU processes running different guests +and the host. In order for all guests to be able to pick up the +shared memory area, it is modeled by QEMU as a PCI device exposing +said memory to the guest as a PCI BAR. + +The device can use a shared memory object on the host directly, or it +can obtain one from an ivshmem server. + +In the latter case, the device can additionally interrupt its peers, and +get interrupted by its peers. + + +== Configuring the ivshmem PCI device == + +There are two basic configurations: + +- Just shared memory: -device ivshmem,shm=NAME,... + + This uses shared memory object NAME. + +- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,... + + An ivshmem server must already be running on the host. The device + connects to the server's UNIX domain socket via character device + CHR. + + Each peer gets assigned a unique ID by the server. IDs must be + between 0 and 65535. + + Interrupts are message-signaled by default (MSI-X). With msi=off + the device has no MSI-X capability, and uses legacy INTx instead. + vectors=N configures the number of vectors to use. + +For more details on ivshmem device properties, see The QEMU Emulator +User Documentation (qemu-doc.*). + + +== The ivshmem PCI device's guest interface == + +The device has vendor ID 1af4, device ID 1110, revision 0. + +=== PCI BARs === + +The ivshmem PCI device has two or three BARs: + +- BAR0 holds device registers (256 Byte MMIO) +- BAR1 holds MSI-X table and PBA (only when using MSI-X) +- BAR2 maps the shared memory object + +There are two ways to use this device: + +- If you only need the shared memory part, BAR2 suffices. This way, + you have access to the shared memory in the guest and can use it as + you see fit. Memnic, for example, uses ivshmem this way from guest + user space (see http://dpdk.org/browse/memnic). + +- If you additionally need the capability for peers to interrupt each + other, you need BAR0 and, if using MSI-X, BAR1. You will most + likely want to write a kernel driver to handle interrupts. Requires + the device to be configured for interrupts, obviously. + +If the device is configured for interrupts, BAR2 is initially invalid. +It becomes safely accessible only after the ivshmem server provided +the shared memory. Guest software should wait for the IVPosition +register (described below) to become non-negative before accessing +BAR2. + +The device is not capable to tell guest software whether it is +configured for interrupts. + +=== PCI device registers === + +BAR 0 contains the following registers: + + Offset Size Access On reset Function + 0 4 read/write 0 Interrupt Mask + bit 0: peer interrupt + bit 1..31: reserved + 4 4 read/write 0 Interrupt Status + bit 0: peer interrupt + bit 1..31: reserved + 8 4 read-only 0 or -1 IVPosition + 12 4 write-only N/A Doorbell + bit 0..15: vector + bit 16..31: peer ID + 16 240 none N/A reserved + +Software should only access the registers as specified in column +"Access". Reserved bits should be ignored on read, and preserved on +write. + +Interrupt Status and Mask Register together control the legacy INTx +interrupt when the device has no MSI-X capability: INTx is asserted +when the bit-wise AND of Status and Mask is non-zero and the device +has no MSI-X capability. Interrupt Status Register bit 0 becomes 1 +when an interrupt request from a peer is received. Reading the +register clears it. + +IVPosition Register: if the device is not configured for interrupts, +this is zero. Else, it's -1 for a short while after reset, then +changes to the device's ID (between 0 and 65535). + +There is no good way for software to find out whether the device is +configured for interrupts. A positive IVPosition means interrupts, +but zero could be either. The initial -1 cannot be reliably observed. + +Doorbell Register: writing this register requests to interrupt a peer. +The written value's high 16 bits are the ID of the peer to interrupt, +and its low 16 bits select an interrupt vector. + +If the device is not configured for interrupts, the write is ignored. + +If the interrupt hasn't completed setup, the write is ignored. The +device is not capable to tell guest software whether setup is +complete. Interrupts can regress to this state on migration. + +If the peer with the requested ID isn't connected, or it has fewer +interrupt vectors connected, the write is ignored. The device is not +capable to tell guest software what peers are connected, or how many +interrupt vectors are connected. + +If the peer doesn't use MSI-X, its Interrupt Status register is set to +1. This asserts INTx unless masked by the Interrupt Mask register. +The device is not capable to communicate the interrupt vector to guest +software then. + +If the peer uses MSI-X, the interrupt for this vector becomes pending. +There is no way for software to clear the pending bit, and a polling +mode of operation is therefore impossible with MSI-X. + +With multiple MSI-X vectors, different vectors can be used to indicate +different events have occurred. The semantics of interrupt vectors +are left to the application. + + +== Interrupt infrastructure == + +When configured for interrupts, the peers share eventfd objects in +addition to shared memory. The shared resources are managed by an +ivshmem server. + +=== The ivshmem server === + +The server listens on a UNIX domain socket. + +For each new client that connects to the server, the server +- picks an ID, +- creates eventfd file descriptors for the interrupt vectors, +- sends the ID and the file descriptor for the shared memory to the + new client, +- sends connect notifications for the new client to the other clients + (these contain file descriptors for sending interrupts), +- sends connect notifications for the other clients to the new client, + and +- sends interrupt setup messages to the new client (these contain file + descriptors for receiving interrupts). + +When a client disconnects from the server, the server sends disconnect +notifications to the other clients. + +The next section describes the protocol in detail. + +If the server terminates without sending disconnect notifications for +its connected clients, the clients can elect to continue. They can +communicate with each other normally, but won't receive disconnect +notification on disconnect, and no new clients can connect. There is +no way for the clients to connect to a restarted the server. The +device is not capable to tell guest software whether the server is +still up. + +Example server code is in contrib/ivshmem-server/. Not to be used in +production. It assumes all clients use the same number of interrupt +vectors. + +A standalone client is in contrib/ivshmem-client/. It can be useful +for debugging. + +=== The ivshmem Client-Server Protocol === + +An ivshmem device configured for interrupts connects to an ivshmem +server. This section details the protocol between the two. + +The connection is one-way: the server sends messages to the client. +Each message consists of a single 8 byte little-endian signed number, +and may be accompanied by a file descriptor via SCM_RIGHTS. Both +client and server close the connection on error. + +On connect, the server sends the following messages in order: + +1. The protocol version number, currently zero. The client should + close the connection on receipt of versions it can't handle. + +2. The client's ID. This is unique among all clients of this server. + IDs must be between 0 and 65535, because the Doorbell register + provides only 16 bits for them. + +3. The number -1, accompanied by the file descriptor for the shared + memory. + +4. Connect notifications for existing other clients, if any. This is + a peer ID (number between 0 and 65535 other than the client's ID), + repeated N times. Each repetition is accompanied by one file + descriptor. These are for interrupting the peer with that ID using + vector 0,..,N-1, in order. If the client is configured for fewer + vectors, it closes the extra file descriptors. If it is configured + for more, the extra vectors remain unconnected. + +5. Interrupt setup. This is the client's own ID, repeated N times. + Each repetition is accompanied by one file descriptor. These are + for receiving interrupts from peers using vector 0,..,N-1, in + order. If the client is configured for fewer vectors, it closes + the extra file descriptors. If it is configured for more, the + extra vectors remain unconnected. + +From then on, the server sends these kinds of messages: + +6. Connection / disconnection notification. This is a peer ID. + + - If the number comes with a file descriptor, it's a connection + notification, exactly like in step 4. + + - Else, it's a disconnection notification for the peer with that ID. + +Known bugs: + +* The protocol changed incompatibly in QEMU 2.5. Before, messages + were native endian long, and there was no version number. + +* The protocol is poorly designed. + +=== The ivshmem Client-Client Protocol === + +An ivshmem device configured for interrupts receives eventfd file +descriptors for interrupting peers and getting interrupted by peers +from the server, as explained in the previous section. + +To interrupt a peer, the device writes the 8-byte integer 1 in native +byte order to the respective file descriptor. + +To receive an interrupt, the device reads and discards as many 8-byte +integers as it can. diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt deleted file mode 100644 index d318d65..0000000 --- a/docs/specs/ivshmem_device_spec.txt +++ /dev/null @@ -1,161 +0,0 @@ - -Device Specification for Inter-VM shared memory device ------------------------------------------------------- - -The Inter-VM shared memory device is designed to share a memory region (created -on the host via the POSIX shared memory API) between multiple QEMU processes -running different guests. In order for all guests to be able to pick up the -shared memory area, it is modeled by QEMU as a PCI device exposing said memory -to the guest as a PCI BAR. -The memory region does not belong to any guest, but is a POSIX memory object on -the host. The host can access this shared memory if needed. - -The device also provides an optional communication mechanism between guests -sharing the same memory object. More details about that in the section 'Guest to -guest communication' section. - - -The Inter-VM PCI device ------------------------ - -From the VM point of view, the ivshmem PCI device supports three BARs. - -- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is - not used. -- BAR1 is used for MSI-X when it is enabled in the device. -- BAR2 is used to access the shared memory object. - -It is your choice how to use the device but you must choose between two -behaviors : - -- basically, if you only need the shared memory part, you will map BAR2. - This way, you have access to the shared memory in guest and can use it as you - see fit (memnic, for example, uses it in userland - http://dpdk.org/browse/memnic). - -- BAR0 and BAR1 are used to implement an optional communication mechanism - through interrupts in the guests. If you need an event mechanism between the - guests accessing the shared memory, you will most likely want to write a - kernel driver that will handle interrupts. See details in the section 'Guest - to guest communication' section. - -The behavior is chosen when starting your QEMU processes: -- no communication mechanism needed, the first QEMU to start creates the shared - memory on the host, subsequent QEMU processes will use it. - -- communication mechanism needed, an ivshmem server must be started before any - QEMU processes, then each QEMU process connects to the server unix socket. - -For more details on the QEMU ivshmem parameters, see qemu-doc documentation. - - -Guest to guest communication ----------------------------- - -This section details the communication mechanism between the guests accessing -the ivhsmem shared memory. - -*ivshmem server* - -This server code is available in qemu.git/contrib/ivshmem-server. - -The server must be started on the host before any guest. -It creates a shared memory object then waits for clients to connect on a unix -socket. All the messages are little-endian int64_t integer. - -For each client (QEMU process) that connects to the server: -- the server sends a protocol version, if client does not support it, the client - closes the communication, -- the server assigns an ID for this client and sends this ID to him as the first - message, -- the server sends a fd to the shared memory object to this client, -- the server creates a new set of host eventfds associated to the new client and - sends this set to all already connected clients, -- finally, the server sends all the eventfds sets for all clients to the new - client. - -The server signals all clients when one of them disconnects. - -The client IDs are limited to 16 bits because of the current implementation (see -Doorbell register in 'PCI device registers' subsection). Hence only 65536 -clients are supported. - -All the file descriptors (fd to the shared memory, eventfds for each client) -are passed to clients using SCM_RIGHTS over the server unix socket. - -Apart from the current ivshmem implementation in QEMU, an ivshmem client has -been provided in qemu.git/contrib/ivshmem-client for debug. - -*QEMU as an ivshmem client* - -At initialisation, when creating the ivshmem device, QEMU first receives a -protocol version and closes communication with server if it does not match. -Then, QEMU gets its ID from the server then makes it available through BAR0 -IVPosition register for the VM to use (see 'PCI device registers' subsection). -QEMU then uses the fd to the shared memory to map it to BAR2. -eventfds for all other clients received from the server are stored to implement -BAR0 Doorbell register (see 'PCI device registers' subsection). -Finally, eventfds assigned to this QEMU process are used to send interrupts in -this VM. - -*PCI device registers* - -From the VM point of view, the ivshmem PCI device supports 4 registers of -32-bits each. - -enum ivshmem_registers { - IntrMask = 0, - IntrStatus = 4, - IVPosition = 8, - Doorbell = 12 -}; - -The first two registers are the interrupt mask and status registers. Mask and -status are only used with pin-based interrupts. They are unused with MSI -interrupts. - -Status Register: The status register is set to 1 when an interrupt occurs. - -Mask Register: The mask register is bitwise ANDed with the interrupt status -and the result will raise an interrupt if it is non-zero. However, since 1 is -the only value the status will be set to, it is only the first bit of the mask -that has any effect. Therefore interrupts can be masked by setting the first -bit to 0 and unmasked by setting the first bit to 1. - -IVPosition Register: The IVPosition register is read-only and reports the -guest's ID number. The guest IDs are non-negative integers. When using the -server, since the server is a separate process, the VM ID will only be set when -the device is ready (shared memory is received from the server and accessible -via the device). If the device is not ready, the IVPosition will return -1. -Applications should ensure that they have a valid VM ID before accessing the -shared memory. - -Doorbell Register: To interrupt another guest, a guest must write to the -Doorbell register. The doorbell register is 32-bits, logically divided into -two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low -16-bits are the interrupt vector to trigger. The semantics of the value -written to the doorbell depends on whether the device is using MSI or a regular -pin-based interrupt. In short, MSI uses vectors while regular interrupts set -the status register. - -Regular Interrupts - -If regular interrupts are used (due to either a guest not supporting MSI or the -user specifying not to use them on startup) then the value written to the lower -16-bits of the Doorbell register results is arbitrary and will trigger an -interrupt in the destination guest. - -Message Signalled Interrupts - -An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits -written to the Doorbell register must be between 0 and the maximum number of -vectors the guest supports. The lower 16 bits written to the doorbell is the -MSI vector that will be raised in the destination guest. The number of MSI -vectors is configurable but it is set when the VM is started. - -The important thing to remember with MSI is that it is only a signal, no status -is set (since MSI interrupts are not shared). All information other than the -interrupt itself should be communicated via the shared memory region. Devices -supporting multiple MSI vectors can use different vectors to indicate different -events have occurred. The semantics of interrupt vectors are left to the -user's discretion.
This started as an attempt to update ivshmem_device_spec.txt for clarity, accuracy and completeness while working on its code, and quickly became a full rewrite. Since the diff would be useless anyway, I'm using the opportunity to rename the file to ivshmem-spec.txt. I tried hard to ensure the new text contradicts neither the old text nor the code. If the new text contradicts the old text but not the code, it's probably a bug in the old text. If the new text contradicts both, its probably a bug in the new text. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- docs/specs/ivshmem-spec.txt | 244 +++++++++++++++++++++++++++++++++++++ docs/specs/ivshmem_device_spec.txt | 161 ------------------------ 2 files changed, 244 insertions(+), 161 deletions(-) create mode 100644 docs/specs/ivshmem-spec.txt delete mode 100644 docs/specs/ivshmem_device_spec.txt