@@ -3,254 +3,399 @@
## Background
The design for *Non-Cooperative Migration of Guests*[1] explains that extra
-save records are required in the migrations stream to allow a guest running
-PV drivers to be migrated without its co-operation. Moreover the save
-records must include details of registered xenstore watches as well as
-content; information that cannot currently be recovered from `xenstored`,
-and hence some extension to the xenstore protocol[2] will also be required.
-
-The *libxenlight Domain Image Format* specification[3] already defines a
-record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
-transferring xenstore data pertaining to the domain directly as it is
-specified such that keys are relative to the path
-`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
-define at least one new save record type.
+save records are required in the migrations stream to allow a guest running PV
+drivers to be migrated without its co-operation. Moreover the save records must
+include details of registered xenstore watches as well as content; information
+that cannot currently be recovered from `xenstored`, and hence some extension
+to the xenstored implementations will also be required.
+
+As a similar set of data is needed for transferring xenstore data from one
+instance to another when live updating xenstored this document proposes an
+image format for a 'migration stream' suitable for both purposes.
## Proposal
-### New Save Record
+The image format consists of a _header_ followed by 1 or more _records_. Each
+record consists of a type and length field, followed by any data mandated by
+the record type. At minimum there will be a single record of type `END`
+(defined below).
-A new mandatory record type should be defined within the libxenlight Domain
-Image Format:
+### Header
-`0x00000007: DOMAIN_XENSTORE_DATA`
+The header identifies the stream as a `xenstore` stream, including the version
+of the specification that it complies with.
-An arbitrary number of these records may be present in the migration
-stream and may appear in any order. The format of each record should be as
-follows:
+All fields in this header must be in _big-endian_ byte order, regardless of
+the setting of the endianness bit.
```
0 1 2 3 4 5 6 7 octet
+-------+-------+-------+-------+-------+-------+-------+-------+
-| type | record specific data |
-+-------------------------------+ |
-...
-+---------------------------------------------------------------+
+| ident |
++-------------------------------+-------------------------------|
+| version | flags |
++-------------------------------+-------------------------------+
```
-where type is one of the following values
+| Field | Description |
+|-----------|---------------------------------------------------|
+| `ident` | 0x78656e73746f7265 ('xenstore' in ASCII) |
+| | |
+| `version` | 0x00000001 (the version of the specification) |
+| | |
+| `flags` | 0 (LSB): Endianness: 0 = little, 1 = big |
+| | |
+| | 1-31: Reserved (must be zero) |
-| Field | Description |
-|--------|--------------------------------------------------|
-| `type` | 0x00000000: invalid |
-| | 0x00000001: NODE_DATA |
-| | 0x00000002: WATCH_DATA |
-| | 0x00000003: TRANSACTION_DATA |
-| | 0x00000004 - 0xFFFFFFFF: reserved for future use |
+### Records
+Records immediately follow the header and have the following format:
-and data is one of the record data formats described in the following
-sections.
+
+```
+ 0 1 2 3 4 5 6 7 octet
++-------+-------+-------+-------+-------+-------+-------+-------+
+| type | len |
++-------------------------------+-------------------------------+
+| body
+...
+| | padding (0 to 7 octets) |
++-------+-------------------------------------------------------+
+```
+
+NOTE: padding octets here and in all subsequent format specifications must be
+ written as zero and should be ignored when the stream is read.
-NOTE: The record data does not contain an overall length because the
-libxenlight record header specifies the length.
+| Field | Description |
+|--------|------------------------------------------------------|
+| `type` | 0x00000000: END |
+| | 0x00000001: GLOBAL_DATA |
+| | 0x00000002: CONNECTION_DATA |
+| | 0x00000003: WATCH_DATA |
+| | 0x00000004: TRANSACTION_DATA |
+| | 0x00000005: NODE_DATA |
+| | 0x00000006 - 0xFFFFFFFF: reserved for future use |
+| | |
+| `len` | The length (in octets) of `body` |
+| | |
+| `body` | The type-specific record data |
+Some records will depend on other records in the migration stream. Records
+upon which other records depend must always appear earlier in the stream.
-**NODE_DATA**
+The various formats of the type-specific data are described in the following
+sections:
+\pagebreak
-Each NODE_DATA record specifies a single node in xenstore and is formatted
-as follows:
+### END
+The end record marks the end of the image, and is the final record
+in the stream.
```
- 0 1 2 3 octet
-+-------+-------+-------+-------+
-| NODE_DATA |
-+-------------------------------+
-| path length |
-+-------------------------------+
-| path data |
-...
-| pad (0 to 3 octets) |
-+-------------------------------+
-| perm count (N) |
-+-------------------------------+
-| perm0 |
-+-------------------------------+
-...
-+-------------------------------+
-| permN |
-+-------------------------------+
-| value length |
-+-------------------------------+
-| value data |
-...
-| pad (0 to 3 octets) |
-+-------------------------------+
+ 0 1 2 3 4 5 6 7 octet
++-------+-------+-------+-------+-------+-------+-------+-------+
```
-where perm0..N are formatted as follows:
+The end record contains no fields; its body length is 0.
+
+\pagebreak
+
+### GLOBAL_DATA
+
+This record is only relevant for live update. It contains details of global
+xenstored state that needs to be restored.
```
- 0 1 2 3 octet
+ 0 1 2 3 octet
+-------+-------+-------+-------+
-| perm | pad | domid |
+| rw-socket-fd |
++-------------------------------+
+| ro-socket-fd |
+-------------------------------+
```
-path length and value length are specified in octets (excluding the NUL
-terminator of the path). perm should be one of the ASCII values `w`, `r`,
-`b` or `n` as described in [2]. All pad values should be 0.
-All paths should be absolute (i.e. start with `/`) and as described in
-[2].
+| Field | Description |
+|----------------|----------------------------------------------|
+| `rw-socket-fd` | The file descriptor of the socket accepting |
+| | read-write connections |
+| | |
+| `ro-socket-fd` | The file descriptor of the socket accepting |
+| | read-only connections |
+
+xenstored will resume in the original process context. Hence `rw-socket-fd` and
+`ro-socket-fd` simply specify the file descriptors of the sockets. Sockets
+are not always used, however, and so -1 will be used to denote an unused
+socket.
-**WATCH_DATA**
+\pagebreak
+### CONNECTION_DATA
-Each WATCH_DATA record specifies a registered watch and is formatted as
-follows:
+For live update the image format will contain a `CONNECTION_DATA` record for
+each connection to xenstore. For migration it will only contain a record for
+the domain being migrated.
```
- 0 1 2 3 octet
-+-------+-------+-------+-------+
-| WATCH_DATA |
-+-------------------------------+
-| wpath length |
-+-------------------------------+
-| wpath data |
-...
-| pad (0 to 3 octets) |
-+-------------------------------+
+ 0 1 2 3 4 5 6 7 octet
++-------+-------+-------+-------+-------+-------+-------+-------+
+| conn-id | conn-type | conn-spec
...
++-------------------------------+-------------------------------+
+| data-len | data
+-------------------------------+
-| token length |
-+-------------------------------+
-| token data |
...
-| pad (0 to 3 octets) |
-+-------------------------------+
```
-wpath length and token length are specified in octets (excluding the NUL
-terminator). The wpath should be as described for the `WATCH` operation in
-[2]. The token is an arbitrary string of octets not containing any NUL
-values.
+| Field | Description |
+|-------------|-------------------------------------------------|
+| `conn-id` | A non-zero number used to identify this |
+| | connection in subsequent connection-specific |
+| | records |
+| | |
+| `conn-type` | 0x0000: shared ring |
+| | 0x0001: socket |
+| | 0x0002 - 0xFFFF: reserved for future use |
+| | |
+| `conn-spec` | See below |
+| | |
+| `data-len` | The length (in octets) of any pending data not |
+| | yet written to the connection |
+| | |
+| `data` | Pending data (may be empty) |
-**TRANSACTION_DATA**
+The format of `conn-spec` is dependent upon `conn-type`.
+\pagebreak
-Each TRANSACTION_DATA record specifies an open transaction and is formatted
-as follows:
+For `shared ring` connections it is as follows:
```
- 0 1 2 3 octet
-+-------+-------+-------+-------+
-| TRANSACTION_DATA |
-+-------------------------------+
-| tx_id |
-+-------------------------------+
+ 0 1 2 3 4 5 6 7 octet
+ +-------+-------+
+ | flags |
++---------------+---------------+---------------+---------------+
+| domid | tdomid | evtchn |
++-------------------------------+-------------------------------+
```
-where tx_id is the non-zero identifier values of an open transaction.
+| Field | Description |
+|-----------|---------------------------------------------------|
+| `domid` | The domain-id that owns the shared page |
+| | |
+| `tdomid` | The domain-id that `domid` acts on behalf of if |
+| | it has been subject to an SET_TARGET |
+| | operation [2] or DOMID_INVALID [3] otherwise |
+| | |
+| `flags` | Must be zero |
+| | |
+| `evtchn` | The port number of the interdomain channel used |
+| | by `domid` to communicate with xenstored |
+| | |
-### Protocol Extension
+Since the ABI guarantees that entry 1 in `domid`'s grant table will always
+contain the GFN of the shared page.
-Before xenstore state is migrated it is necessary to wait for any pending
-reads, writes, watch registrations etc. to complete, and also to make sure
-that xenstored does not start processing any new requests (so that new
-requests remain pending on the shared ring for subsequent processing on the
-new host). Hence the following operation is needed:
+For `socket` connections it is as follows:
-```
-QUIESCE <domid>|
-Complete processing of any request issued by the specified domain, and
-do not process any further requests from the shared ring.
+```
+ +-------+-------+
+ | flags |
++---------------+---------------+---------------+---------------+
+| socket-fd | pad |
++-------------------------------+-------------------------------+
```
-The `WATCH` operation does not allow specification of a `<domid>`; it is
-assumed that the watch pertains to the domain that owns the shared ring
-over which the operation is passed. Hence, for the tool-stack to be able
-to register a watch on behalf of a domain a new operation is needed:
-```
-ADD_DOMAIN_WATCHES <domid>|<watch>|+
+| Field | Description |
+|-------------|-------------------------------------------------|
+| `flags` | A bit-wise OR of: |
+| | 0001: read-only |
+| | |
+| `socket-fd` | The file descriptor of the connected socket |
-Adds watches on behalf of the specified domain.
+This type of connection is only relevant for live update, where the xenstored
+resumes in the original process context. Hence `socket-fd` simply specify
+the file descriptor of the socket connection.
-<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
-operation are identical to the domain issuing WATCH <path>|<token>| for
-each <watch>.
-```
+\pagebreak
+
+### WATCH_DATA
+
+The image format will contain a `WATCH_DATA` record for each watch registered
+by a connection for which there is `CONNECTION_DATA` record previously present.
-The watch information for a domain also needs to be extracted from the
-sending xenstored so the following operation is also needed:
```
-GET_DOMAIN_WATCHES <domid>|<index> <gencnt>|<watch>|*
+ 0 1 2 3 octet
++-------+-------+-------+-------+
+| conn-id |
++---------------+---------------+
+| wpath-len | token-len |
++---------------+---------------+
+| wpath
+...
+| token
+...
+```
+
+
+| Field | Description |
+|-------------|-------------------------------------------------|
+| `conn-id` | The connection that issued the `WATCH` |
+| | operation [2] |
+| | |
+| `wpath-len` | The length (in octets) of `wpath` including the |
+| | NUL terminator |
+| | |
+| `token-len` | The length (in octets) of `token` including the |
+| | NUL terminator |
+| | |
+| `wpath` | The watch path, as specified in the `WATCH` |
+| | operation |
+| | |
+| `token` | The watch identifier token, as specified in the |
+| | `WATCH` operation |
+
+\pagebreak
-Gets the list of watches that are currently registered for the domain.
+### TRANSACTION_DATA
-<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
-will start at <index> items into the the overall list of watches and may
-be truncated (at a <watch> boundary) such that the returned data fits
-within XENSTORE_PAYLOAD_MAX.
+The image format will contain a `TRANSACTION_DATA` record for each transaction
+that is pending on a connection for which there is `CONNECTION_DATA` record
+previously present.
-If <index> is beyond the end of the overall list then the returned sub-
-list will be empty. If the value of <gencnt> changes then it indicates
-that the overall watch list has changed and thus it may be necessary
-to re-issue the operation for previous values of <index>.
+
+```
+ 0 1 2 3 octet
++-------+-------+-------+-------+
+| conn-id |
++-------------------------------+
+| tx-id |
++-------------------------------+
```
-To deal with transactions that were pending when the domain is migrated
-it is necessary to start transactions with the same tx_id on behalf of the
-domain in the receiving xenstored.
-NOTE: For safety each such transaction should result in an `EAGAIN` when
-the `TRANSACTION_END` operation is performed, as modifications made under
-the tx_id will not be part of the migration stream.
+| Field | Description |
+|----------------|----------------------------------------------|
+| `conn-id` | The connection that issued the |
+| | `TRANSACTION_START` operation [2] |
+| | |
+| `tx-id` | The transaction id passed back to the domain |
+| | by the `TRANSACTION_START` operation |
-The `TRANSACTION_START` operation does not allow specification of a
-`<domid>`; it is assumed that the transaction pertains to the domain that
-owns the shared ring over which the operation is passed. Neither does it
-allow a `<transid>` to be specified; it is always chosen by xenstored.
-Hence, for the tool-stack to be able to open a transaction on behalf of a
-domain a new operation is needed:
+\pagebreak
+### NODE_DATA
+
+For live update the image format will contain a `NODE_DATA` record for each
+node in xenstore. For migration it will only contain a record for the nodes
+relating to the domain being migrated. The `NODE_DATA` may be related to
+a _committed_ node (globally visible in xenstored) or a _pending_ node (created
+or modified by a transaction for which there is also a `TRANSACTION_DATA`
+record previously present).
+
+
+```
+ 0 1 2 3 octet
++-------+-------+-------+-------+
+| conn-id |
++-------------------------------+
+| tx-id |
++---------------+---------------+
+| path-len | value-len |
++---------------+---------------+
+| access | perm-count |
++---------------+---------------+
+| perm1 |
++-------------------------------+
+...
++-------------------------------+
+| permN |
++---------------+---------------+
+| path
+...
+| value
+...
```
-START_DOMAIN_TRANSACTION <domid>|<transid>|
-Starts a transaction on behalf of a domain.
-The semantics of this are similar to the domain issuing
-TRANSACTION_START and receiving the specified <transid> as the response.
-The main difference is that the transaction will be immediately marked as
-'conflicting' such that when the domain issues TRANSACTION_END T|, it will
-result in EAGAIN.
+| Field | Description |
+|--------------|------------------------------------------------|
+| `conn-id` | If this value is non-zero then this record |
+| | related to a pending transaction |
+| | |
+| `tx-id` | This value should be ignored if `conn-id` is |
+| | zero. Otherwise it specifies the id of the |
+| | pending transaction |
+| | |
+| `path-len` | The length (in octets) of `path` including the |
+| | NUL terminator |
+| | |
+| `value-len` | The length (in octets) of `value` (which will |
+| | be zero for a deleted node) |
+| | |
+| `access` | This value should be ignored if this record |
+| | does not relate to a pending transaction, |
+| | otherwise it specifies the accesses made to |
+| | the node and hence is a bitwise OR of: |
+| | |
+| | 0x0001: read |
+| | 0x0002: written |
+| | |
+| | The value will be zero for a deleted node |
+| | |
+| `perm-count` | The number (N) of node permission specifiers |
+| | (which will be 0 for a node deleted in a |
+| | pending transaction) |
+| | |
+| `perm1..N` | A list of zero or more node permission |
+| | specifiers (see below) |
+| | |
+| `path` | The absolute path of the node |
+| | |
+| `value` | The node value (which may be empty or contain |
+| | NUL octets) |
+
+
+A node permission specifier has the following format:
+
+
+```
+ 0 1 2 3 octet
++-------+-------+-------+-------+
+| perm | pad | domid |
++-------+-------+---------------+
```
-It may also be desirable to state in the protocol specification that
-the `INTRODUCE` operation should not clear the `<gfn>` specified such that
-a `RELEASE` operation followed by an `INTRODUCE` operation form an
-idempotent pair. The current implementation of *C xentored* does this
-(in the `domain_conn_reset()` function) but this could be dropped as this
-behaviour is not currently specified and the page will always be zeroed
-for a newly created domain.
+| Field | Description |
+|---------|-----------------------------------------------------|
+| `perm` | One of the ASCII values `w`, `r`, `b` or `n` as |
+| | specified for the `SET_PERMS` operation [2] |
+| | |
+| `domid` | The domain-id to which the permission relates |
+Note that perm1 defines the domain owning the code. See [4] for more
+explanation of node permissions.
* * *
[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
+
[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
-[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
+
+[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/xen.h;hb=HEAD#l612
+
+[4] https://wiki.xen.org/wiki/XenBus
\ No newline at end of file
@@ -289,23 +289,6 @@ IS_DOMAIN_INTRODUCED <domid>| T| or F|
ie, if INTRODUCE for the domain has not yet been followed by
domain destruction or explicit RELEASE.
-RESUME <domid>|
-
- Arranges that @releaseDomain events will once more be
- generated when the domain becomes shut down. This might have
- to be used if a domain were to be shut down (generating one
- @releaseDomain) and then subsequently restarted, since the
- state-sensitive algorithm in xenstored will not otherwise send
- further watch event notifications if the domain were to be
- shut down again.
-
- It is not clear whether this is possible since one would
- normally expect a domain not to be restarted after being shut
- down without being destroyed in the meantime. There are
- currently no users of this request in xen-unstable.
-
- xenstored prevents the use of RESUME other than by dom0.
-
SET_TARGET <domid>|<tdomid>|
Notifies xenstored that domain <domid> is targeting domain
<tdomid>. This grants domain <domid> full access to paths