From patchwork Mon Apr 27 07:53:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paul Durrant X-Patchwork-Id: 11511391 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3C2F281 for ; Mon, 27 Apr 2020 07:54:27 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0D28C2078E for ; Mon, 27 Apr 2020 07:54:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=xen.org header.i=@xen.org header.b="KFC3Jcqe" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0D28C2078E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=xen.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1jSyaM-0003dP-KD; Mon, 27 Apr 2020 07:53:50 +0000 Received: from all-amaz-eas1.inumbo.com ([34.197.232.57] helo=us1-amaz-eas2.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1jSyaK-0003dK-G9 for xen-devel@lists.xenproject.org; Mon, 27 Apr 2020 07:53:48 +0000 X-Inumbo-ID: 3922cca2-885c-11ea-9745-12813bfff9fa Received: from mail.xenproject.org (unknown [104.130.215.37]) by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS id 3922cca2-885c-11ea-9745-12813bfff9fa; Mon, 27 Apr 2020 07:53:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xen.org; s=20200302mail; h=Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To: MIME-Version:Content-Type:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=S2VSSQT2kfz3fL8yU/OB6ihdy22xumVyR3OIKwJAw10=; b=KFC3Jcqe2X22p2b3CcG5hWiANn VSdrsvupXkJKwjlxh45Rj223WNMJbEeTlXCbDvssilxs4VvD/CHh/2FXeSjYaNKp0K95w9yRAGeAd 4d9tQO3kKBzRS90UOhtA6t4U0ZE7PMM7c1QqJ6DQ0BZg30a1Jtrxk+w63kPrHNlOFIis=; Received: from xenbits.xenproject.org ([104.239.192.120]) by mail.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1jSyaI-0000Gk-6p; Mon, 27 Apr 2020 07:53:46 +0000 Received: from [54.239.6.186] (helo=CBG-R90WXYV0.amazon.com) by xenbits.xenproject.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.89) (envelope-from ) id 1jSyaH-0001yS-Mv; Mon, 27 Apr 2020 07:53:46 +0000 From: Paul Durrant To: xen-devel@lists.xenproject.org Subject: [PATCH v2] docs/designs: re-work the xenstore migration document... Date: Mon, 27 Apr 2020 08:53:42 +0100 Message-Id: <20200427075342.149-1-paul@xen.org> X-Mailer: git-send-email 2.17.1 X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Paul Durrant Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" From: Paul Durrant ... to specify a separate migration stream that will also be suitable for live update. The original scope of the document was to support non-cooperative migration of guests [1] but, since then, live update of xenstored has been brought into scope. Thus it makes more sense to define a separate image format for serializing xenstore state that is suitable for both purposes. The document has been limited to specifying a new image format. The mechanism for acquiring the image for live update or migration is not covered as that is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is also expected that, when the first implementation of live update or migration making use of this specification is committed, that the document is moved from docs/designs into docs/specs. [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md Signed-off-by: Paul Durrant --- Juergen Gross Andrew Cooper George Dunlap Ian Jackson Jan Beulich Julien Grall Stefano Stabellini Wei Liu v2: - Address comments from Juergen --- docs/designs/xenstore-migration.md | 475 +++++++++++++++++++---------- 1 file changed, 312 insertions(+), 163 deletions(-) diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md index 6ab351e8fe..815aaeee59 100644 --- a/docs/designs/xenstore-migration.md +++ b/docs/designs/xenstore-migration.md @@ -3,254 +3,403 @@ ## Background The design for *Non-Cooperative Migration of Guests*[1] explains that extra -save records are required in the migrations stream to allow a guest running -PV drivers to be migrated without its co-operation. Moreover the save -records must include details of registered xenstore watches as well as -content; information that cannot currently be recovered from `xenstored`, -and hence some extension to the xenstore protocol[2] will also be required. - -The *libxenlight Domain Image Format* specification[3] already defines a -record type `EMULATOR_XENSTORE_DATA` but this is not suitable for -transferring xenstore data pertaining to the domain directly as it is -specified such that keys are relative to the path -`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to -define at least one new save record type. +save records are required in the migrations stream to allow a guest running PV +drivers to be migrated without its co-operation. Moreover the save records must +include details of registered xenstore watches as well ascontent; information +that cannot currently be recovered from `xenstored`, and hence some extension +to the xenstored implementations will also be required. + +As a similar set of data is needed for transferring xenstore data from one +instance to another when live updating xenstored this document proposes an +image format for a 'migration stream' suitable for both purposes. ## Proposal -### New Save Record +The image format consists of a _header_ followed by 1 or more _records_. Each +record consists of a type and length field, followed by any data mandated by +the record type. At minimum there will be a single record of type `END` +(defined below). -A new mandatory record type should be defined within the libxenlight Domain -Image Format: +### Header -`0x00000007: DOMAIN_XENSTORE_DATA` +The header identifies the stream as a `xenstore` stream, including the version +of the specification that it complies with. -An arbitrary number of these records may be present in the migration -stream and may appear in any order. The format of each record should be as -follows: +All fields in this header must be in _big-endian_ byte order, regardless of +the setting of the endianness bit. ``` 0 1 2 3 4 5 6 7 octet +-------+-------+-------+-------+-------+-------+-------+-------+ -| type | record specific data | -+-------------------------------+ | -... -+---------------------------------------------------------------+ +| ident | ++-------------------------------+-------------------------------| +| version | flags | ++-------------------------------+-------------------------------+ ``` -where type is one of the following values +| Field | Description | +|-----------|---------------------------------------------------| +| `ident` | 0x78656e73746f7265 ('xenstore' in ASCII) | +| | | +| `version` | 0x00000001 (the version of the specification) | +| | | +| `flags` | 0 (LSB): Endianness: 0 = little, 1 = big | +| | | +| | 1-31: Reserved (must be zero) | -| Field | Description | -|--------|--------------------------------------------------| -| `type` | 0x00000000: invalid | -| | 0x00000001: NODE_DATA | -| | 0x00000002: WATCH_DATA | -| | 0x00000003: TRANSACTION_DATA | -| | 0x00000004 - 0xFFFFFFFF: reserved for future use | +### Records +Records immediately follow the header and have the following format: -and data is one of the record data formats described in the following -sections. +``` + 0 1 2 3 4 5 6 7 octet ++-------+-------+-------+-------+-------+-------+-------+-------+ +| type | len | ++-------------------------------+-------------------------------+ +| body +... +| | padding (0 to 7 octets) | ++-------+-------------------------------------------------------+ +``` + +NOTE: padding octets here and in all subsequent format specifications must be + written as zero and should be ignored when the stream is read. -NOTE: The record data does not contain an overall length because the -libxenlight record header specifies the length. +| Field | Description | +|--------|------------------------------------------------------| +| `type` | 0x00000000: END | +| | 0x00000001: GLOBAL_DATA | +| | 0x00000002: CONNECTION_DATA | +| | 0x00000003: WATCH_DATA | +| | 0x00000004: TRANSACTION_DATA | +| | 0x00000005: NODE_DATA | +| | 0x00000006 - 0xFFFFFFFF: reserved for future use | +| | | +| `len` | The length (in octets) of `body` | +| | | +| `body` | The type-specific record data | -**NODE_DATA** +Some records will depend on other records in the migration stream. Records +upon which other records depend must always appear earlier in the stream. +The various formats of the type-specific data are described in the following +sections: -Each NODE_DATA record specifies a single node in xenstore and is formatted -as follows: +\pagebreak +### END + +The end record marks the end of the image, and is the final record +in the stream. ``` - 0 1 2 3 octet -+-------+-------+-------+-------+ -| NODE_DATA | -+-------------------------------+ -| path length | -+-------------------------------+ -| path data | -... -| pad (0 to 3 octets) | -+-------------------------------+ -| perm count (N) | -+-------------------------------+ -| perm0 | -+-------------------------------+ -... -+-------------------------------+ -| permN | -+-------------------------------+ -| value length | -+-------------------------------+ -| value data | -... -| pad (0 to 3 octets) | -+-------------------------------+ + 0 1 2 3 4 5 6 7 octet ++-------+-------+-------+-------+-------+-------+-------+-------+ ``` -where perm0..N are formatted as follows: +The end record contains no fields; its body length is 0. + +\pagebreak + +### GLOBAL_DATA + +This record is only relevant for live update. It contains details of global +xenstored state that needs to be restored. ``` - 0 1 2 3 octet + 0 1 2 3 octet +-------+-------+-------+-------+ -| perm | pad | domid | +| rw-socket-fd | ++-------------------------------+ +| ro-socket-fd | +-------------------------------+ ``` -path length and value length are specified in octets (excluding the NUL -terminator of the path). perm should be one of the ASCII values `w`, `r`, -`b` or `n` as described in [2]. All pad values should be 0. -All paths should be absolute (i.e. start with `/`) and as described in -[2]. +| Field | Description | +|----------------|----------------------------------------------| +| `rw-socket-fd` | The file descriptor of the socket accepting | +| | read-write connections | +| | | +| `ro-socket-fd` | The file descriptor of the socket accepting | +| | read-only connections | + +xenstored will resume in the original process context. Hence `rw-socket-fd` and +`ro-socket-fd` simply specify the file descriptors of the sockets. -**WATCH_DATA** +\pagebreak +### CONNECTION_DATA -Each WATCH_DATA record specifies a registered watch and is formatted as -follows: +For live update the image format will contain a `CONNECTION_DATA` record for +each connection to xenstore. For migration it will only contain a record for +the domain being migrated. ``` - 0 1 2 3 octet -+-------+-------+-------+-------+ -| WATCH_DATA | -+-------------------------------+ -| wpath length | -+-------------------------------+ -| wpath data | -... -| pad (0 to 3 octets) | -+-------------------------------+ + 0 1 2 3 4 5 6 7 octet ++-------+-------+-------+-------+-------+-------+-------+-------+ +| conn-id | pad | ++---------------+-----------------------------------------------+ +| conn-type | conn-spec ... ++-------------------------------+-------------------------------+ +| data-len | data +-------------------------------+ -| token length | -+-------------------------------+ -| token data | ... -| pad (0 to 3 octets) | -+-------------------------------+ ``` -wpath length and token length are specified in octets (excluding the NUL -terminator). The wpath should be as described for the `WATCH` operation in -[2]. The token is an arbitrary string of octets not containing any NUL -values. +| Field | Description | +|-------------|-------------------------------------------------| +| `conn-id` | A non-zero number used to identify this | +| | connection in subsequent connection-specific | +| | records | +| | | +| `conn-type` | 0x0000: shared ring | +| | 0x0001: socket | +| | | +| `conn-spec` | See below | +| | | +| `data-len` | The length (in octets) of any pending data not | +| | yet written to the connection | +| | | +| `data` | Pending data (may be empty) | -**TRANSACTION_DATA** +The format of `conn-spec` is dependent upon `conn-type`. +\pagebreak -Each TRANSACTION_DATA record specifies an open transaction and is formatted -as follows: +For `shared ring` connections it is as follows: ``` - 0 1 2 3 octet -+-------+-------+-------+-------+ -| TRANSACTION_DATA | -+-------------------------------+ -| tx_id | -+-------------------------------+ + 0 1 2 3 4 5 6 7 octet + +-------+-------+-------+-------+-------+-------+ + | domid | tdomid | flags | ++---------------+---------------+---------------+---------------+ +| revtchn | levtchn | ++-------------------------------+-------------------------------+ +| mfn | ++---------------------------------------------------------------+ ``` -where tx_id is the non-zero identifier values of an open transaction. - -### Protocol Extension +| Field | Description | +|------------|--------------------------------------------------| +| `domid` | The domain-id that owns the shared page | +| | | +| `tdomid` | The domain-id that `domid` acts on behalf of if | +| | it has been subject to an SET_TARGET | +| | operation [2] or DOMID_INVALID otherwise | +| | | +| `flags` | A bit-wise OR of: | +| | 0x0001: INTRODUCE has been issued | +| | 0x0002: RELEASE has been issued | +| | | +| `revtchn` | The port number of the interdomain channel used | +| | by `domid` to communicate with xenstored | +| | | +| `levtchn` | For a live update this will be the port number | +| | of the interdomain channel used by xenstored | +| | itself otherwise, for migration, it will be -1 | +| | | +| `mfn` | The MFN of the shared page for a live update or | +| | INVALID_MFN otherwise | + +Since the ABI guarantees that entry 1 in `domid`'s grant table will always +contain the GFN of the shared page, so for a live update `mfn` can be used to +give confidence that `domid` has not been re-cycled during the update. + + +For `socket` connections it is as follows: -Before xenstore state is migrated it is necessary to wait for any pending -reads, writes, watch registrations etc. to complete, and also to make sure -that xenstored does not start processing any new requests (so that new -requests remain pending on the shared ring for subsequent processing on the -new host). Hence the following operation is needed: ``` -QUIESCE | - -Complete processing of any request issued by the specified domain, and -do not process any further requests from the shared ring. + 0 1 2 3 4 5 6 7 octet + +-------+-------+-------+-------+-------+-------+ + | flags | socket-fd | + +---------------+-------------------------------+ ``` -The `WATCH` operation does not allow specification of a ``; it is -assumed that the watch pertains to the domain that owns the shared ring -over which the operation is passed. Hence, for the tool-stack to be able -to register a watch on behalf of a domain a new operation is needed: -``` -ADD_DOMAIN_WATCHES ||+ +| Field | Description | +|-------------|-------------------------------------------------| +| `flags` | A bit-wise OR of: | +| | 0001: read-only | +| | | +| `socket-fd` | The file descriptor of the connected socket | -Adds watches on behalf of the specified domain. +This type of connection is only relevant for live update, where the xenstored +resumes in the original process context. Hence `socket-fd` simply specify +the file descriptor of the socket connection. - is a NUL separated tuple of |. The semantics of this -operation are identical to the domain issuing WATCH || for -each . -``` +\pagebreak + +### WATCH_DATA + +The image format will contain a `WATCH_DATA` record for each watch registered +by a connection for which there is `CONNECTION_DATA` record previously present. -The watch information for a domain also needs to be extracted from the -sending xenstored so the following operation is also needed: ``` -GET_DOMAIN_WATCHES | ||* + 0 1 2 3 octet ++-------+-------+-------+-------+ +| conn-id | ++---------------+---------------+ +| wpath-len | token-len | ++---------------+---------------+ +| wpath +... +| token +... +``` + -Gets the list of watches that are currently registered for the domain. +| Field | Description | +|-------------|-------------------------------------------------| +| `conn-id` | The connection that issued the `WATCH` | +| | operation [2] | +| | | +| `wpath-len` | The length (in octets) of `wpath` including the | +| | NUL terminator | +| | | +| `token-len` | The length (in octets) of `token` including the | +| | NUL terminator | +| | | +| `wpath` | The watch path, as specified in the `WATCH` | +| | operation | +| | | +| `token` | The watch identifier token, as specified in the | +| | `WATCH` operation | + +\pagebreak + +### TRANSACTION_DATA + +The image format will contain a `TRANSACTION_DATA` record for each transaction +that is pending on a connection for which there is `CONNECTION_DATA` record +previously present. - is a NUL separated tuple of |. The sub-list returned -will start at items into the the overall list of watches and may -be truncated (at a boundary) such that the returned data fits -within XENSTORE_PAYLOAD_MAX. -If is beyond the end of the overall list then the returned sub- -list will be empty. If the value of changes then it indicates -that the overall watch list has changed and thus it may be necessary -to re-issue the operation for previous values of . ``` + 0 1 2 3 octet ++-------+-------+-------+-------+ +| conn-id | ++-------------------------------+ +| tx-id | ++-------------------------------+ +``` + + +| Field | Description | +|----------------|----------------------------------------------| +| `conn-id` | The connection that issued the | +| | `TRANSACTION_START` operation [2] | +| | | +| `tx-id` | The transaction id passed back to the domain | +| | by the `TRANSACTION_START` operation | + +\pagebreak -To deal with transactions that were pending when the domain is migrated -it is necessary to start transactions with the same tx_id on behalf of the -domain in the receiving xenstored. +### NODE_DATA -NOTE: For safety each such transaction should result in an `EAGAIN` when -the `TRANSACTION_END` operation is performed, as modifications made under -the tx_id will not be part of the migration stream. +For live update the image format will contain a `NODE_DATA` record for each +node in xenstore. For migration it will only contain a record for the nodes +relating to the domain being migrated. The `NODE_DATA` may be related to +a _committed_ node (globally visible in xenstored) or a _pending_ node (created +or modified by a transaction for which there is also a `TRANSACTION_DATA` +record previously present). -The `TRANSACTION_START` operation does not allow specification of a -``; it is assumed that the transaction pertains to the domain that -owns the shared ring over which the operation is passed. Neither does it -allow a `` to be specified; it is always chosen by xenstored. -Hence, for the tool-stack to be able to open a transaction on behalf of a -domain a new operation is needed: ``` -START_DOMAIN_TRANSACTION || + 0 1 2 3 octet ++-------+-------+-------+-------+ +| conn-id | ++-------------------------------+ +| tx-id | ++---------------+---------------+ +| access | perm-count | ++---------------+---------------+ +| perm1 | ++-------------------------------+ +... ++-------------------------------+ +| permN | ++---------------+---------------+ +| path-len | value-len | ++---------------+---------------+ +| path +... +| value +... +``` + + +| Field | Description | +|--------------|------------------------------------------------| +| `conn-id` | If this value is non-zero then this record | +| | related to a pending transaction | +| | | +| `tx-id` | This value should be ignored if `conn-id` is | +| | zero. Otherwise it specifies the id of the | +| | pending transaction | +| | | +| `access` | This value should be ignored if this record | +| | does not relate to a pending transaction, | +| | otherwise it specifies the accesses made to | +| | the node and hence is a bitwise OR of: | +| | | +| | 0x0001: read | +| | 0x0002: written | +| | | +| | The value will be zero for a deleted node | +| | | +| `perm-count` | The number (N) of node permission specifiers | +| | (which will be 0 for a node deleted in a | +| | pending transaction) | +| | | +| `perm1..N` | A list of zero or more node permission | +| | specifiers (see below) | +| | | +| `path-len` | The length (in octets) of `path` including the | +| | NUL terminator | +| | | +| `value-len` | The length (in octets) of `value` (which will | +| | be zero for a deleted node) | +| | | +| `path` | The absolute path of the node | +| | | +| `value` | The node value (which may be empty or contain | +| | NUL octets) | + + +A node permission specifier has the following format: -Starts a transaction on behalf of a domain. -The semantics of this are similar to the domain issuing -TRANSACTION_START and receiving the specified as the response. -The main difference is that the transaction will be immediately marked as -'conflicting' such that when the domain issues TRANSACTION_END T|, it will -result in EAGAIN. +``` + 0 1 2 3 octet ++-------+-------+-------+-------+ +| perm | pad | domid | ++-------+-------+---------------+ ``` -It may also be desirable to state in the protocol specification that -the `INTRODUCE` operation should not clear the `` specified such that -a `RELEASE` operation followed by an `INTRODUCE` operation form an -idempotent pair. The current implementation of *C xentored* does this -(in the `domain_conn_reset()` function) but this could be dropped as this -behaviour is not currently specified and the page will always be zeroed -for a newly created domain. +| Field | Description | +|---------|-----------------------------------------------------| +| `perm` | One of the ASCII values `w`, `r`, `b` or `n` as | +| | specified for the `SET_PERMS` operation [2] | +| | | +| `domid` | The domain-id to which the permission relates | * * * [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md + [2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt -[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc