diff mbox series

[V1,08/26] migration: vmstate_info_void_ptr

Message ID 1714406135-451286-9-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New, archived
Headers show
Series Live update: cpr-exec | expand

Commit Message

Steven Sistare April 29, 2024, 3:55 p.m. UTC
Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
can be saved in the migration stream.  This will be needed for CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h | 15 +++++++++++++++
 migration/vmstate-types.c   | 24 ++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

Comments

Fabiano Rosas May 7, 2024, 9:33 p.m. UTC | #1
Steve Sistare <steven.sistare@oracle.com> writes:

> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> can be saved in the migration stream.  This will be needed for CPR.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Peter Xu May 27, 2024, 6:31 p.m. UTC | #2
On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> can be saved in the migration stream.  This will be needed for CPR.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

This is really tricky.

From a first glance, I don't think migrating a VA is valid at all for
migration even if with exec.. and looks insane to me for a cross-process
migration, which seems to be allowed to use as a generic VMSD helper.. as
VA is the address space barrier for different processes and I think it
normally even apply to generic execve(), and we're trying to jailbreak for
some reason..

It definitely won't work for any generic migration as sizeof(void*) can be
different afaict between hosts, e.g. 32bit -> 64bit migrations.

Some description would be really helpful in this commit message,
e.g. explain the users and why.  Do we need to poison that for generic VMSD
use (perhaps with prefixed underscores)?  I think I'll need to read on the
rest to tell..

Thanks,
Steven Sistare May 28, 2024, 3:10 p.m. UTC | #3
On 5/27/2024 2:31 PM, Peter Xu wrote:
> On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
>> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
>> can be saved in the migration stream.  This will be needed for CPR.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> This is really tricky.
> 
>  From a first glance, I don't think migrating a VA is valid at all for
> migration even if with exec.. and looks insane to me for a cross-process
> migration, which seems to be allowed to use as a generic VMSD helper.. as
> VA is the address space barrier for different processes and I think it
> normally even apply to generic execve(), and we're trying to jailbreak for
> some reason..
> 
> It definitely won't work for any generic migration as sizeof(void*) can be
> different afaict between hosts, e.g. 32bit -> 64bit migrations.
> 
> Some description would be really helpful in this commit message,
> e.g. explain the users and why.  Do we need to poison that for generic VMSD
> use (perhaps with prefixed underscores)?  I think I'll need to read on the
> rest to tell..

Short answer: we never dereference the void* in the new process.  And must not.

Longer answer:

During CPR for vfio, each mapped DMA region is re-registered in the new
process using the new VA.  The ioctl to re-register identifies the mapping
by IOVA and length.

The same requirement holds for CPR of iommufd devices.  However, in the
iommufd framework, IOVA does not uniquely identify a dma mapping, and we
need to use the old VA as the unique identifier.  The new process
re-registers each mapping, passing the old VA and new VA to the kernel.
The old VA is never dereferenced in the new process, we just need its value.

I suspected that the void* which must not be dereferenced might make people
uncomfortable.  I have an older version of my code which adds a uint64_t
field to RAMBlock for recording and migrating the old VA.  The saving and
loading code is slightly less elegant, but no big deal.  Would you prefer
that?

- Steve
Peter Xu May 28, 2024, 6:21 p.m. UTC | #4
On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote:
> On 5/27/2024 2:31 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
> > > Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> > > can be saved in the migration stream.  This will be needed for CPR.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > This is really tricky.
> > 
> >  From a first glance, I don't think migrating a VA is valid at all for
> > migration even if with exec.. and looks insane to me for a cross-process
> > migration, which seems to be allowed to use as a generic VMSD helper.. as
> > VA is the address space barrier for different processes and I think it
> > normally even apply to generic execve(), and we're trying to jailbreak for
> > some reason..
> > 
> > It definitely won't work for any generic migration as sizeof(void*) can be
> > different afaict between hosts, e.g. 32bit -> 64bit migrations.
> > 
> > Some description would be really helpful in this commit message,
> > e.g. explain the users and why.  Do we need to poison that for generic VMSD
> > use (perhaps with prefixed underscores)?  I think I'll need to read on the
> > rest to tell..
> 
> Short answer: we never dereference the void* in the new process.  And must not.
> 
> Longer answer:
> 
> During CPR for vfio, each mapped DMA region is re-registered in the new
> process using the new VA.  The ioctl to re-register identifies the mapping
> by IOVA and length.
> 
> The same requirement holds for CPR of iommufd devices.  However, in the
> iommufd framework, IOVA does not uniquely identify a dma mapping, and we
> need to use the old VA as the unique identifier.  The new process
> re-registers each mapping, passing the old VA and new VA to the kernel.
> The old VA is never dereferenced in the new process, we just need its value.
> 
> I suspected that the void* which must not be dereferenced might make people
> uncomfortable.  I have an older version of my code which adds a uint64_t
> field to RAMBlock for recording and migrating the old VA.  The saving and
> loading code is slightly less elegant, but no big deal.  Would you prefer
> that?

I see, thanks for explaining.  Yes that sounds better to me.  Re the
ugliness: is that about a pre_save() plus one extra uint64_t field?  In
that case it looks better comparing to migrating "void*".

I'm trying to read some context on the vaddr remap thing from you, and I
found this:

https://lore.kernel.org/all/Y90bvBnrvRAcEQ%2F%2F@nvidia.com/

So it will work with iommufd now?  Meanwhile, what's the status for mdev?
Looks like it isn't supported yet for both.

Thanks,
Steven Sistare May 29, 2024, 5:30 p.m. UTC | #5
On 5/28/2024 2:21 PM, Peter Xu wrote:
> On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote:
>> On 5/27/2024 2:31 PM, Peter Xu wrote:
>>> On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
>>>> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
>>>> can be saved in the migration stream.  This will be needed for CPR.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> This is really tricky.
>>>
>>>   From a first glance, I don't think migrating a VA is valid at all for
>>> migration even if with exec.. and looks insane to me for a cross-process
>>> migration, which seems to be allowed to use as a generic VMSD helper.. as
>>> VA is the address space barrier for different processes and I think it
>>> normally even apply to generic execve(), and we're trying to jailbreak for
>>> some reason..
>>>
>>> It definitely won't work for any generic migration as sizeof(void*) can be
>>> different afaict between hosts, e.g. 32bit -> 64bit migrations.
>>>
>>> Some description would be really helpful in this commit message,
>>> e.g. explain the users and why.  Do we need to poison that for generic VMSD
>>> use (perhaps with prefixed underscores)?  I think I'll need to read on the
>>> rest to tell..
>>
>> Short answer: we never dereference the void* in the new process.  And must not.
>>
>> Longer answer:
>>
>> During CPR for vfio, each mapped DMA region is re-registered in the new
>> process using the new VA.  The ioctl to re-register identifies the mapping
>> by IOVA and length.
>>
>> The same requirement holds for CPR of iommufd devices.  However, in the
>> iommufd framework, IOVA does not uniquely identify a dma mapping, and we
>> need to use the old VA as the unique identifier.  The new process
>> re-registers each mapping, passing the old VA and new VA to the kernel.
>> The old VA is never dereferenced in the new process, we just need its value.
>>
>> I suspected that the void* which must not be dereferenced might make people
>> uncomfortable.  I have an older version of my code which adds a uint64_t
>> field to RAMBlock for recording and migrating the old VA.  The saving and
>> loading code is slightly less elegant, but no big deal.  Would you prefer
>> that?
> 
> I see, thanks for explaining.  Yes that sounds better to me.  Re the
> ugliness: is that about a pre_save() plus one extra uint64_t field?  In
> that case it looks better comparing to migrating "void*".

Yes.  Will do.

> I'm trying to read some context on the vaddr remap thing from you, and I
> found this:
> 
> https://lore.kernel.org/all/Y90bvBnrvRAcEQ%2F%2F@nvidia.com/
> 
> So it will work with iommufd now?  Meanwhile, what's the status for mdev?
> Looks like it isn't supported yet for both.

I am currently working on the kernel and qemu code to support iommufd.
It is an RFE on top of this initial cpr-exec work that I will post separately
later.  I do know that it will require the old VA, so I am proposing to
preserve old VA now in the migration stream to avoid adding extra backwards
compatibility code later.

I have prototyped userland code that supports mdev, based on jason's suggestion to
fork an extra process to handle mdev translations during the transition from old
to new qemu, but it is a work in progres and adds complexity, and I do not plan to
submit that any time soon.  Another RFE.  For now mdev is not supported.

- Steve
diff mbox series

Patch

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a39c0e6..bb885d9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -236,6 +236,7 @@  extern const VMStateInfo vmstate_info_uint8;
 extern const VMStateInfo vmstate_info_uint16;
 extern const VMStateInfo vmstate_info_uint32;
 extern const VMStateInfo vmstate_info_uint64;
+extern const VMStateInfo vmstate_info_void_ptr;
 
 /** Put this in the stream when migrating a null pointer.*/
 #define VMS_NULLPTR_MARKER (0x30U) /* '0' */
@@ -326,6 +327,17 @@  extern const VMStateInfo vmstate_info_qlist;
     .offset       = vmstate_offset_value(_state, _field, _type),     \
 }
 
+#define VMSTATE_SINGLE_TEST_NO_CHECK(_field, _state, _test,          \
+                                     _version, _info, _type) {       \
+    .name         = (stringify(_field)),                             \
+    .version_id   = (_version),                                      \
+    .field_exists = (_test),                                         \
+    .size         = sizeof(_type),                                   \
+    .info         = &(_info),                                        \
+    .flags        = VMS_SINGLE,                                      \
+    .offset       = offsetof(_state, _field)                         \
+}
+
 #define VMSTATE_SINGLE_FULL(_field, _state, _test, _version, _info,  \
                             _type, _err_hint) {                      \
     .name         = (stringify(_field)),                             \
@@ -952,6 +964,9 @@  extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64(_f, _s)                                        \
     VMSTATE_UINT64_V(_f, _s, 0)
 
+#define VMSTATE_VOID_PTR(_f, _s)                                      \
+    VMSTATE_SINGLE_TEST_NO_CHECK(_f, _s, NULL, 0, vmstate_info_void_ptr, void *)
+
 #ifdef CONFIG_LINUX
 
 #define VMSTATE_U8(_f, _s)                                         \
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index e83bfcc..097ecad 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -314,6 +314,30 @@  const VMStateInfo vmstate_info_uint64 = {
     .put  = put_uint64,
 };
 
+/* 64 bit pointer */
+
+static int get_void_ptr(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field)
+{
+    void **v = pv;
+    qemu_get_be64s(f, (uint64_t *)v);
+    return 0;
+}
+
+static int put_void_ptr(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field, JSONWriter *vmdesc)
+{
+    void **v = pv;
+    qemu_put_be64s(f, (uint64_t *)v);
+    return 0;
+}
+
+const VMStateInfo vmstate_info_void_ptr = {
+    .name = "void_ptr",
+    .get  = get_void_ptr,
+    .put  = put_void_ptr,
+};
+
 static int get_nullptr(QEMUFile *f, void *pv, size_t size,
                        const VMStateField *field)