mbox series

[v2,00/17] Multifd

Message ID cover.1724701542.git.maciej.szmigiero@oracle.com (mailing list archive)
Headers show
Series Multifd | expand

Message

Maciej S. Szmigiero Aug. 27, 2024, 5:54 p.m. UTC
From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is an updated v2 patch series of the v1 series located here:
https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/

Changes from v1:
* Extended the QEMU thread-pool with non-AIO (generic) pool support,
implemented automatic memory management support for its work element
function argument.

* Introduced a multifd device state save thread pool, ported the VFIO
multifd device state save implementation to use this thread pool instead
of VFIO internally managed individual threads.

* Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/

* Moved device state related multifd code to new multifd-device-state.c
file where it made sense.

* Implemented a max in-flight VFIO device state buffer count limit to
allow capping the maximum recipient memory usage.

* Removed unnecessary explicit memory barriers from multifd_send().

* A few small changes like updated comments, code formatting,
fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.


For convenience, this patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio

Based-on: <20240823173911.6712-1-farosas@suse.de>


Maciej S. Szmigiero (17):
  vfio/migration: Add save_{iterate,complete_precopy}_started trace
    events
  migration/ram: Add load start trace event
  migration/multifd: Zero p->flags before starting filling a packet
  thread-pool: Add a DestroyNotify parameter to
    thread_pool_submit{,_aio)()
  thread-pool: Implement non-AIO (generic) pool support
  migration: Add save_live_complete_precopy_{begin,end} handlers
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Add load_finish handler and associated functions
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Convert multifd_send()::next_channel to atomic
  migration/multifd: Add an explicit MultiFDSendData destructor
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  migration: Add save_live_complete_precopy_thread handler
  vfio/migration: Multifd device state transfer support - receive side
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  vfio/migration: Multifd device state transfer support - send side

 backends/tpm/tpm_backend.c       |   2 +-
 block/file-win32.c               |   2 +-
 hw/9pfs/coth.c                   |   3 +-
 hw/ppc/spapr_nvdimm.c            |   4 +-
 hw/vfio/migration.c              | 520 ++++++++++++++++++++++++++++++-
 hw/vfio/pci.c                    |   9 +
 hw/vfio/trace-events             |  14 +-
 hw/virtio/virtio-pmem.c          |   2 +-
 include/block/thread-pool.h      |  12 +-
 include/hw/vfio/vfio-common.h    |  22 ++
 include/migration/misc.h         |  15 +
 include/migration/register.h     |  97 ++++++
 include/qemu/typedefs.h          |   4 +
 migration/meson.build            |   1 +
 migration/migration.c            |   6 +
 migration/migration.h            |   3 +
 migration/multifd-device-state.c | 193 ++++++++++++
 migration/multifd-nocomp.c       |   9 +-
 migration/multifd-qpl.c          |   2 +-
 migration/multifd-uadk.c         |   2 +-
 migration/multifd-zlib.c         |   2 +-
 migration/multifd-zstd.c         |   2 +-
 migration/multifd.c              | 249 ++++++++++++---
 migration/multifd.h              |  65 +++-
 migration/ram.c                  |   1 +
 migration/savevm.c               | 152 ++++++++-
 migration/savevm.h               |   7 +
 migration/trace-events           |   1 +
 tests/unit/test-thread-pool.c    |   8 +-
 util/thread-pool.c               |  83 ++++-
 30 files changed, 1406 insertions(+), 86 deletions(-)
 create mode 100644 migration/multifd-device-state.c

Comments

Fabiano Rosas Aug. 28, 2024, 8:46 p.m. UTC | #1
"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This is an updated v2 patch series of the v1 series located here:
> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>
> Changes from v1:
> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
> implemented automatic memory management support for its work element
> function argument.
>
> * Introduced a multifd device state save thread pool, ported the VFIO
> multifd device state save implementation to use this thread pool instead
> of VFIO internally managed individual threads.
>
> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>
> * Moved device state related multifd code to new multifd-device-state.c
> file where it made sense.
>
> * Implemented a max in-flight VFIO device state buffer count limit to
> allow capping the maximum recipient memory usage.
>
> * Removed unnecessary explicit memory barriers from multifd_send().
>
> * A few small changes like updated comments, code formatting,
> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>
>
> For convenience, this patch set is also available as a git tree:
> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio

With this branch I'm getting:

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/uri/plain/none
...
qemu-system-x86_64: ../util/thread-pool.c:354: thread_pool_set_minmax_threads: Assertion `max_threads > 0' failed.
Broken pipe


$ ./tests/qemu-iotests/check -p -qcow2 068
...
+qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
Maciej S. Szmigiero Aug. 28, 2024, 9:58 p.m. UTC | #2
On 28.08.2024 22:46, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v2 patch series of the v1 series located here:
>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>
>> Changes from v1:
>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>> implemented automatic memory management support for its work element
>> function argument.
>>
>> * Introduced a multifd device state save thread pool, ported the VFIO
>> multifd device state save implementation to use this thread pool instead
>> of VFIO internally managed individual threads.
>>
>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>
>> * Moved device state related multifd code to new multifd-device-state.c
>> file where it made sense.
>>
>> * Implemented a max in-flight VFIO device state buffer count limit to
>> allow capping the maximum recipient memory usage.
>>
>> * Removed unnecessary explicit memory barriers from multifd_send().
>>
>> * A few small changes like updated comments, code formatting,
>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>
>>
>> For convenience, this patch set is also available as a git tree:
>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
> 
> With this branch I'm getting:
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/uri/plain/none
> ...
> qemu-system-x86_64: ../util/thread-pool.c:354: thread_pool_set_minmax_threads: Assertion `max_threads > 0' failed.
> Broken pipe
> 

Oops, I should have tested this patch set in setups without any VFIO devices too.

Fixed this now (together with that RAM tracepoint thing) and updated the GitHub tree -
the above test now passes.

Tomorrow I will test the whole multifd VFIO migration once again to be sure.

> $ ./tests/qemu-iotests/check -p -qcow2 068
> ...
> +qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
> 

I'm not sure how this can happen - it looks like qemu_loadvm_state() might be called
somehow after migration_incoming_state_destroy() already destroyed the migration state?
Will investigate this in detail tomorrow.

By the way, this test seems to not be run by the default "make check".

Thanks,
Maciej
Fabiano Rosas Aug. 29, 2024, 12:51 a.m. UTC | #3
"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 28.08.2024 22:46, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This is an updated v2 patch series of the v1 series located here:
>>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>>
>>> Changes from v1:
>>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>>> implemented automatic memory management support for its work element
>>> function argument.
>>>
>>> * Introduced a multifd device state save thread pool, ported the VFIO
>>> multifd device state save implementation to use this thread pool instead
>>> of VFIO internally managed individual threads.
>>>
>>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>>
>>> * Moved device state related multifd code to new multifd-device-state.c
>>> file where it made sense.
>>>
>>> * Implemented a max in-flight VFIO device state buffer count limit to
>>> allow capping the maximum recipient memory usage.
>>>
>>> * Removed unnecessary explicit memory barriers from multifd_send().
>>>
>>> * A few small changes like updated comments, code formatting,
>>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>>
>>>
>>> For convenience, this patch set is also available as a git tree:
>>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
>> 
>> With this branch I'm getting:
>> 
>> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/uri/plain/none
>> ...
>> qemu-system-x86_64: ../util/thread-pool.c:354: thread_pool_set_minmax_threads: Assertion `max_threads > 0' failed.
>> Broken pipe
>> 
>
> Oops, I should have tested this patch set in setups without any VFIO devices too.
>
> Fixed this now (together with that RAM tracepoint thing) and updated the GitHub tree -
> the above test now passes.
>
> Tomorrow I will test the whole multifd VFIO migration once again to be sure.
>
>> $ ./tests/qemu-iotests/check -p -qcow2 068
>> ...
>> +qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
>> 
>
> I'm not sure how this can happen - it looks like qemu_loadvm_state() might be called
> somehow after migration_incoming_state_destroy() already destroyed the migration state?
> Will investigate this in detail tomorrow.

Usually something breaks and then the clean up code rushes and frees
state while other parts are still using it.

We also had issues recently with code not incrementing the migration
state refcount properly:

27eb8499ed ("migration: Fix use-after-free of migration state object")

>
> By the way, this test seems to not be run by the default "make check".
>
> Thanks,
> Maciej
Maciej S. Szmigiero Aug. 29, 2024, 8:02 p.m. UTC | #4
On 29.08.2024 02:51, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 28.08.2024 22:46, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This is an updated v2 patch series of the v1 series located here:
>>>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>>>
>>>> Changes from v1:
>>>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>>>> implemented automatic memory management support for its work element
>>>> function argument.
>>>>
>>>> * Introduced a multifd device state save thread pool, ported the VFIO
>>>> multifd device state save implementation to use this thread pool instead
>>>> of VFIO internally managed individual threads.
>>>>
>>>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>>>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>>>
>>>> * Moved device state related multifd code to new multifd-device-state.c
>>>> file where it made sense.
>>>>
>>>> * Implemented a max in-flight VFIO device state buffer count limit to
>>>> allow capping the maximum recipient memory usage.
>>>>
>>>> * Removed unnecessary explicit memory barriers from multifd_send().
>>>>
>>>> * A few small changes like updated comments, code formatting,
>>>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>>>
>>>>
>>>> For convenience, this patch set is also available as a git tree:
>>>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
>>>
>>> With this branch I'm getting:
>>>
(..)
>>> $ ./tests/qemu-iotests/check -p -qcow2 068
>>> ...
>>> +qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
>>>
>>
>> I'm not sure how this can happen - it looks like qemu_loadvm_state() might be called
>> somehow after migration_incoming_state_destroy() already destroyed the migration state?
>> Will investigate this in detail tomorrow.
> 
> Usually something breaks and then the clean up code rushes and frees
> state while other parts are still using it.
> 
> We also had issues recently with code not incrementing the migration
> state refcount properly:
> 
> 27eb8499ed ("migration: Fix use-after-free of migration state object")

Looks like MigrationIncomingState is just for "true" incoming migration,
which can be started just once - so it is destroyed after the first
attempt and never reinitialized.

On the other hand, MigrationState is for both true incoming migration and
also for snapshot load - the later which seems able to be started multiple
times.

Moved these variables to MigrationState, updated the GitHub tree and now
this test passes.

Thanks,
Maciej
Cédric Le Goater Oct. 11, 2024, 1:58 p.m. UTC | #5
Hello Maciej,

On 8/27/24 19:54, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This is an updated v2 patch series of the v1 series located here:
> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
> 
> Changes from v1:
> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
> implemented automatic memory management support for its work element
> function argument.
> 
> * Introduced a multifd device state save thread pool, ported the VFIO
> multifd device state save implementation to use this thread pool instead
> of VFIO internally managed individual threads.
> 
> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
> 
> * Moved device state related multifd code to new multifd-device-state.c
> file where it made sense.
> 
> * Implemented a max in-flight VFIO device state buffer count limit to
> allow capping the maximum recipient memory usage.
> 
> * Removed unnecessary explicit memory barriers from multifd_send().
> 
> * A few small changes like updated comments, code formatting,
> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
> 
> 
> For convenience, this patch set is also available as a git tree:
> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
> 
> Based-on: <20240823173911.6712-1-farosas@suse.de>


I must admit, I’m a bit lost in all the discussions. Could you please
resend a v3 on top of the master branch, incorporating the points
discussed and agreed upon ? Many thanks to Peter for leading the
discussion, his expertise in this area is invaluable.

Please include a summary of the proposed design (and alternatives) in
the cover letter. Diagrams of the communication flows between src/dest
threads would be a plus to understand better the proposal. Such level
of details should go under docs/devel/migration at end. So, it might
good to invest some time for that.

Performance figures would be good to have in the cover. The ones from
your presentation at KVM forum 2024 should be fine unless you have
changed the design since.

 From there, we can test and stress to evaluate the benefits of the
changes for mlx5 VF and vGPU migration. Once we have the results,
we can determine how to upstream the changes, either all at once
or splitting the series.


Quoting Peter,

   "I am sorry to ask for this, Fabiano already blames me for this,
   but.. logically it'll be best we use no new thread in the series,
   then one patch on top with your new thread solution to justify its
   performance benefits and worthwhile to having those threads at all."

I fully support this step-by-step approach. VFIO migration is a recent
feature. It can be stressed in a complex environment and is not fully
optimized for certain workloads. However, I would prefer to introduce
changes progressively to ensure stability is maintained. It is now
acceptable to introduce experimental knobs to explore alternative
solutions.

Also, quoting again Peter,

   "PS: I'd suggest if you really need those threads it should still be
    managed by migration framework like the src thread pool. "

yes, I would prefer to see the VFIO subsystem rely on common QEMU APIs
and in this case, a QEMU multifd API too.

Thanks,

C.
Maciej S. Szmigiero Oct. 15, 2024, 9:12 p.m. UTC | #6
Hi Cédric,

On 11.10.2024 15:58, Cédric Le Goater wrote:
> Hello Maciej,
> 
> On 8/27/24 19:54, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v2 patch series of the v1 series located here:
>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>
>> Changes from v1:
>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>> implemented automatic memory management support for its work element
>> function argument.
>>
>> * Introduced a multifd device state save thread pool, ported the VFIO
>> multifd device state save implementation to use this thread pool instead
>> of VFIO internally managed individual threads.
>>
>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>
>> * Moved device state related multifd code to new multifd-device-state.c
>> file where it made sense.
>>
>> * Implemented a max in-flight VFIO device state buffer count limit to
>> allow capping the maximum recipient memory usage.
>>
>> * Removed unnecessary explicit memory barriers from multifd_send().
>>
>> * A few small changes like updated comments, code formatting,
>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>
>>
>> For convenience, this patch set is also available as a git tree:
>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
>>
>> Based-on: <20240823173911.6712-1-farosas@suse.de>
> 
> 
> I must admit, I’m a bit lost in all the discussions. Could you please
> resend a v3 on top of the master branch, incorporating the points
> discussed and agreed upon ? Many thanks to Peter for leading the
> discussion, his expertise in this area is invaluable.
> 
> Please include a summary of the proposed design (and alternatives) in
> the cover letter. Diagrams of the communication flows between src/dest
> threads would be a plus to understand better the proposal. Such level
> of details should go under docs/devel/migration at end. So, it might
> good to invest some time for that.
> 
> Performance figures would be good to have in the cover. The ones from
> your presentation at KVM forum 2024 should be fine unless you have
> changed the design since.
> 
>  From there, we can test and stress to evaluate the benefits of the
> changes for mlx5 VF and vGPU migration. Once we have the results,
> we can determine how to upstream the changes, either all at once
> or splitting the series.
> 
> 
> Quoting Peter,
> 
>    "I am sorry to ask for this, Fabiano already blames me for this,
>    but.. logically it'll be best we use no new thread in the series,
>    then one patch on top with your new thread solution to justify its
>    performance benefits and worthwhile to having those threads at all."
> 
> I fully support this step-by-step approach. VFIO migration is a recent
> feature. It can be stressed in a complex environment and is not fully
> optimized for certain workloads. However, I would prefer to introduce
> changes progressively to ensure stability is maintained. It is now
> acceptable to introduce experimental knobs to explore alternative
> solutions.
> 
> Also, quoting again Peter,
> 
>    "PS: I'd suggest if you really need those threads it should still be
>     managed by migration framework like the src thread pool. "
> 
> yes, I would prefer to see the VFIO subsystem rely on common QEMU APIs
> and in this case, a QEMU multifd API too.

I am preparing a v3 right now with the review comments (hopefully)
addressed.

The updated patch set will be based on the current QEMU git master
since I see that Fabiano's patches have been already merged there.

> Thanks,
> 
> C.

Thanks,
Maciej