From patchwork Fri Dec 15 17:28:18 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Eugenio Perez Martin <eperezma@redhat.com>
X-Patchwork-Id: 13494743
Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 51264C4332F
	for <qemu-devel@archiver.kernel.org>; Fri, 15 Dec 2023 17:31:01 +0000 (UTC)
Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <qemu-devel-bounces@nongnu.org>)
	id 1rEBzR-0005xb-76; Fri, 15 Dec 2023 12:28:45 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eperezma@redhat.com>)
 id 1rEBzP-0005x6-KA
 for qemu-devel@nongnu.org; Fri, 15 Dec 2023 12:28:43 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eperezma@redhat.com>)
 id 1rEBzM-00011F-Bu
 for qemu-devel@nongnu.org; Fri, 15 Dec 2023 12:28:42 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1702661318;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 content-transfer-encoding:content-transfer-encoding;
 bh=gRsTFyFns8/cq+LJnN16TSdKAtXCedCKNhD71qVsk4w=;
 b=WobhfFCVimXXHQnJeJQPq25d/JKtcvLpMyMqt50wmyzo0v2NDtwNIJrT9quLkp6iMTBAcB
 WBNUuL02fIy1kS3rpRFW593bhxsbe+3OlM5eyJjK1E/zdCuJ/8SrdfJoYsDaCR52ziwVgM
 y567x/a10emcxdoQDhB6gp6XMx72+9o=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-630-IQlQWtP9N3-VtTxH1I7elQ-1; Fri, 15 Dec 2023 12:28:34 -0500
X-MC-Unique: IQlQWtP9N3-VtTxH1I7elQ-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
 (No client certificate requested)
 by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B7D45185A786;
 Fri, 15 Dec 2023 17:28:33 +0000 (UTC)
Received: from eperezma.remote.csb (unknown [10.39.194.2])
 by smtp.corp.redhat.com (Postfix) with ESMTP id 1596B40C6EB9;
 Fri, 15 Dec 2023 17:28:31 +0000 (UTC)
From: =?utf-8?q?Eugenio_P=C3=A9rez?= <eperezma@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Michael S. Tsirkin" <mst@redhat.com>, si-wei.liu@oracle.com,
 Lei Yang <leiyang@redhat.com>, Jason Wang <jasowang@redhat.com>,
 Dragos Tatulea <dtatulea@nvidia.com>,
 Zhu Lingshan <lingshan.zhu@intel.com>, Parav Pandit <parav@mellanox.com>,
 Stefano Garzarella <sgarzare@redhat.com>,
 Laurent Vivier <lvivier@redhat.com>
Subject: [PATCH for 9.0 00/12] Map memory at destination .load_setup in
 vDPA-net migration
Date: Fri, 15 Dec 2023 18:28:18 +0100
Message-Id: <20231215172830.2540987-1-eperezma@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Received-SPF: pass client-ip=170.10.133.124;
 envelope-from=eperezma@redhat.com;
 helo=us-smtp-delivery-124.mimecast.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001,
 DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
 T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment the migration starts.  Moving that operation allows
QEMU to communicate the kernel the maps while the workload is still running in
the source, so Linux can start mapping them.

Also, the destination of the guest memory may finish before the destination
QEMU maps all the memory.  In this case, the rest of the memory will be mapped
at the same time as before applying this series, when the device is starting.
So we're only improving with this series.

If the destination has the switchover_ack capability enabled, the destination
hold the migration until all the memory is mapped.

This needs to be applied on top of [1]. That series performs some code
reorganization that allows to map the guest memory without knowing the queue
layout the guest configure on the device.

This series reduced the downtime in the stop-and-copy phase of the live
migration from 20s~30s to 5s, with a 128G mem guest and two mlx5_vdpa devices,
per [2].

Future directions on top of this series may include:
* Iterative migration of virtio-net devices, as it may reduce downtime per [3].
  vhost-vdpa net can apply the configuration through CVQ in the destination
  while the source is still migrating.
* Move more things ahead of migration time, like DRIVER_OK.
* Check that the devices of the destination are valid, and cancel the migration
  in case it is not.

v1 from RFC v2:
* Hold on migration if memory has not been mapped in full with switchover_ack.
* Revert map if the device is not started.

RFC v2:
* Delegate map to another thread so it does no block QMP.
* Fix not allocating iova_tree if x-svq=on at the destination.
* Rebased on latest master.
* More cleanups of current code, that might be split from this series too.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01986.html
[2] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
[3] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/

Eugenio Pérez (12):
  vdpa: do not set virtio status bits if unneeded
  vdpa: make batch_begin_once early return
  vdpa: merge _begin_batch into _batch_begin_once
  vdpa: extract out _dma_end_batch from _listener_commit
  vdpa: factor out stop path of vhost_vdpa_dev_start
  vdpa: check for iova tree initialized at net_client_start
  vdpa: set backend capabilities at vhost_vdpa_init
  vdpa: add vhost_vdpa_load_setup
  vdpa: approve switchover after memory map in the migration destination
  vdpa: add vhost_vdpa_net_load_setup NetClient callback
  vdpa: add vhost_vdpa_net_switchover_ack_needed
  virtio_net: register incremental migration handlers

 include/hw/virtio/vhost-vdpa.h |  32 ++++
 include/net/net.h              |   8 +
 hw/net/virtio-net.c            |  48 ++++++
 hw/virtio/vhost-vdpa.c         | 274 +++++++++++++++++++++++++++------
 net/vhost-vdpa.c               |  43 +++++-
 5 files changed, 357 insertions(+), 48 deletions(-)
Tested-by: Lei Yang <leiyang@redhat.com>