From patchwork Thu Nov 10 16:48:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emanuele Giuseppe Esposito X-Patchwork-Id: 13039011 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9746C433FE for ; Thu, 10 Nov 2022 16:50:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232237AbiKJQuW (ORCPT ); Thu, 10 Nov 2022 11:50:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232511AbiKJQt5 (ORCPT ); Thu, 10 Nov 2022 11:49:57 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1AC056450 for ; Thu, 10 Nov 2022 08:48:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668098898; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mCeFUvd8Kq8Qt7Mn84nMP6ZZ97jPIWZ4yJQYiSlWipM=; b=VFOj3ozp4aboLBh5Ba55xrsPwzZNKy5ingvVQuhaEMTqG+lJHJYQW7JC4fPP5TN3wJ1OLN OSxAz8I7YfblfCth6bUGVDc1hfUfehevycKr1W8yZC0psSzB+FFnBDEVaqKqU1tMKM0mdP vui1uJmDyHtwTYxEfIXbR8z1ow4+paU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-38-cUdW0m5kOUiIb1_hL1gLog-1; Thu, 10 Nov 2022 11:48:12 -0500 X-MC-Unique: cUdW0m5kOUiIb1_hL1gLog-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C028887B2AC; Thu, 10 Nov 2022 16:48:09 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8081C40C94AA; Thu, 10 Nov 2022 16:48:09 +0000 (UTC) From: Emanuele Giuseppe Esposito To: qemu-devel@nongnu.org Cc: Paolo Bonzini , Richard Henderson , Eduardo Habkost , Marcel Apfelbaum , =?utf-8?q?Philippe_Mathieu-D?= =?utf-8?q?aud=C3=A9?= , Yanan Wang , kvm@vger.kernel.org, Emanuele Giuseppe Esposito Subject: [PATCH v2 1/3] accel: introduce accelerator blocker API Date: Thu, 10 Nov 2022 11:48:05 -0500 Message-Id: <20221110164807.1306076-2-eesposit@redhat.com> In-Reply-To: <20221110164807.1306076-1-eesposit@redhat.com> References: <20221110164807.1306076-1-eesposit@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This API allows the accelerators to prevent vcpus from issuing new ioctls while execting a critical section marked with the accel-ioctl_inhibit_begin/end functions. Note that all functions submitting ioctls must mark where the ioctl is being called with accel_{cpu_}set_in_ioctl(). This API requires the caller to always hold the BQL. API documentation is in sysemu/accel-blocker.h Internally, it uses a QemuLockCnt together with a per-CPU QemuLockCnt (to minimize cache line bouncing) to keep avoid that new ioctls run when the critical section starts, and a QemuEvent to wait that all running ioctls finish. Signed-off-by: Emanuele Giuseppe Esposito --- accel/accel-blocker.c | 139 +++++++++++++++++++++++++++++++++ accel/meson.build | 2 +- include/sysemu/accel-blocker.h | 45 +++++++++++ 3 files changed, 185 insertions(+), 1 deletion(-) create mode 100644 accel/accel-blocker.c create mode 100644 include/sysemu/accel-blocker.h diff --git a/accel/accel-blocker.c b/accel/accel-blocker.c new file mode 100644 index 0000000000..2701a05945 --- /dev/null +++ b/accel/accel-blocker.c @@ -0,0 +1,139 @@ +/* + * QEMU accel blocker class + * + * Copyright (c) 2014 Red Hat Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include "qemu/osdep.h" +#include "qemu/thread.h" +#include "qemu/main-loop.h" +#include "hw/core/cpu.h" +#include "sysemu/accel-blocker.h" + +static QemuLockCnt accel_in_ioctl_lock; +static QemuEvent accel_in_ioctl_event; + +void accel_blocker_init(void) +{ + qemu_lockcnt_init(&accel_in_ioctl_lock); + qemu_event_init(&accel_in_ioctl_event, false); +} + +void accel_set_in_ioctl(bool in_ioctl) +{ + if (likely(qemu_mutex_iothread_locked())) { + return; + } + if (in_ioctl) { + /* block if lock is taken in kvm_ioctl_inhibit_begin() */ + qemu_lockcnt_inc(&accel_in_ioctl_lock); + } else { + qemu_lockcnt_dec(&accel_in_ioctl_lock); + /* change event to SET. If event was BUSY, wake up all waiters */ + qemu_event_set(&accel_in_ioctl_event); + } +} + +void accel_cpu_set_in_ioctl(CPUState *cpu, bool in_ioctl) +{ + if (unlikely(qemu_mutex_iothread_locked())) { + return; + } + if (in_ioctl) { + /* block if lock is taken in kvm_ioctl_inhibit_begin() */ + qemu_lockcnt_inc(&cpu->in_ioctl_lock); + } else { + qemu_lockcnt_dec(&cpu->in_ioctl_lock); + /* change event to SET. If event was BUSY, wake up all waiters */ + qemu_event_set(&accel_in_ioctl_event); + } +} + +static int accel_in_ioctls(void) +{ + CPUState *cpu; + int ret = qemu_lockcnt_count(&accel_in_ioctl_lock); + + CPU_FOREACH(cpu) { + ret += qemu_lockcnt_count(&cpu->in_ioctl_lock); + } + + return ret; +} + +void accel_ioctl_inhibit_begin(void) +{ + CPUState *cpu; + + /* + * We allow to inhibit only when holding the BQL, so we can identify + * when an inhibitor wants to issue an ioctl easily. + */ + g_assert(qemu_mutex_iothread_locked()); + + /* Block further invocations of the ioctls outside the BQL. */ + CPU_FOREACH(cpu) { + qemu_lockcnt_lock(&cpu->in_ioctl_lock); + } + qemu_lockcnt_lock(&accel_in_ioctl_lock); + + /* Keep waiting until there are running ioctls */ + while (accel_in_ioctls()) { + /* Reset event to FREE. */ + qemu_event_reset(&accel_in_ioctl_event); + + if (accel_in_ioctls()) { + + CPU_FOREACH(cpu) { + /* exit the ioctl */ + qemu_cpu_kick(cpu); + } + + /* + * If event is still FREE, and there are ioctls still in progress, + * wait. + * + * If an ioctl finishes before qemu_event_wait(), it will change + * the event state to SET. This will prevent qemu_event_wait() from + * blocking, but it's not a problem because if other ioctls are + * still running (accel_in_ioctls is true) the loop will iterate + * once more and reset the event status to FREE so that it can wait + * properly. + * + * If an ioctls finishes while qemu_event_wait() is blocking, then + * it will be waken up, but also here the while loop makes sure + * to re-enter the wait if there are other running ioctls. + */ + qemu_event_wait(&accel_in_ioctl_event); + } + } +} + +void accel_ioctl_inhibit_end(void) +{ + CPUState *cpu; + + qemu_lockcnt_unlock(&accel_in_ioctl_lock); + CPU_FOREACH(cpu) { + qemu_lockcnt_unlock(&cpu->in_ioctl_lock); + } +} + diff --git a/accel/meson.build b/accel/meson.build index b9a963cf80..a0d49c4f31 100644 --- a/accel/meson.build +++ b/accel/meson.build @@ -1,4 +1,4 @@ -specific_ss.add(files('accel-common.c')) +specific_ss.add(files('accel-common.c', 'accel-blocker.c')) softmmu_ss.add(files('accel-softmmu.c')) user_ss.add(files('accel-user.c')) diff --git a/include/sysemu/accel-blocker.h b/include/sysemu/accel-blocker.h new file mode 100644 index 0000000000..135ebea566 --- /dev/null +++ b/include/sysemu/accel-blocker.h @@ -0,0 +1,45 @@ +/* + * Accelerator blocking API, to prevent new ioctls from starting and wait the + * running ones finish. + * This mechanism differs from pause/resume_all_vcpus() in that it does not + * release the BQL. + * + * Copyright (c) 2014 Red Hat Inc. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ +#ifndef ACCEL_BLOCKER_H +#define ACCEL_BLOCKER_H + +#include "qemu/osdep.h" +#include "qemu/accel.h" +#include "sysemu/cpus.h" + +extern void accel_blocker_init(void); + +/* + * accel_set_in_ioctl/accel_cpu_set_in_ioctl: + * Mark when ioctl is about to run or just finished. + * If @in_ioctl is true, then mark it is beginning. Otherwise marks that it is + * ending. + * + * These functions will block after accel_ioctl_inhibit_begin() is called, + * preventing new ioctls to run. They will continue only after + * accel_ioctl_inibith_end(). + */ +extern void accel_set_in_ioctl(bool in_ioctl); +extern void accel_cpu_set_in_ioctl(CPUState *cpu, bool in_ioctl); + +/* + * accel_ioctl_inhibit_begin/end: start/end critical section + * Between these two calls, no ioctl marked with accel_set_in_ioctl() and + * accel_cpu_set_in_ioctl() is allowed to run. + * + * This allows the caller to access shared data or perform operations without + * worrying of concurrent vcpus accesses. + */ +extern void accel_ioctl_inhibit_begin(void); +extern void accel_ioctl_inhibit_end(void); + +#endif /* ACCEL_BLOCKER_H */ From patchwork Thu Nov 10 16:48:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emanuele Giuseppe Esposito X-Patchwork-Id: 13039009 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7CD7C4332F for ; Thu, 10 Nov 2022 16:50:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232711AbiKJQuT (ORCPT ); Thu, 10 Nov 2022 11:50:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49494 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232210AbiKJQtw (ORCPT ); Thu, 10 Nov 2022 11:49:52 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16D662DC9 for ; Thu, 10 Nov 2022 08:48:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668098897; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mJciTuRnRb4l+sIQoG140nyybXVYv2g1v3IFlBtCDkY=; b=b7dFXW/vhwBoH5JvYEG4SBFwPDZ1EOZN/e3SbFjekO2CrtjKuTUMeReRKBbtcVW3NRn1Gl oX5FDYFg6AN1rKdEbcWRPgbMvZvYErZOPAV2PUY8Uavb4Eq2X+YXwwYClq7H1HzUn+ZmAc /na0PeQ+jzT1YhnKscSDEO2HisbzyEA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-163-rNaO_w88PSayqrpT6FtzBg-1; Thu, 10 Nov 2022 11:48:13 -0500 X-MC-Unique: rNaO_w88PSayqrpT6FtzBg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1B21D101A528; Thu, 10 Nov 2022 16:48:10 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id C92B640C6F73; Thu, 10 Nov 2022 16:48:09 +0000 (UTC) From: Emanuele Giuseppe Esposito To: qemu-devel@nongnu.org Cc: Paolo Bonzini , Richard Henderson , Eduardo Habkost , Marcel Apfelbaum , =?utf-8?q?Philippe_Mathieu-D?= =?utf-8?q?aud=C3=A9?= , Yanan Wang , kvm@vger.kernel.org, Emanuele Giuseppe Esposito , David Hildenbrand Subject: [PATCH v2 2/3] KVM: keep track of running ioctls Date: Thu, 10 Nov 2022 11:48:06 -0500 Message-Id: <20221110164807.1306076-3-eesposit@redhat.com> In-Reply-To: <20221110164807.1306076-1-eesposit@redhat.com> References: <20221110164807.1306076-1-eesposit@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Using the new accel-blocker API, mark where ioctls are being called in KVM. Next, we will implement the critical section that will take care of performing memslots modifications atomically, therefore preventing any new ioctl from running and allowing the running ones to finish. Signed-off-by: David Hildenbrand Signed-off-by: Emanuele Giuseppe Esposito --- accel/kvm/kvm-all.c | 7 +++++++ hw/core/cpu-common.c | 2 ++ include/hw/core/cpu.h | 3 +++ 3 files changed, 12 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f99b0becd8..dfc6fe76db 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2310,6 +2310,7 @@ static int kvm_init(MachineState *ms) assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size()); s->sigmask_len = 8; + accel_blocker_init(); #ifdef KVM_CAP_SET_GUEST_DEBUG QTAILQ_INIT(&s->kvm_sw_breakpoints); @@ -3014,7 +3015,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...) va_end(ap); trace_kvm_vm_ioctl(type, arg); + accel_set_in_ioctl(true); ret = ioctl(s->vmfd, type, arg); + accel_set_in_ioctl(false); if (ret == -1) { ret = -errno; } @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type, ...) va_end(ap); trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg); + accel_cpu_set_in_ioctl(cpu, true); ret = ioctl(cpu->kvm_fd, type, arg); + accel_cpu_set_in_ioctl(cpu, false); if (ret == -1) { ret = -errno; } @@ -3050,7 +3055,9 @@ int kvm_device_ioctl(int fd, int type, ...) va_end(ap); trace_kvm_device_ioctl(fd, type, arg); + accel_set_in_ioctl(true); ret = ioctl(fd, type, arg); + accel_set_in_ioctl(false); if (ret == -1) { ret = -errno; } diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c index f9fdd46b9d..8d6a4b1b65 100644 --- a/hw/core/cpu-common.c +++ b/hw/core/cpu-common.c @@ -237,6 +237,7 @@ static void cpu_common_initfn(Object *obj) cpu->nr_threads = 1; qemu_mutex_init(&cpu->work_mutex); + qemu_lockcnt_init(&cpu->in_ioctl_lock); QSIMPLEQ_INIT(&cpu->work_list); QTAILQ_INIT(&cpu->breakpoints); QTAILQ_INIT(&cpu->watchpoints); @@ -248,6 +249,7 @@ static void cpu_common_finalize(Object *obj) { CPUState *cpu = CPU(obj); + qemu_lockcnt_destroy(&cpu->in_ioctl_lock); qemu_mutex_destroy(&cpu->work_mutex); } diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index f9b58773f7..15053663bc 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -397,6 +397,9 @@ struct CPUState { uint32_t kvm_fetch_index; uint64_t dirty_pages; + /* Use by accel-block: CPU is executing an ioctl() */ + QemuLockCnt in_ioctl_lock; + /* Used for events with 'vcpu' and *without* the 'disabled' properties */ DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS); DECLARE_BITMAP(trace_dstate, CPU_TRACE_DSTATE_MAX_EVENTS); From patchwork Thu Nov 10 16:48:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emanuele Giuseppe Esposito X-Patchwork-Id: 13039010 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A867C4332F for ; Thu, 10 Nov 2022 16:50:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232735AbiKJQuV (ORCPT ); Thu, 10 Nov 2022 11:50:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232096AbiKJQty (ORCPT ); Thu, 10 Nov 2022 11:49:54 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D6F9BC8B for ; Thu, 10 Nov 2022 08:48:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668098898; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FnmIzFs7Npf3HFGOMIuNdk5GFtHnhbFpd0CPbW0qThc=; b=Qf6DgnsguQzmFlQGWD8Gvr4TRAhqWdNumsPJKcgvkYevynxOfdbDrN/0X6uK14cowYongL Nx6dYfILYj2G+ASIRi5zGurhaDWFug61uN+QJA7vh3wpg9lv45eF0C3wrZ+O/HqeZtFz4S xe1hf8N8oPBhdNpSdC1p2WSVnD0vwnw= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-411-W8_bNZVPNjeVGdN0mETRxQ-1; Thu, 10 Nov 2022 11:48:12 -0500 X-MC-Unique: W8_bNZVPNjeVGdN0mETRxQ-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6E40E85A59D; Thu, 10 Nov 2022 16:48:10 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id 24F5640C94AA; Thu, 10 Nov 2022 16:48:10 +0000 (UTC) From: Emanuele Giuseppe Esposito To: qemu-devel@nongnu.org Cc: Paolo Bonzini , Richard Henderson , Eduardo Habkost , Marcel Apfelbaum , =?utf-8?q?Philippe_Mathieu-D?= =?utf-8?q?aud=C3=A9?= , Yanan Wang , kvm@vger.kernel.org, David Hildenbrand , Emanuele Giuseppe Esposito Subject: [PATCH v2 3/3] kvm: Atomic memslot updates Date: Thu, 10 Nov 2022 11:48:07 -0500 Message-Id: <20221110164807.1306076-4-eesposit@redhat.com> In-Reply-To: <20221110164807.1306076-1-eesposit@redhat.com> References: <20221110164807.1306076-1-eesposit@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Hildenbrand If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. Signed-off-by: David Hildenbrand Signed-off-by: Emanuele Giuseppe Esposito Tested-by: Emanuele Giuseppe Esposito --- accel/kvm/kvm-all.c | 100 ++++++++++++++++++++++++++++++++++----- include/sysemu/kvm_int.h | 8 ++++ 2 files changed, 97 insertions(+), 11 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index dfc6fe76db..ec478dbe69 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -31,6 +31,7 @@ #include "sysemu/kvm_int.h" #include "sysemu/runstate.h" #include "sysemu/cpus.h" +#include "sysemu/accel-blocker.h" #include "qemu/bswap.h" #include "exec/memory.h" #include "exec/ram_addr.h" @@ -46,6 +47,7 @@ #include "sysemu/hw_accel.h" #include "kvm-cpus.h" #include "sysemu/dirtylimit.h" +#include "qemu/range.h" #include "hw/boards.h" #include "monitor/stats.h" @@ -1292,6 +1294,7 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size) kvm_max_slot_size = max_slot_size; } +/* Called with KVMMemoryListener.slots_lock held */ static void kvm_set_phys_mem(KVMMemoryListener *kml, MemoryRegionSection *section, bool add) { @@ -1326,14 +1329,12 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, ram = memory_region_get_ram_ptr(mr) + mr_offset; ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset; - kvm_slots_lock(); - if (!add) { do { slot_size = MIN(kvm_max_slot_size, size); mem = kvm_lookup_matching_slot(kml, start_addr, slot_size); if (!mem) { - goto out; + return; } if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) { /* @@ -1371,7 +1372,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, start_addr += slot_size; size -= slot_size; } while (size); - goto out; + return; } /* register the new slot */ @@ -1396,9 +1397,6 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, ram += slot_size; size -= slot_size; } while (size); - -out: - kvm_slots_unlock(); } static void *kvm_dirty_ring_reaper_thread(void *data) @@ -1455,18 +1453,94 @@ static void kvm_region_add(MemoryListener *listener, MemoryRegionSection *section) { KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener); + KVMMemoryUpdate *update; + + update = g_new0(KVMMemoryUpdate, 1); + update->section = memory_region_section_new_copy(section); - memory_region_ref(section->mr); - kvm_set_phys_mem(kml, section, true); + QSIMPLEQ_INSERT_TAIL(&kml->transaction_add, update, next); } static void kvm_region_del(MemoryListener *listener, MemoryRegionSection *section) { KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener); + KVMMemoryUpdate *update; - kvm_set_phys_mem(kml, section, false); - memory_region_unref(section->mr); + update = g_new0(KVMMemoryUpdate, 1); + update->section = memory_region_section_new_copy(section); + + QSIMPLEQ_INSERT_TAIL(&kml->transaction_del, update, next); +} + +static void kvm_region_commit(MemoryListener *listener) +{ + KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, + listener); + KVMMemoryUpdate *u1, *u2; + bool need_inhibit = false; + + if (QSIMPLEQ_EMPTY(&kml->transaction_add) && + QSIMPLEQ_EMPTY(&kml->transaction_del)) { + return; + } + + /* + * We have to be careful when regions to add overlap with ranges to remove. + * We have to simulate atomic KVM memslot updates by making sure no ioctl() + * is currently active. + * + * The lists are order by addresses, so it's easy to find overlaps. + */ + u1 = QSIMPLEQ_FIRST(&kml->transaction_del); + u2 = QSIMPLEQ_FIRST(&kml->transaction_add); + while (u1 && u2) { + Range r1, r2; + + range_init_nofail(&r1, u1->section->offset_within_address_space, + int128_get64(u1->section->size)); + range_init_nofail(&r2, u2->section->offset_within_address_space, + int128_get64(u2->section->size)); + + if (range_overlaps_range(&r1, &r2)) { + need_inhibit = true; + break; + } + if (range_lob(&r1) < range_lob(&r2)) { + u1 = QSIMPLEQ_NEXT(u1, next); + } else { + u2 = QSIMPLEQ_NEXT(u2, next); + } + } + + + kvm_slots_lock(); + if (need_inhibit) { + accel_ioctl_inhibit_begin(); + } + + /* Remove all memslots before adding the new ones. */ + QSIMPLEQ_FOREACH_SAFE(u1, &kml->transaction_del, next, u2) { + kvm_set_phys_mem(kml, u1->section, false); + memory_region_unref(u1->section->mr); + + QSIMPLEQ_REMOVE(&kml->transaction_del, u1, KVMMemoryUpdate, next); + memory_region_section_free_copy(u1->section); + g_free(u1); + } + QSIMPLEQ_FOREACH_SAFE(u1, &kml->transaction_add, next, u2) { + memory_region_ref(u1->section->mr); + kvm_set_phys_mem(kml, u1->section, true); + + QSIMPLEQ_REMOVE(&kml->transaction_add, u1, KVMMemoryUpdate, next); + memory_region_section_free_copy(u1->section); + g_free(u1); + } + + if (need_inhibit) { + accel_ioctl_inhibit_end(); + } + kvm_slots_unlock(); } static void kvm_log_sync(MemoryListener *listener, @@ -1610,8 +1684,12 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, kml->slots[i].slot = i; } + QSIMPLEQ_INIT(&kml->transaction_add); + QSIMPLEQ_INIT(&kml->transaction_del); + kml->listener.region_add = kvm_region_add; kml->listener.region_del = kvm_region_del; + kml->listener.commit = kvm_region_commit; kml->listener.log_start = kvm_log_start; kml->listener.log_stop = kvm_log_stop; kml->listener.priority = 10; diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 3b4adcdc10..5ea2d7924b 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -12,6 +12,7 @@ #include "exec/memory.h" #include "qapi/qapi-types-common.h" #include "qemu/accel.h" +#include "qemu/queue.h" #include "sysemu/kvm.h" typedef struct KVMSlot @@ -31,10 +32,17 @@ typedef struct KVMSlot ram_addr_t ram_start_offset; } KVMSlot; +typedef struct KVMMemoryUpdate { + QSIMPLEQ_ENTRY(KVMMemoryUpdate) next; + MemoryRegionSection *section; +} KVMMemoryUpdate; + typedef struct KVMMemoryListener { MemoryListener listener; KVMSlot *slots; int as_id; + QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_add; + QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_del; } KVMMemoryListener; #define KVM_MSI_HASHTAB_SIZE 256