From patchwork Thu Mar 20 01:55:35 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Changyuan Lyu X-Patchwork-Id: 14023300 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on Received: from ( []) by (Postfix) with ESMTP id 4B8EBC35FFC for ; Thu, 20 Mar 2025 01:56:05 +0000 (UTC) Received: by (Postfix) id BDFD2280002; Wed, 19 Mar 2025 21:56:03 -0400 (EDT) Received: by (Postfix, from userid 40) id B8F68280001; Wed, 19 Mar 2025 21:56:03 -0400 (EDT) X-Delivered-To: Received: by (Postfix, from userid 63042) id A307E280002; Wed, 19 Mar 2025 21:56:03 -0400 (EDT) X-Delivered-To: Received: from ( []) by (Postfix) with ESMTP id 780AC280001 for ; Wed, 19 Mar 2025 21:56:03 -0400 (EDT) Received: from (a10.router.float.18 []) by (Postfix) with ESMTP id D26BE51F9A for ; Thu, 20 Mar 2025 01:56:03 +0000 (UTC) X-FDA: 83240263806.17.DC4B155 Received: from ( []) by (Postfix) with ESMTP id E8DDAC0005 for ; Thu, 20 Mar 2025 01:56:01 +0000 (UTC) Authentication-Results:; dkim=pass header.s=20230601 header.b=Dskuoby3; spf=pass ( domain of designates as permitted sender); dmarc=pass (policy=reject) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;; s=arc-20220608; t=1742435762; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=fJIWs/RmPeqP5oWmTUDu0YzJ84Cq5pt18XtxjnOkupA=; b=T0lqR8StvrEMZgcfFcsyMjEiY5mYSMH1DDfGIHAHpWgh7BdClU7ZYkoewTsRnMdDbjN2Va G37r6XXBVQqOPP/3Uo3rZG5HLx6Fly9PNW2X7MPM3NiGoJXn+luS+zh9ohDWtynWX3KZPb 7Eg17OBSlK4UxBLol0j2vZgpPlCQB/g= ARC-Authentication-Results: i=1;; dkim=pass header.s=20230601 header.b=Dskuoby3; spf=pass ( domain of designates as permitted sender); dmarc=pass (policy=reject) ARC-Seal: i=1; s=arc-20220608;; t=1742435762; a=rsa-sha256; cv=none; b=YkNL09FoTwzhpYa+cca+EnfSNqRusau3CHU6CgFoxJb4M6OUG11Isb1gNHFioCbVRphwmz giDSOeFYIQ5LKmzSv/qCtnAsUq3YQbNl/lHR4VvRcoMlHcc4OcoLO85FoVIqxQzKmSiokd coAj9xYDwC4I9NzJxuIANJZnqXEjEcs= Received: by with SMTP id d9443c01a7336-2242f3fd213so2972915ad.1 for ; Wed, 19 Mar 2025 18:56:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20230601; t=1742435760; x=1743040560;; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=fJIWs/RmPeqP5oWmTUDu0YzJ84Cq5pt18XtxjnOkupA=; b=Dskuoby3vD5cc7ghz0n+2uWvlJU+q5uVLJh7kVdGMivTPmSXoReB2Tk/1J7CM8vN2B 2gDf4tw+82SAfUlSFt/DQ2cDURA6j1MFSc3g8gTnsDMKSwIBZddoYKhcRYND+Wl1Uk43 ECCkjR2i5g8jlsfMt1BzgzTVnRTeDxhU1Ubi1epSrgNS+Bz/T/ntEiHamIzcgLSjbCYF RKbaKi2RiSNdKLgB9Maq3XFuUnR1uH4Awq7HlxoL2chBaU2f6eJSvTasT3q5Tbo7KWBP iPsaAyph0Gv2lPFDbeR+n4WucZQd0rQqorq974rq7/nurriPJqM1MGM05DDx4oAhWNIX eg4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20230601; t=1742435760; x=1743040560; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=fJIWs/RmPeqP5oWmTUDu0YzJ84Cq5pt18XtxjnOkupA=; b=sQqRPhhkMjXQGKxoEGGw0H8LxOLbBJTdS++jcYp4kqF/CF97YjqVr1S0HMl4mq8W9H sS/230WGZLjMvQP7wm8+y/LYGn1UuBzNe2SSIOzgJPgl4kLQVatWvPE03c7AKeTjqvuB pN1B8+JFJXAgmqubBfgOn3w1Je+GUN7L0j+nchzCq2yczWA0iJSynEG8AIAzqmb5M+Cd YSqX5NZuka1q7OX9KGJ23Om39+2E5MGEQpQWc9/5doagHiVjngGUvg/+j/yqncIhbnE/ qWZSlxWRitQ6etilzNSsuiAWyVdG/Ke5XEg3uWKW7hXgAQzkEdUCk8rEwqzQe6wv6AS9 0T/g== X-Forwarded-Encrypted: i=1; AJvYcCXbbmZg2KyUr0/ X-Gm-Message-State: AOJu0Yy5cKqHBXlbQklgF2a2bkoNO8j/eor0ZMExHeLQXtAe29q6bEGC WZ725p8kd7BNfwnhLNc6qOM4epRn0sVMFI+agMNtbdRbMkObb6XRE6po+GD6FS1UCLK+B5Dqjez KAsR8DW0ttEAUZstFPA== X-Google-Smtp-Source: AGHT+IFSfjHWsJORGvzYZj6vDxkXyKih6HbHAx5dGbvbNQJB/0VzMeC+2VxQ6zbVq6wG2AlSOPhuB4IOuuaqL1wn X-Received: from ([2002:a17:90b:28cf:b0:2ff:5df6:7e03]) (user=changyuanl job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ebc3:b0:225:ac99:ae0d with SMTP id d9443c01a7336-2264992dd80mr101668955ad.10.1742435760550; Wed, 19 Mar 2025 18:56:00 -0700 (PDT) Date: Wed, 19 Mar 2025 18:55:35 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.49.0.rc1.451.g8f38331e32-goog Message-ID: <> Subject: [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) From: Changyuan Lyu To: Cc:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Changyuan Lyu X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E8DDAC0005 X-Stat-Signature: 97izczrxfeq7zg5bogzdh4nrawjajo1o X-HE-Tag: 1742435761-688358 X-HE-Meta: U2FsdGVkX1/yr8Bqkm30G7WPEbFrddb3res8uhMjp0ls7H4MUts3VQLAr9nbmolGl/Pz+IUUhXusACbo4fy7H7Gq2TAsA5piSXRTQqqDhswooegkKqY86PG1ZXU5MZ8MSOBtNp2mT3Ep13sGSMrNwl/FT5dVdMCMJ8stbGrWG8Zbgeg2jjsty0yDBhviy+gJ7Jf6fPZghgewubZ5VQO350k+kQCU+mlLtUPxvWNCHi4JSyJMPy1bq5sRT/LT0aSdqgEXMopEA8BSYx1TJHI/qaM/hSTVQHv+IqqMldpQaRySUrouJUlkM2w6GjxuA2x4vL/mccAaqiDBH5zqMFMzxo7ZofW06mTqxtPaZZ588oL0yEumItEIb6JhoXRBdYMW4QZWNnCPfrP5Eisu5BjqLfHl4z3hDJJzOEvCmkqy3W7jHeyLRVqwn4izTpX2RLDEKkwNp3yifmRK0aD14p1rbLXg7LHiWQLRm44hZBZtP3O43plQo25uVaSRBuZQcePdgFz6noJVgcOKU6DXSgyqyGRZ0OqP9OnREdvEvOHq6xHihHzHarvY1PxBaiOmdx7BjRteAtWmifWFwdY03IKo65A1J9OINfcmuSAjKwFJ9bpPEEfHeJvu9rDe5n142kJ6/1+Rk/yb8MkmKbtxp32HRclqCSqYZNlIVj32+oduaN5pDhqvgqpTIQxHn8bENx4A/nGHiEKS8jQ/c+w8xhTho1Imb4TiQIRQVWFDPWCIPamK7jw0expSHq0o2ss6Zw+oHQ/3tpXYPbOV6yVBlUVNJIBsVaDO0zPLr+29+ogHTimzCQ4h3fTakeRaLJP/1TZYvq1ErmJB0D1arq/vaBZwCLYRm2khXFGsr+wUDMlb9atlDCRDXW/MG1W6IUZy0Lr7xhMY3tdpbdH/fhGBtKYxKkT4UdT/vbNspOOkfbcaX5134/WpC7wLgN/7gKXW+fKhk6Sv8liC3EcUHGnUlBz J9i8Te56 /vHUtaZJntJI2k8n5O/0uvECkehUvnG5bKu1Bleyno/5wpRBrc1C/KcP4s2JKz7qrERx6HAJclap6wnK/ght2MHvxKUoWcamLXOXqeGUgpbfrpwi7OhxTGsc4GRzQot0d4SZzRN9babdsQgNhCCkOS8jw/qu5foOEVbQiEzH4EjEEeYjk91FAkFQN4q4xDMYayUoCt4HpTvugN7tGvXGc6N63g/qBdJ0/I249kqIzJaV6WNBSw9LyW/BNIdRHLDolc0QXefZTIFFoq242rIjjLLgxJqyJq3Zv6Ji5WA6e/VJX9ZJ8Q1fO9Sy3bFMNLS6Owxwh8q/cLao8uWu2/yYEJ9/Rm+zteCQIG17m6udrCdUFai/DgNVde+F+CVSl7EkuyiMyd8oooI6Gccb8v9QcpHq6oqOWEI2LDlFd4lj3DLjHhE8oAY0JUsS0X1tOiXJXCKdK0vdb/WC0rLJYey4hlczUbAz2ZXuCzH0rQDhen02VbXthLHRZGTPrKvo2pOojE8ITXpJHko8uz2fsxPP2MsWlwRPqV2eDYZiaRQw7841wiy9zxXAl07mEP2PhQa1WJzu5JDc+0XRY3x+YCRFegZgC3UOq1eWywBhy1/OLTfU2F+C8sSwlkJTlq9fL7uxQksdi4ZYiOalT2EjcOeqH4PpKQbehYJe/MTS9HF3q539EiRacCuh0Rjelet6AKZvPKeDTmHMV+vimz0+9OgCBXLMqyW0+JPVkL5T+R9N6Kab4pMgv9jO0dt2k1eggfT1txewiarp0KNe0EF2RFThbgEdy/iLdFKRK1iALbI51m0uBFxnaOtPo0z+ZDjBTdaDa36ZI+AksDxG2W6+HJpAactW0pA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: Precedence: bulk X-Loop: List-ID: List-Subscribe: List-Unsubscribe: Hi, This is a next version of Alexander Graf and Mike Rapoport's "kexec: introduce Kexec HandOver (KHO)" series (, with bitmaps for preserving folios and address ranges, new hashtable-based KHO state tree API, and reduced blackout window with kexec_file_load. The patches are also available in git: v4 -> v5: - New: Preserve folios and address ranges in bitmaps [1]. Removed the `mem` property. - New: Hash table based API for manipulating the KHO state tree. - Change the concept of "active phase" to "finalization phase". KHO users can add/remove data into/from KHO DT anytime before the finalization phase. - Decouple kexec_file_load and KHO FDT creation. kexec_file_load can be done before KHO FDT is created. - Update the example usecase (reserve_mem) using the new KHO API, replace underscores with dashes in reserve-mem fdt generation. - Drop the YAMLs for now and add a brief description of KHO FDT before KHO schema is stable. - Move all sysfs interfaces to debugfs. - Fixed the memblock test reported in [2]. - Incorporate fix for kho_locate_mem_hole() with !CONFIG_KEXEC_HANDOVER [3] into "kexec: Add KHO support to kexec file loads". [1] [2] [3] = Original cover letter = Kexec today considers itself purely a boot loader: When we enter the new kernel, any state the previous kernel left behind is irrelevant and the new kernel reinitializes the system. However, there are use cases where this mode of operation is not what we actually want. In virtualization hosts for example, we want to use kexec to update the host kernel while virtual machine memory stays untouched. When we add device assignment to the mix, we also need to ensure that IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we need to do the same for the PCI subsystem. If we want to kexec while an SEV-SNP enabled virtual machine is running, we need to preserve the VM context pages and physical memory. See "pkernfs: Persisting guest memory and kernel/device state safely across kexec" Linux Plumbers Conference 2023 presentation for details: To start us on the journey to support all the use cases above, this patch implements basic infrastructure to allow hand over of kernel state across kexec (Kexec HandOver, aka KHO). As a really simple example target, we use memblock's reserve_mem. With this patch set applied, memory that was reserved using "reserve_mem" command line options remains intact after kexec and it is guaranteed to reside at the same physical address. == Alternatives == There are alternative approaches to (parts of) the problems above: * Memory Pools [1] - preallocated persistent memory region + allocator * PRMEM [2] - resizable persistent memory regions with fixed metadata pointer on the kernel command line + allocator * Pkernfs [3] - preallocated file system for in-kernel data with fixed address location on the kernel command line * PKRAM [4] - handover of user space pages using a fixed metadata page specified via command line All of the approaches above fundamentally have the same problem: They require the administrator to explicitly carve out a physical memory location because they have no mechanism outside of the kernel command line to pass data (including memory reservations) between kexec'ing kernels. KHO provides that base foundation. We will determine later whether we still need any of the approaches above for fast bulk memory handover of for example IOMMU page tables. But IMHO they would all be users of KHO, with KHO providing the foundational primitive to pass metadata and bulk memory reservations as well as provide easy versioning for data. == Overview == We introduce a metadata file that the kernels pass between each other. How they pass it is architecture specific. The file's format is a Flattened Device Tree (fdt) which has a generator and parser already included in Linux. KHO is enabled in the kernel command line by `kho=on`. Drivers can add/remove data into/from KHO root state tree hash-table anytime. When the hash-table is converted to FDT (by kernel automatically in the case of kexec_file_load or by the userspace manually through debugfs kho/out/finalize), the kernel invokes callbacks to every driver that supports KHO to serialize its state. When the actual kexec happens, the fdt is part of the image set that we boot into. In addition, we keep a "scratch regions" available for kexec: A physically contiguous memory regions that is guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When drivers initialize that support KHO, they introspect the fdt and recover their state from it. This includes memory reservations, where the driver can either discard or claim reservations. == Limitations == Currently KHO is only implemented for file based kexec. The kernel interfaces in the patch set are already in place to support user space kexec as well, but it is still not implemented it yet inside kexec tools. == How to Use == To use the code, please boot the kernel with the "kho=on" command line parameter. KHO will automatically create scratch regions. If you want to set the scratch size explicitly you can use "kho_scratch=" command line parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB per NUMA node scratch regions on boot. Make sure to have a reserved memory range requested with reserv_mem command line option. Then you invoke file based "kexec -l", # kexec -l Image --initrd=initrd -s # kexec -e The new kernel will boot up and contain the previous kernel's reserve_mem contents at the same physical address as the first kernel. Optionally, you can finalize the KHO FDT early by # echo 1 > /sys/kernel/debug/kho/out/finalize which allows you to preview the FDT to be passed to the next kernel at /sys/kernel/debug/kho/out/fdt. == Changelog == v3 -> v4: - Major rework of scrach management. Rather than force scratch memory allocations only very early in boot now we rely on scratch for all memblock allocations. - Use simple example usecase (reserv_mem instead of ftrace) - merge all KHO functionality into a single kernel/kexec_handover.c file - rename CONFIG_KEXEC_KHO to CONFIG_KEXEC_HANDOVER v2 -> v3: - Fix make dt_binding_check - Add descriptions for each object - s/trace_flags/trace-flags/ - s/global_trace/global-trace/ - Make all additionalProperties false - Change subject to reflect subsysten (dt-bindings) - Fix indentation - Remove superfluous examples - Convert to 64bit syntax - Move to kho directory - s/"global_trace"/"global-trace"/ - s/"global_trace"/"global-trace"/ - s/"trace_flags"/"trace-flags"/ - Fix wording - Add Documentation to MAINTAINERS file - Remove kho reference on read error - Move handover_dt unmap up - s/reserve_scratch_mem/mark_phys_as_cma/ - Remove ifdeffery - Remove superfluous comment v1 -> v2: - Removed: tracing: Introduce names for ring buffers - Removed: tracing: Introduce names for events - New: kexec: Add config option for KHO - New: kexec: Add documentation for KHO - New: tracing: Initialize fields before registering - New: devicetree: Add bindings for ftrace KHO - test bot warning fixes - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO - s/kho_reserve_mem/kho_reserve_previous_mem/g - s/kho_reserve/kho_reserve_scratch/g - Remove / reduce ifdefs - Select crc32 - Leave anything that requires a name in trace.c to keep buffers unnamed entities - Put events as array into a property, use fingerprint instead of names to identify them - Reduce footprint without CONFIG_FTRACE_KHO - s/kho_reserve_mem/kho_reserve_previous_mem/g - make kho_get_fdt() const - Add stubs for return_mem and claim_mem - make kho_get_fdt() const - Get events as array from a property, use fingerprint instead of names to identify events - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO - s/kho_reserve_mem/kho_reserve_previous_mem/g - s/kho_reserve/kho_reserve_scratch/g - Leave the node generation code that needs to know the name in trace.c so that ring buffers can stay anonymous - s/kho_reserve/kho_reserve_scratch/g - Move kho enums out of ifdef - Move from names to fdt offsets. That way, trace.c can find the trace array offset and then the ring buffer code only needs to read out its per-CPU data. That way it can stay oblivient to its name. - Make kho_get_fdt() const Alexander Graf (9): memblock: Add support for scratch memory kexec: add Kexec HandOver (KHO) generation helpers kexec: add KHO parsing support kexec: add KHO support to kexec file loads kexec: add config option for KHO arm64: add KHO support x86: add KHO support memblock: add KHO support for reserve_mem Documentation: add documentation for KHO Changyuan Lyu (1): hashtable: add macro HASHTABLE_INIT Mike Rapoport (Microsoft) (5): mm/mm_init: rename init_reserved_page to init_deferred_page memblock: add MEMBLOCK_RSRV_KERN flag memblock: introduce memmap_init_kho_scratch() kexec: enable KHO support for memory preservation x86/setup: use memblock_reserve_kern for memory used by kernel steven chen (1): kexec: define functions to map and unmap segments .../admin-guide/kernel-parameters.txt | 25 + Documentation/kho/concepts.rst | 70 + Documentation/kho/fdt.rst | 62 + Documentation/kho/index.rst | 14 + Documentation/kho/usage.rst | 118 ++ Documentation/subsystem-apis.rst | 1 + MAINTAINERS | 3 +- arch/arm64/Kconfig | 3 + arch/x86/Kconfig | 3 + arch/x86/boot/compressed/kaslr.c | 52 +- arch/x86/include/asm/setup.h | 4 + arch/x86/include/uapi/asm/setup_data.h | 13 +- arch/x86/kernel/e820.c | 18 + arch/x86/kernel/kexec-bzimage64.c | 36 + arch/x86/kernel/setup.c | 41 +- arch/x86/realmode/init.c | 2 + drivers/of/fdt.c | 33 + drivers/of/kexec.c | 37 + include/linux/hashtable.h | 7 +- include/linux/kexec.h | 12 + include/linux/kexec_handover.h | 202 ++ include/linux/memblock.h | 41 +- kernel/Kconfig.kexec | 15 + kernel/Makefile | 1 + kernel/kexec_core.c | 58 + kernel/kexec_file.c | 19 + kernel/kexec_handover.c | 1755 +++++++++++++++++ kernel/kexec_internal.h | 18 + mm/Kconfig | 4 + mm/internal.h | 2 + mm/memblock.c | 303 ++- mm/mm_init.c | 19 +- tools/testing/memblock/tests/alloc_api.c | 22 +- .../memblock/tests/alloc_helpers_api.c | 4 +- tools/testing/memblock/tests/alloc_nid_api.c | 20 +- 35 files changed, 2988 insertions(+), 49 deletions(-) create mode 100644 Documentation/kho/concepts.rst create mode 100644 Documentation/kho/fdt.rst create mode 100644 Documentation/kho/index.rst create mode 100644 Documentation/kho/usage.rst create mode 100644 include/linux/kexec_handover.h create mode 100644 kernel/kexec_handover.c base-commit: a7f2e10ecd8f18b83951b0bab47ddaf48f93bf47 ---