From patchwork Thu Apr 27 00:08:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anthony Yznaga X-Patchwork-Id: 13225044 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECBE7C7618E for ; Thu, 27 Apr 2023 00:15:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 85E866B0071; Wed, 26 Apr 2023 20:15:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E7F96B0072; Wed, 26 Apr 2023 20:15:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 662AD6B0074; Wed, 26 Apr 2023 20:15:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4BC436B0071 for ; Wed, 26 Apr 2023 20:15:24 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0D0FC802DE for ; Thu, 27 Apr 2023 00:15:24 +0000 (UTC) X-FDA: 80725251768.30.2E2D60D Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by imf23.hostedemail.com (Postfix) with ESMTP id 0664A14001F for ; Thu, 27 Apr 2023 00:15:21 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2023-03-30 header.b="I0kieCx/"; spf=pass (imf23.hostedemail.com: domain of anthony.yznaga@oracle.com designates 205.220.165.32 as permitted sender) smtp.mailfrom=anthony.yznaga@oracle.com; dmarc=pass (policy=none) header.from=oracle.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682554522; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:references:dkim-signature; bh=BRj/3TfYOvjkwblw52yOiKUFYYErGZe4lSnIhVOYAHQ=; b=kWSyEKm7bj/9T9VL3GIsB9hTpIc8z0OXqpOcUHmGB0QlQkZ2vL6oLVgmAHljYuAddoNPKZ 67GWdQ5oGDxjsaXHTuTmKdytZ5LgyDNJXfuIAzxD3Vw0iiGTVYPhSu+cvHBYmULZeGWEO9 NYfYmz7yD1Sagi/Be9npQr2ffrznmZQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2023-03-30 header.b="I0kieCx/"; spf=pass (imf23.hostedemail.com: domain of anthony.yznaga@oracle.com designates 205.220.165.32 as permitted sender) smtp.mailfrom=anthony.yznaga@oracle.com; dmarc=pass (policy=none) header.from=oracle.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682554522; a=rsa-sha256; cv=none; b=SDEChk526OZc7tq6S27hICV8l7GiAi5p2b4mqz6l9fhlYrexibTt6WqIMIWWGFnPkFiH8I Ojlh5/znhJKjoCG6Yb/lb4G6a5IDPlP6oK7vjWSeVJOQZXa92MOgegqZhe56ZyWpIFDRAy H4FDNL42JDuST5y02mdxRsaCPlsc+mE= Received: from pps.filterd (m0246629.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 33QGx6DB014740; Thu, 27 Apr 2023 00:09:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id; s=corp-2023-03-30; bh=BRj/3TfYOvjkwblw52yOiKUFYYErGZe4lSnIhVOYAHQ=; b=I0kieCx/ycmz6RyPLhJm3IfF3PB5a+grP7b67gzvPqQxqChV4KM13kJpH4NLc8WYyXDa +HkpypFJrv6T+eCnkED2zSuIUvWMPAb+iWJ9seI08TwfW5ljYnfb64PWbPHAkDeaET/f /+JXX0mOJcMdjWYEMts/G5L7goQwvnM5/sdXuanabg6U17uT0NmCk2zflQzG5/Jje5bR +L9PuDpzT8LMhnR5psLHEU4rL1aF8mXtHYck+n+gbbd8lEA7gdgdYT/2WNG8V5OlbCBt IFtMb3bwJfZczpej7ZgBCqoKnWLngowu4fDT+aIUW2g4j/jWoqxkI2Pp6vPpmb7zgd1L Cg== Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.appoci.oracle.com [138.1.114.2]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3q47fatmrd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 27 Apr 2023 00:09:04 +0000 Received: from pps.filterd (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 33QLgiYR007654; Thu, 27 Apr 2023 00:09:04 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3q4618mp8e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 27 Apr 2023 00:09:04 +0000 Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 33R0938M013888; Thu, 27 Apr 2023 00:09:03 GMT Received: from ca-qasparc-x86-2.us.oracle.com (ca-qasparc-x86-2.us.oracle.com [10.147.24.103]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 3q4618mp42-1; Thu, 27 Apr 2023 00:09:03 +0000 From: Anthony Yznaga To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, rppt@kernel.org, akpm@linux-foundation.org, ebiederm@xmission.com, keescook@chromium.org, graf@amazon.com, jason.zeng@intel.com, lei.l.li@intel.com, steven.sistare@oracle.com, fam.zheng@bytedance.com, mgalaxy@akamai.com, kexec@lists.infradead.org Subject: [RFC v3 00/21] Preserved-over-Kexec RAM Date: Wed, 26 Apr 2023 17:08:36 -0700 Message-Id: <1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com> X-Mailer: git-send-email 1.9.4 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.942,Hydra:6.0.573,FMLib:17.11.170.22 definitions=2023-04-26_10,2023-04-26_03,2023-02-09_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 phishscore=0 bulkscore=0 mlxlogscore=999 malwarescore=0 mlxscore=0 spamscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2303200000 definitions=main-2304270000 X-Proofpoint-ORIG-GUID: kITeu-Uy-ebJonyaTei8eqLB7j2YAgRH X-Proofpoint-GUID: kITeu-Uy-ebJonyaTei8eqLB7j2YAgRH X-Rspamd-Queue-Id: 0664A14001F X-Stat-Signature: 3b7smdd4yto8c9edg8whqc1c5rh67cqn X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1682554521-132423 X-HE-Meta: U2FsdGVkX1+O3R41oELku1fAdZooO2mADzCirRWWCKSEa17/mQbCthOb7hVDWrcdJ3AS94xEp5Npfw/gY7migsdtZqoyKihwNwMfz7m8hLFi6HtfNDBs7NhKVa3D5gstsTFLDiR12g6f3hx1NM5tXUmvnKHBngsLxHd762I6vEoCb3y8HRtDurMhYeBKvkUexeh6QKdT+IICPcdIe6dgZLEkBoH1jWKsXn3BPAPsdF+PKMhp8cskvPxl6jVoYl/T9lgZ33+MCMMn1kPtU00cSc4UCbwZW4qnpRRi3rxLxkF1AqWBa6L+7Eu4f31EqG9ZSMXfXPZClbgY9+KyzrrjSGzv0l5TakR52khXJU9lEsFx/TRZ17wmFVUaGFtlr24tOQ1RbTuMDIMdC0xkfOBwPobuND+q2CIr1rozYzuTSKb+bTG3uUyZ1UqlmLCulsbtByCwd7O/LiltsflfXv8f5zwginNYwGBVsUiNR6R//HXxjVIgSimJ1GEMSE5AnBofZacf60CLXBxkOWA1o29I54tsS6jrA5kr30YF1Ytt70Mve5jocB8r60lsIxazzC0XdcvoTMdXhtxyokdCXWVLIZTUbHSAaHXRUcEZSB9UhOLhcH2UbouZAe4dTfuQUO/MTkSShEUwvHfNailm4NHWV3YXywEMg1vvSx33JvOLshGoBnJ9eQtlwnPTzASY5+2CQf7BzAaTc4NxU1Lm/h984mvaPJGaV6fVAt0c+oWjtPQTIzpuzHgfkrNsiRrFUttbhPXzzdM3tTeKnP8EcsYNasPeK69P5NNV2ysnJ6NN9Y9V/86ArtrpgjOdJXvfGtNEMawhgmwVZNuJ9VCPUdXtG/5ztMvWdHSI+wCm5xkRnoJlTE/ONB0GOmzGGRxuuu2zbMFbLl2h31IUDYbIXcAehuGKeKhf1SNLG3AZxrsxYHz19seiWf+5GHe0Evhfnz9aFvTVGMsE5MkRVX+4Eb3 imIqqXw3 P4aJI9EHB6VyIUtuXQt/ifDxYSZLdD0Aw8l1IDie0kaZQ4wEkXPiONsugbsLr9/huxHr+y5XhGwWxaGwUgwmijY2nvyBOYOIOev4MriG4tzKtNU/PuyaeHiGo+n6JK8Yu+jd7SvagKnzQ7G5YwAfn2O6fucWav7DspRiZmW1sPKNncQE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Sending out this RFC in part to guage community interest. This patchset implements preserved-over-kexec memory storage or PKRAM as a method for saving memory pages of the currently executing kernel so that they may be restored after kexec into a new kernel. The patches are adapted from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They introduce the PKRAM kernel API. One use case for PKRAM is preserving guest memory and/or auxillary supporting data (e.g. iommu data) across kexec to support reboot of the host with minimal disruption to the guest. PKRAM provides a flexible way for doing this without requiring that the amount of memory used by a fixed size created a priori. Another use case is for databases to preserve their block caches in shared memory across reboot. Changes since RFC v2 - Rebased onto 6.3 - Updated API to save/load folios rather than file pages - Omitted previous patches for implementing and optimizing preservation and restoration of shmem files to reduce the number of patches and focus on core functionality. Changes since RFC v1 - Rebased onto 5.12-rc4 - Refined the API to reduce the number of calls and better support multithreading. - Allow preserving byte data of arbitrary length (was previously limited to one page). - Build a new memblock reserved list with the preserved ranges and then substitute it for the existing one. (Mike Rapoport) - Use mem_avoid_overlap() to avoid kaslr stepping on preserved ranges. (Kees Cook) -- Implementation details -- * To aid in quickly finding contiguous ranges of memory containing preserved pages a pseudo physical mapping pagetable is populated with pages as they are preserved. * If a page to be preserved is found to be in range of memory that was previously reserved during early boot or in range of memory where the kernel will be loaded to on kexec, the page will be copied to a page outside of those ranges and the new page will be preserved. A compound page will be copied to and preserved as individual base pages. Note that this means that a page that cannot be moved (e.g. pinned for DMA) currently cannot safely be preserved. This could be addressed by adding functionality to kexec to reconfigure the destination addreses for the sections of an already-loaded kexec kernel. * A single page is allocated for the PKRAM super block. For the next kernel kexec boot to find preserved memory metadata, the pfn of the PKRAM super block, which is exported via /sys/kernel/pkram, is passed in the 'pkram' boot option. * In the newly booted kernel, PKRAM adds all preserved pages to the memblock reserve list during early boot so that they will not be recycled. * Since kexec may load the new kernel code to any memory region, it could destroy preserved memory. When the kernel selects the memory region (kexec_file_load syscall), kexec will avoid preserved pages. When the user selects the kexec memory region to use (kexec_load syscall) , kexec load will fail if there is conflict with preserved pages. Pages preserved after a kexec kernel is loaded will be relocated if they conflict with the selected memory region. [1] https://lkml.org/lkml/2013/7/1/211 Anthony Yznaga (21): mm: add PKRAM API stubs and Kconfig mm: PKRAM: implement node load and save functions mm: PKRAM: implement object load and save functions mm: PKRAM: implement folio stream operations mm: PKRAM: implement byte stream operations mm: PKRAM: link nodes by pfn before reboot mm: PKRAM: introduce super block PKRAM: track preserved pages in a physical mapping pagetable PKRAM: pass a list of preserved ranges to the next kernel PKRAM: prepare for adding preserved ranges to memblock reserved mm: PKRAM: reserve preserved memory at boot PKRAM: free the preserved ranges list PKRAM: prevent inadvertent use of a stale superblock PKRAM: provide a way to ban pages from use by PKRAM kexec: PKRAM: prevent kexec clobbering preserved pages in some cases PKRAM: provide a way to check if a memory range has preserved pages kexec: PKRAM: avoid clobbering already preserved pages mm: PKRAM: allow preserved memory to be freed from userspace PKRAM: disable feature when running the kdump kernel x86/KASLR: PKRAM: support physical kaslr x86/boot/compressed/64: use 1GB pages for mappings arch/x86/boot/compressed/Makefile | 3 + arch/x86/boot/compressed/ident_map_64.c | 9 +- arch/x86/boot/compressed/kaslr.c | 10 +- arch/x86/boot/compressed/misc.h | 10 + arch/x86/boot/compressed/pkram.c | 110 ++ arch/x86/kernel/setup.c | 3 + arch/x86/mm/init_64.c | 3 + include/linux/pkram.h | 116 ++ kernel/kexec.c | 9 + kernel/kexec_core.c | 3 + kernel/kexec_file.c | 15 + mm/Kconfig | 9 + mm/Makefile | 2 + mm/pkram.c | 1753 +++++++++++++++++++++++++++++++ mm/pkram_pagetable.c | 375 +++++++ 15 files changed, 2424 insertions(+), 6 deletions(-) create mode 100644 arch/x86/boot/compressed/pkram.c create mode 100644 include/linux/pkram.h create mode 100644 mm/pkram.c create mode 100644 mm/pkram_pagetable.c