From patchwork Tue Aug 10 13:40:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Edmondson X-Patchwork-Id: 12428981 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4F4CC4320A for ; Tue, 10 Aug 2021 13:42:41 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2C60961075 for ; Tue, 10 Aug 2021 13:42:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2C60961075 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=nongnu.org Received: from localhost ([::1]:41306 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mDS1f-0001oy-Vy for qemu-devel@archiver.kernel.org; Tue, 10 Aug 2021 09:42:40 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:42600) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mDS0C-0007Fy-PP; Tue, 10 Aug 2021 09:41:08 -0400 Received: from forward3-smtp.messagingengine.com ([66.111.4.237]:52321) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mDS09-0003ho-2J; Tue, 10 Aug 2021 09:41:08 -0400 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailforward.nyi.internal (Postfix) with ESMTP id 1AB341940138; Tue, 10 Aug 2021 09:41:01 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Tue, 10 Aug 2021 09:41:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:from :message-id:mime-version:subject:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=RIardBH1mTSGd3Guk f1lpAbUgykGFuWsjwLIk8uxa5I=; b=ptQl0i++7W6T1/TxKrcKqz78gSDb1iOcm jJ1jUAjU9knEaJ7FXzr/At/yulmRCmbPItZhqKYk2zgs1ybf+uxVkP5G+GSHeZMs jSyCVP73o5YZ/QpVMeTV0I8/1npmFc1XhxYoSNhw9Nt6R+9TJFLOEXPN75ZyMwNg duSPAfWw6yWX5uJvrbau8q6OJKyYDzgEavj6JmkrrAFa2B4G8KeA2xUFL3e0HqxT EL+A/Re3abDOwtGMa6wRe0IhjAXMnc0XSG2Onkle9zzUIiSJBf0tJgxeQBa2CX0g vc3Yk42wSp9EIsn4NoUZq29wg6H8UkuksPb0OaQRNodcRL3Q7dDIg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddrjeelgdeikecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkofgggfestdekredtredttdenucfhrhhomhepffgrvhhiugcugfgu mhhonhgushhonhcuoegurghvihgurdgvughmohhnughsohhnsehorhgrtghlvgdrtghomh eqnecuggftrfgrthhtvghrnhephfeftdeiveelteeuueekffdvffefiefgtddvffegiedt geefffeliefhvedtkeefnecuffhomhgrihhnpehkvghrnhgvlhdrohhrghenucevlhhush htvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegurghvihgurdgvughm ohhnughsohhnsehorhgrtghlvgdrtghomh X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 10 Aug 2021 09:40:52 -0400 (EDT) Received: from localhost (disaster-area.hh.sledj.net [local]) by disaster-area.hh.sledj.net (OpenSMTPD) with ESMTPA id 2a9d1ee0; Tue, 10 Aug 2021 13:40:50 +0000 (UTC) From: David Edmondson To: qemu-devel@nongnu.org Subject: [PATCH v4 0/1] hw/pflash_cfi01: Allow an administrator to reduce the memory consumption of flash devices Date: Tue, 10 Aug 2021 14:40:49 +0100 Message-Id: <20210810134050.396747-1-david.edmondson@oracle.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Received-SPF: softfail client-ip=66.111.4.237; envelope-from=david.edmondson@oracle.com; helo=forward3-smtp.messagingengine.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.665, UNPARSEABLE_RELAY=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Peter Maydell , qemu-block@nongnu.org, "Michael S. Tsirkin" , Xu Yandong , John Snow , Markus Armbruster , Max Reitz , =?utf-8?q?Alex_Benn=C3=A9e?= , Shannon Zhao , Zheng Xiang , qemu-arm@nongnu.org, =?utf-8?b?aGFp?= =?utf-8?b?Ymluemhhbmco5byg5rW35paMKQ==?= , Stefan Hajnoczi , Paolo Bonzini , Igor Mammedov , David Edmondson , =?utf-8?q?Philippe_Mathieu-Da?= =?utf-8?q?ud=C3=A9?= , Stefano Garzarella Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" As described in https://lore.kernel.org/r/20201116104216.439650-1-david.edmondson@oracle.com and https://lore.kernel.org/r/20210222174757.2329740-1-david.edmondson@oracle.com I'd like to reduce the amount of memory consumed by QEMU mapping UEFI images on aarch64. To recap: > Currently ARM UEFI images are typically built as 2MB/768kB flash > images for code and variables respectively. These images are both > then padded out to 64MB before being loaded by QEMU. > > Because the images are 64MB each, QEMU allocates 128MB of memory to > read them, and then proceeds to read all 128MB from disk (dirtying > the memory). Of this 128MB less than 3MB is useful - the rest is > zero padding. > > On a machine with 100 VMs this wastes over 12GB of memory. Some of the cleanups in the previous patches were incorporated, but the patch that reduced memory consumption was not accepted. This is essentially that patch rebased after some unrelated changes. Having investigated alternatives, I think that the patch here is useful as it stands. All read/write operations to areas outside of the underlying block device are handled directly. Reads return 0, writes either fail (read-only devices) or are discarded (writable devices). This reduces the memory consumption for the AAVMF code image from 64MiB to around 2MB and that for the AAVMF vars from 64MiB to 768KiB (presuming that the UEFI images are adjusted accordingly). For read-only devices (such as the AAVMF code) this seems completely safe. For writable devices there is a change in behaviour - previously it was possible to write anywhere in the extent of the flash device, read back the data written and have that data persist through a restart of QEMU. This is no longer the case - writes outside of the extent of the underlying backing block device will be discarded. That is, a read after a write will *not* return the written data, either immediately or after a QEMU restart - it will return zeros. Looking at the AAVMF implementation, it seems to me that if the initial VARS image is prepared as being 768KiB in size (which it is), none of AAVMF itself will attempt to write outside of that extent, and so I believe that this is an acceptable compromise. It would be relatively straightforward to allow writes outside of the backing device to persist for the lifetime of a particular QEMU by allocating memory on demand (i.e. when there is a write to the relevant region). This would allow a read to return the relevant data, but only until a QEMU restart, at which point the data would be lost. It may be possible to persist writes by extending the underlying backing device to accommodate a new extent. This would definitely add complication, as ideally the size of the memory sub-region would also be updated. I have not investigated this further. There was a suggestion in a previous thread that perhaps the pflash driver could be re-worked to use the block IO interfaces to access the underlying device "on demand" rather than reading in the entire image at startup (at least, that's how I understood the comment). An implementation of this based around mapping the flash region only for IO, which meant that every read or write had to be handled directly by the pflash driver (there was no ROMD style operation), made booting an aarch64 VM significantly slower - getting through the firmware went from under 1 second to around 10 seconds. It's possible that this could be improved by caching blocks or some other mechanism, but I have not pursued it further. Philippe implemented a suggestion to use mmap() to avoid the need to allocate (and dirty) memory for read-only pflash images in https://lore.kernel.org/qemu-devel/20210301115329.411762-1-philmd@redhat.com/. This solution was, I believe, considered incomplete, as: - it does not handle the case where the image underlying a pflash device is changed via QAPI, - it does not handle writable devices. There is also an assumption that multiple QEMU instances on a single host will share the same AAVMF code image (to benefit from a shared mapping) - this is not the case in the environment that I am looking to support. If using mmap() for read-only device is particularly valuable, it could be combined with the patches here - the benefit would be cumulative. The only drawback that I see with this patch is the change in behaviour for writes beyond the extent of an underlying image. Unless the AAVMF build process is modified to generate smaller images (768kB for the variables, for example), this will never be a problem in reality, as the underlying image will match the size of the device. Only when a deliberate decision is taken to use an image smaller than the device does this drawback come to the fore, which is a tradeoff that an administrator can choose to make if they wish. v2: - Unify the approach for both read-only and writable devices, saving another 63MiB per QEMU instance. v3: - Add Reviewed-by: for two changes (Philippe). - Call blk_pread() directly rather than using blk_check_size_and_read_all(), given that we know how much we want to read. v4: - Remove changes already upstream. - Rebase. David Edmondson (1): hw/pflash_cfi01: Allow backing devices to be smaller than memory region hw/block/pflash_cfi01.c | 105 ++++++++++++++++++++++++++++++++-------- hw/block/trace-events | 3 ++ 2 files changed, 87 insertions(+), 21 deletions(-)