From patchwork Mon Jun 19 06:51:21 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: mawupeng X-Patchwork-Id: 13284029 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A07ECEB64D9 for ; Mon, 19 Jun 2023 06:51:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB8368D0002; Mon, 19 Jun 2023 02:51:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C40908D0001; Mon, 19 Jun 2023 02:51:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE1FC8D0002; Mon, 19 Jun 2023 02:51:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9B57C8D0001 for ; Mon, 19 Jun 2023 02:51:41 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 632A3160463 for ; Mon, 19 Jun 2023 06:51:41 +0000 (UTC) X-FDA: 80918576802.08.A636B52 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf30.hostedemail.com (Postfix) with ESMTP id EF8E980012 for ; Mon, 19 Jun 2023 06:51:36 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf30.hostedemail.com: domain of mawupeng1@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=mawupeng1@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687157498; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=riJVsayhRvYuvL6NLldzdzayh5RiEVXyBVKnmLdO6N0=; b=Ax6TkGH/adk6nqXQn/BCfSEyBdw8ijkJ25XnO+FGsHlZXsuYn06PIBC92lHZkk4IS4GEIs kvAbilc8893lz/L+PVw9BphCgo6gpc6wpCkmbC+1krSfbiP40XwKBqTGTMIvhyMOymZ1O5 DOSufXIqPlfzX1qOf4ltJG1QxYK/kyI= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf30.hostedemail.com: domain of mawupeng1@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=mawupeng1@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687157498; a=rsa-sha256; cv=none; b=OuYLiBz///dbTYaUWfn8r79vtOp9QsSeFd6tYMn49O0pb3e+syIiISS/b/nmNmjbq7I1Pb ZI6YCcPEo5REQfEyzmuFSAOnP5gzo4WAaukoSVfz6wp89qhiborG/CnXowNRGwjI/0YCSS 2robIpb086WRJN5zW05Kr1Yu/151uS8= Received: from dggpemm500014.china.huawei.com (unknown [172.30.72.55]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Ql0m93YPPz1GDDs; Mon, 19 Jun 2023 14:51:21 +0800 (CST) Received: from localhost.localdomain (10.175.112.125) by dggpemm500014.china.huawei.com (7.185.36.153) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Mon, 19 Jun 2023 14:51:25 +0800 From: Wupeng Ma To: , , CC: , , , , Wei Yang , "Michael S. Tsirkin" , Jason Wang , Pankaj Gupta , Michal Hocko , Oscar Salvador Subject: [PATCH stable 5.10] mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block Date: Mon, 19 Jun 2023 14:51:21 +0800 Message-ID: <20230619065121.1720912-1-mawupeng1@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 X-Originating-IP: [10.175.112.125] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm500014.china.huawei.com (7.185.36.153) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: EF8E980012 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: jc1fqp43mngwx3x5kba97d1jxbi7j5ak X-HE-Tag: 1687157496-575681 X-HE-Meta: U2FsdGVkX187VorMA0xIA5UZB5ixun1yYDTh32imJGVhPhpWlEo2Viql0o8MdJ61h0aQoIcEa6v2PLkD6vjjsR5jKjVyyXyJoUKyW53W1A1sukrbxuQALIzjE9FgA76j+A+Lvaz6bwIHdENDXEDcNYaczvJn2V4S+5NKkGOg0TLfE4R9FIhMJ6G+IXrDXt2/vmxOQyVnDIf9Rk6I9Minr+Zs9dvhZ2s4k+QlqlHKKUyzWkDn/vmnkrWsjwfhKZBma6K+xkYBqxfwi+SQl5PLQIdCJ02gNUddaXIBMAAV8/lbZ/L2knv9sqUBmVWKyw7yaYFlH3KfBIqmGubiChKsc1NJpO9uIfql8PDCzU8v0ae+U/MCAEGmBmB4zpszxeKcY9KreUUHPRBSuFBSbuf63ito6nGdgHAVQvQTfConKoGGk6f3atQyoz9/kefjfKDhtMphZRq4cXJhAudr+OZpqlIu+lQlWbfemxxlzb660KGchJOpbAJlr4RXTv7ft2LbfL4S7Cwq43EZ+1ZRZEVQmCYFB0T8SpCF8Cq/3VAZerezxl9iohbpDVVlU4rWujzdUxFZlto5fID/SzqpKJS+qybzSjAQHrxsqGRmNrV/vOpuxQWMfVf4VbR9Z+nv95jk7mgsK7Twcnooex5jj1pg6FyBPZnzq9tzHh8RCvtP0yEIqJ/fYKVZU5embZftNLklnv/INpB35qv42hJgSr6gOETGpX49VkFzEiotXddlBtw0MFLRBDQu4EbXD4+FJU82ocovr8CRNtVG+F4SHwrGNDB/ZesmbspIJSaOKH4w4nx1mPbx0IHNKGXgHaqA67LoadxmNf6A3lylIZ2O7GiSVAsZVkZor68TuOcM9wdDyEoIpO5bwFHPSI7Bbxhu/I2iAX2ZppI6Tuntev8xwSQzapx4MnfSLMhAjphNcs3uPYRtULD5IApAdvztXWbUDuWruYnUG1J4nIYyahJ+Bgs m9k2/c8A mAaF1v9EYxykn5dBIpxWQ2EI/3c94l3DVd8LZkO19sVwRy+XnVJqN00izNgeHIveOzFq8X9Zc2qI2+f9y/Vq8/92tJW2+dGeHff4XBviPtfN2nIRInyGbb3xH97+8CQcAJ2KIUBZd2gqn7VBuYmZ89EVQm7F+nGngv8kJlz7KnV2wC6XQbYxPbTi5NKkI7Eatgyg/bZT469qFOhAjmeEbdFjBQ0q68q+Kr0GLrhWbE4UZ62B/Inbi+cGFxokPj/QiZP89G7mviqI6mEtvquJkLSw+Wm7yfn6iKjbHSKy7RDbA5q8MnysaqR9wYVZnPJQfXloaMNnM45NdofWmkRzYyMmzHjr0eu9R6kc4lZlf3OSgMaGSNzApJlI3CP71pkN5p31bPwTK2R2855i2wi8SaAPxgjXMg3BhaZl6AysrqZuWMNfuyL9hQZxI9TmZ74GtiOi5imSVCMRMhGI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: David Hildenbrand commit 8dc4bb58a146655eb057247d7c9d19e73928715b upstream. virtio-mem soon wants to use offline_and_remove_memory() memory that exceeds a single Linux memory block (memory_block_size_bytes()). Let's remove that restriction. Let's remember the old state and try to restore that if anything goes wrong. While re-onlining can, in general, fail, it's highly unlikely to happen (usually only when a notifier fails to allocate memory, and these are rather rare). This will be used by virtio-mem to offline+remove memory ranges that are bigger than a single memory block - for example, with a device block size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory block size of 128MB. While we could compress the state into 2 bit, using 8 bit is much easier. This handling is similar, but different to acpi_scan_try_to_offline(): a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG optimization is still relevant - it should only apply to ZONE_NORMAL (where we have no guarantees). If relevant, we can always add it. b) acpi_scan_try_to_offline() simply onlines all memory in case something goes wrong. It doesn't restore previous online type. Let's do that, so we won't overwrite what e.g., user space configured. Reviewed-by: Wei Yang Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Pankaj Gupta Cc: Michal Hocko Cc: Oscar Salvador Cc: Wei Yang Cc: Andrew Morton Signed-off-by: David Hildenbrand Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.com Signed-off-by: Michael S. Tsirkin Acked-by: Andrew Morton Signed-off-by: Ma Wupeng --- mm/memory_hotplug.c | 105 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 89 insertions(+), 16 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f0633f9a9116..9ec9e1e67705 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1788,39 +1788,112 @@ int remove_memory(int nid, u64 start, u64 size) } EXPORT_SYMBOL_GPL(remove_memory); +static int try_offline_memory_block(struct memory_block *mem, void *arg) +{ + uint8_t online_type = MMOP_ONLINE_KERNEL; + uint8_t **online_types = arg; + struct page *page; + int rc; + + /* + * Sense the online_type via the zone of the memory block. Offlining + * with multiple zones within one memory block will be rejected + * by offlining code ... so we don't care about that. + */ + page = pfn_to_online_page(section_nr_to_pfn(mem->start_section_nr)); + if (page && zone_idx(page_zone(page)) == ZONE_MOVABLE) + online_type = MMOP_ONLINE_MOVABLE; + + rc = device_offline(&mem->dev); + /* + * Default is MMOP_OFFLINE - change it only if offlining succeeded, + * so try_reonline_memory_block() can do the right thing. + */ + if (!rc) + **online_types = online_type; + + (*online_types)++; + /* Ignore if already offline. */ + return rc < 0 ? rc : 0; +} + +static int try_reonline_memory_block(struct memory_block *mem, void *arg) +{ + uint8_t **online_types = arg; + int rc; + + if (**online_types != MMOP_OFFLINE) { + mem->online_type = **online_types; + rc = device_online(&mem->dev); + if (rc < 0) + pr_warn("%s: Failed to re-online memory: %d", + __func__, rc); + } + + /* Continue processing all remaining memory blocks. */ + (*online_types)++; + return 0; +} + /* - * Try to offline and remove a memory block. Might take a long time to - * finish in case memory is still in use. Primarily useful for memory devices - * that logically unplugged all memory (so it's no longer in use) and want to - * offline + remove the memory block. + * Try to offline and remove memory. Might take a long time to finish in case + * memory is still in use. Primarily useful for memory devices that logically + * unplugged all memory (so it's no longer in use) and want to offline + remove + * that memory. */ int offline_and_remove_memory(int nid, u64 start, u64 size) { - struct memory_block *mem; - int rc = -EINVAL; + const unsigned long mb_count = size / memory_block_size_bytes(); + uint8_t *online_types, *tmp; + int rc; if (!IS_ALIGNED(start, memory_block_size_bytes()) || - size != memory_block_size_bytes()) - return rc; + !IS_ALIGNED(size, memory_block_size_bytes()) || !size) + return -EINVAL; + + /* + * We'll remember the old online type of each memory block, so we can + * try to revert whatever we did when offlining one memory block fails + * after offlining some others succeeded. + */ + online_types = kmalloc_array(mb_count, sizeof(*online_types), + GFP_KERNEL); + if (!online_types) + return -ENOMEM; + /* + * Initialize all states to MMOP_OFFLINE, so when we abort processing in + * try_offline_memory_block(), we'll skip all unprocessed blocks in + * try_reonline_memory_block(). + */ + memset(online_types, MMOP_OFFLINE, mb_count); lock_device_hotplug(); - mem = find_memory_block(__pfn_to_section(PFN_DOWN(start))); - if (mem) - rc = device_offline(&mem->dev); - /* Ignore if the device is already offline. */ - if (rc > 0) - rc = 0; + + tmp = online_types; + rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block); /* - * In case we succeeded to offline the memory block, remove it. + * In case we succeeded to offline all memory, remove it. * This cannot fail as it cannot get onlined in the meantime. */ if (!rc) { rc = try_remove_memory(nid, start, size); - WARN_ON_ONCE(rc); + if (rc) + pr_err("%s: Failed to remove memory: %d", __func__, rc); + } + + /* + * Rollback what we did. While memory onlining might theoretically fail + * (nacked by a notifier), it barely ever happens. + */ + if (rc) { + tmp = online_types; + walk_memory_blocks(start, size, &tmp, + try_reonline_memory_block); } unlock_device_hotplug(); + kfree(online_types); return rc; } EXPORT_SYMBOL_GPL(offline_and_remove_memory);