From patchwork Thu Jun 25 22:35:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wei Yang X-Patchwork-Id: 11626179 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A00CD1392 for ; Thu, 25 Jun 2020 22:35:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 730EA206BE for ; Thu, 25 Jun 2020 22:35:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 730EA206BE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8DE126B0003; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 890376B0005; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A46D6B0006; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72]) by kanga.kvack.org (Postfix) with ESMTP id 607A16B0003 for ; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id F2492181ABE86 for ; Thu, 25 Jun 2020 22:35:40 +0000 (UTC) X-FDA: 76969192440.24.wing41_260f78726e50 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id CFA4C1A4A0 for ; Thu, 25 Jun 2020 22:35:40 +0000 (UTC) X-Spam-Summary: 1,0,0,d7e1cb90f15fc31a,d41d8cd98f00b204,richard.weiyang@linux.alibaba.com,,RULES_HIT:2:41:355:379:541:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1437:1515:1535:1605:1606:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:2693:2901:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3873:3874:4117:4250:4321:4385:4605:5007:6119:6238:6261:7558:7903:7904:8603:8660:8784:8957:9163:9165:9901:10004:10226:11026:11233:11473:11658:11914:12043:12297:12438:12555:12679:12740:12895:12986:13053:13148:13161:13229:13230:13870:13894:21063:21080:21451:21524:21611:21622:21789:21939:21966:30003:30012:30029:30054:30056:30065:30070:30080,0,RBL:115.124.30.56:@linux.alibaba.com:.lbl8.mailshell.net-64.201.201.201 62.20.2.100;04yfnjieutgcdqx385k7yd9xnr711ypjsdir6bism1ggio9r4s7be9i1n7th8bm.ajqbrqggdwbb84eb67ttq849uj6gf343qdhb49yipup6p4fz8s1pyap9xrm1jg3.w-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF: fp,MSBL: X-HE-Tag: wing41_260f78726e50 X-Filterd-Recvd-Size: 6336 Received: from out30-56.freemail.mail.aliyun.com (out30-56.freemail.mail.aliyun.com [115.124.30.56]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Thu, 25 Jun 2020 22:35:39 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R761e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=richard.weiyang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0U0iFBoh_1593124536; Received: from localhost(mailfrom:richard.weiyang@linux.alibaba.com fp:SMTPD_---0U0iFBoh_1593124536) by smtp.aliyun-inc.com(127.0.0.1); Fri, 26 Jun 2020 06:35:36 +0800 From: Wei Yang To: akpm@linux-foundation.org, osalvador@suse.de, dan.j.williams@intel.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, david@redhat.com, Wei Yang Subject: [Patch v2] mm/sparse: never partially remove memmap for early section Date: Fri, 26 Jun 2020 06:35:34 +0800 Message-Id: <20200625223534.18024-1-richard.weiyang@linux.alibaba.com> X-Mailer: git-send-email 2.20.1 (Apple Git-117) MIME-Version: 1.0 X-Rspamd-Queue-Id: CFA4C1A4A0 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For early sections, its memmap is handled specially even sub-section is enabled. The memmap could only be populated as a whole. Quoted from the comment of section_activate(): * The early init code does not consider partially populated * initial sections, it simply assumes that memory will never be * referenced. If we hot-add memory into such a section then we * do not need to populate the memmap and can simply reuse what * is already there. While current section_deactivate() breaks this rule. When hot-remove a sub-section, section_deactivate() would depopulate its memmap. The consequence is if we hot-add this subsection again, its memmap never get proper populated. We can reproduce the case by following steps: 1. Hacking qemu to allow sub-section early section diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 51b3050d01..c6a78d83c0 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms, } machine->device_memory->base = - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB); + 0x100000000ULL + x86ms->above_4g_mem_size; if (pcmc->enforce_aligned_dimm) { /* size device region assuming 1G page max alignment per slot */ 2. Bootup qemu with PSE disabled and a sub-section aligned memory size Part of the qemu command would look like this: sudo x86_64-softmmu/qemu-system-x86_64 \ --enable-kvm -cpu host,pse=off \ -m 4160M,maxmem=20G,slots=1 \ -smp sockets=2,cores=16 \ -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \ -machine pc,nvdimm \ -nographic \ -object memory-backend-ram,id=mem0,size=8G \ -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k 3. Re-config a pmem device with sub-section size in guest ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M Then you would see the following call trace: pmem0: detected capacity change from 0 to 16777216 BUG: unable to handle page fault for address: ffffec73c51000b4 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0 Oops: 0002 [#1] SMP NOPTI CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 RIP: 0010:memmap_init_zone+0x154/0x1c2 Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282 RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000 RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080 RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708 R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000 FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0 Call Trace: move_pfn_range_to_zone+0x128/0x150 memremap_pages+0x4e4/0x5a0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0x69/0x160 [device_dax] really_probe+0x298/0x3c0 driver_probe_device+0xe1/0x150 ? driver_allows_async_probing+0x50/0x50 bus_for_each_drv+0x7e/0xc0 __device_attach+0xdf/0x160 bus_probe_device+0x8e/0xa0 device_add+0x3b9/0x740 __devm_create_dev_dax+0x127/0x1c0 __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core] dax_pmem_probe+0xc/0x1b [dax_pmem] nvdimm_bus_probe+0x69/0x1c0 [libnvdimm] really_probe+0x147/0x3c0 driver_probe_device+0xe1/0x150 device_driver_attach+0x53/0x60 bind_store+0xd1/0x110 kernfs_fop_write+0xce/0x1b0 vfs_write+0xb6/0x1a0 ksys_write+0x5f/0xe0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Wei Yang Acked-by: David Hildenbrand --- mm/sparse.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index b2b9a3e34696..a06085738295 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -825,10 +825,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; } - if (section_is_early && memmap) - free_map_bootmem(memmap); - else + /* + * The memmap of early sections is always fully populated. See + * section_activate() and pfn_valid() . + */ + if (!section_is_early) depopulate_section_memmap(pfn, nr_pages, altmap); + else if (memmap) + free_map_bootmem(memmap); if (empty) ms->section_mem_map = (unsigned long)NULL;