From patchwork Thu Nov 8 10:04:13 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 10673875 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 44A9114E2 for ; Thu, 8 Nov 2018 10:04:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 336392CC89 for ; Thu, 8 Nov 2018 10:04:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 275DF2CC90; Thu, 8 Nov 2018 10:04:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE,SUBJ_OBFU_PUNCT_FEW autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7B34F2CC89 for ; Thu, 8 Nov 2018 10:04:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8702B6B05C9; Thu, 8 Nov 2018 05:04:34 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 81FA96B05CB; Thu, 8 Nov 2018 05:04:34 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 711B46B05CC; Thu, 8 Nov 2018 05:04:34 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by kanga.kvack.org (Postfix) with ESMTP id 142B16B05C9 for ; Thu, 8 Nov 2018 05:04:34 -0500 (EST) Received: by mail-ed1-f71.google.com with SMTP id d17-v6so11019320edv.4 for ; Thu, 08 Nov 2018 02:04:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:mime-version:content-transfer-encoding; bh=JlDKuVCSX6yv23TxsZ5ehMOWoWzxvwzQ7wGwAhD6cf8=; b=D45oWcjYdi5tlAiR10CnHMdrESjZ4UpkXJ7DeYJyb+hfFaOiDPDd4td2OH82mEenmh cd1i1Wyv6xTPfb50pZ/zwMkn+oNuKDI2ZndDBTcty1TQ6tMseFc2yuqnM78C6pGcljF0 NnHbRjfFgkYg/O/OcPZ6OvpzhLUrQlqicjKhwDZCXMxpVPGR2ilhjrRuSDKsggWS1ETO hvMSZhKmFkzpdW/IqPJOCefW+toCUiqh3A7JKLAZmG5toI8gJXVDs0w7fE/zciCBJc/6 ZVSjAT6+UgLXrA55zcJPQFnHrNtqXyzkt5XrHzhO6HOTtOXTIXC4VN93tz8bdIeUH6el LjIA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Gm-Message-State: AGRZ1gKHtoH0esbDq30vMKKAktc24cOmmverlW0SA6tfAv6SpPwPlc+0 Q3NBsOJK0KijCDT5cOj55P7X9Y7aSrTLfkZbstEJ5fg/zOSA5RtemIGFnhFJqrHB60PxuChFUnd cjvJsfk3Mm3WTfZTpxJs0ZiQlW2wtsaYpAHtSRh6r4sX7BQOlnV7VlT9GFmkF7rEuBTLPj+NNWL l3i+gmPL6r4L7fx7WHBnqFrgoVi8UQY8bwoTtc0e9QaJPmsH/Jl/32Hprj357qcs8MK2IM04tfA 7VHw/nINVZhXa7Fm0V2rPmzXyA5euD7OnB9IZ8hNht8WmH05qWH0n2o8PZG0q72th2LAr+4drwA EZ1I3eQWX9s4P3n/j14jQOVOI0JYFUNLTG4wVB5vZYGB3li0IhAP5cJWDzwpkSwixwkC9PMqaw= = X-Received: by 2002:a17:906:a88c:: with SMTP id ha12-v6mr2610269ejb.107.1541671473517; Thu, 08 Nov 2018 02:04:33 -0800 (PST) X-Received: by 2002:a17:906:a88c:: with SMTP id ha12-v6mr2610227ejb.107.1541671472404; Thu, 08 Nov 2018 02:04:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541671472; cv=none; d=google.com; s=arc-20160816; b=U79CgfJxNKgXXOj0EbmRLGIkX9bUKyIE94JppL37IZcMQ+Sf1Clw6aNMlWufZ9fypb 7ajkT1rPg0bPZn+HIpOwPfM9YxwnIcVtS7rKcqtlYvgbZGKxVNudFTUZc0lnGms3+QHB xTNmqRP7lRh80GKZBJ6cl2OTA0aZpfyD5i4r3CRfgeHczVNCKxBpUL1lQqQJ12QhLZX+ 6tSzEZcAkqTLb0sVqMAirIdMtMIVa0+GLoEOdm7NhbLzvcgWol/gyg2XlJnoMzX7+Qfp 7eYtnz3h4bi733qh1Sk6v1JG2mHUxnkAf10o+InOQ/dCAzbcutHrPM/dhENVoldFqEVf Ou5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from; bh=JlDKuVCSX6yv23TxsZ5ehMOWoWzxvwzQ7wGwAhD6cf8=; b=rLsRxb6WkZzgLIcVEhJjwWPOGQ97WLwJQKkT3ugY2xaB/attZQIQuDqIZ9AmIHtbtM grIU2knach9cVCqKUlaR/aLC/z0yQg/rCohQJuCzIymjDDdZ6e4ed/E5XjKbVxFGcmlM Al4cP8tMVLADcvibRcxfnCFxZvErXn8i+0xumGc+XzSiH9nHprBb4R7k1eGnEvfPoXqm wtSqE7t1lvLhPkUdsL/u4towNstDlqUQW12pjIWej50NBGX23AAa3CDmezDzosF/BeJB fRCyvlJIGHrUFetsDBL82pGC8iIF9m3Gl7M3npsdvGHzvoA78b5Sbv+jW77VVbItZ6Qj bnCA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id e28-v6sor2006987edd.2.2018.11.08.02.04.32 for (Google Transport Security); Thu, 08 Nov 2018 02:04:32 -0800 (PST) Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Google-Smtp-Source: AJdET5cuCmmOW6FuxWA9vcQBLyw46Kt+Sg1J/XolKPE8JpleIZtmscGR8SVz8ZLP9c/ZHhhu8jzn8w== X-Received: by 2002:a50:f285:: with SMTP id f5-v6mr3153856edm.77.1541671471575; Thu, 08 Nov 2018 02:04:31 -0800 (PST) Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id z11-v6sm932788edh.6.2018.11.08.02.04.30 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Nov 2018 02:04:30 -0800 (PST) From: Michal Hocko To: Cc: Andrew Morton , Oscar Salvador , LKML , Michal Hocko , Wen Congyang , Tang Chen , Miroslav Benes , Vlastimil Babka Subject: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove Date: Thu, 8 Nov 2018 11:04:13 +0100 Message-Id: <20181108100413.966-1-mhocko@kernel.org> X-Mailer: git-send-email 2.19.1 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Michal Hocko Per-cpu numa_node provides a default node for each possible cpu. The association gets initialized during the boot when the architecture specific code explores cpu->NUMA affinity. When the whole NUMA node is removed though we are clearing this association try_offline_node check_and_unmap_cpu_on_node unmap_cpu_on_node numa_clear_node numa_set_node(cpu, NUMA_NO_NODE) This means that whoever calls cpu_to_node for a cpu associated with such a node will get NUMA_NO_NODE. This is problematic for two reasons. First it is fragile because __alloc_pages_node would simply blow up on an out-of-bound access. We have encountered this when loading kvm module BUG: unable to handle kernel paging request at 00000000000021c0 IP: [] __alloc_pages_nodemask+0x93/0xb70 PGD 800000ffe853e067 PUD 7336bbc067 PMD 0 Oops: 0000 [#1] SMP [...] CPU: 88 PID: 1223749 Comm: modprobe Tainted: G W 4.4.156-94.64-default #1 task: ffff88727eff1880 ti: ffff887354490000 task.ti: ffff887354490000 RIP: 0010:[] [] __alloc_pages_nodemask+0x93/0xb70 RSP: 0018:ffff887354493b40 EFLAGS: 00010202 RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0 RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000 R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101 R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000 FS: 00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 0000000000000086 014000c014d20400 ffff887354493bb8 ffff882614d20f4c 0000000000000000 0000000000000046 0000000000000046 ffffffff810ac0c9 ffff88ffe78c0000 ffffffff0000009f ffffe8ffe82d3500 ffff88ff8ac55000 Call Trace: [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel] [] hardware_setup+0x781/0x849 [kvm_intel] [] kvm_arch_hardware_setup+0x28/0x190 [kvm] [] kvm_init+0x7c/0x2d0 [kvm] [] vmx_init+0x1e/0x32c [kvm_intel] [] do_one_initcall+0xca/0x1f0 [] do_init_module+0x5a/0x1d7 [] load_module+0x1393/0x1c90 [] SYSC_finit_module+0x70/0xa0 [] entry_SYSCALL_64_fastpath+0x1e/0xb7 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7 on an older kernel but the code is basically the same in the current Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which would recognize NUMA_NO_NODE and use alloc_pages_node which would translate it to numa_mem_id but that is wrong as well because it would use a cpu affinity of the local CPU which might be quite far from the original node. It is also reasonable to expect that cpu_to_node will provide a sane value and there might be many more callers like that. The second problem is that __register_one_node relies on cpu_to_node to properly associate cpus back to the node when it is onlined. We do not want to lose that link as there is no arch independent way to get it from the early boot time AFAICS. Drop the whole check_and_unmap_cpu_on_node machinery and keep the association to fix both issues. The NODE_DATA(nid) is not deallocated so it will stay in place and if anybody wants to allocate from that node then a fallback node will be used. Thanks to Vlastimil Babka for his live system debugging skills that helped debugging the issue. Debugged-by: Vlastimil Babka Reported-by: Miroslav Benes Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node") Cc: Wen Congyang Cc: Tang Chen Signed-off-by: Michal Hocko --- Hi, please note that I am sending this as an RFC even though this has been confirmed to fix the oops in kvm_intel module because I cannot simply tell that there are no other side effect that I do not see from the code reading. I would appreciate some background from people who have introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node") because the changelog doesn't really explain the motivation much. mm/memory_hotplug.c | 30 +----------------------------- 1 file changed, 1 insertion(+), 29 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2b2b3ccbbfb5..87aeafac54ee 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat) return 0; } -static void unmap_cpu_on_node(pg_data_t *pgdat) -{ -#ifdef CONFIG_ACPI_NUMA - int cpu; - - for_each_possible_cpu(cpu) - if (cpu_to_node(cpu) == pgdat->node_id) - numa_clear_node(cpu); -#endif -} - -static int check_and_unmap_cpu_on_node(pg_data_t *pgdat) -{ - int ret; - - ret = check_cpu_on_node(pgdat); - if (ret) - return ret; - - /* - * the node will be offlined when we come here, so we can clear - * the cpu_to_node() now. - */ - - unmap_cpu_on_node(pgdat); - return 0; -} - /** * try_offline_node * @nid: the node ID @@ -1813,7 +1785,7 @@ void try_offline_node(int nid) return; } - if (check_and_unmap_cpu_on_node(pgdat)) + if (check_cpu_on_node(pgdat)) return; /*