From patchwork Tue Jun 1 06:51:45 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bharata B Rao X-Patchwork-Id: 12290501 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B29EC47092 for ; Tue, 1 Jun 2021 06:52:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 05A59613A9 for ; Tue, 1 Jun 2021 06:52:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 05A59613A9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9E79A8D0003; Tue, 1 Jun 2021 02:52:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 949888D0002; Tue, 1 Jun 2021 02:52:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74C1C6B0072; Tue, 1 Jun 2021 02:52:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0166.hostedemail.com [216.40.44.166]) by kanga.kvack.org (Postfix) with ESMTP id 212EA6B006E for ; Tue, 1 Jun 2021 02:52:10 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B14D6181AC9B6 for ; Tue, 1 Jun 2021 06:52:09 +0000 (UTC) X-FDA: 78204235578.22.1C83DF0 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf07.hostedemail.com (Postfix) with ESMTP id 90844A000241 for ; Tue, 1 Jun 2021 06:51:56 +0000 (UTC) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 1516j77U166123; Tue, 1 Jun 2021 02:52:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=DgncIJq9zew5EYu13Nia/yGtk4R0YZADoVYqPGEbxHw=; b=IbjfQFNsXXjTc7tgoNWpoC4Dgl39Mc+ezmCryj77HJ2oAgn93YYHivd1rBj3inDhKBFV EUXq9snof3W6Ei9Eahv62eDR8s7bYcQxKmMVSSUssVbILwMrG+JbnbQJPYNU4y1G6fCb J8p8Z74vtq4i8FLHhLnd2L3o1QKEDLZvWVcXMmErVvITb+zJgC7HBK3AFxJWuZT3f3Xw 6Vybv7YjQitrRIW9WbanIlUamEiqYCJWvHIfeF/PKILyqzY3ABI7VMeg6Yv0t6dBAOX5 bciKEM1WvzpNnXCsUN8ZHA+nC1aa1DT5mPCu/fOLBOMrfdOeX3+BV3b8Sn+68R+5iyq0 eg== Received: from ppma04fra.de.ibm.com (6a.4a.5195.ip4.static.sl-reverse.com [149.81.74.106]) by mx0b-001b2d01.pphosted.com with ESMTP id 38wfw284sk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Jun 2021 02:52:05 -0400 Received: from pps.filterd (ppma04fra.de.ibm.com [127.0.0.1]) by ppma04fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1516mEdp003489; Tue, 1 Jun 2021 06:52:03 GMT Received: from b06avi18878370.portsmouth.uk.ibm.com (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194]) by ppma04fra.de.ibm.com with ESMTP id 38ud880sye-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Jun 2021 06:52:03 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1516pRGp29622572 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 1 Jun 2021 06:51:28 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EC6A511C05C; Tue, 1 Jun 2021 06:51:59 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5B60211C04C; Tue, 1 Jun 2021 06:51:57 +0000 (GMT) Received: from bharata.ibmuc.com (unknown [9.77.195.136]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 1 Jun 2021 06:51:57 +0000 (GMT) From: Bharata B Rao To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, aneesh.kumar@linux.ibm.com, dennis@kernel.org, tj@kernel.org, cl@linux.com, akpm@linux-foundation.org, amakhalov@vmware.com, guro@fb.com, vbabka@suse.cz, srikar@linux.vnet.ibm.com, psampat@linux.ibm.com, ego@linux.vnet.ibm.com, Bharata B Rao Subject: [RFC PATCH v0 1/3] percpu: CPU hotplug support for alloc_percpu() Date: Tue, 1 Jun 2021 12:21:45 +0530 Message-Id: <20210601065147.53735-2-bharata@linux.ibm.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210601065147.53735-1-bharata@linux.ibm.com> References: <20210601065147.53735-1-bharata@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: s6J7gmAiOTZGAc2PlHXlhpjm4C11wA1m X-Proofpoint-ORIG-GUID: s6J7gmAiOTZGAc2PlHXlhpjm4C11wA1m X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.761 definitions=2021-06-01_03:2021-05-31,2021-06-01 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 adultscore=0 suspectscore=0 spamscore=0 mlxscore=0 priorityscore=1501 malwarescore=0 mlxlogscore=999 clxscore=1015 impostorscore=0 bulkscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106010045 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=IbjfQFNs; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf07.hostedemail.com: domain of bharata@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=bharata@linux.ibm.com X-Stat-Signature: 7p9ejh4ru5hewobmhn9a3g95mgbq3tzy X-Rspamd-Queue-Id: 90844A000241 X-Rspamd-Server: rspam02 X-HE-Tag: 1622530316-670928 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The percpu allocator allocates memory for all the possible CPUs. This can lead to wastage of memory when possible number of CPUs is significantly higher than the number of online CPUs. This can be avoided if the percpu allocator were to allocate only for the online CPUs and extend the allocation for other CPUs as and they become online. Essentially the population of the chunk which involves allocating the pages for the chunk unit that corresponding to the CPU and mapping them to the vmalloc range can be delayed to the CPU hotplug time. To achieve this, add CPU hotplug callback support to the percpu allocator and let it setup the percpu allocation corresponding to the newly coming up CPU at hotplug time. The vmalloc range is allocated for all the possible CPUs upfront, but during hotplug time, only the populated pages from the chunk are setup (allocated and mapped) for the unit that corresponds to the new CPUs. The same is undone (unit pages unmapped and freed) at unplug time. This itself isn't sufficient because some callers of alloc_percpu() would expect the percpu variables/pointers for all the possible CPUs to have been initialized at allocation time itself. Hence allow them to register a callback via alloc_percpu() variants that would be called back during hotpug time for any necessary initialization of percpu variables. This is very much an experimental patch with major unsolved and unaddressed aspects listed below: - Memcg charging has been changed to account for online CPUs, however the growing and removing of charge corresponding to the hotplugged CPU hasn't been done yet. - The CPU hotplug support has been added only to vmalloc based percpu allocator. - All the callers of alloc_percpu() who need initialization callbacks haven't been changed to use the new variants. I have changed only those callers whom I ran into when booting a minimal powerpc KVM guest in my environment. - Yet to audit all the callers of alloc_percpu() and verify if the approach taken here would fit their use of percpu memory and if they are in a position to handle the required initialization during CPU hotplug time. - The patches may break git blame, the intention right now is just to make it easy to illustrate the approach taken. Signed-off-by: Bharata B Rao --- include/linux/cpuhotplug.h | 2 + include/linux/percpu.h | 15 +++ mm/percpu-internal.h | 9 ++ mm/percpu-vm.c | 199 +++++++++++++++++++++++++++++++++++ mm/percpu.c | 209 ++++++++++++++++++++++++++++++++++++- 5 files changed, 430 insertions(+), 4 deletions(-) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 4a62b3980642..ae20a14967eb 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -36,6 +36,8 @@ enum cpuhp_state { CPUHP_X86_MCE_DEAD, CPUHP_VIRT_NET_DEAD, CPUHP_SLUB_DEAD, + CPUHP_PERCPU_SETUP, + CPUHP_PERCPU_ALLOC, CPUHP_DEBUG_OBJ_DEAD, CPUHP_MM_WRITEBACK_DEAD, CPUHP_MM_VMSTAT_DEAD, diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 5e76af742c80..145fdb9318d1 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -100,6 +100,7 @@ typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size, typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size); typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr); typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to); +typedef int (*pcpu_cpuhp_fn_t)(void __percpu *ptr, unsigned int cpu, void *data); extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, int nr_units); @@ -133,6 +134,11 @@ extern void __init setup_per_cpu_areas(void); extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp); extern void __percpu *__alloc_percpu(size_t size, size_t align); +extern void __percpu *__alloc_percpu_gfp_cb(size_t size, size_t align, + gfp_t gfp, pcpu_cpuhp_fn_t fn, + void *data); +extern void __percpu *__alloc_percpu_cb(size_t size, size_t align, + pcpu_cpuhp_fn_t fn, void *data); extern void free_percpu(void __percpu *__pdata); extern phys_addr_t per_cpu_ptr_to_phys(void *addr); @@ -143,6 +149,15 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr); (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \ __alignof__(type)) +#define alloc_percpu_gfp_cb(type, gfp, fn, data) \ + (typeof(type) __percpu *)__alloc_percpu_gfp_cb(sizeof(type), \ + __alignof__(type), gfp, \ + fn, data) +#define alloc_percpu_cb(type, fn, data) \ + (typeof(type) __percpu *)__alloc_percpu_cb(sizeof(type), \ + __alignof__(type), \ + fn, data) + extern unsigned long pcpu_nr_pages(void); #endif /* __LINUX_PERCPU_H */ diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index ae26b118e246..8064e7c43b9f 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -57,6 +57,8 @@ struct pcpu_chunk { #endif struct list_head list; /* linked to pcpu_slot lists */ + struct list_head cpuhp; /* list of registered cpu hotplug + notifiers */ int free_bytes; /* free bytes in the chunk */ struct pcpu_block_md chunk_md; void *base_addr; /* base address of this chunk */ @@ -282,4 +284,11 @@ static inline void pcpu_stats_chunk_dealloc(void) #endif /* !CONFIG_PERCPU_STATS */ +struct percpu_cpuhp_notifier { + void __percpu *ptr; + void *data; + pcpu_cpuhp_fn_t cb; + struct list_head list; +}; + #endif diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 8d3844bc0c7c..3250e1c9aeaf 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -41,6 +41,67 @@ static struct page **pcpu_get_pages(void) return pages; } +/** + * pcpu_alloc_pages_cpu - allocates pages for @chunk for a given cpu + * @cpu: target cpu + * @chunk: target chunk + * @pages: array to put the allocated pages into, indexed by pcpu_page_idx() + * @page_start: page index of the first page to be allocated + * @page_end: page index of the last page to be allocated + 1 + * @gfp: allocation flags passed to the underlying allocator + * + * Allocate pages [@page_start,@page_end) into @pages for the given cpu. + * The allocation is for @chunk. Percpu core doesn't care about the + * content of @pages and will pass it verbatim to pcpu_map_pages(). + */ +static int pcpu_alloc_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk, + struct page **pages, int page_start, int page_end, + gfp_t gfp) +{ + int i; + + gfp |= __GFP_HIGHMEM; + + for (i = page_start; i < page_end; i++) { + struct page **pagep = &pages[pcpu_page_idx(cpu, i)]; + + *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0); + if (!*pagep) + goto err; + } + return 0; + +err: + while (--i >= page_start) + __free_page(pages[pcpu_page_idx(cpu, i)]); + + return -ENOMEM; +} + +/** + * pcpu_free_pages_cpu - free pages which were allocated for @chunk for @cpu + * @cpu: cpu for which the pages were allocated + * @chunk: chunk pages were allocated for + * @pages: array of pages to be freed, indexed by pcpu_page_idx() + * @page_start: page index of the first page to be freed + * @page_end: page index of the last page to be freed + 1 + * + * Free pages [@page_start and @page_end) in @pages for @cpu. + * The pages were allocated for @chunk. + */ +static void pcpu_free_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk, + struct page **pages, int page_start, int page_end) +{ + int i; + + for (i = page_start; i < page_end; i++) { + struct page *page = pages[pcpu_page_idx(cpu, i)]; + + if (page) + __free_page(page); + } +} + /** * pcpu_free_pages - free pages which were allocated for @chunk * @chunk: chunk pages were allocated for @@ -137,6 +198,37 @@ static void __pcpu_unmap_pages(unsigned long addr, int nr_pages) vunmap_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT)); } +/** + * pcpu_unmap_pages_cpu - unmap pages out of a pcpu_chunk for a cpu + * @cpu: cpu of interest + * @chunk: chunk of interest + * @pages: pages array which can be used to pass information to free + * @page_start: page index of the first page to unmap + * @page_end: page index of the last page to unmap + 1 + * + * For the given cpu, unmap pages [@page_start,@page_end) out of @chunk. + * Corresponding elements in @pages were cleared by the caller and can + * be used to carry information to pcpu_free_pages() which will be + * called after all unmaps are finished. The caller should call + * proper pre/post flush functions. + */ +static void pcpu_unmap_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk, + struct page **pages, int page_start, + int page_end) +{ + int i; + + for (i = page_start; i < page_end; i++) { + struct page *page; + + page = pcpu_chunk_page(chunk, cpu, i); + WARN_ON(!page); + pages[pcpu_page_idx(cpu, i)] = page; + } + __pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start), + page_end - page_start); +} + /** * pcpu_unmap_pages - unmap pages out of a pcpu_chunk * @chunk: chunk of interest @@ -197,6 +289,41 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages, PAGE_KERNEL, pages, PAGE_SHIFT); } +/** + * pcpu_map_pages_cpu - map pages into a pcpu_chunk for a cpu + * @cpu: cpu of interest + * @chunk: chunk of interest + * @pages: pages array containing pages to be mapped + * @page_start: page index of the first page to map + * @page_end: page index of the last page to map + 1 + * + * For the given cpu, map pages [@page_start,@page_end) into @chunk. The + * caller is responsible for calling pcpu_post_map_flush() after all + * mappings are complete. + * + * This function is responsible for setting up whatever is necessary for + * reverse lookup (addr -> chunk). + */ +static int pcpu_map_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk, + struct page **pages, int page_start, int page_end) +{ + int i, err; + + err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start), + &pages[pcpu_page_idx(cpu, page_start)], + page_end - page_start); + if (err < 0) + goto err; + + for (i = page_start; i < page_end; i++) + pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)], + chunk); + return 0; +err: + pcpu_post_unmap_tlb_flush(chunk, page_start, page_end); + return err; +} + /** * pcpu_map_pages - map pages into a pcpu_chunk * @chunk: chunk of interest @@ -260,6 +387,40 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk, pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end)); } +/** + * pcpu_populate_chunk_cpu - populate and map an area of a pcpu_chunk for a cpu + * @cpu: cpu of interest + * @chunk: chunk of interest + * @page_start: the start page + * @page_end: the end page + * @gfp: allocation flags passed to the underlying memory allocator + * + * For the given cpu, populate and map pages [@page_start,@page_end) into + * @chunk. + * + * CONTEXT: + * pcpu_alloc_mutex, does GFP_KERNEL allocation. + */ +static int pcpu_populate_chunk_cpu(unsigned int cpu, struct pcpu_chunk *chunk, + int page_start, int page_end, gfp_t gfp) +{ + struct page **pages; + + pages = pcpu_get_pages(); + if (!pages) + return -ENOMEM; + + if (pcpu_alloc_pages_cpu(cpu, chunk, pages, page_start, page_end, gfp)) + return -ENOMEM; + + if (pcpu_map_pages_cpu(cpu, chunk, pages, page_start, page_end)) { + pcpu_free_pages_cpu(cpu, chunk, pages, page_start, page_end); + return -ENOMEM; + } + pcpu_post_map_flush(chunk, page_start, page_end); + + return 0; +} /** * pcpu_populate_chunk - populate and map an area of a pcpu_chunk * @chunk: chunk of interest @@ -294,6 +455,44 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, return 0; } +/** + * pcpu_depopulate_chunk_cpu - depopulate and unmap an area of a pcpu_chunk + * for a cpu + * @cpu: cpu of interest + * @chunk: chunk to depopulate + * @page_start: the start page + * @page_end: the end page + * + * For the given cpu, depopulate and unmap pages [@page_start,@page_end) + * from @chunk. + * + * CONTEXT: + * pcpu_alloc_mutex. + */ +static void pcpu_depopulate_chunk_cpu(unsigned int cpu, + struct pcpu_chunk *chunk, + int page_start, int page_end) +{ + struct page **pages; + + /* + * If control reaches here, there must have been at least one + * successful population attempt so the temp pages array must + * be available now. + */ + pages = pcpu_get_pages(); + BUG_ON(!pages); + + /* unmap and free */ + pcpu_pre_unmap_flush(chunk, page_start, page_end); + + pcpu_unmap_pages_cpu(cpu, chunk, pages, page_start, page_end); + + /* no need to flush tlb, vmalloc will handle it lazily */ + + pcpu_free_pages_cpu(cpu, chunk, pages, page_start, page_end); +} + /** * pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk * @chunk: chunk to depopulate diff --git a/mm/percpu.c b/mm/percpu.c index f99e9306b939..ca8ca541bede 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1324,6 +1324,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr, alloc_size); INIT_LIST_HEAD(&chunk->list); + INIT_LIST_HEAD(&chunk->cpuhp); chunk->base_addr = (void *)aligned_addr; chunk->start_offset = start_offset; @@ -1404,6 +1405,7 @@ static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp) return NULL; INIT_LIST_HEAD(&chunk->list); + INIT_LIST_HEAD(&chunk->cpuhp); chunk->nr_pages = pcpu_unit_pages; region_bits = pcpu_chunk_map_bits(chunk); @@ -1659,6 +1661,161 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) } #endif /* CONFIG_MEMCG_KMEM */ +static void pcpu_cpuhp_register(struct pcpu_chunk *chunk, + struct percpu_cpuhp_notifier *n) +{ + list_add(&n->list, &chunk->cpuhp); +} + +static void pcpu_cpuhp_deregister(struct pcpu_chunk *chunk, + void __percpu *ptr) +{ + struct percpu_cpuhp_notifier *n, *next; + + list_for_each_entry_safe(n, next, &chunk->cpuhp, list) + if (n->ptr == ptr) { + list_del(&n->list); + kfree(n); + return; + } +} + +static void __pcpu_cpuhp_setup(enum pcpu_chunk_type type, unsigned int cpu) +{ + int slot; + struct list_head *pcpu_slot = pcpu_chunk_list(type); + struct pcpu_chunk *chunk; + + for (slot = 0; slot < pcpu_nr_slots; slot++) { + list_for_each_entry(chunk, &pcpu_slot[slot], list) { + unsigned int rs, re; + + if (chunk == pcpu_first_chunk) + continue; + + bitmap_for_each_set_region(chunk->populated, rs, re, 0, + chunk->nr_pages) + pcpu_populate_chunk_cpu(cpu, chunk, rs, re, + GFP_KERNEL); + } + } +} + +/** + * cpu hotplug callback for percpu allocator + * @cpu: cpu that is being hotplugged + * + * Allocates and maps the pages that corresponds to @cpu's unit + * in all chunks. + */ +static int percpu_cpuhp_setup(unsigned int cpu) +{ + enum pcpu_chunk_type type; + + mutex_lock(&pcpu_alloc_mutex); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + __pcpu_cpuhp_setup(type, cpu); + mutex_unlock(&pcpu_alloc_mutex); + + return 0; +} + +static void __pcpu_cpuhp_destroy(enum pcpu_chunk_type type, unsigned int cpu) +{ + int slot; + struct list_head *pcpu_slot = pcpu_chunk_list(type); + struct pcpu_chunk *chunk; + + for (slot = 0; slot < pcpu_nr_slots; slot++) { + list_for_each_entry(chunk, &pcpu_slot[slot], list) { + unsigned int rs, re; + + if (chunk == pcpu_first_chunk) + continue; + + bitmap_for_each_set_region(chunk->populated, rs, re, 0, + chunk->nr_pages) + pcpu_depopulate_chunk_cpu(cpu, chunk, rs, re); + } + } +} + +/** + * cpu unplug callback for percpu allocator + * @cpu: cpu that is being hotplugged + * + * Unmaps and frees the pages that corresponds to @cpu's unit + * in all chunks. + */ +static int percpu_cpuhp_destroy(unsigned int cpu) +{ + enum pcpu_chunk_type type; + + mutex_lock(&pcpu_alloc_mutex); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + __pcpu_cpuhp_destroy(type, cpu); + mutex_unlock(&pcpu_alloc_mutex); + + return 0; +} + +static void __pcpu_cpuhp_alloc(enum pcpu_chunk_type type, unsigned int cpu) +{ + int slot; + struct list_head *pcpu_slot = pcpu_chunk_list(type); + struct pcpu_chunk *chunk; + struct percpu_cpuhp_notifier *n; + + for (slot = 0; slot < pcpu_nr_slots; slot++) { + list_for_each_entry(chunk, &pcpu_slot[slot], list) { + list_for_each_entry(n, &chunk->cpuhp, list) + n->cb(n->ptr, cpu, n->data); + } + } +} + +/** + * cpu hotplug callback for executing any initialization routines + * registered by the callers of alloc_percpu_cb() + * + * @cpu: cpu that is being hotplugged + */ +static int percpu_cpuhp_alloc(unsigned int cpu) +{ + enum pcpu_chunk_type type; + + mutex_lock(&pcpu_alloc_mutex); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + __pcpu_cpuhp_alloc(type, cpu); + mutex_unlock(&pcpu_alloc_mutex); + + return 0; +} + +static int percpu_cpuhp_free(unsigned int cpu) +{ + return 0; +} + +/** + * Register cpu hotplug callbacks for the percpu allocator + * and its callers + */ +static int percpu_hotplug_setup(void) +{ + /* Callback for percpu allocator */ + if (cpuhp_setup_state(CPUHP_PERCPU_SETUP, "percpu:setup", + percpu_cpuhp_setup, percpu_cpuhp_destroy)) + return -EINVAL; + + /* Callback for the callers of alloc_percpu() */ + if (cpuhp_setup_state(CPUHP_PERCPU_ALLOC, "percpu:alloc", + percpu_cpuhp_alloc, percpu_cpuhp_free)) + return -EINVAL; + + return 0; +} + /** * pcpu_alloc - the percpu allocator * @size: size of area to allocate in bytes @@ -1675,7 +1832,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) * Percpu pointer to the allocated area on success, NULL on failure. */ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, - gfp_t gfp) + gfp_t gfp, pcpu_cpuhp_fn_t cb, void *data) { gfp_t pcpu_gfp; bool is_atomic; @@ -1690,6 +1847,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, unsigned long flags; void __percpu *ptr; size_t bits, bit_align; + struct percpu_cpuhp_notifier *n; gfp = current_gfp_context(gfp); /* whitelisted flags that can be passed to the backing allocators */ @@ -1697,6 +1855,12 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; do_warn = !(gfp & __GFP_NOWARN); + if (cb) { + n = kmalloc(sizeof(*n), gfp); + if (!n) + return NULL; + } + /* * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE, * therefore alignment must be a minimum of that many bytes. @@ -1847,6 +2011,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, pcpu_memcg_post_alloc_hook(objcg, chunk, off, size); + if (cb) { + n->ptr = ptr; + n->cb = cb; + n->data = data; + pcpu_cpuhp_register(chunk, n); + } + return ptr; fail_unlock: @@ -1870,6 +2041,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, } pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size); + kfree(n); return NULL; } @@ -1891,7 +2063,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, */ void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) { - return pcpu_alloc(size, align, false, gfp); + return pcpu_alloc(size, align, false, gfp, NULL, NULL); } EXPORT_SYMBOL_GPL(__alloc_percpu_gfp); @@ -1904,7 +2076,7 @@ EXPORT_SYMBOL_GPL(__alloc_percpu_gfp); */ void __percpu *__alloc_percpu(size_t size, size_t align) { - return pcpu_alloc(size, align, false, GFP_KERNEL); + return pcpu_alloc(size, align, false, GFP_KERNEL, NULL, NULL); } EXPORT_SYMBOL_GPL(__alloc_percpu); @@ -1926,7 +2098,33 @@ EXPORT_SYMBOL_GPL(__alloc_percpu); */ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) { - return pcpu_alloc(size, align, true, GFP_KERNEL); + return pcpu_alloc(size, align, true, GFP_KERNEL, NULL, NULL); +} + +/** + * alloc_percpu variants that take a callback to handle + * any required initialization to the percpu ptr corresponding + * to the cpu that is coming online. + * @cb: This callback will be called whenever a cpu is hotplugged. + */ +void __percpu *__alloc_percpu_gfp_cb(size_t size, size_t align, gfp_t gfp, + pcpu_cpuhp_fn_t cb, void *data) +{ + return pcpu_alloc(size, align, false, gfp, cb, data); +} +EXPORT_SYMBOL_GPL(__alloc_percpu_gfp_cb); + +void __percpu *__alloc_percpu_cb(size_t size, size_t align, pcpu_cpuhp_fn_t cb, + void *data) +{ + return pcpu_alloc(size, align, false, GFP_KERNEL, cb, data); +} +EXPORT_SYMBOL_GPL(__alloc_percpu_cb); + +void __percpu *__alloc_reserved_percpu_cb(size_t size, size_t align, + pcpu_cpuhp_fn_t cb, void *data) +{ + return pcpu_alloc(size, align, true, GFP_KERNEL, cb, data); } /** @@ -2116,6 +2314,7 @@ void free_percpu(void __percpu *ptr) } } + pcpu_cpuhp_deregister(chunk, ptr); trace_percpu_free_percpu(chunk->base_addr, off, ptr); spin_unlock_irqrestore(&pcpu_lock, flags); @@ -2426,6 +2625,8 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, } \ } while (0) + PCPU_SETUP_BUG_ON(percpu_hotplug_setup() < 0); + /* sanity checks */ PCPU_SETUP_BUG_ON(ai->nr_groups <= 0); #ifdef CONFIG_SMP From patchwork Tue Jun 1 06:51:46 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bharata B Rao X-Patchwork-Id: 12290503 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42E60C4708F for ; Tue, 1 Jun 2021 06:52:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E679B613A9 for ; Tue, 1 Jun 2021 06:52:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E679B613A9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9C1D38E0002; Tue, 1 Jun 2021 02:52:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 972F58E0001; Tue, 1 Jun 2021 02:52:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 728448E0002; Tue, 1 Jun 2021 02:52:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33]) by kanga.kvack.org (Postfix) with ESMTP id 398C68E0001 for ; Tue, 1 Jun 2021 02:52:12 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id CC1E6181AC9B6 for ; Tue, 1 Jun 2021 06:52:11 +0000 (UTC) X-FDA: 78204235662.19.617156F Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf16.hostedemail.com (Postfix) with ESMTP id CEE0E8019357 for ; Tue, 1 Jun 2021 06:51:59 +0000 (UTC) Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 1516Xqii102524; Tue, 1 Jun 2021 02:52:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=KAd3p0aOLPKhFyURvvArVG71/qkCtWBr9P0L5TsxcJs=; b=i1uepNeenI14RVuSXZx/l81KVtNj0VYKjsc1SRWOMTDHtKJcw+npbDnnsz+2OaH9IDXi ZnIlaTtU2vgS4UeX/UWB9ZSJWBZLSfRU3oYUWvRGDShgZp1W7Eq9KhirmeCuW6oP/P59 4ip8Gs8eakjsAd0GIABYbDI1gYJWeQRntp1so7dT3hCcshVfJClxTde1qQUPNqVx8IDX 67pFYZV+ueouIgug6rPQ1BHJa/JUkAJ86IQvm0Bj8DTNSHGvxk2QI50VY8uhSKExo48X 43r3A8Kdv95WSkYK3gM9wTok/FDhyhJT2ErHBrhik11LqVSWDEyNE7z+z+Q2s9xJWcQI yA== Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com [169.51.49.102]) by mx0a-001b2d01.pphosted.com with ESMTP id 38wf0xhf25-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Jun 2021 02:52:08 -0400 Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1]) by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1516l4d0027627; Tue, 1 Jun 2021 06:52:05 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma06ams.nl.ibm.com with ESMTP id 38ucvh9f29-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Jun 2021 06:52:05 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1516q2eB26607930 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 1 Jun 2021 06:52:02 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9E06311C04A; Tue, 1 Jun 2021 06:52:02 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 40F6211C05C; Tue, 1 Jun 2021 06:52:00 +0000 (GMT) Received: from bharata.ibmuc.com (unknown [9.77.195.136]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 1 Jun 2021 06:52:00 +0000 (GMT) From: Bharata B Rao To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, aneesh.kumar@linux.ibm.com, dennis@kernel.org, tj@kernel.org, cl@linux.com, akpm@linux-foundation.org, amakhalov@vmware.com, guro@fb.com, vbabka@suse.cz, srikar@linux.vnet.ibm.com, psampat@linux.ibm.com, ego@linux.vnet.ibm.com, Bharata B Rao Subject: [RFC PATCH v0 2/3] percpu: Limit percpu allocator to online cpus Date: Tue, 1 Jun 2021 12:21:46 +0530 Message-Id: <20210601065147.53735-3-bharata@linux.ibm.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210601065147.53735-1-bharata@linux.ibm.com> References: <20210601065147.53735-1-bharata@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: _0KQBx0LPWwGw3Yw9YbC-43YXcFD8z5U X-Proofpoint-ORIG-GUID: _0KQBx0LPWwGw3Yw9YbC-43YXcFD8z5U X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.761 definitions=2021-06-01_03:2021-05-31,2021-06-01 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 bulkscore=0 spamscore=0 adultscore=0 mlxlogscore=999 clxscore=1015 phishscore=0 mlxscore=0 priorityscore=1501 lowpriorityscore=0 impostorscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106010045 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=i1uepNee; spf=pass (imf16.hostedemail.com: domain of bharata@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=bharata@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: CEE0E8019357 X-Stat-Signature: ds4c1j584hdoc1bsgn73pnkjdfpzpzma X-HE-Tag: 1622530319-464594 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that percpu allocator supports growing of memory for newly coming up CPU at hotplug time, limit the allocation, mapping and memcg charging of memory to online CPUs. Also change the Percpu memory reporting in /proc/meminfo to reflect the populated pages of only online CPUs. TODO: Address percpu memcg charging and uncharging from CPU hotplug callback. Signed-off-by: Bharata B Rao --- mm/percpu-vm.c | 12 ++++++------ mm/percpu.c | 20 +++++++++++++------- 2 files changed, 19 insertions(+), 13 deletions(-) diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 3250e1c9aeaf..79ce104c963a 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -118,7 +118,7 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk, unsigned int cpu; int i; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { for (i = page_start; i < page_end; i++) { struct page *page = pages[pcpu_page_idx(cpu, i)]; @@ -149,7 +149,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, gfp |= __GFP_HIGHMEM; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { for (i = page_start; i < page_end; i++) { struct page **pagep = &pages[pcpu_page_idx(cpu, i)]; @@ -164,7 +164,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, while (--i >= page_start) __free_page(pages[pcpu_page_idx(cpu, i)]); - for_each_possible_cpu(tcpu) { + for_each_online_cpu(tcpu) { if (tcpu == cpu) break; for (i = page_start; i < page_end; i++) @@ -248,7 +248,7 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk, unsigned int cpu; int i; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { for (i = page_start; i < page_end; i++) { struct page *page; @@ -344,7 +344,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk, unsigned int cpu, tcpu; int i, err; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start), &pages[pcpu_page_idx(cpu, page_start)], page_end - page_start); @@ -357,7 +357,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk, } return 0; err: - for_each_possible_cpu(tcpu) { + for_each_online_cpu(tcpu) { if (tcpu == cpu) break; __pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start), diff --git a/mm/percpu.c b/mm/percpu.c index ca8ca541bede..83b6bcfcfa80 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1594,7 +1594,7 @@ static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, if (!objcg) return PCPU_CHUNK_ROOT; - if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) { + if (obj_cgroup_charge(objcg, gfp, size * num_online_cpus())) { obj_cgroup_put(objcg); return PCPU_FAIL_ALLOC; } @@ -1615,10 +1615,10 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, rcu_read_lock(); mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - size * num_possible_cpus()); + size * num_online_cpus()); rcu_read_unlock(); } else { - obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + obj_cgroup_uncharge(objcg, size * num_online_cpus()); obj_cgroup_put(objcg); } } @@ -1633,11 +1633,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT]; chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL; - obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + obj_cgroup_uncharge(objcg, size * num_online_cpus()); rcu_read_lock(); mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - -(size * num_possible_cpus())); + -(size * num_online_cpus())); rcu_read_unlock(); obj_cgroup_put(objcg); @@ -1680,6 +1680,9 @@ static void pcpu_cpuhp_deregister(struct pcpu_chunk *chunk, } } +/* + * TODO: Grow the memcg charge + */ static void __pcpu_cpuhp_setup(enum pcpu_chunk_type type, unsigned int cpu) { int slot; @@ -1720,6 +1723,9 @@ static int percpu_cpuhp_setup(unsigned int cpu) return 0; } +/* + * TODO: Reduce the memcg charge + */ static void __pcpu_cpuhp_destroy(enum pcpu_chunk_type type, unsigned int cpu) { int slot; @@ -2000,7 +2006,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, pcpu_schedule_balance_work(); /* clear the areas and return address relative to base address */ - for_each_possible_cpu(cpu) + for_each_online_cpu(cpu) memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size); ptr = __addr_to_pcpu_ptr(chunk->base_addr + off); @@ -3372,7 +3378,7 @@ void __init setup_per_cpu_areas(void) */ unsigned long pcpu_nr_pages(void) { - return pcpu_nr_populated * pcpu_nr_units; + return pcpu_nr_populated * num_online_cpus(); } /* From patchwork Tue Jun 1 06:51:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bharata B Rao X-Patchwork-Id: 12290505 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 278C1C47080 for ; Tue, 1 Jun 2021 06:52:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC2AA613A9 for ; Tue, 1 Jun 2021 06:52:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BC2AA613A9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4BD30940008; Tue, 1 Jun 2021 02:52:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E21B940007; Tue, 1 Jun 2021 02:52:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1D37940008; Tue, 1 Jun 2021 02:52:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0119.hostedemail.com [216.40.44.119]) by kanga.kvack.org (Postfix) with ESMTP id AECEC940007 for ; Tue, 1 Jun 2021 02:52:14 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 567B4A755 for ; Tue, 1 Jun 2021 06:52:14 +0000 (UTC) X-FDA: 78204235788.03.A892A43 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf20.hostedemail.com (Postfix) with ESMTP id 2160842F for ; Tue, 1 Jun 2021 06:51:57 +0000 (UTC) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 1516WfiD062862; Tue, 1 Jun 2021 02:52:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=NCXEv/aJngQFqigRMlTYy/jGouLUt4CcN6irCoZTFks=; b=Aoz0U80ox2mdfo9DTxGA/ZkEROXNX3udO2HLwzgKkmijcRmiDztqEBy6aR+l8AbzgCNB h1M/uxbkoU7/k5Vg4PTsUvv1ROpxvhxjgf8ID/peJbK6c5Kx6e3n/vyO9FlT+T5bUTnS qrx/3MY8DelAjHULzp0DmfaEi0DqWGrsQXQnD6FsY5q2a2+WGTI7N6as5Vnxpqebxxng T7wjtRCPOQ7bCaj03UzQZUu6DDH7ufZ4oDVBzn8Y3A2SFOGnewWydShCWuA7rkxQF+yK 2Y4SIBACvct8XrloyPhLOQTiUZDsHlE2xWH0GF56fA7u04HXz+cu0SDjs4YtWb6aBwJi 3w== Received: from ppma04fra.de.ibm.com (6a.4a.5195.ip4.static.sl-reverse.com [149.81.74.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 38wfebrvc7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Jun 2021 02:52:10 -0400 Received: from pps.filterd (ppma04fra.de.ibm.com [127.0.0.1]) by ppma04fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1516lKVd002991; Tue, 1 Jun 2021 06:52:08 GMT Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by ppma04fra.de.ibm.com with ESMTP id 38ud880syf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Jun 2021 06:52:08 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1516q5k429622596 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 1 Jun 2021 06:52:05 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 62A5311C058; Tue, 1 Jun 2021 06:52:05 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F020F11C04C; Tue, 1 Jun 2021 06:52:02 +0000 (GMT) Received: from bharata.ibmuc.com (unknown [9.77.195.136]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 1 Jun 2021 06:52:02 +0000 (GMT) From: Bharata B Rao To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, aneesh.kumar@linux.ibm.com, dennis@kernel.org, tj@kernel.org, cl@linux.com, akpm@linux-foundation.org, amakhalov@vmware.com, guro@fb.com, vbabka@suse.cz, srikar@linux.vnet.ibm.com, psampat@linux.ibm.com, ego@linux.vnet.ibm.com, Bharata B Rao Subject: [RFC PATCH v0 3/3] percpu: Avoid using percpu ptrs of non-existing cpus Date: Tue, 1 Jun 2021 12:21:47 +0530 Message-Id: <20210601065147.53735-4-bharata@linux.ibm.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210601065147.53735-1-bharata@linux.ibm.com> References: <20210601065147.53735-1-bharata@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 9qBE326RlfhVU4tKbSkqGRoEShEeAk3v X-Proofpoint-GUID: 9qBE326RlfhVU4tKbSkqGRoEShEeAk3v X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.761 definitions=2021-06-01_03:2021-05-31,2021-06-01 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 malwarescore=0 adultscore=0 lowpriorityscore=0 priorityscore=1501 bulkscore=0 suspectscore=0 clxscore=1015 phishscore=0 spamscore=0 impostorscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106010045 X-Rspamd-Queue-Id: 2160842F Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Aoz0U80o; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf20.hostedemail.com: domain of bharata@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=bharata@linux.ibm.com X-Rspamd-Server: rspam03 X-Stat-Signature: x6zne4tn61suu8za9ipwckbsc55cyss8 X-HE-Tag: 1622530317-734002 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Prevent the callers of alloc_percpu() from using the percpu pointer of non-existing CPUs. Also switch those callers who require initialization of percpu data for onlined CPU to use the new variant alloc_percpu_cb() Note: Not all callers have been modified here Signed-off-by: Bharata B Rao --- fs/namespace.c | 4 ++-- kernel/cgroup/rstat.c | 20 ++++++++++++++++---- kernel/sched/cpuacct.c | 10 +++++----- kernel/sched/psi.c | 14 +++++++++++--- lib/percpu-refcount.c | 4 ++-- lib/percpu_counter.c | 2 +- net/ipv4/fib_semantics.c | 2 +- net/ipv6/route.c | 6 +++--- 8 files changed, 41 insertions(+), 21 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index c3f1a78ba369..b6ea584b99e5 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -182,7 +182,7 @@ int mnt_get_count(struct mount *mnt) int count = 0; int cpu; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count; } @@ -294,7 +294,7 @@ static unsigned int mnt_get_writers(struct mount *mnt) unsigned int count = 0; int cpu; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_writers; } diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index cee265cb535c..b25c59138c0b 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -152,7 +152,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep) lockdep_assert_held(&cgroup_rstat_lock); - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); struct cgroup *pos = NULL; @@ -245,19 +245,31 @@ void cgroup_rstat_flush_release(void) spin_unlock_irq(&cgroup_rstat_lock); } +static int cgroup_rstat_cpuhp_handler(void __percpu *ptr, unsigned int cpu, void *data) +{ + struct cgroup *cgrp = (struct cgroup *)data; + struct cgroup_rstat_cpu *rstatc = per_cpu_ptr(ptr, cpu); + + rstatc->updated_children = cgrp; + u64_stats_init(&rstatc->bsync); + return 0; +} + int cgroup_rstat_init(struct cgroup *cgrp) { int cpu; /* the root cgrp has rstat_cpu preallocated */ if (!cgrp->rstat_cpu) { - cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu); + cgrp->rstat_cpu = alloc_percpu_cb(struct cgroup_rstat_cpu, + cgroup_rstat_cpuhp_handler, + cgrp); if (!cgrp->rstat_cpu) return -ENOMEM; } /* ->updated_children list is self terminated */ - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); rstatc->updated_children = cgrp; @@ -274,7 +286,7 @@ void cgroup_rstat_exit(struct cgroup *cgrp) cgroup_rstat_flush(cgrp); /* sanity check */ - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); if (WARN_ON_ONCE(rstatc->updated_children != cgrp) || diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 104a1bade14f..81dd53387ba5 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -160,7 +160,7 @@ static u64 __cpuusage_read(struct cgroup_subsys_state *css, u64 totalcpuusage = 0; int i; - for_each_possible_cpu(i) + for_each_online_cpu(i) totalcpuusage += cpuacct_cpuusage_read(ca, i, index); return totalcpuusage; @@ -195,7 +195,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, if (val) return -EINVAL; - for_each_possible_cpu(cpu) + for_each_online_cpu(cpu) cpuacct_cpuusage_write(ca, cpu, 0); return 0; @@ -208,7 +208,7 @@ static int __cpuacct_percpu_seq_show(struct seq_file *m, u64 percpu; int i; - for_each_possible_cpu(i) { + for_each_online_cpu(i) { percpu = cpuacct_cpuusage_read(ca, i, index); seq_printf(m, "%llu ", (unsigned long long) percpu); } @@ -242,7 +242,7 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V) seq_printf(m, " %s", cpuacct_stat_desc[index]); seq_puts(m, "\n"); - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); seq_printf(m, "%d", cpu); @@ -275,7 +275,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v) int stat; memset(val, 0, sizeof(val)); - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index cc25a3cff41f..228977aa4780 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -186,7 +186,7 @@ static void group_init(struct psi_group *group) { int cpu; - for_each_possible_cpu(cpu) + for_each_online_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); group->avg_last_update = sched_clock(); group->avg_next_update = group->avg_last_update + psi_period; @@ -321,7 +321,7 @@ static void collect_percpu_times(struct psi_group *group, * the sampling period. This eliminates artifacts from uneven * loading, or even entirely idle CPUs. */ - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { u32 times[NR_PSI_STATES]; u32 nonidle; u32 cpu_changed_states; @@ -935,12 +935,20 @@ void psi_memstall_leave(unsigned long *flags) } #ifdef CONFIG_CGROUPS +static int psi_cpuhp_handler(void __percpu *ptr, unsigned int cpu, void *unused) +{ + struct psi_group_cpu *groupc = per_cpu_ptr(ptr, cpu); + + seqcount_init(&groupc->seq); + return 0; +} + int psi_cgroup_alloc(struct cgroup *cgroup) { if (static_branch_likely(&psi_disabled)) return 0; - cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + cgroup->psi.pcpu = alloc_percpu_cb(struct psi_group_cpu, psi_cpuhp_handler, NULL); if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index a1071cdefb5a..aeba43c33600 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -173,7 +173,7 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) unsigned long count = 0; int cpu; - for_each_possible_cpu(cpu) + for_each_online_cpu(cpu) count += *per_cpu_ptr(percpu_count, cpu); pr_debug("global %lu percpu %lu\n", @@ -253,7 +253,7 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) * zeroing is visible to all percpu accesses which can see the * following __PERCPU_REF_ATOMIC clearing. */ - for_each_possible_cpu(cpu) + for_each_online_cpu(cpu) *per_cpu_ptr(percpu_count, cpu) = 0; smp_store_release(&ref->percpu_count_ptr, diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index ed610b75dc32..db40abc6f0f5 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -63,7 +63,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) unsigned long flags; raw_spin_lock_irqsave(&fbc->lock, flags); - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); *pcount = 0; } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index a632b66bc13a..dbfd14b0077f 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -194,7 +194,7 @@ static void rt_fibinfo_free_cpus(struct rtable __rcu * __percpu *rtp) if (!rtp) return; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct rtable *rt; rt = rcu_dereference_protected(*per_cpu_ptr(rtp, cpu), 1); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index a22822bdbf39..e7db3a5fe5c5 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -165,7 +165,7 @@ static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev) if (dev == loopback_dev) return; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu); struct rt6_info *rt; @@ -3542,7 +3542,7 @@ void fib6_nh_release(struct fib6_nh *fib6_nh) if (fib6_nh->rt6i_pcpu) { int cpu; - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct rt6_info **ppcpu_rt; struct rt6_info *pcpu_rt; @@ -6569,7 +6569,7 @@ int __init ip6_route_init(void) #endif #endif - for_each_possible_cpu(cpu) { + for_each_online_cpu(cpu) { struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu); INIT_LIST_HEAD(&ul->head);