From patchwork Mon Nov 7 22:39:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Song Liu X-Patchwork-Id: 13035366 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF655C433FE for ; Mon, 7 Nov 2022 22:41:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E7CE6B0074; Mon, 7 Nov 2022 17:41:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 571666B0075; Mon, 7 Nov 2022 17:41:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 439308E0001; Mon, 7 Nov 2022 17:41:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 31A0B6B0074 for ; Mon, 7 Nov 2022 17:41:42 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id EE6D3C0511 for ; Mon, 7 Nov 2022 22:41:41 +0000 (UTC) X-FDA: 80108119602.13.ECB9D85 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by imf18.hostedemail.com (Postfix) with ESMTP id 78D931C0005 for ; Mon, 7 Nov 2022 22:41:41 +0000 (UTC) Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A7LKoAr030595 for ; Mon, 7 Nov 2022 14:41:40 -0800 Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3knkvkm08v-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 07 Nov 2022 14:41:40 -0800 Received: from twshared16963.27.frc3.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 7 Nov 2022 14:41:37 -0800 Received: by devbig932.frc1.facebook.com (Postfix, from userid 4523) id E44CBF6B36D9; Mon, 7 Nov 2022 14:39:25 -0800 (PST) From: Song Liu To: , CC: , , , , , , , , Song Liu Subject: [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Date: Mon, 7 Nov 2022 14:39:17 -0800 Message-ID: <20221107223921.3451913-2-song@kernel.org> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221107223921.3451913-1-song@kernel.org> References: <20221107223921.3451913-1-song@kernel.org> X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: 4eUO2WkZfnJ8Ip1APaOQq-8iAybiSnym X-Proofpoint-GUID: 4eUO2WkZfnJ8Ip1APaOQq-8iAybiSnym X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-07_11,2022-11-07_02,2022-06-22_01 ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of "prvs=2310266901=songliubraving@meta.com" designates 67.231.145.42 as permitted sender) smtp.mailfrom="prvs=2310266901=songliubraving@meta.com"; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667860901; a=rsa-sha256; cv=none; b=pDaJ+QgxQKAmE9omLteF76TaA2lqlH3H6/FIwQsIGT/gbjGA7n7oit4NYnyfeJRse+/ipp NwKS6hMCh3BGL3uMVBwX5VVFk98RGfRk+6tqdrYQYet16A4KWBnBIjF+WhB2jrVVdRPG5L A8xJF68ArVpBk5zEu6H/x3yfUpfP66E= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667860901; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9OY7amygRdMAksREHFwzx1Em8ml3t73Jkjwzudh7it8=; b=qpl1GNZDly68hysiYMSfTcf+nM+FLrKAba1hp7uVLF9FlfKwPOXvgr0sCwyG5Y1/ZclqF2 INPcrvp+QYrxdiFqHM6H2tm0Z0xkJkx2oiW6mxzE+eBCuTX79jJ33AklSi2RaoxIpA4hnA Rjsq1R3JYRYay5CAjW0xFzD5MUk5nq4= X-Stat-Signature: 8fczq3eumaxrodtt1jq11faygusfux86 X-Rspamd-Queue-Id: 78D931C0005 Authentication-Results: imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of "prvs=2310266901=songliubraving@meta.com" designates 67.231.145.42 as permitted sender) smtp.mailfrom="prvs=2310266901=songliubraving@meta.com"; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none) X-Rspamd-Server: rspam05 X-Rspam-User: X-HE-Tag: 1667860901-899796 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: execmem_alloc is used to allocate memory to host dynamic kernel text (modules, BPF programs, etc.) with huge pages. This is similar to the proposal by Peter in [1]. A new tree of vmap_area, free_text_area_* tree, is introduced in addition to free_vmap_area_* and vmap_area_*. execmem_alloc allocates pages from free_text_area_*. When there isn't enough space left in free_text_area_*, new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to free_text_area_*. To be more accurate, the vmap_area is first added to vmap_area_* tree and then moved to free_text_area_*. This extra move simplifies the logic of execmem_alloc. vmap_area in free_text_area_* tree are backed with memory, but we need subtree_max_size for tree operations. Therefore, vm_struct for these vmap_area are stored in a separate list, all_text_vm. The new tree allows separate handling of < PAGE_SIZE allocations, as current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This version of execmem_alloc can handle bpf programs, which uses 64 byte aligned allocations), and modules, which uses PAGE_SIZE aligned allocations. Memory allocated by execmem_alloc() is set to RO+X before returning to the caller. Therefore, the caller cannot write directly write to the memory. Instead, the caller is required to use execmem_fill() to update the memory. For the safety and security of X memory, execmem_fill() checks the data being updated always in the memory allocated by one execmem_alloc() call. execmem_fill() uses text_poke like mechanism and requires arch support. Specifically, the arch need to implement arch_execmem_fill(). In execmem_free(), the memory is first erased with arch_invalidate_exec(). Then, the memory is added to free_text_area_*. If this free creates big enough continuous free space (> PMD_SIZE), execmem_free() will try to free the backing vm_struct. [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ Signed-off-by: Song Liu --- include/linux/vmalloc.h | 5 + mm/nommu.c | 12 ++ mm/vmalloc.c | 320 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 337 insertions(+) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 096d48aa3437..30aa8c187d40 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align, void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller) __alloc_size(1); void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1); +void *execmem_alloc(unsigned long size, unsigned long align) __alloc_size(1); +void *execmem_fill(void *dst, void *src, size_t len); +void execmem_free(void *addr); +void *arch_fill_execmem(void *dst, void *src, size_t len); +int arch_invalidate_execmem(void *ptr, size_t len); extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2); extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2); diff --git a/mm/nommu.c b/mm/nommu.c index 214c70e1d059..e3039fd4f65b 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, } EXPORT_SYMBOL(vm_map_pages_zero); +void *execmem_alloc(unsigned long size, unsigned long align) +{ + return NULL; +} + +void *execmem_fill(void *dst, void *src, size_t len) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +void execmem_free(const void *addr) { } + /* * sys_brk() for the most part doesn't need the global kernel * lock, except when an application is doing something nasty diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ccaa461998f3..6cc72c795ee5 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc); static const bool vmap_allow_huge = false; #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ +#ifndef PMD_ALIGN +#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE) +#endif +#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE) + bool is_vmalloc_addr(const void *x) { unsigned long addr = (unsigned long)kasan_reset_tag(x); @@ -769,6 +774,38 @@ static LIST_HEAD(free_vmap_area_list); */ static struct rb_root free_vmap_area_root = RB_ROOT; +/* + * free_text_area for execmem_alloc() + */ +static DEFINE_SPINLOCK(free_text_area_lock); +/* + * This linked list is used in pair with free_text_area_root. + * It gives O(1) access to prev/next to perform fast coalescing. + */ +static LIST_HEAD(free_text_area_list); + +/* + * This augment red-black tree represents the free text space. + * All vmap_area objects in this tree are sorted by va->va_start + * address. It is used for allocation and merging when a vmap + * object is released. + * + * Each vmap_area node contains a maximum available free block + * of its sub-tree, right or left. Therefore it is possible to + * find a lowest match of free area. + * + * vmap_area in this tree are backed by RO+X memory, but they do + * not have valid vm pointer (because we need subtree_max_size). + * The vm for these vmap_area are stored in all_text_vm. + */ +static struct rb_root free_text_area_root = RB_ROOT; + +/* + * List of vm_struct for free_text_area_root. This list is rarely + * accessed, so the O(N) complexity is not likely a real issue. + */ +struct vm_struct *all_text_vm; + /* * Preload a CPU with one object for "no edge" split case. The * aim is to get rid of allocations from the atomic context, thus @@ -3313,6 +3350,289 @@ void *vmalloc(unsigned long size) } EXPORT_SYMBOL(vmalloc); +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) +#define EXEC_MEM_START MODULES_VADDR +#define EXEC_MEM_END MODULES_END +#else +#define EXEC_MEM_START VMALLOC_START +#define EXEC_MEM_END VMALLOC_END +#endif + +static void move_vmap_to_free_text_tree(void *addr) +{ + struct vmap_area *va; + + /* remove from vmap_area_root */ + spin_lock(&vmap_area_lock); + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); + if (WARN_ON_ONCE(!va)) { + spin_unlock(&vmap_area_lock); + return; + } + unlink_va(va, &vmap_area_root); + spin_unlock(&vmap_area_lock); + + /* make the memory RO+X */ + memset(addr, 0, va->va_end - va->va_start); + set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT); + set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT); + + /* add to all_text_vm */ + va->vm->next = all_text_vm; + all_text_vm = va->vm; + + /* add to free_text_area_root */ + spin_lock(&free_text_area_lock); + merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list); + spin_unlock(&free_text_area_lock); +} + +/** + * execmem_alloc - allocate virtually contiguous RO+X memory + * @size: allocation size + * + * This is used to allocate dynamic kernel text, such as module text, BPF + * programs, etc. User need to use text_poke to update the memory allocated + * by execmem_alloc. + * + * Return: pointer to the allocated memory or %NULL on error + */ +void *execmem_alloc(unsigned long size, unsigned long align) +{ + struct vmap_area *va, *tmp; + unsigned long addr; + enum fit_type type; + int ret; + + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE); + if (unlikely(!va)) + return NULL; + +again: + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE); + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false); + + if (!tmp) { + unsigned long alloc_size; + void *ptr; + + spin_unlock(&free_text_area_lock); + + /* + * Not enough continuous space in free_text_area_root, try + * allocate more memory. The memory is first added to + * vmap_area_root, and then moved to free_text_area_root. + */ + alloc_size = roundup(size, PMD_SIZE * num_online_nodes()); + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, EXEC_MEM_START, + EXEC_MEM_END, GFP_KERNEL, PAGE_KERNEL, + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD, + NUMA_NO_NODE, __builtin_return_address(0)); + if (unlikely(!ptr)) + goto err_out; + + move_vmap_to_free_text_tree(ptr); + goto again; + } + + addr = roundup(tmp->va_start, align); + type = classify_va_fit_type(tmp, addr, size); + if (WARN_ON_ONCE(type == NOTHING_FIT)) + goto err_out; + + ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list, + tmp, addr, size); + if (ret) + goto err_out; + + spin_unlock(&free_text_area_lock); + + va->va_start = addr; + va->va_end = addr + size; + va->vm = NULL; + + spin_lock(&vmap_area_lock); + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + spin_unlock(&vmap_area_lock); + + return (void *)addr; + +err_out: + spin_unlock(&free_text_area_lock); + kmem_cache_free(vmap_area_cachep, va); + return NULL; +} + +void __weak *arch_fill_execmem(void *dst, void *src, size_t len) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +int __weak arch_invalidate_execmem(void *ptr, size_t len) +{ + return -EOPNOTSUPP; +} + +/** + * execmem_fill - Copy text to RO+X memory allocated by execmem_alloc() + * @dst: pointer to memory allocated by execmem_alloc() + * @src: pointer to data being copied from + * @len: number of bytes to be copied + * + * execmem_fill() will only update memory allocated by a single execmem_fill() + * call. If dst + len goes beyond the boundary of one allocation, + * execmem_fill() is aborted. + * + * If @addr is NULL, no operation is performed. + */ +void *execmem_fill(void *dst, void *src, size_t len) +{ + struct vmap_area *va; + + spin_lock(&vmap_area_lock); + va = __find_vmap_area((unsigned long)dst, &vmap_area_root); + + /* + * If no va, or va has a vm attached, this memory is not allocated + * by execmem_alloc(). + */ + if (WARN_ON_ONCE(!va) || WARN_ON_ONCE(va->vm)) + goto err_out; + if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end)) + goto err_out; + + spin_unlock(&vmap_area_lock); + + return arch_fill_execmem(dst, src, len); + +err_out: + spin_unlock(&vmap_area_lock); + return ERR_PTR(-EINVAL); +} + +static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end) +{ + struct vm_struct *vm, *prev_vm; + + lockdep_assert_held(&free_text_area_lock); + + vm = all_text_vm; + while (vm) { + unsigned long vm_addr = (unsigned long)vm->addr; + + /* vm is within this free space, we can free it */ + if ((vm_addr >= start) && ((vm_addr + vm->size) <= end)) + goto unlink_vm; + vm = vm->next; + } + return NULL; + +unlink_vm: + if (all_text_vm == vm) { + all_text_vm = vm->next; + } else { + prev_vm = all_text_vm; + while (prev_vm->next != vm) + prev_vm = prev_vm->next; + prev_vm = vm->next; + } + return vm; +} + +/** + * execmem_free - Release memory allocated by execmem_alloc() + * @addr: Memory base address + * + * If @addr is NULL, no operation is performed. + */ +void execmem_free(void *addr) +{ + unsigned long free_start, free_end, free_addr; + struct vm_struct *vm; + struct vmap_area *va; + + might_sleep(); + + if (!addr) + return; + + spin_lock(&vmap_area_lock); + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); + if (WARN_ON_ONCE(!va)) { + spin_unlock(&vmap_area_lock); + return; + } + WARN_ON_ONCE(va->vm); + + unlink_va(va, &vmap_area_root); + spin_unlock(&vmap_area_lock); + + /* Invalidate text in the region */ + arch_invalidate_execmem(addr, va->va_end - va->va_start); + + spin_lock(&free_text_area_lock); + va = merge_or_add_vmap_area_augment(va, + &free_text_area_root, &free_text_area_list); + + if (WARN_ON_ONCE(!va)) + goto out; + + free_start = PMD_ALIGN(va->va_start); + free_end = PMD_ALIGN_DOWN(va->va_end); + + /* + * Only try to free vm when there is at least one PMD_SIZE free + * continuous memory. + */ + if (free_start >= free_end) + goto out; + + /* + * TODO: It is possible that multiple vm are ready to be freed + * after one execmem_free(). But we free at most one vm for now. + */ + vm = find_and_unlink_text_vm(free_start, free_end); + if (!vm) + goto out; + + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE); + if (unlikely(!va)) + goto out_save_vm; + + free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list, + vm->size, 1, (unsigned long)vm->addr, + (unsigned long)vm->addr + vm->size); + + if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr)) + goto out_save_vm; + + va->va_start = (unsigned long)vm->addr; + va->va_end = va->va_start + vm->size; + va->vm = vm; + spin_unlock(&free_text_area_lock); + + set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT); + set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT); + + /* put the va to vmap_area_root, and then free it with vfree */ + spin_lock(&vmap_area_lock); + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + spin_unlock(&vmap_area_lock); + + vfree(vm->addr); + return; + +out_save_vm: + /* + * vm is removed from all_text_vm, but not freed. Add it back, + * so that we can use or free it later. + */ + vm->next = all_text_vm; + all_text_vm = vm; +out: + spin_unlock(&free_text_area_lock); +} + /** * vmalloc_huge - allocate virtually contiguous memory, allow huge pages * @size: allocation size