From patchwork Mon Apr 7 16:53:29 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= X-Patchwork-Id: 14041337 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EDF00C369A1 for ; Mon, 7 Apr 2025 16:55:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 04EE26B000A; Mon, 7 Apr 2025 12:55:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F40776B000C; Mon, 7 Apr 2025 12:55:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CFA166B000E; Mon, 7 Apr 2025 12:55:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8E85F6B000C for ; Mon, 7 Apr 2025 12:55:27 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DE8E4160492 for ; Mon, 7 Apr 2025 16:55:27 +0000 (UTC) X-FDA: 83307848694.09.F62875A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 945C81A000D for ; Mon, 7 Apr 2025 16:55:25 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KGJcTuru; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of toke@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744044925; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mmWHrHuVRgKH+JhKAsvyJxgLEkq6TW3TK9aZ+ghtur0=; b=ebbgAmYzOCcsXM2qMcBMcUSeffc0WwClZuBt6ZEKHmwOO6uGSx/ZDR9vSg9OTS1hfTgHts M7ef48Pz+1wQqnolFjHrJ6wFivoi6uXlRbc5mg+/8ggQiV59P/8f7g7SRCzIFMZLHrfVwp Vn7kyGQieucZ5LEu8Sxw/hrMWRarpLE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744044925; a=rsa-sha256; cv=none; b=tO82LpBvwXUWC4NDm9Je/brzRxdFo/v548j6PQeLhJw2jf4Q7jfzHEwUINjQSomx0v0qd7 y3ZeTDe7+DMJQt3igu4umswLqFDuen7p27ieP2cTVz78UWrV3V1+MA+EY52HJCbhTZMtEn QW/VOVGpLs/gmY2DVFGP3bpq/0NkgI4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KGJcTuru; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of toke@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=toke@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1744044924; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mmWHrHuVRgKH+JhKAsvyJxgLEkq6TW3TK9aZ+ghtur0=; b=KGJcTuruodf/bjZ7zkUbsKlVxs5I3GLsEvoinw28PKjOLYI1rh7pfcvKkuHqMZSj+q6Y8Z vfT0urRmcIr98Yq6RBEhXgxEqowNOBBx6N2NmPhnvfDGBzwDUBH2s0UIM6kmExb66T39Wq PgZv4MWi8pjlqzM7/NpXbFVDZTypDzQ= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-465-g7xkXEVqPKOgiNk6Un9M8w-1; Mon, 07 Apr 2025 12:55:23 -0400 X-MC-Unique: g7xkXEVqPKOgiNk6Un9M8w-1 X-Mimecast-MFC-AGG-ID: g7xkXEVqPKOgiNk6Un9M8w_1744044922 Received: by mail-lj1-f200.google.com with SMTP id 38308e7fff4ca-30c165885fdso22165431fa.2 for ; Mon, 07 Apr 2025 09:55:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744044922; x=1744649722; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mmWHrHuVRgKH+JhKAsvyJxgLEkq6TW3TK9aZ+ghtur0=; b=EX7wz3M35ota5Yp8c6jscuG2ZKDvByeypod6Pf/3FC5VjHVry3hftaADZNpTbBuMSi /RromQjqSW+88TvgWWoO/XbAn0i9LMyOZjQmZkNd75e91s67Hw8TG9hZ7K53gcAl7Zzt FLysTA7k4jqpzf/2qKeV4TTN0ho6DaNlvKnmy5PzwMXwo8cOD06RMg2Pf2RiDbSAcg1G 4LtmmulhrI8r6J5qxFxfVUHC5wmtztqrJIiVQ35g5HO9L86MqapeO9yBMZiW70HBnkCA X9FJWcTkAeBI+t+I8mDGCqzTeQYfTy77NpvXYKjjBuqffvo8BtjsSFvZlQy2rUOjlGwS YQeg== X-Forwarded-Encrypted: i=1; AJvYcCU34W/S2LOO6jGUQTXC/3gsNGQ4FCyt/sYmphzTNCh0c5WnH8vcPmc+oZftQATkjubz7YNtz+MFgw==@kvack.org X-Gm-Message-State: AOJu0Yzjs4KP+ZoKgQH7rfFIlPJA27JI+vt7APdjD6+DX6BbwJONPW1Y +mFnSZUmxPQ+OA0xHYkbHXmOvoo8gyOPGf+pCh2JzypMwvqAXEKjs8iAhOFh7SL0+J6wL8XhZkC G/15Ou7Pf8uXBNuNYE1we8oTIvqHJOxLdhyTb289zv7IsJB3e X-Gm-Gg: ASbGncuH+Mxfx/Thf+uGlHCmMoq7NEGMufiE+VfUnkb9fgoxmmjTJuKe52cy4wdMXBF jUnzdfWf1WJX6dLjxNN/V5WXwioFEW4CC8rrVyz0pWI4u81Yp+MD2rhsaoNSCnPUpcv2J/H0zil bIPZ2KnLAIE5J6FtGoz39e/kGlwY35bxvbZnAOOgKsiSjW2xAljj3VKvovwMqanUMn1G30XqbgN zc9QYJoOdJDEwpcFBxEHaF/FsxMaGOWuvtohOSg0wkmsSsAEWvKOt5f35JGkak+/G6sgXmGVQoy UJVVYXywKABy X-Received: by 2002:a05:651c:150e:b0:30c:514c:55d4 with SMTP id 38308e7fff4ca-30f0bf223c5mr40328291fa.16.1744044921457; Mon, 07 Apr 2025 09:55:21 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE+W1FxsZTSUOd6avRoJtTcE7IasTTLXu1j9TAww8Yox/ia12ZyPiFOL0s7dEly7IFB19T7Yg== X-Received: by 2002:a05:651c:150e:b0:30c:514c:55d4 with SMTP id 38308e7fff4ca-30f0bf223c5mr40328091fa.16.1744044920938; Mon, 07 Apr 2025 09:55:20 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-30f0313f374sm15981161fa.24.2025.04.07.09.55.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Apr 2025 09:55:20 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 4FBCD19918DC; Mon, 07 Apr 2025 18:55:19 +0200 (CEST) From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= Date: Mon, 07 Apr 2025 18:53:29 +0200 Subject: [PATCH net-next v8 2/2] page_pool: Track DMA-mapped pages and unmap them when destroying the pool MIME-Version: 1.0 Message-Id: <20250407-page-pool-track-dma-v8-2-da9500d4ba21@redhat.com> References: <20250407-page-pool-track-dma-v8-0-da9500d4ba21@redhat.com> In-Reply-To: <20250407-page-pool-track-dma-v8-0-da9500d4ba21@redhat.com> To: "David S. Miller" , Jakub Kicinski , Jesper Dangaard Brouer , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Andrew Lunn , Eric Dumazet , Paolo Abeni , Ilias Apalodimas , Simon Horman , Andrew Morton , Mina Almasry , Yonglong Liu , Yunsheng Lin , Pavel Begunkov , Matthew Wilcox Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, linux-rdma@vger.kernel.org, linux-mm@kvack.org, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , Qiuling Ren , Yuying Ma X-Mailer: b4 0.14.2 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: k-ji6eiakX_stZpSWtFDSeJ4pkWKGs1ssYiuoNNgGM0_1744044922 X-Mimecast-Originator: redhat.com X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 945C81A000D X-Stat-Signature: etsrgcs4dbrc7unb1yiskn93y78s4nrz X-Rspam-User: X-HE-Tag: 1744044925-847496 X-HE-Meta: U2FsdGVkX19HiaYD2Xi2N+rL09dZaByCs3pL+drClcWMHFB4LPvwl+3j+XXZhd6Pn+Wyw4P8p3u1QoBNqwC738X6kq+0F+teWcuJa0eWJbXW9o1uYWQ2TYf/nizCKvItOAWNfyy/wXVnXrzs1zG4AOW2mJ5rE6eMCVJJ/qHlcRxrDBGnq+KQuYDfFYi2SFsO7JjrDv/PUeyV73pV7zuyIboa4P+hVUMKlsiYlINyNeqrF1BhwvF2A1cxaVG9lYka693SeVSv689VQG8wNubuN+cM3mnMU2ym8M4PtmR0dF5UgiSY5MicU0YadU0lolS4VfOmMeX5ILY9GtHrHrzHwmJ1d4qUX1KNujLzA2gvrpSd0aheOHL8hG9IUIeyb9QjbTqmh6WMZ6qQTO3CuL1s0hj7xQnLlHZCdbfJCRcW7u39N8zBfn2R3mz743t07HqAdM41c7nb7Fgw7sBBZo9NRa/3f1hBcktBMF19LAhqeZ6ve3PHP4AAzhvDsJhmzw50U24EwJNNWYev+kFc3cyciPVStvcQOfYtBe9AJfi8sDJhi38At3yjrxXhIKjOd7r0WF0ZVNDAG+Ylpn/xGfWI1P31FaqRb+7nWmv1EuGtgkgHpIFl7X13lKtfyL4hesiwlHK/OQnpICZxQI1mCZEyGabCSbm6V0272fbbW4zB8AoSB5lGGFNJQMHPPOG9ruHQVs6P1vXM5RTBz0H7tDAPginBN30tbsXmFAI/NmiNUx0ngA6LOqg+834Qy7zRo4c3h2OM47uOJQdJyVcgAofPrDniQ9rbWzvYJcWQUp7Ag1KAntzsdMX05awUOMMeBYZ2vSEhR3Ieggo3ClM/zlz6EjG7VuVa0VC8fOcQFmrk16mazPr8oM/6oAOn5M3rZoV3O+E7e2cvYX9NZWVUG8j7H5Unv/jFsaCA+tY3irEQ0vt0aZJDErvtVNGhlpw4nBEaxHzbBE6+OuR1I5Aw5IT V9CF3724 k8Djlqpi8cN4QEbeNGZkEwn0ouUqa47jGseivSCQmoSumI8DDtX6FnTXt39v/D4dFYWQGu8bCgWSR5YDtbinEv5gP3ByV7H+7hQ26vFxt2ye3eZ9YOeeolTKZ+BPvHSz53BYAUM4Myfx5CYxQr/ZnM7dLh0ehztFcICS7GTUCQRSWj6k9DjSxmDmTs3zHg1OzMhoakp+KbkUwIaFK94yfbztfGNCkrkY9Z1522nh+uhIlPnAHfSnHh1uyoJh/L+awU/oPgUwlNg0/NXjeFaeCTciVidkib8NJvhMLyHku+dPwbmXkjQw650E9rhGoHkz7vqmD1hZLME5ZCxPug+0fPLINkQd0eG4Myou/mEs9Wt9k5/4pKvluSiVk9ySeyf3LdPKcLH7HRvZCJtVpXDn7d6jBct3AwKdHAn4I9k5pZMQbaXMNQW847LukknWwi9sOtUU21OxL+CaOxeXyBGzpkmfGp0N0cHA09A3/Vlw5qFL+f7oifq2uc1dmEsnhl7fn5gKqgdgdhAxIbHhHDCyJk4XwZw+vW2O3nQ44cWVrLMmEP0bRw/XsjcDDJbFHXVBINbCPbosHMXC/ifPtf93zBFgyCyETtzAfwd/r+FlEC8qrBsc2V/2Lld4RNZMWkeLB2VhNmdN081Ag5zysk5VW4qnI3cCDBxJwR4MMUJylBNxF0RSbJlgCenjrQx4ntYwv/yTCZDhgdSqAK5I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When enabling DMA mapping in page_pool, pages are kept DMA mapped until they are released from the pool, to avoid the overhead of re-mapping the pages every time they are used. This causes resource leaks and/or crashes when there are pages still outstanding while the device is torn down, because page_pool will attempt an unmap through a non-existent DMA device on the subsequent page return. To fix this, implement a simple tracking of outstanding DMA-mapped pages in page pool using an xarray. This was first suggested by Mina[0], and turns out to be fairly straight forward: We simply store pointers to pages directly in the xarray with xa_alloc() when they are first DMA mapped, and remove them from the array on unmap. Then, when a page pool is torn down, it can simply walk the xarray and unmap all pages still present there before returning, which also allows us to get rid of the get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional synchronisation is needed, as a page will only ever be unmapped once. To avoid having to walk the entire xarray on unmap to find the page reference, we stash the ID assigned by xa_alloc() into the page structure itself, using the upper bits of the pp_magic field. This requires a couple of defines to avoid conflicting with the POINTER_POISON_DELTA define, but this is all evaluated at compile-time, so does not affect run-time performance. The bitmap calculations in this patch gives the following number of bits for different architectures: - 23 bits on 32-bit architectures - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) - 32 bits on other 64-bit architectures Stashing a value into the unused bits of pp_magic does have the effect that it can make the value stored there lie outside the unmappable range (as governed by the mmap_min_addr sysctl), for architectures that don't define ILLEGAL_POINTER_VALUE. This means that if one of the pointers that is aliased to the pp_magic field (such as page->lru.next) is dereferenced while the page is owned by page_pool, that could lead to a dereference into userspace, which is a security concern. The risk of this is mitigated by the fact that (a) we always clear pp_magic before releasing a page from page_pool, and (b) this would need a use-after-free bug for struct page, which can have many other risks since page->lru.next is used as a generic list pointer in multiple places in the kernel. As such, with this patch we take the position that this risk is negligible in practice. For more discussion, see[1]. Since all the tracking added in this patch is performed on DMA map/unmap, no additional code is needed in the fast path, meaning the performance overhead of this tracking is negligible there. A micro-benchmark shows that the total overhead of the tracking itself is about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). Since this cost is only paid on DMA map and unmap, it seems like an acceptable cost to fix the late unmap issue. Further optimisation can narrow the cases where this cost is paid (for instance by eliding the tracking when DMA map/unmap is a no-op). The extra memory needed to track the pages is neatly encapsulated inside xarray, which uses the 'struct xa_node' structure to track items. This structure is 576 bytes long, with slots for 64 items, meaning that a full node occurs only 9 bytes of overhead per slot it tracks (in practice, it probably won't be this efficient, but in any case it should be an acceptable overhead). [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com Reported-by: Yonglong Liu Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Suggested-by: Mina Almasry Reviewed-by: Mina Almasry Reviewed-by: Jesper Dangaard Brouer Tested-by: Jesper Dangaard Brouer Tested-by: Qiuling Ren Tested-by: Yuying Ma Tested-by: Yonglong Liu Acked-by: Jesper Dangaard Brouer Signed-off-by: Toke Høiland-Jørgensen --- include/linux/mm.h | 46 +++++++++++++++++++++--- include/linux/poison.h | 4 +++ include/net/page_pool/types.h | 6 ++++ net/core/netmem_priv.h | 28 ++++++++++++++- net/core/page_pool.c | 81 ++++++++++++++++++++++++++++++++++++------- 5 files changed, 147 insertions(+), 18 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f9ef1634f75701ae0be146add1ea2c11beb6e48..6f1d835b1213845940c49f2dd591b7392abbb472 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4248,13 +4248,51 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status); #define VM_SEALED_SYSMAP VM_NONE #endif +/* + * DMA mapping IDs for page_pool + * + * When DMA-mapping a page, page_pool allocates an ID (from an xarray) and + * stashes it in the upper bits of page->pp_magic. We always want to be able to + * unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP + * pages can have arbitrary kernel pointers stored in the same field as pp_magic + * (since it overlaps with page->lru.next), so we must ensure that we cannot + * mistake a valid kernel pointer with any of the values we write into this + * field. + * + * On architectures that set POISON_POINTER_DELTA, this is already ensured, + * since this value becomes part of PP_SIGNATURE; meaning we can just use the + * space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the + * lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is + * 0, we make sure that we leave the two topmost bits empty, as that guarantees + * we won't mistake a valid kernel pointer for a value we set, regardless of the + * VMSPLIT setting. + * + * Altogether, this means that the number of bits available is constrained by + * the size of an unsigned long (at the upper end, subtracting two bits per the + * above), and the definition of PP_SIGNATURE (with or without + * POISON_POINTER_DELTA). + */ +#define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA)) +#if POISON_POINTER_DELTA > 0 +/* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA + * index to not overlap with that if set + */ +#define PP_DMA_INDEX_BITS MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT) +#else +/* Always leave out the topmost two; see above. */ +#define PP_DMA_INDEX_BITS MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2) +#endif + +#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \ + PP_DMA_INDEX_SHIFT) + /* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for - * the head page of compound page and bit 1 for pfmemalloc page. - * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling - * the pfmemalloc page. + * the head page of compound page and bit 1 for pfmemalloc page, as well as the + * bits used for the DMA index. page_is_pfmemalloc() is checked in + * __page_pool_put_page() to avoid recycling the pfmemalloc page. */ -#define PP_MAGIC_MASK ~0x3UL +#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL) #ifdef CONFIG_PAGE_POOL static inline bool page_pool_page_is_pp(struct page *page) diff --git a/include/linux/poison.h b/include/linux/poison.h index 331a9a996fa8746626afa63ea462b85ca3e5938b..8ca2235f78d5d9c070ae816cfd57fe2984db5562 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -70,6 +70,10 @@ #define KEY_DESTROY 0xbd /********** net/core/page_pool.c **********/ +/* + * page_pool uses additional free bits within this value to store data, see the + * definition of PP_DMA_INDEX_MASK in mm.h + */ #define PP_SIGNATURE (0x40 + POISON_POINTER_DELTA) /********** net/core/skbuff.c **********/ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 31e6c5c6724b1cffbf5ad2535b3eaee5dec54d9d..b201e7d81dc493b0ec9ccbe73a7e6b2aa55af514 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -6,6 +6,7 @@ #include #include #include +#include #include #define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA @@ -33,6 +34,9 @@ #define PP_FLAG_ALL (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \ PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM) +/* Index limit to stay within PP_DMA_INDEX_BITS for DMA indices */ +#define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1) + /* * Fast allocation side cache array/stack * @@ -221,6 +225,8 @@ struct page_pool { void *mp_priv; const struct memory_provider_ops *mp_ops; + struct xarray dma_mapped; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h index f33162fd281c23e109273ba09950c5d0a2829bc9..cd95394399b40c3604934ba7898eeeeacb8aee99 100644 --- a/net/core/netmem_priv.h +++ b/net/core/netmem_priv.h @@ -5,7 +5,7 @@ static inline unsigned long netmem_get_pp_magic(netmem_ref netmem) { - return __netmem_clear_lsb(netmem)->pp_magic; + return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK; } static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic) @@ -15,6 +15,8 @@ static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic) static inline void netmem_clear_pp_magic(netmem_ref netmem) { + WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK); + __netmem_clear_lsb(netmem)->pp_magic = 0; } @@ -33,4 +35,28 @@ static inline void netmem_set_dma_addr(netmem_ref netmem, { __netmem_clear_lsb(netmem)->dma_addr = dma_addr; } + +static inline unsigned long netmem_get_dma_index(netmem_ref netmem) +{ + unsigned long magic; + + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) + return 0; + + magic = __netmem_clear_lsb(netmem)->pp_magic; + + return (magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT; +} + +static inline void netmem_set_dma_index(netmem_ref netmem, + unsigned long id) +{ + unsigned long magic; + + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) + return; + + magic = netmem_get_pp_magic(netmem) | (id << PP_DMA_INDEX_SHIFT); + __netmem_clear_lsb(netmem)->pp_magic = magic; +} #endif diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 7745ad924ae2d801580a6760eba9393e1cf67b01..2b7684865941854660d32b8d1bb00a72fb550563 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -276,8 +276,7 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); - if (pool->dma_map) - get_device(pool->p.dev); + xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1); if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) { netdev_assert_locked(pool->slow.netdev); @@ -320,9 +319,7 @@ static int page_pool_init(struct page_pool *pool, static void page_pool_uninit(struct page_pool *pool) { ptr_ring_cleanup(&pool->ring, NULL); - - if (pool->dma_map) - put_device(pool->p.dev); + xa_destroy(&pool->dma_mapped); #ifdef CONFIG_PAGE_POOL_STATS if (!pool->system) @@ -463,13 +460,21 @@ page_pool_dma_sync_for_device(const struct page_pool *pool, netmem_ref netmem, u32 dma_sync_size) { - if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) - __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); + if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) { + rcu_read_lock(); + /* re-check under rcu_read_lock() to sync with page_pool_scrub() */ + if (pool->dma_sync) + __page_pool_dma_sync_for_device(pool, netmem, + dma_sync_size); + rcu_read_unlock(); + } } -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp) { dma_addr_t dma; + int err; + u32 id; /* Setup DMA mapping: use 'struct page' area for storing DMA-addr * since dma_addr_t can be either 32 or 64 bits and does not always fit @@ -483,15 +488,30 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) if (dma_mapping_error(pool->p.dev, dma)) return false; - if (page_pool_set_dma_addr_netmem(netmem, dma)) + if (page_pool_set_dma_addr_netmem(netmem, dma)) { + WARN_ONCE(1, "unexpected DMA address, please report to netdev@"); goto unmap_failed; + } + if (in_softirq()) + err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem), + PP_DMA_INDEX_LIMIT, gfp); + else + err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem), + PP_DMA_INDEX_LIMIT, gfp); + if (err) { + WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@"); + goto unset_failed; + } + + netmem_set_dma_index(netmem, id); page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len); return true; +unset_failed: + page_pool_set_dma_addr_netmem(netmem, 0); unmap_failed: - WARN_ONCE(1, "unexpected DMA address, please report to netdev@"); dma_unmap_page_attrs(pool->p.dev, dma, PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); @@ -508,7 +528,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, if (unlikely(!page)) return NULL; - if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) { + if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) { put_page(page); return NULL; } @@ -554,7 +574,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, */ for (i = 0; i < nr_pages; i++) { netmem = pool->alloc.cache[i]; - if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) { + if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) { put_page(netmem_to_page(netmem)); continue; } @@ -656,6 +676,8 @@ void page_pool_clear_pp_info(netmem_ref netmem) static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, netmem_ref netmem) { + struct page *old, *page = netmem_to_page(netmem); + unsigned long id; dma_addr_t dma; if (!pool->dma_map) @@ -664,6 +686,17 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, */ return; + id = netmem_get_dma_index(netmem); + if (!id) + return; + + if (in_softirq()) + old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0); + else + old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0); + if (old != page) + return; + dma = page_pool_get_dma_addr_netmem(netmem); /* When page is unmapped, it cannot be returned to our pool */ @@ -671,6 +704,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); page_pool_set_dma_addr_netmem(netmem, 0); + netmem_set_dma_index(netmem, 0); } /* Disconnects a page (from a page_pool). API users can have a need @@ -1080,8 +1114,29 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) static void page_pool_scrub(struct page_pool *pool) { + unsigned long id; + void *ptr; + page_pool_empty_alloc_cache_once(pool); - pool->destroy_cnt++; + if (!pool->destroy_cnt++ && pool->dma_map) { + if (pool->dma_sync) { + /* Disable page_pool_dma_sync_for_device() */ + pool->dma_sync = false; + + /* Make sure all concurrent returns that may see the old + * value of dma_sync (and thus perform a sync) have + * finished before doing the unmapping below. Skip the + * wait if the device doesn't actually need syncing, or + * if there are no outstanding mapped pages. + */ + if (dma_dev_need_sync(pool->p.dev) && + !xa_empty(&pool->dma_mapped)) + synchronize_net(); + } + + xa_for_each(&pool->dma_mapped, id, ptr) + __page_pool_release_page_dma(pool, page_to_netmem(ptr)); + } /* No more consumers should exist, but producers could still * be in-flight.