From patchwork Tue Dec 19 21:03:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499071 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E50673B793 for ; Tue, 19 Dec 2023 21:04:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="089NnUHM" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-6d5c4cb8a4cso1856871b3a.3 for ; Tue, 19 Dec 2023 13:04:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019846; x=1703624646; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Oy9uiiKwbsIgKaUJ3odB6F3vW7RFsVN49w7UGQoPJaI=; b=089NnUHMEUakuL6hga0Fo8qZzebYOmm7mMBnZfsWGRQ0DqlmgnDelkwSUHj8OIbR99 BPcdYhfZ5ULvq46KjyRvfSIXs6vUOzJhVSX8CQMjmWyPqxYQRiwk5uIdR+9cTJGtexZm HDA7GZNzhteNZkT4NU6yyCDMxZmUvvqb2EWGKyE4Xg4MWiXSxCv+8C1suSseacLEjDpF NeXV1VBiLrfzgrVPeU1uv1kerIE40Tin8I4e5M46hZ8au+0botfKCi7kR3N//QBCWoWR 0AMBPdj01lLnco1MKEw7NPqsbGb3i+1rQ70TablD0wwjMZjTqZf40k6ufu2G06TovMmv U1fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019846; x=1703624646; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Oy9uiiKwbsIgKaUJ3odB6F3vW7RFsVN49w7UGQoPJaI=; b=KVfTemdYzlpXEljrldoMNNwyB36Y320cU0FmvD3AHRM4OkuNwbz5wNJet2M540DIXs qbL1YgVlELl4Ug6jYa+TpnDGm9JzaJOf6UNRtuWmgUsnM/f6/1qFwgbbv0KN6IVimds4 lznwOd9nj4JMnecGVtRpv6xbEtw5Ed65Lre0jyqOLVve18tROfI3Rv/butThjxyrIS8X ODz0+/zDhXVwzz8PLryhw3OLV1/O/FGk5uOsVSWX6S2J+hfiDv3mZMd0F6PPwpYrtpEZ f8qGUC+1qtgknEtjpehQ80vUMLWWH8FmqfzymbcClFRNyaNBqZ6TeDm9Z2VNka6vflwC 0vPA== X-Gm-Message-State: AOJu0YylIV13J0z6NTmHPU/wIJli/FUkU6zeXkmv459ND5kHrHyLO7sJ uPZd/FCDlOD56uf+j8sL0cgVtaf9HqA2PIRSDXg+zg== X-Google-Smtp-Source: AGHT+IHVEYep/1woYWBS9dt/9XVd4k3yjoZczvwC+R6/pD8ybviMhpX1EJ68kXcXPjNNtknzMGKX6A== X-Received: by 2002:a17:90b:19c6:b0:28b:49d7:e746 with SMTP id nm6-20020a17090b19c600b0028b49d7e746mr2598512pjb.65.1703019846088; Tue, 19 Dec 2023 13:04:06 -0800 (PST) Received: from localhost (fwdproxy-prn-016.fbsv.net. [2a03:2880:ff:10::face:b00c]) by smtp.gmail.com with ESMTPSA id u12-20020a17090adb4c00b002867594de40sm2086062pjx.14.2023.12.19.13.04.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:05 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper Date: Tue, 19 Dec 2023 13:03:38 -0800 Message-Id: <20231219210357.4029713-2-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov NOT FOR UPSTREAM The final version will depend on how ppiov looks like, but add a convenience helper for now. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 5 +++++ net/core/page_pool.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 95f4d579cbc4..92804c499833 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -86,6 +86,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) /* page_pool_iov support */ +static inline struct page *page_pool_mangle_ppiov(struct page_pool_iov *ppiov) +{ + return (struct page *)((unsigned long)ppiov | PP_DEVMEM); +} + static inline struct dmabuf_genpool_chunk_owner * page_pool_iov_owner(const struct page_pool_iov *ppiov) { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index c0bc62ee77c6..38eff947f679 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -1074,7 +1074,7 @@ static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, pool->pages_state_hold_cnt++; trace_page_pool_state_hold(pool, (struct page *)ppiov, pool->pages_state_hold_cnt); - return (struct page *)((unsigned long)ppiov | PP_DEVMEM); + return page_pool_mangle_ppiov(ppiov); } static void mp_dmabuf_devmem_destroy(struct page_pool *pool) From patchwork Tue Dec 19 21:03:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499072 Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 06E6D3A269 for ; Tue, 19 Dec 2023 21:04:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="KTauUz9t" Received: by mail-pg1-f174.google.com with SMTP id 41be03b00d2f7-53fbf2c42bfso3792699a12.3 for ; Tue, 19 Dec 2023 13:04:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019847; x=1703624647; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AHaFgdexCzA0uTP4OJiRQISL/EACmJ6kzizjojPCXBw=; b=KTauUz9txi8e8EDGRgr7KZDO3GLkQxgu9BD+dn1djoKo/8Enlb+nguOCnC3D8nyDmz dzKgEX4OKmqK+TUm4Nv/izPM3FHj++rCwbB9aQYQOv6hPupeMWCQ3yui2JWK3vEObReB /ccew2Dj455A2VWcpa9YFFCEo9CqzEW7aAIobFD/TZ1WWSnE+norGtTqjRdE5vcPlTTq 6mplqBvIZparTmqfSI6n9C1hZCC20AKZjgO5A2wmrNabqK0vptd+rpryxybdVOxwjmFK dCyA5y/VmdCyuAm2aAbuU6ep7W7/08uGNyi1807TWc7Tf4L2LpS9sCmZFOsFkIVJV5Tc x7Fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019847; x=1703624647; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AHaFgdexCzA0uTP4OJiRQISL/EACmJ6kzizjojPCXBw=; b=w5YH8zfZMzwOC5x/3pT6nW+OaUIV3jV9if9u482pM4L4/JVdXjoIYdOw2m579XBAaw U0U74lz+yBztJPmg5/AdnNdnPqU/vJ+kPXJprUx38XH6is/mrmuJwgHHRa8TXuFGWZcD ad5mlSWDtPmOtHwEtm4PO2ay0WqSC/9KboGOE5s/aGFXVeJyMWs7fA4K1K2VdnvQVzZH zPyM/DznhoR1Eoz3Oe1Dwr5gPreu3vSK29yBgZ5vlfv1WNF4V1vmCmeh98Cpx33VT0nH Lh9+uRmVAt203M2iZaT/3D0//OqUMIf+eEZyjbQ9bBA4UhCTW21+teqgY6Sji3WCnXli unqA== X-Gm-Message-State: AOJu0Yy80H9WsouZ94BbaQxUdBVLERS5de8Yr3XZqWCX1+9d/+GFoIH7 YC3Si394lxmHa+ew4Yd/nAfJDztvdcGOBj4o8l9y/g== X-Google-Smtp-Source: AGHT+IF16V9K9fxk40dVoBfsF4UCnLKdU0l3jKOgVGqIYtUJwS2ni4AdskvJa710jcq0aAXd9+X4lA== X-Received: by 2002:a17:90a:fa4f:b0:28b:c572:1f0 with SMTP id dt15-20020a17090afa4f00b0028bc57201f0mr819001pjb.90.1703019846948; Tue, 19 Dec 2023 13:04:06 -0800 (PST) Received: from localhost (fwdproxy-prn-020.fbsv.net. [2a03:2880:ff:14::face:b00c]) by smtp.gmail.com with ESMTPSA id g15-20020a17090a4b0f00b0028bb87b2378sm2082025pjh.49.2023.12.19.13.04.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:06 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov Date: Tue, 19 Dec 2023 13:03:39 -0800 Message-Id: <20231219210357.4029713-3-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov NOT FOR UPSTREAM There will be more users of struct page_pool_iov, and ppiovs from one subsystem must not be used by another. That should never happen for any sane application, but we need to enforce it in case of bufs and/or malicious users. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- net/ipv4/tcp.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 33a8bb63fbf5..9c6b18eebb5b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2384,6 +2384,13 @@ static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb, } ppiov = skb_frag_page_pool_iov(frag); + + /* Disallow non devmem owned buffers */ + if (ppiov->pp->p.memory_provider != PP_MP_DMABUF_DEVMEM) { + err = -ENODEV; + goto out; + } + end = start + skb_frag_size(frag); copy = end - offset; From patchwork Tue Dec 19 21:03:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499074 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CCCD93B191 for ; Tue, 19 Dec 2023 21:04:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="DNaHuW9J" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-5c673b01eeeso2060941a12.1 for ; Tue, 19 Dec 2023 13:04:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019848; x=1703624648; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=C5DHyX6aeoJMKvf3FMrIDlFr+7zOXVmc0yCRXRvCbfw=; b=DNaHuW9JyCsGjRr32FZPjepveILYCjfq/NnQ5P+SElQi/jJBJDR7xLy3zeoRr0B47F /82kHVdDftMc5fRmHH75Olo/PpecEW6xhz72icYfBdAhSsXePomp4WDV8r9W0WjkjXe8 VWmnPMCBx4WX1ixDBanNjr0xjl4c0EUQm9AQIRq3fhcajJDV/M5f7mGG9a8i8JhEPcRd 1LqMXhFflA178I7DOiE9sILA9BFqdb+RTyHN6n0wz5fqzUi8sdDdwIdz3R5xnhtcf+dC 41SZpoCCtBnTZAud1vNlupTsutLyJK5EgaYQEN/G/Lj+rAvUbCRF992UGQXckMkUvMm0 Ou0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019848; x=1703624648; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C5DHyX6aeoJMKvf3FMrIDlFr+7zOXVmc0yCRXRvCbfw=; b=XxfBuCJFoybRWXOcRlPR5ViDJy6FiPoTJd1ZEv65ab6jNsagAhKbWaeIBVGLuhERjt KmA2otFgYRMY+IScMDQhcZ7JJeJIWCOq1cn+hw8XLlo6zj4faX6Cq0MLecSpEuFb+aPj GHZXxvcFjvqfQ8FgCZk86WDdmeWO7E3Ae30/XstBuFxVXf/+Tq8PAdzpUknA7h7AlHAf XnHBHpFoiGQXmEP9mTkNraPIpbKf0uNFcEi2LimA7PWUAOw5iKgpgRvgIS7AbTn5/5O0 PAz6LCqA1VbxmA++/z2kPyh5y7W41WcRw8iwdyzBXFFqaq04GEWPTWdbgtnsXAYnqv6G aXIQ== X-Gm-Message-State: AOJu0YzqHFYVcoC0ZXaA4UvLFoXXmX8dsxpkok0KVEZwBKDaHp2g+4vl 7AKjz4JIGAa58YQzIuTbdUPhLBGHh3nKqWydiJqRLQ== X-Google-Smtp-Source: AGHT+IGng/UXhQ+pTyRj8hIhoKPfhzyfyq52gtF64Uvn9PD+x5fXxFr0YyiQhiVsLPL8ylYTdzjzaw== X-Received: by 2002:a17:90a:b901:b0:28b:1f1e:827e with SMTP id p1-20020a17090ab90100b0028b1f1e827emr3955367pjr.48.1703019847844; Tue, 19 Dec 2023 13:04:07 -0800 (PST) Received: from localhost (fwdproxy-prn-119.fbsv.net. [2a03:2880:ff:77::face:b00c]) by smtp.gmail.com with ESMTPSA id oe7-20020a17090b394700b0028a69db1f51sm2110330pjb.30.2023.12.19.13.04.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:07 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle Date: Tue, 19 Dec 2023 13:03:40 -0800 Message-Id: <20231219210357.4029713-4-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov NOT FOR UPSTREAM The final version will depend on how the ppiov infra looks like Page pool is tracking how many pages were allocated and returned, which serves for refcounting the pool, and so every page/frag allocated should eventually come back to the page pool via appropriate ways, e.g. by calling page_pool_put_page(). When it comes to normal page pools (i.e. without memory providers attached), it's fine to return a page when it's still refcounted by somewhat in the stack, in which case we'll "detach" the page from the pool and rely on page refcount for it to return back to the kernel. Memory providers are different, at least ppiov based ones, they need all their buffers to eventually return back, so apart from custom pp ->release handlers, we'll catch when someone puts down a ppiov and call its memory provider to handle it, i.e. __page_pool_iov_free(). The first problem is that __page_pool_iov_free() hard coded devmem handling, and other providers need a flexible way to specify their own callbacks. The second problem is that it doesn't go through the generic page pool paths and so can't do the mentioned pp accounting right. And we can't even safely rely on page_pool_put_page() to be called somewhere before to do the pp refcounting, because then the page pool might get destroyed and ppiov->pp would point to garbage. The solution is to make the pp ->release callback to be responsible for properly recycling its buffers, e.g. calling what was __page_pool_iov_free() before in case of devmem. page_pool_iov_put_many() will be returning buffers to the page pool. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 15 ++++++++--- net/core/page_pool.c | 46 +++++++++++++++++---------------- 2 files changed, 35 insertions(+), 26 deletions(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 92804c499833..ef380ee8f205 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -137,15 +137,22 @@ static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov, refcount_add(count, &ppiov->refcount); } -void __page_pool_iov_free(struct page_pool_iov *ppiov); +static inline bool page_pool_iov_sub_and_test(struct page_pool_iov *ppiov, + unsigned int count) +{ + return refcount_sub_and_test(count, &ppiov->refcount); +} static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov, unsigned int count) { - if (!refcount_sub_and_test(count, &ppiov->refcount)) - return; + if (count > 1) + WARN_ON_ONCE(page_pool_iov_sub_and_test(ppiov, count - 1)); - __page_pool_iov_free(ppiov); +#ifdef CONFIG_PAGE_POOL + page_pool_put_defragged_page(ppiov->pp, page_pool_mangle_ppiov(ppiov), + -1, false); +#endif } /* page pool mm helpers */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 38eff947f679..ecf90a1ccabe 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -599,6 +599,16 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) page_pool_set_dma_addr(page, 0); } +static void page_pool_return_provider(struct page_pool *pool, struct page *page) +{ + int count; + + if (pool->mp_ops->release_page(pool, page)) { + count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); + trace_page_pool_state_release(pool, page, count); + } +} + /* Disconnects a page (from a page_pool). API users can have a need * to disconnect a page (from a page_pool), to allow it to be used as * a regular page (that will eventually be returned to the normal @@ -607,13 +617,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; - bool put; - put = true; - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - put = pool->mp_ops->release_page(pool, page); - else - __page_pool_release_page_dma(pool, page); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) { + page_pool_return_provider(pool, page); + return; + } + + __page_pool_release_page_dma(pool, page); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -621,10 +631,8 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - if (put) { - page_pool_clear_pp_info(page); - put_page(page); - } + page_pool_clear_pp_info(page); + put_page(page); /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -1034,15 +1042,6 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid); -void __page_pool_iov_free(struct page_pool_iov *ppiov) -{ - if (ppiov->pp->mp_ops != &dmabuf_devmem_ops) - return; - - netdev_free_devmem(ppiov); -} -EXPORT_SYMBOL_GPL(__page_pool_iov_free); - /*** "Dmabuf devmem memory provider" ***/ static int mp_dmabuf_devmem_init(struct page_pool *pool) @@ -1093,9 +1092,12 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, return false; ppiov = page_to_page_pool_iov(page); - page_pool_iov_put_many(ppiov, 1); - /* We don't want the page pool put_page()ing our page_pool_iovs. */ - return false; + + if (!page_pool_iov_sub_and_test(ppiov, 1)) + return false; + netdev_free_devmem(ppiov); + /* tell page_pool that the ppiov is released */ + return true; } const struct pp_memory_provider_ops dmabuf_devmem_ops = { From patchwork Tue Dec 19 21:03:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499073 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A61DA3B29D for ; Tue, 19 Dec 2023 21:04:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="mQAsWpko" Received: by mail-pg1-f169.google.com with SMTP id 41be03b00d2f7-5c701bd98f3so1901362a12.1 for ; Tue, 19 Dec 2023 13:04:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019849; x=1703624649; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=w5PuaR4d/MoGidxq0L3iO4s55XxlUWagn7UVjiQLif0=; b=mQAsWpko25z+CTg2NQCrn3+F/sH0Eoi9YUbQHrgmGGxjLbwEY6YsIfzFpthxOJgOHX /hALO3Q5rXgVxL3NRJZslycn3rMeSPWyiiHt5OvhvjwvAOYc2iDrpr8dywWrkF9EprJ7 nNSvXpPKDJumRdZR+LIgNdtx+pTAUwW8ULXiJxim6D1rFNY28QWkslqjV7anF+dZJk2N uKF+dhTG9TEGUd39BJAXU7J00Cpnhz3Pv3BY4KdbN5nyL4qh/pCEd8aelMV09LenVurg 8TSxPHs5nGOQ89YNnSU5PJt2G8DmyZKYxwKngjr1hbQVw/mTBwvsrcmb2cLPj4HT7pLT nDGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019849; x=1703624649; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w5PuaR4d/MoGidxq0L3iO4s55XxlUWagn7UVjiQLif0=; b=Z75qWvs3ViSE0ArPj+7eNPeORCaOYZpI8vOMg9oChw3tQa29qffsE9DqBb5gI0ydjg +zzwpLzejfZ9/pgJwreS0MaFejJjMuxUWgzZTc1ExaqM91f4V2uoh61Vnx6UrMLJwoLD o7qXedOxY0RjFkwkhGQDOl63hJK4sLND/ALE6LIWEstiCZwgoGGn8/G6lhWMS/FaYRSf OsCKeilprsyWPz3xuUXk+SsUvQmiCdL9N0uJaLAs41vHK6iOlRVaSRLQMhnRvkc81B0H hvTcAHjMrw0eHvjrNAqgKSYEMLt485RVrjs3N7+uLSSp+FgO1yPmO9C6IX+JhdzDpDBV gBBg== X-Gm-Message-State: AOJu0YypiifJXjB8WlaSUNDqNAhbzeAkpoNcmZEM3Ir3wrafdK0G7ZLw izTryx/rMnIRXAlKBHTDzq86N7suVyWAr7L7l5p0tQ== X-Google-Smtp-Source: AGHT+IFxU8K97jLnXcSyyO+D+sbfKINOieazzqXKOxdloDIB5FLVgcl8B8HXGUFAAoIhc0I94Q4ixg== X-Received: by 2002:a17:90a:7405:b0:28b:9a2d:c1c3 with SMTP id a5-20020a17090a740500b0028b9a2dc1c3mr2555650pjg.80.1703019848745; Tue, 19 Dec 2023 13:04:08 -0800 (PST) Received: from localhost (fwdproxy-prn-008.fbsv.net. [2a03:2880:ff:8::face:b00c]) by smtp.gmail.com with ESMTPSA id bx15-20020a17090af48f00b0028b89520c7asm2091559pjb.9.2023.12.19.13.04.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:08 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov Date: Tue, 19 Dec 2023 13:03:41 -0800 Message-Id: <20231219210357.4029713-5-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov NOT FOR UPSTREAM Teach napi_pp_put_page() how to work with ppiov. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 2 +- net/core/page_pool.c | 3 --- net/core/skbuff.c | 28 ++++++++++++++++------------ 3 files changed, 17 insertions(+), 16 deletions(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index ef380ee8f205..aca3a52d0e22 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -381,7 +381,7 @@ static inline long page_pool_defrag_page(struct page *page, long nr) long ret; if (page_is_page_pool_iov(page)) - return -EINVAL; + return 0; /* If nr == pp_frag_count then we have cleared all remaining * references to the page: diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ecf90a1ccabe..71af9835638e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -922,9 +922,6 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) { struct page *page; - if (pool->destroy_cnt) - return; - /* Empty alloc cache, assume caller made sure this is * no-longer in use, and page_pool_alloc_pages() cannot be * call concurrently. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f44c53b0ca27..cf523d655f92 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -896,19 +896,23 @@ bool napi_pp_put_page(struct page *page, bool napi_safe) bool allow_direct = false; struct page_pool *pp; - page = compound_head(page); - - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) - return false; + if (page_is_page_pool_iov(page)) { + pp = page_to_page_pool_iov(page)->pp; + } else { + page = compound_head(page); + + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + return false; - pp = page->pp; + pp = page->pp; + } /* Allow direct recycle if we have reasons to believe that we are * in the same context as the consumer would run, so there's From patchwork Tue Dec 19 21:03:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499075 Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C10833A29B for ; Tue, 19 Dec 2023 21:04:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="uKbjLRH0" Received: by mail-pg1-f171.google.com with SMTP id 41be03b00d2f7-5c701bd9a3cso2265796a12.0 for ; Tue, 19 Dec 2023 13:04:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019850; x=1703624650; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=zs+cSQCBJyAxSPUcytmGfDGwcvea+TXK48eVoO+ZZxM=; b=uKbjLRH01NjyMU7fK0zt3tZfcJ1pBEpXjlyqljltjkbg8k2/UGPp+mj3+dptL+Ij3C 9SQrgGNo7SigjukR8AWHedc34Zn2w4yDbrM2T1ETDB9yHWAdjM4TnJqH8MS7ig6dLYLJ FmhGeDfcdkAw+Cq+aRdJDheg28Gg206Ttsf9BPY9DiOgc5so8AVRXpP4L0n8R/9oCVJ9 y39bQdxVvpYbefpOEUaoCbhWc7BDlE5qhI91I66RX7YoRUlzUt8nj0T1WhUWCzwC2eSj CNkgeOgXlDiXBTaZOxunw//SA/gPe6QIwg+rzHHRJaCkJsGmNipVdSOYe3WE/ECWVFib ahgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019850; x=1703624650; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zs+cSQCBJyAxSPUcytmGfDGwcvea+TXK48eVoO+ZZxM=; b=TMPmvtW8DKlBW41qMbIMRyV9WBYa/0UIfcNWzpjWkkeBsC7Uuf+6OkXHk5S/tVVi/f W3b/Kdju+ZfyLGwrGQQ10oaChlx51PkDS9pkee3IYUPGZ2gUA3e/eO6dhA67JqtTO41/ 9swAGA7hCSRkHM6ReEytaIMDlAa/2fP9O/0Iern9UkaPcdDLqNE9olPFkGt5bVXd+3OE SloKYehj4gAmzuGtY4K5XwQYycQuFa2bFVXVwZ8Ixq9/1c56Y2wHjRfxyiwwB8aVh1fX UFfriRZ2slAge5Hl6swRL0vlNPPUxLrgfhAGjWV6JMCsPVK8ARMdMYXeDZ/Wkwio64ip RwrA== X-Gm-Message-State: AOJu0Yz2+zNOV3TTexzk8q1Y9wd6U7yyedqFblMI4oX/1b9jIKq2B7yU P6IrAUt6esBJUGwjFrOap6PSVeoi7teDQ0EhYJcNmQ== X-Google-Smtp-Source: AGHT+IHjfDQaUz2sKAkFTiAhNVlx7Yd3lLWWRw8dr6FWMxhyPk+XlzUgXZXfLatLI87/PuyoBdiiAg== X-Received: by 2002:a05:6a20:3ca3:b0:18b:ec94:deed with SMTP id b35-20020a056a203ca300b0018bec94deedmr9799295pzj.45.1703019849709; Tue, 19 Dec 2023 13:04:09 -0800 (PST) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id e9-20020a17090ab38900b0028b07d1f647sm2076812pjr.23.2023.12.19.13.04.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:09 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback Date: Tue, 19 Dec 2023 13:03:42 -0800 Message-Id: <20231219210357.4029713-6-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov page pool is now waiting for all ppiovs to return before destroying itself, and for that to happen the memory provider might need to push some buffers, flush caches and so on. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/types.h | 1 + net/core/page_pool.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index a701310b9811..fd846cac9fb6 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -134,6 +134,7 @@ enum pp_memory_provider_type { struct pp_memory_provider_ops { int (*init)(struct page_pool *pool); void (*destroy)(struct page_pool *pool); + void (*scrub)(struct page_pool *pool); struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); bool (*release_page)(struct page_pool *pool, struct page *page); }; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 71af9835638e..9e3073d61a97 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -947,6 +947,8 @@ static int page_pool_release(struct page_pool *pool) { int inflight; + if (pool->mp_ops && pool->mp_ops->scrub) + pool->mp_ops->scrub(pool); page_pool_scrub(pool); inflight = page_pool_inflight(pool); if (!inflight) From patchwork Tue Dec 19 21:03:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499076 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 833443B79F for ; Tue, 19 Dec 2023 21:04:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="Erzzqqgj" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-6d411636a95so76073b3a.0 for ; Tue, 19 Dec 2023 13:04:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019850; x=1703624650; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bxF16fAD5eY0vARRtlHeEdU6+P5TPKLrnZoJ3mN5VeE=; b=ErzzqqgjalS5ujCO50KqgFAMsFMtWp1/zrlkAU6OCuzTTHXuRB8eB649YsZa7WIG1S Z5E4ICEorKlo4kxD4YRQqUWl9WX5rJRHnz/K/cYE4oAfx9DG3rJc1eAc/6tWY5PqQJaJ XrvsLt/FfVEEEhVezufe+skQMskQnWoZaFq0sIsdsXiQVcFLsSNsgrFKWcBZ5M4MYEGR 0NtRqazGtD695JYR1s+CUyVlxLaTvLyLQPsGOhABh2c45mDh0wUANOC6/1t/d3k3eKtB y8Q49aCxdbtgtrhOS5YeAnl/1+bnWlMA8ApBS7WkcOx6raCijxVYo+IgrTEOOnzKIoVQ VcYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019850; x=1703624650; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bxF16fAD5eY0vARRtlHeEdU6+P5TPKLrnZoJ3mN5VeE=; b=H0aGwPLfyYiqg15GYz3J407v5VZa7q5uGo3DaFgnc8yfKo9z6UKYplRmyMdx3deAzD pwFQAwKSbUqohl5RtuPk2RFqdEQI4ZY9CHcf3bZhqCNSXKpb9v4q86jh+thOEZv3zaq9 +8bFkZj4ytPfMV2Sqqck9O0BB5QjvnBbRn8ql3hRjsBzOnCXXasbY5W3BJAh5r0c/ADZ 8hepld0oCE1mqUqI6rnsVGnwfjwvyrOOjJrrZWThPPy3ZqStnQ7effsxUZJam/HxhFv8 3Z4A6fRYQuxS82+hNNP6m1ALbMzlctDfGyzsHKSLzb5TfxJ+ctkfXKucE6ZBOXoIwSnR B9mQ== X-Gm-Message-State: AOJu0YyglvBXaS0r06izyMbLOWqnEyCEOg2lqSjmJESX2yRAnt6JMckI AHuBpelSnZdm8LUJ6bQ6Ju0LF/A6xmCIdk1FiomjhA== X-Google-Smtp-Source: AGHT+IHQu3MDTkpc00SBCiVZjqGTylNNidHL0fESRWK86uvjemoU13/tRBE4mX3qs2HdkNdYcVbOmg== X-Received: by 2002:a05:6a20:6311:b0:18f:354f:58c2 with SMTP id h17-20020a056a20631100b0018f354f58c2mr1314596pzf.44.1703019850627; Tue, 19 Dec 2023 13:04:10 -0800 (PST) Received: from localhost (fwdproxy-prn-025.fbsv.net. [2a03:2880:ff:19::face:b00c]) by smtp.gmail.com with ESMTPSA id c3-20020aa78803000000b006d451d8d7f3sm6017911pfo.76.2023.12.19.13.04.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:10 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 06/20] io_uring: separate header for exported net bits Date: Tue, 19 Dec 2023 13:03:43 -0800 Message-Id: <20231219210357.4029713-7-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov We're exporting some io_uring bits to networking, e.g. for implementing a net callback for io_uring cmds, but we don't want to expose more than needed. Add a separate header for networking. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei Reviewed-by: Jens Axboe --- include/linux/io_uring.h | 6 ------ include/linux/io_uring/net.h | 18 ++++++++++++++++++ io_uring/uring_cmd.c | 1 + net/socket.c | 2 +- 4 files changed, 20 insertions(+), 7 deletions(-) create mode 100644 include/linux/io_uring/net.h diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index d8fc93492dc5..88d9aae7681b 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -12,7 +12,6 @@ void __io_uring_cancel(bool cancel_all); void __io_uring_free(struct task_struct *tsk); void io_uring_unreg_ringfd(void); const char *io_uring_get_opcode(u8 opcode); -int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); static inline void io_uring_files_cancel(void) { @@ -49,11 +48,6 @@ static inline const char *io_uring_get_opcode(u8 opcode) { return ""; } -static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd, - unsigned int issue_flags) -{ - return -EOPNOTSUPP; -} #endif #endif diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h new file mode 100644 index 000000000000..b58f39fed4d5 --- /dev/null +++ b/include/linux/io_uring/net.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _LINUX_IO_URING_NET_H +#define _LINUX_IO_URING_NET_H + +struct io_uring_cmd; + +#if defined(CONFIG_IO_URING) +int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); + +#else +static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd, + unsigned int issue_flags) +{ + return -EOPNOTSUPP; +} +#endif + +#endif diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 34030583b9b2..c98749eff5ce 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include diff --git a/net/socket.c b/net/socket.c index 3379c64217a4..d75246450a3c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -88,7 +88,7 @@ #include #include #include -#include +#include #include #include From patchwork Tue Dec 19 21:03:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499077 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 954183C495 for ; Tue, 19 Dec 2023 21:04:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="XU3UrHve" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-6d946beebe6so11810b3a.1 for ; Tue, 19 Dec 2023 13:04:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019851; x=1703624651; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=X+cD0HBbDXg6qDTaNdwjQlMiJRw8SP72E8bqAxbhE8s=; b=XU3UrHveKv4FykWM7gzm1lTiTMjIf1IVo3agLeyjanCt/YfMUVoKkNEPoIHjH4CWBc ON7vzJatilRqbTDbgzRRaFh6EEfNREm6exksKwMSUd5LUNcrhHAHG320J67t6hj8IzZM NfqwqZs4x/tuiknuCCs8651Qm574gwikysic7I+hJzslQ32+eQDQLo2lBhwJUjkD63JC d4Api9Lt0/aPDq2Fe6HtQg6q9XBotdQcLm3EWMCZZ4NBRdapyyE90VN/faJhhE4Qkbl8 OLVd5aZ2DNgmj9/13mW1OPHsdnCLu1M9EcaVyX7ZnxrNyA925Xfr6NovNhGEpFIl7pWB AgFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019851; x=1703624651; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=X+cD0HBbDXg6qDTaNdwjQlMiJRw8SP72E8bqAxbhE8s=; b=v5ZIKlMO3KQS4q6toXlDQCKLcxeNAqCHaQSblWhKOGDrXcMmwQk36YK3feK098LKDA kyIGUegjr6IPk3R/60kzOZBUtAyzG3iomxHvXNtA0cdzksqkz72fnvQirKqEL9u2ecZh IlL3U53sk3CkcY72Lgi/PoFUplqyoKnzv2Tu8xAS8X1yyyYdfG/D2DS6S0XXWne5M08b FWoN/X/Rm3HQz8+9HuesG7+vEkwAd7o9qCoWk5qdLxXLbW9Cv+4Zh0jUb/GY0gQHKNUZ 56xZquBIgulJFtesC3NzmXLnomc7EC/QvfEHwJ3t+t1z5MyesEo/kJFvdrtNy+JUjGv6 bAVg== X-Gm-Message-State: AOJu0YzN/8rmCjRuQBBczROSpo5IuxSy9KKNBmLDkO9LSkBhGXLIwUdE GSZWdwsHjZYA3NtkXJWUaZdjxnl3AYoE2iumKaGfdQ== X-Google-Smtp-Source: AGHT+IF4yzEdIL3OE2Yk+EqkDNUGNs1PJkGmtMabu00W7UcjeuqOC098z2oc+B7EWH+cuRRVUfcEhg== X-Received: by 2002:a05:6a00:4c18:b0:6d9:4598:d1f7 with SMTP id ea24-20020a056a004c1800b006d94598d1f7mr172961pfb.52.1703019851574; Tue, 19 Dec 2023 13:04:11 -0800 (PST) Received: from localhost (fwdproxy-prn-120.fbsv.net. [2a03:2880:ff:78::face:b00c]) by smtp.gmail.com with ESMTPSA id ei22-20020a056a0080d600b006ce75e0ef83sm3671250pfb.179.2023.12.19.13.04.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:11 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 07/20] io_uring: add interface queue Date: Tue, 19 Dec 2023 13:03:44 -0800 Message-Id: <20231219210357.4029713-8-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: David Wei This patch introduces a new object in io_uring called an interface queue (ifq) which contains: * A pool region allocated by userspace and registered w/ io_uring where Rx data is written to. * A net device and one specific Rx queue in it that will be configured for ZC Rx. * A pair of shared ringbuffers w/ userspace, dubbed registered buf (rbuf) rings. Each entry contains a pool region id and an offset + len within that region. The kernel writes entries into the completion ring to tell userspace where RX data is relative to the start of a region. Userspace writes entries into the refill ring to tell the kernel when it is done with the data. For now, each io_uring instance has a single ifq, and each ifq has a single pool region associated with one Rx queue. Add a new opcode to io_uring_register that sets up an ifq. Size and offsets of shared ringbuffers are returned to userspace for it to mmap. The implementation will be added in a later patch. Signed-off-by: David Wei --- include/linux/io_uring_types.h | 8 +++ include/uapi/linux/io_uring.h | 51 +++++++++++++++ io_uring/Makefile | 2 +- io_uring/io_uring.c | 13 ++++ io_uring/zc_rx.c | 116 +++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 37 +++++++++++ 6 files changed, 226 insertions(+), 1 deletion(-) create mode 100644 io_uring/zc_rx.c create mode 100644 io_uring/zc_rx.h diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index bebab36abce8..e87053b200f2 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -38,6 +38,8 @@ enum io_uring_cmd_flags { IO_URING_F_COMPAT = (1 << 12), }; +struct io_zc_rx_ifq; + struct io_wq_work_node { struct io_wq_work_node *next; }; @@ -182,6 +184,10 @@ struct io_rings { struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; }; +struct io_rbuf_ring { + struct io_uring rq, cq; +}; + struct io_restriction { DECLARE_BITMAP(register_op, IORING_REGISTER_LAST); DECLARE_BITMAP(sqe_op, IORING_OP_LAST); @@ -383,6 +389,8 @@ struct io_ring_ctx { struct io_rsrc_data *file_data; struct io_rsrc_data *buf_data; + struct io_zc_rx_ifq *ifq; + /* protected by ->uring_lock */ struct list_head rsrc_ref_list; struct io_alloc_cache rsrc_node_cache; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f1c16f817742..024a6f79323b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -558,6 +558,9 @@ enum { /* register a range of fixed file slots for automatic slot allocation */ IORING_REGISTER_FILE_ALLOC_RANGE = 25, + /* register a network interface queue for zerocopy */ + IORING_REGISTER_ZC_RX_IFQ = 26, + /* this goes last */ IORING_REGISTER_LAST, @@ -750,6 +753,54 @@ enum { SOCKET_URING_OP_SETSOCKOPT, }; +struct io_uring_rbuf_rqe { + __u32 off; + __u32 len; + __u16 region; + __u8 __pad[6]; +}; + +struct io_uring_rbuf_cqe { + __u32 off; + __u32 len; + __u16 region; + __u8 sock; + __u8 flags; + __u8 __pad[2]; +}; + +struct io_rbuf_rqring_offsets { + __u32 head; + __u32 tail; + __u32 rqes; + __u8 __pad[4]; +}; + +struct io_rbuf_cqring_offsets { + __u32 head; + __u32 tail; + __u32 cqes; + __u8 __pad[4]; +}; + +/* + * Argument for IORING_REGISTER_ZC_RX_IFQ + */ +struct io_uring_zc_rx_ifq_reg { + __u32 if_idx; + /* hw rx descriptor ring id */ + __u32 if_rxq_id; + __u32 region_id; + __u32 rq_entries; + __u32 cq_entries; + __u32 flags; + __u16 cpu; + + __u32 mmap_sz; + struct io_rbuf_rqring_offsets rq_off; + struct io_rbuf_cqring_offsets cq_off; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/Makefile b/io_uring/Makefile index e5be47e4fc3b..6c4b4ed37a1f 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -8,6 +8,6 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \ statx.o net.o msg_ring.o timeout.o \ sqpoll.o fdinfo.o tctx.o poll.o \ cancel.o kbuf.o rsrc.o rw.o opdef.o \ - notif.o waitid.o + notif.o waitid.o zc_rx.o obj-$(CONFIG_IO_WQ) += io-wq.o obj-$(CONFIG_FUTEX) += futex.o diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 1d254f2c997d..7fff01d57e9e 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -95,6 +95,7 @@ #include "notif.h" #include "waitid.h" #include "futex.h" +#include "zc_rx.h" #include "timeout.h" #include "poll.h" @@ -2919,6 +2920,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) return; mutex_lock(&ctx->uring_lock); + io_unregister_zc_rx_ifqs(ctx); if (ctx->buf_data) __io_sqe_buffers_unregister(ctx); if (ctx->file_data) @@ -3109,6 +3111,11 @@ static __cold void io_ring_exit_work(struct work_struct *work) io_cqring_overflow_kill(ctx); mutex_unlock(&ctx->uring_lock); } + if (ctx->ifq) { + mutex_lock(&ctx->uring_lock); + io_shutdown_zc_rx_ifqs(ctx); + mutex_unlock(&ctx->uring_lock); + } if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) io_move_task_work_from_local(ctx); @@ -4609,6 +4616,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_file_alloc_range(ctx, arg); break; + case IORING_REGISTER_ZC_RX_IFQ: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_ifq(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c new file mode 100644 index 000000000000..5fc94cad5e3a --- /dev/null +++ b/io_uring/zc_rx.c @@ -0,0 +1,116 @@ +// SPDX-License-Identifier: GPL-2.0 +#if defined(CONFIG_PAGE_POOL) +#include +#include +#include +#include + +#include + +#include "io_uring.h" +#include "kbuf.h" +#include "zc_rx.h" + +static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, + struct io_uring_zc_rx_ifq_reg *reg) +{ + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP; + size_t off, size, rq_size, cq_size; + void *ptr; + + off = sizeof(struct io_rbuf_ring); + rq_size = reg->rq_entries * sizeof(struct io_uring_rbuf_rqe); + cq_size = reg->cq_entries * sizeof(struct io_uring_rbuf_cqe); + size = off + rq_size + cq_size; + ptr = (void *) __get_free_pages(gfp, get_order(size)); + if (!ptr) + return -ENOMEM; + ifq->ring = (struct io_rbuf_ring *)ptr; + ifq->rqes = (struct io_uring_rbuf_rqe *)((char *)ptr + off); + ifq->cqes = (struct io_uring_rbuf_cqe *)((char *)ifq->rqes + rq_size); + return 0; +} + +static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq) +{ + if (ifq->ring) + folio_put(virt_to_folio(ifq->ring)); +} + +static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq; + + ifq = kzalloc(sizeof(*ifq), GFP_KERNEL); + if (!ifq) + return NULL; + + ifq->if_rxq_id = -1; + ifq->ctx = ctx; + return ifq; +} + +static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +{ + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + struct io_uring_zc_rx_ifq_reg reg; + struct io_zc_rx_ifq *ifq; + int ret; + + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) + return -EINVAL; + if (copy_from_user(®, arg, sizeof(reg))) + return -EFAULT; + if (ctx->ifq) + return -EBUSY; + if (reg.if_rxq_id == -1) + return -EINVAL; + + ifq = io_zc_rx_ifq_alloc(ctx); + if (!ifq) + return -ENOMEM; + + /* TODO: initialise network interface */ + + ret = io_allocate_rbuf_ring(ifq, ®); + if (ret) + goto err; + + /* TODO: map zc region and initialise zc pool */ + + ifq->rq_entries = reg.rq_entries; + ifq->cq_entries = reg.cq_entries; + ifq->if_rxq_id = reg.if_rxq_id; + ctx->ifq = ifq; + + return 0; +err: + io_zc_rx_ifq_free(ifq); + return ret; +} + +void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq = ctx->ifq; + + lockdep_assert_held(&ctx->uring_lock); + + if (!ifq) + return; + + ctx->ifq = NULL; + io_zc_rx_ifq_free(ifq); +} + +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ + lockdep_assert_held(&ctx->uring_lock); +} + +#endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h new file mode 100644 index 000000000000..aab57c1a4c5d --- /dev/null +++ b/io_uring/zc_rx.h @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef IOU_ZC_RX_H +#define IOU_ZC_RX_H + +struct io_zc_rx_ifq { + struct io_ring_ctx *ctx; + struct net_device *dev; + struct io_rbuf_ring *ring; + struct io_uring_rbuf_rqe *rqes; + struct io_uring_rbuf_cqe *cqes; + u32 rq_entries; + u32 cq_entries; + + /* hw rx descriptor ring id */ + u32 if_rxq_id; +}; + +#if defined(CONFIG_PAGE_POOL) +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg); +void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx); +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx); +#else +static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + return -EOPNOTSUPP; +} +static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ +} +static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ +} +#endif + +#endif From patchwork Tue Dec 19 21:03:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499078 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BE4E3B18E for ; Tue, 19 Dec 2023 21:04:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="z61dSdpK" Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-6d93278bfbeso849196b3a.3 for ; Tue, 19 Dec 2023 13:04:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019852; x=1703624652; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=b3Y4JU6HMwB1uVJ+JNXJdWd+XQLYN1jCeFKBg4yFw1k=; b=z61dSdpK2dWdcPpUIawfQsityXb4ZJwOdbh1tIQkyl8JFy0GbLtyTn6RaAtEscYQoR 75SVcTjSgH4aZIZYWp/AiIL86Rm6WWJlmMKhMhgwqvYbDuV1vtndOGUzBaIToXfkda4k qhFMNtIlK1mqySfyzzTsR+RErc2SkUo2QkMAv2jMRdsXvEidFthsQl6OjtXOVcUkSDZT rIO1Q5M5Y/1tiSUR8C3rxIHzoZgoKfqZqa8FkA4Ua6xgIHh/98Ok2S3cpbSvvLcKvaqn Q/j8Eyzk6IGXQgm9TGQjglqCGpTDxLHmXmLU91WOOuRNzfcZICz8I1I70oVj7rkJPhe3 XVow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019852; x=1703624652; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=b3Y4JU6HMwB1uVJ+JNXJdWd+XQLYN1jCeFKBg4yFw1k=; b=aeSe4byBW7MxPpE/VQFy5mGk1L96tCxJ+MM5CdO94ss/0FeWw9V2N5LyJhVCxAsKX+ QSJBWNcOX/fHnEU+9Y7byu91oy+DODPmwz96eILO/93pmviT1Gh+cxq5wkcCWG8/4go4 G1599oqwtxrjShmVTRNoUxi0HxbD4bWFcVJxVF+q5w55bqrdFpx5wzb6EvZMXGlXeXAF toOjZzR+2CPz+EpnFcM1zPQC1pNTFKefX+8bMhD7Yrb967ZKWicEsn8H2cHPOV5uTuLE +dz0JOAMfCBvQVQqKF4VEplIdjNe504HghJqNMgv0CAfjhlUfYPvPzxzHxR9J9yE65CD VN2w== X-Gm-Message-State: AOJu0Yw1lGdMMwzIGB33KCV6MXmVBsIiat/Y2SdQ6F2JB2xY/8sRc3fp ykLVahG46JK+vwKNNjOYPzPAo4v5jYjeC9puflflmA== X-Google-Smtp-Source: AGHT+IG4uhmPRuRYA1PBi6o4YeFNdksavwq6wMK9ezYS86VITabON3ke2qgZ1ovSnIArd+jBALIfSg== X-Received: by 2002:a05:6a00:4783:b0:6ce:2731:5f7c with SMTP id dh3-20020a056a00478300b006ce27315f7cmr7863805pfb.59.1703019852569; Tue, 19 Dec 2023 13:04:12 -0800 (PST) Received: from localhost (fwdproxy-prn-010.fbsv.net. [2a03:2880:ff:a::face:b00c]) by smtp.gmail.com with ESMTPSA id x11-20020a056a00188b00b006d7d454e58asm4085899pfh.117.2023.12.19.13.04.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:12 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers Date: Tue, 19 Dec 2023 13:03:45 -0800 Message-Id: <20231219210357.4029713-9-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: David Wei This patch adds mmap support for ifq rbuf rings. There are two rings and a struct io_rbuf_ring that contains the head and tail ptrs into each ring. Just like the io_uring SQ/CQ rings, userspace issues a single mmap call using the io_uring fd w/ magic offset IORING_OFF_RBUF_RING. An opaque ptr is returned to userspace, which is then expected to use the offsets returned in the registration struct to get access to the head/tail and rings. Signed-off-by: David Wei Reviewed-by: Jens Axboe --- include/uapi/linux/io_uring.h | 2 ++ io_uring/io_uring.c | 5 +++++ io_uring/zc_rx.c | 19 ++++++++++++++++++- 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 024a6f79323b..839933e562e6 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -428,6 +428,8 @@ enum { #define IORING_OFF_PBUF_RING 0x80000000ULL #define IORING_OFF_PBUF_SHIFT 16 #define IORING_OFF_MMAP_MASK 0xf8000000ULL +#define IORING_OFF_RBUF_RING 0x20000000ULL +#define IORING_OFF_RBUF_SHIFT 16 /* * Filled with the offset for mmap(2) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 7fff01d57e9e..02d6d638bd65 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3516,6 +3516,11 @@ static void *io_uring_validate_mmap_request(struct file *file, return ERR_PTR(-EINVAL); break; } + case IORING_OFF_RBUF_RING: + if (!ctx->ifq) + return ERR_PTR(-EINVAL); + ptr = ctx->ifq->ring; + break; default: return ERR_PTR(-EINVAL); } diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 5fc94cad5e3a..7e3e6f6d446b 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -61,6 +61,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, { struct io_uring_zc_rx_ifq_reg reg; struct io_zc_rx_ifq *ifq; + size_t ring_sz, rqes_sz, cqes_sz; int ret; if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) @@ -87,8 +88,24 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; ifq->if_rxq_id = reg.if_rxq_id; - ctx->ifq = ifq; + ring_sz = sizeof(struct io_rbuf_ring); + rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; + cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; + reg.mmap_sz = ring_sz + rqes_sz + cqes_sz; + reg.rq_off.rqes = ring_sz; + reg.cq_off.cqes = ring_sz + rqes_sz; + reg.rq_off.head = offsetof(struct io_rbuf_ring, rq.head); + reg.rq_off.tail = offsetof(struct io_rbuf_ring, rq.tail); + reg.cq_off.head = offsetof(struct io_rbuf_ring, cq.head); + reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); + + if (copy_to_user(arg, ®, sizeof(reg))) { + ret = -EFAULT; + goto err; + } + + ctx->ifq = ifq; return 0; err: io_zc_rx_ifq_free(ifq); From patchwork Tue Dec 19 21:03:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499079 Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75E9339FD3 for ; Tue, 19 Dec 2023 21:04:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="hVQfTmM3" Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-6d7f1109abcso1703383b3a.3 for ; Tue, 19 Dec 2023 13:04:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019853; x=1703624653; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wkWI+jBk9ibeTFLlqTLjoDgBCcMIsyS5O3uAoss9Ttk=; b=hVQfTmM3DGTVb0Jan2JmJg5ujHubg0EH2ANU5ALsN4ZlCGRUqKSThlyVoG3cXCt6gM o0zGDSXVWz0CBIfLNpyAJaXOn2ej86C7V//wnQQHmhD82gFLmqWy254/QTdHe+ElPkSN AwuLwc+0aZpnVJhuVBNt4gAuG7+PhhP5JCrCNmLHnWVRk8gCK5T5lKmqrjiaVE6nwpeX a2ciwGXVYSWJPGG4EOaHdoevMfdaZzERV/kd5ApQx0jIxn/8LFtVp3XhOXeS4lKH7mzr tGygr/wgMbAPN2w1MZp3L8HtK6DXwC/Krno/uzbTeakRFmIYfSFx3hNqu0ZpkVR719ET wl1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019853; x=1703624653; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wkWI+jBk9ibeTFLlqTLjoDgBCcMIsyS5O3uAoss9Ttk=; b=Jw/+l50i7fQpUO+xOditdd9/ZfXDHqG1U1yA3kx5zOdReQgLxQV9tP6g5xZBN+mRQw xtJXu/2YKKAi79hDRehHMl356kWkJ0lGjgnbmKIRZ6n+09Vq7R+Ay7jPKT5TjLphdgM/ ZeQUrPXVLWztsvxNad3Bty0/To/lUD85Me89vdEWq0GRP9rEaPuSgO7Bxq97HurvAakI gkyUMl7Kx7prMrm9/IYXRVs8X+2gkBqpFp45evXZh/EPCzT9KAvgJ+0fVOroZbvg1PPj +kDnC6Y/bThHr+WLZm9yiDtsF/94+ea1f8Dt6z4xiHpOPWG/srri4h490Nozd0UShiD/ V/+A== X-Gm-Message-State: AOJu0YztKNr5bUtMSCsFXAafHLpBRNhpCWlDhk78r7SRVT17p5UJJR+R Vp8GEuEhuo1Ioy4OKRA1qrao65Z1JX8D1L3ZxmbDzQ== X-Google-Smtp-Source: AGHT+IFkIV9kmjgRboZ06qcSOFb/5LLi9sCmzo0B4VxRAsTpi263CWRkMG9FMB5SxtZFYbwU0PQDWQ== X-Received: by 2002:a17:902:d4c7:b0:1d0:910e:5039 with SMTP id o7-20020a170902d4c700b001d0910e5039mr10917744plg.77.1703019853402; Tue, 19 Dec 2023 13:04:13 -0800 (PST) Received: from localhost (fwdproxy-prn-019.fbsv.net. [2a03:2880:ff:13::face:b00c]) by smtp.gmail.com with ESMTPSA id n2-20020a170902d2c200b001bf044dc1a6sm21422316plc.39.2023.12.19.13.04.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:13 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command Date: Tue, 19 Dec 2023 13:03:46 -0800 Message-Id: <20231219210357.4029713-10-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: David Wei RFC ONLY, NOT FOR UPSTREAM This will be replaced with a separate ndo callback or some other mechanism in next patchset revisions. This patch adds a new XDP_SETUP_ZC_RX command that will be used in a later patch to enable or disable ZC RX for a specific RX queue. We are open to suggestions on a better way of doing this. Google's TCP devmem proposal sets up struct netdev_rx_queue which persists across device reset, then expects userspace to use an out-of-band method (e.g. ethtool) to reset the device, thus re-filling a hardware Rx queue. Signed-off-by: David Wei --- include/linux/netdevice.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a4bdc35c7d6f..5b4df0b6a6c0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1097,6 +1097,7 @@ enum bpf_netdev_command { BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, XDP_SETUP_XSK_POOL, + XDP_SETUP_ZC_RX, }; struct bpf_prog_offload_ops; @@ -1135,6 +1136,11 @@ struct netdev_bpf { struct xsk_buff_pool *pool; u16 queue_id; } xsk; + /* XDP_SETUP_ZC_RX */ + struct { + struct io_zc_rx_ifq *ifq; + u16 queue_id; + } zc_rx; }; }; From patchwork Tue Dec 19 21:03:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499080 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 680733D0C6 for ; Tue, 19 Dec 2023 21:04:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="gQnYYc+4" Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-28bd09e35e8so73272a91.0 for ; Tue, 19 Dec 2023 13:04:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019854; x=1703624654; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wAMvX5PF2J42C1cn2NekqDrZ68M45niiPpeKzGsshQo=; b=gQnYYc+4cUb/LyvXOAYXV4CcWCkXGxnl1WiRbq+UPsXTpqZhsq3ebvzMqkHRlvHdXG CLGIPmwqIN0hCAHiYhYI+GoDZMfRyfO2IEqSvLktGi1NYoyBnD8AviV1J7iKGN5aTWMr 9iJNSzzaOiJnC1BJuFDYezq88uHMftcv/ERbai4AO1KE8zz9bni9ni9vuToxqw2824cz Z9sM9Vqk5fD5J1sHLTq9iVCu8t0VhM1Gwsyu/JB9+9KLUgC/YVfudx4hdjr50OYuQoki kwVcIuP6tnBNmeW03YEeXH8qhq7JuhTTdglim1i1JJWXNQxUN8NV6JkneykYm2DAefG0 Akyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019854; x=1703624654; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wAMvX5PF2J42C1cn2NekqDrZ68M45niiPpeKzGsshQo=; b=oa4UciaFaM6idrjgfDISnbgb/yefk4Bobmz+RWmMjfQsLk/37bN+aQsZ4ZrV8ehkaE ZGhv7Il5deBepQykVy2nAIlxMWWLlIOvO0GamlJ+KTlQ2PGjSh+9n35IKPxi/uF94XVp vRIr1CqO3kkH13VA/CDVeMFW9hQDyg5Q1hkgQlivTCTWY//2qtgAlwreQ9YqRdZ1dBMH Ez5fw411+8lLzzN1suh6qygzXTFZbvmKUqKH4Tw+B77qWwN4bXFYkA1Bh03nfgaFKfAZ zhIxZk43mkvjGVooiQdXp2rdTis9XmgvgJE6MB/cUu7SOS1ANdtjuL0I0PzErU4mFtF/ olnQ== X-Gm-Message-State: AOJu0YxxzjUhjoPAy7P2NOY4tDGz2PAhzghlbJTyJ66Qo1sCSSWiP3E6 LA8wFmeSRxpB0In5v2qeju8CmuPQFOm/f9bhT+PLlQ== X-Google-Smtp-Source: AGHT+IHoQAQ/k1SxzwciIcmi+jcQ+upozI8wcoH4KzgZpCsiXfMS4tOsd7N3uxiOUng9G36LVHu5GQ== X-Received: by 2002:a17:90b:120d:b0:28b:a3bf:8aaa with SMTP id gl13-20020a17090b120d00b0028ba3bf8aaamr1684113pjb.53.1703019854284; Tue, 19 Dec 2023 13:04:14 -0800 (PST) Received: from localhost (fwdproxy-prn-000.fbsv.net. [2a03:2880:ff::face:b00c]) by smtp.gmail.com with ESMTPSA id k9-20020a170902c40900b001d0969c5b68sm21470889plk.139.2023.12.19.13.04.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:14 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq Date: Tue, 19 Dec 2023 13:03:47 -0800 Message-Id: <20231219210357.4029713-11-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: David Wei This patch sets up ZC for an Rx queue in a net device when an ifq is registered with io_uring. The Rx queue is specified in the registration struct. For now since there is only one ifq, its destruction is implicit during io_uring cleanup. Signed-off-by: David Wei --- io_uring/zc_rx.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 7e3e6f6d446b..259e08a34ab2 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -4,6 +4,7 @@ #include #include #include +#include #include @@ -11,6 +12,34 @@ #include "kbuf.h" #include "zc_rx.h" +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); + +static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, + u16 queue_id) +{ + struct netdev_bpf cmd; + bpf_op_t ndo_bpf; + + ndo_bpf = dev->netdev_ops->ndo_bpf; + if (!ndo_bpf) + return -EINVAL; + + cmd.command = XDP_SETUP_ZC_RX; + cmd.zc_rx.ifq = ifq; + cmd.zc_rx.queue_id = queue_id; + return ndo_bpf(dev, &cmd); +} + +static int io_open_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, ifq, ifq->if_rxq_id); +} + +static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); +} + static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, struct io_uring_zc_rx_ifq_reg *reg) { @@ -52,6 +81,10 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { + if (ifq->if_rxq_id != -1) + io_close_zc_rxq(ifq); + if (ifq->dev) + dev_put(ifq->dev); io_free_rbuf_ring(ifq); kfree(ifq); } @@ -77,18 +110,25 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (!ifq) return -ENOMEM; - /* TODO: initialise network interface */ - ret = io_allocate_rbuf_ring(ifq, ®); if (ret) goto err; + ret = -ENODEV; + ifq->dev = dev_get_by_index(current->nsproxy->net_ns, reg.if_idx); + if (!ifq->dev) + goto err; + /* TODO: map zc region and initialise zc pool */ ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; ifq->if_rxq_id = reg.if_rxq_id; + ret = io_open_zc_rxq(ifq); + if (ret) + goto err; + ring_sz = sizeof(struct io_rbuf_ring); rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; @@ -101,6 +141,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); if (copy_to_user(arg, ®, sizeof(reg))) { + io_close_zc_rxq(ifq); ret = -EFAULT; goto err; } From patchwork Tue Dec 19 21:03:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499083 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 929CA3AC30 for ; Tue, 19 Dec 2023 21:04:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ndC5qhFj" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-1d3dc30ae01so9161075ad.0 for ; Tue, 19 Dec 2023 13:04:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019855; x=1703624655; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4hgHL/ZvjBTD8SimX6JIDvgoHS0Ce7mxEyaQY6vMoO4=; b=ndC5qhFj70vcYE3zzd1Z8oKtmAK7yQds9L97HIBTmUCSPXqX8eIuod2sJUCT/kWV2o 1WD7ug9WugzyrQ5igdGgk/kAWRDxlFdbYFwl0CuL0RWE8gPOh4ydGYTiy0x/NiZmppgU mtahQZQjxV/il001JfBpurOMQkeSR5D2nLSNOQnSoQvGHMoQIjmwWFLtZrPK8LzuSq0U dvGNYlQKzdnDAM4Ohs/GTAStmdb8PmwHl2IK2YoWNwetMraxGLnQ4NbDc4D2S+hUNjzh HuT4P9fTYjCICCIC4kYAqbod/Y1G/5qYdBzTgns9ji2zaJ00sd93yjsAPGauz4RMDkh4 bQ9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019855; x=1703624655; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4hgHL/ZvjBTD8SimX6JIDvgoHS0Ce7mxEyaQY6vMoO4=; b=MpkJhrmO36zD/vtJBqrMMcGh8+WWWhR3+k+hlrazoQU+0Wrcm/wa/N7I/6qpmMirX6 hRBBAoXz3Nk0pmCTel03xAzglq40vskzgi5sOwJD/xWHsNoDCBodxYX46MI3Ze/SqpJ7 1+IroX+xZGx9ro/4oSGCxmHz/X9O7/UxmeGFRoo4uluB774Ux3WHA3oHyLlMwmTHnGUF LTRV9FXWof8/WxjIvkR+DTSn7P9VXV5AYcaY2iaWbCE5yCaJf0uS8yVRbOvLUFzphVER 3aXUWtNiH69DUWsJpIkZLy8H30Y939oSeCgiheKvycUtVUIw+OOp0ezIfiqbk9o+DU1d LJDw== X-Gm-Message-State: AOJu0YxBYIAcw4vADYxk5XaVvRks59sIh3VGhK0cfoFFIYYiugAendHb wIRnNJgLholYF5VLbq2J1kBd+PuSfSCf5IiTacZSNg== X-Google-Smtp-Source: AGHT+IGpRb8Vk7U6vTaxpT/fIISuWL242u8wPDqXvpB/Hrzrc8SZbdwoUsYlgc7jNB61e7DGfKmrpA== X-Received: by 2002:a17:902:7b8d:b0:1d0:6ffd:9e36 with SMTP id w13-20020a1709027b8d00b001d06ffd9e36mr18734326pll.136.1703019855312; Tue, 19 Dec 2023 13:04:15 -0800 (PST) Received: from localhost (fwdproxy-prn-015.fbsv.net. [2a03:2880:ff:f::face:b00c]) by smtp.gmail.com with ESMTPSA id c2-20020a170902848200b001d09c539c96sm10014931plo.229.2023.12.19.13.04.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:15 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration Date: Tue, 19 Dec 2023 13:03:48 -0800 Message-Id: <20231219210357.4029713-12-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov We want userspace to explicitly list all sockets it'll be using with a particular zc ifq, so we can properly configure them, e.g. binding the sockets to the corresponding interface and setting steering rules. We'll also need it to better control ifq lifetime and for termination / unregistration purposes. TODO: remove zc_rx_idx from struct socket, which will fix zc_rx_idx token init races and re-registration bug. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/net.h | 2 + include/uapi/linux/io_uring.h | 7 +++ io_uring/io_uring.c | 6 +++ io_uring/net.c | 20 ++++++++ io_uring/zc_rx.c | 89 +++++++++++++++++++++++++++++++++-- io_uring/zc_rx.h | 17 +++++++ net/socket.c | 1 + 7 files changed, 139 insertions(+), 3 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index c9b4a63791a4..867061a91d30 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -126,6 +126,8 @@ struct socket { const struct proto_ops *ops; /* Might change with IPV6_ADDRFORM or MPTCP. */ struct socket_wq wq; + + unsigned zc_rx_idx; }; /* diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 839933e562e6..f4ba58bce3bd 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -562,6 +562,7 @@ enum { /* register a network interface queue for zerocopy */ IORING_REGISTER_ZC_RX_IFQ = 26, + IORING_REGISTER_ZC_RX_SOCK = 27, /* this goes last */ IORING_REGISTER_LAST, @@ -803,6 +804,12 @@ struct io_uring_zc_rx_ifq_reg { struct io_rbuf_cqring_offsets cq_off; }; +struct io_uring_zc_rx_sock_reg { + __u32 sockfd; + __u32 zc_rx_ifq_idx; + __u32 __resv[2]; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 02d6d638bd65..47859599469d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -4627,6 +4627,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_zc_rx_ifq(ctx, arg); break; + case IORING_REGISTER_ZC_RX_SOCK: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_sock(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/net.c b/io_uring/net.c index 75d494dad7e2..454ba301ae6b 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -16,6 +16,7 @@ #include "net.h" #include "notif.h" #include "rsrc.h" +#include "zc_rx.h" #if defined(CONFIG_NET) struct io_shutdown { @@ -955,6 +956,25 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } +static __maybe_unused +struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, + struct socket *sock) +{ + unsigned token = READ_ONCE(sock->zc_rx_idx); + unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET; + unsigned sock_idx = token & IO_ZC_IFQ_IDX_MASK; + struct io_zc_rx_ifq *ifq; + + if (ifq_idx) + return NULL; + ifq = req->ctx->ifq; + if (!ifq || sock_idx >= ifq->nr_sockets) + return NULL; + if (ifq->sockets[sock_idx] != req->file) + return NULL; + return ifq; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 259e08a34ab2..06e2c54d3f3d 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -11,6 +11,7 @@ #include "io_uring.h" #include "kbuf.h" #include "zc_rx.h" +#include "rsrc.h" typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); @@ -79,10 +80,31 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) return ifq; } -static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +static void io_shutdown_ifq(struct io_zc_rx_ifq *ifq) { - if (ifq->if_rxq_id != -1) + int i; + + if (!ifq) + return; + + for (i = 0; i < ifq->nr_sockets; i++) { + if (ifq->sockets[i]) { + fput(ifq->sockets[i]); + ifq->sockets[i] = NULL; + } + } + ifq->nr_sockets = 0; + + if (ifq->if_rxq_id != -1) { io_close_zc_rxq(ifq); + ifq->if_rxq_id = -1; + } +} + +static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +{ + io_shutdown_ifq(ifq); + if (ifq->dev) dev_put(ifq->dev); io_free_rbuf_ring(ifq); @@ -141,7 +163,6 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); if (copy_to_user(arg, ®, sizeof(reg))) { - io_close_zc_rxq(ifq); ret = -EFAULT; goto err; } @@ -162,6 +183,8 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) if (!ifq) return; + WARN_ON_ONCE(ifq->nr_sockets); + ctx->ifq = NULL; io_zc_rx_ifq_free(ifq); } @@ -169,6 +192,66 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) { lockdep_assert_held(&ctx->uring_lock); + + io_shutdown_ifq(ctx->ifq); +} + +int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg) +{ + struct io_uring_zc_rx_sock_reg sr; + struct io_zc_rx_ifq *ifq; + struct socket *sock; + struct file *file; + int ret = -EEXIST; + int idx; + + if (copy_from_user(&sr, arg, sizeof(sr))) + return -EFAULT; + if (sr.__resv[0] || sr.__resv[1]) + return -EINVAL; + if (sr.zc_rx_ifq_idx != 0 || !ctx->ifq) + return -EINVAL; + + ifq = ctx->ifq; + if (ifq->nr_sockets >= ARRAY_SIZE(ifq->sockets)) + return -EINVAL; + + BUILD_BUG_ON(ARRAY_SIZE(ifq->sockets) > IO_ZC_IFQ_IDX_MASK); + + file = fget(sr.sockfd); + if (!file) + return -EBADF; + + if (io_file_need_scm(file)) { + fput(file); + return -EBADF; + } + + sock = sock_from_file(file); + if (unlikely(!sock || !sock->sk)) { + fput(file); + return -ENOTSOCK; + } + + idx = ifq->nr_sockets; + lock_sock(sock->sk); + if (!sock->zc_rx_idx) { + unsigned token; + + token = idx + (sr.zc_rx_ifq_idx << IO_ZC_IFQ_IDX_OFFSET); + WRITE_ONCE(sock->zc_rx_idx, token); + ret = 0; + } + release_sock(sock->sk); + + if (ret) { + fput(file); + return -EINVAL; + } + ifq->sockets[idx] = file; + ifq->nr_sockets++; + return 0; } #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index aab57c1a4c5d..9257dda77e92 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -2,6 +2,13 @@ #ifndef IOU_ZC_RX_H #define IOU_ZC_RX_H +#include +#include + +#define IO_ZC_MAX_IFQ_SOCKETS 16 +#define IO_ZC_IFQ_IDX_OFFSET 16 +#define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) + struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; @@ -13,6 +20,9 @@ struct io_zc_rx_ifq { /* hw rx descriptor ring id */ u32 if_rxq_id; + + unsigned nr_sockets; + struct file *sockets[IO_ZC_MAX_IFQ_SOCKETS]; }; #if defined(CONFIG_PAGE_POOL) @@ -20,6 +30,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx); void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx); +int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg); #else static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg) @@ -32,6 +44,11 @@ static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) { } +static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg) +{ + return -EOPNOTSUPP; +} #endif #endif diff --git a/net/socket.c b/net/socket.c index d75246450a3c..a9cef870309a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -637,6 +637,7 @@ struct socket *sock_alloc(void) sock = SOCKET_I(inode); + sock->zc_rx_idx = 0; inode->i_ino = get_next_ino(); inode->i_mode = S_IFSOCK | S_IRWXUGO; inode->i_uid = current_fsuid(); From patchwork Tue Dec 19 21:03:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499081 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3ECD63D0D9 for ; Tue, 19 Dec 2023 21:04:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="QHalPbvH" Received: by mail-pj1-f45.google.com with SMTP id 98e67ed59e1d1-28b71490fbbso1613341a91.0 for ; Tue, 19 Dec 2023 13:04:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019856; x=1703624656; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tJ8hPjWHfatZ/BzlXTQEl7ypf6OdinK6dQcdfVsR+fI=; b=QHalPbvH6VAGFB+ML/YOP8G66Y1pJOMJ7PVotVN4kS9dVIi9GujUUEApSeSm/i+QjS ZyI0Ejrw0PxFYXM6lwHRKlz88kYi+JIXzg+JKp8Y6fKNccsYHHG5HfcwvbuP1arjpYsq Xlrpvg4hBF2h8K3ak0lKLYxC5WRPZiB/pCFrxpjLLoNylbzUyjoyyBpU6hdV6Vlew38r yutxbLnF9S9BqJOQ06JCLB4UnfCZ+C/ZaD3n7M60VxBg0AN+41mwRoHP2SMSZBU/xEKA 8V+SuB/OZpfJWGSuRXeJcFomQ72iw1Z0bt1UfyWmTPRmkdjlIGb0VzqjZ55g/lF4vIy1 9VIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019856; x=1703624656; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tJ8hPjWHfatZ/BzlXTQEl7ypf6OdinK6dQcdfVsR+fI=; b=XAAfCQ9+eWfxfoG8y2g+eLi+DRUmjZaCE6HSGR8eArHoydL/cWc/dDv6KsNgc2IpOT mLxyoQjK2ejpZHoP0/3c2K2ZgFs3kEiTsdIAoobgrDaWdIOKreHp0mex2kAiucfT8pHt ZdrviHpMGts94ntCwyWUoG4QgX//4IxUue0MkZYpurN4GoZQy6J7WOA1GjuQIRK2PBVj VVKnR7pP6h2Hrjufr8MpOaqD6d43QhTZLZJjyy9NUAdbRRcFBHNi9bSBjPZVl231Zypz EKRS8KuyptmvUYdxkIHT5+KQ32m2cXoFHHavEJP9uYmSIZ8spKUpNu2uRIsdhWeToYVc SRhQ== X-Gm-Message-State: AOJu0Yx4mokwziNm9TfmpdOEqn6Yi5zCFPhOZEiBEzri/CMWAEaq0B2h 3ecFfZ2pUiglrdqdcnvTlIsY4i05IW/8yTiEOK+Asw== X-Google-Smtp-Source: AGHT+IHvfTVtAhN39iKMzYo0U58+EnVbhBRtZTa0KHZT79kAWK1dvewUF2g8Zu6YRl+BF77iiu/Htw== X-Received: by 2002:a17:90a:a40b:b0:28b:440f:766d with SMTP id y11-20020a17090aa40b00b0028b440f766dmr2611424pjp.90.1703019856226; Tue, 19 Dec 2023 13:04:16 -0800 (PST) Received: from localhost (fwdproxy-prn-020.fbsv.net. [2a03:2880:ff:14::face:b00c]) by smtp.gmail.com with ESMTPSA id u12-20020a17090a890c00b0028bbd30172csm1965513pjn.56.2023.12.19.13.04.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:15 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 12/20] io_uring: add ZC buf and pool Date: Tue, 19 Dec 2023 13:03:49 -0800 Message-Id: <20231219210357.4029713-13-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: David Wei [TODO: REVIEW COMMIT MESSAGE] This patch adds two objects: * Zero copy buffer representation, holding a page, its mapped dma_addr, and a refcount for lifetime management. * Zero copy pool, spiritually similar to page pool, that holds ZC bufs and hands them out to net devices. Pool regions are registered w/ io_uring using the registered buffer API, with a 1:1 mapping between region and nr_iovec in io_uring_register_buffers. This does the heavy lifting of pinning and chunking into bvecs into a struct io_mapped_ubuf for us. For now as there is only one pool region per ifq, there is no separate API for adding/removing regions yet and it is mapped implicitly during ifq registration. Signed-off-by: David Wei --- include/linux/io_uring/net.h | 8 +++ io_uring/zc_rx.c | 135 ++++++++++++++++++++++++++++++++++- io_uring/zc_rx.h | 15 ++++ 3 files changed, 157 insertions(+), 1 deletion(-) diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h index b58f39fed4d5..d994d26116d0 100644 --- a/include/linux/io_uring/net.h +++ b/include/linux/io_uring/net.h @@ -2,8 +2,16 @@ #ifndef _LINUX_IO_URING_NET_H #define _LINUX_IO_URING_NET_H +#include + struct io_uring_cmd; +struct io_zc_rx_buf { + struct page_pool_iov ppiov; + struct page *page; + dma_addr_t dma; +}; + #if defined(CONFIG_IO_URING) int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 06e2c54d3f3d..1e656b481725 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -5,6 +5,7 @@ #include #include #include +#include #include @@ -15,6 +16,11 @@ typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); +static inline struct device *netdev2dev(struct net_device *dev) +{ + return dev->dev.parent; +} + static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, u16 queue_id) { @@ -67,6 +73,129 @@ static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq) folio_put(virt_to_folio(ifq->ring)); } +static int io_zc_rx_init_buf(struct device *dev, struct page *page, u16 pool_id, + u32 pgid, struct io_zc_rx_buf *buf) +{ + dma_addr_t addr = 0; + + /* Skip dma setup for devices that don't do any DMA transfers */ + if (dev) { + addr = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + if (dma_mapping_error(dev, addr)) + return -ENOMEM; + } + + buf->dma = addr; + buf->page = page; + refcount_set(&buf->ppiov.refcount, 0); + buf->ppiov.owner = NULL; + buf->ppiov.pp = NULL; + get_page(page); + return 0; +} + +static void io_zc_rx_free_buf(struct device *dev, struct io_zc_rx_buf *buf) +{ + struct page *page = buf->page; + + if (dev) + dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + put_page(page); +} + +static int io_zc_rx_map_pool(struct io_zc_rx_pool *pool, + struct io_mapped_ubuf *imu, + struct device *dev) +{ + struct io_zc_rx_buf *buf; + struct page *page; + int i, ret; + + for (i = 0; i < imu->nr_bvecs; i++) { + page = imu->bvec[i].bv_page; + buf = &pool->bufs[i]; + ret = io_zc_rx_init_buf(dev, page, pool->pool_id, i, buf); + if (ret) + goto err; + + pool->freelist[i] = i; + } + + pool->free_count = imu->nr_bvecs; + return 0; +err: + while (i--) { + buf = &pool->bufs[i]; + io_zc_rx_free_buf(dev, buf); + } + return ret; +} + +static int io_zc_rx_create_pool(struct io_ring_ctx *ctx, + struct io_zc_rx_ifq *ifq, + u16 id) +{ + struct device *dev = netdev2dev(ifq->dev); + struct io_mapped_ubuf *imu; + struct io_zc_rx_pool *pool; + int nr_pages; + int ret; + + if (ifq->pool) + return -EFAULT; + + if (unlikely(id >= ctx->nr_user_bufs)) + return -EFAULT; + id = array_index_nospec(id, ctx->nr_user_bufs); + imu = ctx->user_bufs[id]; + if (imu->ubuf & ~PAGE_MASK || imu->ubuf_end & ~PAGE_MASK) + return -EFAULT; + + ret = -ENOMEM; + nr_pages = imu->nr_bvecs; + pool = kvmalloc(struct_size(pool, freelist, nr_pages), GFP_KERNEL); + if (!pool) + goto err; + + pool->bufs = kvmalloc_array(nr_pages, sizeof(*pool->bufs), GFP_KERNEL); + if (!pool->bufs) + goto err_buf; + + ret = io_zc_rx_map_pool(pool, imu, dev); + if (ret) + goto err_map; + + pool->ifq = ifq; + pool->pool_id = id; + pool->nr_bufs = nr_pages; + spin_lock_init(&pool->freelist_lock); + ifq->pool = pool; + return 0; +err_map: + kvfree(pool->bufs); +err_buf: + kvfree(pool); +err: + return ret; +} + +static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +{ + struct device *dev = netdev2dev(pool->ifq->dev); + struct io_zc_rx_buf *buf; + + for (int i = 0; i < pool->nr_bufs; i++) { + buf = &pool->bufs[i]; + io_zc_rx_free_buf(dev, buf); + } + kvfree(pool->bufs); + kvfree(pool); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -105,6 +234,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { io_shutdown_ifq(ifq); + if (ifq->pool) + io_zc_rx_destroy_pool(ifq->pool); if (ifq->dev) dev_put(ifq->dev); io_free_rbuf_ring(ifq); @@ -141,7 +272,9 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (!ifq->dev) goto err; - /* TODO: map zc region and initialise zc pool */ + ret = io_zc_rx_create_pool(ctx, ifq, reg.region_id); + if (ret) + goto err; ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 9257dda77e92..af1d865525d2 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -3,15 +3,30 @@ #define IOU_ZC_RX_H #include +#include #include #define IO_ZC_MAX_IFQ_SOCKETS 16 #define IO_ZC_IFQ_IDX_OFFSET 16 #define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) +struct io_zc_rx_pool { + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *bufs; + u32 nr_bufs; + u16 pool_id; + + /* freelist */ + spinlock_t freelist_lock; + u32 free_count; + u32 freelist[]; +}; + struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; + struct io_zc_rx_pool *pool; + struct io_rbuf_ring *ring; struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; From patchwork Tue Dec 19 21:03:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499082 Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B69A3DB8C for ; Tue, 19 Dec 2023 21:04:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="cyivTOJr" Received: by mail-pg1-f174.google.com with SMTP id 41be03b00d2f7-5cdbc7bebecso515737a12.1 for ; Tue, 19 Dec 2023 13:04:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019857; x=1703624657; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eEGPLnsr8EuY8VD2FnKVCOVQnMaFt7IkWGFTilccx/A=; b=cyivTOJrwoLI0/YLu3/Ahq31WLXLq4pkB4jViJT9/yx4PwKdylYdEnL5Epq4TWmRUD oyAiY5C1B/+M/ySO9TjFvXA+6XBLhMt97/Dx2XardPGVBuFHDL7m8ZgFRVXjlJ2tu08C VUhLE3zh/zeWAhYiggyYZi3inWeBoxaSTM/8MWf2udzQFHgZAUq39AtJ4RJ8Xcva+wMn C7xi8vvUK9SBTfJ0QtLB9IvJqgIN/3MmwXl4cBr1KXx4X99tL7nVu1JotwzCAArY0VoF FT3nY+h3jhoz1MKz9cWaaW8VxtyWbZH19thI+ARrQDspG6inheCgirph0dCzk74cKArq a5DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019857; x=1703624657; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eEGPLnsr8EuY8VD2FnKVCOVQnMaFt7IkWGFTilccx/A=; b=MsnJtgAk1bXO3Yca9u9CsDHwFs9vVrqe3jscyHwvCqeTmPQ/Yr8tSnFAV3TwSUGoIj 6ir0PnayYQTYI8zeOuMlG+7O9W0MKjPdWxBL3hp+vPLHXtbfjaQ7Ig0sWHHqR+Fl1vpz +YkQU6wgmJYnr/Iobu5kg6C18rtUzfSPSHWTxw0/xCoxVDcGUzfcrQTn4LnETW1FJkyF IVqzFE7XPOGICHT0hmOCB5NvDApUC2f7P63dzFG3bDofjCxl/FRUxV8nF1AzUk+zaYyq gf5zHSjteDwS4j/+k9BeSNRpzgDTMYxuL3F/WlA7Ns9b4Ll7Egzn2tfVTN3gEKLFmAD4 iCgQ== X-Gm-Message-State: AOJu0YyXg7SRakXr+cvrzC02lvELVTWThg4lL7UOrJe/k6W/BA0UjJnJ orUyyO+5DPzZ6DmuodVkilsQLxwhUSBfuJpse7Qa3A== X-Google-Smtp-Source: AGHT+IF8bw1k9pZ9asY2+rvU14RiZot/ft1Sk+wFRr4Kw+D4bW9SqnE775wocw9nZCzUzNSxjlss3A== X-Received: by 2002:a05:6a21:1f03:b0:18a:d4c3:1350 with SMTP id ry3-20020a056a211f0300b0018ad4c31350mr7409363pzb.44.1703019857139; Tue, 19 Dec 2023 13:04:17 -0800 (PST) Received: from localhost (fwdproxy-prn-003.fbsv.net. [2a03:2880:ff:3::face:b00c]) by smtp.gmail.com with ESMTPSA id c14-20020aa781ce000000b006d082dd8086sm16864175pfn.214.2023.12.19.13.04.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:16 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx Date: Tue, 19 Dec 2023 13:03:50 -0800 Message-Id: <20231219210357.4029713-14-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov We're adding a new pp memory provider to implement io_uring zerocopy receive. It'll be "registered" in pp and used in later paches. The typical life cycle of a buffer goes as follows: first it's allocated to a driver with the initial refcount set to 1. The drivers fills it with data, puts it into an skb and passes down the stack, where it gets queued up to a socket. Later, a zc io_uring request will be receiving data from the socket from a task context. At that point io_uring will tell the userspace that this buffer has some data by posting an appropriate completion. It'll also elevating the refcount by IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading the data. When the userspace is done with the buffer it should return it back to io_uring by adding an entry to the buffer refill ring. When necessary io_uring will poll the refill ring, compare references including IO_ZC_RX_UREF and reuse the buffer. Initally, all buffers are placed in a spinlock protected ->freelist. It's a slow path stash, where buffers are considered to be unallocated and not exposed to core page pool. On allocation, pp will first try all its caches, and the ->alloc_pages callback if everything else failed. The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill ring. The consumption from the ring is always done in the attached napi context, so no additional synchronisation required. If that fails we'll be getting buffers from the ->freelist. Note: only ->freelist are considered unallocated for page pool, so we only add pages_state_hold_cnt when allocating from there. Subsequently, as page_pool_return_page() and others bump the ->pages_state_release_cnt counter, io_pp_zc_release_page() can only use ->freelist, which is not a problem as it's not a slow path. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring/net.h | 5 + io_uring/zc_rx.c | 204 +++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 6 ++ 3 files changed, 215 insertions(+) diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h index d994d26116d0..13244ae5fc4a 100644 --- a/include/linux/io_uring/net.h +++ b/include/linux/io_uring/net.h @@ -13,6 +13,11 @@ struct io_zc_rx_buf { }; #if defined(CONFIG_IO_URING) + +#if defined(CONFIG_PAGE_POOL) +extern const struct pp_memory_provider_ops io_uring_pp_zc_ops; +#endif + int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); #else diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 1e656b481725..ff1dac24ac40 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include @@ -387,4 +388,207 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx, return 0; } +static inline struct io_zc_rx_buf *io_iov_to_buf(struct page_pool_iov *iov) +{ + return container_of(iov, struct io_zc_rx_buf, ppiov); +} + +static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool, + struct io_zc_rx_buf *buf) +{ + return buf - pool->bufs; +} + +static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf) +{ + refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount); +} + +static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf) +{ + if (page_pool_iov_refcount(&buf->ppiov) < IO_ZC_RX_UREF) + return false; + + return page_pool_iov_sub_and_test(&buf->ppiov, IO_ZC_RX_UREF); +} + +static inline struct page *io_zc_buf_to_pp_page(struct io_zc_rx_buf *buf) +{ + return page_pool_mangle_ppiov(&buf->ppiov); +} + +static inline void io_zc_add_pp_cache(struct page_pool *pp, + struct io_zc_rx_buf *buf) +{ + refcount_set(&buf->ppiov.refcount, 1); + pp->alloc.cache[pp->alloc.count++] = io_zc_buf_to_pp_page(buf); +} + +static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + u32 entries; + + entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head; + return min(entries, ifq->rq_entries); +} + +static void io_zc_rx_ring_refill(struct page_pool *pp, + struct io_zc_rx_ifq *ifq) +{ + unsigned int entries = io_zc_rx_rqring_entries(ifq); + unsigned int mask = ifq->rq_entries - 1; + struct io_zc_rx_pool *pool = ifq->pool; + + if (unlikely(!entries)) + return; + + while (entries--) { + unsigned int rq_idx = ifq->cached_rq_head++ & mask; + struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx]; + u32 pgid = rqe->off / PAGE_SIZE; + struct io_zc_rx_buf *buf = &pool->bufs[pgid]; + + if (!io_zc_rx_put_buf_uref(buf)) + continue; + io_zc_add_pp_cache(pp, buf); + if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL) + break; + } + smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head); +} + +static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq) +{ + struct io_zc_rx_pool *pool = ifq->pool; + + spin_lock_bh(&pool->freelist_lock); + while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) { + struct io_zc_rx_buf *buf; + u32 pgid; + + pgid = pool->freelist[--pool->free_count]; + buf = &pool->bufs[pgid]; + + io_zc_add_pp_cache(pp, buf); + pp->pages_state_hold_cnt++; + trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf), + pp->pages_state_hold_cnt); + } + spin_unlock_bh(&pool->freelist_lock); +} + +static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool, + struct io_zc_rx_buf *buf) +{ + spin_lock_bh(&pool->freelist_lock); + pool->freelist[pool->free_count++] = io_buf_pgid(pool, buf); + spin_unlock_bh(&pool->freelist_lock); +} + +static struct page *io_pp_zc_alloc_pages(struct page_pool *pp, gfp_t gfp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + + /* pp should already be ensuring that */ + if (unlikely(pp->alloc.count)) + goto out_return; + + io_zc_rx_ring_refill(pp, ifq); + if (likely(pp->alloc.count)) + goto out_return; + + io_zc_rx_refill_slow(pp, ifq); + if (!pp->alloc.count) + return NULL; +out_return: + return pp->alloc.cache[--pp->alloc.count]; +} + +static bool io_pp_zc_release_page(struct page_pool *pp, struct page *page) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + struct page_pool_iov *ppiov; + + if (WARN_ON_ONCE(!page_is_page_pool_iov(page))) + return false; + + ppiov = page_to_page_pool_iov(page); + + if (!page_pool_iov_sub_and_test(ppiov, 1)) + return false; + + io_zc_rx_recycle_buf(ifq->pool, io_iov_to_buf(ppiov)); + return true; +} + +static void io_pp_zc_scrub(struct page_pool *pp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + struct io_zc_rx_pool *pool = ifq->pool; + struct io_zc_rx_buf *buf; + int i; + + for (i = 0; i < pool->nr_bufs; i++) { + buf = &pool->bufs[i]; + + if (io_zc_rx_put_buf_uref(buf)) { + /* just return it to the page pool, it'll clean it up */ + refcount_set(&buf->ppiov.refcount, 1); + page_pool_iov_put_many(&buf->ppiov, 1); + } + } +} + +static void io_zc_rx_init_pool(struct io_zc_rx_pool *pool, + struct page_pool *pp) +{ + struct io_zc_rx_buf *buf; + int i; + + for (i = 0; i < pool->nr_bufs; i++) { + buf = &pool->bufs[i]; + buf->ppiov.pp = pp; + } +} + +static int io_pp_zc_init(struct page_pool *pp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + + if (!ifq) + return -EINVAL; + if (pp->p.order != 0) + return -EINVAL; + if (!pp->p.napi) + return -EINVAL; + + io_zc_rx_init_pool(ifq->pool, pp); + percpu_ref_get(&ifq->ctx->refs); + ifq->pp = pp; + return 0; +} + +static void io_pp_zc_destroy(struct page_pool *pp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + struct io_zc_rx_pool *pool = ifq->pool; + + ifq->pp = NULL; + + if (WARN_ON_ONCE(pool->free_count != pool->nr_bufs)) + return; + percpu_ref_put(&ifq->ctx->refs); +} + +const struct pp_memory_provider_ops io_uring_pp_zc_ops = { + .alloc_pages = io_pp_zc_alloc_pages, + .release_page = io_pp_zc_release_page, + .init = io_pp_zc_init, + .destroy = io_pp_zc_destroy, + .scrub = io_pp_zc_scrub, +}; +EXPORT_SYMBOL(io_uring_pp_zc_ops); + + #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index af1d865525d2..00d864700c67 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -10,6 +10,9 @@ #define IO_ZC_IFQ_IDX_OFFSET 16 #define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) +#define IO_ZC_RX_UREF 0x10000 +#define IO_ZC_RX_KREF_MASK (IO_ZC_RX_UREF - 1) + struct io_zc_rx_pool { struct io_zc_rx_ifq *ifq; struct io_zc_rx_buf *bufs; @@ -26,12 +29,15 @@ struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; struct io_zc_rx_pool *pool; + struct page_pool *pp; struct io_rbuf_ring *ring; struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; u32 rq_entries; u32 cq_entries; + u32 cached_rq_head; + u32 cached_cq_tail; /* hw rx descriptor ring id */ u32 if_rxq_id; From patchwork Tue Dec 19 21:03:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499084 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 075C83C48B for ; Tue, 19 Dec 2023 21:04:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="oux1MZOq" Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-1d3ad3ad517so13782175ad.0 for ; Tue, 19 Dec 2023 13:04:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019858; x=1703624658; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+R3fraE1AUJLiBUl+irWqkEYdm8NtMUbP7MGr9N7dq0=; b=oux1MZOq0AzQgkcRmTCoj873Fn++FIU+nsu6uzWncnnZnBKhfjnsVhQmhuanI8ANrp 441FQdX/PIBwqAwSTclUPmw51AHBolPPqVidjmbvYdV6jpbKyhLeOMAD2f+rb0XPFRXD 1UF59ZJDEmS68G/aZQalCwtyuIb07xxrw3gaBQDCEMk6POItlGSjVr7inj5wFKxXH3En 82J3w82CPGNwPSykYLF5NjH2kMYS9vBQhV8LsCPTez/TQ2wdqRHuF33hoG0Dp0qJp8NE UhOikBtYWJdeeZPiIuI6Cn0SDmXR5uEtFVBRFpUUepmYOQYQwJd7wSYsFnLCV0/qd1lv 2Hfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019858; x=1703624658; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+R3fraE1AUJLiBUl+irWqkEYdm8NtMUbP7MGr9N7dq0=; b=qV18tgA549bKsEVBc/qGQKVTrDf/pX0ZLfpuxRAAAnu8jk7UhaDJuLig+mONTqFklP kgmutE7v90KFjiq6+fx2kXVzj+b+oqEOykbH52xbesOakEse+tdMjT8AOb1nBwQKCnHn fWJwSXqSBre4nuI9SC7PI23oh+gQVVi9xtgsN2VJcu718BA3A30XcYLPEIA9CiOelXsq roQF832HZSCKnIibSRXzqdrtO6kliEE/yj/VMHWCF/zvspFuIQCTad0SMV7okfHPt9pZ YR5Tj8dNaVdyzkRAIuR7YUlNRe97M0jQNKQ4mmifo+gmYfYmA9smC6zBvzqyZFfPuCgQ /uwg== X-Gm-Message-State: AOJu0YzUzKw9mTm/UR1AFqGfvL4iksRbj68olFNrLX+Nhf38OeQUDgkd elxbzNyY+okZpaMnacZjuqamNg4HfItvygz8LxIe6A== X-Google-Smtp-Source: AGHT+IHv2Ih+zIMLBLTSl+oRkb/qmBFcX3CARWtV0GirJ8tYSwTCNeW0vzrX5FsoKxAiegCWCFM5LA== X-Received: by 2002:a17:902:704c:b0:1d0:6ffd:ae22 with SMTP id h12-20020a170902704c00b001d06ffdae22mr9681493plt.137.1703019858245; Tue, 19 Dec 2023 13:04:18 -0800 (PST) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id u2-20020a170902e80200b001acae9734c0sm4100365plg.266.2023.12.19.13.04.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:17 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider Date: Tue, 19 Dec 2023 13:03:51 -0800 Message-Id: <20231219210357.4029713-15-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Allow creating a special io_uring pp memory providers, which will be for implementing io_uring zerocopy receive. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/types.h | 1 + net/core/page_pool.c | 6 ++++++ 2 files changed, 7 insertions(+) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index fd846cac9fb6..f54ee759e362 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -129,6 +129,7 @@ struct mem_provider; enum pp_memory_provider_type { __PP_MP_NONE, /* Use system allocator directly */ PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */ + PP_MP_IOU_ZCRX, /* io_uring zerocopy receive provider */ }; struct pp_memory_provider_ops { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 9e3073d61a97..ebf5ff009d9d 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -21,6 +21,7 @@ #include #include #include +#include #include @@ -242,6 +243,11 @@ static int page_pool_init(struct page_pool *pool, case PP_MP_DMABUF_DEVMEM: pool->mp_ops = &dmabuf_devmem_ops; break; +#if defined(CONFIG_IO_URING) + case PP_MP_IOU_ZCRX: + pool->mp_ops = &io_uring_pp_zc_ops; + break; +#endif default: err = -EINVAL; goto free_ptr_ring; From patchwork Tue Dec 19 21:03:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499085 Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FF073FB11 for ; Tue, 19 Dec 2023 21:04:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="XfHxmjbw" Received: by mail-pg1-f176.google.com with SMTP id 41be03b00d2f7-5ca1b4809b5so1742960a12.3 for ; Tue, 19 Dec 2023 13:04:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019859; x=1703624659; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=EseMquZ+ud914kgVcMqdU4pc5Hd6eVscu9u0Nzy12vg=; b=XfHxmjbw/YAU1WsBGvOQc6EaJ6VM8unnd7mW5cWXcN2yPX8+g78cXiTF61i42H2y+c 2SZsfsXfCVXI69KXTKB87S8pFG++5NbUYOxasySWnzkuWe9CgHTyqQ3lau1wakOq2e3s PJWR2Vqg+8lWj3x7HDuKw6AfgEmJkVx6iThxV6a26jFPBiXleHF4ad8UwfynvNyBX0Oe sBpc6HGE0HYdhpmPhoNr49RnOrpsOHisXqkww4cbfd2KpDFOxPHOJ0uHGpfMthCErbb/ lug0unYJxquvLoq2/HrCoCGtw6dN+wGEFOWJO9TgpBDDvEzfD2FVI/Wf2VwqB/UKajpn B9rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019859; x=1703624659; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EseMquZ+ud914kgVcMqdU4pc5Hd6eVscu9u0Nzy12vg=; b=KiTUudjXvFgp+ggBn4w6tQIUdWyS1hk/NYR0XfarblunntfFQED/+C6A8rVrtq18BG TRF63m2I8eXA7DqD6NgGT8MWkpZ0RlTa3O/6wz71vL1fxk8GCSnT+0BYIdE90a/yNxxv udlqbMr52JH546wThciUIt9h4Q2JzNLzPmdoK8D7T8tCcSeGU6U399jjhqT7NtkY8VVJ kGD05ZIzyhalW5I9VCQnpQ/FxYyiOE4EjPX1J4MU/7kxjMDfRBCNxHuWG+CxKJndATXJ WTYRn2QFEEghZsWayan4LSK4E5FOYaQZflDoaohgldhty87AzlULKgopcRrzkmq9WkX/ eAIg== X-Gm-Message-State: AOJu0Yy6oyt+c6q/wjS6cFnAdJ9+HaU5kFYDZvV/VQFCQEpnYHjumQIr ylJEOsmaD33pSmA+Bg4Ip3dSu8lZ5FFGfjH7/g1XXw== X-Google-Smtp-Source: AGHT+IGKtSzi/RRmXkuPnD6lrUqbaSUkwLiVuC2nC80xHz/yFrUKP7xukH1K1+e1c60fL24LqB6u4g== X-Received: by 2002:a05:6a21:1a6:b0:194:bb77:b263 with SMTP id le38-20020a056a2101a600b00194bb77b263mr703965pzb.55.1703019859125; Tue, 19 Dec 2023 13:04:19 -0800 (PST) Received: from localhost (fwdproxy-prn-000.fbsv.net. [2a03:2880:ff::face:b00c]) by smtp.gmail.com with ESMTPSA id v4-20020aa78084000000b006cde2090154sm20613615pff.218.2023.12.19.13.04.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:18 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 15/20] io_uring: add io_recvzc request Date: Tue, 19 Dec 2023 13:03:52 -0800 Message-Id: <20231219210357.4029713-16-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: David Wei This patch adds an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that is set up for ZC Rx. The request reads skbs from a socket where its page frags are tagged w/ a magic cookie in their page private field. For each frag, entries are written into the ifq rbuf completion ring, and the total number of bytes read is returned to user as an io_uring completion event. Multishot requests work. There is no need to specify provided buffers as data is returned in the ifq rbuf completion rings. Userspace is expected to look into the ifq rbuf completion ring when it receives an io_uring completion event. The addr3 field is used to encode params in the following format: addr3 = (readlen << 32); readlen is the max amount of data to read from the socket. ifq_id is the interface queue id, and currently only 0 is supported. Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 1 + io_uring/net.c | 119 ++++++++++++++++- io_uring/opdef.c | 16 +++ io_uring/zc_rx.c | 240 +++++++++++++++++++++++++++++++++- io_uring/zc_rx.h | 5 + 5 files changed, 375 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f4ba58bce3bd..f57f394744fe 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -253,6 +253,7 @@ enum io_uring_op { IORING_OP_FUTEX_WAIT, IORING_OP_FUTEX_WAKE, IORING_OP_FUTEX_WAITV, + IORING_OP_RECV_ZC, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/net.c b/io_uring/net.c index 454ba301ae6b..7a2aadf6962c 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -71,6 +71,16 @@ struct io_sr_msg { struct io_kiocb *notif; }; +struct io_recvzc { + struct file *file; + unsigned len; + unsigned done_io; + unsigned msg_flags; + u16 flags; + + u32 datalen; +}; + static inline bool io_check_multishot(struct io_kiocb *req, unsigned int issue_flags) { @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, unsigned int cflags; cflags = io_put_kbuf(req, issue_flags); - if (msg->msg_inq && msg->msg_inq != -1) + if (msg && msg->msg_inq && msg->msg_inq != -1) cflags |= IORING_CQE_F_SOCK_NONEMPTY; if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, io_recv_prep_retry(req); /* Known not-empty or unknown state, retry */ if (cflags & IORING_CQE_F_SOCK_NONEMPTY || - msg->msg_inq == -1) + (msg && msg->msg_inq == -1)) return false; if (issue_flags & IO_URING_F_MULTISHOT) *ret = IOU_ISSUE_SKIP_COMPLETE; @@ -956,9 +966,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } -static __maybe_unused -struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, - struct socket *sock) +static struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, + struct socket *sock) { unsigned token = READ_ONCE(sock->zc_rx_idx); unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET; @@ -975,6 +984,106 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, return ifq; } +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + u64 recvzc_cmd; + + recvzc_cmd = READ_ONCE(sqe->addr3); + zc->datalen = recvzc_cmd >> 32; + if (recvzc_cmd & 0xffff) + return -EINVAL; + if (!(req->ctx->flags & IORING_SETUP_DEFER_TASKRUN)) + return -EINVAL; + if (unlikely(sqe->file_index || sqe->addr2)) + return -EINVAL; + + zc->len = READ_ONCE(sqe->len); + zc->flags = READ_ONCE(sqe->ioprio); + if (zc->flags & ~(RECVMSG_FLAGS)) + return -EINVAL; + zc->msg_flags = READ_ONCE(sqe->msg_flags); + if (zc->msg_flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + if (zc->msg_flags & MSG_ERRQUEUE) + req->flags |= REQ_F_CLEAR_POLLIN; + if (zc->flags & IORING_RECV_MULTISHOT) { + if (zc->msg_flags & MSG_WAITALL) + return -EINVAL; + if (req->opcode == IORING_OP_RECV && zc->len) + return -EINVAL; + req->flags |= REQ_F_APOLL_MULTISHOT; + } + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + zc->msg_flags |= MSG_CMSG_COMPAT; +#endif + zc->done_io = 0; + return 0; +} + +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + struct socket *sock; + unsigned flags; + int ret, min_ret = 0; + bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct io_zc_rx_ifq *ifq; + + if (issue_flags & IO_URING_F_UNLOCKED) + return -EAGAIN; + + if (!(req->flags & REQ_F_POLLED) && + (zc->flags & IORING_RECVSEND_POLL_FIRST)) + return -EAGAIN; + + sock = sock_from_file(req->file); + if (unlikely(!sock)) + return -ENOTSOCK; + ifq = io_zc_verify_sock(req, sock); + if (!ifq) + return -EINVAL; + +retry_multishot: + flags = zc->msg_flags; + if (force_nonblock) + flags |= MSG_DONTWAIT; + if (flags & MSG_WAITALL) + min_ret = zc->len; + + ret = io_zc_rx_recv(ifq, sock, zc->datalen, flags); + if (ret < min_ret) { + if (ret == -EAGAIN && force_nonblock) { + if (issue_flags & IO_URING_F_MULTISHOT) + return IOU_ISSUE_SKIP_COMPLETE; + return -EAGAIN; + } + if (ret > 0 && io_net_retry(sock, flags)) { + zc->len -= ret; + zc->done_io += ret; + req->flags |= REQ_F_PARTIAL_IO; + return -EAGAIN; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + req_set_fail(req); + } else if ((flags & MSG_WAITALL) && (flags & (MSG_TRUNC | MSG_CTRUNC))) { + req_set_fail(req); + } + + if (ret > 0) + ret += zc->done_io; + else if (zc->done_io) + ret = zc->done_io; + + if (!io_recv_finish(req, &ret, 0, ret <= 0, issue_flags)) + goto retry_multishot; + + return ret; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 799db44283c7..a90231566d09 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -35,6 +35,7 @@ #include "rw.h" #include "waitid.h" #include "futex.h" +#include "zc_rx.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -467,6 +468,18 @@ const struct io_issue_def io_issue_defs[] = { .issue = io_futexv_wait, #else .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_RECV_ZC] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .ioprio = 1, +#if defined(CONFIG_NET) + .prep = io_recvzc_prep, + .issue = io_recvzc, +#else + .prep = io_eopnotsupp_prep, #endif }, }; @@ -704,6 +717,9 @@ const struct io_cold_def io_cold_defs[] = { [IORING_OP_FUTEX_WAITV] = { .name = "FUTEX_WAITV", }, + [IORING_OP_RECV_ZC] = { + .name = "RECV_ZC", + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index ff1dac24ac40..acb70ca23150 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -15,8 +16,20 @@ #include "zc_rx.h" #include "rsrc.h" +struct io_zc_rx_args { + struct io_zc_rx_ifq *ifq; + struct socket *sock; +}; + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); +static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + + return ifq->cached_cq_tail - READ_ONCE(ring->cq.head); +} + static inline struct device *netdev2dev(struct net_device *dev) { return dev->dev.parent; @@ -399,7 +412,7 @@ static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool, return buf - pool->bufs; } -static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf) +static void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf) { refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount); } @@ -590,5 +603,230 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = { }; EXPORT_SYMBOL(io_uring_pp_zc_ops); +static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq) +{ + struct io_uring_rbuf_cqe *cqe; + unsigned int cq_idx, queued, free, entries; + unsigned int mask = ifq->cq_entries - 1; + + cq_idx = ifq->cached_cq_tail & mask; + smp_rmb(); + queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries); + free = ifq->cq_entries - queued; + entries = min(free, ifq->cq_entries - cq_idx); + if (!entries) + return NULL; + + cqe = &ifq->cqes[cq_idx]; + ifq->cached_cq_tail++; + return cqe; +} + +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, + int off, int len, unsigned sock_idx) +{ + off += skb_frag_off(frag); + + if (likely(page_is_page_pool_iov(frag->bv_page))) { + struct io_uring_rbuf_cqe *cqe; + struct io_zc_rx_buf *buf; + struct page_pool_iov *ppiov; + + ppiov = page_to_page_pool_iov(frag->bv_page); + if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX || + ppiov->pp->mp_priv != ifq) + return -EFAULT; + + cqe = io_zc_get_rbuf_cqe(ifq); + if (!cqe) + return -ENOBUFS; + + buf = io_iov_to_buf(ppiov); + io_zc_rx_get_buf_uref(buf); + + cqe->region = 0; + cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off; + cqe->len = len; + cqe->sock = sock_idx; + cqe->flags = 0; + } else { + return -EOPNOTSUPP; + } + + return len; +} + +static int +zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, + unsigned int offset, size_t len) +{ + struct io_zc_rx_args *args = desc->arg.data; + struct io_zc_rx_ifq *ifq = args->ifq; + struct socket *sock = args->sock; + unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK; + struct sk_buff *frag_iter; + unsigned start, start_off; + int i, copy, end, off; + int ret = 0; + + start = skb_headlen(skb); + start_off = offset; + + if (offset < start) + return -EOPNOTSUPP; + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag; + + WARN_ON(start > offset + len); + + frag = &skb_shinfo(skb)->frags[i]; + end = start + skb_frag_size(frag); + + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_frag(ifq, frag, off, copy, sock_idx); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + + skb_walk_frags(skb, frag_iter) { + WARN_ON(start > offset + len); + + end = start + frag_iter->len; + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_skb(desc, frag_iter, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + +out: + smp_store_release(&ifq->ring->cq.tail, ifq->cached_cq_tail); + if (offset == start_off) + return ret; + return offset - start_off; +} + +static int io_zc_rx_tcp_read(struct io_zc_rx_ifq *ifq, struct sock *sk) +{ + struct io_zc_rx_args args = { + .ifq = ifq, + .sock = sk->sk_socket, + }; + read_descriptor_t rd_desc = { + .count = 1, + .arg.data = &args, + }; + + return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb); +} + +static int io_zc_rx_tcp_recvmsg(struct io_zc_rx_ifq *ifq, struct sock *sk, + unsigned int recv_limit, + int flags, int *addr_len) +{ + size_t used; + long timeo; + int ret; + + ret = used = 0; + + lock_sock(sk); + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (recv_limit) { + ret = io_zc_rx_tcp_read(ifq, sk); + if (ret < 0) + break; + if (!ret) { + if (used) + break; + if (sock_flag(sk, SOCK_DONE)) + break; + if (sk->sk_err) { + ret = sock_error(sk); + break; + } + if (sk->sk_shutdown & RCV_SHUTDOWN) + break; + if (sk->sk_state == TCP_CLOSE) { + ret = -ENOTCONN; + break; + } + if (!timeo) { + ret = -EAGAIN; + break; + } + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + sk_wait_data(sk, &timeo, NULL); + if (signal_pending(current)) { + ret = sock_intr_errno(timeo); + break; + } + continue; + } + recv_limit -= ret; + used += ret; + + if (!timeo) + break; + release_sock(sk); + lock_sock(sk); + + if (sk->sk_err || sk->sk_state == TCP_CLOSE || + (sk->sk_shutdown & RCV_SHUTDOWN) || + signal_pending(current)) + break; + } + release_sock(sk); + /* TODO: handle timestamping */ + return used ? used : ret; +} + +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock, + unsigned int limit, unsigned int flags) +{ + struct sock *sk = sock->sk; + const struct proto *prot; + int addr_len = 0; + int ret; + + if (flags & MSG_ERRQUEUE) + return -EOPNOTSUPP; + + prot = READ_ONCE(sk->sk_prot); + if (prot->recvmsg != tcp_recvmsg) + return -EPROTONOSUPPORT; + + sock_rps_record_flow(sk); + + ret = io_zc_rx_tcp_recvmsg(ifq, sk, limit, flags, &addr_len); + + return ret; +} #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 00d864700c67..3e8f07e4b252 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -72,4 +72,9 @@ static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, } #endif +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock, + unsigned int limit, unsigned int flags); + #endif From patchwork Tue Dec 19 21:03:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499086 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F33A3405C6 for ; Tue, 19 Dec 2023 21:04:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="dCsXqDtQ" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-1d3ce28ac3cso16788065ad.0 for ; Tue, 19 Dec 2023 13:04:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019860; x=1703624660; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GxpOp5keoBQGOz00dSxKjcoORly3Uq6QcOZ8S/MSAKQ=; b=dCsXqDtQQRdS3qE2x27ZjDcAt67Sxxj3q5igk7TxkZ7nIOZpby+uKKUQVB4gyTNR2B d/L3Wpryht4RLo4k/VLZvzmRNzmt5fUat5grHDhOtZgChXjsR5otxIYKad4XuUJyY0Fk zx0b8m/zhlm9lzDjzaosaAtMzmQ4BWFP7oaZccY2Z4/vpXsfE8yEcycYeI6k8YK25iHd GWbnI3Wx4fwBP/OAD9HUMlgpNmm0kna8DXXEVjtzE67UZtOZxo10aXK3Dj2XqMzJDHSd xxlnPBXLh7ACFeLSdSXiB+sbaDE6l13TnIDlTzitzRhwD17Sy6gYrLqhODabsWHvzvp9 kq3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019860; x=1703624660; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GxpOp5keoBQGOz00dSxKjcoORly3Uq6QcOZ8S/MSAKQ=; b=jm3jfXT/8bICV2L3UQBEtH/F1ah3X1Cn4IJwAQ4eiBHH4y+S0K3PXz9tlBtsK77oLk B5OzDpEoMqP9RvkkPbXd8xbOKwkwuri8TUBlAsB4WmPItDdwruYUpBSvUTLxPsstmCaB GjMKGzrfo9/A41NdgG965aDVuspmH1BHcm7DcxOLYhfMZ61lp+lxDirYIInBlM6MQoaW QY4VvjvrdEcWN0INBh0/yeZRURigFHgeBZcLdW6VkkYr0SDd5UyTGfPgBRj1N9QklZc+ auD4Hi2JnyadCWKsDE7Fb6BCo5w8ZtDl1Nzbdb3DPaQBJEh1YWWsUstmRn/nM9aWwrG7 21RA== X-Gm-Message-State: AOJu0YyBZw170CMORZ+vXCw1HPrxMGFD+mNnoruel2nZFmVkZbvj3F68 iYDwgv1dsQJyYjEJ6SunkHNWrBnkhvuZkEpgRn+BnQ== X-Google-Smtp-Source: AGHT+IGggFl0JrUM4oTE/WqbU6Jj7xNuoLiaUQdJMih11dS+6jNAD97QTKlC1myhDJGi0F2OblMPZQ== X-Received: by 2002:a17:903:11c5:b0:1d0:6ffd:8367 with SMTP id q5-20020a17090311c500b001d06ffd8367mr9469743plh.114.1703019860208; Tue, 19 Dec 2023 13:04:20 -0800 (PST) Received: from localhost (fwdproxy-prn-013.fbsv.net. [2a03:2880:ff:d::face:b00c]) by smtp.gmail.com with ESMTPSA id l5-20020a170903120500b001d349fcb70dsm12563264plh.202.2023.12.19.13.04.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:19 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 16/20] net: execute custom callback from napi Date: Tue, 19 Dec 2023 13:03:53 -0800 Message-Id: <20231219210357.4029713-17-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Sometimes we want to access a napi protected resource from task context like in the case of io_uring zc falling back to copy and accessing the buffer ring. Add a helper function that allows to execute a custom function from napi context by first stopping it similarly to napi_busy_loop(). Experimental, needs much polishing and sharing bits with napi_busy_loop(). Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/busy_poll.h | 7 +++++++ net/core/dev.c | 46 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+) diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index 4dabeb6c76d3..64238467e00a 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -47,6 +47,8 @@ bool sk_busy_loop_end(void *p, unsigned long start_time); void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget); +void napi_execute(struct napi_struct *napi, + void (*cb)(void *), void *cb_arg); #else /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long net_busy_loop_on(void) @@ -59,6 +61,11 @@ static inline bool sk_can_busy_loop(struct sock *sk) return false; } +static inline void napi_execute(struct napi_struct *napi, + void (*cb)(void *), void *cb_arg) +{ +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long busy_loop_current_time(void) diff --git a/net/core/dev.c b/net/core/dev.c index e55750c47245..2dd4f3846535 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6537,6 +6537,52 @@ void napi_busy_loop(unsigned int napi_id, } EXPORT_SYMBOL(napi_busy_loop); +void napi_execute(struct napi_struct *napi, + void (*cb)(void *), void *cb_arg) +{ + bool done = false; + unsigned long val; + void *have_poll_lock = NULL; + + rcu_read_lock(); + + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + for (;;) { + local_bh_disable(); + val = READ_ONCE(napi->state); + + /* If multiple threads are competing for this napi, + * we avoid dirtying napi->state as much as we can. + */ + if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | + NAPIF_STATE_IN_BUSY_POLL)) + goto restart; + + if (cmpxchg(&napi->state, val, + val | NAPIF_STATE_IN_BUSY_POLL | + NAPIF_STATE_SCHED) != val) + goto restart; + + have_poll_lock = netpoll_poll_lock(napi); + cb(cb_arg); + done = true; + gro_normal_list(napi); + local_bh_enable(); + break; +restart: + local_bh_enable(); + if (unlikely(need_resched())) + break; + cpu_relax(); + } + if (done) + busy_poll_stop(napi, have_poll_lock, false, 1); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); + rcu_read_unlock(); +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static void napi_hash_add(struct napi_struct *napi) From patchwork Tue Dec 19 21:03:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499087 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D362B405DF for ; Tue, 19 Dec 2023 21:04:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="AdL53sTm" Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-6d5c4cb8a4cso1857078b3a.3 for ; Tue, 19 Dec 2023 13:04:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019861; x=1703624661; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=88y6GqmlC4OquYgwdteTX7sFZdaMlDFOxYfP/ATHY4U=; b=AdL53sTmXAOuiJZ0HZu2jKQQj+9POMGtPmzl1IEyLlF1LowjmXhdnVlu1mKcdhfnc2 qBYU/UAj2+PRoy00+6Y4vCzpDFcq5z69+0PIiZ5XZFCRVLUc6NK5GGQoAgjogllawytu WeJTuj7YvAfuuVnNdfZ0sdYEMVR3xhvn1WV1WdPquQyEMCuVUX5lO4FjAZdSm3t8v1Q2 XwDTQ7/yDilBSgZU5mpa3NQXJ7hSeEX2+IUasxrn+BmLZ0ASQ/n/idbCPdTLj1hIayCy ZmlrhaCPU8ybwCGnmgF4+raQuC/TLH8nwtLtGHmKp6hBFgLJYBLNR+fRI9GbE9hKh0/E urmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019861; x=1703624661; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=88y6GqmlC4OquYgwdteTX7sFZdaMlDFOxYfP/ATHY4U=; b=uXSUFFGw5EM/CRB0uJGTIw1s14s2EyeAObHeY45q6X7R6kTmzwH5rqPnduOcSDoUs9 oSx4mF2JmPQo1NEZWqv7RLSKlDG5vL/TtydNC9dcAklRwkjHO/974U17VhUWjyI+rXOl 8J7WI/cjFaMnLz2GLAA0WFkYeBFxi8yHss11OUyKxjYCMUQ440dxF9beQE7YysYOJ0NC GRj07QySHxUn8ItM6At5nOkcy4wLEZsJZLyvLYqpuNiuHB4Jwm+5V5BEh66DhGcX8xcQ D9mmfVapKyZROlarzTGiwSERYQD5WYoFYck4Xh04b4SOZ8eQS3PCsUE2NB7KPQ9V94s3 OpeA== X-Gm-Message-State: AOJu0YzW4kvY7hwpVuy52vHtIzeu5iiUYmb8S2Dd0ku5FrvnP+fv9lnd /CYP7Iu8D4OBk5PgSu+hY8sHgvGe3V28R3LtF0KVLQ== X-Google-Smtp-Source: AGHT+IEpQz41IIokOuYaoRjPu5IHvUE6igpmNn5ccwm0j3KSVHCAAgpLK/9gBFnbdEc/qrCQqXJ9+g== X-Received: by 2002:a17:902:e88f:b0:1d0:7072:e241 with SMTP id w15-20020a170902e88f00b001d07072e241mr12430623plg.49.1703019861044; Tue, 19 Dec 2023 13:04:21 -0800 (PST) Received: from localhost (fwdproxy-prn-010.fbsv.net. [2a03:2880:ff:a::face:b00c]) by smtp.gmail.com with ESMTPSA id az4-20020a170902a58400b001b7f40a8959sm21395712plb.76.2023.12.19.13.04.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:20 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback Date: Tue, 19 Dec 2023 13:03:54 -0800 Message-Id: <20231219210357.4029713-18-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Currently, if user fails to keep up with the network and doesn't refill the buffer ring fast enough the NIC/driver will start dropping packets. That might be too punishing. Add a fallback path, which would allow drivers to allocate normal pages when there is starvation, then zc_rx_recv_skb() we'll detect them and copy into the user specified buffers, when they become available. That should help with adoption and also help the user striking the right balance allocating just the right amount of zerocopy buffers but also being resilient to sudden surges in traffic. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zc_rx.c | 126 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 120 insertions(+), 6 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index acb70ca23150..f7d99d569885 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -21,6 +22,11 @@ struct io_zc_rx_args { struct socket *sock; }; +struct io_zc_refill_data { + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *buf; +}; + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) @@ -603,6 +609,39 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = { }; EXPORT_SYMBOL(io_uring_pp_zc_ops); +static void io_napi_refill(void *data) +{ + struct io_zc_refill_data *rd = data; + struct io_zc_rx_ifq *ifq = rd->ifq; + void *page; + + if (WARN_ON_ONCE(!ifq->pp)) + return; + + page = page_pool_dev_alloc_pages(ifq->pp); + if (!page) + return; + if (WARN_ON_ONCE(!page_is_page_pool_iov(page))) + return; + + rd->buf = io_iov_to_buf(page_to_page_pool_iov(page)); +} + +static struct io_zc_rx_buf *io_zc_get_buf_task_safe(struct io_zc_rx_ifq *ifq) +{ + struct io_zc_refill_data rd = { + .ifq = ifq, + }; + + napi_execute(ifq->pp->p.napi, io_napi_refill, &rd); + return rd.buf; +} + +static inline void io_zc_return_rbuf_cqe(struct io_zc_rx_ifq *ifq) +{ + ifq->cached_cq_tail--; +} + static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq) { struct io_uring_rbuf_cqe *cqe; @@ -622,6 +661,51 @@ static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq * return cqe; } +static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data, + unsigned int offset, size_t len, + unsigned sock_idx) +{ + size_t copy_size, copied = 0; + struct io_uring_rbuf_cqe *cqe; + struct io_zc_rx_buf *buf; + int ret = 0, off = 0; + u8 *vaddr; + + do { + cqe = io_zc_get_rbuf_cqe(ifq); + if (!cqe) { + ret = -ENOBUFS; + break; + } + buf = io_zc_get_buf_task_safe(ifq); + if (!buf) { + io_zc_return_rbuf_cqe(ifq); + ret = -ENOMEM; + break; + } + + vaddr = kmap_local_page(buf->page); + copy_size = min_t(size_t, PAGE_SIZE, len); + memcpy(vaddr, data + offset, copy_size); + kunmap_local(vaddr); + + cqe->region = 0; + cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off; + cqe->len = copy_size; + cqe->flags = 0; + cqe->sock = sock_idx; + + io_zc_rx_get_buf_uref(buf); + page_pool_iov_put_many(&buf->ppiov, 1); + + offset += copy_size; + len -= copy_size; + copied += copy_size; + } while (offset < len); + + return copied ? copied : ret; +} + static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, int off, int len, unsigned sock_idx) { @@ -650,7 +734,22 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, cqe->sock = sock_idx; cqe->flags = 0; } else { - return -EOPNOTSUPP; + struct page *page = skb_frag_page(frag); + u32 p_off, p_len, t, copied = 0; + u8 *vaddr; + int ret = 0; + + skb_frag_foreach_page(frag, off, len, + page, p_off, p_len, t) { + vaddr = kmap_local_page(page); + ret = zc_rx_copy_chunk(ifq, vaddr, p_off, p_len, sock_idx); + kunmap_local(vaddr); + + if (ret < 0) + return copied ? copied : ret; + copied += ret; + } + len = copied; } return len; @@ -665,15 +764,30 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, struct socket *sock = args->sock; unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK; struct sk_buff *frag_iter; - unsigned start, start_off; + unsigned start, start_off = offset; int i, copy, end, off; int ret = 0; - start = skb_headlen(skb); - start_off = offset; + if (unlikely(offset < skb_headlen(skb))) { + ssize_t copied; + size_t to_copy; - if (offset < start) - return -EOPNOTSUPP; + to_copy = min_t(size_t, skb_headlen(skb) - offset, len); + copied = zc_rx_copy_chunk(ifq, skb->data, offset, to_copy, + sock_idx); + if (copied < 0) { + ret = copied; + goto out; + } + offset += copied; + len -= copied; + if (!len) + goto out; + if (offset != skb_headlen(skb)) + goto out; + } + + start = skb_headlen(skb); for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { const skb_frag_t *frag; From patchwork Tue Dec 19 21:03:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499088 Received: from mail-oo1-f43.google.com (mail-oo1-f43.google.com [209.85.161.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 329DF40BF3 for ; Tue, 19 Dec 2023 21:04:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="EJ4kPyrn" Received: by mail-oo1-f43.google.com with SMTP id 006d021491bc7-591c3dc7265so2418596eaf.0 for ; Tue, 19 Dec 2023 13:04:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019862; x=1703624662; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=cIyS0qBLv2Ng8nfVCtfxHFE9coSuv9Yrxpz6D1RsN7A=; b=EJ4kPyrnCLitPryvfTEVArLlCVcYyLCXDto2g82tCT15oFvzVGUgRzScnbWKo6qEjv 9PCMqpC/OZkop3Eazrw6kRU+gA+eu1zYgodHDmcg9wgN68d5zaa7PPn8XZgpOfPYz+ld 7pai523V4SNVMAVmovIRUL7SiX5wLuGxt5Csliv0OLj3sP8rdYx92xliFvXmuHHcOHWo AmeTgpiebhyqVIuXXDBfnFqm1eSy+r5SgZDbbW27K4QfQAq3TMl1GSa3PwcoDx02QUve 6MY7KDLmI98kjRbwYMLS+4jNiX5dbBL+I2iNSX4KxwrLjzWE7GvftT4jl/GyuoE+hXLb Sfhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019862; x=1703624662; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cIyS0qBLv2Ng8nfVCtfxHFE9coSuv9Yrxpz6D1RsN7A=; b=b7+xSoA4BlFNE69++XXJmT8irUNy70Py3YD08Gemo9z784TIyFrlxooClv7+Ih96Br H4Fe2hUe4cKWmq5GvJju9RrdOGnjzjrxdsk5ZkD+l7i1SNRbk1Jsx8AGL/x+RUccJjmC Mw+1VlED7KTOIjVbRM5KfalZtpJkamMDp5yqym7va7fDnzyFp36jokqnlFgNOSk0A0T7 ZMJ3FxK8tmpOrhCVmMXnsz2O+5m4c47r12/Q4KoDwQlpiZ6jm7+X6pnWNvYELECCP/c0 7A6PhtpRyMv/gBKLZJeWhyh7xJDv8YTCbwFwnWPpQ0ko7bA96WpDbRytn9FCxfk/bO/1 c1GA== X-Gm-Message-State: AOJu0Yzv9ZJWMqTSOdHdFbUGiDPrzK3EuY9bG0feORQfQ7P3pLEGZlo4 nVPNswm90lq/CTdV0n2dFMT+J1SNsqR5fSH1PuAfUw== X-Google-Smtp-Source: AGHT+IFvVsl5rOhi1y2F4/tSu5S+c7GqtoW/ErMsl6lkfBz1N/WlWCxRk6BDJFQtIHdc0xjgKGXyDA== X-Received: by 2002:a05:6358:c325:b0:173:50b:26ed with SMTP id fk37-20020a056358c32500b00173050b26edmr286038rwb.36.1703019862058; Tue, 19 Dec 2023 13:04:22 -0800 (PST) Received: from localhost (fwdproxy-prn-005.fbsv.net. [2a03:2880:ff:5::face:b00c]) by smtp.gmail.com with ESMTPSA id e7-20020a056a001a8700b006ce835b77d9sm3615155pfv.20.2023.12.19.13.04.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:21 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 18/20] veth: add support for io_uring zc rx Date: Tue, 19 Dec 2023 13:03:55 -0800 Message-Id: <20231219210357.4029713-19-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov NOT FOR UPSTREAM, TESTING ONLY. Add io_uring zerocopy support for veth. It's not actually zerocopy, we copy data in napi, which is early enough in the stack to be useful for testing. Note, we'll need some virtual dev support for testing, but that should not be in the way of real workloads. Signed-off-by: David Wei --- drivers/net/veth.c | 211 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 205 insertions(+), 6 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 57efb3454c57..dd00e172979f 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #define DRV_NAME "veth" @@ -75,6 +76,7 @@ struct veth_priv { struct bpf_prog *_xdp_prog; struct veth_rq *rq; unsigned int requested_headroom; + bool zc_installed; }; struct veth_xdp_tx_bq { @@ -335,9 +337,12 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev, const struct net_device *rcv, const struct sk_buff *skb) { + struct veth_priv *rcv_priv = netdev_priv(rcv); + return !(dev->features & NETIF_F_ALL_TSO) || (skb->destructor == sock_wfree && - rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)); + rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)) || + rcv_priv->zc_installed; } static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) @@ -726,6 +731,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, struct sk_buff *skb = *pskb; u32 frame_sz; + if (WARN_ON_ONCE(1)) + return -EFAULT; + if (skb_shared(skb) || skb_head_is_locked(skb) || skb_shinfo(skb)->nr_frags || skb_headroom(skb) < XDP_PACKET_HEADROOM) { @@ -827,6 +835,90 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, return -ENOMEM; } +static noinline struct sk_buff *veth_iou_rcv_skb(struct veth_rq *rq, + struct sk_buff *skb) +{ + struct sk_buff *nskb; + u32 size, len, off, max_head_size; + struct page *page; + int ret, i, head_off; + void *vaddr; + + /* Testing only, randomly send normal pages to test copy fallback */ + if (ktime_get_ns() % 16 == 0) + return skb; + + skb_prepare_for_gro(skb); + max_head_size = skb_headlen(skb); + + rcu_read_lock(); + nskb = napi_alloc_skb(&rq->xdp_napi, max_head_size); + if (!nskb) + goto drop; + + skb_copy_header(nskb, skb); + skb_mark_for_recycle(nskb); + + size = max_head_size; + if (skb_copy_bits(skb, 0, nskb->data, size)) { + consume_skb(nskb); + goto drop; + } + skb_put(nskb, size); + head_off = skb_headroom(nskb) - skb_headroom(skb); + skb_headers_offset_update(nskb, head_off); + + /* Allocate paged area of new skb */ + off = size; + len = skb->len - off; + + for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { + struct io_zc_rx_buf *buf; + void *ppage; + + ppage = page_pool_dev_alloc_pages(rq->page_pool); + if (!ppage) { + consume_skb(nskb); + goto drop; + } + if (WARN_ON_ONCE(!page_is_page_pool_iov(ppage))) { + consume_skb(nskb); + goto drop; + } + + buf = container_of(page_to_page_pool_iov(ppage), + struct io_zc_rx_buf, ppiov); + page = buf->page; + + if (WARN_ON_ONCE(buf->ppiov.pp != rq->page_pool)) + goto drop; + + size = min_t(u32, len, PAGE_SIZE); + skb_add_rx_frag(nskb, i, ppage, 0, size, PAGE_SIZE); + + vaddr = kmap_atomic(page); + ret = skb_copy_bits(skb, off, vaddr, size); + kunmap_atomic(vaddr); + + if (ret) { + consume_skb(nskb); + goto drop; + } + len -= size; + off += size; + } + rcu_read_unlock(); + + consume_skb(skb); + skb = nskb; + return skb; +drop: + rcu_read_unlock(); + kfree_skb(skb); + return NULL; +} + + static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, struct veth_xdp_tx_bq *bq, @@ -970,8 +1062,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, /* ndo_start_xmit */ struct sk_buff *skb = ptr; - stats->xdp_bytes += skb->len; - skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + if (rq->page_pool->p.memory_provider == PP_MP_IOU_ZCRX) { + skb = veth_iou_rcv_skb(rq, skb); + } else { + stats->xdp_bytes += skb->len; + skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + } + if (skb) { if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC)) netif_receive_skb(skb); @@ -1030,15 +1127,21 @@ static int veth_poll(struct napi_struct *napi, int budget) return done; } -static int veth_create_page_pool(struct veth_rq *rq) +static int veth_create_page_pool(struct veth_rq *rq, struct io_zc_rx_ifq *ifq) { struct page_pool_params pp_params = { .order = 0, .pool_size = VETH_RING_SIZE, .nid = NUMA_NO_NODE, .dev = &rq->dev->dev, + .napi = &rq->xdp_napi, }; + if (ifq) { + pp_params.mp_priv = ifq; + pp_params.memory_provider = PP_MP_IOU_ZCRX; + } + rq->page_pool = page_pool_create(&pp_params); if (IS_ERR(rq->page_pool)) { int err = PTR_ERR(rq->page_pool); @@ -1056,7 +1159,7 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end) int err, i; for (i = start; i < end; i++) { - err = veth_create_page_pool(&priv->rq[i]); + err = veth_create_page_pool(&priv->rq[i], NULL); if (err) goto err_page_pool; } @@ -1112,9 +1215,17 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end) for (i = start; i < end; i++) { struct veth_rq *rq = &priv->rq[i]; + void *ptr; + int nr = 0; rq->rx_notify_masked = false; - ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free); + + while ((ptr = ptr_ring_consume(&rq->xdp_ring))) { + veth_ptr_free(ptr); + nr++; + } + + ptr_ring_cleanup(&rq->xdp_ring, NULL); } for (i = start; i < end; i++) { @@ -1350,6 +1461,9 @@ static int veth_set_channels(struct net_device *dev, struct net_device *peer; int err; + if (priv->zc_installed) + return -EINVAL; + /* sanity check. Upper bounds are already enforced by the caller */ if (!ch->rx_count || !ch->tx_count) return -EINVAL; @@ -1427,6 +1541,8 @@ static int veth_open(struct net_device *dev) struct net_device *peer = rtnl_dereference(priv->peer); int err; + priv->zc_installed = false; + if (!peer) return -ENOTCONN; @@ -1604,6 +1720,84 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr) rcu_read_unlock(); } +static int __veth_iou_set(struct net_device *dev, + struct netdev_bpf *xdp) +{ + bool napi_already_on = veth_gro_requested(dev) && (dev->flags & IFF_UP); + unsigned qid = xdp->zc_rx.queue_id; + struct veth_priv *priv = netdev_priv(dev); + struct net_device *peer; + struct veth_rq *rq; + int ret; + + if (priv->_xdp_prog) + return -EINVAL; + if (qid >= dev->real_num_rx_queues) + return -EINVAL; + if (!(dev->flags & IFF_UP)) + return -EOPNOTSUPP; + if (dev->real_num_rx_queues != 1) + return -EINVAL; + rq = &priv->rq[qid]; + + if (!xdp->zc_rx.ifq) { + if (!priv->zc_installed) + return -EINVAL; + + veth_napi_del(dev); + priv->zc_installed = false; + if (!veth_gro_requested(dev) && netif_running(dev)) { + dev->features &= ~NETIF_F_GRO; + netdev_features_change(dev); + } + return 0; + } + + if (priv->zc_installed) + return -EINVAL; + + peer = rtnl_dereference(priv->peer); + peer->hw_features &= ~NETIF_F_GSO_SOFTWARE; + + ret = veth_create_page_pool(rq, xdp->zc_rx.ifq); + if (ret) + return ret; + + ret = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL); + if (ret) { + page_pool_destroy(rq->page_pool); + rq->page_pool = NULL; + return ret; + } + + priv->zc_installed = true; + + if (!veth_gro_requested(dev)) { + /* user-space did not require GRO, but adding XDP + * is supposed to get GRO working + */ + dev->features |= NETIF_F_GRO; + netdev_features_change(dev); + } + if (!napi_already_on) { + netif_napi_add(dev, &rq->xdp_napi, veth_poll); + napi_enable(&rq->xdp_napi); + rcu_assign_pointer(rq->napi, &rq->xdp_napi); + } + return 0; +} + +static int veth_iou_set(struct net_device *dev, + struct netdev_bpf *xdp) +{ + int ret; + + rtnl_lock(); + ret = __veth_iou_set(dev, xdp); + rtnl_unlock(); + return ret; +} + static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, struct netlink_ext_ack *extack) { @@ -1613,6 +1807,9 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, unsigned int max_mtu; int err; + if (priv->zc_installed) + return -EINVAL; + old_prog = priv->_xdp_prog; priv->_xdp_prog = prog; peer = rtnl_dereference(priv->peer); @@ -1691,6 +1888,8 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return veth_xdp_set(dev, xdp->prog, xdp->extack); + case XDP_SETUP_ZC_RX: + return veth_iou_set(dev, xdp); default: return -EINVAL; } From patchwork Tue Dec 19 21:03:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499089 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B180340BFD for ; Tue, 19 Dec 2023 21:04:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="KC8owpaR" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-28b82dc11e6so1433483a91.1 for ; Tue, 19 Dec 2023 13:04:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019863; x=1703624663; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=aEQ6N11l7v8r/nsFkAQ5qbz/E3VFP+byl5ecjp5V+Xo=; b=KC8owpaRkA58F5+Z25p+K2xkowfxp4/q1LQm6UrBkDSn+16W93lYhGwzbSTPHeZKP5 Ya2DbPp6CM7EzydW+fWitqJpTQRZLe2hEBN248j6JTuAR3yTCiHaQ/07/1vTR71ua7SV rwmuLIHEzLtKOCHNXLBG4HkuJUlINu0BLdhoR29q3TlfKw+/Tmj0uGWEZHEHw941DLI6 WMYnH7K8DKyd0/SRAlqHFsyAupiYqi2h8zohThRcH327GfeFQTTEAax9wgQPCnQiUZET JSEKRluk4n2rgC9eSCPcsXLUvSKP5tir6XtvYDZK4a79MuJMS7w8NimN/FQBYVrWEZvm PYCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019863; x=1703624663; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aEQ6N11l7v8r/nsFkAQ5qbz/E3VFP+byl5ecjp5V+Xo=; b=MU0gOHsE+GLi1FD8fFOBuc8+VzT7c62bj5jqMkWwIZMvysEnW8ybU8OFWVOZ+3qxbq ZNQhlApS88jctWtv4HDFfqgP2zBUdFdv7IbMXv8BqKzdjcZQyNgFL12iBx4PWvllKxlN nBK1n7eWTdgypBjAbJUyHDjQtKUD9u3+j8EnWjAJG8AXVtTDqxx0fyjv40Pv215hCRcs yerHrnIDjkGMqUI1JLlmyxt0jKIlc9NGZmmV37PjHzoOi99y+srNee3UVqp8iVueIyCS +h1R+xNsmmq578py0Dilv21UIL+9iPrJctjo/oJfEITofMg1u6bIFaGNudJ3ITm/IFWk aoMw== X-Gm-Message-State: AOJu0YwOopVvArx9G8JarDCOPeveSC9dI+y1O27e+YzI2koZClhTgr13 uYK1B4CWLEicLIQzMv6I3RzJTr1twP0rOAySBAYpAw== X-Google-Smtp-Source: AGHT+IEMwF0JJgfJeogY5EMwB8SgA6SA/QjJDc0LnTgjbd/Zd8UA2CKgdeV6xAqemSSppOsZza1xMQ== X-Received: by 2002:a17:90a:d790:b0:28b:2e12:4fb3 with SMTP id z16-20020a17090ad79000b0028b2e124fb3mr3330341pju.33.1703019862907; Tue, 19 Dec 2023 13:04:22 -0800 (PST) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id dj15-20020a17090ad2cf00b0028bbf4c0264sm1752924pjb.10.2023.12.19.13.04.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:22 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get Date: Tue, 19 Dec 2023 13:03:56 -0800 Message-Id: <20231219210357.4029713-20-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov io_uring pp memory provider doesn't have contiguous dma addresses, implement page_pool_iov_dma_addr() via callbacks. Note: it might be better to stash dma address into struct page_pool_iov. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 5 +---- include/net/page_pool/types.h | 2 ++ io_uring/zc_rx.c | 8 ++++++++ net/core/page_pool.c | 9 +++++++++ 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index aca3a52d0e22..10dba1f2aa0c 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -105,10 +105,7 @@ static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov) static inline dma_addr_t page_pool_iov_dma_addr(const struct page_pool_iov *ppiov) { - struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); - - return owner->base_dma_addr + - ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); + return ppiov->pp->mp_ops->ppiov_dma_addr(ppiov); } static inline unsigned long diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index f54ee759e362..1b9266835ab6 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -125,6 +125,7 @@ struct page_pool_stats { #endif struct mem_provider; +struct page_pool_iov; enum pp_memory_provider_type { __PP_MP_NONE, /* Use system allocator directly */ @@ -138,6 +139,7 @@ struct pp_memory_provider_ops { void (*scrub)(struct page_pool *pool); struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); bool (*release_page)(struct page_pool *pool, struct page *page); + dma_addr_t (*ppiov_dma_addr)(const struct page_pool_iov *ppiov); }; extern const struct pp_memory_provider_ops dmabuf_devmem_ops; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index f7d99d569885..20fb89e6bad7 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -600,12 +600,20 @@ static void io_pp_zc_destroy(struct page_pool *pp) percpu_ref_put(&ifq->ctx->refs); } +static dma_addr_t io_pp_zc_ppiov_dma_addr(const struct page_pool_iov *ppiov) +{ + struct io_zc_rx_buf *buf = io_iov_to_buf((struct page_pool_iov *)ppiov); + + return buf->dma; +} + const struct pp_memory_provider_ops io_uring_pp_zc_ops = { .alloc_pages = io_pp_zc_alloc_pages, .release_page = io_pp_zc_release_page, .init = io_pp_zc_init, .destroy = io_pp_zc_destroy, .scrub = io_pp_zc_scrub, + .ppiov_dma_addr = io_pp_zc_ppiov_dma_addr, }; EXPORT_SYMBOL(io_uring_pp_zc_ops); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ebf5ff009d9d..6586631ecc2e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -1105,10 +1105,19 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, return true; } +static dma_addr_t mp_dmabuf_devmem_ppiov_dma_addr(const struct page_pool_iov *ppiov) +{ + struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); + + return owner->base_dma_addr + + ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); +} + const struct pp_memory_provider_ops dmabuf_devmem_ops = { .init = mp_dmabuf_devmem_init, .destroy = mp_dmabuf_devmem_destroy, .alloc_pages = mp_dmabuf_devmem_alloc_pages, .release_page = mp_dmabuf_devmem_release_page, + .ppiov_dma_addr = mp_dmabuf_devmem_ppiov_dma_addr, }; EXPORT_SYMBOL(dmabuf_devmem_ops); From patchwork Tue Dec 19 21:03:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499090 Received: from mail-ot1-f43.google.com (mail-ot1-f43.google.com [209.85.210.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5BE340C07 for ; Tue, 19 Dec 2023 21:04:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="WJhzViu7" Received: by mail-ot1-f43.google.com with SMTP id 46e09a7af769-6d9f4eed60eso3868580a34.1 for ; Tue, 19 Dec 2023 13:04:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019864; x=1703624664; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=UaWUxwWMhpizTTTVDJEx3BaZ8y03KGcwpzKbAhXUqpk=; b=WJhzViu7VLsvKmdEJt8RM9QWBU830WM/jDux05YkwR1Sejx4gfuMWXu4yNwnOGbos5 oQsaO0qiJ0BG4SS5l/BUNx5l/E4rhlw9uH+32WCTRp+RXtIsv4mpQrYzuoP0M8Bj7eiW mpDi/NPj/Dte0Ng7ooZSM34nvjkGVFy/52OSU3VFZYJ8k3zRC5Q5wPvAk9khZMoYPqMW UrYonltAPTRMxpkQAEP3wcdpb6XLzXj5He4SC4ta+F/f5rmQb/7sTpRxxpL53cj5ESBt ywPHELf6d1avOHjRBIbh+QRU7LHPse2X10Q6tGV0Xfn2Wl1A9PtJMbltrXgL6JqBbtRn 3B/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019864; x=1703624664; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UaWUxwWMhpizTTTVDJEx3BaZ8y03KGcwpzKbAhXUqpk=; b=xKUXxBpz8PK3JETYA5wWkfHHSS6xudpahseKxw8ap0NzyqGh8s/goQ7/sh9dgPOtG6 sHqTrwDyWSToDAFhUH8a9/aOGugLOuoZc6MOQuZu3fRn8cJECZVy5YMRbbm6hrwO35Ln kSQWmd5X/8LramXF+e8C0akqgpJ5OC2hGeHfo1bjB49sNSAvy7lEsDVTzdM5v1zRdoZ0 XKDW36BgG9vVl4S2ghSyH5Gktfkpn+5M3sFX4/OpdREgh3qyiwwwA2Y1IopdMIJb9eXy SyBsENGw8F7BQurvhDoXNC5xHrYrEqhhGRF00c2spH90lvbf6+75w7DDZIgSWl0lO1lx fWtw== X-Gm-Message-State: AOJu0YzEbGT15o9a2DmUXBNVqkWYvWJOlyndLe6Dh99X5uZoFqiQLB4j 3wk9hA1xmG1RYG3J0Q9U16VQ+wmDSxoloUJABE8qwQ== X-Google-Smtp-Source: AGHT+IEtTwvrmgbTpMQqNKAuQVjNzW+eoJccv8bAayAZyfBOqrS0CFXslrtBQSxc+fixpkFaqhl1Kw== X-Received: by 2002:a05:6358:99a0:b0:170:17eb:203c with SMTP id j32-20020a05635899a000b0017017eb203cmr19345106rwb.37.1703019863804; Tue, 19 Dec 2023 13:04:23 -0800 (PST) Received: from localhost (fwdproxy-prn-118.fbsv.net. [2a03:2880:ff:76::face:b00c]) by smtp.gmail.com with ESMTPSA id fn6-20020a056a002fc600b006d838632671sm3803511pfb.101.2023.12.19.13.04.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:23 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool Date: Tue, 19 Dec 2023 13:03:57 -0800 Message-Id: <20231219210357.4029713-21-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov TESTING ONLY Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 71 +++++++++++++++++-- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 7 ++ drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 3 + 3 files changed, 75 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 039f8d995a26..d9fb8633f226 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -55,6 +55,7 @@ #include #include #include +#include #include "bnxt_hsi.h" #include "bnxt.h" @@ -875,6 +876,25 @@ static inline u8 *__bnxt_alloc_rx_frag(struct bnxt *bp, dma_addr_t *mapping, return data; } +static inline struct page *bnxt_get_real_page(struct page *page) +{ + struct io_zc_rx_buf *buf; + + if (page_is_page_pool_iov(page)) { + buf = container_of(page_to_page_pool_iov(page), + struct io_zc_rx_buf, ppiov); + page = buf->page; + } + return page; +} + +static inline void *bnxt_get_page_address(struct page *frag) +{ + struct page *page = bnxt_get_real_page(frag); + + return page_address(page); +} + int bnxt_alloc_rx_data(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 prod, gfp_t gfp) { @@ -892,7 +912,7 @@ int bnxt_alloc_rx_data(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, mapping += bp->rx_dma_offset; rx_buf->data = page; - rx_buf->data_ptr = page_address(page) + offset + bp->rx_offset; + rx_buf->data_ptr = bnxt_get_page_address(page) + offset + bp->rx_offset; } else { u8 *data = __bnxt_alloc_rx_frag(bp, &mapping, gfp); @@ -954,8 +974,9 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp, if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE) page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp); - else + else { page = __bnxt_alloc_rx_64k_page(bp, &mapping, rxr, gfp, &offset); + } if (!page) return -ENOMEM; @@ -1079,6 +1100,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp, return NULL; } skb_mark_for_recycle(skb); + skb_reserve(skb, bp->rx_offset); __skb_put(skb, len); @@ -1118,7 +1140,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp, } skb_mark_for_recycle(skb); - off = (void *)data_ptr - page_address(page); + off = (void *)data_ptr - bnxt_get_page_address(page); skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE); memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN, payload + NET_IP_ALIGN); @@ -2032,7 +2054,6 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr, goto next_rx; } } else { - skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr->page_pool, &xdp, rxcmp1); if (!skb) { /* we should be able to free the old skb here */ bnxt_xdp_buff_frags_free(rxr, &xdp); @@ -3402,7 +3423,8 @@ static void bnxt_free_rx_rings(struct bnxt *bp) } static int bnxt_alloc_rx_page_pool(struct bnxt *bp, - struct bnxt_rx_ring_info *rxr) + struct bnxt_rx_ring_info *rxr, + int qid) { struct page_pool_params pp = { 0 }; @@ -3416,6 +3438,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp, pp.max_len = PAGE_SIZE; pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; + if (bp->iou_ifq && qid == bp->iou_qid) { + pp.mp_priv = bp->iou_ifq; + pp.memory_provider = PP_MP_IOU_ZCRX; + pp.max_len = PAGE_SIZE; + pp.flags = 0; + } + rxr->page_pool = page_pool_create(&pp); if (IS_ERR(rxr->page_pool)) { int err = PTR_ERR(rxr->page_pool); @@ -3442,7 +3471,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp) ring = &rxr->rx_ring_struct; - rc = bnxt_alloc_rx_page_pool(bp, rxr); + rc = bnxt_alloc_rx_page_pool(bp, rxr, i); if (rc) return rc; @@ -14347,6 +14376,36 @@ void bnxt_print_device_info(struct bnxt *bp) pcie_print_link_status(bp->pdev); } +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp) +{ + unsigned ifq_idx = xdp->zc_rx.queue_id; + + if (ifq_idx >= bp->rx_nr_rings) + return -EINVAL; + if (PAGE_SIZE != BNXT_RX_PAGE_SIZE) + return -EINVAL; + + bnxt_rtnl_lock_sp(bp); + if (!!bp->iou_ifq == !!xdp->zc_rx.ifq) { + bnxt_rtnl_unlock_sp(bp); + return -EINVAL; + } + if (netif_running(bp->dev)) { + int rc; + + bnxt_ulp_stop(bp); + bnxt_close_nic(bp, true, false); + + bp->iou_qid = ifq_idx; + bp->iou_ifq = xdp->zc_rx.ifq; + + rc = bnxt_open_nic(bp, true, false); + bnxt_ulp_start(bp, rc); + } + bnxt_rtnl_unlock_sp(bp); + return 0; +} + static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { struct net_device *dev; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index e31164e3b8fb..1003f9260805 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -2342,6 +2342,10 @@ struct bnxt { #endif u32 thermal_threshold_type; enum board_idx board_idx; + + /* io_uring zerocopy */ + void *iou_ifq; + unsigned iou_qid; }; #define BNXT_NUM_RX_RING_STATS 8 @@ -2556,4 +2560,7 @@ int bnxt_get_port_parent_id(struct net_device *dev, void bnxt_dim_work(struct work_struct *work); int bnxt_hwrm_set_ring_coal(struct bnxt *bp, struct bnxt_napi *bnapi); void bnxt_print_device_info(struct bnxt *bp); + +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp); + #endif diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 4791f6a14e55..a3ae02c31ffc 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -466,6 +466,9 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: rc = bnxt_xdp_set(bp, xdp->prog); break; + case XDP_SETUP_ZC_RX: + return bnxt_zc_rx(bp, xdp); + break; default: rc = -EINVAL; break;