From patchwork Tue Dec 19 21:03:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499095 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBAF13B78E for ; Tue, 19 Dec 2023 21:04:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="089NnUHM" Received: by mail-pg1-f171.google.com with SMTP id 41be03b00d2f7-5cda3e35b26so1332291a12.1 for ; Tue, 19 Dec 2023 13:04:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019846; x=1703624646; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Oy9uiiKwbsIgKaUJ3odB6F3vW7RFsVN49w7UGQoPJaI=; b=089NnUHMEUakuL6hga0Fo8qZzebYOmm7mMBnZfsWGRQ0DqlmgnDelkwSUHj8OIbR99 BPcdYhfZ5ULvq46KjyRvfSIXs6vUOzJhVSX8CQMjmWyPqxYQRiwk5uIdR+9cTJGtexZm HDA7GZNzhteNZkT4NU6yyCDMxZmUvvqb2EWGKyE4Xg4MWiXSxCv+8C1suSseacLEjDpF NeXV1VBiLrfzgrVPeU1uv1kerIE40Tin8I4e5M46hZ8au+0botfKCi7kR3N//QBCWoWR 0AMBPdj01lLnco1MKEw7NPqsbGb3i+1rQ70TablD0wwjMZjTqZf40k6ufu2G06TovMmv U1fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019846; x=1703624646; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Oy9uiiKwbsIgKaUJ3odB6F3vW7RFsVN49w7UGQoPJaI=; b=j4SEFHO/e/tKHO2FlQTHIouWcx/mESh0Uo/0tnPtjW5I3mWxZ90n3Y0sWOK3S5ajQE K6XH1Qxd1OUFfHs0LMzkTTmfvlbd6WxSNRqs047Vu+5pa3GV/YuPJ6VHrvhzSp0VLoHE Pi9BCYiKptFVsc9nYJDhTI8SL/NJR55+fp84qQnfJBDyFjCBR0Beta68CYiZN3WBSll5 G7BWPAaWpB3P9X7JzJZfbDp+YVWdejPESupGOEopdtHEynhN8bUwkXLVJTuj75+Rhdl/ 0vpEyBXcPnsVa0cXQ4vsC9sA4G1qZsrm9v6ADWrBhHHxc55zfXhvpVwc1kC6QN7+gngt vMdw== X-Gm-Message-State: AOJu0YyI+qZj+ES84tcwa57ck0xSJqJXh3h+nNMmaiFPN+eEtXyG96Od 6uUNjc0cGpTyyepK7nhkKx+Gxg== X-Google-Smtp-Source: AGHT+IHVEYep/1woYWBS9dt/9XVd4k3yjoZczvwC+R6/pD8ybviMhpX1EJ68kXcXPjNNtknzMGKX6A== X-Received: by 2002:a17:90b:19c6:b0:28b:49d7:e746 with SMTP id nm6-20020a17090b19c600b0028b49d7e746mr2598512pjb.65.1703019846088; Tue, 19 Dec 2023 13:04:06 -0800 (PST) Received: from localhost (fwdproxy-prn-016.fbsv.net. [2a03:2880:ff:10::face:b00c]) by smtp.gmail.com with ESMTPSA id u12-20020a17090adb4c00b002867594de40sm2086062pjx.14.2023.12.19.13.04.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:05 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper Date: Tue, 19 Dec 2023 13:03:38 -0800 Message-Id: <20231219210357.4029713-2-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov NOT FOR UPSTREAM The final version will depend on how ppiov looks like, but add a convenience helper for now. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 5 +++++ net/core/page_pool.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 95f4d579cbc4..92804c499833 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -86,6 +86,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) /* page_pool_iov support */ +static inline struct page *page_pool_mangle_ppiov(struct page_pool_iov *ppiov) +{ + return (struct page *)((unsigned long)ppiov | PP_DEVMEM); +} + static inline struct dmabuf_genpool_chunk_owner * page_pool_iov_owner(const struct page_pool_iov *ppiov) { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index c0bc62ee77c6..38eff947f679 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -1074,7 +1074,7 @@ static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, pool->pages_state_hold_cnt++; trace_page_pool_state_hold(pool, (struct page *)ppiov, pool->pages_state_hold_cnt); - return (struct page *)((unsigned long)ppiov | PP_DEVMEM); + return page_pool_mangle_ppiov(ppiov); } static void mp_dmabuf_devmem_destroy(struct page_pool *pool) From patchwork Tue Dec 19 21:03:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499092 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DAC539AFC for ; Tue, 19 Dec 2023 21:04:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="KTauUz9t" Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-5cd68a0de49so3548042a12.2 for ; Tue, 19 Dec 2023 13:04:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019847; x=1703624647; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AHaFgdexCzA0uTP4OJiRQISL/EACmJ6kzizjojPCXBw=; b=KTauUz9txi8e8EDGRgr7KZDO3GLkQxgu9BD+dn1djoKo/8Enlb+nguOCnC3D8nyDmz dzKgEX4OKmqK+TUm4Nv/izPM3FHj++rCwbB9aQYQOv6hPupeMWCQ3yui2JWK3vEObReB /ccew2Dj455A2VWcpa9YFFCEo9CqzEW7aAIobFD/TZ1WWSnE+norGtTqjRdE5vcPlTTq 6mplqBvIZparTmqfSI6n9C1hZCC20AKZjgO5A2wmrNabqK0vptd+rpryxybdVOxwjmFK dCyA5y/VmdCyuAm2aAbuU6ep7W7/08uGNyi1807TWc7Tf4L2LpS9sCmZFOsFkIVJV5Tc x7Fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019847; x=1703624647; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AHaFgdexCzA0uTP4OJiRQISL/EACmJ6kzizjojPCXBw=; b=dwCzRKIXKNJUisBuVIuwvptqIOGlmlBxCcSXAbKoMK33ptWRKSDtzlfcwhSI4OJ+K2 NG5cJo/MaZlCH/wEQUmhIzN2+91FFglbSzGJ6A5StD7bEXWf5ihhgF/6PT9toQKdBYo+ MRAFEJU1tSMu/7VvieMQ7E5y7+Lgp8jykdVuS/UuzWRm0fuF7tJyIMzCB2KLKSk9GT/O 5zAGPLAP+56NzcULPxVMus6m2gcIjLzcJTbe7db6bL6TprXxjFL8mxS72pld+mhZVyAO VLC0TNy4NW2QNmzqIjpGhOCGHfZE4PlYMgtiZM4xWI+cGBWO8u8gqGX1DHntrEyvsQa9 sLAA== X-Gm-Message-State: AOJu0YwM/jMuTzEdyeqngokiXQGwiw1iTMyqow/NUBzldTF62RGqE1sc /QQwVZ2NfEeh/mLrr2Xn7HvjgA== X-Google-Smtp-Source: AGHT+IF16V9K9fxk40dVoBfsF4UCnLKdU0l3jKOgVGqIYtUJwS2ni4AdskvJa710jcq0aAXd9+X4lA== X-Received: by 2002:a17:90a:fa4f:b0:28b:c572:1f0 with SMTP id dt15-20020a17090afa4f00b0028bc57201f0mr819001pjb.90.1703019846948; Tue, 19 Dec 2023 13:04:06 -0800 (PST) Received: from localhost (fwdproxy-prn-020.fbsv.net. [2a03:2880:ff:14::face:b00c]) by smtp.gmail.com with ESMTPSA id g15-20020a17090a4b0f00b0028bb87b2378sm2082025pjh.49.2023.12.19.13.04.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:06 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov Date: Tue, 19 Dec 2023 13:03:39 -0800 Message-Id: <20231219210357.4029713-3-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov NOT FOR UPSTREAM There will be more users of struct page_pool_iov, and ppiovs from one subsystem must not be used by another. That should never happen for any sane application, but we need to enforce it in case of bufs and/or malicious users. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- net/ipv4/tcp.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 33a8bb63fbf5..9c6b18eebb5b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2384,6 +2384,13 @@ static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb, } ppiov = skb_frag_page_pool_iov(frag); + + /* Disallow non devmem owned buffers */ + if (ppiov->pp->p.memory_provider != PP_MP_DMABUF_DEVMEM) { + err = -ENODEV; + goto out; + } + end = start + skb_frag_size(frag); copy = end - offset; From patchwork Tue Dec 19 21:03:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499093 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A45AA3B18E for ; Tue, 19 Dec 2023 21:04:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="DNaHuW9J" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-28bc0c0375fso643657a91.0 for ; Tue, 19 Dec 2023 13:04:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019848; x=1703624648; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=C5DHyX6aeoJMKvf3FMrIDlFr+7zOXVmc0yCRXRvCbfw=; b=DNaHuW9JyCsGjRr32FZPjepveILYCjfq/NnQ5P+SElQi/jJBJDR7xLy3zeoRr0B47F /82kHVdDftMc5fRmHH75Olo/PpecEW6xhz72icYfBdAhSsXePomp4WDV8r9W0WjkjXe8 VWmnPMCBx4WX1ixDBanNjr0xjl4c0EUQm9AQIRq3fhcajJDV/M5f7mGG9a8i8JhEPcRd 1LqMXhFflA178I7DOiE9sILA9BFqdb+RTyHN6n0wz5fqzUi8sdDdwIdz3R5xnhtcf+dC 41SZpoCCtBnTZAud1vNlupTsutLyJK5EgaYQEN/G/Lj+rAvUbCRF992UGQXckMkUvMm0 Ou0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019848; x=1703624648; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C5DHyX6aeoJMKvf3FMrIDlFr+7zOXVmc0yCRXRvCbfw=; b=RMZRG4nnFJgznK5uX1EXO58xpF2MJszeeyDHhg+O2CDSc08h49p6QsUX9h6pkK8ei5 Z2crwyYJRECu08WRj6TrtET6+1mbPCPznosc2MAAUTfJgwxV/yCbadzLIMRcYUD/uQFc A6AKXuueA9jxEeZtBFL8oiBKpTekk32CMWzgkhKFwLoXeaN6lTZOWfGodNWpfotc3/S2 GPO5fSMr+pKhwAA9WJrI/8VXRHP7+POSod++/7qwSyOFVFcb3mFHwJayTYTYkFNmhfJX dRRt/QSbjOlnutSGzNFf5N8vW3/tivEVL1iHX5CBTILlBaPNaGdRDDngMrLI4GLhu3UW FV0w== X-Gm-Message-State: AOJu0Yxr5lYUjSFzO614mqnuegoVooZFwV6B0o4F1jwc+fxgxmxA84Mt EI9CTnJAElZQWZcyJvLL+CPZ/Q== X-Google-Smtp-Source: AGHT+IGng/UXhQ+pTyRj8hIhoKPfhzyfyq52gtF64Uvn9PD+x5fXxFr0YyiQhiVsLPL8ylYTdzjzaw== X-Received: by 2002:a17:90a:b901:b0:28b:1f1e:827e with SMTP id p1-20020a17090ab90100b0028b1f1e827emr3955367pjr.48.1703019847844; Tue, 19 Dec 2023 13:04:07 -0800 (PST) Received: from localhost (fwdproxy-prn-119.fbsv.net. [2a03:2880:ff:77::face:b00c]) by smtp.gmail.com with ESMTPSA id oe7-20020a17090b394700b0028a69db1f51sm2110330pjb.30.2023.12.19.13.04.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:07 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle Date: Tue, 19 Dec 2023 13:03:40 -0800 Message-Id: <20231219210357.4029713-4-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov NOT FOR UPSTREAM The final version will depend on how the ppiov infra looks like Page pool is tracking how many pages were allocated and returned, which serves for refcounting the pool, and so every page/frag allocated should eventually come back to the page pool via appropriate ways, e.g. by calling page_pool_put_page(). When it comes to normal page pools (i.e. without memory providers attached), it's fine to return a page when it's still refcounted by somewhat in the stack, in which case we'll "detach" the page from the pool and rely on page refcount for it to return back to the kernel. Memory providers are different, at least ppiov based ones, they need all their buffers to eventually return back, so apart from custom pp ->release handlers, we'll catch when someone puts down a ppiov and call its memory provider to handle it, i.e. __page_pool_iov_free(). The first problem is that __page_pool_iov_free() hard coded devmem handling, and other providers need a flexible way to specify their own callbacks. The second problem is that it doesn't go through the generic page pool paths and so can't do the mentioned pp accounting right. And we can't even safely rely on page_pool_put_page() to be called somewhere before to do the pp refcounting, because then the page pool might get destroyed and ppiov->pp would point to garbage. The solution is to make the pp ->release callback to be responsible for properly recycling its buffers, e.g. calling what was __page_pool_iov_free() before in case of devmem. page_pool_iov_put_many() will be returning buffers to the page pool. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 15 ++++++++--- net/core/page_pool.c | 46 +++++++++++++++++---------------- 2 files changed, 35 insertions(+), 26 deletions(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 92804c499833..ef380ee8f205 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -137,15 +137,22 @@ static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov, refcount_add(count, &ppiov->refcount); } -void __page_pool_iov_free(struct page_pool_iov *ppiov); +static inline bool page_pool_iov_sub_and_test(struct page_pool_iov *ppiov, + unsigned int count) +{ + return refcount_sub_and_test(count, &ppiov->refcount); +} static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov, unsigned int count) { - if (!refcount_sub_and_test(count, &ppiov->refcount)) - return; + if (count > 1) + WARN_ON_ONCE(page_pool_iov_sub_and_test(ppiov, count - 1)); - __page_pool_iov_free(ppiov); +#ifdef CONFIG_PAGE_POOL + page_pool_put_defragged_page(ppiov->pp, page_pool_mangle_ppiov(ppiov), + -1, false); +#endif } /* page pool mm helpers */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 38eff947f679..ecf90a1ccabe 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -599,6 +599,16 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) page_pool_set_dma_addr(page, 0); } +static void page_pool_return_provider(struct page_pool *pool, struct page *page) +{ + int count; + + if (pool->mp_ops->release_page(pool, page)) { + count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); + trace_page_pool_state_release(pool, page, count); + } +} + /* Disconnects a page (from a page_pool). API users can have a need * to disconnect a page (from a page_pool), to allow it to be used as * a regular page (that will eventually be returned to the normal @@ -607,13 +617,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; - bool put; - put = true; - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - put = pool->mp_ops->release_page(pool, page); - else - __page_pool_release_page_dma(pool, page); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) { + page_pool_return_provider(pool, page); + return; + } + + __page_pool_release_page_dma(pool, page); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -621,10 +631,8 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - if (put) { - page_pool_clear_pp_info(page); - put_page(page); - } + page_pool_clear_pp_info(page); + put_page(page); /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -1034,15 +1042,6 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid); -void __page_pool_iov_free(struct page_pool_iov *ppiov) -{ - if (ppiov->pp->mp_ops != &dmabuf_devmem_ops) - return; - - netdev_free_devmem(ppiov); -} -EXPORT_SYMBOL_GPL(__page_pool_iov_free); - /*** "Dmabuf devmem memory provider" ***/ static int mp_dmabuf_devmem_init(struct page_pool *pool) @@ -1093,9 +1092,12 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, return false; ppiov = page_to_page_pool_iov(page); - page_pool_iov_put_many(ppiov, 1); - /* We don't want the page pool put_page()ing our page_pool_iovs. */ - return false; + + if (!page_pool_iov_sub_and_test(ppiov, 1)) + return false; + netdev_free_devmem(ppiov); + /* tell page_pool that the ppiov is released */ + return true; } const struct pp_memory_provider_ops dmabuf_devmem_ops = { From patchwork Tue Dec 19 21:03:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499094 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 858823B296 for ; Tue, 19 Dec 2023 21:04:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="mQAsWpko" Received: by mail-pg1-f181.google.com with SMTP id 41be03b00d2f7-5c701bd98f3so1901359a12.1 for ; Tue, 19 Dec 2023 13:04:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019849; x=1703624649; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=w5PuaR4d/MoGidxq0L3iO4s55XxlUWagn7UVjiQLif0=; b=mQAsWpko25z+CTg2NQCrn3+F/sH0Eoi9YUbQHrgmGGxjLbwEY6YsIfzFpthxOJgOHX /hALO3Q5rXgVxL3NRJZslycn3rMeSPWyiiHt5OvhvjwvAOYc2iDrpr8dywWrkF9EprJ7 nNSvXpPKDJumRdZR+LIgNdtx+pTAUwW8ULXiJxim6D1rFNY28QWkslqjV7anF+dZJk2N uKF+dhTG9TEGUd39BJAXU7J00Cpnhz3Pv3BY4KdbN5nyL4qh/pCEd8aelMV09LenVurg 8TSxPHs5nGOQ89YNnSU5PJt2G8DmyZKYxwKngjr1hbQVw/mTBwvsrcmb2cLPj4HT7pLT nDGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019849; x=1703624649; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w5PuaR4d/MoGidxq0L3iO4s55XxlUWagn7UVjiQLif0=; b=s7UuMIjeCL7AR/TfK9+5EpJ87ZXLxjNmkvuQAtvPbUn0GL9JkcyIZUXTRAipK61iVd RPhc4wFTiIEQ53ESsW4SoeCDYfQ/kgpZ0FIgl4L9kHLUXBVpUqXhbygbLeeQEE9xPsO1 CzIcSPaCjhkOM+3NreeNYxxAicAo9k2CgzDe3VIFylJKDbBPHqRbiPnRVSts4k9xIqsc deo0kWfXbM/3QflfVPgIBvOD4PAY1RYyTk8v+sh5dPtVQnQ7PS/g7z9h3xHEAPr3UxXd uLlcvlYwuNgejgFKQ4wBXTbTgrmjv4y4mw8HUcfp+zrvaGOl4RPN7UaqYlQeRwzhI33x xLgQ== X-Gm-Message-State: AOJu0Yz1/0Dnhg8j9XDm/Y2RIF0f5jqa3WNSL32JYrdlraVanpB72eWJ 4feGcJXCKHKUfXq9QJezY3Vtbg== X-Google-Smtp-Source: AGHT+IFxU8K97jLnXcSyyO+D+sbfKINOieazzqXKOxdloDIB5FLVgcl8B8HXGUFAAoIhc0I94Q4ixg== X-Received: by 2002:a17:90a:7405:b0:28b:9a2d:c1c3 with SMTP id a5-20020a17090a740500b0028b9a2dc1c3mr2555650pjg.80.1703019848745; Tue, 19 Dec 2023 13:04:08 -0800 (PST) Received: from localhost (fwdproxy-prn-008.fbsv.net. [2a03:2880:ff:8::face:b00c]) by smtp.gmail.com with ESMTPSA id bx15-20020a17090af48f00b0028b89520c7asm2091559pjb.9.2023.12.19.13.04.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:08 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov Date: Tue, 19 Dec 2023 13:03:41 -0800 Message-Id: <20231219210357.4029713-5-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov NOT FOR UPSTREAM Teach napi_pp_put_page() how to work with ppiov. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 2 +- net/core/page_pool.c | 3 --- net/core/skbuff.c | 28 ++++++++++++++++------------ 3 files changed, 17 insertions(+), 16 deletions(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index ef380ee8f205..aca3a52d0e22 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -381,7 +381,7 @@ static inline long page_pool_defrag_page(struct page *page, long nr) long ret; if (page_is_page_pool_iov(page)) - return -EINVAL; + return 0; /* If nr == pp_frag_count then we have cleared all remaining * references to the page: diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ecf90a1ccabe..71af9835638e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -922,9 +922,6 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) { struct page *page; - if (pool->destroy_cnt) - return; - /* Empty alloc cache, assume caller made sure this is * no-longer in use, and page_pool_alloc_pages() cannot be * call concurrently. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f44c53b0ca27..cf523d655f92 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -896,19 +896,23 @@ bool napi_pp_put_page(struct page *page, bool napi_safe) bool allow_direct = false; struct page_pool *pp; - page = compound_head(page); - - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) - return false; + if (page_is_page_pool_iov(page)) { + pp = page_to_page_pool_iov(page)->pp; + } else { + page = compound_head(page); + + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + return false; - pp = page->pp; + pp = page->pp; + } /* Allow direct recycle if we have reasons to believe that we are * in the same context as the consumer would run, so there's From patchwork Tue Dec 19 21:03:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499096 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 689A03B2BD for ; Tue, 19 Dec 2023 21:04:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="r91WcHkF" Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-6d0a679fca7so2811652b3a.2 for ; Tue, 19 Dec 2023 13:04:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019849; x=1703624649; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=zs+cSQCBJyAxSPUcytmGfDGwcvea+TXK48eVoO+ZZxM=; b=r91WcHkFUHCQ5SAxBq1DMCCRf01qUer6dm3fGeY4q73QJqhgnLFgUaW2t7VBHRhoPq Gclgz5RAJZRAfzQNPk4NKuG7cLsz0B2OK+YXeF7+9jDVRzhztHRIs4MAZp1qvaBMQB2Z mSIf8mIfXWJ2/8kCnkDZIywuZFMQKRamlowq/ZW9nU4rkSFNfh/UM2sxMYqjIz3ZgRuJ WNPNtYIKRZY+o/48hBHtR2aVcCGdnmXWvSSX5INpCXs7e2Zr4lhh5Pp0Ux1/2ZbmxbDk jwk1jbFsDpjLGSGvrNZea3M/hwA5QhXrKsuaiFxd+21Eprag4UMoYBmoyOl0wi5lAVg8 X+8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019849; x=1703624649; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zs+cSQCBJyAxSPUcytmGfDGwcvea+TXK48eVoO+ZZxM=; b=n8GJEebNhpCox47pMPcbQyVbO08syTiXuWSUnYEEBwPxT8C9xm5ADVb2HFEtZnIFHx GaBuNPVdhii3P8EXyBtFSOKkAMn0YNz9QRHRc5c2Z/vR/mC3NKoq4J66fKcoH/UAXV2b lcXjhGtEVBt02HUyEuh801M5p6nduzE1wmgzvvbYb7vHpWidwensTRFp1o9e7W23EJ1R 5k6EJWaEzCjsfRd+CTQgyA4Z+w/cZwAatheQ6SilKDyfmEwQXuZMlddycZrzmktduwou MZ19x10C7aUpVg+y9umeiswuRi7k+wY90L6nXEd7APSPhZs4fAorQtxgufmyKZJoZr44 a9Tw== X-Gm-Message-State: AOJu0YyLXRFfDHRuzMIvGvKM+bLPQPCg3jcUzgdWwRvz9u5BRtiBGkzZ bMOO6dABFZ3nKf1Adhks0wddfOUIAxj6mnW/ibNMOQ== X-Google-Smtp-Source: AGHT+IHjfDQaUz2sKAkFTiAhNVlx7Yd3lLWWRw8dr6FWMxhyPk+XlzUgXZXfLatLI87/PuyoBdiiAg== X-Received: by 2002:a05:6a20:3ca3:b0:18b:ec94:deed with SMTP id b35-20020a056a203ca300b0018bec94deedmr9799295pzj.45.1703019849709; Tue, 19 Dec 2023 13:04:09 -0800 (PST) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id e9-20020a17090ab38900b0028b07d1f647sm2076812pjr.23.2023.12.19.13.04.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:09 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback Date: Tue, 19 Dec 2023 13:03:42 -0800 Message-Id: <20231219210357.4029713-6-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov page pool is now waiting for all ppiovs to return before destroying itself, and for that to happen the memory provider might need to push some buffers, flush caches and so on. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/types.h | 1 + net/core/page_pool.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index a701310b9811..fd846cac9fb6 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -134,6 +134,7 @@ enum pp_memory_provider_type { struct pp_memory_provider_ops { int (*init)(struct page_pool *pool); void (*destroy)(struct page_pool *pool); + void (*scrub)(struct page_pool *pool); struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); bool (*release_page)(struct page_pool *pool, struct page *page); }; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 71af9835638e..9e3073d61a97 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -947,6 +947,8 @@ static int page_pool_release(struct page_pool *pool) { int inflight; + if (pool->mp_ops && pool->mp_ops->scrub) + pool->mp_ops->scrub(pool); page_pool_scrub(pool); inflight = page_pool_inflight(pool); if (!inflight) From patchwork Tue Dec 19 21:03:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499097 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6108239FE5 for ; Tue, 19 Dec 2023 21:04:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="Erzzqqgj" Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-6d411636a95so76072b3a.0 for ; Tue, 19 Dec 2023 13:04:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019850; x=1703624650; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bxF16fAD5eY0vARRtlHeEdU6+P5TPKLrnZoJ3mN5VeE=; b=ErzzqqgjalS5ujCO50KqgFAMsFMtWp1/zrlkAU6OCuzTTHXuRB8eB649YsZa7WIG1S Z5E4ICEorKlo4kxD4YRQqUWl9WX5rJRHnz/K/cYE4oAfx9DG3rJc1eAc/6tWY5PqQJaJ XrvsLt/FfVEEEhVezufe+skQMskQnWoZaFq0sIsdsXiQVcFLsSNsgrFKWcBZ5M4MYEGR 0NtRqazGtD695JYR1s+CUyVlxLaTvLyLQPsGOhABh2c45mDh0wUANOC6/1t/d3k3eKtB y8Q49aCxdbtgtrhOS5YeAnl/1+bnWlMA8ApBS7WkcOx6raCijxVYo+IgrTEOOnzKIoVQ VcYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019850; x=1703624650; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bxF16fAD5eY0vARRtlHeEdU6+P5TPKLrnZoJ3mN5VeE=; b=itY/6CHFhSql6yox1NsMetSTBM4AHTQBGlMsUpgJDvt0t5N5TdX4YljzK9922U9Dte LXWdPeg9u7ssIPaXIKaA3qVxHK0dZHBfN3IUy4bKrbT1G7sjEYgh9Q+RAwynhULveNx3 zeHRz2xZyWMclaVgTzgNIzZkCiIvA5n3xI7BGddCiWB7V10XJn6f4+sLG58o57by7jn6 NYBasg5H8tV1kuamyPiJzSnzMg3eQVwlXsH6ni5l2N9/t86p+Z6//0bE575AboYintGm Ls0q0eiULDHGSwxAG+kj2HlJj2+NoV51cIoxYJUCt19YcXtSs68gtFCvu9GVEIXAIPxp K/2g== X-Gm-Message-State: AOJu0YxqYgYaofBcafYqaCmF74UqyEy2revJuyd3hN0n94KhaAN57zLa 9SsbEbzRAX9C0NekNjoid+V6kg== X-Google-Smtp-Source: AGHT+IHQu3MDTkpc00SBCiVZjqGTylNNidHL0fESRWK86uvjemoU13/tRBE4mX3qs2HdkNdYcVbOmg== X-Received: by 2002:a05:6a20:6311:b0:18f:354f:58c2 with SMTP id h17-20020a056a20631100b0018f354f58c2mr1314596pzf.44.1703019850627; Tue, 19 Dec 2023 13:04:10 -0800 (PST) Received: from localhost (fwdproxy-prn-025.fbsv.net. [2a03:2880:ff:19::face:b00c]) by smtp.gmail.com with ESMTPSA id c3-20020aa78803000000b006d451d8d7f3sm6017911pfo.76.2023.12.19.13.04.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:10 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 06/20] io_uring: separate header for exported net bits Date: Tue, 19 Dec 2023 13:03:43 -0800 Message-Id: <20231219210357.4029713-7-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov We're exporting some io_uring bits to networking, e.g. for implementing a net callback for io_uring cmds, but we don't want to expose more than needed. Add a separate header for networking. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei Reviewed-by: Jens Axboe --- include/linux/io_uring.h | 6 ------ include/linux/io_uring/net.h | 18 ++++++++++++++++++ io_uring/uring_cmd.c | 1 + net/socket.c | 2 +- 4 files changed, 20 insertions(+), 7 deletions(-) create mode 100644 include/linux/io_uring/net.h diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index d8fc93492dc5..88d9aae7681b 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -12,7 +12,6 @@ void __io_uring_cancel(bool cancel_all); void __io_uring_free(struct task_struct *tsk); void io_uring_unreg_ringfd(void); const char *io_uring_get_opcode(u8 opcode); -int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); static inline void io_uring_files_cancel(void) { @@ -49,11 +48,6 @@ static inline const char *io_uring_get_opcode(u8 opcode) { return ""; } -static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd, - unsigned int issue_flags) -{ - return -EOPNOTSUPP; -} #endif #endif diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h new file mode 100644 index 000000000000..b58f39fed4d5 --- /dev/null +++ b/include/linux/io_uring/net.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _LINUX_IO_URING_NET_H +#define _LINUX_IO_URING_NET_H + +struct io_uring_cmd; + +#if defined(CONFIG_IO_URING) +int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); + +#else +static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd, + unsigned int issue_flags) +{ + return -EOPNOTSUPP; +} +#endif + +#endif diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 34030583b9b2..c98749eff5ce 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include diff --git a/net/socket.c b/net/socket.c index 3379c64217a4..d75246450a3c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -88,7 +88,7 @@ #include #include #include -#include +#include #include #include From patchwork Tue Dec 19 21:03:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499098 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6CBCB3C46E for ; Tue, 19 Dec 2023 21:04:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="XU3UrHve" Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-6d8923fe26bso1477521b3a.0 for ; Tue, 19 Dec 2023 13:04:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019851; x=1703624651; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=X+cD0HBbDXg6qDTaNdwjQlMiJRw8SP72E8bqAxbhE8s=; b=XU3UrHveKv4FykWM7gzm1lTiTMjIf1IVo3agLeyjanCt/YfMUVoKkNEPoIHjH4CWBc ON7vzJatilRqbTDbgzRRaFh6EEfNREm6exksKwMSUd5LUNcrhHAHG320J67t6hj8IzZM NfqwqZs4x/tuiknuCCs8651Qm574gwikysic7I+hJzslQ32+eQDQLo2lBhwJUjkD63JC d4Api9Lt0/aPDq2Fe6HtQg6q9XBotdQcLm3EWMCZZ4NBRdapyyE90VN/faJhhE4Qkbl8 OLVd5aZ2DNgmj9/13mW1OPHsdnCLu1M9EcaVyX7ZnxrNyA925Xfr6NovNhGEpFIl7pWB AgFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019851; x=1703624651; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=X+cD0HBbDXg6qDTaNdwjQlMiJRw8SP72E8bqAxbhE8s=; b=APlPWp6KAqSOZ1lnzl7gYCu0/tB2jWnccTDKaRtM1t/5NoHg4MnVVPfF0dSxFAP76N cODpFWjk3zPmlQoY50QcVYOLmipa4vSUba0fFbtKK7eKJHSYsxbifTpQzS2MDoUaTuu4 LOYk9pQNRFlVg2oS4Fi0kCZCPDimKf8Fab2niTErSmzhhUd6Rn7gheyantniP7Bec2Oh WmFOowmkC5aMKwNWBRoFSCz4l4YKS0Ev5vPkpU+y3jsxd3VcijoX5jxExOFP0b0Pn7ub VsKKjhw/ujWeCsNVn5dxdYWc6HuoeT429XN96GuffAoOY/YjvpYWzvZJ0RdPk0yZxjk8 BwYA== X-Gm-Message-State: AOJu0YwXIds57NaR1Al+HR8+WLXwGKH4IGMe/Yf+RirmQS/U1gr0YSDf 7ulMRi/aFHXeb76EuLEdoOyrmFnUSblUmU1/Yqy7sQ== X-Google-Smtp-Source: AGHT+IF4yzEdIL3OE2Yk+EqkDNUGNs1PJkGmtMabu00W7UcjeuqOC098z2oc+B7EWH+cuRRVUfcEhg== X-Received: by 2002:a05:6a00:4c18:b0:6d9:4598:d1f7 with SMTP id ea24-20020a056a004c1800b006d94598d1f7mr172961pfb.52.1703019851574; Tue, 19 Dec 2023 13:04:11 -0800 (PST) Received: from localhost (fwdproxy-prn-120.fbsv.net. [2a03:2880:ff:78::face:b00c]) by smtp.gmail.com with ESMTPSA id ei22-20020a056a0080d600b006ce75e0ef83sm3671250pfb.179.2023.12.19.13.04.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:11 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 07/20] io_uring: add interface queue Date: Tue, 19 Dec 2023 13:03:44 -0800 Message-Id: <20231219210357.4029713-8-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: David Wei This patch introduces a new object in io_uring called an interface queue (ifq) which contains: * A pool region allocated by userspace and registered w/ io_uring where Rx data is written to. * A net device and one specific Rx queue in it that will be configured for ZC Rx. * A pair of shared ringbuffers w/ userspace, dubbed registered buf (rbuf) rings. Each entry contains a pool region id and an offset + len within that region. The kernel writes entries into the completion ring to tell userspace where RX data is relative to the start of a region. Userspace writes entries into the refill ring to tell the kernel when it is done with the data. For now, each io_uring instance has a single ifq, and each ifq has a single pool region associated with one Rx queue. Add a new opcode to io_uring_register that sets up an ifq. Size and offsets of shared ringbuffers are returned to userspace for it to mmap. The implementation will be added in a later patch. Signed-off-by: David Wei --- include/linux/io_uring_types.h | 8 +++ include/uapi/linux/io_uring.h | 51 +++++++++++++++ io_uring/Makefile | 2 +- io_uring/io_uring.c | 13 ++++ io_uring/zc_rx.c | 116 +++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 37 +++++++++++ 6 files changed, 226 insertions(+), 1 deletion(-) create mode 100644 io_uring/zc_rx.c create mode 100644 io_uring/zc_rx.h diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index bebab36abce8..e87053b200f2 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -38,6 +38,8 @@ enum io_uring_cmd_flags { IO_URING_F_COMPAT = (1 << 12), }; +struct io_zc_rx_ifq; + struct io_wq_work_node { struct io_wq_work_node *next; }; @@ -182,6 +184,10 @@ struct io_rings { struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; }; +struct io_rbuf_ring { + struct io_uring rq, cq; +}; + struct io_restriction { DECLARE_BITMAP(register_op, IORING_REGISTER_LAST); DECLARE_BITMAP(sqe_op, IORING_OP_LAST); @@ -383,6 +389,8 @@ struct io_ring_ctx { struct io_rsrc_data *file_data; struct io_rsrc_data *buf_data; + struct io_zc_rx_ifq *ifq; + /* protected by ->uring_lock */ struct list_head rsrc_ref_list; struct io_alloc_cache rsrc_node_cache; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f1c16f817742..024a6f79323b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -558,6 +558,9 @@ enum { /* register a range of fixed file slots for automatic slot allocation */ IORING_REGISTER_FILE_ALLOC_RANGE = 25, + /* register a network interface queue for zerocopy */ + IORING_REGISTER_ZC_RX_IFQ = 26, + /* this goes last */ IORING_REGISTER_LAST, @@ -750,6 +753,54 @@ enum { SOCKET_URING_OP_SETSOCKOPT, }; +struct io_uring_rbuf_rqe { + __u32 off; + __u32 len; + __u16 region; + __u8 __pad[6]; +}; + +struct io_uring_rbuf_cqe { + __u32 off; + __u32 len; + __u16 region; + __u8 sock; + __u8 flags; + __u8 __pad[2]; +}; + +struct io_rbuf_rqring_offsets { + __u32 head; + __u32 tail; + __u32 rqes; + __u8 __pad[4]; +}; + +struct io_rbuf_cqring_offsets { + __u32 head; + __u32 tail; + __u32 cqes; + __u8 __pad[4]; +}; + +/* + * Argument for IORING_REGISTER_ZC_RX_IFQ + */ +struct io_uring_zc_rx_ifq_reg { + __u32 if_idx; + /* hw rx descriptor ring id */ + __u32 if_rxq_id; + __u32 region_id; + __u32 rq_entries; + __u32 cq_entries; + __u32 flags; + __u16 cpu; + + __u32 mmap_sz; + struct io_rbuf_rqring_offsets rq_off; + struct io_rbuf_cqring_offsets cq_off; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/Makefile b/io_uring/Makefile index e5be47e4fc3b..6c4b4ed37a1f 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -8,6 +8,6 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \ statx.o net.o msg_ring.o timeout.o \ sqpoll.o fdinfo.o tctx.o poll.o \ cancel.o kbuf.o rsrc.o rw.o opdef.o \ - notif.o waitid.o + notif.o waitid.o zc_rx.o obj-$(CONFIG_IO_WQ) += io-wq.o obj-$(CONFIG_FUTEX) += futex.o diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 1d254f2c997d..7fff01d57e9e 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -95,6 +95,7 @@ #include "notif.h" #include "waitid.h" #include "futex.h" +#include "zc_rx.h" #include "timeout.h" #include "poll.h" @@ -2919,6 +2920,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) return; mutex_lock(&ctx->uring_lock); + io_unregister_zc_rx_ifqs(ctx); if (ctx->buf_data) __io_sqe_buffers_unregister(ctx); if (ctx->file_data) @@ -3109,6 +3111,11 @@ static __cold void io_ring_exit_work(struct work_struct *work) io_cqring_overflow_kill(ctx); mutex_unlock(&ctx->uring_lock); } + if (ctx->ifq) { + mutex_lock(&ctx->uring_lock); + io_shutdown_zc_rx_ifqs(ctx); + mutex_unlock(&ctx->uring_lock); + } if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) io_move_task_work_from_local(ctx); @@ -4609,6 +4616,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_file_alloc_range(ctx, arg); break; + case IORING_REGISTER_ZC_RX_IFQ: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_ifq(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c new file mode 100644 index 000000000000..5fc94cad5e3a --- /dev/null +++ b/io_uring/zc_rx.c @@ -0,0 +1,116 @@ +// SPDX-License-Identifier: GPL-2.0 +#if defined(CONFIG_PAGE_POOL) +#include +#include +#include +#include + +#include + +#include "io_uring.h" +#include "kbuf.h" +#include "zc_rx.h" + +static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, + struct io_uring_zc_rx_ifq_reg *reg) +{ + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP; + size_t off, size, rq_size, cq_size; + void *ptr; + + off = sizeof(struct io_rbuf_ring); + rq_size = reg->rq_entries * sizeof(struct io_uring_rbuf_rqe); + cq_size = reg->cq_entries * sizeof(struct io_uring_rbuf_cqe); + size = off + rq_size + cq_size; + ptr = (void *) __get_free_pages(gfp, get_order(size)); + if (!ptr) + return -ENOMEM; + ifq->ring = (struct io_rbuf_ring *)ptr; + ifq->rqes = (struct io_uring_rbuf_rqe *)((char *)ptr + off); + ifq->cqes = (struct io_uring_rbuf_cqe *)((char *)ifq->rqes + rq_size); + return 0; +} + +static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq) +{ + if (ifq->ring) + folio_put(virt_to_folio(ifq->ring)); +} + +static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq; + + ifq = kzalloc(sizeof(*ifq), GFP_KERNEL); + if (!ifq) + return NULL; + + ifq->if_rxq_id = -1; + ifq->ctx = ctx; + return ifq; +} + +static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +{ + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + struct io_uring_zc_rx_ifq_reg reg; + struct io_zc_rx_ifq *ifq; + int ret; + + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) + return -EINVAL; + if (copy_from_user(®, arg, sizeof(reg))) + return -EFAULT; + if (ctx->ifq) + return -EBUSY; + if (reg.if_rxq_id == -1) + return -EINVAL; + + ifq = io_zc_rx_ifq_alloc(ctx); + if (!ifq) + return -ENOMEM; + + /* TODO: initialise network interface */ + + ret = io_allocate_rbuf_ring(ifq, ®); + if (ret) + goto err; + + /* TODO: map zc region and initialise zc pool */ + + ifq->rq_entries = reg.rq_entries; + ifq->cq_entries = reg.cq_entries; + ifq->if_rxq_id = reg.if_rxq_id; + ctx->ifq = ifq; + + return 0; +err: + io_zc_rx_ifq_free(ifq); + return ret; +} + +void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq = ctx->ifq; + + lockdep_assert_held(&ctx->uring_lock); + + if (!ifq) + return; + + ctx->ifq = NULL; + io_zc_rx_ifq_free(ifq); +} + +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ + lockdep_assert_held(&ctx->uring_lock); +} + +#endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h new file mode 100644 index 000000000000..aab57c1a4c5d --- /dev/null +++ b/io_uring/zc_rx.h @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef IOU_ZC_RX_H +#define IOU_ZC_RX_H + +struct io_zc_rx_ifq { + struct io_ring_ctx *ctx; + struct net_device *dev; + struct io_rbuf_ring *ring; + struct io_uring_rbuf_rqe *rqes; + struct io_uring_rbuf_cqe *cqes; + u32 rq_entries; + u32 cq_entries; + + /* hw rx descriptor ring id */ + u32 if_rxq_id; +}; + +#if defined(CONFIG_PAGE_POOL) +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg); +void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx); +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx); +#else +static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + return -EOPNOTSUPP; +} +static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ +} +static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) +{ +} +#endif + +#endif From patchwork Tue Dec 19 21:03:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499099 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6C5113D0AC for ; Tue, 19 Dec 2023 21:04:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="z61dSdpK" Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-6d532e4f6d6so1753724b3a.2 for ; Tue, 19 Dec 2023 13:04:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019852; x=1703624652; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=b3Y4JU6HMwB1uVJ+JNXJdWd+XQLYN1jCeFKBg4yFw1k=; b=z61dSdpK2dWdcPpUIawfQsityXb4ZJwOdbh1tIQkyl8JFy0GbLtyTn6RaAtEscYQoR 75SVcTjSgH4aZIZYWp/AiIL86Rm6WWJlmMKhMhgwqvYbDuV1vtndOGUzBaIToXfkda4k qhFMNtIlK1mqySfyzzTsR+RErc2SkUo2QkMAv2jMRdsXvEidFthsQl6OjtXOVcUkSDZT rIO1Q5M5Y/1tiSUR8C3rxIHzoZgoKfqZqa8FkA4Ua6xgIHh/98Ok2S3cpbSvvLcKvaqn Q/j8Eyzk6IGXQgm9TGQjglqCGpTDxLHmXmLU91WOOuRNzfcZICz8I1I70oVj7rkJPhe3 XVow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019852; x=1703624652; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=b3Y4JU6HMwB1uVJ+JNXJdWd+XQLYN1jCeFKBg4yFw1k=; b=NzYHhSqWFxW55earZd+YWuMDLbnNkcb1KiPlG7IwwvznnJeu5w7L40ocz/VMzML2qp cItxur3Q4sRNslVF86Gh06tu1T3R8leB5tkrz3bH75keOzw9tH/fuO2zqYjq36JljgF0 FGSoHJr1k/zR7RRR5RNIdviIS9Jc3vLfgYETYhrGcK6T9ApYUtVfyjq27OZv4T+wVWUR iiPeji6fv+Mh8g+lv+ENFQGU0CbitwmlmKpgZZTZ4palv9OftQuRH2TmZRvxLZ+uKf/X YRjn89Qc1RIbZNXsHUeHnFIZ3xX1zzss6zrQvNI2Cyn3cqPg3K4wIe3a932kLYTOgc70 j3Qg== X-Gm-Message-State: AOJu0YysBPbI3LziFmJHXbjQFKCRayo6HyLEufCiYXNIh8CgNp+3E6WB DcBFwIiz/CCY94oCYQDH07b/MA== X-Google-Smtp-Source: AGHT+IG4uhmPRuRYA1PBi6o4YeFNdksavwq6wMK9ezYS86VITabON3ke2qgZ1ovSnIArd+jBALIfSg== X-Received: by 2002:a05:6a00:4783:b0:6ce:2731:5f7c with SMTP id dh3-20020a056a00478300b006ce27315f7cmr7863805pfb.59.1703019852569; Tue, 19 Dec 2023 13:04:12 -0800 (PST) Received: from localhost (fwdproxy-prn-010.fbsv.net. [2a03:2880:ff:a::face:b00c]) by smtp.gmail.com with ESMTPSA id x11-20020a056a00188b00b006d7d454e58asm4085899pfh.117.2023.12.19.13.04.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:12 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers Date: Tue, 19 Dec 2023 13:03:45 -0800 Message-Id: <20231219210357.4029713-9-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: David Wei This patch adds mmap support for ifq rbuf rings. There are two rings and a struct io_rbuf_ring that contains the head and tail ptrs into each ring. Just like the io_uring SQ/CQ rings, userspace issues a single mmap call using the io_uring fd w/ magic offset IORING_OFF_RBUF_RING. An opaque ptr is returned to userspace, which is then expected to use the offsets returned in the registration struct to get access to the head/tail and rings. Signed-off-by: David Wei Reviewed-by: Jens Axboe --- include/uapi/linux/io_uring.h | 2 ++ io_uring/io_uring.c | 5 +++++ io_uring/zc_rx.c | 19 ++++++++++++++++++- 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 024a6f79323b..839933e562e6 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -428,6 +428,8 @@ enum { #define IORING_OFF_PBUF_RING 0x80000000ULL #define IORING_OFF_PBUF_SHIFT 16 #define IORING_OFF_MMAP_MASK 0xf8000000ULL +#define IORING_OFF_RBUF_RING 0x20000000ULL +#define IORING_OFF_RBUF_SHIFT 16 /* * Filled with the offset for mmap(2) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 7fff01d57e9e..02d6d638bd65 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3516,6 +3516,11 @@ static void *io_uring_validate_mmap_request(struct file *file, return ERR_PTR(-EINVAL); break; } + case IORING_OFF_RBUF_RING: + if (!ctx->ifq) + return ERR_PTR(-EINVAL); + ptr = ctx->ifq->ring; + break; default: return ERR_PTR(-EINVAL); } diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 5fc94cad5e3a..7e3e6f6d446b 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -61,6 +61,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, { struct io_uring_zc_rx_ifq_reg reg; struct io_zc_rx_ifq *ifq; + size_t ring_sz, rqes_sz, cqes_sz; int ret; if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) @@ -87,8 +88,24 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; ifq->if_rxq_id = reg.if_rxq_id; - ctx->ifq = ifq; + ring_sz = sizeof(struct io_rbuf_ring); + rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; + cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; + reg.mmap_sz = ring_sz + rqes_sz + cqes_sz; + reg.rq_off.rqes = ring_sz; + reg.cq_off.cqes = ring_sz + rqes_sz; + reg.rq_off.head = offsetof(struct io_rbuf_ring, rq.head); + reg.rq_off.tail = offsetof(struct io_rbuf_ring, rq.tail); + reg.cq_off.head = offsetof(struct io_rbuf_ring, cq.head); + reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); + + if (copy_to_user(arg, ®, sizeof(reg))) { + ret = -EFAULT; + goto err; + } + + ctx->ifq = ifq; return 0; err: io_zc_rx_ifq_free(ifq); From patchwork Tue Dec 19 21:03:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499101 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 45C3B3A29B for ; Tue, 19 Dec 2023 21:04:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="hVQfTmM3" Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-6d7395ab92cso1743028b3a.2 for ; Tue, 19 Dec 2023 13:04:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019853; x=1703624653; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wkWI+jBk9ibeTFLlqTLjoDgBCcMIsyS5O3uAoss9Ttk=; b=hVQfTmM3DGTVb0Jan2JmJg5ujHubg0EH2ANU5ALsN4ZlCGRUqKSThlyVoG3cXCt6gM o0zGDSXVWz0CBIfLNpyAJaXOn2ej86C7V//wnQQHmhD82gFLmqWy254/QTdHe+ElPkSN AwuLwc+0aZpnVJhuVBNt4gAuG7+PhhP5JCrCNmLHnWVRk8gCK5T5lKmqrjiaVE6nwpeX a2ciwGXVYSWJPGG4EOaHdoevMfdaZzERV/kd5ApQx0jIxn/8LFtVp3XhOXeS4lKH7mzr tGygr/wgMbAPN2w1MZp3L8HtK6DXwC/Krno/uzbTeakRFmIYfSFx3hNqu0ZpkVR719ET wl1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019853; x=1703624653; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wkWI+jBk9ibeTFLlqTLjoDgBCcMIsyS5O3uAoss9Ttk=; b=wE0g9Gho6lP7VgJb4VzYT1JfjcM9hrXPioPDx0ON24JbEP33bBR7GmiYt08COerTTA oo+wX5SQDc8/6i34/R1w9gdZzxEGCrxfzfNTG43gN/FRJGAqFuRfKRtcB4WKBX8cOzNZ 9qx0wae9wEvu66D1ggWBC5E6Eq5ub0dmdcZ2j3ik7TLBCpMXifGzxPzi56v6Lb7Ws4pk Ad0/fKgWTPfzY4NgX9wrhIsLDIjKUl3E50qbvZPXohYCR9cmwfOhARAvNjurkPe7/APb NK41cdmrEUMmnLeSO09pH5rddVejdWwcRbxh0qyqLwfWiWFqb1SaQIL7cVuX3FZSnUb0 hgXQ== X-Gm-Message-State: AOJu0Ywnl+XvouzStnOYTxLj2AXaAwk4rYBEvja6miEWYsiEj5FcrX0W 1jG2cNnchT4TAwmH3KGyjMTyyA== X-Google-Smtp-Source: AGHT+IFkIV9kmjgRboZ06qcSOFb/5LLi9sCmzo0B4VxRAsTpi263CWRkMG9FMB5SxtZFYbwU0PQDWQ== X-Received: by 2002:a17:902:d4c7:b0:1d0:910e:5039 with SMTP id o7-20020a170902d4c700b001d0910e5039mr10917744plg.77.1703019853402; Tue, 19 Dec 2023 13:04:13 -0800 (PST) Received: from localhost (fwdproxy-prn-019.fbsv.net. [2a03:2880:ff:13::face:b00c]) by smtp.gmail.com with ESMTPSA id n2-20020a170902d2c200b001bf044dc1a6sm21422316plc.39.2023.12.19.13.04.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:13 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command Date: Tue, 19 Dec 2023 13:03:46 -0800 Message-Id: <20231219210357.4029713-10-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: David Wei RFC ONLY, NOT FOR UPSTREAM This will be replaced with a separate ndo callback or some other mechanism in next patchset revisions. This patch adds a new XDP_SETUP_ZC_RX command that will be used in a later patch to enable or disable ZC RX for a specific RX queue. We are open to suggestions on a better way of doing this. Google's TCP devmem proposal sets up struct netdev_rx_queue which persists across device reset, then expects userspace to use an out-of-band method (e.g. ethtool) to reset the device, thus re-filling a hardware Rx queue. Signed-off-by: David Wei --- include/linux/netdevice.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a4bdc35c7d6f..5b4df0b6a6c0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1097,6 +1097,7 @@ enum bpf_netdev_command { BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, XDP_SETUP_XSK_POOL, + XDP_SETUP_ZC_RX, }; struct bpf_prog_offload_ops; @@ -1135,6 +1136,11 @@ struct netdev_bpf { struct xsk_buff_pool *pool; u16 queue_id; } xsk; + /* XDP_SETUP_ZC_RX */ + struct { + struct io_zc_rx_ifq *ifq; + u16 queue_id; + } zc_rx; }; }; From patchwork Tue Dec 19 21:03:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499100 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E29913D0BD for ; Tue, 19 Dec 2023 21:04:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="gQnYYc+4" Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-28b4d49293fso2082327a91.2 for ; Tue, 19 Dec 2023 13:04:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019854; x=1703624654; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wAMvX5PF2J42C1cn2NekqDrZ68M45niiPpeKzGsshQo=; b=gQnYYc+4cUb/LyvXOAYXV4CcWCkXGxnl1WiRbq+UPsXTpqZhsq3ebvzMqkHRlvHdXG CLGIPmwqIN0hCAHiYhYI+GoDZMfRyfO2IEqSvLktGi1NYoyBnD8AviV1J7iKGN5aTWMr 9iJNSzzaOiJnC1BJuFDYezq88uHMftcv/ERbai4AO1KE8zz9bni9ni9vuToxqw2824cz Z9sM9Vqk5fD5J1sHLTq9iVCu8t0VhM1Gwsyu/JB9+9KLUgC/YVfudx4hdjr50OYuQoki kwVcIuP6tnBNmeW03YEeXH8qhq7JuhTTdglim1i1JJWXNQxUN8NV6JkneykYm2DAefG0 Akyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019854; x=1703624654; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wAMvX5PF2J42C1cn2NekqDrZ68M45niiPpeKzGsshQo=; b=Wfm80RaQsH+J70sXdc9xF7SrqJfSneXCvC68OT4OV8gxu155DHNQZ9iEaUIjrqO/C0 KluNFTY4MhbjXVIFEZcDqwg7nykuzgsrnJ60TCh+QWDsFIKLjVVXuDq8fOkk17yTj+Hd +OSbpSY5n+/2tFSh1SKUP+YO9BmoOBU44pUp/s/UtuiutPsVEot2EM/VcPuNvJ3A3ldu hWt8aNnCmoD6tt188Py1k2iY8C2AM5+Z/kHZomPOQK9nhNH9pbWmL5HZ1PZD83Wspyqw /TeAME6ML3dZ3mmKm2TrnOTi7Ybu4wWcp6xsd/Qh9vCMGc6D7b+QUI0HZgpOrJLLoSVf //4w== X-Gm-Message-State: AOJu0YzGXceomay8/PNU5nO5V3mrmwD1V8FNcsfxX43bxmVXQPsjVIIy cyR0FGeoaq58/PGTotgIf3Ef2w== X-Google-Smtp-Source: AGHT+IHoQAQ/k1SxzwciIcmi+jcQ+upozI8wcoH4KzgZpCsiXfMS4tOsd7N3uxiOUng9G36LVHu5GQ== X-Received: by 2002:a17:90b:120d:b0:28b:a3bf:8aaa with SMTP id gl13-20020a17090b120d00b0028ba3bf8aaamr1684113pjb.53.1703019854284; Tue, 19 Dec 2023 13:04:14 -0800 (PST) Received: from localhost (fwdproxy-prn-000.fbsv.net. [2a03:2880:ff::face:b00c]) by smtp.gmail.com with ESMTPSA id k9-20020a170902c40900b001d0969c5b68sm21470889plk.139.2023.12.19.13.04.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:14 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq Date: Tue, 19 Dec 2023 13:03:47 -0800 Message-Id: <20231219210357.4029713-11-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: David Wei This patch sets up ZC for an Rx queue in a net device when an ifq is registered with io_uring. The Rx queue is specified in the registration struct. For now since there is only one ifq, its destruction is implicit during io_uring cleanup. Signed-off-by: David Wei --- io_uring/zc_rx.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 7e3e6f6d446b..259e08a34ab2 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -4,6 +4,7 @@ #include #include #include +#include #include @@ -11,6 +12,34 @@ #include "kbuf.h" #include "zc_rx.h" +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); + +static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, + u16 queue_id) +{ + struct netdev_bpf cmd; + bpf_op_t ndo_bpf; + + ndo_bpf = dev->netdev_ops->ndo_bpf; + if (!ndo_bpf) + return -EINVAL; + + cmd.command = XDP_SETUP_ZC_RX; + cmd.zc_rx.ifq = ifq; + cmd.zc_rx.queue_id = queue_id; + return ndo_bpf(dev, &cmd); +} + +static int io_open_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, ifq, ifq->if_rxq_id); +} + +static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); +} + static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, struct io_uring_zc_rx_ifq_reg *reg) { @@ -52,6 +81,10 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { + if (ifq->if_rxq_id != -1) + io_close_zc_rxq(ifq); + if (ifq->dev) + dev_put(ifq->dev); io_free_rbuf_ring(ifq); kfree(ifq); } @@ -77,18 +110,25 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (!ifq) return -ENOMEM; - /* TODO: initialise network interface */ - ret = io_allocate_rbuf_ring(ifq, ®); if (ret) goto err; + ret = -ENODEV; + ifq->dev = dev_get_by_index(current->nsproxy->net_ns, reg.if_idx); + if (!ifq->dev) + goto err; + /* TODO: map zc region and initialise zc pool */ ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; ifq->if_rxq_id = reg.if_rxq_id; + ret = io_open_zc_rxq(ifq); + if (ret) + goto err; + ring_sz = sizeof(struct io_rbuf_ring); rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; @@ -101,6 +141,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); if (copy_to_user(arg, ®, sizeof(reg))) { + io_close_zc_rxq(ifq); ret = -EFAULT; goto err; } From patchwork Tue Dec 19 21:03:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499102 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0060F3D0CB for ; Tue, 19 Dec 2023 21:04:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ndC5qhFj" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1d3536cd414so39180875ad.2 for ; Tue, 19 Dec 2023 13:04:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019855; x=1703624655; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4hgHL/ZvjBTD8SimX6JIDvgoHS0Ce7mxEyaQY6vMoO4=; b=ndC5qhFj70vcYE3zzd1Z8oKtmAK7yQds9L97HIBTmUCSPXqX8eIuod2sJUCT/kWV2o 1WD7ug9WugzyrQ5igdGgk/kAWRDxlFdbYFwl0CuL0RWE8gPOh4ydGYTiy0x/NiZmppgU mtahQZQjxV/il001JfBpurOMQkeSR5D2nLSNOQnSoQvGHMoQIjmwWFLtZrPK8LzuSq0U dvGNYlQKzdnDAM4Ohs/GTAStmdb8PmwHl2IK2YoWNwetMraxGLnQ4NbDc4D2S+hUNjzh HuT4P9fTYjCICCIC4kYAqbod/Y1G/5qYdBzTgns9ji2zaJ00sd93yjsAPGauz4RMDkh4 bQ9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019855; x=1703624655; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4hgHL/ZvjBTD8SimX6JIDvgoHS0Ce7mxEyaQY6vMoO4=; b=X+xD8rVSXcT0Q0vg6eyw4xxBZu3LOBd4TSWfoz3f+skae8QkNxGDnUwwduJVuu5yFh qlz8H2Q3jPqrHetNtrLpPzWkaEYUE6BSdunTf9c2ew5bP/P2iNkE2VPHBq1iAJZis55N DdMm7S3ltm1FstfvmyG6i1hZ4wpctuJCgj6gR29R0eeUos/uoZzSXG9rtXrTPmIu17MV 8D6rPwMy9jhL1uqAB1BY5W5NpsYz1CZIHs7CiusXw/VihCNAyihRFFOTa5eo7L7tX/5v FK1sCgbERUC8PH4QcYxJqvye7dP6F5XBUMYl+eoGuH9CZhjdS/VZXUvK7GD4FQlk9Eoa 5aag== X-Gm-Message-State: AOJu0YyLoSaWeeYNVllNvfp72D7wkFQ3m2J+RGQt6GmIsvoR4iEYo4s0 qc3XO4GVZkHuGH+qZIB1+weWe2jDW6rqs0Kflc15RA== X-Google-Smtp-Source: AGHT+IGpRb8Vk7U6vTaxpT/fIISuWL242u8wPDqXvpB/Hrzrc8SZbdwoUsYlgc7jNB61e7DGfKmrpA== X-Received: by 2002:a17:902:7b8d:b0:1d0:6ffd:9e36 with SMTP id w13-20020a1709027b8d00b001d06ffd9e36mr18734326pll.136.1703019855312; Tue, 19 Dec 2023 13:04:15 -0800 (PST) Received: from localhost (fwdproxy-prn-015.fbsv.net. [2a03:2880:ff:f::face:b00c]) by smtp.gmail.com with ESMTPSA id c2-20020a170902848200b001d09c539c96sm10014931plo.229.2023.12.19.13.04.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:15 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration Date: Tue, 19 Dec 2023 13:03:48 -0800 Message-Id: <20231219210357.4029713-12-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov We want userspace to explicitly list all sockets it'll be using with a particular zc ifq, so we can properly configure them, e.g. binding the sockets to the corresponding interface and setting steering rules. We'll also need it to better control ifq lifetime and for termination / unregistration purposes. TODO: remove zc_rx_idx from struct socket, which will fix zc_rx_idx token init races and re-registration bug. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/net.h | 2 + include/uapi/linux/io_uring.h | 7 +++ io_uring/io_uring.c | 6 +++ io_uring/net.c | 20 ++++++++ io_uring/zc_rx.c | 89 +++++++++++++++++++++++++++++++++-- io_uring/zc_rx.h | 17 +++++++ net/socket.c | 1 + 7 files changed, 139 insertions(+), 3 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index c9b4a63791a4..867061a91d30 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -126,6 +126,8 @@ struct socket { const struct proto_ops *ops; /* Might change with IPV6_ADDRFORM or MPTCP. */ struct socket_wq wq; + + unsigned zc_rx_idx; }; /* diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 839933e562e6..f4ba58bce3bd 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -562,6 +562,7 @@ enum { /* register a network interface queue for zerocopy */ IORING_REGISTER_ZC_RX_IFQ = 26, + IORING_REGISTER_ZC_RX_SOCK = 27, /* this goes last */ IORING_REGISTER_LAST, @@ -803,6 +804,12 @@ struct io_uring_zc_rx_ifq_reg { struct io_rbuf_cqring_offsets cq_off; }; +struct io_uring_zc_rx_sock_reg { + __u32 sockfd; + __u32 zc_rx_ifq_idx; + __u32 __resv[2]; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 02d6d638bd65..47859599469d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -4627,6 +4627,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_zc_rx_ifq(ctx, arg); break; + case IORING_REGISTER_ZC_RX_SOCK: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_sock(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/net.c b/io_uring/net.c index 75d494dad7e2..454ba301ae6b 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -16,6 +16,7 @@ #include "net.h" #include "notif.h" #include "rsrc.h" +#include "zc_rx.h" #if defined(CONFIG_NET) struct io_shutdown { @@ -955,6 +956,25 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } +static __maybe_unused +struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, + struct socket *sock) +{ + unsigned token = READ_ONCE(sock->zc_rx_idx); + unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET; + unsigned sock_idx = token & IO_ZC_IFQ_IDX_MASK; + struct io_zc_rx_ifq *ifq; + + if (ifq_idx) + return NULL; + ifq = req->ctx->ifq; + if (!ifq || sock_idx >= ifq->nr_sockets) + return NULL; + if (ifq->sockets[sock_idx] != req->file) + return NULL; + return ifq; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 259e08a34ab2..06e2c54d3f3d 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -11,6 +11,7 @@ #include "io_uring.h" #include "kbuf.h" #include "zc_rx.h" +#include "rsrc.h" typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); @@ -79,10 +80,31 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) return ifq; } -static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +static void io_shutdown_ifq(struct io_zc_rx_ifq *ifq) { - if (ifq->if_rxq_id != -1) + int i; + + if (!ifq) + return; + + for (i = 0; i < ifq->nr_sockets; i++) { + if (ifq->sockets[i]) { + fput(ifq->sockets[i]); + ifq->sockets[i] = NULL; + } + } + ifq->nr_sockets = 0; + + if (ifq->if_rxq_id != -1) { io_close_zc_rxq(ifq); + ifq->if_rxq_id = -1; + } +} + +static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +{ + io_shutdown_ifq(ifq); + if (ifq->dev) dev_put(ifq->dev); io_free_rbuf_ring(ifq); @@ -141,7 +163,6 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); if (copy_to_user(arg, ®, sizeof(reg))) { - io_close_zc_rxq(ifq); ret = -EFAULT; goto err; } @@ -162,6 +183,8 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) if (!ifq) return; + WARN_ON_ONCE(ifq->nr_sockets); + ctx->ifq = NULL; io_zc_rx_ifq_free(ifq); } @@ -169,6 +192,66 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) { lockdep_assert_held(&ctx->uring_lock); + + io_shutdown_ifq(ctx->ifq); +} + +int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg) +{ + struct io_uring_zc_rx_sock_reg sr; + struct io_zc_rx_ifq *ifq; + struct socket *sock; + struct file *file; + int ret = -EEXIST; + int idx; + + if (copy_from_user(&sr, arg, sizeof(sr))) + return -EFAULT; + if (sr.__resv[0] || sr.__resv[1]) + return -EINVAL; + if (sr.zc_rx_ifq_idx != 0 || !ctx->ifq) + return -EINVAL; + + ifq = ctx->ifq; + if (ifq->nr_sockets >= ARRAY_SIZE(ifq->sockets)) + return -EINVAL; + + BUILD_BUG_ON(ARRAY_SIZE(ifq->sockets) > IO_ZC_IFQ_IDX_MASK); + + file = fget(sr.sockfd); + if (!file) + return -EBADF; + + if (io_file_need_scm(file)) { + fput(file); + return -EBADF; + } + + sock = sock_from_file(file); + if (unlikely(!sock || !sock->sk)) { + fput(file); + return -ENOTSOCK; + } + + idx = ifq->nr_sockets; + lock_sock(sock->sk); + if (!sock->zc_rx_idx) { + unsigned token; + + token = idx + (sr.zc_rx_ifq_idx << IO_ZC_IFQ_IDX_OFFSET); + WRITE_ONCE(sock->zc_rx_idx, token); + ret = 0; + } + release_sock(sock->sk); + + if (ret) { + fput(file); + return -EINVAL; + } + ifq->sockets[idx] = file; + ifq->nr_sockets++; + return 0; } #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index aab57c1a4c5d..9257dda77e92 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -2,6 +2,13 @@ #ifndef IOU_ZC_RX_H #define IOU_ZC_RX_H +#include +#include + +#define IO_ZC_MAX_IFQ_SOCKETS 16 +#define IO_ZC_IFQ_IDX_OFFSET 16 +#define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) + struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; @@ -13,6 +20,9 @@ struct io_zc_rx_ifq { /* hw rx descriptor ring id */ u32 if_rxq_id; + + unsigned nr_sockets; + struct file *sockets[IO_ZC_MAX_IFQ_SOCKETS]; }; #if defined(CONFIG_PAGE_POOL) @@ -20,6 +30,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx); void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx); +int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg); #else static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg) @@ -32,6 +44,11 @@ static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx) static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx) { } +static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg) +{ + return -EOPNOTSUPP; +} #endif #endif diff --git a/net/socket.c b/net/socket.c index d75246450a3c..a9cef870309a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -637,6 +637,7 @@ struct socket *sock_alloc(void) sock = SOCKET_I(inode); + sock->zc_rx_idx = 0; inode->i_ino = get_next_ino(); inode->i_mode = S_IFSOCK | S_IRWXUGO; inode->i_uid = current_fsuid(); From patchwork Tue Dec 19 21:03:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499103 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F1B253A8EE for ; Tue, 19 Dec 2023 21:04:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="QHalPbvH" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-1d3ac87553bso15405585ad.3 for ; Tue, 19 Dec 2023 13:04:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019856; x=1703624656; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tJ8hPjWHfatZ/BzlXTQEl7ypf6OdinK6dQcdfVsR+fI=; b=QHalPbvH6VAGFB+ML/YOP8G66Y1pJOMJ7PVotVN4kS9dVIi9GujUUEApSeSm/i+QjS ZyI0Ejrw0PxFYXM6lwHRKlz88kYi+JIXzg+JKp8Y6fKNccsYHHG5HfcwvbuP1arjpYsq Xlrpvg4hBF2h8K3ak0lKLYxC5WRPZiB/pCFrxpjLLoNylbzUyjoyyBpU6hdV6Vlew38r yutxbLnF9S9BqJOQ06JCLB4UnfCZ+C/ZaD3n7M60VxBg0AN+41mwRoHP2SMSZBU/xEKA 8V+SuB/OZpfJWGSuRXeJcFomQ72iw1Z0bt1UfyWmTPRmkdjlIGb0VzqjZ55g/lF4vIy1 9VIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019856; x=1703624656; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tJ8hPjWHfatZ/BzlXTQEl7ypf6OdinK6dQcdfVsR+fI=; b=GYhyV14DPNyRWPmPOjC/JGuEk9dlSf+9sSuhQKQp7KThUx+z6+vlza+R42JuX1PEUw MUrTqQupz/jrRf0iVYpTlSQVzGVOjmptkypvWXyvV1CGKwaLlXPoVMkW6yP3Q3QFKwWD jp7Dix7QE769PcNp7osGiu0VA1UMfQ1ctlVpeKWiMriqAZ1iM8VHBuT0Pb5s9M4tvias hyar3OygYwCnv0KN4l2fYNTByDeKS3xEsZqp9qyghB0zpRMNamw/YCGJDa/oTI+pH8Lo til91q6ishisstNhJhY8yihsX06vlLY3QqcO6xK65tXzymkdBbpnGhrJGTXub5JcZ9AV aEPg== X-Gm-Message-State: AOJu0Yx6Ypna8KjqMpjJcT1l3ewbQ0d77kZmM9Yf5eFI2JPakAkfgPDO IJvJKJXBESRuqH+BLN2VK7gY9A== X-Google-Smtp-Source: AGHT+IHvfTVtAhN39iKMzYo0U58+EnVbhBRtZTa0KHZT79kAWK1dvewUF2g8Zu6YRl+BF77iiu/Htw== X-Received: by 2002:a17:90a:a40b:b0:28b:440f:766d with SMTP id y11-20020a17090aa40b00b0028b440f766dmr2611424pjp.90.1703019856226; Tue, 19 Dec 2023 13:04:16 -0800 (PST) Received: from localhost (fwdproxy-prn-020.fbsv.net. [2a03:2880:ff:14::face:b00c]) by smtp.gmail.com with ESMTPSA id u12-20020a17090a890c00b0028bbd30172csm1965513pjn.56.2023.12.19.13.04.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:15 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 12/20] io_uring: add ZC buf and pool Date: Tue, 19 Dec 2023 13:03:49 -0800 Message-Id: <20231219210357.4029713-13-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: David Wei [TODO: REVIEW COMMIT MESSAGE] This patch adds two objects: * Zero copy buffer representation, holding a page, its mapped dma_addr, and a refcount for lifetime management. * Zero copy pool, spiritually similar to page pool, that holds ZC bufs and hands them out to net devices. Pool regions are registered w/ io_uring using the registered buffer API, with a 1:1 mapping between region and nr_iovec in io_uring_register_buffers. This does the heavy lifting of pinning and chunking into bvecs into a struct io_mapped_ubuf for us. For now as there is only one pool region per ifq, there is no separate API for adding/removing regions yet and it is mapped implicitly during ifq registration. Signed-off-by: David Wei --- include/linux/io_uring/net.h | 8 +++ io_uring/zc_rx.c | 135 ++++++++++++++++++++++++++++++++++- io_uring/zc_rx.h | 15 ++++ 3 files changed, 157 insertions(+), 1 deletion(-) diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h index b58f39fed4d5..d994d26116d0 100644 --- a/include/linux/io_uring/net.h +++ b/include/linux/io_uring/net.h @@ -2,8 +2,16 @@ #ifndef _LINUX_IO_URING_NET_H #define _LINUX_IO_URING_NET_H +#include + struct io_uring_cmd; +struct io_zc_rx_buf { + struct page_pool_iov ppiov; + struct page *page; + dma_addr_t dma; +}; + #if defined(CONFIG_IO_URING) int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 06e2c54d3f3d..1e656b481725 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -5,6 +5,7 @@ #include #include #include +#include #include @@ -15,6 +16,11 @@ typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); +static inline struct device *netdev2dev(struct net_device *dev) +{ + return dev->dev.parent; +} + static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, u16 queue_id) { @@ -67,6 +73,129 @@ static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq) folio_put(virt_to_folio(ifq->ring)); } +static int io_zc_rx_init_buf(struct device *dev, struct page *page, u16 pool_id, + u32 pgid, struct io_zc_rx_buf *buf) +{ + dma_addr_t addr = 0; + + /* Skip dma setup for devices that don't do any DMA transfers */ + if (dev) { + addr = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + if (dma_mapping_error(dev, addr)) + return -ENOMEM; + } + + buf->dma = addr; + buf->page = page; + refcount_set(&buf->ppiov.refcount, 0); + buf->ppiov.owner = NULL; + buf->ppiov.pp = NULL; + get_page(page); + return 0; +} + +static void io_zc_rx_free_buf(struct device *dev, struct io_zc_rx_buf *buf) +{ + struct page *page = buf->page; + + if (dev) + dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + put_page(page); +} + +static int io_zc_rx_map_pool(struct io_zc_rx_pool *pool, + struct io_mapped_ubuf *imu, + struct device *dev) +{ + struct io_zc_rx_buf *buf; + struct page *page; + int i, ret; + + for (i = 0; i < imu->nr_bvecs; i++) { + page = imu->bvec[i].bv_page; + buf = &pool->bufs[i]; + ret = io_zc_rx_init_buf(dev, page, pool->pool_id, i, buf); + if (ret) + goto err; + + pool->freelist[i] = i; + } + + pool->free_count = imu->nr_bvecs; + return 0; +err: + while (i--) { + buf = &pool->bufs[i]; + io_zc_rx_free_buf(dev, buf); + } + return ret; +} + +static int io_zc_rx_create_pool(struct io_ring_ctx *ctx, + struct io_zc_rx_ifq *ifq, + u16 id) +{ + struct device *dev = netdev2dev(ifq->dev); + struct io_mapped_ubuf *imu; + struct io_zc_rx_pool *pool; + int nr_pages; + int ret; + + if (ifq->pool) + return -EFAULT; + + if (unlikely(id >= ctx->nr_user_bufs)) + return -EFAULT; + id = array_index_nospec(id, ctx->nr_user_bufs); + imu = ctx->user_bufs[id]; + if (imu->ubuf & ~PAGE_MASK || imu->ubuf_end & ~PAGE_MASK) + return -EFAULT; + + ret = -ENOMEM; + nr_pages = imu->nr_bvecs; + pool = kvmalloc(struct_size(pool, freelist, nr_pages), GFP_KERNEL); + if (!pool) + goto err; + + pool->bufs = kvmalloc_array(nr_pages, sizeof(*pool->bufs), GFP_KERNEL); + if (!pool->bufs) + goto err_buf; + + ret = io_zc_rx_map_pool(pool, imu, dev); + if (ret) + goto err_map; + + pool->ifq = ifq; + pool->pool_id = id; + pool->nr_bufs = nr_pages; + spin_lock_init(&pool->freelist_lock); + ifq->pool = pool; + return 0; +err_map: + kvfree(pool->bufs); +err_buf: + kvfree(pool); +err: + return ret; +} + +static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +{ + struct device *dev = netdev2dev(pool->ifq->dev); + struct io_zc_rx_buf *buf; + + for (int i = 0; i < pool->nr_bufs; i++) { + buf = &pool->bufs[i]; + io_zc_rx_free_buf(dev, buf); + } + kvfree(pool->bufs); + kvfree(pool); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -105,6 +234,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { io_shutdown_ifq(ifq); + if (ifq->pool) + io_zc_rx_destroy_pool(ifq->pool); if (ifq->dev) dev_put(ifq->dev); io_free_rbuf_ring(ifq); @@ -141,7 +272,9 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (!ifq->dev) goto err; - /* TODO: map zc region and initialise zc pool */ + ret = io_zc_rx_create_pool(ctx, ifq, reg.region_id); + if (ret) + goto err; ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 9257dda77e92..af1d865525d2 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -3,15 +3,30 @@ #define IOU_ZC_RX_H #include +#include #include #define IO_ZC_MAX_IFQ_SOCKETS 16 #define IO_ZC_IFQ_IDX_OFFSET 16 #define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) +struct io_zc_rx_pool { + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *bufs; + u32 nr_bufs; + u16 pool_id; + + /* freelist */ + spinlock_t freelist_lock; + u32 free_count; + u32 freelist[]; +}; + struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; + struct io_zc_rx_pool *pool; + struct io_rbuf_ring *ring; struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; From patchwork Tue Dec 19 21:03:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499104 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB0963D561 for ; Tue, 19 Dec 2023 21:04:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="cyivTOJr" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-6d0a679fca7so2811770b3a.2 for ; Tue, 19 Dec 2023 13:04:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019857; x=1703624657; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eEGPLnsr8EuY8VD2FnKVCOVQnMaFt7IkWGFTilccx/A=; b=cyivTOJrwoLI0/YLu3/Ahq31WLXLq4pkB4jViJT9/yx4PwKdylYdEnL5Epq4TWmRUD oyAiY5C1B/+M/ySO9TjFvXA+6XBLhMt97/Dx2XardPGVBuFHDL7m8ZgFRVXjlJ2tu08C VUhLE3zh/zeWAhYiggyYZi3inWeBoxaSTM/8MWf2udzQFHgZAUq39AtJ4RJ8Xcva+wMn C7xi8vvUK9SBTfJ0QtLB9IvJqgIN/3MmwXl4cBr1KXx4X99tL7nVu1JotwzCAArY0VoF FT3nY+h3jhoz1MKz9cWaaW8VxtyWbZH19thI+ARrQDspG6inheCgirph0dCzk74cKArq a5DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019857; x=1703624657; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eEGPLnsr8EuY8VD2FnKVCOVQnMaFt7IkWGFTilccx/A=; b=n0mnPyMEyWaqTauG2X0MM4ez/KDU/eCA97y3SxTLHNkZzjftRzkPfMosQcrR11upG4 78KePABBXm6H4C/C3Wgggak9wy+d1bNISeVt/9L72JtBgm+PZUuZ+5PiQyyKJ6AYwiT6 /PG6O2e12g+VAoOSarAdxJbQU+5MunIT0BZIGyGJgqc/bVjJMwmfIZI8710PK0wZyZ8Z pymyPtMSZWzQLGN6MiL+E/d5+G8xH3QwzPrbGVXP0tjPlW9ZW8CjOd71+82ZW39IDjM5 +Q/rwbuIYmvYYKOvRzkvkyAX+VdSGeft9HmgM4AB1jXh05FDbkTUksyHyhiF2BQaDz2v ZzdQ== X-Gm-Message-State: AOJu0Ywi4NOebQIvennuNb458hlEQRCHdSXfV/t1B5JCgD/r5rVbbyIe 98ls0i7ATHtOhIgJdfk4LsZw6w== X-Google-Smtp-Source: AGHT+IF8bw1k9pZ9asY2+rvU14RiZot/ft1Sk+wFRr4Kw+D4bW9SqnE775wocw9nZCzUzNSxjlss3A== X-Received: by 2002:a05:6a21:1f03:b0:18a:d4c3:1350 with SMTP id ry3-20020a056a211f0300b0018ad4c31350mr7409363pzb.44.1703019857139; Tue, 19 Dec 2023 13:04:17 -0800 (PST) Received: from localhost (fwdproxy-prn-003.fbsv.net. [2a03:2880:ff:3::face:b00c]) by smtp.gmail.com with ESMTPSA id c14-20020aa781ce000000b006d082dd8086sm16864175pfn.214.2023.12.19.13.04.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:16 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx Date: Tue, 19 Dec 2023 13:03:50 -0800 Message-Id: <20231219210357.4029713-14-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: Pavel Begunkov We're adding a new pp memory provider to implement io_uring zerocopy receive. It'll be "registered" in pp and used in later paches. The typical life cycle of a buffer goes as follows: first it's allocated to a driver with the initial refcount set to 1. The drivers fills it with data, puts it into an skb and passes down the stack, where it gets queued up to a socket. Later, a zc io_uring request will be receiving data from the socket from a task context. At that point io_uring will tell the userspace that this buffer has some data by posting an appropriate completion. It'll also elevating the refcount by IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading the data. When the userspace is done with the buffer it should return it back to io_uring by adding an entry to the buffer refill ring. When necessary io_uring will poll the refill ring, compare references including IO_ZC_RX_UREF and reuse the buffer. Initally, all buffers are placed in a spinlock protected ->freelist. It's a slow path stash, where buffers are considered to be unallocated and not exposed to core page pool. On allocation, pp will first try all its caches, and the ->alloc_pages callback if everything else failed. The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill ring. The consumption from the ring is always done in the attached napi context, so no additional synchronisation required. If that fails we'll be getting buffers from the ->freelist. Note: only ->freelist are considered unallocated for page pool, so we only add pages_state_hold_cnt when allocating from there. Subsequently, as page_pool_return_page() and others bump the ->pages_state_release_cnt counter, io_pp_zc_release_page() can only use ->freelist, which is not a problem as it's not a slow path. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring/net.h | 5 + io_uring/zc_rx.c | 204 +++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 6 ++ 3 files changed, 215 insertions(+) diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h index d994d26116d0..13244ae5fc4a 100644 --- a/include/linux/io_uring/net.h +++ b/include/linux/io_uring/net.h @@ -13,6 +13,11 @@ struct io_zc_rx_buf { }; #if defined(CONFIG_IO_URING) + +#if defined(CONFIG_PAGE_POOL) +extern const struct pp_memory_provider_ops io_uring_pp_zc_ops; +#endif + int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags); #else diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 1e656b481725..ff1dac24ac40 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include @@ -387,4 +388,207 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx, return 0; } +static inline struct io_zc_rx_buf *io_iov_to_buf(struct page_pool_iov *iov) +{ + return container_of(iov, struct io_zc_rx_buf, ppiov); +} + +static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool, + struct io_zc_rx_buf *buf) +{ + return buf - pool->bufs; +} + +static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf) +{ + refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount); +} + +static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf) +{ + if (page_pool_iov_refcount(&buf->ppiov) < IO_ZC_RX_UREF) + return false; + + return page_pool_iov_sub_and_test(&buf->ppiov, IO_ZC_RX_UREF); +} + +static inline struct page *io_zc_buf_to_pp_page(struct io_zc_rx_buf *buf) +{ + return page_pool_mangle_ppiov(&buf->ppiov); +} + +static inline void io_zc_add_pp_cache(struct page_pool *pp, + struct io_zc_rx_buf *buf) +{ + refcount_set(&buf->ppiov.refcount, 1); + pp->alloc.cache[pp->alloc.count++] = io_zc_buf_to_pp_page(buf); +} + +static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + u32 entries; + + entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head; + return min(entries, ifq->rq_entries); +} + +static void io_zc_rx_ring_refill(struct page_pool *pp, + struct io_zc_rx_ifq *ifq) +{ + unsigned int entries = io_zc_rx_rqring_entries(ifq); + unsigned int mask = ifq->rq_entries - 1; + struct io_zc_rx_pool *pool = ifq->pool; + + if (unlikely(!entries)) + return; + + while (entries--) { + unsigned int rq_idx = ifq->cached_rq_head++ & mask; + struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx]; + u32 pgid = rqe->off / PAGE_SIZE; + struct io_zc_rx_buf *buf = &pool->bufs[pgid]; + + if (!io_zc_rx_put_buf_uref(buf)) + continue; + io_zc_add_pp_cache(pp, buf); + if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL) + break; + } + smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head); +} + +static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq) +{ + struct io_zc_rx_pool *pool = ifq->pool; + + spin_lock_bh(&pool->freelist_lock); + while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) { + struct io_zc_rx_buf *buf; + u32 pgid; + + pgid = pool->freelist[--pool->free_count]; + buf = &pool->bufs[pgid]; + + io_zc_add_pp_cache(pp, buf); + pp->pages_state_hold_cnt++; + trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf), + pp->pages_state_hold_cnt); + } + spin_unlock_bh(&pool->freelist_lock); +} + +static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool, + struct io_zc_rx_buf *buf) +{ + spin_lock_bh(&pool->freelist_lock); + pool->freelist[pool->free_count++] = io_buf_pgid(pool, buf); + spin_unlock_bh(&pool->freelist_lock); +} + +static struct page *io_pp_zc_alloc_pages(struct page_pool *pp, gfp_t gfp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + + /* pp should already be ensuring that */ + if (unlikely(pp->alloc.count)) + goto out_return; + + io_zc_rx_ring_refill(pp, ifq); + if (likely(pp->alloc.count)) + goto out_return; + + io_zc_rx_refill_slow(pp, ifq); + if (!pp->alloc.count) + return NULL; +out_return: + return pp->alloc.cache[--pp->alloc.count]; +} + +static bool io_pp_zc_release_page(struct page_pool *pp, struct page *page) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + struct page_pool_iov *ppiov; + + if (WARN_ON_ONCE(!page_is_page_pool_iov(page))) + return false; + + ppiov = page_to_page_pool_iov(page); + + if (!page_pool_iov_sub_and_test(ppiov, 1)) + return false; + + io_zc_rx_recycle_buf(ifq->pool, io_iov_to_buf(ppiov)); + return true; +} + +static void io_pp_zc_scrub(struct page_pool *pp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + struct io_zc_rx_pool *pool = ifq->pool; + struct io_zc_rx_buf *buf; + int i; + + for (i = 0; i < pool->nr_bufs; i++) { + buf = &pool->bufs[i]; + + if (io_zc_rx_put_buf_uref(buf)) { + /* just return it to the page pool, it'll clean it up */ + refcount_set(&buf->ppiov.refcount, 1); + page_pool_iov_put_many(&buf->ppiov, 1); + } + } +} + +static void io_zc_rx_init_pool(struct io_zc_rx_pool *pool, + struct page_pool *pp) +{ + struct io_zc_rx_buf *buf; + int i; + + for (i = 0; i < pool->nr_bufs; i++) { + buf = &pool->bufs[i]; + buf->ppiov.pp = pp; + } +} + +static int io_pp_zc_init(struct page_pool *pp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + + if (!ifq) + return -EINVAL; + if (pp->p.order != 0) + return -EINVAL; + if (!pp->p.napi) + return -EINVAL; + + io_zc_rx_init_pool(ifq->pool, pp); + percpu_ref_get(&ifq->ctx->refs); + ifq->pp = pp; + return 0; +} + +static void io_pp_zc_destroy(struct page_pool *pp) +{ + struct io_zc_rx_ifq *ifq = pp->mp_priv; + struct io_zc_rx_pool *pool = ifq->pool; + + ifq->pp = NULL; + + if (WARN_ON_ONCE(pool->free_count != pool->nr_bufs)) + return; + percpu_ref_put(&ifq->ctx->refs); +} + +const struct pp_memory_provider_ops io_uring_pp_zc_ops = { + .alloc_pages = io_pp_zc_alloc_pages, + .release_page = io_pp_zc_release_page, + .init = io_pp_zc_init, + .destroy = io_pp_zc_destroy, + .scrub = io_pp_zc_scrub, +}; +EXPORT_SYMBOL(io_uring_pp_zc_ops); + + #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index af1d865525d2..00d864700c67 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -10,6 +10,9 @@ #define IO_ZC_IFQ_IDX_OFFSET 16 #define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) +#define IO_ZC_RX_UREF 0x10000 +#define IO_ZC_RX_KREF_MASK (IO_ZC_RX_UREF - 1) + struct io_zc_rx_pool { struct io_zc_rx_ifq *ifq; struct io_zc_rx_buf *bufs; @@ -26,12 +29,15 @@ struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; struct io_zc_rx_pool *pool; + struct page_pool *pp; struct io_rbuf_ring *ring; struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; u32 rq_entries; u32 cq_entries; + u32 cached_rq_head; + u32 cached_cq_tail; /* hw rx descriptor ring id */ u32 if_rxq_id; From patchwork Tue Dec 19 21:03:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499105 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DDDA03EA78 for ; Tue, 19 Dec 2023 21:04:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="oux1MZOq" Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1d3b5f9860bso13764695ad.3 for ; Tue, 19 Dec 2023 13:04:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019858; x=1703624658; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+R3fraE1AUJLiBUl+irWqkEYdm8NtMUbP7MGr9N7dq0=; b=oux1MZOq0AzQgkcRmTCoj873Fn++FIU+nsu6uzWncnnZnBKhfjnsVhQmhuanI8ANrp 441FQdX/PIBwqAwSTclUPmw51AHBolPPqVidjmbvYdV6jpbKyhLeOMAD2f+rb0XPFRXD 1UF59ZJDEmS68G/aZQalCwtyuIb07xxrw3gaBQDCEMk6POItlGSjVr7inj5wFKxXH3En 82J3w82CPGNwPSykYLF5NjH2kMYS9vBQhV8LsCPTez/TQ2wdqRHuF33hoG0Dp0qJp8NE UhOikBtYWJdeeZPiIuI6Cn0SDmXR5uEtFVBRFpUUepmYOQYQwJd7wSYsFnLCV0/qd1lv 2Hfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019858; x=1703624658; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+R3fraE1AUJLiBUl+irWqkEYdm8NtMUbP7MGr9N7dq0=; b=bLxAKatA5dPs5QodFdlw0my4LUsDNw3KQ/fl71pmjA4XlEPoZrWHhZX+cAG2MUyd4j 3h54hp3vUfGLqEs3s+0XwDuJP3X5mU8a9XE4mv8L0LwnyAADw1GQRGVOOXckls1Zq+pM aqOXIyRbxGwD3ejE1ZfTXzDNn9GUaglXmCpJinLJZvi7rCdPhIZS7wECStIFrSlg8je6 AHrJ/cqW6D/gYFwnXgRK6PpG40XMdKELXRNMH91m+UlWARoKY2mDewz6Hc0i9s37kkfy 8YinxoKiwjhWzzRRpyef3fyrQyV/h2sJNOwD855RUO1chZxKkS+y0Nl8lI52Pq0Rx5Zu f4yw== X-Gm-Message-State: AOJu0Yx+U+asfFSvSMsoRUwV1QU0SpEg1HbvIEbqaXdkeKxz2ukeX169 uvu68I3zfLquR//u6ikRSC/Shw== X-Google-Smtp-Source: AGHT+IHv2Ih+zIMLBLTSl+oRkb/qmBFcX3CARWtV0GirJ8tYSwTCNeW0vzrX5FsoKxAiegCWCFM5LA== X-Received: by 2002:a17:902:704c:b0:1d0:6ffd:ae22 with SMTP id h12-20020a170902704c00b001d06ffdae22mr9681493plt.137.1703019858245; Tue, 19 Dec 2023 13:04:18 -0800 (PST) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id u2-20020a170902e80200b001acae9734c0sm4100365plg.266.2023.12.19.13.04.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:17 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider Date: Tue, 19 Dec 2023 13:03:51 -0800 Message-Id: <20231219210357.4029713-15-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov Allow creating a special io_uring pp memory providers, which will be for implementing io_uring zerocopy receive. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/types.h | 1 + net/core/page_pool.c | 6 ++++++ 2 files changed, 7 insertions(+) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index fd846cac9fb6..f54ee759e362 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -129,6 +129,7 @@ struct mem_provider; enum pp_memory_provider_type { __PP_MP_NONE, /* Use system allocator directly */ PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */ + PP_MP_IOU_ZCRX, /* io_uring zerocopy receive provider */ }; struct pp_memory_provider_ops { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 9e3073d61a97..ebf5ff009d9d 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -21,6 +21,7 @@ #include #include #include +#include #include @@ -242,6 +243,11 @@ static int page_pool_init(struct page_pool *pool, case PP_MP_DMABUF_DEVMEM: pool->mp_ops = &dmabuf_devmem_ops; break; +#if defined(CONFIG_IO_URING) + case PP_MP_IOU_ZCRX: + pool->mp_ops = &io_uring_pp_zc_ops; + break; +#endif default: err = -EINVAL; goto free_ptr_ring; From patchwork Tue Dec 19 21:03:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499106 Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBE0C3C08E for ; Tue, 19 Dec 2023 21:04:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="XfHxmjbw" Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-5c6ce4dffb5so1755072a12.0 for ; Tue, 19 Dec 2023 13:04:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019859; x=1703624659; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=EseMquZ+ud914kgVcMqdU4pc5Hd6eVscu9u0Nzy12vg=; b=XfHxmjbw/YAU1WsBGvOQc6EaJ6VM8unnd7mW5cWXcN2yPX8+g78cXiTF61i42H2y+c 2SZsfsXfCVXI69KXTKB87S8pFG++5NbUYOxasySWnzkuWe9CgHTyqQ3lau1wakOq2e3s PJWR2Vqg+8lWj3x7HDuKw6AfgEmJkVx6iThxV6a26jFPBiXleHF4ad8UwfynvNyBX0Oe sBpc6HGE0HYdhpmPhoNr49RnOrpsOHisXqkww4cbfd2KpDFOxPHOJ0uHGpfMthCErbb/ lug0unYJxquvLoq2/HrCoCGtw6dN+wGEFOWJO9TgpBDDvEzfD2FVI/Wf2VwqB/UKajpn B9rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019859; x=1703624659; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EseMquZ+ud914kgVcMqdU4pc5Hd6eVscu9u0Nzy12vg=; b=lL3K1UHCCCrIPksds1SRs+pIaNGEXWQo4duCKB1H06NzDxTY+Kl9UMKLpMx+0IsQ0c 4arLIODSz5DHN8Xu/JNj7nXIHtzWGzBVmmFxD6u+NqktnpAm5E7ui6CGr/Tiz7oUZ9Tl VT7SLUnoZFtTvBhEn81FUCwfPP4g5TsUOyGzADCOWlIHg6ea2jFfQhciA/OGwNI59Af9 Oh6w4Wh5J5JxUNYS34MWSGh/6wrmHlW6agJopHevPe6ZwprTA5E6xGKuFzi7scYJ/KBG E65ixBSwKLm+Q7eeJV3ayWiluocXzmRCUXgNCGi2L7e5thGvO6qyzE2h8BjI95Dzs+08 5Ttw== X-Gm-Message-State: AOJu0YyRB5E2GZZFokfEt1Mxhh4lHj687WmZUC0Zo5kcWtGezDK24rew DUjR5fmsXtx+d4zyFCkWww35nQ== X-Google-Smtp-Source: AGHT+IGKtSzi/RRmXkuPnD6lrUqbaSUkwLiVuC2nC80xHz/yFrUKP7xukH1K1+e1c60fL24LqB6u4g== X-Received: by 2002:a05:6a21:1a6:b0:194:bb77:b263 with SMTP id le38-20020a056a2101a600b00194bb77b263mr703965pzb.55.1703019859125; Tue, 19 Dec 2023 13:04:19 -0800 (PST) Received: from localhost (fwdproxy-prn-000.fbsv.net. [2a03:2880:ff::face:b00c]) by smtp.gmail.com with ESMTPSA id v4-20020aa78084000000b006cde2090154sm20613615pff.218.2023.12.19.13.04.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:18 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 15/20] io_uring: add io_recvzc request Date: Tue, 19 Dec 2023 13:03:52 -0800 Message-Id: <20231219210357.4029713-16-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: David Wei This patch adds an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that is set up for ZC Rx. The request reads skbs from a socket where its page frags are tagged w/ a magic cookie in their page private field. For each frag, entries are written into the ifq rbuf completion ring, and the total number of bytes read is returned to user as an io_uring completion event. Multishot requests work. There is no need to specify provided buffers as data is returned in the ifq rbuf completion rings. Userspace is expected to look into the ifq rbuf completion ring when it receives an io_uring completion event. The addr3 field is used to encode params in the following format: addr3 = (readlen << 32); readlen is the max amount of data to read from the socket. ifq_id is the interface queue id, and currently only 0 is supported. Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 1 + io_uring/net.c | 119 ++++++++++++++++- io_uring/opdef.c | 16 +++ io_uring/zc_rx.c | 240 +++++++++++++++++++++++++++++++++- io_uring/zc_rx.h | 5 + 5 files changed, 375 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f4ba58bce3bd..f57f394744fe 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -253,6 +253,7 @@ enum io_uring_op { IORING_OP_FUTEX_WAIT, IORING_OP_FUTEX_WAKE, IORING_OP_FUTEX_WAITV, + IORING_OP_RECV_ZC, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/net.c b/io_uring/net.c index 454ba301ae6b..7a2aadf6962c 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -71,6 +71,16 @@ struct io_sr_msg { struct io_kiocb *notif; }; +struct io_recvzc { + struct file *file; + unsigned len; + unsigned done_io; + unsigned msg_flags; + u16 flags; + + u32 datalen; +}; + static inline bool io_check_multishot(struct io_kiocb *req, unsigned int issue_flags) { @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, unsigned int cflags; cflags = io_put_kbuf(req, issue_flags); - if (msg->msg_inq && msg->msg_inq != -1) + if (msg && msg->msg_inq && msg->msg_inq != -1) cflags |= IORING_CQE_F_SOCK_NONEMPTY; if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, io_recv_prep_retry(req); /* Known not-empty or unknown state, retry */ if (cflags & IORING_CQE_F_SOCK_NONEMPTY || - msg->msg_inq == -1) + (msg && msg->msg_inq == -1)) return false; if (issue_flags & IO_URING_F_MULTISHOT) *ret = IOU_ISSUE_SKIP_COMPLETE; @@ -956,9 +966,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } -static __maybe_unused -struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, - struct socket *sock) +static struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, + struct socket *sock) { unsigned token = READ_ONCE(sock->zc_rx_idx); unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET; @@ -975,6 +984,106 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, return ifq; } +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + u64 recvzc_cmd; + + recvzc_cmd = READ_ONCE(sqe->addr3); + zc->datalen = recvzc_cmd >> 32; + if (recvzc_cmd & 0xffff) + return -EINVAL; + if (!(req->ctx->flags & IORING_SETUP_DEFER_TASKRUN)) + return -EINVAL; + if (unlikely(sqe->file_index || sqe->addr2)) + return -EINVAL; + + zc->len = READ_ONCE(sqe->len); + zc->flags = READ_ONCE(sqe->ioprio); + if (zc->flags & ~(RECVMSG_FLAGS)) + return -EINVAL; + zc->msg_flags = READ_ONCE(sqe->msg_flags); + if (zc->msg_flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + if (zc->msg_flags & MSG_ERRQUEUE) + req->flags |= REQ_F_CLEAR_POLLIN; + if (zc->flags & IORING_RECV_MULTISHOT) { + if (zc->msg_flags & MSG_WAITALL) + return -EINVAL; + if (req->opcode == IORING_OP_RECV && zc->len) + return -EINVAL; + req->flags |= REQ_F_APOLL_MULTISHOT; + } + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + zc->msg_flags |= MSG_CMSG_COMPAT; +#endif + zc->done_io = 0; + return 0; +} + +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + struct socket *sock; + unsigned flags; + int ret, min_ret = 0; + bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct io_zc_rx_ifq *ifq; + + if (issue_flags & IO_URING_F_UNLOCKED) + return -EAGAIN; + + if (!(req->flags & REQ_F_POLLED) && + (zc->flags & IORING_RECVSEND_POLL_FIRST)) + return -EAGAIN; + + sock = sock_from_file(req->file); + if (unlikely(!sock)) + return -ENOTSOCK; + ifq = io_zc_verify_sock(req, sock); + if (!ifq) + return -EINVAL; + +retry_multishot: + flags = zc->msg_flags; + if (force_nonblock) + flags |= MSG_DONTWAIT; + if (flags & MSG_WAITALL) + min_ret = zc->len; + + ret = io_zc_rx_recv(ifq, sock, zc->datalen, flags); + if (ret < min_ret) { + if (ret == -EAGAIN && force_nonblock) { + if (issue_flags & IO_URING_F_MULTISHOT) + return IOU_ISSUE_SKIP_COMPLETE; + return -EAGAIN; + } + if (ret > 0 && io_net_retry(sock, flags)) { + zc->len -= ret; + zc->done_io += ret; + req->flags |= REQ_F_PARTIAL_IO; + return -EAGAIN; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + req_set_fail(req); + } else if ((flags & MSG_WAITALL) && (flags & (MSG_TRUNC | MSG_CTRUNC))) { + req_set_fail(req); + } + + if (ret > 0) + ret += zc->done_io; + else if (zc->done_io) + ret = zc->done_io; + + if (!io_recv_finish(req, &ret, 0, ret <= 0, issue_flags)) + goto retry_multishot; + + return ret; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 799db44283c7..a90231566d09 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -35,6 +35,7 @@ #include "rw.h" #include "waitid.h" #include "futex.h" +#include "zc_rx.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -467,6 +468,18 @@ const struct io_issue_def io_issue_defs[] = { .issue = io_futexv_wait, #else .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_RECV_ZC] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .ioprio = 1, +#if defined(CONFIG_NET) + .prep = io_recvzc_prep, + .issue = io_recvzc, +#else + .prep = io_eopnotsupp_prep, #endif }, }; @@ -704,6 +717,9 @@ const struct io_cold_def io_cold_defs[] = { [IORING_OP_FUTEX_WAITV] = { .name = "FUTEX_WAITV", }, + [IORING_OP_RECV_ZC] = { + .name = "RECV_ZC", + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index ff1dac24ac40..acb70ca23150 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -15,8 +16,20 @@ #include "zc_rx.h" #include "rsrc.h" +struct io_zc_rx_args { + struct io_zc_rx_ifq *ifq; + struct socket *sock; +}; + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); +static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + + return ifq->cached_cq_tail - READ_ONCE(ring->cq.head); +} + static inline struct device *netdev2dev(struct net_device *dev) { return dev->dev.parent; @@ -399,7 +412,7 @@ static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool, return buf - pool->bufs; } -static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf) +static void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf) { refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount); } @@ -590,5 +603,230 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = { }; EXPORT_SYMBOL(io_uring_pp_zc_ops); +static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq) +{ + struct io_uring_rbuf_cqe *cqe; + unsigned int cq_idx, queued, free, entries; + unsigned int mask = ifq->cq_entries - 1; + + cq_idx = ifq->cached_cq_tail & mask; + smp_rmb(); + queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries); + free = ifq->cq_entries - queued; + entries = min(free, ifq->cq_entries - cq_idx); + if (!entries) + return NULL; + + cqe = &ifq->cqes[cq_idx]; + ifq->cached_cq_tail++; + return cqe; +} + +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, + int off, int len, unsigned sock_idx) +{ + off += skb_frag_off(frag); + + if (likely(page_is_page_pool_iov(frag->bv_page))) { + struct io_uring_rbuf_cqe *cqe; + struct io_zc_rx_buf *buf; + struct page_pool_iov *ppiov; + + ppiov = page_to_page_pool_iov(frag->bv_page); + if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX || + ppiov->pp->mp_priv != ifq) + return -EFAULT; + + cqe = io_zc_get_rbuf_cqe(ifq); + if (!cqe) + return -ENOBUFS; + + buf = io_iov_to_buf(ppiov); + io_zc_rx_get_buf_uref(buf); + + cqe->region = 0; + cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off; + cqe->len = len; + cqe->sock = sock_idx; + cqe->flags = 0; + } else { + return -EOPNOTSUPP; + } + + return len; +} + +static int +zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, + unsigned int offset, size_t len) +{ + struct io_zc_rx_args *args = desc->arg.data; + struct io_zc_rx_ifq *ifq = args->ifq; + struct socket *sock = args->sock; + unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK; + struct sk_buff *frag_iter; + unsigned start, start_off; + int i, copy, end, off; + int ret = 0; + + start = skb_headlen(skb); + start_off = offset; + + if (offset < start) + return -EOPNOTSUPP; + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag; + + WARN_ON(start > offset + len); + + frag = &skb_shinfo(skb)->frags[i]; + end = start + skb_frag_size(frag); + + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_frag(ifq, frag, off, copy, sock_idx); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + + skb_walk_frags(skb, frag_iter) { + WARN_ON(start > offset + len); + + end = start + frag_iter->len; + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_skb(desc, frag_iter, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + +out: + smp_store_release(&ifq->ring->cq.tail, ifq->cached_cq_tail); + if (offset == start_off) + return ret; + return offset - start_off; +} + +static int io_zc_rx_tcp_read(struct io_zc_rx_ifq *ifq, struct sock *sk) +{ + struct io_zc_rx_args args = { + .ifq = ifq, + .sock = sk->sk_socket, + }; + read_descriptor_t rd_desc = { + .count = 1, + .arg.data = &args, + }; + + return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb); +} + +static int io_zc_rx_tcp_recvmsg(struct io_zc_rx_ifq *ifq, struct sock *sk, + unsigned int recv_limit, + int flags, int *addr_len) +{ + size_t used; + long timeo; + int ret; + + ret = used = 0; + + lock_sock(sk); + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (recv_limit) { + ret = io_zc_rx_tcp_read(ifq, sk); + if (ret < 0) + break; + if (!ret) { + if (used) + break; + if (sock_flag(sk, SOCK_DONE)) + break; + if (sk->sk_err) { + ret = sock_error(sk); + break; + } + if (sk->sk_shutdown & RCV_SHUTDOWN) + break; + if (sk->sk_state == TCP_CLOSE) { + ret = -ENOTCONN; + break; + } + if (!timeo) { + ret = -EAGAIN; + break; + } + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + sk_wait_data(sk, &timeo, NULL); + if (signal_pending(current)) { + ret = sock_intr_errno(timeo); + break; + } + continue; + } + recv_limit -= ret; + used += ret; + + if (!timeo) + break; + release_sock(sk); + lock_sock(sk); + + if (sk->sk_err || sk->sk_state == TCP_CLOSE || + (sk->sk_shutdown & RCV_SHUTDOWN) || + signal_pending(current)) + break; + } + release_sock(sk); + /* TODO: handle timestamping */ + return used ? used : ret; +} + +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock, + unsigned int limit, unsigned int flags) +{ + struct sock *sk = sock->sk; + const struct proto *prot; + int addr_len = 0; + int ret; + + if (flags & MSG_ERRQUEUE) + return -EOPNOTSUPP; + + prot = READ_ONCE(sk->sk_prot); + if (prot->recvmsg != tcp_recvmsg) + return -EPROTONOSUPPORT; + + sock_rps_record_flow(sk); + + ret = io_zc_rx_tcp_recvmsg(ifq, sk, limit, flags, &addr_len); + + return ret; +} #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 00d864700c67..3e8f07e4b252 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -72,4 +72,9 @@ static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, } #endif +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock, + unsigned int limit, unsigned int flags); + #endif From patchwork Tue Dec 19 21:03:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499107 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E50AB3EA7A for ; Tue, 19 Dec 2023 21:04:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="dCsXqDtQ" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-1d3ce28ac3cso16788035ad.0 for ; Tue, 19 Dec 2023 13:04:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019860; x=1703624660; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GxpOp5keoBQGOz00dSxKjcoORly3Uq6QcOZ8S/MSAKQ=; b=dCsXqDtQQRdS3qE2x27ZjDcAt67Sxxj3q5igk7TxkZ7nIOZpby+uKKUQVB4gyTNR2B d/L3Wpryht4RLo4k/VLZvzmRNzmt5fUat5grHDhOtZgChXjsR5otxIYKad4XuUJyY0Fk zx0b8m/zhlm9lzDjzaosaAtMzmQ4BWFP7oaZccY2Z4/vpXsfE8yEcycYeI6k8YK25iHd GWbnI3Wx4fwBP/OAD9HUMlgpNmm0kna8DXXEVjtzE67UZtOZxo10aXK3Dj2XqMzJDHSd xxlnPBXLh7ACFeLSdSXiB+sbaDE6l13TnIDlTzitzRhwD17Sy6gYrLqhODabsWHvzvp9 kq3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019860; x=1703624660; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GxpOp5keoBQGOz00dSxKjcoORly3Uq6QcOZ8S/MSAKQ=; b=RvgTHrMzGfB4W4Igw3Zw+HzTH39MS2/g0wghvSZyQ5wexXd7+wtQmXZesKu3uF5gE7 eRpB0LQ6m5hF9hGQOgRYRMsUHoDl2KcBA3Qt/qLzZMw73d/udv4UYvKNEKl0A4tO7jnZ c9LkLjFEva67ROE5+kqoO0KT4cZGEaV3eEiE3vyy5GL4Yh2LvAf0TgdxxNacXEQN6+r5 AjbZohcVG0Q0aicXhq3MuOB5Nq7J0AB+ZHbRcGeA/o6BMNdJEWd+F469eSTxfnoxoV5N huJ1BJriVL+g0Kxbc81NawJglBlNCfqWE7+fWuntinwI7bKlu6HjWEyjfIkPdBcPOJEo EGcg== X-Gm-Message-State: AOJu0YxjuFZJb2cCGFCk2AZC3UaMUq8K+Ajr3Xtv9I1H/deh+u1owEgA ePi9xxIA6vGNLOu6tKMyImOQrw== X-Google-Smtp-Source: AGHT+IGggFl0JrUM4oTE/WqbU6Jj7xNuoLiaUQdJMih11dS+6jNAD97QTKlC1myhDJGi0F2OblMPZQ== X-Received: by 2002:a17:903:11c5:b0:1d0:6ffd:8367 with SMTP id q5-20020a17090311c500b001d06ffd8367mr9469743plh.114.1703019860208; Tue, 19 Dec 2023 13:04:20 -0800 (PST) Received: from localhost (fwdproxy-prn-013.fbsv.net. [2a03:2880:ff:d::face:b00c]) by smtp.gmail.com with ESMTPSA id l5-20020a170903120500b001d349fcb70dsm12563264plh.202.2023.12.19.13.04.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:19 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 16/20] net: execute custom callback from napi Date: Tue, 19 Dec 2023 13:03:53 -0800 Message-Id: <20231219210357.4029713-17-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov Sometimes we want to access a napi protected resource from task context like in the case of io_uring zc falling back to copy and accessing the buffer ring. Add a helper function that allows to execute a custom function from napi context by first stopping it similarly to napi_busy_loop(). Experimental, needs much polishing and sharing bits with napi_busy_loop(). Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/busy_poll.h | 7 +++++++ net/core/dev.c | 46 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+) diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index 4dabeb6c76d3..64238467e00a 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -47,6 +47,8 @@ bool sk_busy_loop_end(void *p, unsigned long start_time); void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget); +void napi_execute(struct napi_struct *napi, + void (*cb)(void *), void *cb_arg); #else /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long net_busy_loop_on(void) @@ -59,6 +61,11 @@ static inline bool sk_can_busy_loop(struct sock *sk) return false; } +static inline void napi_execute(struct napi_struct *napi, + void (*cb)(void *), void *cb_arg) +{ +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long busy_loop_current_time(void) diff --git a/net/core/dev.c b/net/core/dev.c index e55750c47245..2dd4f3846535 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6537,6 +6537,52 @@ void napi_busy_loop(unsigned int napi_id, } EXPORT_SYMBOL(napi_busy_loop); +void napi_execute(struct napi_struct *napi, + void (*cb)(void *), void *cb_arg) +{ + bool done = false; + unsigned long val; + void *have_poll_lock = NULL; + + rcu_read_lock(); + + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + for (;;) { + local_bh_disable(); + val = READ_ONCE(napi->state); + + /* If multiple threads are competing for this napi, + * we avoid dirtying napi->state as much as we can. + */ + if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | + NAPIF_STATE_IN_BUSY_POLL)) + goto restart; + + if (cmpxchg(&napi->state, val, + val | NAPIF_STATE_IN_BUSY_POLL | + NAPIF_STATE_SCHED) != val) + goto restart; + + have_poll_lock = netpoll_poll_lock(napi); + cb(cb_arg); + done = true; + gro_normal_list(napi); + local_bh_enable(); + break; +restart: + local_bh_enable(); + if (unlikely(need_resched())) + break; + cpu_relax(); + } + if (done) + busy_poll_stop(napi, have_poll_lock, false, 1); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); + rcu_read_unlock(); +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static void napi_hash_add(struct napi_struct *napi) From patchwork Tue Dec 19 21:03:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499108 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9B8CD405D3 for ; Tue, 19 Dec 2023 21:04:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="AdL53sTm" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-1d3e2972f65so6160715ad.3 for ; Tue, 19 Dec 2023 13:04:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019861; x=1703624661; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=88y6GqmlC4OquYgwdteTX7sFZdaMlDFOxYfP/ATHY4U=; b=AdL53sTmXAOuiJZ0HZu2jKQQj+9POMGtPmzl1IEyLlF1LowjmXhdnVlu1mKcdhfnc2 qBYU/UAj2+PRoy00+6Y4vCzpDFcq5z69+0PIiZ5XZFCRVLUc6NK5GGQoAgjogllawytu WeJTuj7YvAfuuVnNdfZ0sdYEMVR3xhvn1WV1WdPquQyEMCuVUX5lO4FjAZdSm3t8v1Q2 XwDTQ7/yDilBSgZU5mpa3NQXJ7hSeEX2+IUasxrn+BmLZ0ASQ/n/idbCPdTLj1hIayCy ZmlrhaCPU8ybwCGnmgF4+raQuC/TLH8nwtLtGHmKp6hBFgLJYBLNR+fRI9GbE9hKh0/E urmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019861; x=1703624661; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=88y6GqmlC4OquYgwdteTX7sFZdaMlDFOxYfP/ATHY4U=; b=fU99bvjbzY3TqLONRTfQMChrBtkMYgqIVqVJVGDomCGCiS/HHeJKAAbxH0hMgSbCpt 9fy/AKDy8NLtEP9PHRRXtWCUX5b3Bnd/RWSpJGUHUHwBC/V1VD/M4rFUZq4hzCtITC5N 8QnQtPKgYifgvsLFF1qCCk1Aai+8wGl2nlg8AmWzgbpCopoOES7FAQzjStHeeu5oknuy X58S9aCPif0ImtZlIPuVjKYk8+W+iU93d+wxzjC3abPp07NQN+2j6hgKI2/SHiFD6I8e lhbI32716q9KGB1IhHaSQMAC23U9+Lrkpc08O1eGRV6SQ554EUIiuLG4PgCQ1e6FIYoQ UpTg== X-Gm-Message-State: AOJu0YzuTTCkQdcpKmftzfD6QRSn8UUDrlHhzJj1gvvXxhC3ThJujsu2 gyUmP1c/2jv/hmMMlvm0vApq1g== X-Google-Smtp-Source: AGHT+IEpQz41IIokOuYaoRjPu5IHvUE6igpmNn5ccwm0j3KSVHCAAgpLK/9gBFnbdEc/qrCQqXJ9+g== X-Received: by 2002:a17:902:e88f:b0:1d0:7072:e241 with SMTP id w15-20020a170902e88f00b001d07072e241mr12430623plg.49.1703019861044; Tue, 19 Dec 2023 13:04:21 -0800 (PST) Received: from localhost (fwdproxy-prn-010.fbsv.net. [2a03:2880:ff:a::face:b00c]) by smtp.gmail.com with ESMTPSA id az4-20020a170902a58400b001b7f40a8959sm21395712plb.76.2023.12.19.13.04.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:20 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback Date: Tue, 19 Dec 2023 13:03:54 -0800 Message-Id: <20231219210357.4029713-18-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-State: RFC From: Pavel Begunkov Currently, if user fails to keep up with the network and doesn't refill the buffer ring fast enough the NIC/driver will start dropping packets. That might be too punishing. Add a fallback path, which would allow drivers to allocate normal pages when there is starvation, then zc_rx_recv_skb() we'll detect them and copy into the user specified buffers, when they become available. That should help with adoption and also help the user striking the right balance allocating just the right amount of zerocopy buffers but also being resilient to sudden surges in traffic. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zc_rx.c | 126 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 120 insertions(+), 6 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index acb70ca23150..f7d99d569885 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -21,6 +22,11 @@ struct io_zc_rx_args { struct socket *sock; }; +struct io_zc_refill_data { + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *buf; +}; + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) @@ -603,6 +609,39 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = { }; EXPORT_SYMBOL(io_uring_pp_zc_ops); +static void io_napi_refill(void *data) +{ + struct io_zc_refill_data *rd = data; + struct io_zc_rx_ifq *ifq = rd->ifq; + void *page; + + if (WARN_ON_ONCE(!ifq->pp)) + return; + + page = page_pool_dev_alloc_pages(ifq->pp); + if (!page) + return; + if (WARN_ON_ONCE(!page_is_page_pool_iov(page))) + return; + + rd->buf = io_iov_to_buf(page_to_page_pool_iov(page)); +} + +static struct io_zc_rx_buf *io_zc_get_buf_task_safe(struct io_zc_rx_ifq *ifq) +{ + struct io_zc_refill_data rd = { + .ifq = ifq, + }; + + napi_execute(ifq->pp->p.napi, io_napi_refill, &rd); + return rd.buf; +} + +static inline void io_zc_return_rbuf_cqe(struct io_zc_rx_ifq *ifq) +{ + ifq->cached_cq_tail--; +} + static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq) { struct io_uring_rbuf_cqe *cqe; @@ -622,6 +661,51 @@ static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq * return cqe; } +static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data, + unsigned int offset, size_t len, + unsigned sock_idx) +{ + size_t copy_size, copied = 0; + struct io_uring_rbuf_cqe *cqe; + struct io_zc_rx_buf *buf; + int ret = 0, off = 0; + u8 *vaddr; + + do { + cqe = io_zc_get_rbuf_cqe(ifq); + if (!cqe) { + ret = -ENOBUFS; + break; + } + buf = io_zc_get_buf_task_safe(ifq); + if (!buf) { + io_zc_return_rbuf_cqe(ifq); + ret = -ENOMEM; + break; + } + + vaddr = kmap_local_page(buf->page); + copy_size = min_t(size_t, PAGE_SIZE, len); + memcpy(vaddr, data + offset, copy_size); + kunmap_local(vaddr); + + cqe->region = 0; + cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off; + cqe->len = copy_size; + cqe->flags = 0; + cqe->sock = sock_idx; + + io_zc_rx_get_buf_uref(buf); + page_pool_iov_put_many(&buf->ppiov, 1); + + offset += copy_size; + len -= copy_size; + copied += copy_size; + } while (offset < len); + + return copied ? copied : ret; +} + static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, int off, int len, unsigned sock_idx) { @@ -650,7 +734,22 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, cqe->sock = sock_idx; cqe->flags = 0; } else { - return -EOPNOTSUPP; + struct page *page = skb_frag_page(frag); + u32 p_off, p_len, t, copied = 0; + u8 *vaddr; + int ret = 0; + + skb_frag_foreach_page(frag, off, len, + page, p_off, p_len, t) { + vaddr = kmap_local_page(page); + ret = zc_rx_copy_chunk(ifq, vaddr, p_off, p_len, sock_idx); + kunmap_local(vaddr); + + if (ret < 0) + return copied ? copied : ret; + copied += ret; + } + len = copied; } return len; @@ -665,15 +764,30 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, struct socket *sock = args->sock; unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK; struct sk_buff *frag_iter; - unsigned start, start_off; + unsigned start, start_off = offset; int i, copy, end, off; int ret = 0; - start = skb_headlen(skb); - start_off = offset; + if (unlikely(offset < skb_headlen(skb))) { + ssize_t copied; + size_t to_copy; - if (offset < start) - return -EOPNOTSUPP; + to_copy = min_t(size_t, skb_headlen(skb) - offset, len); + copied = zc_rx_copy_chunk(ifq, skb->data, offset, to_copy, + sock_idx); + if (copied < 0) { + ret = copied; + goto out; + } + offset += copied; + len -= copied; + if (!len) + goto out; + if (offset != skb_headlen(skb)) + goto out; + } + + start = skb_headlen(skb); for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { const skb_frag_t *frag; From patchwork Tue Dec 19 21:03:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499109 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-oo1-f54.google.com (mail-oo1-f54.google.com [209.85.161.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 07F1240BED for ; Tue, 19 Dec 2023 21:04:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="EJ4kPyrn" Received: by mail-oo1-f54.google.com with SMTP id 006d021491bc7-593f6fb21a5so419430eaf.2 for ; Tue, 19 Dec 2023 13:04:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019862; x=1703624662; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=cIyS0qBLv2Ng8nfVCtfxHFE9coSuv9Yrxpz6D1RsN7A=; b=EJ4kPyrnCLitPryvfTEVArLlCVcYyLCXDto2g82tCT15oFvzVGUgRzScnbWKo6qEjv 9PCMqpC/OZkop3Eazrw6kRU+gA+eu1zYgodHDmcg9wgN68d5zaa7PPn8XZgpOfPYz+ld 7pai523V4SNVMAVmovIRUL7SiX5wLuGxt5Csliv0OLj3sP8rdYx92xliFvXmuHHcOHWo AmeTgpiebhyqVIuXXDBfnFqm1eSy+r5SgZDbbW27K4QfQAq3TMl1GSa3PwcoDx02QUve 6MY7KDLmI98kjRbwYMLS+4jNiX5dbBL+I2iNSX4KxwrLjzWE7GvftT4jl/GyuoE+hXLb Sfhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019862; x=1703624662; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cIyS0qBLv2Ng8nfVCtfxHFE9coSuv9Yrxpz6D1RsN7A=; b=ANKKLObLhIyYJ4EVSQlF2yuKM3gzMqvGdW0BHO26qn4ogRqZvkbJkTjyEwJUYZbLu8 4zj9NyZGjsS4uBOwhe6YRIigoSSW35nSDbGvJT63kpJBaR8EN+xe72hNcYWk1cFLaVZ6 jgpE9Nay1zrmFNIhdHClou+H0u6npHqfhio1+yTuowRkY418YgTJ4VR2TU9pVX5/BSB2 f1NNiVeDsLFTMQsF3p5MuAclG8DIPmr75nh5cjagEX5qfUpHOWjs5HJDbVh0jnB65JMr dAeLmvRR3v7sXTX3avnIQ76+K4CCzFzovsUUWYF2oTG6p+R6eJSAfMSezNA2bBvboOgJ YJvw== X-Gm-Message-State: AOJu0YwA98cUVwCXcW1n8f6tvxl1lNhbiV580CKIh9NfP8bsK6iymuFs tvQVQVw6PanTadZn5dKBuFiWQA== X-Google-Smtp-Source: AGHT+IFvVsl5rOhi1y2F4/tSu5S+c7GqtoW/ErMsl6lkfBz1N/WlWCxRk6BDJFQtIHdc0xjgKGXyDA== X-Received: by 2002:a05:6358:c325:b0:173:50b:26ed with SMTP id fk37-20020a056358c32500b00173050b26edmr286038rwb.36.1703019862058; Tue, 19 Dec 2023 13:04:22 -0800 (PST) Received: from localhost (fwdproxy-prn-005.fbsv.net. [2a03:2880:ff:5::face:b00c]) by smtp.gmail.com with ESMTPSA id e7-20020a056a001a8700b006ce835b77d9sm3615155pfv.20.2023.12.19.13.04.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:21 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 18/20] veth: add support for io_uring zc rx Date: Tue, 19 Dec 2023 13:03:55 -0800 Message-Id: <20231219210357.4029713-19-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov NOT FOR UPSTREAM, TESTING ONLY. Add io_uring zerocopy support for veth. It's not actually zerocopy, we copy data in napi, which is early enough in the stack to be useful for testing. Note, we'll need some virtual dev support for testing, but that should not be in the way of real workloads. Signed-off-by: David Wei --- drivers/net/veth.c | 211 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 205 insertions(+), 6 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 57efb3454c57..dd00e172979f 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #define DRV_NAME "veth" @@ -75,6 +76,7 @@ struct veth_priv { struct bpf_prog *_xdp_prog; struct veth_rq *rq; unsigned int requested_headroom; + bool zc_installed; }; struct veth_xdp_tx_bq { @@ -335,9 +337,12 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev, const struct net_device *rcv, const struct sk_buff *skb) { + struct veth_priv *rcv_priv = netdev_priv(rcv); + return !(dev->features & NETIF_F_ALL_TSO) || (skb->destructor == sock_wfree && - rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)); + rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)) || + rcv_priv->zc_installed; } static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) @@ -726,6 +731,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, struct sk_buff *skb = *pskb; u32 frame_sz; + if (WARN_ON_ONCE(1)) + return -EFAULT; + if (skb_shared(skb) || skb_head_is_locked(skb) || skb_shinfo(skb)->nr_frags || skb_headroom(skb) < XDP_PACKET_HEADROOM) { @@ -827,6 +835,90 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, return -ENOMEM; } +static noinline struct sk_buff *veth_iou_rcv_skb(struct veth_rq *rq, + struct sk_buff *skb) +{ + struct sk_buff *nskb; + u32 size, len, off, max_head_size; + struct page *page; + int ret, i, head_off; + void *vaddr; + + /* Testing only, randomly send normal pages to test copy fallback */ + if (ktime_get_ns() % 16 == 0) + return skb; + + skb_prepare_for_gro(skb); + max_head_size = skb_headlen(skb); + + rcu_read_lock(); + nskb = napi_alloc_skb(&rq->xdp_napi, max_head_size); + if (!nskb) + goto drop; + + skb_copy_header(nskb, skb); + skb_mark_for_recycle(nskb); + + size = max_head_size; + if (skb_copy_bits(skb, 0, nskb->data, size)) { + consume_skb(nskb); + goto drop; + } + skb_put(nskb, size); + head_off = skb_headroom(nskb) - skb_headroom(skb); + skb_headers_offset_update(nskb, head_off); + + /* Allocate paged area of new skb */ + off = size; + len = skb->len - off; + + for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { + struct io_zc_rx_buf *buf; + void *ppage; + + ppage = page_pool_dev_alloc_pages(rq->page_pool); + if (!ppage) { + consume_skb(nskb); + goto drop; + } + if (WARN_ON_ONCE(!page_is_page_pool_iov(ppage))) { + consume_skb(nskb); + goto drop; + } + + buf = container_of(page_to_page_pool_iov(ppage), + struct io_zc_rx_buf, ppiov); + page = buf->page; + + if (WARN_ON_ONCE(buf->ppiov.pp != rq->page_pool)) + goto drop; + + size = min_t(u32, len, PAGE_SIZE); + skb_add_rx_frag(nskb, i, ppage, 0, size, PAGE_SIZE); + + vaddr = kmap_atomic(page); + ret = skb_copy_bits(skb, off, vaddr, size); + kunmap_atomic(vaddr); + + if (ret) { + consume_skb(nskb); + goto drop; + } + len -= size; + off += size; + } + rcu_read_unlock(); + + consume_skb(skb); + skb = nskb; + return skb; +drop: + rcu_read_unlock(); + kfree_skb(skb); + return NULL; +} + + static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, struct veth_xdp_tx_bq *bq, @@ -970,8 +1062,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, /* ndo_start_xmit */ struct sk_buff *skb = ptr; - stats->xdp_bytes += skb->len; - skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + if (rq->page_pool->p.memory_provider == PP_MP_IOU_ZCRX) { + skb = veth_iou_rcv_skb(rq, skb); + } else { + stats->xdp_bytes += skb->len; + skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + } + if (skb) { if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC)) netif_receive_skb(skb); @@ -1030,15 +1127,21 @@ static int veth_poll(struct napi_struct *napi, int budget) return done; } -static int veth_create_page_pool(struct veth_rq *rq) +static int veth_create_page_pool(struct veth_rq *rq, struct io_zc_rx_ifq *ifq) { struct page_pool_params pp_params = { .order = 0, .pool_size = VETH_RING_SIZE, .nid = NUMA_NO_NODE, .dev = &rq->dev->dev, + .napi = &rq->xdp_napi, }; + if (ifq) { + pp_params.mp_priv = ifq; + pp_params.memory_provider = PP_MP_IOU_ZCRX; + } + rq->page_pool = page_pool_create(&pp_params); if (IS_ERR(rq->page_pool)) { int err = PTR_ERR(rq->page_pool); @@ -1056,7 +1159,7 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end) int err, i; for (i = start; i < end; i++) { - err = veth_create_page_pool(&priv->rq[i]); + err = veth_create_page_pool(&priv->rq[i], NULL); if (err) goto err_page_pool; } @@ -1112,9 +1215,17 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end) for (i = start; i < end; i++) { struct veth_rq *rq = &priv->rq[i]; + void *ptr; + int nr = 0; rq->rx_notify_masked = false; - ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free); + + while ((ptr = ptr_ring_consume(&rq->xdp_ring))) { + veth_ptr_free(ptr); + nr++; + } + + ptr_ring_cleanup(&rq->xdp_ring, NULL); } for (i = start; i < end; i++) { @@ -1350,6 +1461,9 @@ static int veth_set_channels(struct net_device *dev, struct net_device *peer; int err; + if (priv->zc_installed) + return -EINVAL; + /* sanity check. Upper bounds are already enforced by the caller */ if (!ch->rx_count || !ch->tx_count) return -EINVAL; @@ -1427,6 +1541,8 @@ static int veth_open(struct net_device *dev) struct net_device *peer = rtnl_dereference(priv->peer); int err; + priv->zc_installed = false; + if (!peer) return -ENOTCONN; @@ -1604,6 +1720,84 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr) rcu_read_unlock(); } +static int __veth_iou_set(struct net_device *dev, + struct netdev_bpf *xdp) +{ + bool napi_already_on = veth_gro_requested(dev) && (dev->flags & IFF_UP); + unsigned qid = xdp->zc_rx.queue_id; + struct veth_priv *priv = netdev_priv(dev); + struct net_device *peer; + struct veth_rq *rq; + int ret; + + if (priv->_xdp_prog) + return -EINVAL; + if (qid >= dev->real_num_rx_queues) + return -EINVAL; + if (!(dev->flags & IFF_UP)) + return -EOPNOTSUPP; + if (dev->real_num_rx_queues != 1) + return -EINVAL; + rq = &priv->rq[qid]; + + if (!xdp->zc_rx.ifq) { + if (!priv->zc_installed) + return -EINVAL; + + veth_napi_del(dev); + priv->zc_installed = false; + if (!veth_gro_requested(dev) && netif_running(dev)) { + dev->features &= ~NETIF_F_GRO; + netdev_features_change(dev); + } + return 0; + } + + if (priv->zc_installed) + return -EINVAL; + + peer = rtnl_dereference(priv->peer); + peer->hw_features &= ~NETIF_F_GSO_SOFTWARE; + + ret = veth_create_page_pool(rq, xdp->zc_rx.ifq); + if (ret) + return ret; + + ret = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL); + if (ret) { + page_pool_destroy(rq->page_pool); + rq->page_pool = NULL; + return ret; + } + + priv->zc_installed = true; + + if (!veth_gro_requested(dev)) { + /* user-space did not require GRO, but adding XDP + * is supposed to get GRO working + */ + dev->features |= NETIF_F_GRO; + netdev_features_change(dev); + } + if (!napi_already_on) { + netif_napi_add(dev, &rq->xdp_napi, veth_poll); + napi_enable(&rq->xdp_napi); + rcu_assign_pointer(rq->napi, &rq->xdp_napi); + } + return 0; +} + +static int veth_iou_set(struct net_device *dev, + struct netdev_bpf *xdp) +{ + int ret; + + rtnl_lock(); + ret = __veth_iou_set(dev, xdp); + rtnl_unlock(); + return ret; +} + static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, struct netlink_ext_ack *extack) { @@ -1613,6 +1807,9 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, unsigned int max_mtu; int err; + if (priv->zc_installed) + return -EINVAL; + old_prog = priv->_xdp_prog; priv->_xdp_prog = prog; peer = rtnl_dereference(priv->peer); @@ -1691,6 +1888,8 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return veth_xdp_set(dev, xdp->prog, xdp->extack); + case XDP_SETUP_ZC_RX: + return veth_iou_set(dev, xdp); default: return -EINVAL; } From patchwork Tue Dec 19 21:03:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499110 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 71F4D3EA7A for ; Tue, 19 Dec 2023 21:04:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="KC8owpaR" Received: by mail-pj1-f45.google.com with SMTP id 98e67ed59e1d1-28b06be7cf6so2306618a91.2 for ; Tue, 19 Dec 2023 13:04:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019863; x=1703624663; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=aEQ6N11l7v8r/nsFkAQ5qbz/E3VFP+byl5ecjp5V+Xo=; b=KC8owpaRkA58F5+Z25p+K2xkowfxp4/q1LQm6UrBkDSn+16W93lYhGwzbSTPHeZKP5 Ya2DbPp6CM7EzydW+fWitqJpTQRZLe2hEBN248j6JTuAR3yTCiHaQ/07/1vTR71ua7SV rwmuLIHEzLtKOCHNXLBG4HkuJUlINu0BLdhoR29q3TlfKw+/Tmj0uGWEZHEHw941DLI6 WMYnH7K8DKyd0/SRAlqHFsyAupiYqi2h8zohThRcH327GfeFQTTEAax9wgQPCnQiUZET JSEKRluk4n2rgC9eSCPcsXLUvSKP5tir6XtvYDZK4a79MuJMS7w8NimN/FQBYVrWEZvm PYCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019863; x=1703624663; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aEQ6N11l7v8r/nsFkAQ5qbz/E3VFP+byl5ecjp5V+Xo=; b=tLVWpKFJ5Q9cA4ZyxBAO9kQZPgC/MGL+RRFG9SyB8mXSqsb/25+qwnEjzEBXo3l1RP oRCGbapB5NyDWT4NpHqQkc+sgItokQ6rssgcGOsyq+OeGzxa39h7hDW1f0dJPBh8o5L6 1nZ5OqHDnAhmUNtYLE6f9eX91eQIIHdvbEMMZXz4oTQiv9ra9GDbvw457G20OzixcqHP klysM1c2Pkeq/agO3VlDveas+1pdGU+iuhEoC4DNSMO6Uqglw3G+lcw0z7eGp7BNDAMc JcnvJkoJmwyr8ZEIPPE1NSQD8tJUYsf8yyMW8IeYG23ffUP749ScZVPJ3LFKhMCbQSjq 8UWQ== X-Gm-Message-State: AOJu0Yy7ZjYSOkzTwAM4EgGh/A8xfcFhCFoYCRMLZunGA/F3L71eEGl0 kVbnlq1vt/EbBpTIGyuebqjR1g== X-Google-Smtp-Source: AGHT+IEMwF0JJgfJeogY5EMwB8SgA6SA/QjJDc0LnTgjbd/Zd8UA2CKgdeV6xAqemSSppOsZza1xMQ== X-Received: by 2002:a17:90a:d790:b0:28b:2e12:4fb3 with SMTP id z16-20020a17090ad79000b0028b2e124fb3mr3330341pju.33.1703019862907; Tue, 19 Dec 2023 13:04:22 -0800 (PST) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id dj15-20020a17090ad2cf00b0028bbf4c0264sm1752924pjb.10.2023.12.19.13.04.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:22 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get Date: Tue, 19 Dec 2023 13:03:56 -0800 Message-Id: <20231219210357.4029713-20-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov io_uring pp memory provider doesn't have contiguous dma addresses, implement page_pool_iov_dma_addr() via callbacks. Note: it might be better to stash dma address into struct page_pool_iov. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/page_pool/helpers.h | 5 +---- include/net/page_pool/types.h | 2 ++ io_uring/zc_rx.c | 8 ++++++++ net/core/page_pool.c | 9 +++++++++ 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index aca3a52d0e22..10dba1f2aa0c 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -105,10 +105,7 @@ static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov) static inline dma_addr_t page_pool_iov_dma_addr(const struct page_pool_iov *ppiov) { - struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); - - return owner->base_dma_addr + - ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); + return ppiov->pp->mp_ops->ppiov_dma_addr(ppiov); } static inline unsigned long diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index f54ee759e362..1b9266835ab6 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -125,6 +125,7 @@ struct page_pool_stats { #endif struct mem_provider; +struct page_pool_iov; enum pp_memory_provider_type { __PP_MP_NONE, /* Use system allocator directly */ @@ -138,6 +139,7 @@ struct pp_memory_provider_ops { void (*scrub)(struct page_pool *pool); struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); bool (*release_page)(struct page_pool *pool, struct page *page); + dma_addr_t (*ppiov_dma_addr)(const struct page_pool_iov *ppiov); }; extern const struct pp_memory_provider_ops dmabuf_devmem_ops; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index f7d99d569885..20fb89e6bad7 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -600,12 +600,20 @@ static void io_pp_zc_destroy(struct page_pool *pp) percpu_ref_put(&ifq->ctx->refs); } +static dma_addr_t io_pp_zc_ppiov_dma_addr(const struct page_pool_iov *ppiov) +{ + struct io_zc_rx_buf *buf = io_iov_to_buf((struct page_pool_iov *)ppiov); + + return buf->dma; +} + const struct pp_memory_provider_ops io_uring_pp_zc_ops = { .alloc_pages = io_pp_zc_alloc_pages, .release_page = io_pp_zc_release_page, .init = io_pp_zc_init, .destroy = io_pp_zc_destroy, .scrub = io_pp_zc_scrub, + .ppiov_dma_addr = io_pp_zc_ppiov_dma_addr, }; EXPORT_SYMBOL(io_uring_pp_zc_ops); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ebf5ff009d9d..6586631ecc2e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -1105,10 +1105,19 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, return true; } +static dma_addr_t mp_dmabuf_devmem_ppiov_dma_addr(const struct page_pool_iov *ppiov) +{ + struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); + + return owner->base_dma_addr + + ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); +} + const struct pp_memory_provider_ops dmabuf_devmem_ops = { .init = mp_dmabuf_devmem_init, .destroy = mp_dmabuf_devmem_destroy, .alloc_pages = mp_dmabuf_devmem_alloc_pages, .release_page = mp_dmabuf_devmem_release_page, + .ppiov_dma_addr = mp_dmabuf_devmem_ppiov_dma_addr, }; EXPORT_SYMBOL(dmabuf_devmem_ops); From patchwork Tue Dec 19 21:03:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13499111 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com [209.85.167.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C182B40C04 for ; Tue, 19 Dec 2023 21:04:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="WJhzViu7" Received: by mail-oi1-f172.google.com with SMTP id 5614622812f47-3b9efed2e6fso3825758b6e.0 for ; Tue, 19 Dec 2023 13:04:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1703019864; x=1703624664; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=UaWUxwWMhpizTTTVDJEx3BaZ8y03KGcwpzKbAhXUqpk=; b=WJhzViu7VLsvKmdEJt8RM9QWBU830WM/jDux05YkwR1Sejx4gfuMWXu4yNwnOGbos5 oQsaO0qiJ0BG4SS5l/BUNx5l/E4rhlw9uH+32WCTRp+RXtIsv4mpQrYzuoP0M8Bj7eiW mpDi/NPj/Dte0Ng7ooZSM34nvjkGVFy/52OSU3VFZYJ8k3zRC5Q5wPvAk9khZMoYPqMW UrYonltAPTRMxpkQAEP3wcdpb6XLzXj5He4SC4ta+F/f5rmQb/7sTpRxxpL53cj5ESBt ywPHELf6d1avOHjRBIbh+QRU7LHPse2X10Q6tGV0Xfn2Wl1A9PtJMbltrXgL6JqBbtRn 3B/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703019864; x=1703624664; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UaWUxwWMhpizTTTVDJEx3BaZ8y03KGcwpzKbAhXUqpk=; b=Psz0stNWzefVO6ozeyp0B4TBkHnvwbqzrknD3Mfk7Jj9pLBre2U6QaZ5WN5tZOLrrb 8nEbWW/iKUFZBMlOz1R18zyUZfy1CzVuLyL3L2hKrDWcdOQn+nO81gFdueqbcdAr4BF/ qPqNtBgfNKyAandy8yCUeBGqO2EDmILPtE0ZFvePmpjVay5T6KKEYD1JpV/KbMeE6jLs ol+Ych5GILaVz1PXMxKAgybRZychS03j0hw8MGwC64cFGZRfGY7Ftx1CnsuGTSPd7mRh G4Ab23yygeZhaDOaGzQcGL33nyNSJa8VrVBT5VSAArLYAwOPz7xu59KGtEIx5cjGj+H2 Qdvg== X-Gm-Message-State: AOJu0YwJeGpfOIQmEmfr3K7GWwRk8n3rqsXTEh+D74dp6pVKBgpGTtTG gI3Eah+2hKPbbPZQ0umv4eylrQ== X-Google-Smtp-Source: AGHT+IEtTwvrmgbTpMQqNKAuQVjNzW+eoJccv8bAayAZyfBOqrS0CFXslrtBQSxc+fixpkFaqhl1Kw== X-Received: by 2002:a05:6358:99a0:b0:170:17eb:203c with SMTP id j32-20020a05635899a000b0017017eb203cmr19345106rwb.37.1703019863804; Tue, 19 Dec 2023 13:04:23 -0800 (PST) Received: from localhost (fwdproxy-prn-118.fbsv.net. [2a03:2880:ff:76::face:b00c]) by smtp.gmail.com with ESMTPSA id fn6-20020a056a002fc600b006d838632671sm3803511pfb.101.2023.12.19.13.04.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 13:04:23 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry Subject: [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool Date: Tue, 19 Dec 2023 13:03:57 -0800 Message-Id: <20231219210357.4029713-21-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231219210357.4029713-1-dw@davidwei.uk> References: <20231219210357.4029713-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Pavel Begunkov TESTING ONLY Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 71 +++++++++++++++++-- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 7 ++ drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 3 + 3 files changed, 75 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 039f8d995a26..d9fb8633f226 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -55,6 +55,7 @@ #include #include #include +#include #include "bnxt_hsi.h" #include "bnxt.h" @@ -875,6 +876,25 @@ static inline u8 *__bnxt_alloc_rx_frag(struct bnxt *bp, dma_addr_t *mapping, return data; } +static inline struct page *bnxt_get_real_page(struct page *page) +{ + struct io_zc_rx_buf *buf; + + if (page_is_page_pool_iov(page)) { + buf = container_of(page_to_page_pool_iov(page), + struct io_zc_rx_buf, ppiov); + page = buf->page; + } + return page; +} + +static inline void *bnxt_get_page_address(struct page *frag) +{ + struct page *page = bnxt_get_real_page(frag); + + return page_address(page); +} + int bnxt_alloc_rx_data(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 prod, gfp_t gfp) { @@ -892,7 +912,7 @@ int bnxt_alloc_rx_data(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, mapping += bp->rx_dma_offset; rx_buf->data = page; - rx_buf->data_ptr = page_address(page) + offset + bp->rx_offset; + rx_buf->data_ptr = bnxt_get_page_address(page) + offset + bp->rx_offset; } else { u8 *data = __bnxt_alloc_rx_frag(bp, &mapping, gfp); @@ -954,8 +974,9 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp, if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE) page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp); - else + else { page = __bnxt_alloc_rx_64k_page(bp, &mapping, rxr, gfp, &offset); + } if (!page) return -ENOMEM; @@ -1079,6 +1100,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp, return NULL; } skb_mark_for_recycle(skb); + skb_reserve(skb, bp->rx_offset); __skb_put(skb, len); @@ -1118,7 +1140,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp, } skb_mark_for_recycle(skb); - off = (void *)data_ptr - page_address(page); + off = (void *)data_ptr - bnxt_get_page_address(page); skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE); memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN, payload + NET_IP_ALIGN); @@ -2032,7 +2054,6 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr, goto next_rx; } } else { - skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr->page_pool, &xdp, rxcmp1); if (!skb) { /* we should be able to free the old skb here */ bnxt_xdp_buff_frags_free(rxr, &xdp); @@ -3402,7 +3423,8 @@ static void bnxt_free_rx_rings(struct bnxt *bp) } static int bnxt_alloc_rx_page_pool(struct bnxt *bp, - struct bnxt_rx_ring_info *rxr) + struct bnxt_rx_ring_info *rxr, + int qid) { struct page_pool_params pp = { 0 }; @@ -3416,6 +3438,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp, pp.max_len = PAGE_SIZE; pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; + if (bp->iou_ifq && qid == bp->iou_qid) { + pp.mp_priv = bp->iou_ifq; + pp.memory_provider = PP_MP_IOU_ZCRX; + pp.max_len = PAGE_SIZE; + pp.flags = 0; + } + rxr->page_pool = page_pool_create(&pp); if (IS_ERR(rxr->page_pool)) { int err = PTR_ERR(rxr->page_pool); @@ -3442,7 +3471,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp) ring = &rxr->rx_ring_struct; - rc = bnxt_alloc_rx_page_pool(bp, rxr); + rc = bnxt_alloc_rx_page_pool(bp, rxr, i); if (rc) return rc; @@ -14347,6 +14376,36 @@ void bnxt_print_device_info(struct bnxt *bp) pcie_print_link_status(bp->pdev); } +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp) +{ + unsigned ifq_idx = xdp->zc_rx.queue_id; + + if (ifq_idx >= bp->rx_nr_rings) + return -EINVAL; + if (PAGE_SIZE != BNXT_RX_PAGE_SIZE) + return -EINVAL; + + bnxt_rtnl_lock_sp(bp); + if (!!bp->iou_ifq == !!xdp->zc_rx.ifq) { + bnxt_rtnl_unlock_sp(bp); + return -EINVAL; + } + if (netif_running(bp->dev)) { + int rc; + + bnxt_ulp_stop(bp); + bnxt_close_nic(bp, true, false); + + bp->iou_qid = ifq_idx; + bp->iou_ifq = xdp->zc_rx.ifq; + + rc = bnxt_open_nic(bp, true, false); + bnxt_ulp_start(bp, rc); + } + bnxt_rtnl_unlock_sp(bp); + return 0; +} + static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { struct net_device *dev; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index e31164e3b8fb..1003f9260805 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -2342,6 +2342,10 @@ struct bnxt { #endif u32 thermal_threshold_type; enum board_idx board_idx; + + /* io_uring zerocopy */ + void *iou_ifq; + unsigned iou_qid; }; #define BNXT_NUM_RX_RING_STATS 8 @@ -2556,4 +2560,7 @@ int bnxt_get_port_parent_id(struct net_device *dev, void bnxt_dim_work(struct work_struct *work); int bnxt_hwrm_set_ring_coal(struct bnxt *bp, struct bnxt_napi *bnapi); void bnxt_print_device_info(struct bnxt *bp); + +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp); + #endif diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 4791f6a14e55..a3ae02c31ffc 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -466,6 +466,9 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: rc = bnxt_xdp_set(bp, xdp->prog); break; + case XDP_SETUP_ZC_RX: + return bnxt_zc_rx(bp, xdp); + break; default: rc = -EINVAL; break;