Message ID | 1689966321-17337-1-git-send-email-haiyangz@microsoft.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [V3,net-next] net: mana: Add page pool for RX buffers | expand |
On 21/07/2023 21.05, Haiyang Zhang wrote: > Add page pool for RX buffers for faster buffer cycle and reduce CPU > usage. > > The standard page pool API is used. > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> > --- > V3: > Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski > V2: > Use the standard page pool API as suggested by Jesper Dangaard Brouer > > --- > drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++---- > include/net/mana/mana.h | 3 + > 2 files changed, 78 insertions(+), 16 deletions(-) > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c > index a499e460594b..4307f25f8c7a 100644 > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c [...] > @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq) > > if (rxq->xdp_flush) > xdp_do_flush(); > + > + page_pool_nid_changed(rxq->page_pool, numa_mem_id()); I don't think this page_pool_nid_changed() called is needed, if you do as I suggest below (nid = NUMA_NO_NODE). > } > > static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue) [...] > @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq) > return 0; > } > > +static int mana_create_page_pool(struct mana_rxq *rxq) > +{ > + struct page_pool_params pprm = {}; You are implicitly assigning NUMA node id zero. > + int ret; > + > + pprm.pool_size = RX_BUFFERS_PER_QUEUE; > + pprm.napi = &rxq->rx_cq.napi; You likely want to assign pprm.nid to NUMA_NO_NODE pprm.nid = NUMA_NO_NODE; For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1) as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled this setting will automatically select the (preferred) NUMA node (via ``numa_mem_id()``) based on where NAPI RX-processing is currently running. The effect is that page_pool will only use recycled memory when NUMA node match running CPU. This assumes CPU refilling driver RX-ring will also run RX-NAPI. If a driver want more control over the NUMA node memory selection, drivers can assign (``pp_params.nid``) something else than `NUMA_NO_NODE`` and runtime adjust via function ``page_pool_nid_changed()``. I will update [1] with this info. - Docs [1] https://kernel.org/doc/html/latest/networking/page_pool.html#registration > + > + rxq->page_pool = page_pool_create(&pprm); > + > + if (IS_ERR(rxq->page_pool)) { > + ret = PTR_ERR(rxq->page_pool); > + rxq->page_pool = NULL; > + return ret; > + } > + > + return 0; > +} > +
> -----Original Message----- > From: Jesper Dangaard Brouer <jbrouer@redhat.com> > Sent: Monday, July 24, 2023 7:29 AM > To: Haiyang Zhang <haiyangz@microsoft.com>; linux-hyperv@vger.kernel.org; > netdev@vger.kernel.org > Cc: brouer@redhat.com; Dexuan Cui <decui@microsoft.com>; KY Srinivasan > <kys@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>; > olaf@aepfle.de; vkuznets@redhat.com; davem@davemloft.net; > wei.liu@kernel.org; edumazet@google.com; kuba@kernel.org; > pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>; > ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org; > daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org; > ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>; hawk@kernel.org; > tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux- > kernel@vger.kernel.org; Ilias Apalodimas <ilias.apalodimas@linaro.org>; Jesper > Dangaard Brouer <hawk@kernel.org> > Subject: Re: [PATCH V3,net-next] net: mana: Add page pool for RX buffers > > > > On 21/07/2023 21.05, Haiyang Zhang wrote: > > Add page pool for RX buffers for faster buffer cycle and reduce CPU > > usage. > > > > The standard page pool API is used. > > > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> > > --- > > V3: > > Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski > > V2: > > Use the standard page pool API as suggested by Jesper Dangaard Brouer > > > > --- > > drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++---- > > include/net/mana/mana.h | 3 + > > 2 files changed, 78 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > b/drivers/net/ethernet/microsoft/mana/mana_en.c > > index a499e460594b..4307f25f8c7a 100644 > > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > [...] > > @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq) > > > > if (rxq->xdp_flush) > > xdp_do_flush(); > > + > > + page_pool_nid_changed(rxq->page_pool, numa_mem_id()); > > I don't think this page_pool_nid_changed() called is needed, if you do > as I suggest below (nid = NUMA_NO_NODE). > > > > } > > > > static int mana_cq_handler(void *context, struct gdma_queue > *gdma_queue) > [...] > > > @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq) > > return 0; > > } > > > > +static int mana_create_page_pool(struct mana_rxq *rxq) > > +{ > > + struct page_pool_params pprm = {}; > > You are implicitly assigning NUMA node id zero. > > > + int ret; > > + > > + pprm.pool_size = RX_BUFFERS_PER_QUEUE; > > + pprm.napi = &rxq->rx_cq.napi; > > You likely want to assign pprm.nid to NUMA_NO_NODE > > pprm.nid = NUMA_NO_NODE; > > For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1) > as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled > this setting will automatically select the (preferred) NUMA node (via > ``numa_mem_id()``) based on where NAPI RX-processing is currently > running. The effect is that page_pool will only use recycled memory when > NUMA node match running CPU. This assumes CPU refilling driver RX-ring > will also run RX-NAPI. > > If a driver want more control over the NUMA node memory selection, > drivers can assign (``pp_params.nid``) something else than > `NUMA_NO_NODE`` and runtime adjust via function > ``page_pool_nid_changed()``. Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id to zero during pool init. And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed() will update the nid for the pool. Does this sound good? Thanks, -Haiyang
> -----Original Message----- > From: Haiyang Zhang <haiyangz@microsoft.com> > Sent: Monday, July 24, 2023 11:46 AM > To: Jesper Dangaard Brouer <jbrouer@redhat.com>; linux- > hyperv@vger.kernel.org; netdev@vger.kernel.org > Cc: brouer@redhat.com; Dexuan Cui <decui@microsoft.com>; KY Srinivasan > <kys@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>; > olaf@aepfle.de; vkuznets@redhat.com; davem@davemloft.net; > wei.liu@kernel.org; edumazet@google.com; kuba@kernel.org; > pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>; > ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org; > daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org; > ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>; hawk@kernel.org; > tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux- > kernel@vger.kernel.org; Ilias Apalodimas <ilias.apalodimas@linaro.org>; Jesper > Dangaard Brouer <hawk@kernel.org> > Subject: RE: [PATCH V3,net-next] net: mana: Add page pool for RX buffers > > > > > -----Original Message----- > > From: Jesper Dangaard Brouer <jbrouer@redhat.com> > > Sent: Monday, July 24, 2023 7:29 AM > > To: Haiyang Zhang <haiyangz@microsoft.com>; linux-hyperv@vger.kernel.org; > > netdev@vger.kernel.org > > Cc: brouer@redhat.com; Dexuan Cui <decui@microsoft.com>; KY Srinivasan > > <kys@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>; > > olaf@aepfle.de; vkuznets@redhat.com; davem@davemloft.net; > > wei.liu@kernel.org; edumazet@google.com; kuba@kernel.org; > > pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>; > > ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org; > > daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org; > > ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>; > hawk@kernel.org; > > tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux- > > kernel@vger.kernel.org; Ilias Apalodimas <ilias.apalodimas@linaro.org>; > Jesper > > Dangaard Brouer <hawk@kernel.org> > > Subject: Re: [PATCH V3,net-next] net: mana: Add page pool for RX buffers > > > > > > > > On 21/07/2023 21.05, Haiyang Zhang wrote: > > > Add page pool for RX buffers for faster buffer cycle and reduce CPU > > > usage. > > > > > > The standard page pool API is used. > > > > > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> > > > --- > > > V3: > > > Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski > > > V2: > > > Use the standard page pool API as suggested by Jesper Dangaard Brouer > > > > > > --- > > > drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++-- > -- > > > include/net/mana/mana.h | 3 + > > > 2 files changed, 78 insertions(+), 16 deletions(-) > > > > > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > > b/drivers/net/ethernet/microsoft/mana/mana_en.c > > > index a499e460594b..4307f25f8c7a 100644 > > > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > > > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > > [...] > > > @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq) > > > > > > if (rxq->xdp_flush) > > > xdp_do_flush(); > > > + > > > + page_pool_nid_changed(rxq->page_pool, numa_mem_id()); > > > > I don't think this page_pool_nid_changed() called is needed, if you do > > as I suggest below (nid = NUMA_NO_NODE). > > > > > > > } > > > > > > static int mana_cq_handler(void *context, struct gdma_queue > > *gdma_queue) > > [...] > > > > > @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq > *rxq) > > > return 0; > > > } > > > > > > +static int mana_create_page_pool(struct mana_rxq *rxq) > > > +{ > > > + struct page_pool_params pprm = {}; > > > > You are implicitly assigning NUMA node id zero. > > > > > + int ret; > > > + > > > + pprm.pool_size = RX_BUFFERS_PER_QUEUE; > > > + pprm.napi = &rxq->rx_cq.napi; > > > > You likely want to assign pprm.nid to NUMA_NO_NODE > > > > pprm.nid = NUMA_NO_NODE; > > > > For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1) > > as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled > > this setting will automatically select the (preferred) NUMA node (via > > ``numa_mem_id()``) based on where NAPI RX-processing is currently > > running. The effect is that page_pool will only use recycled memory when > > NUMA node match running CPU. This assumes CPU refilling driver RX-ring > > will also run RX-NAPI. > > > > If a driver want more control over the NUMA node memory selection, > > drivers can assign (``pp_params.nid``) something else than > > `NUMA_NO_NODE`` and runtime adjust via function > > ``page_pool_nid_changed()``. > > Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id > to zero during pool init. > > And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed() > will update the nid for the pool. Does this sound good? > Also, since our driver is getting the default node from here: gc->numa_node = dev_to_node(&pdev->dev); I will update this patch to set the default node as above, instead of implicitly assigning it to 0. Thanks, - Haiyang
On 24/07/2023 20.35, Haiyang Zhang wrote: > [...] >>> On 21/07/2023 21.05, Haiyang Zhang wrote: >>>> Add page pool for RX buffers for faster buffer cycle and reduce CPU >>>> usage. >>>> >>>> The standard page pool API is used. >>>> >>>> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> >>>> --- >>>> V3: >>>> Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski >>>> V2: >>>> Use the standard page pool API as suggested by Jesper Dangaard Brouer >>>> >>>> --- >>>> drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++-- >> -- >>>> include/net/mana/mana.h | 3 + >>>> 2 files changed, 78 insertions(+), 16 deletions(-) >>>> >>>> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c >>> b/drivers/net/ethernet/microsoft/mana/mana_en.c >>>> index a499e460594b..4307f25f8c7a 100644 >>>> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c >>>> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c >>> [...] >>>> @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq) >>>> >>>> if (rxq->xdp_flush) >>>> xdp_do_flush(); >>>> + >>>> + page_pool_nid_changed(rxq->page_pool, numa_mem_id()); >>> >>> I don't think this page_pool_nid_changed() called is needed, if you do >>> as I suggest below (nid = NUMA_NO_NODE). >>> >>> >>>> } >>>> >>>> static int mana_cq_handler(void *context, struct gdma_queue >>> *gdma_queue) >>> [...] >>> >>>> @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq >> *rxq) >>>> return 0; >>>> } >>>> >>>> +static int mana_create_page_pool(struct mana_rxq *rxq) >>>> +{ >>>> + struct page_pool_params pprm = {}; >>> >>> You are implicitly assigning NUMA node id zero. >>> >>>> + int ret; >>>> + >>>> + pprm.pool_size = RX_BUFFERS_PER_QUEUE; >>>> + pprm.napi = &rxq->rx_cq.napi; >>> >>> You likely want to assign pprm.nid to NUMA_NO_NODE >>> >>> pprm.nid = NUMA_NO_NODE; >>> >>> For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1) >>> as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled >>> this setting will automatically select the (preferred) NUMA node (via >>> ``numa_mem_id()``) based on where NAPI RX-processing is currently >>> running. The effect is that page_pool will only use recycled memory when >>> NUMA node match running CPU. This assumes CPU refilling driver RX-ring >>> will also run RX-NAPI. >>> >>> If a driver want more control over the NUMA node memory selection, >>> drivers can assign (``pp_params.nid``) something else than >>> `NUMA_NO_NODE`` and runtime adjust via function >>> ``page_pool_nid_changed()``. >> >> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id >> to zero during pool init. >> >> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed() >> will update the nid for the pool. Does this sound good? >> > > Also, since our driver is getting the default node from here: > gc->numa_node = dev_to_node(&pdev->dev); > I will update this patch to set the default node as above, instead of implicitly > assigning it to 0. > In that case, I agree that it make sense to use dev_to_node(&pdev->dev), like: pprm.nid = dev_to_node(&pdev->dev); Driver must have a reason for assigning gc->numa_node for this hardware, which is okay. That is why page_pool API allows driver to control this. But then I don't think you should call page_pool_nid_changed() like page_pool_nid_changed(rxq->page_pool, numa_mem_id()); Because then you will (at first packet processing event) revert the dev_to_node() setting to use numa_mem_id() of processing/running CPU. (In effect this will be the same as setting NUMA_NO_NODE). I know, mlx5 do call page_pool_nid_changed(), but they showed benchmark numbers that this was preferred action, even-when sysadm had "misconfigured" the default smp_affinity RX-processing to happen on a remote NUMA node. AFAIK mlx5 keeps the descriptor rings on the originally configured NUMA node that corresponds to the NIC PCIe slot. --Jesper
> -----Original Message----- > From: Jesper Dangaard Brouer <jbrouer@redhat.com> > Sent: Tuesday, July 25, 2023 2:01 PM > >> > >> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id > >> to zero during pool init. > >> > >> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed() > >> will update the nid for the pool. Does this sound good? > >> > > > > Also, since our driver is getting the default node from here: > > gc->numa_node = dev_to_node(&pdev->dev); > > I will update this patch to set the default node as above, instead of implicitly > > assigning it to 0. > > > > In that case, I agree that it make sense to use dev_to_node(&pdev->dev), > like: > pprm.nid = dev_to_node(&pdev->dev); > > Driver must have a reason for assigning gc->numa_node for this hardware, > which is okay. That is why page_pool API allows driver to control this. > > But then I don't think you should call page_pool_nid_changed() like > > page_pool_nid_changed(rxq->page_pool, numa_mem_id()); > > Because then you will (at first packet processing event) revert the > dev_to_node() setting to use numa_mem_id() of processing/running CPU. > (In effect this will be the same as setting NUMA_NO_NODE). > > I know, mlx5 do call page_pool_nid_changed(), but they showed benchmark > numbers that this was preferred action, even-when sysadm had > "misconfigured" the default smp_affinity RX-processing to happen on a > remote NUMA node. AFAIK mlx5 keeps the descriptor rings on the > originally configured NUMA node that corresponds to the NIC PCIe slot. In mana_gd_setup_irqs(), we set the default IRQ/CPU affinity to gc->numa_node too, so it won't revert the nid initial setting. Currently, the Azure hypervisor always indicates numa 0 as default. (In the future, it will start to provide the accurate default dev node.) When a user manually changes the IRQ/CPU affinity for perf tuning, we want to allow page_pool_nid_changed() to update the pool. Is this OK? Thanks, - Haiyang
On 25/07/2023 21.02, Haiyang Zhang wrote: > >> -----Original Message----- >> From: Jesper Dangaard Brouer <jbrouer@redhat.com> >> Sent: Tuesday, July 25, 2023 2:01 PM >>>> >>>> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id >>>> to zero during pool init. >>>> >>>> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed() >>>> will update the nid for the pool. Does this sound good? >>>> >>> >>> Also, since our driver is getting the default node from here: >>> gc->numa_node = dev_to_node(&pdev->dev); >>> I will update this patch to set the default node as above, instead of implicitly >>> assigning it to 0. >>> >> >> In that case, I agree that it make sense to use dev_to_node(&pdev->dev), >> like: >> pprm.nid = dev_to_node(&pdev->dev); >> >> Driver must have a reason for assigning gc->numa_node for this hardware, >> which is okay. That is why page_pool API allows driver to control this. >> >> But then I don't think you should call page_pool_nid_changed() like >> >> page_pool_nid_changed(rxq->page_pool, numa_mem_id()); >> >> Because then you will (at first packet processing event) revert the >> dev_to_node() setting to use numa_mem_id() of processing/running CPU. >> (In effect this will be the same as setting NUMA_NO_NODE). >> >> I know, mlx5 do call page_pool_nid_changed(), but they showed benchmark >> numbers that this was preferred action, even-when sysadm had >> "misconfigured" the default smp_affinity RX-processing to happen on a >> remote NUMA node. AFAIK mlx5 keeps the descriptor rings on the >> originally configured NUMA node that corresponds to the NIC PCIe slot. > > In mana_gd_setup_irqs(), we set the default IRQ/CPU affinity to gc->numa_node > too, so it won't revert the nid initial setting. > > Currently, the Azure hypervisor always indicates numa 0 as default. (In > the future, it will start to provide the accurate default dev node.) When a > user manually changes the IRQ/CPU affinity for perf tuning, we want to > allow page_pool_nid_changed() to update the pool. Is this OK? > If I were you, I would wait with the page_pool_nid_changed() "optimization" and do a benchmark mark to see if this actually have a benefit. (You can do this in another patch). (In a Azure hypervisor environment is might not be the right choice). This reminds me, do you have any benchmark data on the improvement this patch (using page_pool) gave? --Jesper
> -----Original Message----- > From: Jesper Dangaard Brouer <jbrouer@redhat.com> > Sent: Wednesday, July 26, 2023 5:23 AM > > > > In mana_gd_setup_irqs(), we set the default IRQ/CPU affinity to gc- > >numa_node > > too, so it won't revert the nid initial setting. > > > > Currently, the Azure hypervisor always indicates numa 0 as default. (In > > the future, it will start to provide the accurate default dev node.) When a > > user manually changes the IRQ/CPU affinity for perf tuning, we want to > > allow page_pool_nid_changed() to update the pool. Is this OK? > > > > If I were you, I would wait with the page_pool_nid_changed() > "optimization" and do a benchmark mark to see if this actually have a > benefit. (You can do this in another patch). (In a Azure hypervisor > environment is might not be the right choice). Ok, I will submit a patch without the page_pool_nid_changed() optimization for now, and will do more testing on this. > This reminds me, do you have any benchmark data on the improvement this > patch (using page_pool) gave? With iperf and 128 threads test, this patch improved the throughput by 12-15%, and decreased the IRQ associated CPU's usage from 99-100% to 10-50%. Thanks, - Haiyang
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index a499e460594b..4307f25f8c7a 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -1414,8 +1414,8 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va, return skb; } -static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe, - struct mana_rxq *rxq) +static void mana_rx_skb(void *buf_va, bool from_pool, + struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq) { struct mana_stats_rx *rx_stats = &rxq->stats; struct net_device *ndev = rxq->ndev; @@ -1448,6 +1448,9 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe, if (!skb) goto drop; + if (from_pool) + skb_mark_for_recycle(skb); + skb->dev = napi->dev; skb->protocol = eth_type_trans(skb, ndev); @@ -1498,9 +1501,14 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe, u64_stats_update_end(&rx_stats->syncp); drop: - WARN_ON_ONCE(rxq->xdp_save_va); - /* Save for reuse */ - rxq->xdp_save_va = buf_va; + if (from_pool) { + page_pool_recycle_direct(rxq->page_pool, + virt_to_head_page(buf_va)); + } else { + WARN_ON_ONCE(rxq->xdp_save_va); + /* Save for reuse */ + rxq->xdp_save_va = buf_va; + } ++ndev->stats.rx_dropped; @@ -1508,11 +1516,13 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe, } static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev, - dma_addr_t *da, bool is_napi) + dma_addr_t *da, bool *from_pool, bool is_napi) { struct page *page; void *va; + *from_pool = false; + /* Reuse XDP dropped page if available */ if (rxq->xdp_save_va) { va = rxq->xdp_save_va; @@ -1533,17 +1543,22 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev, return NULL; } } else { - page = dev_alloc_page(); + page = page_pool_dev_alloc_pages(rxq->page_pool); if (!page) return NULL; + *from_pool = true; va = page_to_virt(page); } *da = dma_map_single(dev, va + rxq->headroom, rxq->datasize, DMA_FROM_DEVICE); if (dma_mapping_error(dev, *da)) { - put_page(virt_to_head_page(va)); + if (*from_pool) + page_pool_put_full_page(rxq->page_pool, page, is_napi); + else + put_page(virt_to_head_page(va)); + return NULL; } @@ -1552,21 +1567,25 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev, /* Allocate frag for rx buffer, and save the old buf */ static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq, - struct mana_recv_buf_oob *rxoob, void **old_buf) + struct mana_recv_buf_oob *rxoob, void **old_buf, + bool *old_fp) { + bool from_pool; dma_addr_t da; void *va; - va = mana_get_rxfrag(rxq, dev, &da, true); + va = mana_get_rxfrag(rxq, dev, &da, &from_pool, true); if (!va) return; dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize, DMA_FROM_DEVICE); *old_buf = rxoob->buf_va; + *old_fp = rxoob->from_pool; rxoob->buf_va = va; rxoob->sgl[0].address = da; + rxoob->from_pool = from_pool; } static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq, @@ -1580,6 +1599,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq, struct device *dev = gc->dev; void *old_buf = NULL; u32 curr, pktlen; + bool old_fp; apc = netdev_priv(ndev); @@ -1622,12 +1642,12 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq, rxbuf_oob = &rxq->rx_oobs[curr]; WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1); - mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf); + mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp); /* Unsuccessful refill will have old_buf == NULL. * In this case, mana_rx_skb() will drop the packet. */ - mana_rx_skb(old_buf, oob, rxq); + mana_rx_skb(old_buf, old_fp, oob, rxq); drop: mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu); @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq) if (rxq->xdp_flush) xdp_do_flush(); + + page_pool_nid_changed(rxq->page_pool, numa_mem_id()); } static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue) @@ -1881,6 +1903,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc, struct mana_recv_buf_oob *rx_oob; struct device *dev = gc->dev; struct napi_struct *napi; + struct page *page; int i; if (!rxq) @@ -1913,10 +1936,18 @@ static void mana_destroy_rxq(struct mana_port_context *apc, dma_unmap_single(dev, rx_oob->sgl[0].address, rx_oob->sgl[0].size, DMA_FROM_DEVICE); - put_page(virt_to_head_page(rx_oob->buf_va)); + page = virt_to_head_page(rx_oob->buf_va); + + if (rx_oob->from_pool) + page_pool_put_full_page(rxq->page_pool, page, false); + else + put_page(page); + rx_oob->buf_va = NULL; } + page_pool_destroy(rxq->page_pool); + if (rxq->gdma_rq) mana_gd_destroy_queue(gc, rxq->gdma_rq); @@ -1927,18 +1958,20 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key, struct mana_rxq *rxq, struct device *dev) { struct mana_port_context *mpc = netdev_priv(rxq->ndev); + bool from_pool = false; dma_addr_t da; void *va; if (mpc->rxbufs_pre) va = mana_get_rxbuf_pre(rxq, &da); else - va = mana_get_rxfrag(rxq, dev, &da, false); + va = mana_get_rxfrag(rxq, dev, &da, &from_pool, false); if (!va) return -ENOMEM; rx_oob->buf_va = va; + rx_oob->from_pool = from_pool; rx_oob->sgl[0].address = da; rx_oob->sgl[0].size = rxq->datasize; @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq) return 0; } +static int mana_create_page_pool(struct mana_rxq *rxq) +{ + struct page_pool_params pprm = {}; + int ret; + + pprm.pool_size = RX_BUFFERS_PER_QUEUE; + pprm.napi = &rxq->rx_cq.napi; + + rxq->page_pool = page_pool_create(&pprm); + + if (IS_ERR(rxq->page_pool)) { + ret = PTR_ERR(rxq->page_pool); + rxq->page_pool = NULL; + return ret; + } + + return 0; +} + static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc, u32 rxq_idx, struct mana_eq *eq, struct net_device *ndev) @@ -2037,6 +2089,13 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc, mana_get_rxbuf_cfg(ndev->mtu, &rxq->datasize, &rxq->alloc_size, &rxq->headroom); + /* Create page pool for RX queue */ + err = mana_create_page_pool(rxq); + if (err) { + netdev_err(ndev, "Create page pool err:%d\n", err); + goto out; + } + err = mana_alloc_rx_wqe(apc, rxq, &rq_size, &cq_size); if (err) goto out; @@ -2108,8 +2167,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc, WARN_ON(xdp_rxq_info_reg(&rxq->xdp_rxq, ndev, rxq_idx, cq->napi.napi_id)); - WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, - MEM_TYPE_PAGE_SHARED, NULL)); + WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL, + rxq->page_pool)); napi_enable(&cq->napi); diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h index 024ad8ddb27e..b12859511839 100644 --- a/include/net/mana/mana.h +++ b/include/net/mana/mana.h @@ -280,6 +280,7 @@ struct mana_recv_buf_oob { struct gdma_wqe_request wqe_req; void *buf_va; + bool from_pool; /* allocated from a page pool */ /* SGL of the buffer going to be sent has part of the work request. */ u32 num_sge; @@ -330,6 +331,8 @@ struct mana_rxq { bool xdp_flush; int xdp_rc; /* XDP redirect return code */ + struct page_pool *page_pool; + /* MUST BE THE LAST MEMBER: * Each receive buffer has an associated mana_recv_buf_oob. */
Add page pool for RX buffers for faster buffer cycle and reduce CPU usage. The standard page pool API is used. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> --- V3: Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski V2: Use the standard page pool API as suggested by Jesper Dangaard Brouer --- drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++---- include/net/mana/mana.h | 3 + 2 files changed, 78 insertions(+), 16 deletions(-)