Message ID | 20230728231829.235716-4-michael.chan@broadcom.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | bnxt_en: Add support for page pool | expand |
On Fri, 28 Jul 2023 16:18:29 -0700 Michael Chan wrote: > + pp.dma_dir = bp->rx_dir; > + pp.max_len = BNXT_RX_PAGE_SIZE; I _think_ you need PAGE_SIZE here. This should be smaller than PAGE_SIZE only if you're wasting the rest of the buffer, e.g. MTU is 3k so you know last 1k will never get used. PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing. Adding Jesper to CC to keep me honest. > + pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
On 29/07/2023 02.42, Jakub Kicinski wrote: > On Fri, 28 Jul 2023 16:18:29 -0700 Michael Chan wrote: >> + pp.dma_dir = bp->rx_dir; >> + pp.max_len = BNXT_RX_PAGE_SIZE; > > I _think_ you need PAGE_SIZE here. > I actually think pp.max_len = BNXT_RX_PAGE_SIZE is correct here. (Although it can be optimized, see below) > This should be smaller than PAGE_SIZE only if you're wasting the rest > of the buffer, e.g. MTU is 3k so you know last 1k will never get used. > PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing. > Remember pp.max_len is used for dma_sync_for_device. If driver is smart, it can set pp.max_len according to MTU, as the (DMA sync for) device knows hardware will not go beyond this. On Intel "dma_sync_for_device" is a no-op, so most drivers done optimized for this. I remember is had HUGE effects on ARM EspressoBin board. > Adding Jesper to CC to keep me honest. Adding Ilias to keep me honest ;-) To follow/understand these changes, reviewers need to keep the context of patch 1/3 in mind [1]. [1] https://lore.kernel.org/all/20230728231829.235716-2-michael.chan@broadcom.com/ > >> + pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; > --Jesper
On Mon, 31 Jul 2023 19:47:08 +0200 Jesper Dangaard Brouer wrote: > > This should be smaller than PAGE_SIZE only if you're wasting the rest > > of the buffer, e.g. MTU is 3k so you know last 1k will never get used. > > PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing. > > Remember pp.max_len is used for dma_sync_for_device. > If driver is smart, it can set pp.max_len according to MTU, as the (DMA > sync for) device knows hardware will not go beyond this. > On Intel "dma_sync_for_device" is a no-op, so most drivers done > optimized for this. I remember is had HUGE effects on ARM EspressoBin board. Note that (AFAIU) there is no MTU here, these are pages for LRO/GRO, they will be filled with TCP payload start to end. page_pool_put_page() does nothing for non-last frag, so we'll only sync for the last (BNXT_RX_PAGE-sized) frag released, and we need to sync the entire host page.
On Mon, Jul 31, 2023 at 11:00 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 31 Jul 2023 19:47:08 +0200 Jesper Dangaard Brouer wrote: > > > This should be smaller than PAGE_SIZE only if you're wasting the rest > > > of the buffer, e.g. MTU is 3k so you know last 1k will never get used. > > > PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing. > > > > Remember pp.max_len is used for dma_sync_for_device. > > If driver is smart, it can set pp.max_len according to MTU, as the (DMA > > sync for) device knows hardware will not go beyond this. > > On Intel "dma_sync_for_device" is a no-op, so most drivers done > > optimized for this. I remember is had HUGE effects on ARM EspressoBin board. > > Note that (AFAIU) there is no MTU here, these are pages for LRO/GRO, > they will be filled with TCP payload start to end. page_pool_put_page() > does nothing for non-last frag, so we'll only sync for the last > (BNXT_RX_PAGE-sized) frag released, and we need to sync the entire > host page. Correct, there is no MTU here. Remember this matters only when PAGE_SIZE > BNXT_RX_PAGE_SIZE (e.g. 64K PAGE_SIZE and 32K BNXT_RX_PAGE_SIZE). I think we want to dma_sync_for_device for 32K in this case.
On Mon, 31 Jul 2023 11:16:55 -0700 Michael Chan wrote: > > > Remember pp.max_len is used for dma_sync_for_device. > > > If driver is smart, it can set pp.max_len according to MTU, as the (DMA > > > sync for) device knows hardware will not go beyond this. > > > On Intel "dma_sync_for_device" is a no-op, so most drivers done > > > optimized for this. I remember is had HUGE effects on ARM EspressoBin board. > > > > Note that (AFAIU) there is no MTU here, these are pages for LRO/GRO, > > they will be filled with TCP payload start to end. page_pool_put_page() > > does nothing for non-last frag, so we'll only sync for the last > > (BNXT_RX_PAGE-sized) frag released, and we need to sync the entire > > host page. > > Correct, there is no MTU here. Remember this matters only when > PAGE_SIZE > BNXT_RX_PAGE_SIZE (e.g. 64K PAGE_SIZE and 32K > BNXT_RX_PAGE_SIZE). I think we want to dma_sync_for_device for 32K in > this case. Maybe I'm misunderstanding. Let me tell you how I think this works and perhaps we should update the docs based on this discussion. Note that the max_len is applied to the full host page when the full host page is returned. Not to fragments, and not at allocation. The .max_len is the max offset within the host page that the HW may access. For page-per-packet, 1500B MTU this could matter quite a bit, because we only have to sync ~1500B rather than 4096B. some wasted headroom/padding, pp.offset can be used to skip / device may touch this section / / device will not touch, sync not needed / / / |**| ===== MTU 1500B ====== | - skb_shinfo and unused --- | <------ .max_len --------> For fragmented pages it becomes: middle skb_shinfo / remainder / | |**| == MTU == | - shinfo- |**| == MTU == | - shinfo- |+++| <------------ .max_len ----------------> So max_len will only exclude the _last_ shinfo and the wasted space (reminder of dividing page by buffer size). We must sync _all_ packet sections ("== MTU ==") within the packet. In bnxt's case - the page is fragmented (latter diagram), and there is no start offset or wasted space. Ergo .max_len = PAGE_SIZE. Where did I get off the track?
On Mon, Jul 31, 2023 at 11:44 AM Jakub Kicinski <kuba@kernel.org> wrote: > Maybe I'm misunderstanding. Let me tell you how I think this works and > perhaps we should update the docs based on this discussion. > > Note that the max_len is applied to the full host page when the full > host page is returned. Not to fragments, and not at allocation. > I think I am beginning to understand what the confusion is. These 32K page fragments within the page may not belong to the same (GRO) packet. So we cannot dma_sync the whole page at the same time. Without setting PP_FLAG_DMA_SYNC_DEV, the driver code should be something like this: mapping = page_pool_get_dma_addr(page) + offset; dma_sync_single_for_device(dev, mapping, BNXT_RX_PAGE_SIZE, bp->rx_dir); offset may be 0, 32K, etc. Since the PP_FLAG_DMA_SYNC_DEV logic is not aware of this offset, we actually must do our own dma_sync and not use PP_FLAG_DMA_SYNC_DEV in this case. Does that sound right?
On Mon, 31 Jul 2023 13:20:04 -0700 Michael Chan wrote: > I think I am beginning to understand what the confusion is. These 32K > page fragments within the page may not belong to the same (GRO) > packet. Right. > So we cannot dma_sync the whole page at the same time. I wouldn't phrase it like that. > Without setting PP_FLAG_DMA_SYNC_DEV, the driver code should be > something like this: > > mapping = page_pool_get_dma_addr(page) + offset; > dma_sync_single_for_device(dev, mapping, BNXT_RX_PAGE_SIZE, bp->rx_dir); > > offset may be 0, 32K, etc. > > Since the PP_FLAG_DMA_SYNC_DEV logic is not aware of this offset, we > actually must do our own dma_sync and not use PP_FLAG_DMA_SYNC_DEV in > this case. Does that sound right? No, no, all I'm saying is that with the current code (in page pool) you can't be very intelligent about the sync'ing. Every time a page enters the pool - the whole page should be synced. But that's fine, it's still better to let page pool do the syncing than trying to do it manually in the driver (since freshly allocated pages do not have to be synced). I think the confusion comes partially from the fact that the driver only ever deals with fragments (32k), but internally page pool does recycling in full pages (64k). And .max_len is part of the recycling machinery, so to speak, not part of the allocation machinery. tl;dr just set .max_len = PAGE_SIZE and all will be right.
On Mon, Jul 31, 2023 at 1:44 PM Jakub Kicinski <kuba@kernel.org> wrote:
> tl;dr just set .max_len = PAGE_SIZE and all will be right.
OK I think I got it now. The page is only recycled when all the
fragments are recycled and so we can let page pool DMA sync the whole
page at that time.
On 31/07/2023 23.11, Michael Chan wrote: > On Mon, Jul 31, 2023 at 1:44 PM Jakub Kicinski <kuba@kernel.org> wrote: >> tl;dr just set .max_len = PAGE_SIZE and all will be right. > > OK I think I got it now. The page is only recycled when all the > fragments are recycled and so we can let page pool DMA sync the whole > page at that time. Yes, Jakub is right, I see that now. When using page_pool "frag" API (e.g. page_pool_dev_alloc_frag) then the optimization I talked about isn't valid. We simply have to DMA sync the entire page, when it gets back to the recycle stage. --Jesper
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index adf785b7aa42..b35bc92094ce 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -759,7 +759,6 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping, unsigned int *offset, gfp_t gfp) { - struct device *dev = &bp->pdev->dev; struct page *page; if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) { @@ -772,12 +771,7 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping, if (!page) return NULL; - *mapping = dma_map_page_attrs(dev, page, *offset, BNXT_RX_PAGE_SIZE, - bp->rx_dir, DMA_ATTR_WEAK_ORDERING); - if (dma_mapping_error(dev, *mapping)) { - page_pool_recycle_direct(rxr->page_pool, page); - return NULL; - } + *mapping = page_pool_get_dma_addr(page) + *offset; return page; } @@ -996,8 +990,8 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp, return NULL; } dma_addr -= bp->rx_dma_offset; - dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE, bp->rx_dir, - DMA_ATTR_WEAK_ORDERING); + dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE, + bp->rx_dir); skb = build_skb(data_ptr - bp->rx_offset, BNXT_RX_PAGE_SIZE); if (!skb) { page_pool_recycle_direct(rxr->page_pool, page); @@ -1030,8 +1024,8 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp, return NULL; } dma_addr -= bp->rx_dma_offset; - dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE, bp->rx_dir, - DMA_ATTR_WEAK_ORDERING); + dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE, + bp->rx_dir); if (unlikely(!payload)) payload = eth_get_headlen(bp->dev, data_ptr, len); @@ -1147,9 +1141,8 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp, return 0; } - dma_unmap_page_attrs(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE, - bp->rx_dir, - DMA_ATTR_WEAK_ORDERING); + dma_sync_single_for_cpu(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE, + bp->rx_dir); total_frag_len += frag_len; prod = NEXT_RX_AGG(prod); @@ -2945,10 +2938,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr) rx_buf->data = NULL; if (BNXT_RX_PAGE_MODE(bp)) { - mapping -= bp->rx_dma_offset; - dma_unmap_page_attrs(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE, - bp->rx_dir, - DMA_ATTR_WEAK_ORDERING); page_pool_recycle_direct(rxr->page_pool, data); } else { dma_unmap_single_attrs(&pdev->dev, mapping, @@ -2969,9 +2958,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr) if (!page) continue; - dma_unmap_page_attrs(&pdev->dev, rx_agg_buf->mapping, - BNXT_RX_PAGE_SIZE, bp->rx_dir, - DMA_ATTR_WEAK_ORDERING); rx_agg_buf->page = NULL; __clear_bit(i, rxr->rx_agg_bmap); @@ -3203,7 +3189,9 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp, pp.nid = dev_to_node(&bp->pdev->dev); pp.napi = &rxr->bnapi->napi; pp.dev = &bp->pdev->dev; - pp.dma_dir = DMA_BIDIRECTIONAL; + pp.dma_dir = bp->rx_dir; + pp.max_len = BNXT_RX_PAGE_SIZE; + pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) pp.flags |= PP_FLAG_PAGE_FRAG;