Message ID | 1354536876-6274-1-git-send-email-nicolas.ferre@atmel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> Allocate regular pages to use as backing for the RX ring and use the > DMA API to sync the caches. This should give a bit better performance > since it allows the CPU to do burst transfers from memory. It is also > a necessary step on the way to reduce the amount of copying done by > the driver. I've not tried to understand the patches, but you have to be very careful using non-snooped memory for descriptor rings. No amount of DMA API calls can sort out some of the issues. Basically you must not dirty a cache line that contains data that the MAC unit might still write to. For the receive ring this means that you must not setup new rx buffers for ring entries until the MAC unit has filled all the ring entries in the same cache line. This probably means only adding rx buffers in blocks of 8 or 16 (or even more if there are large cache lines). I can't see any code in the patch that does this. Doing the same for the tx ring is more difficult, especially if you can't stop the MAC unit polling the TX ring on a timer basis. Basically you can only give the MAX tx packets if either it is idle, or if the tx ring containing the new entries starts on a cache line. If the MAC unit is polling the ring, then to give it multiple items you may need to update the 'owner' bit in the first ring entry last - just in case the cache line gets written out before you've finished. David
On 12/03/2012 01:43 PM, David Laight : >> Allocate regular pages to use as backing for the RX ring and use the >> DMA API to sync the caches. This should give a bit better performance >> since it allows the CPU to do burst transfers from memory. It is also >> a necessary step on the way to reduce the amount of copying done by >> the driver. > > I've not tried to understand the patches, but you have to be > very careful using non-snooped memory for descriptor rings. > No amount of DMA API calls can sort out some of the issues. David, Maybe I have not described the patch properly but the non-coherent memory is not used for descriptor rings. It is used for DMA buffers pointed out by descriptors (that are allocated as coherent memory). As buffers are filled up by the interface DMA and then, afterwards, used by the driver to pass data to the net layer, it seems to me that the use of non-coherent memory is sensible. Do you still have reluctance with this patch? Best regards,
> On 12/03/2012 01:43 PM, David Laight : > >> Allocate regular pages to use as backing for the RX ring and use the > >> DMA API to sync the caches. This should give a bit better performance > >> since it allows the CPU to do burst transfers from memory. It is also > >> a necessary step on the way to reduce the amount of copying done by > >> the driver. > > > > I've not tried to understand the patches, but you have to be > > very careful using non-snooped memory for descriptor rings. > > No amount of DMA API calls can sort out some of the issues. > > David, > > Maybe I have not described the patch properly but the non-coherent > memory is not used for descriptor rings. It is used for DMA buffers > pointed out by descriptors (that are allocated as coherent memory). > > As buffers are filled up by the interface DMA and then, afterwards, used > by the driver to pass data to the net layer, it seems to me that the use > of non-coherent memory is sensible. Ah, ok - difficult to actually determine from a fast read of the code. So you invalidate (I think that is the right term) all the cache lines that are part of each rx buffer before giving it back to the MAC unit. (Maybe that first time, and just those cache lines that might have been written to after reception - I'd worry about whether the CRC is written into the rx buffer!) I was wondering if the code needs to do per page allocations? Perhaps that is necessary to avoid needing a large block of contiguous physical memory (and virtual addresses)? I know from some experiments done many years ago that a data copy in the MAC tx and rx path isn't necessarily as bad as people may think - especially if it removes complicated 'buffer loaning' schemes and/or iommu setup (or bounce buffers due to limited hardware memory addressing). The rx copy can usually be made to be a 'whole word' copy (ie you copy the two bytes of garbage that (mis)align the destination MAC address, and some bytes after the CRC. With some hardware I believe it is possible for the cache controller to do cache-line aligned copies very quickly! (Some very new x86 cpus might be doing this for 'rep movsd'.) The copy in the rx path is also better for short packets the can end up queued for userspace (although a copy in the socket code would solve that one. David
On 12/03/2012 03:25 PM, David Laight : >> On 12/03/2012 01:43 PM, David Laight : >>>> Allocate regular pages to use as backing for the RX ring and use the >>>> DMA API to sync the caches. This should give a bit better performance >>>> since it allows the CPU to do burst transfers from memory. It is also >>>> a necessary step on the way to reduce the amount of copying done by >>>> the driver. >>> >>> I've not tried to understand the patches, but you have to be >>> very careful using non-snooped memory for descriptor rings. >>> No amount of DMA API calls can sort out some of the issues. >> >> David, >> >> Maybe I have not described the patch properly but the non-coherent >> memory is not used for descriptor rings. It is used for DMA buffers >> pointed out by descriptors (that are allocated as coherent memory). >> >> As buffers are filled up by the interface DMA and then, afterwards, used >> by the driver to pass data to the net layer, it seems to me that the use >> of non-coherent memory is sensible. > > Ah, ok - difficult to actually determine from a fast read of the code. > So you invalidate (I think that is the right term) all the cache lines > that are part of each rx buffer before giving it back to the MAC unit. > (Maybe that first time, and just those cache lines that might have been > written to after reception - I'd worry about whether the CRC is written > into the rx buffer!) If I understand well, you mean that the call to: dma_sync_single_range_for_device(&bp->pdev->dev, phys, pg_offset, frag_len, DMA_FROM_DEVICE); in the rx path after having copied the data to skb is not needed? That is also the conclusion that I found after having thinking about this again... I will check this. For the CRC, my driver is not using the CRC offloading feature for the moment. So no CRC is written by the device. > I was wondering if the code needs to do per page allocations? > Perhaps that is necessary to avoid needing a large block of > contiguous physical memory (and virtual addresses)? The page management seems interesting for future management of RX buffers as skb fragments: that will allow to avoid copying received data. > I know from some experiments done many years ago that a data > copy in the MAC tx and rx path isn't necessarily as bad as > people may think - especially if it removes complicated > 'buffer loaning' schemes and/or iommu setup (or bounce > buffers due to limited hardware memory addressing). > > The rx copy can usually be made to be a 'whole word' copy > (ie you copy the two bytes of garbage that (mis)align the > destination MAC address, and some bytes after the CRC. > With some hardware I believe it is possible for the cache > controller to do cache-line aligned copies very quickly! > (Some very new x86 cpus might be doing this for 'rep movsd'.) Well, on our side, the "memory bus" resource is precious, so I imagine that even with an optimized copy, limiting the use of this resource should be better. > The copy in the rx path is also better for short packets > the can end up queued for userspace (although a copy in > the socket code would solve that one. Sure, some patches by Haavard that I am working on at the moment are taking care of copying in any cases the first 62 bytes (+2 bytes alignment) for each packet so that we cover the case of short packets and headers... Thanks for your comments, best regards,
> If I understand well, you mean that the call to: > > dma_sync_single_range_for_device(&bp->pdev->dev, phys, > pg_offset, frag_len, DMA_FROM_DEVICE); > > in the rx path after having copied the data to skb is not needed? > That is also the conclusion that I found after having thinking about > this again... I will check this. You need to make sure that the memory isn't in the data cache when you give the rx buffer back to the MAC. (and ensure the cpu doesn't read it until the rx is complete.) I've NFI what that dma_sync call does - you need to invalidate the cache lines. > For the CRC, my driver is not using the CRC offloading feature for the > moment. So no CRC is written by the device. I was thinking it would matter if the MAC wrote the CRC into the buffer (even though it was excluded from the length). It doesn't - you only need to worry about data you've read. > > I was wondering if the code needs to do per page allocations? > > Perhaps that is necessary to avoid needing a large block of > > contiguous physical memory (and virtual addresses)? > > The page management seems interesting for future management of RX > buffers as skb fragments: that will allow to avoid copying received data. Dunno - the complexities of such buffer loaning schemes often exceed the gain of avoiding the data copy. Using buffers allocated to the skb is a bit different - since you completely forget about the memory once you pass the skb upstream. Some quick sums indicate you might want to allocate 8k memory blocks and split into 5 buffers. David
On 12/05/2012 10:35 AM, David Laight : >> If I understand well, you mean that the call to: >> >> dma_sync_single_range_for_device(&bp->pdev->dev, phys, >> pg_offset, frag_len, DMA_FROM_DEVICE); >> >> in the rx path after having copied the data to skb is not needed? >> That is also the conclusion that I found after having thinking about >> this again... I will check this. > > You need to make sure that the memory isn't in the data cache > when you give the rx buffer back to the MAC. > (and ensure the cpu doesn't read it until the rx is complete.) > I've NFI what that dma_sync call does - you need to invalidate > the cache lines. The invalidate of cache lines is done by dma_sync_single_range_for_device(, DMA_FROM_DEVICE) so I need to keep it. >> For the CRC, my driver is not using the CRC offloading feature for the >> moment. So no CRC is written by the device. > > I was thinking it would matter if the MAC wrote the CRC into the > buffer (even though it was excluded from the length). > It doesn't - you only need to worry about data you've read. > >>> I was wondering if the code needs to do per page allocations? >>> Perhaps that is necessary to avoid needing a large block of >>> contiguous physical memory (and virtual addresses)? >> >> The page management seems interesting for future management of RX >> buffers as skb fragments: that will allow to avoid copying received data. > > Dunno - the complexities of such buffer loaning schemes often > exceed the gain of avoiding the data copy. > Using buffers allocated to the skb is a bit different - since > you completely forget about the memory once you pass the skb > upstream. > > Some quick sums indicate you might want to allocate 8k memory > blocks and split into 5 buffers. Well, for the 10/100 MACB interface, I am stuck with 128 Bytes buffers! So this use of pages seems sensible. On the other hand, it is true that I may have to reconsider the GEM memory management (it one is able to cover 128-10KB rx DMA buffers)... Best regards,
> Well, for the 10/100 MACB interface, I am stuck with 128 Bytes buffers! > So this use of pages seems sensible. If you have dma coherent memory you can make the rx buffer space be an array of short buffers referenced by adjacent ring entries (possibly with the last one slightly short to allow for the 2 byte offset). Then, when a big frame ends up in multiple buffers, you need a maximum of two copies to extract the data. David
diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c index 6a59bce..c2955da 100644 --- a/drivers/net/ethernet/cadence/macb.c +++ b/drivers/net/ethernet/cadence/macb.c @@ -10,6 +10,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/clk.h> +#include <linux/highmem.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/kernel.h> @@ -35,6 +36,8 @@ #define RX_BUFFER_SIZE 128 #define RX_RING_SIZE 512 /* must be power of 2 */ #define RX_RING_BYTES (sizeof(struct macb_dma_desc) * RX_RING_SIZE) +#define RX_BUFFERS_PER_PAGE (PAGE_SIZE / RX_BUFFER_SIZE) +#define RX_RING_PAGES (RX_RING_SIZE / RX_BUFFERS_PER_PAGE) #define TX_RING_SIZE 128 /* must be power of 2 */ #define TX_RING_BYTES (sizeof(struct macb_dma_desc) * TX_RING_SIZE) @@ -90,9 +93,16 @@ static struct macb_dma_desc *macb_rx_desc(struct macb *bp, unsigned int index) return &bp->rx_ring[macb_rx_ring_wrap(index)]; } -static void *macb_rx_buffer(struct macb *bp, unsigned int index) +static struct macb_rx_page *macb_rx_page(struct macb *bp, unsigned int index) { - return bp->rx_buffers + RX_BUFFER_SIZE * macb_rx_ring_wrap(index); + unsigned int entry = macb_rx_ring_wrap(index); + + return &bp->rx_page[entry / RX_BUFFERS_PER_PAGE]; +} + +static unsigned int macb_rx_page_offset(struct macb *bp, unsigned int index) +{ + return (index % RX_BUFFERS_PER_PAGE) * RX_BUFFER_SIZE; } void macb_set_hwaddr(struct macb *bp) @@ -528,11 +538,15 @@ static void macb_tx_interrupt(struct macb *bp) static int macb_rx_frame(struct macb *bp, unsigned int first_frag, unsigned int last_frag) { - unsigned int len; - unsigned int frag; - unsigned int offset; - struct sk_buff *skb; - struct macb_dma_desc *desc; + unsigned int len; + unsigned int frag; + unsigned int skb_offset; + unsigned int pg_offset; + struct macb_rx_page *rx_page; + dma_addr_t phys; + void *buf; + struct sk_buff *skb; + struct macb_dma_desc *desc; desc = macb_rx_desc(bp, last_frag); len = MACB_BFEXT(RX_FRMLEN, desc->ctrl); @@ -566,7 +580,7 @@ static int macb_rx_frame(struct macb *bp, unsigned int first_frag, return 1; } - offset = 0; + skb_offset = 0; len += NET_IP_ALIGN; skb_checksum_none_assert(skb); skb_put(skb, len); @@ -574,13 +588,28 @@ static int macb_rx_frame(struct macb *bp, unsigned int first_frag, for (frag = first_frag; ; frag++) { unsigned int frag_len = RX_BUFFER_SIZE; - if (offset + frag_len > len) { + if (skb_offset + frag_len > len) { BUG_ON(frag != last_frag); - frag_len = len - offset; + frag_len = len - skb_offset; } - skb_copy_to_linear_data_offset(skb, offset, - macb_rx_buffer(bp, frag), frag_len); - offset += RX_BUFFER_SIZE; + + rx_page = macb_rx_page(bp, frag); + pg_offset = macb_rx_page_offset(bp, frag); + phys = rx_page->phys; + + dma_sync_single_range_for_cpu(&bp->pdev->dev, phys, + pg_offset, frag_len, DMA_FROM_DEVICE); + + buf = kmap_atomic(rx_page->page); + skb_copy_to_linear_data_offset(skb, skb_offset, + buf + pg_offset, frag_len); + kunmap_atomic(buf); + + skb_offset += frag_len; + + dma_sync_single_range_for_device(&bp->pdev->dev, phys, + pg_offset, frag_len, DMA_FROM_DEVICE); + desc = macb_rx_desc(bp, frag); desc->addr &= ~MACB_BIT(RX_USED); @@ -860,86 +889,90 @@ static int macb_start_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } -static void macb_free_consistent(struct macb *bp) +static void macb_free_rings(struct macb *bp) { - if (bp->tx_skb) { - kfree(bp->tx_skb); - bp->tx_skb = NULL; - } - if (bp->rx_ring) { - dma_free_coherent(&bp->pdev->dev, RX_RING_BYTES, - bp->rx_ring, bp->rx_ring_dma); - bp->rx_ring = NULL; - } - if (bp->tx_ring) { - dma_free_coherent(&bp->pdev->dev, TX_RING_BYTES, - bp->tx_ring, bp->tx_ring_dma); - bp->tx_ring = NULL; - } - if (bp->rx_buffers) { - dma_free_coherent(&bp->pdev->dev, - RX_RING_SIZE * RX_BUFFER_SIZE, - bp->rx_buffers, bp->rx_buffers_dma); - bp->rx_buffers = NULL; + int i; + + for (i = 0; i < RX_RING_PAGES; i++) { + struct macb_rx_page *rx_page = &bp->rx_page[i]; + + if (!rx_page->page) + continue; + + dma_unmap_page(&bp->pdev->dev, rx_page->phys, + PAGE_SIZE, DMA_FROM_DEVICE); + put_page(rx_page->page); + rx_page->page = NULL; } + + kfree(bp->tx_skb); + kfree(bp->rx_page); + dma_free_coherent(&bp->pdev->dev, TX_RING_BYTES, bp->tx_ring, + bp->tx_ring_dma); + dma_free_coherent(&bp->pdev->dev, RX_RING_BYTES, bp->rx_ring, + bp->rx_ring_dma); } -static int macb_alloc_consistent(struct macb *bp) +static int macb_init_rings(struct macb *bp) { - int size; + struct page *page; + dma_addr_t phys; + unsigned int page_idx; + unsigned int ring_idx; + unsigned int i; - size = TX_RING_SIZE * sizeof(struct macb_tx_skb); - bp->tx_skb = kmalloc(size, GFP_KERNEL); - if (!bp->tx_skb) - goto out_err; - - size = RX_RING_BYTES; - bp->rx_ring = dma_alloc_coherent(&bp->pdev->dev, size, + bp->rx_ring = dma_alloc_coherent(&bp->pdev->dev, RX_RING_BYTES, &bp->rx_ring_dma, GFP_KERNEL); if (!bp->rx_ring) - goto out_err; + goto err_alloc_rx_ring; + netdev_dbg(bp->dev, "Allocated RX ring of %d bytes at %08lx (mapped %p)\n", - size, (unsigned long)bp->rx_ring_dma, bp->rx_ring); + RX_RING_BYTES, (unsigned long)bp->rx_ring_dma, bp->rx_ring); - size = TX_RING_BYTES; - bp->tx_ring = dma_alloc_coherent(&bp->pdev->dev, size, + bp->tx_ring = dma_alloc_coherent(&bp->pdev->dev, TX_RING_BYTES, &bp->tx_ring_dma, GFP_KERNEL); if (!bp->tx_ring) - goto out_err; - netdev_dbg(bp->dev, - "Allocated TX ring of %d bytes at %08lx (mapped %p)\n", - size, (unsigned long)bp->tx_ring_dma, bp->tx_ring); - - size = RX_RING_SIZE * RX_BUFFER_SIZE; - bp->rx_buffers = dma_alloc_coherent(&bp->pdev->dev, size, - &bp->rx_buffers_dma, GFP_KERNEL); - if (!bp->rx_buffers) - goto out_err; + goto err_alloc_tx_ring; + netdev_dbg(bp->dev, - "Allocated RX buffers of %d bytes at %08lx (mapped %p)\n", - size, (unsigned long)bp->rx_buffers_dma, bp->rx_buffers); + "Allocated TX ring of %d bytes at 0x%08lx (mapped %p)\n", + TX_RING_BYTES, (unsigned long)bp->tx_ring_dma, bp->tx_ring); - return 0; + bp->rx_page = kcalloc(RX_RING_PAGES, sizeof(struct macb_rx_page), + GFP_KERNEL); + if (!bp->rx_page) + goto err_alloc_rx_page; -out_err: - macb_free_consistent(bp); - return -ENOMEM; -} + bp->tx_skb = kcalloc(TX_RING_SIZE, sizeof(struct macb_tx_skb), + GFP_KERNEL); + if (!bp->tx_skb) + goto err_alloc_tx_skb; -static void macb_init_rings(struct macb *bp) -{ - int i; - dma_addr_t addr; + for (page_idx = 0, ring_idx = 0; page_idx < RX_RING_PAGES; page_idx++) { + page = alloc_page(GFP_KERNEL); + if (!page) + goto err_alloc_page; + + phys = dma_map_page(&bp->pdev->dev, page, 0, PAGE_SIZE, + DMA_FROM_DEVICE); + if (dma_mapping_error(&bp->pdev->dev, phys)) + goto err_map_page; + + bp->rx_page[page_idx].page = page; + bp->rx_page[page_idx].phys = phys; - addr = bp->rx_buffers_dma; - for (i = 0; i < RX_RING_SIZE; i++) { - bp->rx_ring[i].addr = addr; - bp->rx_ring[i].ctrl = 0; - addr += RX_BUFFER_SIZE; + for (i = 0; i < RX_BUFFERS_PER_PAGE; i++, ring_idx++) { + bp->rx_ring[ring_idx].addr = phys; + bp->rx_ring[ring_idx].ctrl = 0; + phys += RX_BUFFER_SIZE; + } } bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP); + netdev_dbg(bp->dev, "Allocated %u RX buffers (%lu pages)\n", + RX_RING_SIZE, RX_RING_PAGES); + for (i = 0; i < TX_RING_SIZE; i++) { bp->tx_ring[i].addr = 0; bp->tx_ring[i].ctrl = MACB_BIT(TX_USED); @@ -947,6 +980,28 @@ static void macb_init_rings(struct macb *bp) bp->tx_ring[TX_RING_SIZE - 1].ctrl |= MACB_BIT(TX_WRAP); bp->rx_tail = bp->tx_head = bp->tx_tail = 0; + + return 0; + +err_map_page: + __free_page(page); +err_alloc_page: + while (page_idx--) { + dma_unmap_page(&bp->pdev->dev, bp->rx_page[page_idx].phys, + PAGE_SIZE, DMA_FROM_DEVICE); + __free_page(bp->rx_page[page_idx].page); + } + kfree(bp->tx_skb); +err_alloc_tx_skb: + kfree(bp->rx_page); +err_alloc_rx_page: + dma_free_coherent(&bp->pdev->dev, TX_RING_BYTES, bp->tx_ring, + bp->tx_ring_dma); +err_alloc_tx_ring: + dma_free_coherent(&bp->pdev->dev, RX_RING_BYTES, bp->rx_ring, + bp->rx_ring_dma); +err_alloc_rx_ring: + return -ENOMEM; } static void macb_reset_hw(struct macb *bp) @@ -1221,16 +1276,15 @@ static int macb_open(struct net_device *dev) if (!bp->phy_dev) return -EAGAIN; - err = macb_alloc_consistent(bp); + err = macb_init_rings(bp); if (err) { - netdev_err(dev, "Unable to allocate DMA memory (error %d)\n", + netdev_err(dev, "Unable to allocate DMA rings (error %d)\n", err); return err; } napi_enable(&bp->napi); - macb_init_rings(bp); macb_init_hw(bp); /* schedule a link state check */ @@ -1257,7 +1311,7 @@ static int macb_close(struct net_device *dev) netif_carrier_off(dev); spin_unlock_irqrestore(&bp->lock, flags); - macb_free_consistent(bp); + macb_free_rings(bp); return 0; } diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index 570908b..e82242b 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -453,6 +453,23 @@ struct macb_dma_desc { #define MACB_TX_USED_SIZE 1 /** + * struct macb_rx_page - data associated with a page used as RX buffers + * @page: Physical page used as storage for the buffers + * @phys: DMA address of the page + * + * Each page is used to provide %MACB_RX_BUFFERS_PER_PAGE RX buffers. + * The page gets an initial reference when it is inserted into the + * ring, and an additional reference each time it is passed up the + * stack as a fragment. When all the buffers have been used, we drop + * the initial reference and allocate a new page. Any additional + * references are dropped when the higher layers free the skb. + */ +struct macb_rx_page { + struct page *page; + dma_addr_t phys; +}; + +/** * struct macb_tx_skb - data about an skb which is being transmitted * @skb: skb currently being transmitted * @mapping: DMA address of the skb's data buffer @@ -543,6 +560,7 @@ struct macb { unsigned int rx_tail; struct macb_dma_desc *rx_ring; + struct macb_rx_page *rx_page; void *rx_buffers; unsigned int tx_head, tx_tail;